NFTCID

AI & ML interests

None yet

Recent Activity

reacted to m-ric's post with 👀 about 22 hours ago

𝗠𝗶𝗻𝗶𝗠𝗮𝘅'𝘀 𝗻𝗲𝘄 𝗠𝗼𝗘 𝗟𝗟𝗠 𝗿𝗲𝗮𝗰𝗵𝗲𝘀 𝗖𝗹𝗮𝘂𝗱𝗲-𝗦𝗼𝗻𝗻𝗲𝘁 𝗹𝗲𝘃𝗲𝗹 𝘄𝗶𝘁𝗵 𝟰𝗠 𝘁𝗼𝗸𝗲𝗻𝘀 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗻𝗴𝘁𝗵 💥 This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach. 𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀: 🏗️ MoE with novel hybrid attention: ‣ Mixture of Experts with 456B total parameters (45.9B activated per token) ‣ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers 🏆 Outperforms leading models across benchmarks while offering vastly longer context: ‣ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks ‣ Can efficiently handle 4M token contexts (vs 256K for most other LLMs) 🔬 Technical innovations enable efficient scaling: ‣ Novel expert parallel and tensor parallel strategies cut communication overhead in half ‣ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%) 🎯 Thorough training strategy: ‣ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge! Overall, not only is the model impressive, but the technical paper is also really interesting! 📝 It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs. Read it in full here 👉 https://huggingface.co/papers/2501.08313 Model here, allows commercial use <100M monthly users 👉 https://huggingface.co/MiniMaxAI/MiniMax-Text-01

liked a Space 3 days ago

akhaliq/anychat

liked a model 15 days ago

ibm-granite/granite-3.1-8b-instruct

View all activity

Organizations

None yet

NFTCID's activity

reacted to m-ric's post with 👀 about 22 hours ago

Post

843

𝗠𝗶𝗻𝗶𝗠𝗮𝘅'𝘀 𝗻𝗲𝘄 𝗠𝗼𝗘 𝗟𝗟𝗠 𝗿𝗲𝗮𝗰𝗵𝗲𝘀 𝗖𝗹𝗮𝘂𝗱𝗲-𝗦𝗼𝗻𝗻𝗲𝘁 𝗹𝗲𝘃𝗲𝗹 𝘄𝗶𝘁𝗵 𝟰𝗠 𝘁𝗼𝗸𝗲𝗻𝘀 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗻𝗴𝘁𝗵 💥

This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:

🏗️ MoE with novel hybrid attention:
‣ Mixture of Experts with 456B total parameters (45.9B activated per token)
‣ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers

🏆 Outperforms leading models across benchmarks while offering vastly longer context:
‣ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
‣ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)

🔬 Technical innovations enable efficient scaling:
‣ Novel expert parallel and tensor parallel strategies cut communication overhead in half
‣ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)

🎯 Thorough training strategy:
‣ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!

Overall, not only is the model impressive, but the technical paper is also really interesting! 📝
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.

Read it in full here 👉 MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users 👉 MiniMaxAI/MiniMax-Text-01