Kimi Open Source Moonlight: 30 Billion / 160 Billion Parameter Hybrid Expert Model
Moonshot Kimi released a new technical report on “Muon is scalable for LLM training” yesterday and announced the launch of “Moonlight”: a 30 billion / 160 billion parameter hybrid expert model (MoE) trained on Muon. Using 57 trillion tokens, it achieved better performance with fewer floating-point operations (FLOPs), thus improving Pareto efficiency boundaries.
The team discovered that the Muon optimizer can be extended by adding weight decay, carefully adjusting the update magnitude of each parameter, and other techniques. It has the following highlights:
These techniques allow Muon to be used out-of-the-box in large-scale training without needing hyperparameter tuning. Experimental results show that compared to AdamW which computes optimal training, Muon achieves approximately twice the computational efficiency.
The model used in this paper is Moonlight-16B-A3B with a total parameter count of 15.29 billion and activation parameters of 2.24 billion. Using the Muon optimizer, it achieved the above results on training data with 5.7 trillion tokens.
Our model not only surpasses current Pareto frontiers but also achieves better performance than previous models while significantly reducing required FLOPs for training.
We have open-sourced a distributed version of Muon implementation optimized for memory usage and communication efficiency. Additionally, we have released pre-trained models, instruction-tuned models, and intermediate training checkpoints to support future research.
SEE ALSO: Under DeepSeek’s Impact, Moonshot AI Significantly Reduces Marketing Budget