par cerebras
Open source · 97k downloads · 31 likes
Le modèle MiniMax M2.1 REAP 139B A10B est une version optimisée et compressée du modèle MiniMax-M2.1, conçue pour réduire significativement son empreinte mémoire tout en conservant des performances quasi identiques. Grâce à la méthode REAP (Router-weighted Expert Activation Pruning), il supprime 40 % des experts redondants tout en préservant l'efficacité du routeur, ce qui le rend 40 % plus léger que son modèle d'origine. Ce modèle excelle dans des tâches complexes comme la génération de code, le raisonnement mathématique, l'appel d'outils et les interactions agentiques, tout en restant compatible avec les outils standards comme vLLM. Idéal pour les environnements contraints en ressources, les déploiements locaux ou la recherche académique, il offre un équilibre optimal entre performance et efficacité. Sa légèreté et sa précision en font un choix pertinent pour des applications nécessitant à la fois puissance et accessibilité.
𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
Introducing MiniMax-M2.1-REAP-139B-A10B, a memory-efficient compressed variant of MiniMax-M2.1 that maintains near-identical performance while being 40% lighter.
This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:
MiniMax-M2.1-REAP-162B-A10B has the following specifications:
| Benchmark | MiniMax-M2.1 | MiniMax-M2.1-REAP-172B-A10B | MiniMax-M2.1-REAP-139B-A10B |
|---|---|---|---|
| Compression | — | 25% | 40% |
| Coding | |||
| HumanEval | 94.5 | 93.9 | 93.9 |
| HumanEval+ | 89.0 | 90.9 | 87.8 |
🟩 This checkpoint maintains almost identical performance while being 40% lighter.
For more details on the evaluation setup, refer to the REAP arXiv preprint.
You can deploy the model directly using the latest vLLM (that supports MiniMax-M2.1), no source modifications or custom patches required.
vllm serve cerebras/MiniMax-M2.1-REAP-139B-A10B \
--tensor-parallel-size 8 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--trust-remote-code \
--enable_expert_parallel \
--enable-auto-tool-choice
If you encounter insufficient memory when running this model, you might need to set a lower value for --max-num-seqs flag (e.g. set to 64). For more information, refer to the official vLLM deployment guide.
This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method uniformly across all Mixture-of-Experts (MoE) blocks of MiniMax-M2.1, with a 40% pruning rate.
REAP selects experts to prune based on a novel saliency criterion that considers both:
This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.
📚 For more details, refer to the following resources:
This model is derived from
MiniMaxAI/MiniMax-M2.1
and distributed under the modified MIT license.
If you use this checkpoint, please cite the REAP paper:
@article{lasby-reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}