by cerebras
Open source · 97k downloads · 31 likes
The MiniMax M2.1 REAP 139B A10B model is an optimized and compressed version of the MiniMax-M2.1, designed to significantly reduce its memory footprint while maintaining nearly identical performance. Using the REAP (Router-weighted Expert Activation Pruning) method, it removes 40% of redundant experts while preserving the router's efficiency, making it 40% lighter than the original model. This model excels in complex tasks such as code generation, mathematical reasoning, tool calling, and agentic interactions, while remaining compatible with standard tools like vLLM. Ideal for resource-constrained environments, local deployments, or academic research, it offers an optimal balance between performance and efficiency. Its lightweight design and precision make it a relevant choice for applications requiring both power and accessibility.
𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
Introducing MiniMax-M2.1-REAP-139B-A10B, a memory-efficient compressed variant of MiniMax-M2.1 that maintains near-identical performance while being 40% lighter.
This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:
MiniMax-M2.1-REAP-162B-A10B has the following specifications:
| Benchmark | MiniMax-M2.1 | MiniMax-M2.1-REAP-172B-A10B | MiniMax-M2.1-REAP-139B-A10B |
|---|---|---|---|
| Compression | — | 25% | 40% |
| Coding | |||
| HumanEval | 94.5 | 93.9 | 93.9 |
| HumanEval+ | 89.0 | 90.9 | 87.8 |
🟩 This checkpoint maintains almost identical performance while being 40% lighter.
For more details on the evaluation setup, refer to the REAP arXiv preprint.
You can deploy the model directly using the latest vLLM (that supports MiniMax-M2.1), no source modifications or custom patches required.
vllm serve cerebras/MiniMax-M2.1-REAP-139B-A10B \
--tensor-parallel-size 8 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--trust-remote-code \
--enable_expert_parallel \
--enable-auto-tool-choice
If you encounter insufficient memory when running this model, you might need to set a lower value for --max-num-seqs flag (e.g. set to 64). For more information, refer to the official vLLM deployment guide.
This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method uniformly across all Mixture-of-Experts (MoE) blocks of MiniMax-M2.1, with a 40% pruning rate.
REAP selects experts to prune based on a novel saliency criterion that considers both:
This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.
📚 For more details, refer to the following resources:
This model is derived from
MiniMaxAI/MiniMax-M2.1
and distributed under the modified MIT license.
If you use this checkpoint, please cite the REAP paper:
@article{lasby-reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}