par cerebras
Open source · 138k downloads · 69 likes
GLM-4.7-Flash-REAP-23B-A3B est une version optimisée et allégée du modèle GLM-4.7-Flash, conçue pour réduire de 25 % sa taille tout en conservant des performances quasi identiques. Grâce à la méthode REAP (Router-weighted Expert Activation Pruning), il supprime sélectivement les experts redondants dans son architecture Mixture-of-Experts, réduisant ainsi son empreinte mémoire sans altérer ses capacités fondamentales. Ce modèle excelle dans des tâches complexes comme la génération de code, l'exécution d'agents autonomes, la compréhension de dépôts logiciels ou l'appel de fonctions, tout en restant compatible avec les outils standards comme vLLM. Idéal pour les environnements contraints en ressources, il offre une alternative performante et économique aux modèles plus lourds, sans nécessiter de modifications logicielles. Son approche innovante en fait un choix pertinent pour les déploiements locaux, la recherche académique ou les applications industrielles nécessitant un équilibre entre efficacité et puissance.
𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
Introducing GLM-4.7-Flash-REAP-23B-A3B, a memory-efficient compressed variant of GLM-4.7-Flash that maintains near-identical performance while being 25% lighter.
This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:
GLM-4.7-Flash-REAP-23B-A3B has the following specifications:
| Benchmark | GLM-4.7-Flash | GLM-4.7-Flash-REAP-23B-A3B |
|---|---|---|
| Compression | — | 25% |
| Coding | ||
| HumanEval | 94.5 | 95.1 |
| HumanEval+ | 89.0 | 89.0 |
🟩 This checkpoint maintains almost identical performance while being 25% lighter.
For more details on the evaluation setup, refer to the REAP arXiv preprint.
You can deploy the model directly using the latest vLLM (that supports GLM4.7-Flash), no source modifications or custom patches required.
vllm serve cerebras/GLM-4.7-Flash-REAP-23B-A3B \
--tensor-parallel-size 4 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--enable-auto-tool-choice
If you encounter insufficient memory when running this model, you might need to set a lower value for --max-num-seqs flag (e.g. set to 64).
This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method uniformly across all Mixture-of-Experts (MoE) blocks of GLM-4.7, with a 25% pruning rate.
REAP selects experts to prune based on a novel saliency criterion that considers both:
This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.
The model was calibrated using a diverse mixture of domain-specific datasets including:
📚 For more details, refer to the following resources:
This model is derived from
zai-org/GLM-4.7-Flash
and distributed under the MIT license.
If you use this checkpoint, please cite the REAP paper:
@article{lasby-reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}