by cerebras
Open source · 138k downloads · 69 likes
GLM-4.7-Flash-REAP-23B-A3B is an optimized and streamlined version of the GLM-4.7-Flash model, designed to reduce its size by 25% while maintaining nearly identical performance. Through the REAP (Router-weighted Expert Activation Pruning) method, it selectively removes redundant experts in its Mixture-of-Experts architecture, thereby reducing its memory footprint without compromising core capabilities. This model excels in complex tasks such as code generation, autonomous agent execution, software repository comprehension, or function calling, while remaining compatible with standard tools like vLLM. Ideal for resource-constrained environments, it offers a high-performance and cost-effective alternative to heavier models without requiring software modifications. Its innovative approach makes it a relevant choice for local deployments, academic research, or industrial applications that demand a balance between efficiency and power.
𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
Introducing GLM-4.7-Flash-REAP-23B-A3B, a memory-efficient compressed variant of GLM-4.7-Flash that maintains near-identical performance while being 25% lighter.
This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:
GLM-4.7-Flash-REAP-23B-A3B has the following specifications:
| Benchmark | GLM-4.7-Flash | GLM-4.7-Flash-REAP-23B-A3B |
|---|---|---|
| Compression | — | 25% |
| Coding | ||
| HumanEval | 94.5 | 95.1 |
| HumanEval+ | 89.0 | 89.0 |
🟩 This checkpoint maintains almost identical performance while being 25% lighter.
For more details on the evaluation setup, refer to the REAP arXiv preprint.
You can deploy the model directly using the latest vLLM (that supports GLM4.7-Flash), no source modifications or custom patches required.
vllm serve cerebras/GLM-4.7-Flash-REAP-23B-A3B \
--tensor-parallel-size 4 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--enable-auto-tool-choice
If you encounter insufficient memory when running this model, you might need to set a lower value for --max-num-seqs flag (e.g. set to 64).
This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method uniformly across all Mixture-of-Experts (MoE) blocks of GLM-4.7, with a 25% pruning rate.
REAP selects experts to prune based on a novel saliency criterion that considers both:
This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.
The model was calibrated using a diverse mixture of domain-specific datasets including:
📚 For more details, refer to the following resources:
This model is derived from
zai-org/GLM-4.7-Flash
and distributed under the MIT license.
If you use this checkpoint, please cite the REAP paper:
@article{lasby-reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}