par bg-digitalservices
Open source · 189k downloads · 19 likes
Gemma 4 26B A4B it NVFP4 est une version quantifiée du modèle Gemma 4, spécialement optimisée pour les tâches d'instruction et de raisonnement. Ce modèle utilise une approche de mélange d'experts (MoE) avec seulement 3,8 milliards de paramètres actifs par token, tout en conservant une précision élevée grâce à une quantification avancée en FP4 pour les poids et FP16 pour les activations. Il excelle dans les applications nécessitant une compréhension contextuelle fine, comme les assistants conversationnels ou l'analyse de données complexes, tout en offrant des performances accrues grâce à sa taille réduite et son efficacité mémoire. Sa particularité réside dans sa méthode de quantification innovante, conçue pour préserver la qualité malgré la compression, notamment pour les tâches de raisonnement mathématique où les pertes restent minimes. Développé par la communauté et validé sur des infrastructures NVIDIA, il se distingue par sa capacité à fonctionner avec des ressources limitées sans sacrifier significativement ses performances.
First community NVFP4 quantization of google/gemma-4-26B-A4B-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and only 3.8B active per token.
W4A4 — weights in FP4, activations in FP16 (full W4A4 quantization).
| Original (BF16) | NVFP4 (this) | |
|---|---|---|
| Size on disk | ~49 GB | ~16.5 GB |
| Compression | — | 3.0x |
| Total parameters | 25.2B | 25.2B |
| Active parameters | 3.8B | 3.8B |
| Architecture | MoE: 128 experts, 8 active/token | same |
| Context window | 256K tokens | 256K tokens |
| Modalities | Text, Image, Video | Text, Image, Video (all verified) |
| Quantization | — | W4A4 (FP4 weights AND activations) |
A/B comparison against the BF16 original, both served via vLLM on DGX Spark (GB10 Blackwell, SM 12.1). Quality via lm-evaluation-harness with --apply_chat_template.
| Benchmark | BF16 (reference) | NVFP4 (this) | Retained |
|---|---|---|---|
| GSM8K (flexible-extract) | 87.79% | 84.23% | 95.9% |
| GSM8K (strict-match) | 86.96% | 82.64% | 95.0% |
| IFEval prompt-strict | 89.46% | 87.99% | 98.3% |
| IFEval inst-strict | 92.81% | 91.37% | 98.4% |
| IFEval prompt-loose | 90.94% | 89.65% | 98.6% |
| IFEval inst-loose | 93.88% | 93.05% | 99.1% |
| Average | 90.31% | 88.15% | 97.6% |
Math reasoning (GSM8K) takes a ≈4pp hit — chained numerical steps accumulate rounding errors. Instruction-following (IFEval) is essentially unaffected (≈1pp, within noise). Typical quantization signature.
| Metric | BF16 | NVFP4 | Factor |
|---|---|---|---|
| Tokens/sec (1000-token generation) | 23.3 | 48.2 | 2.07x |
| TTFT (ms) | 97 | 53 | 1.83x |
| Model size on disk | ~49 GB | ~16.5 GB | 2.97x |
MoE inference on GB10 is memory-bandwidth-bound, so 4x smaller weights translate directly into roughly 2x throughput. W4A4 gives a bit more headroom than W4A16 at the cost of slightly more quality drop.
transformers >= 5.4 (for Gemma 4 architecture support)--tf5 flaggemma4_patched.py for NVFP4 MoE scale key loading (see vLLM Patch)docker run -d \
--name vllm-gemma-4 \
--gpus all --ipc=host --network host \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-v /path/to/Gemma-4-26B-A4B-it-NVFP4:/model \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
<your-vllm-image> \
vllm serve /model \
--served-model-name gemma-4 \
--host 0.0.0.0 --port 8888 \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 262144 \
--max-num-seqs 4 \
--moe-backend marlin \
--trust-remote-code
| Flag | Why |
|---|---|
--quantization modelopt | modelopt NVFP4 checkpoint format |
--moe-backend marlin | Marlin kernel for MoE expert layers |
--kv-cache-dtype fp8 | Saves memory for longer contexts |
-e VLLM_NVFP4_GEMM_BACKEND=marlin | Marlin for non-MoE layers (needed on SM 12.1) |
--trust-remote-code | Required for Gemma 4 |
This is an instruct model — use the chat completions endpoint:
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4",
"messages": [{"role": "user", "content": "Hello! Tell me a joke."}],
"max_tokens": 200
}'
Tested on NVIDIA DGX Spark (128GB unified memory, GB10 Blackwell, SM 12.1). Model loads at 15.7 GiB — plenty of headroom for 256K context with FP8 KV cache.
Gemma 4 MoE stores expert weights as fused 3D tensors (nn.Parameter of shape [128, dim, dim]) instead of individual nn.Linear modules. NVIDIA Model Optimizer (modelopt) only quantizes nn.Linear — it silently skips the 3D expert parameters, which are 91% of the model.
We wrote a _QuantGemma4TextExperts modelopt plugin that unfuses the 3D expert tensors into 128 × 3 individual nn.Linear layers before quantization. This follows the same pattern modelopt uses for Qwen3.5, Llama4, and DBRX MoE models. After quantization, a post-processing step renames the exported keys to match vLLM's expected format.
_nvfp4_selective_quant_cfg(["*"], )vLLM's Gemma 4 expert_params_mapping doesn't correctly map NVFP4 scale keys (.weight_scale, .weight_scale_2, .input_scale) to FusedMoE parameter names. The included gemma4_patched.py fixes this. A PR to upstream vLLM is forthcoming.
pip install torch transformers>=5.4 accelerate datasets
git clone https://github.com/NVIDIA/Model-Optimizer.git
pip install -e Model-Optimizer[all]
pip install --force-reinstall transformers>=5.4 huggingface_hub>=1.5
python quantize_gemma4_moe.py --qformat nvfp4
Full quantization script included as quantize_gemma4_moe.py.
transformers >= 5.4 and the included gemma4_patched.py--moe-backend marlin required for correct MoE computationApache 2.0 — inherited from the base model.
Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.
Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.
📬 [email protected] ☕ Buy me a coffee if this makes your Spark go brrrrrr! 🚀