by bg-digitalservices
Open source · 189k downloads · 19 likes
Gemma 4 26B A4B it NVFP4 is a quantized version of the Gemma 4 model, specifically optimized for instruction-following and reasoning tasks. This model employs a mixture-of-experts (MoE) approach with only 3.8 billion active parameters per token, while maintaining high precision through advanced FP4 quantization for weights and FP16 for activations. It excels in applications requiring fine contextual understanding, such as conversational assistants or complex data analysis, while delivering enhanced performance thanks to its reduced size and improved memory efficiency. Its standout feature is an innovative quantization method designed to preserve quality despite compression, particularly in mathematical reasoning tasks where losses remain minimal. Developed by the community and validated on NVIDIA infrastructure, it stands out for its ability to operate with limited resources without significantly compromising performance.
First community NVFP4 quantization of google/gemma-4-26B-A4B-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and only 3.8B active per token.
W4A4 — weights in FP4, activations in FP16 (full W4A4 quantization).
| Original (BF16) | NVFP4 (this) | |
|---|---|---|
| Size on disk | ~49 GB | ~16.5 GB |
| Compression | — | 3.0x |
| Total parameters | 25.2B | 25.2B |
| Active parameters | 3.8B | 3.8B |
| Architecture | MoE: 128 experts, 8 active/token | same |
| Context window | 256K tokens | 256K tokens |
| Modalities | Text, Image, Video | Text, Image, Video (all verified) |
| Quantization | — | W4A4 (FP4 weights AND activations) |
A/B comparison against the BF16 original, both served via vLLM on DGX Spark (GB10 Blackwell, SM 12.1). Quality via lm-evaluation-harness with --apply_chat_template.
| Benchmark | BF16 (reference) | NVFP4 (this) | Retained |
|---|---|---|---|
| GSM8K (flexible-extract) | 87.79% | 84.23% | 95.9% |
| GSM8K (strict-match) | 86.96% | 82.64% | 95.0% |
| IFEval prompt-strict | 89.46% | 87.99% | 98.3% |
| IFEval inst-strict | 92.81% | 91.37% | 98.4% |
| IFEval prompt-loose | 90.94% | 89.65% | 98.6% |
| IFEval inst-loose | 93.88% | 93.05% | 99.1% |
| Average | 90.31% | 88.15% | 97.6% |
Math reasoning (GSM8K) takes a ≈4pp hit — chained numerical steps accumulate rounding errors. Instruction-following (IFEval) is essentially unaffected (≈1pp, within noise). Typical quantization signature.
| Metric | BF16 | NVFP4 | Factor |
|---|---|---|---|
| Tokens/sec (1000-token generation) | 23.3 | 48.2 | 2.07x |
| TTFT (ms) | 97 | 53 | 1.83x |
| Model size on disk | ~49 GB | ~16.5 GB | 2.97x |
MoE inference on GB10 is memory-bandwidth-bound, so 4x smaller weights translate directly into roughly 2x throughput. W4A4 gives a bit more headroom than W4A16 at the cost of slightly more quality drop.
transformers >= 5.4 (for Gemma 4 architecture support)--tf5 flaggemma4_patched.py for NVFP4 MoE scale key loading (see vLLM Patch)docker run -d \
--name vllm-gemma-4 \
--gpus all --ipc=host --network host \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-v /path/to/Gemma-4-26B-A4B-it-NVFP4:/model \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
<your-vllm-image> \
vllm serve /model \
--served-model-name gemma-4 \
--host 0.0.0.0 --port 8888 \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 262144 \
--max-num-seqs 4 \
--moe-backend marlin \
--trust-remote-code
| Flag | Why |
|---|---|
--quantization modelopt | modelopt NVFP4 checkpoint format |
--moe-backend marlin | Marlin kernel for MoE expert layers |
--kv-cache-dtype fp8 | Saves memory for longer contexts |
-e VLLM_NVFP4_GEMM_BACKEND=marlin | Marlin for non-MoE layers (needed on SM 12.1) |
--trust-remote-code | Required for Gemma 4 |
This is an instruct model — use the chat completions endpoint:
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4",
"messages": [{"role": "user", "content": "Hello! Tell me a joke."}],
"max_tokens": 200
}'
Tested on NVIDIA DGX Spark (128GB unified memory, GB10 Blackwell, SM 12.1). Model loads at 15.7 GiB — plenty of headroom for 256K context with FP8 KV cache.
Gemma 4 MoE stores expert weights as fused 3D tensors (nn.Parameter of shape [128, dim, dim]) instead of individual nn.Linear modules. NVIDIA Model Optimizer (modelopt) only quantizes nn.Linear — it silently skips the 3D expert parameters, which are 91% of the model.
We wrote a _QuantGemma4TextExperts modelopt plugin that unfuses the 3D expert tensors into 128 × 3 individual nn.Linear layers before quantization. This follows the same pattern modelopt uses for Qwen3.5, Llama4, and DBRX MoE models. After quantization, a post-processing step renames the exported keys to match vLLM's expected format.
_nvfp4_selective_quant_cfg(["*"], )vLLM's Gemma 4 expert_params_mapping doesn't correctly map NVFP4 scale keys (.weight_scale, .weight_scale_2, .input_scale) to FusedMoE parameter names. The included gemma4_patched.py fixes this. A PR to upstream vLLM is forthcoming.
pip install torch transformers>=5.4 accelerate datasets
git clone https://github.com/NVIDIA/Model-Optimizer.git
pip install -e Model-Optimizer[all]
pip install --force-reinstall transformers>=5.4 huggingface_hub>=1.5
python quantize_gemma4_moe.py --qformat nvfp4
Full quantization script included as quantize_gemma4_moe.py.
transformers >= 5.4 and the included gemma4_patched.py--moe-backend marlin required for correct MoE computationApache 2.0 — inherited from the base model.
Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.
Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.
📬 [email protected] ☕ Buy me a coffee if this makes your Spark go brrrrrr! 🚀