AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsGemma 4 26B A4B it NVFP4

Gemma 4 26B A4B it NVFP4

by bg-digitalservices

Open source · 189k downloads · 19 likes

1.6
(19 reviews)ChatAPI & Local
About

Gemma 4 26B A4B it NVFP4 is a quantized version of the Gemma 4 model, specifically optimized for instruction-following and reasoning tasks. This model employs a mixture-of-experts (MoE) approach with only 3.8 billion active parameters per token, while maintaining high precision through advanced FP4 quantization for weights and FP16 for activations. It excels in applications requiring fine contextual understanding, such as conversational assistants or complex data analysis, while delivering enhanced performance thanks to its reduced size and improved memory efficiency. Its standout feature is an innovative quantization method designed to preserve quality despite compression, particularly in mathematical reasoning tasks where losses remain minimal. Developed by the community and validated on NVIDIA infrastructure, it stands out for its ability to operate with limited resources without significantly compromising performance.

Documentation

Gemma-4-26B-A4B-it-NVFP4

First community NVFP4 quantization of google/gemma-4-26B-A4B-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and only 3.8B active per token.

W4A4 — weights in FP4, activations in FP16 (full W4A4 quantization).

Key Specs

Original (BF16)NVFP4 (this)
Size on disk~49 GB~16.5 GB
Compression—3.0x
Total parameters25.2B25.2B
Active parameters3.8B3.8B
ArchitectureMoE: 128 experts, 8 active/tokensame
Context window256K tokens256K tokens
ModalitiesText, Image, VideoText, Image, Video (all verified)
Quantization—W4A4 (FP4 weights AND activations)

Benchmarks

A/B comparison against the BF16 original, both served via vLLM on DGX Spark (GB10 Blackwell, SM 12.1). Quality via lm-evaluation-harness with --apply_chat_template.

Quality

BenchmarkBF16 (reference)NVFP4 (this)Retained
GSM8K (flexible-extract)87.79%84.23%95.9%
GSM8K (strict-match)86.96%82.64%95.0%
IFEval prompt-strict89.46%87.99%98.3%
IFEval inst-strict92.81%91.37%98.4%
IFEval prompt-loose90.94%89.65%98.6%
IFEval inst-loose93.88%93.05%99.1%
Average90.31%88.15%97.6%

Math reasoning (GSM8K) takes a ≈4pp hit — chained numerical steps accumulate rounding errors. Instruction-following (IFEval) is essentially unaffected (≈1pp, within noise). Typical quantization signature.

Speed & Size

MetricBF16NVFP4Factor
Tokens/sec (1000-token generation)23.348.22.07x
TTFT (ms)97531.83x
Model size on disk~49 GB~16.5 GB2.97x

MoE inference on GB10 is memory-bandwidth-bound, so 4x smaller weights translate directly into roughly 2x throughput. W4A4 gives a bit more headroom than W4A16 at the cost of slightly more quality drop.

Serving with vLLM

Requirements

  • vLLM build with transformers >= 5.4 (for Gemma 4 architecture support)
  • On DGX Spark / SM 12.1: spark-vllm-docker built with --tf5 flag
  • Included gemma4_patched.py for NVFP4 MoE scale key loading (see vLLM Patch)

Quick Start

Bash
docker run -d \
  --name vllm-gemma-4 \
  --gpus all --ipc=host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/Gemma-4-26B-A4B-it-NVFP4:/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  <your-vllm-image> \
  vllm serve /model \
    --served-model-name gemma-4 \
    --host 0.0.0.0 --port 8888 \
    --quantization modelopt \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 262144 \
    --max-num-seqs 4 \
    --moe-backend marlin \
    --trust-remote-code

Key Flags

FlagWhy
--quantization modeloptmodelopt NVFP4 checkpoint format
--moe-backend marlinMarlin kernel for MoE expert layers
--kv-cache-dtype fp8Saves memory for longer contexts
-e VLLM_NVFP4_GEMM_BACKEND=marlinMarlin for non-MoE layers (needed on SM 12.1)
--trust-remote-codeRequired for Gemma 4

Testing

This is an instruct model — use the chat completions endpoint:

Bash
curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4",
    "messages": [{"role": "user", "content": "Hello! Tell me a joke."}],
    "max_tokens": 200
  }'

DGX Spark

Tested on NVIDIA DGX Spark (128GB unified memory, GB10 Blackwell, SM 12.1). Model loads at 15.7 GiB — plenty of headroom for 256K context with FP8 KV cache.

How this was made

The Problem

Gemma 4 MoE stores expert weights as fused 3D tensors (nn.Parameter of shape [128, dim, dim]) instead of individual nn.Linear modules. NVIDIA Model Optimizer (modelopt) only quantizes nn.Linear — it silently skips the 3D expert parameters, which are 91% of the model.

The Solution

We wrote a _QuantGemma4TextExperts modelopt plugin that unfuses the 3D expert tensors into 128 × 3 individual nn.Linear layers before quantization. This follows the same pattern modelopt uses for Qwen3.5, Llama4, and DBRX MoE models. After quantization, a post-processing step renames the exported keys to match vLLM's expected format.

Calibration

  • Tool: NVIDIA Model Optimizer v0.43, _nvfp4_selective_quant_cfg(["*"], )
  • Data: 4096 samples from CNN/DailyMail, batch 16, seq_len 1024
  • Why 4096 samples: MoE models have 128 experts with top-8 routing — each expert only sees ~6% of tokens. With 4096 samples, each expert gets ~250 calibration tokens on average for stable activation range estimation. Fewer samples leave rare experts uncalibrated, producing poor scales.
  • Expert routing: Natural (router decides which experts see which data — forced uniform routing degrades quality by overriding the model's learned specialization)
  • Vision encoder: Excluded from quantization (stays BF16)
  • Hardware: NVIDIA DGX Spark

vLLM Patch

vLLM's Gemma 4 expert_params_mapping doesn't correctly map NVFP4 scale keys (.weight_scale, .weight_scale_2, .input_scale) to FusedMoE parameter names. The included gemma4_patched.py fixes this. A PR to upstream vLLM is forthcoming.

Reproduce

Bash
pip install torch transformers>=5.4 accelerate datasets
git clone https://github.com/NVIDIA/Model-Optimizer.git
pip install -e Model-Optimizer[all]
pip install --force-reinstall transformers>=5.4 huggingface_hub>=1.5

python quantize_gemma4_moe.py --qformat nvfp4

Full quantization script included as quantize_gemma4_moe.py.

Limitations

  • Requires vLLM with transformers >= 5.4 and the included gemma4_patched.py
  • --moe-backend marlin required for correct MoE computation
  • Community quantization, not an official NVIDIA or Google release

License

Apache 2.0 — inherited from the base model.

Credits

Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.

Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.

📬 [email protected] ☕ Buy me a coffee if this makes your Spark go brrrrrr! 🚀

Capabilities & Tags
transformerssafetensorsgemma4image-text-to-textnvidianvfp4modeloptquantizedmoedgx-spark
Links & Resources
Specifications
CategoryChat
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Parameters26B parameters
Rating
1.6

Try Gemma 4 26B A4B it NVFP4

Access the model directly