AI/EXPLORER
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium
—Outils IA
—Sites & Blogs
—LLMs & Modèles
—Catégories
AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • ›Tous les outils
  • ›Sites & Blogs
  • ›LLMs & Modèles
  • ›Comparer
  • ›Chatbots
  • ›Images IA
  • ›Code & Dev

Entreprise

  • ›Premium
  • ›À propos
  • ›Contact
  • ›Blog

Légal

  • ›Mentions légales
  • ›Confidentialité
  • ›CGV

© 2026 AI Explorer·Tous droits réservés.

AccueilLLMsGemma 4 26B A4B it NVFP4

Gemma 4 26B A4B it NVFP4

par bg-digitalservices

Open source · 189k downloads · 19 likes

1.6
(19 avis)ChatAPI & Local
À propos

Gemma 4 26B A4B it NVFP4 est une version quantifiée du modèle Gemma 4, spécialement optimisée pour les tâches d'instruction et de raisonnement. Ce modèle utilise une approche de mélange d'experts (MoE) avec seulement 3,8 milliards de paramètres actifs par token, tout en conservant une précision élevée grâce à une quantification avancée en FP4 pour les poids et FP16 pour les activations. Il excelle dans les applications nécessitant une compréhension contextuelle fine, comme les assistants conversationnels ou l'analyse de données complexes, tout en offrant des performances accrues grâce à sa taille réduite et son efficacité mémoire. Sa particularité réside dans sa méthode de quantification innovante, conçue pour préserver la qualité malgré la compression, notamment pour les tâches de raisonnement mathématique où les pertes restent minimes. Développé par la communauté et validé sur des infrastructures NVIDIA, il se distingue par sa capacité à fonctionner avec des ressources limitées sans sacrifier significativement ses performances.

Documentation

Gemma-4-26B-A4B-it-NVFP4

First community NVFP4 quantization of google/gemma-4-26B-A4B-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and only 3.8B active per token.

W4A4 — weights in FP4, activations in FP16 (full W4A4 quantization).

Key Specs

Original (BF16)NVFP4 (this)
Size on disk~49 GB~16.5 GB
Compression—3.0x
Total parameters25.2B25.2B
Active parameters3.8B3.8B
ArchitectureMoE: 128 experts, 8 active/tokensame
Context window256K tokens256K tokens
ModalitiesText, Image, VideoText, Image, Video (all verified)
Quantization—W4A4 (FP4 weights AND activations)

Benchmarks

A/B comparison against the BF16 original, both served via vLLM on DGX Spark (GB10 Blackwell, SM 12.1). Quality via lm-evaluation-harness with --apply_chat_template.

Quality

BenchmarkBF16 (reference)NVFP4 (this)Retained
GSM8K (flexible-extract)87.79%84.23%95.9%
GSM8K (strict-match)86.96%82.64%95.0%
IFEval prompt-strict89.46%87.99%98.3%
IFEval inst-strict92.81%91.37%98.4%
IFEval prompt-loose90.94%89.65%98.6%
IFEval inst-loose93.88%93.05%99.1%
Average90.31%88.15%97.6%

Math reasoning (GSM8K) takes a ≈4pp hit — chained numerical steps accumulate rounding errors. Instruction-following (IFEval) is essentially unaffected (≈1pp, within noise). Typical quantization signature.

Speed & Size

MetricBF16NVFP4Factor
Tokens/sec (1000-token generation)23.348.22.07x
TTFT (ms)97531.83x
Model size on disk~49 GB~16.5 GB2.97x

MoE inference on GB10 is memory-bandwidth-bound, so 4x smaller weights translate directly into roughly 2x throughput. W4A4 gives a bit more headroom than W4A16 at the cost of slightly more quality drop.

Serving with vLLM

Requirements

  • vLLM build with transformers >= 5.4 (for Gemma 4 architecture support)
  • On DGX Spark / SM 12.1: spark-vllm-docker built with --tf5 flag
  • Included gemma4_patched.py for NVFP4 MoE scale key loading (see vLLM Patch)

Quick Start

Bash
docker run -d \
  --name vllm-gemma-4 \
  --gpus all --ipc=host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/Gemma-4-26B-A4B-it-NVFP4:/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  <your-vllm-image> \
  vllm serve /model \
    --served-model-name gemma-4 \
    --host 0.0.0.0 --port 8888 \
    --quantization modelopt \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 262144 \
    --max-num-seqs 4 \
    --moe-backend marlin \
    --trust-remote-code

Key Flags

FlagWhy
--quantization modeloptmodelopt NVFP4 checkpoint format
--moe-backend marlinMarlin kernel for MoE expert layers
--kv-cache-dtype fp8Saves memory for longer contexts
-e VLLM_NVFP4_GEMM_BACKEND=marlinMarlin for non-MoE layers (needed on SM 12.1)
--trust-remote-codeRequired for Gemma 4

Testing

This is an instruct model — use the chat completions endpoint:

Bash
curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4",
    "messages": [{"role": "user", "content": "Hello! Tell me a joke."}],
    "max_tokens": 200
  }'

DGX Spark

Tested on NVIDIA DGX Spark (128GB unified memory, GB10 Blackwell, SM 12.1). Model loads at 15.7 GiB — plenty of headroom for 256K context with FP8 KV cache.

How this was made

The Problem

Gemma 4 MoE stores expert weights as fused 3D tensors (nn.Parameter of shape [128, dim, dim]) instead of individual nn.Linear modules. NVIDIA Model Optimizer (modelopt) only quantizes nn.Linear — it silently skips the 3D expert parameters, which are 91% of the model.

The Solution

We wrote a _QuantGemma4TextExperts modelopt plugin that unfuses the 3D expert tensors into 128 × 3 individual nn.Linear layers before quantization. This follows the same pattern modelopt uses for Qwen3.5, Llama4, and DBRX MoE models. After quantization, a post-processing step renames the exported keys to match vLLM's expected format.

Calibration

  • Tool: NVIDIA Model Optimizer v0.43, _nvfp4_selective_quant_cfg(["*"], )
  • Data: 4096 samples from CNN/DailyMail, batch 16, seq_len 1024
  • Why 4096 samples: MoE models have 128 experts with top-8 routing — each expert only sees ~6% of tokens. With 4096 samples, each expert gets ~250 calibration tokens on average for stable activation range estimation. Fewer samples leave rare experts uncalibrated, producing poor scales.
  • Expert routing: Natural (router decides which experts see which data — forced uniform routing degrades quality by overriding the model's learned specialization)
  • Vision encoder: Excluded from quantization (stays BF16)
  • Hardware: NVIDIA DGX Spark

vLLM Patch

vLLM's Gemma 4 expert_params_mapping doesn't correctly map NVFP4 scale keys (.weight_scale, .weight_scale_2, .input_scale) to FusedMoE parameter names. The included gemma4_patched.py fixes this. A PR to upstream vLLM is forthcoming.

Reproduce

Bash
pip install torch transformers>=5.4 accelerate datasets
git clone https://github.com/NVIDIA/Model-Optimizer.git
pip install -e Model-Optimizer[all]
pip install --force-reinstall transformers>=5.4 huggingface_hub>=1.5

python quantize_gemma4_moe.py --qformat nvfp4

Full quantization script included as quantize_gemma4_moe.py.

Limitations

  • Requires vLLM with transformers >= 5.4 and the included gemma4_patched.py
  • --moe-backend marlin required for correct MoE computation
  • Community quantization, not an official NVIDIA or Google release

License

Apache 2.0 — inherited from the base model.

Credits

Quantized by Mario Iseli on an NVIDIA DGX Spark. Built and validated with AI-engineering assistance from Anthropic.

Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build.

📬 [email protected] ☕ Buy me a coffee if this makes your Spark go brrrrrr! 🚀

Liens & Ressources
Spécifications
CatégorieChat
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Paramètres26B parameters
Note
1.6

Essayer Gemma 4 26B A4B it NVFP4

Accédez directement au modèle