AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsCodeQwen3 Coder Next NVFP4

Qwen3 Coder Next NVFP4

by GadflyII

Open source · 196k downloads · 40 likes

2.0
(40 reviews)CodeAPI & Local
About

Qwen3 Coder Next NVFP4 is a quantized version of the Qwen3-Coder-Next model, optimized for efficient resource usage while maintaining high performance. Designed for developers and environments requiring robust processing capabilities, it excels in code comprehension, generation, and technical text analysis. The model supports extremely long contexts, up to 128,000 tokens, making it suitable for complex projects or large-scale data analysis. Its NVFP4 quantization enables faster execution and reduced memory consumption while remaining compatible with modern tools like vLLM. This model stands out for its balance between performance, accessibility, and flexibility, making it ideal for professional environments or resource-constrained infrastructures.

Documentation

Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

https://github.com/Gadflyii/vllm/tree/main

Qwen3-Coder-Next-NVFP4

NVFP4 quantized version of Qwen/Qwen3-Coder-Next (80B-A3B).

Model Details

PropertyValue
Base ModelQwen/Qwen3-Coder-Next
ArchitectureQwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE)
Parameters80B total, 3B activated per token
Experts512 total, 10 activated + 1 shared
Layers48
Context Length262,144 tokens (256K)
QuantizationNVFP4 (FP4 weights + FP4 activations)
Size45GB (down from ~149GB BF16, 70% reduction)
Formatcompressed-tensors

Quantization Details

Quantized using llmcompressor 0.9.0.1.

Python
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 2048
DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft)
moe_calibrate_all_experts = True

# Layers kept in BF16
ignore = [
    "lm_head",
    "re:.*mlp.gate$",               # MoE router gates
    "re:.*mlp.shared_expert_gate$", # Shared expert gates
    "re:.*linear_attn.*",           # DeltaNet linear attention
]

Benchmark Results

MMLU-Pro

ModelAccuracyDelta
BF1652.90%-
NVFP451.27%-1.63%

Context Length Testing

Successfully tested up to 128K tokens with FP8 KV cache (Not enough VRAM to test any higher context).

Usage with vLLM

Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+

Bash
#vllm Serving
vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8

License

Apache 2.0 (same as base model)

Acknowledgments

  • Qwen Team for the base model
  • RedHatAI for the quantization approach reference
  • vLLM Project for llmcompressor
Capabilities & Tags
transformerssafetensorsqwen3_nexttext-generationqwen3moenvfp4quantizedllmcompressorvllm
Links & Resources
Specifications
CategoryCode
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
2.0

Try Qwen3 Coder Next NVFP4

Access the model directly