by GadflyII
Open source · 196k downloads · 40 likes
Qwen3 Coder Next NVFP4 is a quantized version of the Qwen3-Coder-Next model, optimized for efficient resource usage while maintaining high performance. Designed for developers and environments requiring robust processing capabilities, it excels in code comprehension, generation, and technical text analysis. The model supports extremely long contexts, up to 128,000 tokens, making it suitable for complex projects or large-scale data analysis. Its NVFP4 quantization enables faster execution and reduced memory consumption while remaining compatible with modern tools like vLLM. This model stands out for its balance between performance, accessibility, and flexibility, making it ideal for professional environments or resource-constrained infrastructures.
https://github.com/Gadflyii/vllm/tree/main
NVFP4 quantized version of Qwen/Qwen3-Coder-Next (80B-A3B).
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-Coder-Next |
| Architecture | Qwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE) |
| Parameters | 80B total, 3B activated per token |
| Experts | 512 total, 10 activated + 1 shared |
| Layers | 48 |
| Context Length | 262,144 tokens (256K) |
| Quantization | NVFP4 (FP4 weights + FP4 activations) |
| Size | 45GB (down from ~149GB BF16, 70% reduction) |
| Format | compressed-tensors |
Quantized using llmcompressor 0.9.0.1.
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 2048
DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft)
moe_calibrate_all_experts = True
# Layers kept in BF16
ignore = [
"lm_head",
"re:.*mlp.gate$", # MoE router gates
"re:.*mlp.shared_expert_gate$", # Shared expert gates
"re:.*linear_attn.*", # DeltaNet linear attention
]
| Model | Accuracy | Delta |
|---|---|---|
| BF16 | 52.90% | - |
| NVFP4 | 51.27% | -1.63% |
Successfully tested up to 128K tokens with FP8 KV cache (Not enough VRAM to test any higher context).
Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+
#vllm Serving
vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--kv-cache-dtype fp8
Apache 2.0 (same as base model)