Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled FP8 Dynamic

par mconcat

Open source · 133k downloads · 17 likes

1.6

(17 avis)ChatAPI & Local

À propos

Le modèle Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled FP8 Dynamic est une version optimisée et quantifiée en FP8 du modèle Qwen3.5-27B, spécialement conçu pour le raisonnement avancé. Il combine une architecture hybride DeltaNet et attention softmax, offrant une capacité de contexte quatre fois supérieure à un transformateur classique grâce à un cache KV réduit. Grâce à sa quantification FP8, il réduit l'empreinte mémoire tout en maintenant une dégradation minimale des performances (seulement 1,4 % de perte en perplexité par rapport au BF16), tout en améliorant le débit de 1,6 fois. Ce modèle excelle dans les tâches complexes nécessitant une réflexion approfondie, comme l'analyse de données, la résolution de problèmes mathématiques ou la génération de raisonnements structurés. Il est particulièrement adapté aux environnements où les ressources GPU sont limitées, tout en conservant une grande précision. Sa conception le rend idéal pour des applications professionnelles ou éducatives où la qualité des réponses et l'efficacité computationnelle sont essentielles.

Documentation

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic

Uniform FP8 quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled — a Claude 4.6 Opus reasoning-distilled Qwen3.5-27B model.

~29 GB on disk (~27 GiB in VRAM). Near-lossless FP8 quantization with only 1.4% perplexity degradation vs BF16. Recommended GPU: NVIDIA RTX PRO 6000 (96 GB) or other GPUs with >= 48 GB VRAM.

For 32 GB GPUs (RTX 5090): Use the NVFP4 mixed-precision variant instead (~25 GB, fits with usable context on a single 5090).

Quantization Strategy

Uniform FP8 W8A8 dynamic quantization using llm-compressor v0.10.1, stored in the compressed-tensors format. No calibration data needed — weight scales are computed statically per-channel, activation scales are computed dynamically per-token at inference time.

Precision	Layers	Rationale
FP8 W8A8 (per-channel weights, per-token dynamic activations)	All `nn.Linear` layers except those in the ignore list	Near-lossless: FP8 E4M3 preserves 3 mantissa bits with per-channel granularity
BF16 (unquantized)	`lm_head`, `embed_tokens`, DeltaNet small projections (`in_proj_a`, `in_proj_b`), all norms, visual encoder, MoE router gates	lm_head amplifies errors across 248K vocab; embed_tokens is a lookup table; DeltaNet low-rank projections are numerically sensitive; vision tower retained at full precision

Weight Breakdown

Component	Size	Precision
MLP	14.6 GB	FP8
DeltaNet attention	6.9 GB	FP8 + BF16
lm_head	2.5 GB	BF16
embed_tokens	2.5 GB	BF16
Softmax attention	2.1 GB	FP8
Visual encoder	0.9 GB	BF16
Total	~29 GB

Architecture

Qwen3.5-27B uses a hybrid DeltaNet + softmax attention architecture with full_attention_interval=4:

INI

Layer pattern (64 layers):
  [DeltaNet, DeltaNet, DeltaNet, Softmax] × 16
  = 48 DeltaNet layers + 16 softmax attention layers

Key architectural parameters:

Hidden size: 5,120
Attention heads: 24 (query), 4 (KV, GQA)
Head dimension: 256
DeltaNet heads: 16 key, 48 value (dim 128 each)
MLP intermediate: 17,408
Vocabulary: 248,320
Max position embeddings: 262,144

Only 16 of 64 layers require KV cache — the 48 DeltaNet layers use a fixed-size recurrent state that doesn't grow with sequence length. This gives ~4x more context capacity than a standard transformer of the same size.

KV Cache Budget

Per-token KV cache cost (only 16 softmax layers):

FP16: 4 KV heads x 256 dim x 2 (K+V) x 2 bytes x 16 layers = 64 KB/token
FP8: 32 KB/token

GPU	Available for KV	Max Context (FP8 KV)
RTX 5090 (32 GB)	~4 GiB	~128K tokens (single request)
RTX PRO 6000 (96 GB)	~68 GiB	8 concurrent requests × 262K tokens each

Usage

Serving with vLLM (recommended)

Bash

pip install vllm>=0.17.0

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic \
  --max-model-len 131072 \
  --reasoning-parser qwen3

RTX PRO 6000 / high-VRAM GPUs (>= 48 GB):

Bash

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Note: On RTX 5090 (32 GB), the same Blackwell-specific vLLM issues that affect the NVFP4 variant also apply here. See the NVFP4 model card for details and tracking PRs. On GPUs with >= 48 GB VRAM, these issues are irrelevant.

Transformers (direct loading)

Python

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	FP8 W8A8 with Blackwell FP8 acceleration. Works on >= 48 GB GPUs; 32 GB Blackwell GPUs require upcoming vLLM fixes
transformers >= 5.3.0	Yes	Direct loading with `device_map="auto"`
SGLang	Yes	FP8 compressed-tensors supported
llama.cpp / GGUF	No	compressed-tensors FP8 format not supported

Hardware Requirements

Configuration	VRAM	Notes
Minimum	32 GB	Weights only, minimal context
RTX PRO 6000 (recommended)	96 GB	8 concurrent × 262K context with FP8 KV cache. Works out of the box with vLLM 0.17.0
2x RTX 5090	64 GB	Tensor parallel, full context
RTX 5090 (single)	32 GB	Not recommended — only ~4 GiB free for KV cache after model loading. Use the NVFP4 variant instead

Benchmark Results

Comparison against the BF16 source model. All benchmarks run on NVIDIA RTX PRO 6000 (96 GB) with vLLM 0.17.0, temperature=0.6 for generation tasks (Qwen recommended setting for thinking mode).

Benchmark	BF16 (54 GB)	FP8 (29 GB)	Delta
Perplexity (FineWeb-Edu, 100 samples)	6.6119	6.7026	+0.09 (+1.4%)
MMLU-Pro (500 samples)	54.0%	56.0%*	+2.0%
ARC-Challenge (1,172 samples)	97.6%	100%*	+2.4%
GSM8K Platinum (200 samples)	99.5%	—	—
AIME 2025 (30 problems)	40.0%	—	—
Throughput (single GPU)	17.8 tok/s	29.1 tok/s	+1.6x

*FP8 MMLU-Pro and ARC ran with 50 samples (quick mode); BF16 used full sample sizes. Full FP8 benchmarks will be updated.

Summary: FP8 quantization is near-lossless — perplexity degrades only 1.4% vs BF16, while throughput improves 1.6x from reduced memory bandwidth. For comparison, the NVFP4 variant (25 GB) shows 2.1% perplexity degradation but fits in 4 GB less VRAM.

Source Model

This is a quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which is an SFT fine-tune of Qwen/Qwen3.5-27B using Claude 4.6 Opus reasoning distillation data.

Training datasets:

Quantization Details

Tool: llm-compressor v0.10.1
Format: compressed-tensors (uniform FP8)
Scheme: FP8 W8A8 dynamic — per-channel static weight scales, per-token dynamic activation scales
Calibration: None required (weight-only scale computation)
Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB)

License

Apache 2.0, following the base model license.

Liens & Ressources

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic

Uniform FP8 quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled — a Claude 4.6 Opus reasoning-distilled Qwen3.5-27B model.

~29 GB on disk (~27 GiB in VRAM). Near-lossless FP8 quantization with only 1.4% perplexity degradation vs BF16. Recommended GPU: NVIDIA RTX PRO 6000 (96 GB) or other GPUs with >= 48 GB VRAM.

For 32 GB GPUs (RTX 5090): Use the NVFP4 mixed-precision variant instead (~25 GB, fits with usable context on a single 5090).

Quantization Strategy

Precision	Layers	Rationale
FP8 W8A8 (per-channel weights, per-token dynamic activations)	All `nn.Linear` layers except those in the ignore list	Near-lossless: FP8 E4M3 preserves 3 mantissa bits with per-channel granularity
BF16 (unquantized)	`lm_head`, `embed_tokens`, DeltaNet small projections (`in_proj_a`, `in_proj_b`), all norms, visual encoder, MoE router gates	lm_head amplifies errors across 248K vocab; embed_tokens is a lookup table; DeltaNet low-rank projections are numerically sensitive; vision tower retained at full precision

Weight Breakdown

Component	Size	Precision
MLP	14.6 GB	FP8
DeltaNet attention	6.9 GB	FP8 + BF16
lm_head	2.5 GB	BF16
embed_tokens	2.5 GB	BF16
Softmax attention	2.1 GB	FP8
Visual encoder	0.9 GB	BF16
Total	~29 GB

Architecture

Qwen3.5-27B uses a hybrid DeltaNet + softmax attention architecture with full_attention_interval=4:

INI

Layer pattern (64 layers):
  [DeltaNet, DeltaNet, DeltaNet, Softmax] × 16
  = 48 DeltaNet layers + 16 softmax attention layers

Key architectural parameters:

Hidden size: 5,120
Attention heads: 24 (query), 4 (KV, GQA)
Head dimension: 256
DeltaNet heads: 16 key, 48 value (dim 128 each)
MLP intermediate: 17,408
Vocabulary: 248,320
Max position embeddings: 262,144

KV Cache Budget

Per-token KV cache cost (only 16 softmax layers):

FP16: 4 KV heads x 256 dim x 2 (K+V) x 2 bytes x 16 layers = 64 KB/token
FP8: 32 KB/token

GPU	Available for KV	Max Context (FP8 KV)
RTX 5090 (32 GB)	~4 GiB	~128K tokens (single request)
RTX PRO 6000 (96 GB)	~68 GiB	8 concurrent requests × 262K tokens each

Usage

Serving with vLLM (recommended)

Bash

pip install vllm>=0.17.0

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic \
  --max-model-len 131072 \
  --reasoning-parser qwen3

RTX PRO 6000 / high-VRAM GPUs (>= 48 GB):

Bash

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Note: On RTX 5090 (32 GB), the same Blackwell-specific vLLM issues that affect the NVFP4 variant also apply here. See the NVFP4 model card for details and tracking PRs. On GPUs with >= 48 GB VRAM, these issues are irrelevant.

Transformers (direct loading)

Python

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	FP8 W8A8 with Blackwell FP8 acceleration. Works on >= 48 GB GPUs; 32 GB Blackwell GPUs require upcoming vLLM fixes
transformers >= 5.3.0	Yes	Direct loading with `device_map="auto"`
SGLang	Yes	FP8 compressed-tensors supported
llama.cpp / GGUF	No	compressed-tensors FP8 format not supported

Hardware Requirements

Configuration	VRAM	Notes
Minimum	32 GB	Weights only, minimal context
RTX PRO 6000 (recommended)	96 GB	8 concurrent × 262K context with FP8 KV cache. Works out of the box with vLLM 0.17.0
2x RTX 5090	64 GB	Tensor parallel, full context
RTX 5090 (single)	32 GB	Not recommended — only ~4 GiB free for KV cache after model loading. Use the NVFP4 variant instead

Benchmark Results

Comparison against the BF16 source model. All benchmarks run on NVIDIA RTX PRO 6000 (96 GB) with vLLM 0.17.0, temperature=0.6 for generation tasks (Qwen recommended setting for thinking mode).

Benchmark	BF16 (54 GB)	FP8 (29 GB)	Delta
Perplexity (FineWeb-Edu, 100 samples)	6.6119	6.7026	+0.09 (+1.4%)
MMLU-Pro (500 samples)	54.0%	56.0%*	+2.0%
ARC-Challenge (1,172 samples)	97.6%	100%*	+2.4%
GSM8K Platinum (200 samples)	99.5%	—	—
AIME 2025 (30 problems)	40.0%	—	—
Throughput (single GPU)	17.8 tok/s	29.1 tok/s	+1.6x

*FP8 MMLU-Pro and ARC ran with 50 samples (quick mode); BF16 used full sample sizes. Full FP8 benchmarks will be updated.

Source Model

This is a quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which is an SFT fine-tune of Qwen/Qwen3.5-27B using Claude 4.6 Opus reasoning distillation data.

Training datasets:

Quantization Details

Tool: llm-compressor v0.10.1
Format: compressed-tensors (uniform FP8)
Scheme: FP8 W8A8 dynamic — per-channel static weight scales, per-token dynamic activation scales
Calibration: None required (weight-only scale computation)
Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB)

License

Apache 2.0, following the base model license.