Qwen3.5 35B A3B APEX GGUF

Name: Qwen3.5 35B A3B APEX GGUF
Rating: 2.4 (81 reviews)

by mudler

Open source · 125k downloads · 81 likes

2.4

(81 reviews)ChatAPI & Local

About

The Qwen3.5 35B A3B APEX GGUF model is an optimized version of the Qwen3.5 architecture, specifically designed for Mixture-of-Experts (MoE) frameworks. Utilizing the APEX quantization technique, it combines adaptive layer-wise precision and intelligent calibration to deliver performance comparable to floating-point models (such as F16) while significantly reducing file size. Its core capabilities include high accuracy across diverse tasks like text generation, reasoning, programming, and tool invocation, with optimized variants tailored for different use cases—from high-end environments to more modest hardware setups. What sets it apart is its ability to maintain high quality even after quantization, outperforming methods like Q8_0 while being up to twice as lightweight. The "I-" variants incorporate diversified calibration to enhance benchmark performance, particularly reducing probability divergence gaps (KL) and improving response consistency. Ideal for local deployments or resource-constrained infrastructures, it adapts seamlessly to both powerful servers and consumer-grade GPUs, offering flexibility and efficiency without compromising precision.

Documentation

Qwen3.5-35B-A3B APEX GGUF -- A Novel MoE-Aware Mixed-Precision Quantization Technique

Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

APEX Technical Report | GitHub Repository | LocalAI

APEX (Adaptive Precision for EXpert Models) is a novel quantization technique for Mixture-of-Experts language models. Unlike uniform quantization methods that apply the same precision to every tensor, APEX introduces a layer-wise precision gradient combined with MoE-aware tensor classification and diverse imatrix calibration to achieve Q8_0-level quality at a fraction of the size. The method was discovered through systematic human-driven, AI-assisted research across 25+ quantization strategies. APEX outperforms Unsloth Dynamic 2.0 (UD) quantizations on accuracy benchmarks while being 2x smaller.

This repository contains seven APEX GGUF files plus a vision projector (mmproj) covering every deployment scenario from maximum accuracy to consumer GPU inference. The best configuration (APEX Quality) beats both Q8_0 and F16 perplexity while being 38% smaller than Q8_0. I-variants use a diverse imatrix (chat, code, reasoning, tool-calling -- no Wikipedia) that trades tiny perplexity increases for significant accuracy gains and lower KL divergence.

For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.

Available Files

File	Configuration	Size	PPL	Speed (tg128)	Best for
`Qwen3.5-35B-A3B-APEX-Quality.gguf`	APEX Quality	21.3 GB	6.527	62.3 t/s	Lowest perplexity of any quantization
`Qwen3.5-35B-A3B-APEX-I-Quality.gguf`	APEX I-Quality	21.3 GB	6.552	63.1 t/s	Best accuracy across benchmarks
`Qwen3.5-35B-A3B-APEX-Balanced.gguf`	APEX Balanced	23.6 GB	6.533	60.8 t/s	Interactive use, serving, general purpose
`Qwen3.5-35B-A3B-APEX-I-Balanced.gguf`	APEX I-Balanced	23.6 GB	6.548	61.4 t/s	All-round with lower KL divergence
`Qwen3.5-35B-A3B-APEX-Compact.gguf`	APEX Compact	16.1 GB	6.783	69.8 t/s	Consumer 24 GB GPUs
`Qwen3.5-35B-A3B-APEX-I-Compact.gguf`	APEX I-Compact	16.1 GB	6.669	69.8 t/s	16 GB GPUs, best accuracy at this size
`Qwen3.5-35B-A3B-APEX-Mini.gguf`	APEX Mini	12.2 GB	7.088	74.4 t/s	Consumer 16 GB VRAM, smallest viable
`mmproj-F16.gguf`	Vision Projector	899 MB	--	--	Required for vision/multimodal tasks

APEX Quality uses a 3-tier layer-wise precision gradient (Q6_K/Q5_K/IQ4_XS) with Q8_0 shared experts. It achieves the lowest perplexity of any quantization tested -- beating even F16 (6.527 vs 6.537).

APEX I-Quality uses the same architecture as Quality but with a diverse imatrix (chat, code, reasoning, tool-calling -- no Wikipedia). It achieves the highest HellaSwag (83.5%), matches Q8_0 on ARC (57.9%), and posts the best TruthfulQA (38.4%) of any model tested.

APEX Balanced uses a 2-tier gradient (Q6_K edges, Q5_K middle) with Q8_0 shared experts. It matches Q8_0 perplexity exactly (6.533) while being 31% smaller and 16% faster. Recommended for general-purpose use.

APEX I-Balanced uses the same architecture as Balanced with a diverse imatrix. KL divergence drops 11% (mean 0.0078 vs 0.0088) and KL max drops from 6.03 to 5.77.

APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. At 16.1 GB it fits consumer 24 GB GPUs with room for KV cache.

APEX I-Compact is the biggest imatrix winner: PPL drops from 6.783 to 6.669 (-0.114), KL max from 7.56 to 5.50, and MMLU rises from 40.9% to 41.7%. The diverse imatrix has the largest impact on aggressively quantized tiers.

APEX Mini combines the layer-wise precision gradient with IQ2_S middle-layer experts and a diverse imatrix, pushing to 12.2 GB. It beats bartowski IQ2_M (11.3 GB) on every metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%. Fits consumer 16 GB VRAM GPUs with room for context.

Benchmark Results

All measurements on Qwen3.5-35B-A3B, NVIDIA DGX Spark (GB10, 122 GB VRAM). Perplexity measured on wikitext-2-raw, context 2048. Accuracy benchmarks (HellaSwag, Winogrande, MMLU, ARC-Challenge, TruthfulQA) evaluated via llama.cpp using 400 tasks where applicable.

Core Metrics

Quantization	Size (GB)	PPL	KL mean	KL max	HS	WG	MMLU	ARC	TQA	tg128 (t/s)
F16	64.6	6.537	--	--	82.5%	74.5%	41.5%	56.9%	37.2%	30.4
Q8_0	34.4	6.533	0.0046	14.71	83.0%	75.3%	41.2%	57.9%	37.7%	52.5
APEX Quality	21.3	6.527	0.0114	5.85	83.0%	74.5%	41.2%	56.2%	37.7%	62.3
APEX I-Quality	21.3	6.552	0.0102	5.59	83.5%	74.5%	41.4%	57.9%	38.4%	63.1
APEX Balanced	23.6	6.533	0.0088	6.03	83.0%	74.5%	41.3%	56.9%	36.8%	60.8
APEX I-Balanced	23.6	6.548	0.0078	5.77	83.0%	73.3%	41.0%	57.5%	37.5%	61.4
APEX Compact	16.1	6.783	0.0469	7.56	82.5%	73.3%	40.9%	55.2%	36.5%	69.8
APEX I-Compact	16.1	6.669	0.0332	5.50	81.8%	75.0%	41.7%	55.5%	37.9%	69.8
APEX Mini	12.2	7.088	0.0870	5.57	81.0%	75.5%	41.3%	57.2%	36.7%	74.4
Unsloth UD-Q8_K_XL	45.3	6.536	0.0025	4.36	82.5%	74.8%	41.3%	57.9%	38.1%	36.4
Unsloth UD-Q4_K_L	18.8	6.586	0.0151	5.98	82.3%	75.8%	41.1%	59.2%	37.3%	65.5
bartowski IQ2_M	11.3	7.303	0.1113	6.07	80.3%	74.0%	39.6%	56.2%	35.0%	76.2
bartowski Q3_K_M	15.1	6.730	0.0420	5.56	82.0%	75.0%	41.5%	57.5%	38.8%	60.6

Accuracy Benchmarks

Benchmark	F16	Q8_0	Quality	I-Quality	Balanced	I-Balanced	Compact	I-Compact	Mini	Q8_K_XL	Q4_K_L	IQ2_M	Q3_K_M
HellaSwag	82.5%	83.0%	83.0%	83.5%	83.0%	83.0%	82.5%	81.8%	81.0%	82.5%	82.3%	80.3%	82.0%
Winogrande	74.5%	75.3%	74.5%	74.5%	74.5%	73.3%	73.3%	75.0%	75.5%	74.8%	75.8%	74.0%	75.0%
MMLU	41.5%	41.2%	41.2%	41.4%	41.3%	41.0%	40.9%	41.7%	41.3%	41.3%	41.1%	39.6%	41.5%
ARC	56.9%	57.9%	56.2%	57.9%	56.9%	57.5%	55.2%	55.5%	57.2%	57.9%	59.2%	56.2%	57.5%
TruthfulQA	37.2%	37.7%	37.7%	38.4%	36.8%	37.5%	36.5%	37.9%	36.7%	38.1%	37.3%	35.0%	38.8%

Key Takeaways

APEX Quality has the best perplexity of any quantization (6.527, beats even F16's 6.537) at just 21.3 GB.
I-variants trade tiny PPL increases for significant accuracy gains. I-Quality achieves 83.5% HellaSwag (best of any model), 57.9% ARC, and 38.4% TruthfulQA. KL divergence is consistently 10-30% lower across all I-variants.
I-Compact is the biggest imatrix winner: PPL drops from 6.783 to 6.669 (-0.114), KL max from 7.56 to 5.50, MMLU from 40.9% to 41.7%.
APEX Mini (12.2 GB) beats bartowski IQ2_M (11.3 GB) on every metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%. Layer gradient + IQ2_S with diverse imatrix outperforms uniform IQ2_M.
At similar size (18.8 vs 21.3 GB), APEX Quality beats Unsloth UD-Q4_K_L on perplexity (6.527 vs 6.586), KL mean (0.011 vs 0.015), and HellaSwag (83.0% vs 82.3%).
APEX Compact (16.1 GB) is 14% smaller than Unsloth UD-Q4_K_L (18.8 GB) and 7% faster (69.8 vs 65.5 t/s).
Unsloth UD-Q8_K_XL wins on KL divergence (best mean 0.0025, best max 4.36) but at 2-3x the size of APEX tiers.
Q8_0 has the worst outlier divergence of all models tested (KL max 14.71), despite a low KL mean.
All APEX tiers match or beat Unsloth on accuracy benchmarks within noise, at a fraction of the size.

Benchmark Plots

Perplexity vs Model Size

Perplexity vs Inference Speed

Accuracy Benchmark Comparison

KL Divergence Comparison

Efficiency: Size vs Speed

Multi-Metric Radar Chart

How to Download and Use

APEX I-Quality (21.3 GB) -- Best accuracy

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-I-Quality.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~22 GB VRAM for full GPU offload. Uses diverse imatrix calibration for best accuracy across benchmarks. Recommended when downstream task performance matters more than raw perplexity.

APEX Quality (21.3 GB) -- Best perplexity

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Quality.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~22 GB VRAM for full GPU offload. Uses IQ4_XS for middle-layer experts, so llama.cpp b5460 or later is recommended.

APEX I-Balanced (23.6 GB) -- All-round with lower KL

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-I-Balanced.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~24 GB VRAM for full GPU offload. Uses diverse imatrix calibration with standard K-quant formats for lower KL divergence.

APEX Balanced (23.6 GB) -- Best all-rounder

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Balanced.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~24 GB VRAM for full GPU offload. Uses only standard K-quant formats (Q6_K/Q5_K) with optimized dequantization kernels.

APEX I-Compact (16.1 GB) -- Best accuracy at 16 GB

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-I-Compact.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~17 GB VRAM for full GPU offload. The biggest imatrix winner -- PPL drops 0.114 vs standard Compact, MMLU rises from 40.9% to 41.7%.

APEX Compact (16.1 GB) -- Consumer GPUs

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Compact.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~17 GB VRAM for full GPU offload. Fits consumer 24 GB GPUs (RTX 4090, RTX 5090) with room for KV cache and context.

APEX Mini (12.2 GB) -- Consumer 16 GB VRAM

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Mini.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~13 GB VRAM for full GPU offload. Fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. Beats bartowski IQ2_M on every metric despite being only 0.9 GB larger.

Download all files

Bash

huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF --local-dir ./model

About the Base Model

Qwen3.5-35B-A3B is a Mixture-of-Experts language model with 35 billion total parameters but only 3 billion active per token. It uses 256 experts per MoE layer, routing 8 experts plus 1 shared expert per token across 40 transformer layers. This sparse activation pattern means 97% of expert weights are idle for any given token, creating an opportunity for differentiated quantization.

Quantization Methodology

APEX exploits three properties of MoE models to achieve lossless compression:

1. MoE-aware tensor classification

Not all tensors in an MoE model are equal. APEX classifies them into three categories with different precision requirements:

Routed expert weights (gate/up/down projections): These make up the bulk of model parameters but only 8 out of 256 experts are active per token. The 97% sparsity means these tolerate aggressive quantization -- the routing decision uses full-precision gate weights, so quantization noise in inactive experts never affects output.
Shared expert weights: Always active for every token and exhibit heavy-tailed weight distributions (kurtosis 13.10 vs 3.41 for routed experts). These need high precision (Q8_0) to preserve outlier values.
Attention and SSM weights: Dense layers that contribute few parameters but matter for generation quality. Kept at Q6_K uniformly in the Quality and Balanced tiers.

2. Layer-wise precision gradient

Edge transformer layers (the first and last 5) handle input embedding alignment and output logit generation. They are significantly more sensitive to quantization than the middle layers, which perform more redundant intermediate processing. APEX assigns higher precision to the edges and lower precision to the middle.

3. Five tiers (seven configurations)

Configuration	Size	Expert strategy	Shared expert	Attention	Best for
APEX I-Quality	21.3 GB	Q6_K edges, Q5_K near-edges, IQ4_XS middle, diverse imatrix	Q8_0	Q6_K	Best accuracy
APEX Quality	21.3 GB	Q6_K edges, Q5_K near-edges, IQ4_XS middle	Q8_0	Q6_K	Lowest perplexity
APEX I-Balanced	23.6 GB	Q6_K edges, Q5_K middle, diverse imatrix	Q8_0	Q6_K	All-round, lower KL
APEX Balanced	23.6 GB	Q6_K edges, Q5_K middle	Q8_0	Q6_K	General purpose
APEX I-Compact	16.1 GB	Q4_K edges, Q3_K middle, diverse imatrix	Q6_K	Q4_K	Best accuracy at 16 GB
APEX Compact	16.1 GB	Q4_K edges, Q3_K middle	Q6_K	Q4_K	Consumer 24 GB GPUs
APEX Mini	12.2 GB	Layer gradient with IQ2_S middle, diverse imatrix	Q6_K	Q4_K	Consumer 16 GB VRAM

I-variants: diverse imatrix calibration

Standard imatrix calibration uses Wikipedia text, which biases quantization toward encyclopedic prose. APEX I-variants use a diverse calibration dataset spanning chat, code, reasoning, and tool-calling -- no Wikipedia. This produces a different optimization tradeoff: I-variants trade a tiny perplexity increase on wikitext (the benchmark Wikipedia text) for significant gains on real-world accuracy benchmarks and consistently lower KL divergence.

The effect is most dramatic on aggressive quantizations. I-Compact drops perplexity from 6.783 to 6.669 (-0.114), reduces KL max from 7.56 to 5.50, and lifts MMLU from 40.9% to 41.7%. At the Quality tier, I-Quality achieves the highest HellaSwag score of any model tested (83.5%), matches Q8_0 on ARC (57.9%), and posts the best TruthfulQA (38.4%).

APEX Mini: the 12 GB tier

APEX Mini combines the layer-wise precision gradient with IQ2_S middle-layer experts and a diverse imatrix to push MoE quantization to 12.2 GB. At this size it fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. It beats bartowski IQ2_M (11.3 GB) on every single metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%, ARC 57.2% vs 56.2%. The layer gradient + diverse imatrix combination outperforms uniform quantization even at extreme compression ratios.

Key findings from 25+ experiments

Q6_K is the sweet spot for routed experts. Going from Q6_K to Q8_0 on expert weights wastes 7.5 GB for zero perplexity improvement. Going below Q5_K causes measurable degradation.
Layer position matters more than uniform bit-width. A 2-tier layer gradient (Q6_K edges, Q5_K middle) matches Q8_0 quality. A uniform Q5_K assignment does not.
Shared expert precision is critical. The shared expert's heavy-tailed weight distribution (kurtosis 13.10) makes it the most sensitive component.
IQ formats underperform K-quants for MoE experts. IQ3_S gives worse perplexity than Q3_K on routed expert tensors despite similar bit rates.
Diverse imatrix calibration improves real-world accuracy. A calibration dataset spanning chat, code, reasoning, and tool-calling (no Wikipedia) trades tiny wikitext perplexity increases for significant gains on downstream benchmarks and consistently lower KL divergence. The effect is strongest on aggressive quantizations.
Stock llama.cpp quantization algorithms are already optimal. Five novel C-level modifications all showed zero improvement. Gains come from better precision allocation, not algorithm changes.

The APEX method and code will be published soon.

Evaluation Methodology

Information-theoretic metrics: Perplexity is measured on wikitext-2-raw (context 2048, full dataset). KL Divergence measures the divergence between quantized and full-precision logit distributions, reported as mean, max, 99.9th percentile, and median. Lower values indicate the quantized model's predictions more closely match the original.

Downstream accuracy benchmarks: HellaSwag (commonsense reasoning), Winogrande (coreference resolution), MMLU (multitask language understanding), ARC-Challenge (science QA), and TruthfulQA (truthful generation) are evaluated via llama.cpp with 400 tasks where applicable.

Note: Evaluations on hybrid MoE models were enabled by our upstream fix to llama.cpp's hybrid memory path for recurrent architectures (PR-ready).

Hardware

All benchmarks were measured on an NVIDIA DGX Spark:

GPU: NVIDIA GB10, 122 GB unified VRAM
CUDA: 13.0, compute capability 12.1
Benchmark: wikitext-2-raw test set, context length 2048, full dataset evaluation
Inference speed: measured with llama-perplexity (prompt processing throughput)

Technical Details

Quantization tool: llama.cpp llama-quantize with --tensor-type-file for per-layer precision assignments
Layer count: 40 transformer layers
Expert count: 256 per MoE layer (8 routed + 1 shared active per token)
Weight distributions: Routed experts are near-Gaussian (kurtosis 3.41); shared expert is heavy-tailed (kurtosis 13.10)
Compatibility: Stock llama.cpp, no patches or custom builds required

Run locally with LocalAI

These APEX quantized models work out of the box with LocalAI -- a free, open-source OpenAI-compatible API that runs locally. Load any APEX GGUF and get an instant API server with chat completions, embeddings, and more:

Bash

# Run APEX Balanced with LocalAI
local-ai run mudler/[email protected]

LocalAI supports GPU acceleration, multiple model loading, and function calling. See the LocalAI documentation for more.

TurboQuant KV Cache Compression (Optional)

For additional memory savings and faster prompt processing, APEX models can be combined with KV cache compression via TurboQuant+, a fork of llama.cpp that adds turbo quantization types for the KV cache. This is separate from weight quantization -- TurboQuant compresses the KV cache 4.6x, allowing longer contexts in less VRAM.

This requires the feature/turboquant-kv-cache branch of the TurboQuant+ fork:

Bash

# Build (same as llama.cpp, but clone the fork)
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Recommended configuration: -ctk q8_0 -ctv turbo3 -fa on

Bash

# Example: APEX Mini with TurboQuant KV cache compression
./build/bin/llama-server -m Qwen3.5-35B-A3B-APEX-Mini.gguf \
    -ctk q8_0 -ctv turbo3 -fa on \
    --host 0.0.0.0 --port 8080 -ngl 99

Prompt Processing Speedup at 8K Context

Model	pp8192 baseline	pp8192 turbo3	Speedup	tg128 delta
APEX I-Quality	1,752 t/s	2,003 t/s	+14.3%	<1%
APEX I-Balanced	1,695 t/s	1,927 t/s	+13.7%	<1%
APEX I-Compact	1,714 t/s	1,959 t/s	+14.3%	<1%
APEX Mini	1,696 t/s	1,938 t/s	+14.3%	<1%

TurboQuant delivers 13-14% prompt processing speedup at 8K context with negligible impact on token generation speed (<1% delta on tg128). The KV cache compression is orthogonal to weight quantization, so all quality metrics (perplexity, accuracy, KL divergence) remain unchanged.

APEX Mini + TurboQuant enables running a 35B MoE model at 12 GB with 8K+ context on 16 GB VRAM GPUs.

Credits

APEX is brought to you by the LocalAI team -- the creators of the free, open-source OpenAI-compatible API for running AI locally.

Developed through human-driven, AI-assisted research to systematically explore MoE quantization strategies across 25+ experiments. Built on llama.cpp by Georgi Gerganov and contributors. Inspired by karpathy/autoresearch.

Citation

If you use APEX quantized models in your research, please cite:

Bibtex

@misc{apex-quant-2026,
    title   = {APEX: Adaptive Precision for Expert Models -- MoE-Aware Mixed-Precision Quantization},
    author  = {Di Giacinto, Ettore and {LocalAI Team}},
    year    = {2026},
    url     = {https://github.com/mudler/apex-quant},
    note    = {Layer-wise precision gradient quantization for Mixture-of-Experts models using llama.cpp}
}

Bibtex

@misc{localai,
    title   = {LocalAI: the free, Open Source OpenAI alternative},
    author  = {Di Giacinto, Ettore and {LocalAI Contributors}},
    year    = {2023},
    url     = {https://github.com/mudler/LocalAI}
}

Capabilities & Tags

ggufquantizedmoeapexmixed-precisionllama-cpplayer-wiseqwen3apex-quanttext-generation

Links & Resources

Qwen3.5-35B-A3B APEX GGUF -- A Novel MoE-Aware Mixed-Precision Quantization Technique

Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

APEX Technical Report | GitHub Repository | LocalAI

For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.

Available Files

File	Configuration	Size	PPL	Speed (tg128)	Best for
`Qwen3.5-35B-A3B-APEX-Quality.gguf`	APEX Quality	21.3 GB	6.527	62.3 t/s	Lowest perplexity of any quantization
`Qwen3.5-35B-A3B-APEX-I-Quality.gguf`	APEX I-Quality	21.3 GB	6.552	63.1 t/s	Best accuracy across benchmarks
`Qwen3.5-35B-A3B-APEX-Balanced.gguf`	APEX Balanced	23.6 GB	6.533	60.8 t/s	Interactive use, serving, general purpose
`Qwen3.5-35B-A3B-APEX-I-Balanced.gguf`	APEX I-Balanced	23.6 GB	6.548	61.4 t/s	All-round with lower KL divergence
`Qwen3.5-35B-A3B-APEX-Compact.gguf`	APEX Compact	16.1 GB	6.783	69.8 t/s	Consumer 24 GB GPUs
`Qwen3.5-35B-A3B-APEX-I-Compact.gguf`	APEX I-Compact	16.1 GB	6.669	69.8 t/s	16 GB GPUs, best accuracy at this size
`Qwen3.5-35B-A3B-APEX-Mini.gguf`	APEX Mini	12.2 GB	7.088	74.4 t/s	Consumer 16 GB VRAM, smallest viable
`mmproj-F16.gguf`	Vision Projector	899 MB	--	--	Required for vision/multimodal tasks

APEX I-Balanced uses the same architecture as Balanced with a diverse imatrix. KL divergence drops 11% (mean 0.0078 vs 0.0088) and KL max drops from 6.03 to 5.77.

APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. At 16.1 GB it fits consumer 24 GB GPUs with room for KV cache.

Benchmark Results

Core Metrics

Quantization	Size (GB)	PPL	KL mean	KL max	HS	WG	MMLU	ARC	TQA	tg128 (t/s)
F16	64.6	6.537	--	--	82.5%	74.5%	41.5%	56.9%	37.2%	30.4
Q8_0	34.4	6.533	0.0046	14.71	83.0%	75.3%	41.2%	57.9%	37.7%	52.5
APEX Quality	21.3	6.527	0.0114	5.85	83.0%	74.5%	41.2%	56.2%	37.7%	62.3
APEX I-Quality	21.3	6.552	0.0102	5.59	83.5%	74.5%	41.4%	57.9%	38.4%	63.1
APEX Balanced	23.6	6.533	0.0088	6.03	83.0%	74.5%	41.3%	56.9%	36.8%	60.8
APEX I-Balanced	23.6	6.548	0.0078	5.77	83.0%	73.3%	41.0%	57.5%	37.5%	61.4
APEX Compact	16.1	6.783	0.0469	7.56	82.5%	73.3%	40.9%	55.2%	36.5%	69.8
APEX I-Compact	16.1	6.669	0.0332	5.50	81.8%	75.0%	41.7%	55.5%	37.9%	69.8
APEX Mini	12.2	7.088	0.0870	5.57	81.0%	75.5%	41.3%	57.2%	36.7%	74.4
Unsloth UD-Q8_K_XL	45.3	6.536	0.0025	4.36	82.5%	74.8%	41.3%	57.9%	38.1%	36.4
Unsloth UD-Q4_K_L	18.8	6.586	0.0151	5.98	82.3%	75.8%	41.1%	59.2%	37.3%	65.5
bartowski IQ2_M	11.3	7.303	0.1113	6.07	80.3%	74.0%	39.6%	56.2%	35.0%	76.2
bartowski Q3_K_M	15.1	6.730	0.0420	5.56	82.0%	75.0%	41.5%	57.5%	38.8%	60.6

Accuracy Benchmarks

Benchmark	F16	Q8_0	Quality	I-Quality	Balanced	I-Balanced	Compact	I-Compact	Mini	Q8_K_XL	Q4_K_L	IQ2_M	Q3_K_M
HellaSwag	82.5%	83.0%	83.0%	83.5%	83.0%	83.0%	82.5%	81.8%	81.0%	82.5%	82.3%	80.3%	82.0%
Winogrande	74.5%	75.3%	74.5%	74.5%	74.5%	73.3%	73.3%	75.0%	75.5%	74.8%	75.8%	74.0%	75.0%
MMLU	41.5%	41.2%	41.2%	41.4%	41.3%	41.0%	40.9%	41.7%	41.3%	41.3%	41.1%	39.6%	41.5%
ARC	56.9%	57.9%	56.2%	57.9%	56.9%	57.5%	55.2%	55.5%	57.2%	57.9%	59.2%	56.2%	57.5%
TruthfulQA	37.2%	37.7%	37.7%	38.4%	36.8%	37.5%	36.5%	37.9%	36.7%	38.1%	37.3%	35.0%	38.8%

Key Takeaways

APEX Quality has the best perplexity of any quantization (6.527, beats even F16's 6.537) at just 21.3 GB.
I-variants trade tiny PPL increases for significant accuracy gains. I-Quality achieves 83.5% HellaSwag (best of any model), 57.9% ARC, and 38.4% TruthfulQA. KL divergence is consistently 10-30% lower across all I-variants.
I-Compact is the biggest imatrix winner: PPL drops from 6.783 to 6.669 (-0.114), KL max from 7.56 to 5.50, MMLU from 40.9% to 41.7%.
APEX Mini (12.2 GB) beats bartowski IQ2_M (11.3 GB) on every metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%. Layer gradient + IQ2_S with diverse imatrix outperforms uniform IQ2_M.
At similar size (18.8 vs 21.3 GB), APEX Quality beats Unsloth UD-Q4_K_L on perplexity (6.527 vs 6.586), KL mean (0.011 vs 0.015), and HellaSwag (83.0% vs 82.3%).
APEX Compact (16.1 GB) is 14% smaller than Unsloth UD-Q4_K_L (18.8 GB) and 7% faster (69.8 vs 65.5 t/s).
Unsloth UD-Q8_K_XL wins on KL divergence (best mean 0.0025, best max 4.36) but at 2-3x the size of APEX tiers.
Q8_0 has the worst outlier divergence of all models tested (KL max 14.71), despite a low KL mean.
All APEX tiers match or beat Unsloth on accuracy benchmarks within noise, at a fraction of the size.

Benchmark Plots

Perplexity vs Model Size

Perplexity vs Inference Speed

Accuracy Benchmark Comparison

KL Divergence Comparison

Efficiency: Size vs Speed

Multi-Metric Radar Chart

How to Download and Use

APEX I-Quality (21.3 GB) -- Best accuracy

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-I-Quality.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~22 GB VRAM for full GPU offload. Uses diverse imatrix calibration for best accuracy across benchmarks. Recommended when downstream task performance matters more than raw perplexity.

APEX Quality (21.3 GB) -- Best perplexity

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Quality.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~22 GB VRAM for full GPU offload. Uses IQ4_XS for middle-layer experts, so llama.cpp b5460 or later is recommended.

APEX I-Balanced (23.6 GB) -- All-round with lower KL

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-I-Balanced.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~24 GB VRAM for full GPU offload. Uses diverse imatrix calibration with standard K-quant formats for lower KL divergence.

APEX Balanced (23.6 GB) -- Best all-rounder

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Balanced.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~24 GB VRAM for full GPU offload. Uses only standard K-quant formats (Q6_K/Q5_K) with optimized dequantization kernels.

APEX I-Compact (16.1 GB) -- Best accuracy at 16 GB

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-I-Compact.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~17 GB VRAM for full GPU offload. The biggest imatrix winner -- PPL drops 0.114 vs standard Compact, MMLU rises from 40.9% to 41.7%.

APEX Compact (16.1 GB) -- Consumer GPUs

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Compact.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~17 GB VRAM for full GPU offload. Fits consumer 24 GB GPUs (RTX 4090, RTX 5090) with room for KV cache and context.

APEX Mini (12.2 GB) -- Consumer 16 GB VRAM

Bash

# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
    Qwen3.5-35B-A3B-APEX-Mini.gguf --local-dir ./model

# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
    --conversation -ngl 99

# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
    --host 0.0.0.0 --port 8080 -ngl 99

Requires ~13 GB VRAM for full GPU offload. Fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. Beats bartowski IQ2_M on every metric despite being only 0.9 GB larger.

Download all files

Bash

huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF --local-dir ./model

About the Base Model

Quantization Methodology

APEX exploits three properties of MoE models to achieve lossless compression:

1. MoE-aware tensor classification

Not all tensors in an MoE model are equal. APEX classifies them into three categories with different precision requirements:

Routed expert weights (gate/up/down projections): These make up the bulk of model parameters but only 8 out of 256 experts are active per token. The 97% sparsity means these tolerate aggressive quantization -- the routing decision uses full-precision gate weights, so quantization noise in inactive experts never affects output.
Shared expert weights: Always active for every token and exhibit heavy-tailed weight distributions (kurtosis 13.10 vs 3.41 for routed experts). These need high precision (Q8_0) to preserve outlier values.
Attention and SSM weights: Dense layers that contribute few parameters but matter for generation quality. Kept at Q6_K uniformly in the Quality and Balanced tiers.

2. Layer-wise precision gradient

3. Five tiers (seven configurations)

Configuration	Size	Expert strategy	Shared expert	Attention	Best for
APEX I-Quality	21.3 GB	Q6_K edges, Q5_K near-edges, IQ4_XS middle, diverse imatrix	Q8_0	Q6_K	Best accuracy
APEX Quality	21.3 GB	Q6_K edges, Q5_K near-edges, IQ4_XS middle	Q8_0	Q6_K	Lowest perplexity
APEX I-Balanced	23.6 GB	Q6_K edges, Q5_K middle, diverse imatrix	Q8_0	Q6_K	All-round, lower KL
APEX Balanced	23.6 GB	Q6_K edges, Q5_K middle	Q8_0	Q6_K	General purpose
APEX I-Compact	16.1 GB	Q4_K edges, Q3_K middle, diverse imatrix	Q6_K	Q4_K	Best accuracy at 16 GB
APEX Compact	16.1 GB	Q4_K edges, Q3_K middle	Q6_K	Q4_K	Consumer 24 GB GPUs
APEX Mini	12.2 GB	Layer gradient with IQ2_S middle, diverse imatrix	Q6_K	Q4_K	Consumer 16 GB VRAM

I-variants: diverse imatrix calibration

APEX Mini: the 12 GB tier

Key findings from 25+ experiments

Q6_K is the sweet spot for routed experts. Going from Q6_K to Q8_0 on expert weights wastes 7.5 GB for zero perplexity improvement. Going below Q5_K causes measurable degradation.
Layer position matters more than uniform bit-width. A 2-tier layer gradient (Q6_K edges, Q5_K middle) matches Q8_0 quality. A uniform Q5_K assignment does not.
Shared expert precision is critical. The shared expert's heavy-tailed weight distribution (kurtosis 13.10) makes it the most sensitive component.
IQ formats underperform K-quants for MoE experts. IQ3_S gives worse perplexity than Q3_K on routed expert tensors despite similar bit rates.
Diverse imatrix calibration improves real-world accuracy. A calibration dataset spanning chat, code, reasoning, and tool-calling (no Wikipedia) trades tiny wikitext perplexity increases for significant gains on downstream benchmarks and consistently lower KL divergence. The effect is strongest on aggressive quantizations.
Stock llama.cpp quantization algorithms are already optimal. Five novel C-level modifications all showed zero improvement. Gains come from better precision allocation, not algorithm changes.

The APEX method and code will be published soon.

Evaluation Methodology

Note: Evaluations on hybrid MoE models were enabled by our upstream fix to llama.cpp's hybrid memory path for recurrent architectures (PR-ready).

Hardware

All benchmarks were measured on an NVIDIA DGX Spark:

GPU: NVIDIA GB10, 122 GB unified VRAM
CUDA: 13.0, compute capability 12.1
Benchmark: wikitext-2-raw test set, context length 2048, full dataset evaluation
Inference speed: measured with llama-perplexity (prompt processing throughput)

Technical Details

Quantization tool: llama.cpp llama-quantize with --tensor-type-file for per-layer precision assignments
Layer count: 40 transformer layers
Expert count: 256 per MoE layer (8 routed + 1 shared active per token)
Weight distributions: Routed experts are near-Gaussian (kurtosis 3.41); shared expert is heavy-tailed (kurtosis 13.10)
Compatibility: Stock llama.cpp, no patches or custom builds required

Run locally with LocalAI

Bash

# Run APEX Balanced with LocalAI
local-ai run mudler/[email protected]

LocalAI supports GPU acceleration, multiple model loading, and function calling. See the LocalAI documentation for more.

TurboQuant KV Cache Compression (Optional)

This requires the feature/turboquant-kv-cache branch of the TurboQuant+ fork:

Bash

# Build (same as llama.cpp, but clone the fork)
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Recommended configuration: -ctk q8_0 -ctv turbo3 -fa on

Bash

# Example: APEX Mini with TurboQuant KV cache compression
./build/bin/llama-server -m Qwen3.5-35B-A3B-APEX-Mini.gguf \
    -ctk q8_0 -ctv turbo3 -fa on \
    --host 0.0.0.0 --port 8080 -ngl 99

Prompt Processing Speedup at 8K Context

Model	pp8192 baseline	pp8192 turbo3	Speedup	tg128 delta
APEX I-Quality	1,752 t/s	2,003 t/s	+14.3%	<1%
APEX I-Balanced	1,695 t/s	1,927 t/s	+13.7%	<1%
APEX I-Compact	1,714 t/s	1,959 t/s	+14.3%	<1%
APEX Mini	1,696 t/s	1,938 t/s	+14.3%	<1%

APEX Mini + TurboQuant enables running a 35B MoE model at 12 GB with 8K+ context on 16 GB VRAM GPUs.

Credits

APEX is brought to you by the LocalAI team -- the creators of the free, open-source OpenAI-compatible API for running AI locally.

Citation

If you use APEX quantized models in your research, please cite:

Bibtex

@misc{apex-quant-2026,
    title   = {APEX: Adaptive Precision for Expert Models -- MoE-Aware Mixed-Precision Quantization},
    author  = {Di Giacinto, Ettore and {LocalAI Team}},
    year    = {2026},
    url     = {https://github.com/mudler/apex-quant},
    note    = {Layer-wise precision gradient quantization for Mixture-of-Experts models using llama.cpp}
}

Bibtex

@misc{localai,
    title   = {LocalAI: the free, Open Source OpenAI alternative},
    author  = {Di Giacinto, Ettore and {LocalAI Contributors}},
    year    = {2023},
    url     = {https://github.com/mudler/LocalAI}
}