by mudler
Open source · 125k downloads · 81 likes
The Qwen3.5 35B A3B APEX GGUF model is an optimized version of the Qwen3.5 architecture, specifically designed for Mixture-of-Experts (MoE) frameworks. Utilizing the APEX quantization technique, it combines adaptive layer-wise precision and intelligent calibration to deliver performance comparable to floating-point models (such as F16) while significantly reducing file size. Its core capabilities include high accuracy across diverse tasks like text generation, reasoning, programming, and tool invocation, with optimized variants tailored for different use cases—from high-end environments to more modest hardware setups. What sets it apart is its ability to maintain high quality even after quantization, outperforming methods like Q8_0 while being up to twice as lightweight. The "I-" variants incorporate diversified calibration to enhance benchmark performance, particularly reducing probability divergence gaps (KL) and improving response consistency. Ideal for local deployments or resource-constrained infrastructures, it adapts seamlessly to both powerful servers and consumer-grade GPUs, offering flexibility and efficiency without compromising precision.
Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
APEX Technical Report | GitHub Repository | LocalAI
APEX (Adaptive Precision for EXpert Models) is a novel quantization technique for Mixture-of-Experts language models. Unlike uniform quantization methods that apply the same precision to every tensor, APEX introduces a layer-wise precision gradient combined with MoE-aware tensor classification and diverse imatrix calibration to achieve Q8_0-level quality at a fraction of the size. The method was discovered through systematic human-driven, AI-assisted research across 25+ quantization strategies. APEX outperforms Unsloth Dynamic 2.0 (UD) quantizations on accuracy benchmarks while being 2x smaller.
This repository contains seven APEX GGUF files plus a vision projector (mmproj) covering every deployment scenario from maximum accuracy to consumer GPU inference. The best configuration (APEX Quality) beats both Q8_0 and F16 perplexity while being 38% smaller than Q8_0. I-variants use a diverse imatrix (chat, code, reasoning, tool-calling -- no Wikipedia) that trades tiny perplexity increases for significant accuracy gains and lower KL divergence.
For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.
| File | Configuration | Size | PPL | Speed (tg128) | Best for |
|---|---|---|---|---|---|
Qwen3.5-35B-A3B-APEX-Quality.gguf | APEX Quality | 21.3 GB | 6.527 | 62.3 t/s | Lowest perplexity of any quantization |
Qwen3.5-35B-A3B-APEX-I-Quality.gguf | APEX I-Quality | 21.3 GB | 6.552 | 63.1 t/s | Best accuracy across benchmarks |
Qwen3.5-35B-A3B-APEX-Balanced.gguf | APEX Balanced | 23.6 GB | 6.533 | 60.8 t/s | Interactive use, serving, general purpose |
Qwen3.5-35B-A3B-APEX-I-Balanced.gguf | APEX I-Balanced | 23.6 GB | 6.548 | 61.4 t/s | All-round with lower KL divergence |
Qwen3.5-35B-A3B-APEX-Compact.gguf | APEX Compact | 16.1 GB | 6.783 | 69.8 t/s | Consumer 24 GB GPUs |
Qwen3.5-35B-A3B-APEX-I-Compact.gguf | APEX I-Compact | 16.1 GB | 6.669 | 69.8 t/s | 16 GB GPUs, best accuracy at this size |
Qwen3.5-35B-A3B-APEX-Mini.gguf | APEX Mini | 12.2 GB | 7.088 | 74.4 t/s | Consumer 16 GB VRAM, smallest viable |
mmproj-F16.gguf | Vision Projector | 899 MB | -- | -- | Required for vision/multimodal tasks |
APEX Quality uses a 3-tier layer-wise precision gradient (Q6_K/Q5_K/IQ4_XS) with Q8_0 shared experts. It achieves the lowest perplexity of any quantization tested -- beating even F16 (6.527 vs 6.537).
APEX I-Quality uses the same architecture as Quality but with a diverse imatrix (chat, code, reasoning, tool-calling -- no Wikipedia). It achieves the highest HellaSwag (83.5%), matches Q8_0 on ARC (57.9%), and posts the best TruthfulQA (38.4%) of any model tested.
APEX Balanced uses a 2-tier gradient (Q6_K edges, Q5_K middle) with Q8_0 shared experts. It matches Q8_0 perplexity exactly (6.533) while being 31% smaller and 16% faster. Recommended for general-purpose use.
APEX I-Balanced uses the same architecture as Balanced with a diverse imatrix. KL divergence drops 11% (mean 0.0078 vs 0.0088) and KL max drops from 6.03 to 5.77.
APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. At 16.1 GB it fits consumer 24 GB GPUs with room for KV cache.
APEX I-Compact is the biggest imatrix winner: PPL drops from 6.783 to 6.669 (-0.114), KL max from 7.56 to 5.50, and MMLU rises from 40.9% to 41.7%. The diverse imatrix has the largest impact on aggressively quantized tiers.
APEX Mini combines the layer-wise precision gradient with IQ2_S middle-layer experts and a diverse imatrix, pushing to 12.2 GB. It beats bartowski IQ2_M (11.3 GB) on every metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%. Fits consumer 16 GB VRAM GPUs with room for context.
All measurements on Qwen3.5-35B-A3B, NVIDIA DGX Spark (GB10, 122 GB VRAM). Perplexity measured on wikitext-2-raw, context 2048. Accuracy benchmarks (HellaSwag, Winogrande, MMLU, ARC-Challenge, TruthfulQA) evaluated via llama.cpp using 400 tasks where applicable.
| Quantization | Size (GB) | PPL | KL mean | KL max | HS | WG | MMLU | ARC | TQA | tg128 (t/s) |
|---|---|---|---|---|---|---|---|---|---|---|
| F16 | 64.6 | 6.537 | -- | -- | 82.5% | 74.5% | 41.5% | 56.9% | 37.2% | 30.4 |
| Q8_0 | 34.4 | 6.533 | 0.0046 | 14.71 | 83.0% | 75.3% | 41.2% | 57.9% | 37.7% | 52.5 |
| APEX Quality | 21.3 | 6.527 | 0.0114 | 5.85 | 83.0% | 74.5% | 41.2% | 56.2% | 37.7% | 62.3 |
| APEX I-Quality | 21.3 | 6.552 | 0.0102 | 5.59 | 83.5% | 74.5% | 41.4% | 57.9% | 38.4% | 63.1 |
| APEX Balanced | 23.6 | 6.533 | 0.0088 | 6.03 | 83.0% | 74.5% | 41.3% | 56.9% | 36.8% | 60.8 |
| APEX I-Balanced | 23.6 | 6.548 | 0.0078 | 5.77 | 83.0% | 73.3% | 41.0% | 57.5% | 37.5% | 61.4 |
| APEX Compact | 16.1 | 6.783 | 0.0469 | 7.56 | 82.5% | 73.3% | 40.9% | 55.2% | 36.5% | 69.8 |
| APEX I-Compact | 16.1 | 6.669 | 0.0332 | 5.50 | 81.8% | 75.0% | 41.7% | 55.5% | 37.9% | 69.8 |
| APEX Mini | 12.2 | 7.088 | 0.0870 | 5.57 | 81.0% | 75.5% | 41.3% | 57.2% | 36.7% | 74.4 |
| Unsloth UD-Q8_K_XL | 45.3 | 6.536 | 0.0025 | 4.36 | 82.5% | 74.8% | 41.3% | 57.9% | 38.1% | 36.4 |
| Unsloth UD-Q4_K_L | 18.8 | 6.586 | 0.0151 | 5.98 | 82.3% | 75.8% | 41.1% | 59.2% | 37.3% | 65.5 |
| bartowski IQ2_M | 11.3 | 7.303 | 0.1113 | 6.07 | 80.3% | 74.0% | 39.6% | 56.2% | 35.0% | 76.2 |
| bartowski Q3_K_M | 15.1 | 6.730 | 0.0420 | 5.56 | 82.0% | 75.0% | 41.5% | 57.5% | 38.8% | 60.6 |
| Benchmark | F16 | Q8_0 | Quality | I-Quality | Balanced | I-Balanced | Compact | I-Compact | Mini | Q8_K_XL | Q4_K_L | IQ2_M | Q3_K_M |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HellaSwag | 82.5% | 83.0% | 83.0% | 83.5% | 83.0% | 83.0% | 82.5% | 81.8% | 81.0% | 82.5% | 82.3% | 80.3% | 82.0% |
| Winogrande | 74.5% | 75.3% | 74.5% | 74.5% | 74.5% | 73.3% | 73.3% | 75.0% | 75.5% | 74.8% | 75.8% | 74.0% | 75.0% |
| MMLU | 41.5% | 41.2% | 41.2% | 41.4% | 41.3% | 41.0% | 40.9% | 41.7% | 41.3% | 41.3% | 41.1% | 39.6% | 41.5% |
| ARC | 56.9% | 57.9% | 56.2% | 57.9% | 56.9% | 57.5% | 55.2% | 55.5% | 57.2% | 57.9% | 59.2% | 56.2% | 57.5% |
| TruthfulQA | 37.2% | 37.7% | 37.7% | 38.4% | 36.8% | 37.5% | 36.5% | 37.9% | 36.7% | 38.1% | 37.3% | 35.0% | 38.8% |






# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-I-Quality.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~22 GB VRAM for full GPU offload. Uses diverse imatrix calibration for best accuracy across benchmarks. Recommended when downstream task performance matters more than raw perplexity.
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Quality.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~22 GB VRAM for full GPU offload. Uses IQ4_XS for middle-layer experts, so llama.cpp b5460 or later is recommended.
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-I-Balanced.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~24 GB VRAM for full GPU offload. Uses diverse imatrix calibration with standard K-quant formats for lower KL divergence.
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Balanced.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~24 GB VRAM for full GPU offload. Uses only standard K-quant formats (Q6_K/Q5_K) with optimized dequantization kernels.
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-I-Compact.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~17 GB VRAM for full GPU offload. The biggest imatrix winner -- PPL drops 0.114 vs standard Compact, MMLU rises from 40.9% to 41.7%.
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Compact.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~17 GB VRAM for full GPU offload. Fits consumer 24 GB GPUs (RTX 4090, RTX 5090) with room for KV cache and context.
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Mini.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~13 GB VRAM for full GPU offload. Fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. Beats bartowski IQ2_M on every metric despite being only 0.9 GB larger.
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF --local-dir ./model
Qwen3.5-35B-A3B is a Mixture-of-Experts language model with 35 billion total parameters but only 3 billion active per token. It uses 256 experts per MoE layer, routing 8 experts plus 1 shared expert per token across 40 transformer layers. This sparse activation pattern means 97% of expert weights are idle for any given token, creating an opportunity for differentiated quantization.
APEX exploits three properties of MoE models to achieve lossless compression:
Not all tensors in an MoE model are equal. APEX classifies them into three categories with different precision requirements:
Edge transformer layers (the first and last 5) handle input embedding alignment and output logit generation. They are significantly more sensitive to quantization than the middle layers, which perform more redundant intermediate processing. APEX assigns higher precision to the edges and lower precision to the middle.
| Configuration | Size | Expert strategy | Shared expert | Attention | Best for |
|---|---|---|---|---|---|
| APEX I-Quality | 21.3 GB | Q6_K edges, Q5_K near-edges, IQ4_XS middle, diverse imatrix | Q8_0 | Q6_K | Best accuracy |
| APEX Quality | 21.3 GB | Q6_K edges, Q5_K near-edges, IQ4_XS middle | Q8_0 | Q6_K | Lowest perplexity |
| APEX I-Balanced | 23.6 GB | Q6_K edges, Q5_K middle, diverse imatrix | Q8_0 | Q6_K | All-round, lower KL |
| APEX Balanced | 23.6 GB | Q6_K edges, Q5_K middle | Q8_0 | Q6_K | General purpose |
| APEX I-Compact | 16.1 GB | Q4_K edges, Q3_K middle, diverse imatrix | Q6_K | Q4_K | Best accuracy at 16 GB |
| APEX Compact | 16.1 GB | Q4_K edges, Q3_K middle | Q6_K | Q4_K | Consumer 24 GB GPUs |
| APEX Mini | 12.2 GB | Layer gradient with IQ2_S middle, diverse imatrix | Q6_K | Q4_K | Consumer 16 GB VRAM |
Standard imatrix calibration uses Wikipedia text, which biases quantization toward encyclopedic prose. APEX I-variants use a diverse calibration dataset spanning chat, code, reasoning, and tool-calling -- no Wikipedia. This produces a different optimization tradeoff: I-variants trade a tiny perplexity increase on wikitext (the benchmark Wikipedia text) for significant gains on real-world accuracy benchmarks and consistently lower KL divergence.
The effect is most dramatic on aggressive quantizations. I-Compact drops perplexity from 6.783 to 6.669 (-0.114), reduces KL max from 7.56 to 5.50, and lifts MMLU from 40.9% to 41.7%. At the Quality tier, I-Quality achieves the highest HellaSwag score of any model tested (83.5%), matches Q8_0 on ARC (57.9%), and posts the best TruthfulQA (38.4%).
APEX Mini combines the layer-wise precision gradient with IQ2_S middle-layer experts and a diverse imatrix to push MoE quantization to 12.2 GB. At this size it fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. It beats bartowski IQ2_M (11.3 GB) on every single metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%, ARC 57.2% vs 56.2%. The layer gradient + diverse imatrix combination outperforms uniform quantization even at extreme compression ratios.
The APEX method and code will be published soon.
Information-theoretic metrics: Perplexity is measured on wikitext-2-raw (context 2048, full dataset). KL Divergence measures the divergence between quantized and full-precision logit distributions, reported as mean, max, 99.9th percentile, and median. Lower values indicate the quantized model's predictions more closely match the original.
Downstream accuracy benchmarks: HellaSwag (commonsense reasoning), Winogrande (coreference resolution), MMLU (multitask language understanding), ARC-Challenge (science QA), and TruthfulQA (truthful generation) are evaluated via llama.cpp with 400 tasks where applicable.
Note: Evaluations on hybrid MoE models were enabled by our upstream fix to llama.cpp's hybrid memory path for recurrent architectures (PR-ready).
All benchmarks were measured on an NVIDIA DGX Spark:
llama-quantize with --tensor-type-file for per-layer precision assignmentsThese APEX quantized models work out of the box with LocalAI -- a free, open-source OpenAI-compatible API that runs locally. Load any APEX GGUF and get an instant API server with chat completions, embeddings, and more:
# Run APEX Balanced with LocalAI
local-ai run mudler/[email protected]
LocalAI supports GPU acceleration, multiple model loading, and function calling. See the LocalAI documentation for more.
For additional memory savings and faster prompt processing, APEX models can be combined with KV cache compression via TurboQuant+, a fork of llama.cpp that adds turbo quantization types for the KV cache. This is separate from weight quantization -- TurboQuant compresses the KV cache 4.6x, allowing longer contexts in less VRAM.
This requires the feature/turboquant-kv-cache branch of the TurboQuant+ fork:
# Build (same as llama.cpp, but clone the fork)
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
Recommended configuration: -ctk q8_0 -ctv turbo3 -fa on
# Example: APEX Mini with TurboQuant KV cache compression
./build/bin/llama-server -m Qwen3.5-35B-A3B-APEX-Mini.gguf \
-ctk q8_0 -ctv turbo3 -fa on \
--host 0.0.0.0 --port 8080 -ngl 99
| Model | pp8192 baseline | pp8192 turbo3 | Speedup | tg128 delta |
|---|---|---|---|---|
| APEX I-Quality | 1,752 t/s | 2,003 t/s | +14.3% | <1% |
| APEX I-Balanced | 1,695 t/s | 1,927 t/s | +13.7% | <1% |
| APEX I-Compact | 1,714 t/s | 1,959 t/s | +14.3% | <1% |
| APEX Mini | 1,696 t/s | 1,938 t/s | +14.3% | <1% |
TurboQuant delivers 13-14% prompt processing speedup at 8K context with negligible impact on token generation speed (<1% delta on tg128). The KV cache compression is orthogonal to weight quantization, so all quality metrics (perplexity, accuracy, KL divergence) remain unchanged.
APEX Mini + TurboQuant enables running a 35B MoE model at 12 GB with 8K+ context on 16 GB VRAM GPUs.
APEX is brought to you by the LocalAI team -- the creators of the free, open-source OpenAI-compatible API for running AI locally.
Developed through human-driven, AI-assisted research to systematically explore MoE quantization strategies across 25+ experiments. Built on llama.cpp by Georgi Gerganov and contributors. Inspired by karpathy/autoresearch.
If you use APEX quantized models in your research, please cite:
@misc{apex-quant-2026,
title = {APEX: Adaptive Precision for Expert Models -- MoE-Aware Mixed-Precision Quantization},
author = {Di Giacinto, Ettore and {LocalAI Team}},
year = {2026},
url = {https://github.com/mudler/apex-quant},
note = {Layer-wise precision gradient quantization for Mixture-of-Experts models using llama.cpp}
}
@misc{localai,
title = {LocalAI: the free, Open Source OpenAI alternative},
author = {Di Giacinto, Ettore and {LocalAI Contributors}},
year = {2023},
url = {https://github.com/mudler/LocalAI}
}