by boboliu
Open source · 190k downloads · 5 likes
The Qwen3 Embedding 4B W4A16 G128 is an optimized and quantized version of the Qwen3-Embedding-4B model, specifically designed to reduce memory footprint while maintaining high performance. It is well-suited for text embedding tasks, enabling the conversion of text into numerical vectors for applications such as information retrieval, classification, or semantic similarity. Thanks to its advanced quantization, it strikes a strong balance between efficiency and accuracy, with only a minor performance drop of approximately 0.72% on standard benchmarks. This model stands out for its ability to operate with limited hardware resources, significantly reducing VRAM usage compared to the original version. It is particularly valuable for developers seeking to deploy embedded or large-scale AI solutions without compromising result quality.
GPTQ Quantized Qwen/Qwen3-Embedding-4B with THUIR/T2Ranking and m-a-p/COIG-CQIA for calibration set.
VRAM Usage: 17430M -> 11000M (w/o FA2).
~0.72% lost in C-MTEB.
Evaluation performed with official code.
| C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS |
|---|---|---|---|---|---|---|---|---|---|
| multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 |
| bge-multilingual-gemma2 | 9B | 67.64 | 68.52 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 |
| gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 |
| gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 |
| ritrieve_zh_v1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 |
| Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
| This Model | 4B-W4A16 | 71.75 | 73.05 | 75.43 | 77.51 | 83.04 | 65.73 | 76.15 | 60.47 |
pip install compressed-tensors optimum and auto-gptq / gptqmodel, then goto the official usage guide.