Llamacpp imatrix Quantizations of GLM-4.5-Air-REAP-82B-A12B by cerebras

Using llama.cpp release b6818 for quantization.

Original model: https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B

All quants made using imatrix option with dataset from here combined with a subset of combined_all_small.parquet from Ed Addario here

Run them in LM Studio

Run them directly with llama.cpp, or any other llama.cpp based project

Prompt format

No chat template specified so default is used. This may be incorrect, check original model card for details.

SQL

[gMASK]<sop><|system|>
{system_prompt}<|user|>
{prompt}<|assistant|>

Download a file (not the whole branch) from below:

Filename	Quant type	File Size	Split	Description
GLM-4.5-Air-REAP-82B-A12B-Q8_0.gguf	Q8_0	90.37GB	true	Extremely high quality, generally unneeded but max available quant.
GLM-4.5-Air-REAP-82B-A12B-Q6_K.gguf	Q6_K	76.21GB	true	Very high quality, near perfect, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q5_K_M.gguf	Q5_K_M	64.39GB	true	High quality, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q5_K_S.gguf	Q5_K_S	60.49GB	true	High quality, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q4_K_L.gguf	Q4_K_L	57.03GB	true	Uses Q8_0 for embed and output weights. Good quality, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q4_K_M.gguf	Q4_K_M	56.57GB	true	Good quality, default size for most use cases, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q4_1.gguf	Q4_1	53.60GB	true	Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon.
GLM-4.5-Air-REAP-82B-A12B-Q4_K_S.gguf	Q4_K_S	52.65GB	true	Slightly lower quality with more space savings, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q4_0.gguf	Q4_0	49.16GB	false	Legacy format, offers online repacking for ARM and AVX CPU inference.
GLM-4.5-Air-REAP-82B-A12B-IQ4_NL.gguf	IQ4_NL	48.71GB	false	Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference.
GLM-4.5-Air-REAP-82B-A12B-IQ4_XS.gguf	IQ4_XS	46.99GB	false	Decent quality, smaller than Q4_K_S with similar performance, recommended.
GLM-4.5-Air-REAP-82B-A12B-Q3_K_XL.gguf	Q3_K_XL	43.60GB	false	Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.
GLM-4.5-Air-REAP-82B-A12B-Q3_K_L.gguf	Q3_K_L	43.05GB	false	Lower quality but usable, good for low RAM availability.
GLM-4.5-Air-REAP-82B-A12B-Q3_K_M.gguf	Q3_K_M	42.88GB	false	Low quality.
GLM-4.5-Air-REAP-82B-A12B-IQ3_M.gguf	IQ3_M	42.88GB	false	Medium-low quality, new method with decent performance comparable to Q3_K_M.
GLM-4.5-Air-REAP-82B-A12B-Q3_K_S.gguf	Q3_K_S	41.32GB	false	Low quality, not recommended.
GLM-4.5-Air-REAP-82B-A12B-IQ3_XS.gguf	IQ3_XS	39.37GB	false	Lower quality, new method with decent performance, slightly better than Q3_K_S.
GLM-4.5-Air-REAP-82B-A12B-IQ3_XXS.gguf	IQ3_XXS	38.98GB	false	Lower quality, new method with decent performance, comparable to Q3 quants.
GLM-4.5-Air-REAP-82B-A12B-Q2_K_L.gguf	Q2_K_L	36.34GB	false	Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.
GLM-4.5-Air-REAP-82B-A12B-Q2_K.gguf	Q2_K	35.73GB	false	Very low quality but surprisingly usable.
GLM-4.5-Air-REAP-82B-A12B-IQ2_M.gguf	IQ2_M	35.00GB	false	Relatively low quality, uses SOTA techniques to be surprisingly usable.
GLM-4.5-Air-REAP-82B-A12B-IQ2_S.gguf	IQ2_S	33.05GB	false	Low quality, uses SOTA techniques to be usable.
GLM-4.5-Air-REAP-82B-A12B-IQ2_XS.gguf	IQ2_XS	32.69GB	false	Low quality, uses SOTA techniques to be usable.
GLM-4.5-Air-REAP-82B-A12B-IQ2_XXS.gguf	IQ2_XXS	30.74GB	false	Very low quality, uses SOTA techniques to be usable.
GLM-4.5-Air-REAP-82B-A12B-IQ1_M.gguf	IQ1_M	29.40GB	false	Extremely low quality, not recommended.

Embed/output weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.

Downloading using huggingface-cli

Click to view download instructions

First, make sure you have hugginface-cli installed:

Arduino

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

Bash

huggingface-cli download bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF --include "cerebras_GLM-4.5-Air-REAP-82B-A12B-Q4_K_M.gguf" --local-dir ./

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

Bash

huggingface-cli download bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF --include "cerebras_GLM-4.5-Air-REAP-82B-A12B-Q8_0/*" --local-dir ./

You can either specify a new local-dir (cerebras_GLM-4.5-Air-REAP-82B-A12B-Q8_0) or download them all in place (./)

ARM/AVX information

Previously, you would download Q4_0_4_4/4_8/8_8, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass.

Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.

As of llama.cpp build b4282 you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.

Additionally, if you want to get slightly better quality for , you can use IQ4_NL thanks to this PR which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.

Click to view Q4_0_X_X information (deprecated

I'm keeping this section to show the potential theoretical uplift in performance from using the Q4_0 with online repacking.

Click to view benchmarks on an AVX2 system (EPYC7702)

model	size	params	backend	threads	test	t/s	% (vs Q4_0)
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	pp512	204.03 ± 1.03	100%
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	pp1024	282.92 ± 0.19	100%
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	pp2048	259.49 ± 0.44	100%
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	tg128	39.12 ± 0.27	100%
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	tg256	39.31 ± 0.69	100%
qwen2 3B Q4_0	1.70 GiB	3.09 B	CPU	64	tg512	40.52 ± 0.03	100%
qwen2 3B Q4_K_M	1.79 GiB	3.09 B	CPU	64	pp512	301.02 ± 1.74	147%
qwen2 3B Q4_K_M	1.79 GiB	3.09 B	CPU	64	pp1024	287.23 ± 0.20	101%
qwen2 3B Q4_K_M	1.79 GiB	3.09 B	CPU	64	pp2048	262.77 ± 1.81	101%
qwen2 3B Q4_K_M	1.79 GiB	3.09 B	CPU	64	tg128	18.80 ± 0.99	48%
qwen2 3B Q4_K_M	1.79 GiB	3.09 B	CPU	64	tg256	24.46 ± 3.04	83%
qwen2 3B Q4_K_M	1.79 GiB	3.09 B	CPU	64	tg512	36.32 ± 3.59	90%
qwen2 3B Q4_0_8_8	1.69 GiB	3.09 B	CPU	64	pp512	271.71 ± 3.53	133%
qwen2 3B Q4_0_8_8	1.69 GiB	3.09 B	CPU	64	pp1024	279.86 ± 45.63	100%
qwen2 3B Q4_0_8_8	1.69 GiB	3.09 B	CPU	64	pp2048	320.77 ± 5.00	124%
qwen2 3B Q4_0_8_8	1.69 GiB	3.09 B	CPU	64	tg128	43.51 ± 0.05	111%
qwen2 3B Q4_0_8_8	1.69 GiB	3.09 B	CPU	64	tg256	43.35 ± 0.09	110%
qwen2 3B Q4_0_8_8	1.69 GiB	3.09 B	CPU	64	tg512	42.60 ± 0.31	105%

Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation

Which file should I choose?

Click here for details

A great write up with charts showing various performances is provided by Artefact2 here

The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.

If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.

If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.

Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.

If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.

If you want to get more into the weeds, you can check out this extremely useful feature chart:

llama.cpp feature matrix

But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.

These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.

Credits

Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.

Thank you ZeroWw for the inspiration to experiment with embed/output.

Thank you to LM Studio for sponsoring my work.

Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

Filename

Quant type

File Size

Split

Description

GLM-4.5-Air-REAP-82B-A12B-Q8_0.gguf

Q8_0

90.37GB

true

Extremely high quality, generally unneeded but max available quant.

GLM-4.5-Air-REAP-82B-A12B-Q6_K.gguf

Q6_K

76.21GB

true

Very high quality, near perfect, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q5_K_M.gguf

Q5_K_M

64.39GB

true

High quality, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q5_K_S.gguf

Q5_K_S

60.49GB

true

High quality, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q4_K_L.gguf

Q4_K_L

57.03GB

true

Uses Q8_0 for embed and output weights. Good quality, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q4_K_M.gguf

Q4_K_M

56.57GB

true

Good quality, default size for most use cases, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q4_1.gguf

Q4_1

53.60GB

true

Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon.

GLM-4.5-Air-REAP-82B-A12B-Q4_K_S.gguf

Q4_K_S

52.65GB

true

Slightly lower quality with more space savings, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q4_0.gguf

Q4_0

49.16GB

false

Legacy format, offers online repacking for ARM and AVX CPU inference.

GLM-4.5-Air-REAP-82B-A12B-IQ4_NL.gguf

IQ4_NL

48.71GB

false

Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference.

GLM-4.5-Air-REAP-82B-A12B-IQ4_XS.gguf

IQ4_XS

46.99GB

false

Decent quality, smaller than Q4_K_S with similar performance, recommended.

GLM-4.5-Air-REAP-82B-A12B-Q3_K_XL.gguf

Q3_K_XL

43.60GB

false

Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.

GLM-4.5-Air-REAP-82B-A12B-Q3_K_L.gguf

Q3_K_L

43.05GB

false

Lower quality but usable, good for low RAM availability.

GLM-4.5-Air-REAP-82B-A12B-Q3_K_M.gguf

Q3_K_M

42.88GB

false

Low quality.

GLM-4.5-Air-REAP-82B-A12B-IQ3_M.gguf

IQ3_M

42.88GB

false

Medium-low quality, new method with decent performance comparable to Q3_K_M.

GLM-4.5-Air-REAP-82B-A12B-Q3_K_S.gguf

Q3_K_S

41.32GB

false

Low quality, not recommended.

GLM-4.5-Air-REAP-82B-A12B-IQ3_XS.gguf

IQ3_XS

39.37GB

false

Lower quality, new method with decent performance, slightly better than Q3_K_S.

GLM-4.5-Air-REAP-82B-A12B-IQ3_XXS.gguf

IQ3_XXS

38.98GB

false

Lower quality, new method with decent performance, comparable to Q3 quants.

GLM-4.5-Air-REAP-82B-A12B-Q2_K_L.gguf

Q2_K_L

36.34GB

false

Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.

GLM-4.5-Air-REAP-82B-A12B-Q2_K.gguf

Q2_K

35.73GB

false

Very low quality but surprisingly usable.

GLM-4.5-Air-REAP-82B-A12B-IQ2_M.gguf

IQ2_M

35.00GB

false

Relatively low quality, uses SOTA techniques to be surprisingly usable.

GLM-4.5-Air-REAP-82B-A12B-IQ2_S.gguf

IQ2_S

33.05GB

false

Low quality, uses SOTA techniques to be usable.

GLM-4.5-Air-REAP-82B-A12B-IQ2_XS.gguf

IQ2_XS

32.69GB

false

Low quality, uses SOTA techniques to be usable.

GLM-4.5-Air-REAP-82B-A12B-IQ2_XXS.gguf

IQ2_XXS

30.74GB

false

Very low quality, uses SOTA techniques to be usable.

GLM-4.5-Air-REAP-82B-A12B-IQ1_M.gguf

IQ1_M

29.40GB

false

Extremely low quality, not recommended.

model

size

params

backend

threads

test

t/s

% (vs Q4_0)

qwen2 3B Q4_0

1.70 GiB

3.09 B

CPU

pp512

204.03 ± 1.03

100%

qwen2 3B Q4_0

1.70 GiB

3.09 B

CPU

pp1024

282.92 ± 0.19

100%

qwen2 3B Q4_0

1.70 GiB

3.09 B

CPU

pp2048

259.49 ± 0.44

100%

qwen2 3B Q4_0

1.70 GiB

3.09 B

CPU

tg128

39.12 ± 0.27

100%

qwen2 3B Q4_0

1.70 GiB

3.09 B

CPU

tg256

39.31 ± 0.69

100%

qwen2 3B Q4_0

1.70 GiB

3.09 B

CPU

tg512

40.52 ± 0.03

100%

qwen2 3B Q4_K_M

1.79 GiB

3.09 B

CPU

pp512

301.02 ± 1.74

147%

qwen2 3B Q4_K_M

1.79 GiB

3.09 B

CPU

pp1024

287.23 ± 0.20

101%

qwen2 3B Q4_K_M

1.79 GiB

3.09 B

CPU

pp2048

262.77 ± 1.81

101%

qwen2 3B Q4_K_M

1.79 GiB

3.09 B

CPU

tg128

18.80 ± 0.99

48%

qwen2 3B Q4_K_M

1.79 GiB

3.09 B

CPU

tg256

24.46 ± 3.04

83%

qwen2 3B Q4_K_M

1.79 GiB

3.09 B

CPU

tg512

36.32 ± 3.59

90%

qwen2 3B Q4_0_8_8

1.69 GiB

3.09 B

CPU

pp512

271.71 ± 3.53

133%

qwen2 3B Q4_0_8_8

1.69 GiB

3.09 B

CPU

pp1024

279.86 ± 45.63

100%

qwen2 3B Q4_0_8_8

1.69 GiB

3.09 B

CPU

pp2048

320.77 ± 5.00

124%

qwen2 3B Q4_0_8_8

1.69 GiB

3.09 B

CPU

tg128

43.51 ± 0.05

111%

qwen2 3B Q4_0_8_8

1.69 GiB

3.09 B

CPU

tg256

43.35 ± 0.09

110%

qwen2 3B Q4_0_8_8

1.69 GiB

3.09 B

CPU

tg512

42.60 ± 0.31

105%

cerebras GLM 4.5 Air REAP 82B A12B GGUF

Llamacpp imatrix Quantizations of GLM-4.5-Air-REAP-82B-A12B by cerebras

Prompt format

Download a file (not the whole branch) from below:

Embed/output weights

Downloading using huggingface-cli

ARM/AVX information

Which file should I choose?

Credits

cerebras GLM 4.5 Air REAP 82B A12B GGUF

Llamacpp imatrix Quantizations of GLM-4.5-Air-REAP-82B-A12B by cerebras

Prompt format

Download a file (not the whole branch) from below:

Embed/output weights

Downloading using huggingface-cli

ARM/AVX information

Which file should I choose?

Credits