by Disty0
Open source · 10k downloads · 13 likes
FLUX.2 klein 9B SDNQ 4bit is an optimized version of the FLUX.2 klein 9B model, employing dynamic 4-bit quantization to reduce its size and enhance efficiency without significantly compromising performance. Through a fine-grained layer-by-layer approach, it dynamically adjusts the data type (uint4 or int5) to minimize quality loss, ensuring an optimal balance between precision and lightweight design. This model is particularly well-suited for environments with limited computational resources, such as mobile devices or embedded systems, while remaining highly capable for text generation or language processing tasks. Its SVD (singular value decomposition) quantization method with a rank of 32 further compresses the model while preserving its functionality. What sets it apart is its ability to provide a more accessible and memory-efficient alternative while maintaining quality nearly identical to the original.
Dynamic 4 bit quantization of black-forest-labs/FLUX.2-klein-9B using SDNQ.
This model uses per layer fine grained quantization.
What dtype to use for a layer is selected dynamically by trial and error until the std normalized mse loss is lower than the selected threshold.
Minimum allowed dtype is set to uint4 and std normalized mse loss threshold is set to 1e-2.
This created a mixed precision model with uint4 and int5 dtypes.
SVD quantization is enabled with SVD rank 32.
Usage:
pip install sdnq
import torch
import diffusers
from sdnq import SDNQConfig # import sdnq to register it into diffusers and transformers
from sdnq.common import use_torch_compile as triton_is_available
from sdnq.loader import apply_sdnq_options_to_model
pipe = diffusers.Flux2KleinPipeline.from_pretrained("Disty0/FLUX.2-klein-9B-SDNQ-4bit-dynamic-svd-r32", torch_dtype=torch.bfloat16)
# Enable INT8 MatMul for AMD, Intel ARC and Nvidia GPUs:
if triton_is_available and (torch.cuda.is_available() or torch.xpu.is_available()):
pipe.transformer = apply_sdnq_options_to_model(pipe.transformer, use_quantized_matmul=True)
pipe.text_encoder = apply_sdnq_options_to_model(pipe.text_encoder, use_quantized_matmul=True)
# pipe.transformer = torch.compile(pipe.transformer) # optional for faster speeds
pipe.enable_model_cpu_offload()
prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt=prompt,
height=1024,
width=1024,
guidance_scale=1.0,
num_inference_steps=4,
generator=torch.manual_seed(0)
).images[0]
image.save("flux-klein-sdnq-4bit-dynamic-svd-r32.png")
Original BF16 vs SDNQ quantization comparison:
| Quantization | Model Size | Visualization |
|---|---|---|
| Original BF16 | 18.2 GB | ![]() |
| SDNQ 4 Bit | 5.7 GB | ![]() |