8 bit quantization of Tongyi-MAI/Z-Image-Turbo using SDNQ.
This model is quantized with group sizes disabled for faster INT8 MatMul.
Example code to enable INT8 MatMul is provided in the Usage.
INT8 MatMul is optional and disabled by default.

Usage:

Code

pip install sdnq

import torch
import diffusers
from sdnq import SDNQConfig # import sdnq to register it into diffusers and transformers
from sdnq.common import use_torch_compile as triton_is_available
from sdnq.loader import apply_sdnq_options_to_model

pipe = diffusers.ZImagePipeline.from_pretrained("Disty0/Z-Image-Turbo-SDNQ-int8", torch_dtype=torch.bfloat16)

# Enable INT8 MatMul for AMD, Intel ARC and Nvidia GPUs:
if triton_is_available and (torch.cuda.is_available() or torch.xpu.is_available()):
    pipe.transformer = apply_sdnq_options_to_model(pipe.transformer, use_quantized_matmul=True)
    pipe.text_encoder = apply_sdnq_options_to_model(pipe.text_encoder, use_quantized_matmul=True)
    pipe.transformer = torch.compile(pipe.transformer) # optional for faster speeds

pipe.enable_model_cpu_offload()

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.manual_seed(42),
).images[0]
image.save("z-image-turbo-sdnq-int8")

Original BF16 vs SDNQ quantization comparison:

Quantization	Model Size	Visualization
Original BF16	12.3 GB
SDNQ INT8	6.2 GB
SDNQ INT8 MatMul	6.2 GB

import torch import diffusers from sdnq import SDNQConfig # import sdnq to register it into diffusers and transformers from sdnq.common import use_torch_compile as triton_is_available from sdnq.loader import apply_sdnq_options_to_model pipe = diffusers.ZImagePipeline.from_pretrained("Disty0/Z-Image-Turbo-SDNQ-int8", torch_dtype=torch.bfloat16) # Enable INT8 MatMul for AMD, Intel ARC and Nvidia GPUs: if triton_is_available and (torch.cuda.is_available() or torch.xpu.is_available()): pipe.transformer = apply_sdnq_options_to_model(pipe.transformer, use_quantized_matmul=True) pipe.text_encoder = apply_sdnq_options_to_model(pipe.text_encoder, use_quantized_matmul=True) pipe.transformer = torch.compile(pipe.transformer) # optional for faster speeds pipe.enable_model_cpu_offload() prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights." image = pipe( prompt=prompt, height=1024, width=1024, num_inference_steps=9, guidance_scale=0.0, generator=torch.manual_seed(42), ).images[0] image.save("z-image-turbo-sdnq-int8")

Quantization

Model Size

Visualization

Original BF16

12.3 GB

SDNQ INT8

6.2 GB

SDNQ INT8 MatMul

6.2 GB

Z Image Turbo SDNQ int8

Z Image Turbo SDNQ int8