Model Card (SVDQuant)

Language: English | 中文

Model Name

Model repo: tonera/dreamshaperXL_v21TurboDPMSDE
Base (Diffusers weights path): tonera/dreamshaperXL_v21TurboDPMSDE (repo root)
Quantized UNet weights: tonera/dreamshaperXL_v21TurboDPMSDE/svdq-<precision>_r32-dreamshaperXL_v21TurboDPMSDE.safetensors

Quantization / Inference Tech

Inference engine: Nunchaku (https://github.com/nunchaku-ai/nunchaku)

Nunchaku is a high-performance inference engine for 4-bit (FP4/INT4) low-bit neural networks. Its goal is to significantly reduce VRAM usage and improve inference speed while preserving generation quality as much as possible. It implements and productionizes post-training quantization methods such as SVDQuant, and reduces the overhead introduced by low-rank branches via operator/kernel fusion and other optimizations.

The SDXL quantized weights in this repository (e.g. svdq-*_r32-*.safetensors) are intended to be used with Nunchaku for efficient inference on supported GPUs.

Quantization Quality (fp8)

Text

PSNR: mean=16.6145 p50=16.8903 p90=18.686 best=19.0489 worst=13.1796 (N=25)
SSIM: mean=0.683617 p50=0.697688 p90=0.769644 best=0.818764 worst=0.492368 (N=25)
LPIPS: mean=0.289557 p50=0.283484 p90=0.349915 best=0.170336 worst=0.414013 (N=25)

Performance

Below is the inference performance comparison (Diffusers vs Nunchaku-UNet).

Inference config: bf16 / steps=30 / guidance_scale=5.0
Resolutions (5 images each, batch=5): 1024x1024, 1024x768, 768x1024, 832x1216, 1216x832
Software versions: torch 2.9 / cuda 12.8 / nunchaku 1.1.0+torch2.9 / diffusers 0.37.0.dev0
Optimization switches: no torch.compile, no explicit cudnn tuning flags

Cold-start performance (end-to-end for the first image)

GPU	Metric	Diffusers	Nunchaku	Speedup	Gain
RTX 5090	load	3.505s	3.432s	1.02x	+2.1%
RTX 5090	cold_infer	2.944s	2.447s	1.20x	+16.9%
RTX 5090	cold_e2e	6.449s	5.880s	1.10x	+8.8%
RTX 3090	load	3.787s	3.442s	1.10x	+9.1%
RTX 3090	cold_infer	7.503s	5.231s	1.43x	+30.3%
RTX 3090	cold_e2e	11.290s	8.673s	1.30x	+23.2%

Steady-state performance (5 consecutive images after warmup)

GPU	Metric	Diffusers	Nunchaku	Speedup	Gain
RTX 5090	total (5 images)	12.937s	9.813s	1.32x	+24.2%
RTX 5090	avg (per image)	2.587s	1.963s	1.32x	+24.2%
RTX 3090	total (5 images)	33.413s	22.975s	1.45x	+31.2%
RTX 3090	avg (per image)	6.683s	4.595s	1.45x	+31.2%

Notes:

The longer load time on RTX 3090 is due to extra one-time processing when loading quantized weights.
During inference (cold_infer and steady-state), Nunchaku shows clear speedups on both GPUs.

Nunchaku Installation Required

Official installation docs (recommended source of truth): https://nunchaku.tech/docs/nunchaku/installation/installation.html

(Recommended) Install the official prebuilt wheel

Prerequisite: PyTorch >= 2.5 (follow the wheel requirements)
Install Nunchaku wheel: choose a wheel matching your torch/cuda/python versions from GitHub Releases / HuggingFace / ModelScope (note cp311 means Python 3.11):
- https://github.com/nunchaku-ai/nunchaku/releases

Bash

# Example (select the correct wheel URL for your torch/cuda/python versions)
pip install https://github.com/nunchaku-ai/nunchaku/releases/download/vX.Y.Z/nunchaku-X.Y.Z+torch2.9-cp311-cp311-linux_x86_64.whl

Tip (RTX 50 series): typically prefer CUDA >= 12.8, and prefer FP4 models for compatibility/performance (follow official docs).

Usage Example (Diffusers + Nunchaku UNet)

Python

import torch
from diffusers import StableDiffusionXLPipeline

from nunchaku.models.unets.unet_sdxl import NunchakuSDXLUNet2DConditionModel
from nunchaku.utils import get_precision

MODEL = "dreamshaperXL_v21TurboDPMSDE"  # Replace with the actual model name before publishing (e.g. zavychromaxl_v100)
REPO_ID = f"tonera/{MODEL}"

if __name__ == "__main__":
    unet = NunchakuSDXLUNet2DConditionModel.from_pretrained(
        f"{REPO_ID}/svdq-{get_precision()}_r32-{MODEL}.safetensors"
    )

    pipe = StableDiffusionXLPipeline.from_pretrained(
        f"{REPO_ID}",
        unet=unet,
        torch_dtype=torch.bfloat16,
        use_safetensors=True,
    ).to("cuda")

    prompt = "Make Pikachu hold a sign that says 'Nunchaku is awesome', yarn art style, detailed, vibrant colors"
    image = pipe(prompt=prompt, guidance_scale=5.0, num_inference_steps=30).images[0]
    image.save("sdxl.png")

Model Card (SVDQuant)

Language: English | 中文

Model Name

Model repo: tonera/dreamshaperXL_v21TurboDPMSDE
Base (Diffusers weights path): tonera/dreamshaperXL_v21TurboDPMSDE (repo root)
Quantized UNet weights: tonera/dreamshaperXL_v21TurboDPMSDE/svdq-<precision>_r32-dreamshaperXL_v21TurboDPMSDE.safetensors

Quantization / Inference Tech

Inference engine: Nunchaku (https://github.com/nunchaku-ai/nunchaku)

The SDXL quantized weights in this repository (e.g. svdq-*_r32-*.safetensors) are intended to be used with Nunchaku for efficient inference on supported GPUs.

Quantization Quality (fp8)

Text

PSNR: mean=16.6145 p50=16.8903 p90=18.686 best=19.0489 worst=13.1796 (N=25)
SSIM: mean=0.683617 p50=0.697688 p90=0.769644 best=0.818764 worst=0.492368 (N=25)
LPIPS: mean=0.289557 p50=0.283484 p90=0.349915 best=0.170336 worst=0.414013 (N=25)

Performance

Below is the inference performance comparison (Diffusers vs Nunchaku-UNet).

Inference config: bf16 / steps=30 / guidance_scale=5.0
Resolutions (5 images each, batch=5): 1024x1024, 1024x768, 768x1024, 832x1216, 1216x832
Software versions: torch 2.9 / cuda 12.8 / nunchaku 1.1.0+torch2.9 / diffusers 0.37.0.dev0
Optimization switches: no torch.compile, no explicit cudnn tuning flags

Cold-start performance (end-to-end for the first image)

GPU	Metric	Diffusers	Nunchaku	Speedup	Gain
RTX 5090	load	3.505s	3.432s	1.02x	+2.1%
RTX 5090	cold_infer	2.944s	2.447s	1.20x	+16.9%
RTX 5090	cold_e2e	6.449s	5.880s	1.10x	+8.8%
RTX 3090	load	3.787s	3.442s	1.10x	+9.1%
RTX 3090	cold_infer	7.503s	5.231s	1.43x	+30.3%
RTX 3090	cold_e2e	11.290s	8.673s	1.30x	+23.2%

Steady-state performance (5 consecutive images after warmup)

GPU	Metric	Diffusers	Nunchaku	Speedup	Gain
RTX 5090	total (5 images)	12.937s	9.813s	1.32x	+24.2%
RTX 5090	avg (per image)	2.587s	1.963s	1.32x	+24.2%
RTX 3090	total (5 images)	33.413s	22.975s	1.45x	+31.2%
RTX 3090	avg (per image)	6.683s	4.595s	1.45x	+31.2%

Notes:

The longer load time on RTX 3090 is due to extra one-time processing when loading quantized weights.
During inference (cold_infer and steady-state), Nunchaku shows clear speedups on both GPUs.

Nunchaku Installation Required

Official installation docs (recommended source of truth): https://nunchaku.tech/docs/nunchaku/installation/installation.html

(Recommended) Install the official prebuilt wheel

Prerequisite: PyTorch >= 2.5 (follow the wheel requirements)
Install Nunchaku wheel: choose a wheel matching your torch/cuda/python versions from GitHub Releases / HuggingFace / ModelScope (note cp311 means Python 3.11):
- https://github.com/nunchaku-ai/nunchaku/releases

Bash

# Example (select the correct wheel URL for your torch/cuda/python versions)
pip install https://github.com/nunchaku-ai/nunchaku/releases/download/vX.Y.Z/nunchaku-X.Y.Z+torch2.9-cp311-cp311-linux_x86_64.whl

Tip (RTX 50 series): typically prefer CUDA >= 12.8, and prefer FP4 models for compatibility/performance (follow official docs).

Usage Example (Diffusers + Nunchaku UNet)

Python

import torch
from diffusers import StableDiffusionXLPipeline

from nunchaku.models.unets.unet_sdxl import NunchakuSDXLUNet2DConditionModel
from nunchaku.utils import get_precision

MODEL = "dreamshaperXL_v21TurboDPMSDE"  # Replace with the actual model name before publishing (e.g. zavychromaxl_v100)
REPO_ID = f"tonera/{MODEL}"

if __name__ == "__main__":
    unet = NunchakuSDXLUNet2DConditionModel.from_pretrained(
        f"{REPO_ID}/svdq-{get_precision()}_r32-{MODEL}.safetensors"
    )

    pipe = StableDiffusionXLPipeline.from_pretrained(
        f"{REPO_ID}",
        unet=unet,
        torch_dtype=torch.bfloat16,
        use_safetensors=True,
    ).to("cuda")

    prompt = "Make Pikachu hold a sign that says 'Nunchaku is awesome', yarn art style, detailed, vibrant colors"
    image = pipe(prompt=prompt, guidance_scale=5.0, num_inference_steps=30).images[0]
    image.save("sdxl.png")

dreamshaperXL v21TurboDPMSDE

Model Card (SVDQuant)

Model Name

Quantization / Inference Tech

Quantization Quality (fp8)

Performance

Cold-start performance (end-to-end for the first image)

Steady-state performance (5 consecutive images after warmup)

Nunchaku Installation Required

(Recommended) Install the official prebuilt wheel

Usage Example (Diffusers + Nunchaku UNet)

dreamshaperXL v21TurboDPMSDE

Model Card (SVDQuant)

Model Name

Quantization / Inference Tech

Quantization Quality (fp8)

Performance

Cold-start performance (end-to-end for the first image)

Steady-state performance (5 consecutive images after warmup)

Nunchaku Installation Required

(Recommended) Install the official prebuilt wheel

Usage Example (Diffusers + Nunchaku UNet)