VibeVoice-Large-Q8 - Selective 8bit Quantization

The first 8-bit VibeVoice model that actually works

🎯 Why This Model is Different

If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works.

The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.

Results

✅ Perfect audio, identical to the original model
✅ 11.6 GB instead of 18.7 GB (-38%)
✅ Uses ~12 GB VRAM instead of 20 GB
✅ Works on 12 GB GPUs (RTX 3060, 4070 Ti, etc.)

🚨 The Problem with Other 8-bit Models

Most 8-bit models you'll find online quantize everything aggressively: Result: Audio components get quantized → numerical errors propagate → audio = pure noise.

✅ The Solution: Selective Quantization

I only quantized what can be safely quantized without losing quality.

Result: 52% of parameters quantized, 48% at full precision = perfect audio quality.

📊 Quick Comparison

Model	Size	Audio Quality	Status
Original VibeVoice	18.7 GB	⭐⭐⭐⭐⭐	Full precision
Other 8-bit models	10.6 GB	💥 NOISE	❌ Don't work
This model	11.6 GB	⭐⭐⭐⭐⭐	✅ Perfect

+1.0 GB vs other 8-bit models = perfect audio instead of noise. Worth it.

💻 How to Use It

With Transformers

Python

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import scipy.io.wavfile as wavfile

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "FabioSarracino/VibeVoice-Large-Q8",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(
    "FabioSarracino/VibeVoice-Large-Q8",
    trust_remote_code=True
)

# Generate audio
text = "Hello, this is VibeVoice speaking."
inputs = processor(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=None)

# Save
audio = output.speech_outputs[0].cpu().numpy()
wavfile.write("output.wav", 24000, audio)

With ComfyUI (recommended)

Install the custom node:

Bash

cd ComfyUI/custom_nodes
git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI

Download this model to ComfyUI/models/vibevoice/
Restart ComfyUI and use it normally!

💾 System Requirements

Minimum

VRAM: 12 GB
RAM: 16 GB
GPU: NVIDIA with CUDA (required)
Storage: 11 GB

⚠️ Limitations

Requires NVIDIA GPU with CUDA - won't work on CPU or Apple Silicon
Inference only - don't use for fine-tuning
Requires:
- transformers>=4.51.3
- bitsandbytes>=0.43.0

🆚 When to Use This Model

✅ Use this 8-bit if:

You have 12-16 GB VRAM
You want maximum quality with reduced size
You need a production-ready model
You want the best size/quality balance

Use full precision (18.7 GB) if:

You have unlimited VRAM (24+ GB)
You're doing research requiring absolute precision

Use 4-bit NF4 (~6.6 GB) if:

You only have 8-10 GB VRAM
You can accept a small quality trade-off

🔧 Troubleshooting

"OutOfMemoryError" during loading

Close other GPU applications
Use device_map="auto"
Reduce batch size to 1

"BitsAndBytes not found"

Bash

pip install bitsandbytes>=0.43.0

Audio sounds distorted

This shouldn't happen! If it does:

Verify you downloaded the correct model
Update transformers: pip install --upgrade transformers
Check CUDA: torch.cuda.is_available() should return True

📚 Citation

Bibtex

@misc{vibevoice-q8-2025,
  title={VibeVoice-Large-Q8: Selective 8-bit Quantization for Audio Quality},
  author={Fabio Sarracino},
  year={2025},
  url={https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8}
}

Original Model

Bibtex

@misc{vibevoice2024,
  title={VibeVoice: High-Quality Text-to-Speech with Large Language Models},
  author={Microsoft Research},
  year={2024},
  url={https://github.com/microsoft/VibeVoice}
}

🔗 Related Resources

Original Model - Full precision base
ComfyUI Node - ComfyUI integration

📜 License

MIT License.

🤝 Support

Issues: GitHub Issues
Questions: HuggingFace Discussions

If this model helped you, leave a ⭐ on GitHub!

Created by Fabio Sarracino

The first 8-bit VibeVoice model that actually works

🤗 HuggingFace • 💻 GitHub