Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Professional-grade AI sound effect generation for video content creators

👥 Authors

Sizhe Shan^1,2* • Qiulin Li^1,3* • Yutao Cui¹ • Miles Yang¹ • Yuehai Wang² • Qun Yang³ • Jin Zhou^1† • Zhao Zhong¹

🏢 ¹Tencent Hunyuan • 🎓 ²Zhejiang University • ✈️ ³Nanjing University of Aeronautics and Astronautics

*Equal contribution • †Project lead

🔥🔥🔥 News

[2025.9.29] 🚀 HunyuanVideo-Foley-XL Model Release - Release XL-sized model with offload inference support, significantly reducing VRAM requirements.
[2025.8.28] 🌟 HunyuanVideo-Foley Open Source Release - Inference code and model weights publicly available.

✨ Key Highlights

🎭 Multi-scenario Sync
High-quality audio synchronized with complex video scenes

🧠 Multi-modal Balance
Perfect harmony between visual and textual information

🎵 48kHz Hi-Fi Output
Professional-grade audio generation with crystal clarity

📄 Abstract

🚀 Tencent Hunyuan open-sources HunyuanVideo-Foley an end-to-end video sound effect generation model!

A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development.

🎯 Core Highlights

🎬 Multi-scenario Audio-Visual Synchronization
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.

⚖️ Multi-modal Semantic Balance
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.

🎵 High-fidelity Audio Output
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.

🏆 SOTA Performance Achieved

HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions!

Performance Overview 📊 Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories

🔧 Technical Architecture

📊 Data Pipeline Design

Data Pipeline 🔄 Comprehensive data processing pipeline for high-quality text-video-audio datasets

The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.

🏗️ Model Architecture

Model Architecture 🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks

HunyuanVideo-Foley employs a sophisticated hybrid architecture:

🔄 Multimodal Transformer Blocks: Process visual-audio streams simultaneously
🎵 Unimodal Transformer Blocks: Focus on audio stream refinement
👁️ Visual Encoding: Pre-trained encoder extracts visual features from video frames
📝 Text Processing: Semantic features extracted via pre-trained text encoder
🎧 Audio Encoding: Latent representations with Gaussian noise perturbation
⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation

📈 Performance Benchmarks

🎬 MovieGen-Audio-Bench Results

Objective and Subjective evaluation results demonstrating superior performance across all metrics

🏆 Method	PQ ↑	PC ↓	CE ↑	CU ↑	IB ↑	DeSync ↓	CLAP ↑	MOS-Q ↑	MOS-S ↑	MOS-T ↑
FoleyGrafter	6.27	2.72	3.34	5.68	0.17	1.29	0.14	3.36±0.78	3.54±0.88	3.46±0.95
V-AURA	5.82	4.30	3.63	5.11	0.23	1.38	0.14	2.55±0.97	2.60±1.20	2.70±1.37
Frieren	5.71	2.81	3.47	5.31	0.18	1.39	0.16	2.92±0.95	2.76±1.20	2.94±1.26
MMAudio	6.17	2.84	3.59	5.62	0.27	0.80	0.35	3.58±0.84	3.63±1.00	3.47±1.03
ThinkSound	6.04	3.73	3.81	5.59	0.18	0.91	0.20	3.20±0.97	3.01±1.04	3.02±1.08
HunyuanVideo-Foley (ours)	6.59	2.74	3.88	6.13	0.35	0.74	0.33	4.14±0.68	4.12±0.77	4.15±0.75

🎯 Kling-Audio-Eval Results

Comprehensive objective evaluation showcasing state-of-the-art performance

🏆 Method	FD_PANNs ↓	FD_PASST ↓	KL ↓	IS ↑	PQ ↑	PC ↓	CE ↑	CU ↑	IB ↑	DeSync ↓	CLAP ↑
FoleyGrafter	22.30	322.63	2.47	7.08	6.05	2.91	3.28	5.44	0.22	1.23	0.22
V-AURA	33.15	474.56	3.24	5.80	5.69	3.98	3.13	4.83	0.25	0.86	0.13
Frieren	16.86	293.57	2.95	7.32	5.72	2.55	2.88	5.10	0.21	0.86	0.16
MMAudio	9.01	205.85	2.17	9.59	5.94	2.91	3.30	5.39	0.30	0.56	0.27
ThinkSound	9.92	228.68	2.39	6.86	5.78	3.23	3.12	5.11	0.22	0.67	0.22
HunyuanVideo-Foley (ours)	6.07	202.12	1.89	8.30	6.12	2.76	3.22	5.53	0.38	0.54	0.24

🎉 Outstanding Results! HunyuanVideo-Foley achieves the best scores across ALL evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment.

🚀 Quick Start

📦 Installation

🔧 System Requirements

CUDA: 12.4 or 11.8 recommended
Python: 3.8+
OS: Linux (primary support)

Step 1: Clone Repository

Bash

# 📥 Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
cd HunyuanVideo-Foley

Step 2: Environment Setup

💡 Tip: We recommend using Conda for Python environment management.

Bash

# 🔧 Install dependencies
pip install -r requirements.txt

Step 3: Download Pretrained Models

🔗 Download Model weights from Huggingface

Bash

# using git-lfs
git clone https://huggingface.co/tencent/HunyuanVideo-Foley

# using huggingface-cli
huggingface-cli download tencent/HunyuanVideo-Foley

💻 Usage

🎬 Single Video Generation

Generate Foley audio for a single video file with text description:

Bash

python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --config_path ./configs/hunyuanvideo-foley-xxl.yaml \
    --single_video video_path \
    --single_prompt "audio description" \
    --output_dir OUTPUT_DIR

📂 Batch Processing

Process multiple videos using a CSV file with video paths and descriptions:

Bash

python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --config_path ./configs/hunyuanvideo-foley-xxl.yaml \
    --csv_path assets/test.csv \
    --output_dir OUTPUT_DIR

🌐 Interactive Web Interface

Launch a user-friendly Gradio web interface for easy interaction:

Bash

export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
python3 gradio_app.py

🚀 Then open your browser and navigate to the provided local URL to start generating Foley audio!

📚 Citation

If you find HunyuanVideo-Foley useful for your research, please consider citing our paper:

Bibtex

@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.16930}, 
}

🙏 Acknowledgements

We extend our heartfelt gratitude to the open-source community!

🎨 Stable Diffusion 3 Foundation diffusion models	⚡ FLUX Advanced generation techniques	🎵 MMAudio Multimodal audio generation
🤗 HuggingFace Platform & diffusers library	🗜️ DAC High-Fidelity Audio Compression	🔗 Synchformer Audio-Visual Synchronization

🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning!

🔗 Connect with Us

🏆 Method

PQ ↑

PC ↓

CE ↑

CU ↑

IB ↑

DeSync ↓

CLAP ↑

MOS-Q ↑

MOS-S ↑

MOS-T ↑

FoleyGrafter

6.27

2.72

3.34

5.68

0.17

1.29

0.14

3.36±0.78

3.54±0.88

3.46±0.95

V-AURA

5.82

4.30

3.63

5.11

0.23

1.38

0.14

2.55±0.97

2.60±1.20

2.70±1.37

Frieren

5.71

2.81

3.47

5.31

0.18

1.39

0.16

2.92±0.95

2.76±1.20

2.94±1.26

MMAudio

6.17

2.84

3.59

5.62

0.27

0.80

0.35

3.58±0.84

3.63±1.00

3.47±1.03

ThinkSound

6.04

3.73

3.81

5.59

0.18

0.91

0.20

3.20±0.97

3.01±1.04

3.02±1.08

HunyuanVideo-Foley (ours)

6.59

2.74

3.88

6.13

0.35

0.74

0.33

4.14±0.68

4.12±0.77

4.15±0.75

🏆 Method

FD_PANNs ↓

FD_PASST ↓

KL ↓

IS ↑

PQ ↑

PC ↓

CE ↑

CU ↑

IB ↑

DeSync ↓

CLAP ↑

FoleyGrafter

22.30

322.63

2.47

7.08

6.05

2.91

3.28

5.44

0.22

1.23

0.22

V-AURA

33.15

474.56

3.24

5.80

5.69

3.98

3.13

4.83

0.25

0.86

0.13

Frieren

16.86

293.57

2.95

7.32

5.72

2.55

2.88

5.10

0.21

0.86

0.16

MMAudio

9.01

205.85

2.17

9.59

5.94

2.91

3.30

5.39

0.30

0.56

0.27

ThinkSound

9.92

228.68

2.39

6.86

5.78

3.23

3.12

5.11

0.22

0.67

0.22

HunyuanVideo-Foley (ours)

6.07

202.12

1.89

8.30

6.12

2.76

3.22

5.53

0.38

0.54

0.24

python3 infer.py \ --model_path PRETRAINED_MODEL_PATH_DIR \ --config_path ./configs/hunyuanvideo-foley-xxl.yaml \ --single_video video_path \ --single_prompt "audio description" \ --output_dir OUTPUT_DIR

@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation, title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong}, year={2025}, eprint={2508.16930}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2508.16930}, }