par tencent
Open source · 306 downloads · 163 likes
HunyuanVideo Foley est un modèle d'IA spécialisé dans la génération d'effets sonores réalistes et synchronisés pour les vidéos, offrant une qualité professionnelle adaptée aux créateurs de contenu. Grâce à une approche multimodale, il analyse à la fois les images et les descriptions textuelles pour produire des sons cohérents avec les scènes, comme des pas, des chocs ou des ambiances, avec une fidélité audio élevée à 48 kHz. Ses capacités s'étendent à des scénarios variés, des environnements urbains aux paysages naturels, en passant par des actions spécifiques, le tout avec une synchronisation précise. Ce qui le distingue, c'est son équilibre entre précision visuelle et compréhension contextuelle, garantissant des résultats immersifs et naturels. Idéal pour les monteurs vidéo, les réalisateurs ou les créateurs de jeux, il simplifie la production sonore tout en élevant la qualité des projets multimédias.
Professional-grade AI sound effect generation for video content creators
Sizhe Shan1,2* • Qiulin Li1,3* • Yutao Cui1 • Miles Yang1 • Yuehai Wang2 • Qun Yang3 • Jin Zhou1† • Zhao Zhong1
🏢 1Tencent Hunyuan • 🎓 2Zhejiang University • ✈️ 3Nanjing University of Aeronautics and Astronautics
*Equal contribution • †Project lead
|
🎭 Multi-scenario Sync |
🧠 Multi-modal Balance |
🎵 48kHz Hi-Fi Output |
🚀 Tencent Hunyuan open-sources HunyuanVideo-Foley an end-to-end video sound effect generation model!
A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development.
🎬 Multi-scenario Audio-Visual Synchronization
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
⚖️ Multi-modal Semantic Balance
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
🎵 High-fidelity Audio Output
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
🏆 SOTA Performance Achieved
HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions!
📊 Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories
🔄 Comprehensive data processing pipeline for high-quality text-video-audio datasets
The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks
HunyuanVideo-Foley employs a sophisticated hybrid architecture:
Objective and Subjective evaluation results demonstrating superior performance across all metrics
| 🏆 Method | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | MOS-Q ↑ | MOS-S ↑ | MOS-T ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36±0.78 | 3.54±0.88 | 3.46±0.95 |
| V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55±0.97 | 2.60±1.20 | 2.70±1.37 |
| Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92±0.95 | 2.76±1.20 | 2.94±1.26 |
| MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58±0.84 | 3.63±1.00 | 3.47±1.03 |
| ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20±0.97 | 3.01±1.04 | 3.02±1.08 |
| HunyuanVideo-Foley (ours) | 6.59 | 2.74 | 3.88 | 6.13 | 0.35 | 0.74 | 0.33 | 4.14±0.68 | 4.12±0.77 | 4.15±0.75 |
Comprehensive objective evaluation showcasing state-of-the-art performance
| 🏆 Method | FD_PANNs ↓ | FD_PASST ↓ | KL ↓ | IS ↑ | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 |
| V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 |
| Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 |
| MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 |
| ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 |
| HunyuanVideo-Foley (ours) | 6.07 | 202.12 | 1.89 | 8.30 | 6.12 | 2.76 | 3.22 | 5.53 | 0.38 | 0.54 | 0.24 |
🎉 Outstanding Results! HunyuanVideo-Foley achieves the best scores across ALL evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment.
🔧 System Requirements
# 📥 Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
cd HunyuanVideo-Foley
💡 Tip: We recommend using Conda for Python environment management.
# 🔧 Install dependencies
pip install -r requirements.txt
🔗 Download Model weights from Huggingface
# using git-lfs
git clone https://huggingface.co/tencent/HunyuanVideo-Foley
# using huggingface-cli
huggingface-cli download tencent/HunyuanVideo-Foley
Generate Foley audio for a single video file with text description:
python3 infer.py \
--model_path PRETRAINED_MODEL_PATH_DIR \
--config_path ./configs/hunyuanvideo-foley-xxl.yaml \
--single_video video_path \
--single_prompt "audio description" \
--output_dir OUTPUT_DIR
Process multiple videos using a CSV file with video paths and descriptions:
python3 infer.py \
--model_path PRETRAINED_MODEL_PATH_DIR \
--config_path ./configs/hunyuanvideo-foley-xxl.yaml \
--csv_path assets/test.csv \
--output_dir OUTPUT_DIR
Launch a user-friendly Gradio web interface for easy interaction:
export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
python3 gradio_app.py
🚀 Then open your browser and navigate to the provided local URL to start generating Foley audio!
If you find HunyuanVideo-Foley useful for your research, please consider citing our paper:
@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation},
author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
year={2025},
eprint={2508.16930},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2508.16930},
}
We extend our heartfelt gratitude to the open-source community!
|
🎨 Stable Diffusion 3 |
⚡ FLUX |
🎵 MMAudio |
|
🤗 HuggingFace |
🗜️ DAC |
🔗 Synchformer |
🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning!