by HKUSTAudio
Open source · 400 downloads · 9 likes
AudioX is a unified framework designed to generate audio and musical content from diverse multimodal control signals, such as text, video, or other audio inputs. Through its Multimodal Adaptive Fusion (MAF) module, it efficiently aligns and merges these varied inputs to produce coherent and contextually appropriate results. The model excels particularly in tasks like video-to-music conversion or audio synthesis from textual descriptions, offering remarkable flexibility in its applications. Its use cases span AI-assisted music creation, custom soundtrack production, and ambient sound generation from visuals. What sets it apart is its ability to process heterogeneous inputs while maintaining high audio quality and contextual adaptability.
AudioX is a unified framework for generating audio and music from diverse multimodal control signals, including text, video, and audio. It features a Multimodal Adaptive Fusion (MAF) module to effectively align and fuse these inputs.
To use AudioX, first install the required dependencies and the package from the official repository:
# Clone the repository
git clone https://github.com/ZeyueT/AudioX.git
cd AudioX
# Install dependencies
pip install git+https://github.com/ZeyueT/AudioX.git
conda install -c conda-forge ffmpeg libsndfile
Below is an example of how to perform Video-to-Music generation programmatically:
import torch
import torchaudio
from einops import rearrange
from audiox import get_pretrained_model
from audiox.inference.generation import generate_diffusion_cond
from audiox.data.utils import read_video, merge_video_audio, load_and_process_audio, encode_video_with_synchformer
import os
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load pretrained model
# Choose one: "HKUSTAudio/AudioX", "HKUSTAudio/AudioX-MAF", or "HKUSTAudio/AudioX-MAF-MMDiT"
model_name = "HKUSTAudio/AudioX"
model, model_config = get_pretrained_model(model_name)
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
target_fps = model_config["video_fps"]
seconds_start = 0
seconds_total = 10
model = model.to(device)
# Example: Video-to-Music generation
video_path = "example/V2M_sample-1.mp4"
text_prompt = "Generate music for the video"
audio_path = None
# Prepare inputs
video_tensor = read_video(video_path, seek_time=seconds_start, duration=seconds_total, target_fps=target_fps)
if audio_path:
audio_tensor = load_and_process_audio(audio_path, sample_rate, seconds_start, seconds_total)
else:
# Use zero tensor when no audio is provided
audio_tensor = torch.zeros((2, int(sample_rate * seconds_total)))
# For AudioX-MAF and AudioX-MAF-MMDiT: encode video with synchformer
video_sync_frames = None
if "MAF" in model_name:
video_sync_frames = encode_video_with_synchformer(
video_path, model_name, seconds_start, seconds_total, device
)
# Create conditioning
conditioning = [{
"video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": video_sync_frames},
"text_prompt": text_prompt,
"audio_prompt": audio_tensor.unsqueeze(0),
"seconds_start": seconds_start,
"seconds_total": seconds_total
}]
# Generate audio
output = generate_diffusion_cond(
model,
steps=250,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_size,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device=device
)
# Post-process and save audio
output = rearrange(output, "b d n -> d (b n)")
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
If you find AudioX useful in your research, please consider citing the following:
@article{tian2025audiox,
title={AudioX: Diffusion Transformer for Anything-to-Audio Generation},
author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2503.10522},
year={2025}
}