by CypressYang
Open source · 780 downloads · 125 likes
SongBloom is an innovative model designed to generate complete songs by combining two powerful approaches: an autoregressive model to progressively sketch the musical structure and a diffusion model to refine the details, from the global to the local. It leverages both the fidelity of diffusion models and the efficiency of language models while integrating semantic and acoustic information to guide the creation process. This unique framework enables the production of coherent and high-quality tracks, outperforming existing methods according to both subjective and objective evaluations. SongBloom stands out for its ability to generate entire songs, including melody, accompaniment, and lyrics, with quality comparable to leading commercial platforms. Ideal for musicians, producers, or content creators, it opens new possibilities in AI-assisted composition.
We propose SongBloom, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms.
https://cypress-yang.github.io/SongBloom_demo/
conda create -n SongBloom python==3.8.12
conda activate SongBloom
# yum install libsndfile
# pip install torch==2.2.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118 # For different CUDA version
pip install -r requirements.txt
A .jsonl file, where each line is a json object:
{
"idx": "The index of each sample",
"lyrics": "The lyrics to be generated",
"prompt_wav": "The path of the style prompt audio",
}
One example can be refered to as: example/test.jsonl
The prompt wav should be a 10-second, 48kHz audio clip.
The details about lyric format can be found in docs/lyric_format.md.
source set_env.sh
python3 infer.py --input-jsonl example/test.jsonl
# For GPUs with low VRAM like RTX4090, you should set the dtype as bfloat16
python3 infer.py --input-jsonl example/test.jsonl --dtype bfloat16
# SongBloom also supports flash-attn (optional). To enable it, please install flash-attn (v2.6.3 is used during training) manually and set os.environ['DISABLE_FLASH_ATTN'] = "0" in infer.py:8
| Name | Size | Max Length | Prompt type | 🤗 |
|---|---|---|---|---|
| songbloom_full_150s | 2B | 2m30s | 10s wav | link |
| ... |
@article{yang2025songbloom,
title={SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement},
author={Yang, Chenyu and Wang, Shuai and Chen, Hangting and Tan, Wei and Yu, Jianwei and Li, Haizhou},
journal={arXiv preprint arXiv:2506.07634},
year={2025}
}