by riffusion
Open source · 2k downloads · 648 likes
The Riffusion v1 model is an innovative real-time music generation tool capable of transforming textual descriptions into visual spectrograms and then into audio clips. It is based on a fine-tuned version of Stable Diffusion, specialized in interpreting musical prompts to create soundscapes or melodies tailored to specific moods or styles. Ideal for artists, content creators, or music enthusiasts, it allows for quick exploration of sound ideas without requiring technical composition skills. What sets it apart is its ability to generate coherent and aesthetically pleasing results from simple text instructions while offering flexibility for creative experimentation. Accessible via a web application or dedicated tools, it opens up possibilities for music education, audio production, or simply the joy of creation.
Riffusion is an app for real-time music generation with stable diffusion.
Read about it at https://www.riffusion.com/about and try it at https://www.riffusion.com/.
This repository contains the model files, including:
Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips.
The model was created by Seth Forsgren and Hayk Martiros as a hobby project.
You can use the Riffusion model directly, or try the Riffusion web app.
The Riffusion model was created by fine-tuning the Stable-Diffusion-v1-5 checkpoint. Read about Stable Diffusion here 🤗's Stable Diffusion blog.
The model is intended for research purposes only. Possible research areas and tasks include
The original Stable Diffusion v1.5 was trained on the LAION-5B dataset using the CLIP text encoder, which provided an amazing starting point with an in-depth understanding of language, including musical concepts. The team at LAION also compiled a fantastic audio dataset from many general, speech, and music sources that we recommend at LAION-AI/audio-dataset.
Check out the diffusers training examples from Hugging Face. Fine tuning requires a dataset of spectrogram images of short audio clips, with associated text describing them. Note that the CLIP encoder is able to understand and connect many words even if they never appear in the dataset. It is also possible to use a dreambooth method to get custom styles.
If you build on this work, please cite it as follows:
@article{Forsgren_Martiros_2022,
author = {Forsgren, Seth* and Martiros, Hayk*},
title = {{Riffusion - Stable diffusion for real-time music generation}},
url = {https://riffusion.com/about},
year = {2022}
}