AI/EXPLORER
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium
—Outils IA
—Sites & Blogs
—LLMs & Modèles
—Catégories
AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • ›Tous les outils
  • ›Sites & Blogs
  • ›LLMs & Modèles
  • ›Comparer
  • ›Chatbots
  • ›Images IA
  • ›Code & Dev

Entreprise

  • ›Premium
  • ›À propos
  • ›Contact
  • ›Blog

Légal

  • ›Mentions légales
  • ›Confidentialité
  • ›CGV

© 2026 AI Explorer·Tous droits réservés.

AccueilLLMsAudioclap musicgen

clap musicgen

par yuhuacheng

Open source · 562 downloads · 0 likes

0.0
(0 avis)AudioAPI & Local
À propos

CLAP-MusicGen est un modèle d'embeddings audio-texte conçu pour améliorer la recherche et la classification musicale. Il combine les capacités de CLAP (Contrastive Language-Audio Pretraining) avec MusicGen de Meta, permettant d'extraire des représentations latentes à partir de fichiers audio ou de descriptions textuelles. Grâce à cette approche multimodale, il facilite des tâches comme la recherche de similarité musicale ou la classification en zéro coup, sans nécessiter de données d'entraînement spécifiques. Ce modèle se distingue par sa capacité à effectuer des recherches bidirectionnelles (audio vers audio ou texte vers audio) et à générer des embeddings riches, même à partir de données synthétiques. Idéal pour organiser, explorer ou recommander des morceaux de musique, il offre une solution flexible pour les applications nécessitant une compréhension fine des contenus sonores et textuels.

Documentation

(Below is from https://github.com/yuhuacheng/clap_musicgen)

👏🏻 CLAP-MusicGen 🎵

CLAP-MusicGen is a contrastive audio-text embedding model that combines the strengths of Contrastive Language-Audio Pretraining (CLAP) with Meta's MusicGen as the audio encoder. Users can generate latent embeddings for any given audio or text, enabling downstream tasks like music similarity search and audio classification.

Note that this is a proof-of-concept project and is not aimed at providing the highest quality embeddings but rather at demonstrating the idea, as it is my personal pet project.

Table of Contents

  • 👨‍🏫 Overview
  • 🏗️ Model Architecture
  • 📀 Training Data
  • 💻 Quick Start
  • 🎧 Similarity Search Demo
  • 🤿 Training / Evaluation Deep Dives
  • 🪪 License
  • 🖇️ Citation

👨‍🏫 Overview

CLAP-MusicGen is a multimodal model designed to enhance music retrieval capabilities. By embedding both audio and text into a shared space, it enables efficient music-to-music and text-to-music search. Unlike traditional models limited to predefined categories, CLAP-MusicGen supports zero-shot classification, retrieval, and embedding extraction, making it a valuable tool for exploring and organizing music collections.

Key Capabilities:

  • MusicGen-based Audio Encoding: Uses MusicGen to extract high-quality audio embeddings.
  • Two-way Retrieval: Supports searching for audio given an input audio or text.

🏗️ Model Architecture

CLAP-MusicGen consists of:

  1. Audio Encoder: Uses MusicGen’s decoder for feature extraction given the tokenization inputs from EnCodec.

  2. Text Encoder: A pretrained RoBERTa finetuned on the music styles/genres text with MLM objective.

  3. Projection Head: A multi-layer perceptron (MLP) that projects both text and audio embeddings into the same space.

  4. Contrastive(ish) Learning: Trained using a listwise ranking loss instead of traditional contrastive loss to optimize the alignment between text and audio embeddings, enhancing retrieval performance for tasks like music similarity search.

📀 Training Data

The model is trained on the nyuuzyou/suno dataset from Hugging Face. This dataset includes approximately 10K curated audio-caption pairs, split into 80% training, 10% validation, and 10% evaluation. Captions are derived from the metadata.tags field, which provides descriptions of musical styles and genres. Note that one can include the full prompt from metadata.prompt along with style tags for training, to achieve an even richer audio/text embeddings supervised by the full captions.

Note: since our CLAP model is trained using AI-generated music-caption pairs from Suno, forming a synthetic data loop where an AI learns from another AI’s outputs, it presents potential biases of training on AI-generated data, opening up opportunities for further refinement by incorporating human-annotated music datasets.

💻 Quick Start

Installation

To install the necessary dependencies, run:

Bash
pip install torch torchvision torchaudio transformers

Loading the Model from 🤗 Hugging Face

First, clone the project repository and navigate to the project directory:

Python
from src.modules.clap_model import CLAPModel
from transformers import RobertaTokenizer

model = CLAPModel.from_pretrained("yuhuacheng/clap-musicgen")
tokenizer = RobertaTokenizer.from_pretrained("yuhuacheng/clap-roberta-finetuned")

Extracting Embeddings

From Audio

Python
import torch 

with torch.no_grad():
  waveform = torch.rand(1, 1, 32000) # 1 sec waveform at 32kHz sample rate
  audio_embeddings = model.audio_encoder(ids=None, waveform=waveform)
  print(audio_embeddings.shape) # (1, 1024)

From Text

Python
sample_captions = [
    'positive jazzy lofi',
    'fast house edm',
    'gangsta rap',
    'dark metal'
]

with torch.no_grad():
    tokenized_captions = tokenizer(list(sample_captions), return_tensors="pt", padding=True, truncation=True)    
    text_embeddings = model.text_encoder(ids=None, **tokenized_captions)
    print(text_embeddings.shape) # (4, 1024)

🎧 Similarity Search Demo

Please refer to the demo notebook that demonstates the audio-to-audio as well as the text-to-audio search.

(Result snapshots)

🎵 Audio-to-Audio Search Audio to Audios

💬 Text-to-Audio Search Text to Audios

🤿 Training / Evaluation Deep Dives

(Coming soon)

🪪 License

  • The code in this repository is released under the MIT license as found in the LICENSE file.
  • Since the model was trained off of the pretrained MusicGen weights, its weights in this repository are released under the CC-BY-NC 4.0 license as found in the LICENSE_weights file.

🖇️ Citation

INI
@inproceedings{copet2023simple,
    title={Simple and Controllable Music Generation},
    author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}
INI
@inproceedings{laionclap2023,
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2023}
}
INI
@inproceedings{htsatke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2022}
}
Liens & Ressources
Spécifications
CatégorieAudio
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Note
0.0

Essayer clap musicgen

Accédez directement au modèle