AI/EXPLORER
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium
—Outils IA
—Sites & Blogs
—LLMs & Modèles
—Catégories
AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • ›Tous les outils
  • ›Sites & Blogs
  • ›LLMs & Modèles
  • ›Comparer
  • ›Chatbots
  • ›Images IA
  • ›Code & Dev

Entreprise

  • ›Premium
  • ›À propos
  • ›Contact
  • ›Blog

Légal

  • ›Mentions légales
  • ›Confidentialité
  • ›CGV

© 2026 AI Explorer·Tous droits réservés.

AccueilLLMsMOSS Audio Tokenizer

MOSS Audio Tokenizer

par OpenMOSS-Team

Open source · 118k downloads · 39 likes

2.0
(39 avis)EmbeddingAPI & Local
À propos

MOSS Audio Tokenizer est un encodeur-décodeur discret conçu pour transformer l'audio brut en une représentation compacte et sémantiquement riche, tout en garantissant une reconstruction de haute qualité. Grâce à son architecture entièrement basée sur des transformers causaux (sans CNN), il compresse efficacement des signaux audio à 24 kHz en une séquence de tokens à très faible débit (12,5 Hz), avec une plage de bitrates allant de 0,125 kbps à 4 kbps. Entraîné sur 3 millions d'heures de données audio variées, il couvre tous les domaines (parole, effets sonores, musique) et produit des tokens à la fois acoustiquement précis et porteurs de sens, adaptés aux tâches de compréhension et de génération vocale. Son approche unifiée, optimisée de bout en bout et indépendante de modèles préexistants, le distingue par sa simplicité, son évolutivité et sa capacité à servir d'interface universelle pour les futurs modèles fondamentaux audio.

Documentation

MossAudioTokenizer

This is the code for MOSS-Audio-Tokenizer presented in MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models.

MOSSAudioTokenizer is a unified discrete audio tokenizer based on the Cat (Causal Audio Tokenizer with Transformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.

Key Features:

  • Extreme Compression & Variable Bitrate: It compresses 24kHz raw audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantizer (RVQ), it supports high-fidelity reconstruction across a wide range of bitrates, from 0.125kbps to 4kbps.
  • Pure Transformer Architecture: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
  • Large-Scale General Audio Training: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
  • Unified Semantic-Acoustic Representation: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
  • Fully Trained From Scratch: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
  • End-to-End Joint Optimization: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.

Summary: By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.

This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers transformers.models.moss_audio_tokenizer module. It is intended to be uploaded to a Hugging Face Hub model repository and loaded with trust_remote_code=True when needed.



Architecture of MossAudioTokenizer


Usage

Quickstart

Python
import torch
from transformers import AutoModel
import torchaudio

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

wav, sr = torchaudio.load('demo/demo_gt.wav')
if sr != model.sampling_rate:
    wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
wav = wav.unsqueeze(0)
enc = model.encode(wav, return_dict=True)
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
dec = model.decode(enc.audio_codes, return_dict=True)
print(f"dec.audio.shape: {dec.audio.shape}")
wav = dec.audio.squeeze(0)
torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)

# Decode using only the first 8 layers of the RVQ
dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
wav_rvq8 = dec_rvq8.audio.squeeze(0)
torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)

Streaming

MossAudioTokenizerModel.encode and MossAudioTokenizerModel.decode support simple streaming via a chunk_duration argument.

  • chunk_duration is expressed in seconds.
  • It must be <= MossAudioTokenizerConfig.causal_transformer_context_duration.
  • chunk_duration * MossAudioTokenizerConfig.sampling_rate must be divisible by MossAudioTokenizerConfig.downsample_rate.
  • Streaming chunking only supports batch_size=1.
Python
import torch
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(1, 1, 3200)  # dummy waveform

# 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)

Repository layout

  • configuration_moss_audio_tokenizer.py
  • modeling_moss_audio_tokenizer.py
  • __init__.py
  • config.json
  • model weights

Evaluation Metrics

The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

  • Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
  • Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
  • STFT-Dist. denotes the STFT distance.
  • Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
  • Nq denotes the number of quantizers.
ModelbpsFrame rateNqSpeech: SIM ↑ (EN/ZH)Speech: STOI ↑ (EN/ZH)Speech: PESQ-NB ↑ (EN/ZH)Speech: PESQ-WB ↑ (EN/ZH)Audio/Music: Mel-Loss ↓Audio/Music: STFT-Dist. ↓
XCodec2.08005010.82 / 0.740.92 / 0.863.04 / 2.462.43 / 1.96-- / ---- / --
MiMo Audio Tokenizer8502540.80 / 0.740.91 / 0.872.94 / 2.622.39 / 2.140.82 / 0.812.33 / 2.23
Higgs Audio Tokenizer10002540.77 / 0.680.83 / 0.823.03 / 2.612.48 / 2.140.83 / 0.802.20 / 2.05
SpeechTokenizer10005020.36 / 0.250.77 / 0.681.59 / 1.381.25 / 1.17-- / ---- / --
XY-Tokenizer100012.580.85 / 0.790.92 / 0.873.10 / 2.632.50 / 2.12-- / ---- / --
BigCodec10408010.84 / 0.690.93 / 0.883.27 / 2.552.68 / 2.06-- / ---- / --
Mimi110012.580.74 / 0.590.91 / 0.852.80 / 2.242.25 / 1.781.24 / 1.192.62 / 2.49
MOSS Audio Tokenizer (Ours)75012.560.82 / 0.750.93 / 0.893.14 / 2.732.60 / 2.220.86 / 0.852.21 / 2.10
MOSS Audio Tokenizer (Ours)100012.580.88 / 0.810.94 / 0.913.38 / 2.962.87 / 2.430.82 / 0.802.16 / 2.04
——————————
DAC15007520.48 / 0.410.83 / 0.791.87 / 1.671.48 / 1.37-- / ---- / --
Encodec15007520.60 / 0.450.85 / 0.811.94 / 1.801.56 / 1.481.12 / 1.042.60 / 2.42
Higgs Audio Tokenizer20002580.90 / 0.830.85 / 0.853.59 / 3.223.11 / 2.730.74 / 0.702.07 / 1.92
SpeechTokenizer20005040.66 / 0.500.88 / 0.802.38 / 1.791.92 / 1.49-- / ---- / --
Qwen3 TTS Tokenizer220012.5160.95 / 0.880.96 / 0.933.66 / 3.103.19 / 2.62-- / ---- / --
MiMo Audio Tokenizer225025120.89 / 0.830.95 / 0.923.57 / 3.253.05 / 2.710.70 / 0.682.21 / 2.10
Mimi247512.5180.89 / 0.760.94 / 0.913.49 / 2.902.97 / 2.351.10 / 1.062.45 / 2.32
MOSS Audio Tokenizer (Ours)150012.5120.92 / 0.860.95 / 0.933.64 / 3.273.20 / 2.740.77 / 0.742.08 / 1.96
MOSS Audio Tokenizer (Ours)200012.5160.95 / 0.890.96 / 0.943.78 / 3.463.41 / 2.960.73 / 0.702.03 / 1.90
——————————
DAC30007540.74 / 0.670.90 / 0.882.76 / 2.472.31 / 2.070.86 / 0.832.23 / 2.10
MiMo Audio Tokenizer365025200.91 / 0.850.95 / 0.933.73 / 3.443.25 / 2.890.66 / 0.652.17 / 2.06
SpeechTokenizer40005080.85 / 0.690.92 / 0.853.05 / 2.202.60 / 1.87-- / ---- / --
Mimi440012.5320.94 / 0.830.96 / 0.943.80 / 3.313.43 / 2.781.02 / 0.982.34 / 2.21
Encodec45007560.86 / 0.750.92 / 0.912.91 / 2.632.46 / 2.150.91 / 0.842.33 / 2.17
DAC60007580.89 / 0.840.95 / 0.943.75 / 3.573.41 / 3.200.65 / 0.631.97 / 1.87
MOSS Audio Tokenizer (Ours)300012.5240.96 / 0.920.97 / 0.963.90 / 3.643.61 / 3.200.69 / 0.661.98 / 1.84
MOSS Audio Tokenizer (Ours)400012.5320.97 / 0.930.97 / 0.963.95 / 3.713.69 / 3.300.68 / 0.641.96 / 1.82

LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better). We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

SIM
STOI
PESQ-NB
PESQ-WB

Citation

If you use this code or result in your paper, please cite our work as:

Tex
Liens & Ressources
Spécifications
CatégorieEmbedding
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Note
2.0

Essayer MOSS Audio Tokenizer

Accédez directement au modèle