by chinedudave06
Open source · 270 downloads · 0 likes
MusicGen Small Stereo ONNX is an artificial intelligence model specialized in generating stereo music from text descriptions. It utilizes a version optimized for mobile devices, incorporating a key-value cache (KV-cache) mechanism to accelerate autoregressive music generation. The model stands out for its ability to produce stereo tracks with enhanced sound quality, achieved through the use of 8 codebooks (4 per audio channel). It is particularly well-suited for mobile applications like DJNed, enabling the rapid generation of customized music from text prompts. Its export in ONNX format ensures efficient local execution, even on devices with limited resources.
ONNX export of facebook/musicgen-stereo-small with KV-cache decoder for efficient on-device autoregressive generation.
| Property | Value |
|---|---|
| Base Model | facebook/musicgen-stereo-small |
| Precision | FP32 |
| Audio | Stereo (2 channels) |
| Codebooks | 8 (4 per channel) |
| Hidden Size | 1024 |
| Sample Rate | 32 kHz |
| Max Length | 1500 steps (~30s) |
| Total Size | ~3.7 GB |
| File | Description | Size |
|---|---|---|
decoder_model.onnx | Step-0 decoder (no KV-cache) | 1.7 GB |
decoder_with_past_model.onnx | Steps 1+ decoder (with KV-cache) | 1.5 GB |
text_encoder.onnx | T5 text encoder | 419 MB |
encodec_decode.onnx | EnCodec audio decoder | 113 MB |
tokenizer.json | T5 tokenizer vocabulary | 2.4 MB |
config.json | Model architecture config | <1 KB |
generation_config.json | Generation parameters | <1 KB |
The stereo model uses 8 codebooks (4 per audio channel). During export, the EnCodec quantizer's decode method was monkeypatched to handle the codebook index mismatch (EnCodec has 4 physical layers, but stereo needs 8 codebook indices). The exported EnCodec ONNX is replaced with the mono version, which handles both mono and stereo decoding.
These models are designed for the DJNed Android app using ONNX Runtime.
text_encoder.onnx encodes the text promptdecoder_model.onnx generates the first token + initial KV-cachedecoder_with_past_model.onnx generates subsequent tokens using KV-cacheencodec_decode.onnx converts 8 codebook streams (4 per channel) to stereo audioThis model is derived from Meta's MusicGen under the CC-BY-NC-4.0 license.