MIDI-LLM

Built on Llama 3.2 (1B) with an extended vocabulary for MIDI tokens.

Research Paper

Shih-Lun Wu, Yoon Kim, and Cheng-Zhi Anna Huang.
"MIDI-LLM: Adapting large language models for text-to-MIDI music generation."
NeurIPS AI4Music Workshop, 2025.
[Code] [Live Demo] [Paper] [Video]

Model Description

Base Model: meta-llama/Llama-3.2-1B
Model Size: 1.4B parameters
Extended Vocabulary: 183,286 tokens (128,256 for text + 55,030 for MIDI music)
Architecture: LlamaForCausalLM with extended embedding layer
Precision: BFloat16

Quick Start

Clone our Github code repo, run through setup steps, and try:

Bash

git clone https://github.com/slSeanWU/MIDI-LLM
cd MIDI-LLM

python generate_transformers.py \
    --model slseanwu/MIDI-LLM_Llama-3.2-1B \
    --prompt "A cheerful rock song with bright electric guitars" \
    --n_outputs 4

The repo and inference scripts provide a more complete usage guide.

Model Details

Extended Vocabulary

The model extends Llama 3.2's vocabulary (128,256 tokens) with 55,030 MIDI tokens representing:

Onset time (when notes occur)
Durations (how long each note is held)
Instrument-pitch pair (which note to play & by which instrument)

These tokens follow the vocabulary of Anticipatory Music Transformer (AMT) (Thickstun et al., TMLR 2024).

Training Data

Datasets:
- Continued Pretraining (CPT)
  - music-related text from MusicPile (~1.7B tokens)
  - standalone MIDIs from GigaMIDI (~1.4B tokens after filtering out SFT examples)
- Supervised Finetuning (SFT)
  - LakhMIDI music paired w/ MidiCaps text descriptions (~5B tokens with AMT infilling augmentation)
Training objective: Causal language modeling
Training sequence length: 2,048
System prompt: You are a world-class composer. Please compose some music according to the following description: [your input text]

Inference Hyperparameters

Recommended settings for best results:

YAML

temperature: 1.0
top_p: 0.98
max_tokens: 2046

Evaluation

This model checkpoint was evaluated with FAD and CLAP metrics on 896 LakhMIDI examples whose IDs can be found in our repo

https://github.com/slSeanWU/MIDI-LLM/blob/main/assets/evaluation_set_lakh_ids.txt | Model | Params | Precision | FAD ↓ | CLAP ↑ | |-------|--------|-----------|-------|--------| | MIDI-LLM | 1.47B | BF16 | 0.173 | 22.1 | | MIDI-LLM | 1.47B | FP8 | 0.216 | 21.8 |

Citation

If you find our model useful, please cite our research as

Bibtex

@inproceedings{wu2025midillm,
  title={{MIDI-LLM}: Adapting large language models for text-to-{MIDI} music generation},
  author={Wu, Shih-Lun and Kim, Yoon and Huang, Cheng-Zhi Anna},
  booktitle={Proc. NeurIPS AI4Music Workshop},
  year={2025}
}

License

This model is based on Llama 3.2 and is subject to the Llama 3.2 Community License.

MIDI-LLM

Built on Llama 3.2 (1B) with an extended vocabulary for MIDI tokens.

Research Paper

Shih-Lun Wu, Yoon Kim, and Cheng-Zhi Anna Huang.
"MIDI-LLM: Adapting large language models for text-to-MIDI music generation."
NeurIPS AI4Music Workshop, 2025.
[Code] [Live Demo] [Paper] [Video]

Model Description

Base Model: meta-llama/Llama-3.2-1B
Model Size: 1.4B parameters
Extended Vocabulary: 183,286 tokens (128,256 for text + 55,030 for MIDI music)
Architecture: LlamaForCausalLM with extended embedding layer
Precision: BFloat16

Quick Start

Clone our Github code repo, run through setup steps, and try:

Bash

git clone https://github.com/slSeanWU/MIDI-LLM
cd MIDI-LLM

python generate_transformers.py \
    --model slseanwu/MIDI-LLM_Llama-3.2-1B \
    --prompt "A cheerful rock song with bright electric guitars" \
    --n_outputs 4

The repo and inference scripts provide a more complete usage guide.

Model Details

Extended Vocabulary

The model extends Llama 3.2's vocabulary (128,256 tokens) with 55,030 MIDI tokens representing:

Onset time (when notes occur)
Durations (how long each note is held)
Instrument-pitch pair (which note to play & by which instrument)

These tokens follow the vocabulary of Anticipatory Music Transformer (AMT) (Thickstun et al., TMLR 2024).

Training Data

Datasets:
- Continued Pretraining (CPT)
  - music-related text from MusicPile (~1.7B tokens)
  - standalone MIDIs from GigaMIDI (~1.4B tokens after filtering out SFT examples)
- Supervised Finetuning (SFT)
  - LakhMIDI music paired w/ MidiCaps text descriptions (~5B tokens with AMT infilling augmentation)
Training objective: Causal language modeling
Training sequence length: 2,048
System prompt: You are a world-class composer. Please compose some music according to the following description: [your input text]

Inference Hyperparameters

Recommended settings for best results:

YAML

temperature: 1.0
top_p: 0.98
max_tokens: 2046

Evaluation

This model checkpoint was evaluated with FAD and CLAP metrics on 896 LakhMIDI examples whose IDs can be found in our repo

https://github.com/slSeanWU/MIDI-LLM/blob/main/assets/evaluation_set_lakh_ids.txt | Model | Params | Precision | FAD ↓ | CLAP ↑ | |-------|--------|-----------|-------|--------| | MIDI-LLM | 1.47B | BF16 | 0.173 | 22.1 | | MIDI-LLM | 1.47B | FP8 | 0.216 | 21.8 |

Citation

If you find our model useful, please cite our research as

Bibtex

@inproceedings{wu2025midillm,
  title={{MIDI-LLM}: Adapting large language models for text-to-{MIDI} music generation},
  author={Wu, Shih-Lun and Kim, Yoon and Huang, Cheng-Zhi Anna},
  booktitle={Proc. NeurIPS AI4Music Workshop},
  year={2025}
}

License

This model is based on Llama 3.2 and is subject to the Llama 3.2 Community License.

MIDI LLM Llama 3.2 1B

MIDI-LLM

Research Paper

Model Description

Quick Start

Model Details

Extended Vocabulary

Training Data

Inference Hyperparameters

Evaluation

Citation

License

MIDI LLM Llama 3.2 1B

MIDI-LLM

Research Paper

Model Description

Quick Start

Model Details

Extended Vocabulary

Training Data

Inference Hyperparameters

Evaluation

Citation

License