ACE-Step Captioner

Description

ACE-Step Captioner is the annotation model used by ACE-Step v1.5 for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.

Performance

🏆 Accuracy surpasses Gemini Pro 2.5 in music description tasks

Key Features

🎼 Musical Style Analysis - Identifies genres, sub-genres, and stylistic influences
🎸 Instrument Recognition - Detects and describes 1000+ instrument types and combinations
🎭 Structure & Progression - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
🔊 Timbre Description - Captures tonal qualities, textures, and sonic characteristics
📝 Rich Vocabulary - Supports 1000+ descriptive terms for comprehensive music annotation

Usage

The usage is the same as Qwen2.5 Omni-7B.

Prompt Format

Use the following prompt to caption audio:

Arduino

*Task* Describe this audio in detail
<audio>

Output Format

The model generates natural language descriptions covering multiple aspects of the music.

Example Output

CSS

A melancholic indie folk track featuring fingerpicked acoustic guitar 
as the primary instrument. The song opens with a sparse, contemplative 
intro before the vocals enter with a breathy, intimate delivery. 
The arrangement gradually builds through the verse, adding subtle 
string pads and a gentle kick drum. The chorus lifts with layered 
harmonies and a warmer, fuller texture. The bridge introduces a 
key change and emotional climax before returning to the stripped-down 
acoustic arrangement for the outro.

Descriptive Capabilities

Musical Styles (Examples)

Category	Styles
Electronic	Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo
Rock	Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge
Pop	Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop
Classical	Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic
World	Latin, African, Middle Eastern, Asian Traditional, Celtic
Jazz	Fusion, Smooth, Bebop, Modal, Free Jazz
Hip-Hop	Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap

Instruments (1000+ Supported)

Category	Examples
Strings	Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin
Keys	Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron
Percussion	Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone
Wind	Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn
Electronic	Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303

Structure Analysis

Intro / Outro - Opening and closing sections
Verse / Pre-Chorus / Chorus - Main song structure
Bridge / Break - Transitional sections
Build-up / Drop / Climax - Dynamic progression
Interlude / Solo - Instrumental passages

Timbre Descriptions

Dimension	Descriptors
Texture	Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated
Space	Reverberant, Dry, Spacious, Intimate, Cavernous, Tight
Dynamics	Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic
Character	Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic

Use Cases

Music AI Training - Generate high-quality captions for music generation models
Music Information Retrieval - Create searchable metadata for audio databases
Content Moderation - Analyze and categorize music content
Music Education - Provide detailed analysis for learning purposes
Audio Production - Document and describe sound design elements

Tech Report

ACE-Step Captioner

Description

Performance

🏆 Accuracy surpasses Gemini Pro 2.5 in music description tasks

Key Features

🎼 Musical Style Analysis - Identifies genres, sub-genres, and stylistic influences
🎸 Instrument Recognition - Detects and describes 1000+ instrument types and combinations
🎭 Structure & Progression - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
🔊 Timbre Description - Captures tonal qualities, textures, and sonic characteristics
📝 Rich Vocabulary - Supports 1000+ descriptive terms for comprehensive music annotation

Usage

The usage is the same as Qwen2.5 Omni-7B.

Prompt Format

Use the following prompt to caption audio:

Arduino

*Task* Describe this audio in detail
<audio>

Output Format

The model generates natural language descriptions covering multiple aspects of the music.

Example Output

CSS

A melancholic indie folk track featuring fingerpicked acoustic guitar 
as the primary instrument. The song opens with a sparse, contemplative 
intro before the vocals enter with a breathy, intimate delivery. 
The arrangement gradually builds through the verse, adding subtle 
string pads and a gentle kick drum. The chorus lifts with layered 
harmonies and a warmer, fuller texture. The bridge introduces a 
key change and emotional climax before returning to the stripped-down 
acoustic arrangement for the outro.

Descriptive Capabilities

Musical Styles (Examples)

Category	Styles
Electronic	Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo
Rock	Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge
Pop	Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop
Classical	Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic
World	Latin, African, Middle Eastern, Asian Traditional, Celtic
Jazz	Fusion, Smooth, Bebop, Modal, Free Jazz
Hip-Hop	Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap

Instruments (1000+ Supported)

Category	Examples
Strings	Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin
Keys	Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron
Percussion	Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone
Wind	Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn
Electronic	Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303

Structure Analysis

Intro / Outro - Opening and closing sections
Verse / Pre-Chorus / Chorus - Main song structure
Bridge / Break - Transitional sections
Build-up / Drop / Climax - Dynamic progression
Interlude / Solo - Instrumental passages

Timbre Descriptions

Dimension	Descriptors
Texture	Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated
Space	Reverberant, Dry, Spacious, Intimate, Cavernous, Tight
Dynamics	Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic
Character	Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic

Use Cases

Music AI Training - Generate high-quality captions for music generation models
Music Information Retrieval - Create searchable metadata for audio databases
Content Moderation - Analyze and categorize music content
Music Education - Provide detailed analysis for learning purposes
Audio Production - Document and describe sound design elements

acestep captioner

ACE-Step Captioner

Description

Performance

Key Features

Usage

Prompt Format

Output Format

Example Output

Descriptive Capabilities

Musical Styles (Examples)

Instruments (1000+ Supported)

Structure Analysis

Timbre Descriptions

Use Cases

acestep captioner

ACE-Step Captioner

Description

Performance

Key Features

Usage

Prompt Format

Output Format

Example Output

Descriptive Capabilities

Musical Styles (Examples)

Instruments (1000+ Supported)

Structure Analysis

Timbre Descriptions

Use Cases