marvis tts 250m v0.2 transformers

par Marvis-AI

Open source · 292 downloads · 1 likes

0.4

(1 avis)AudioAPI & Local

À propos

Marvis TTS 250M v0.2 est un modèle de synthèse vocale conversationnelle conçu pour générer de la parole en temps réel à partir de texte, avec une fluidité naturelle adaptée aux échanges interactifs. Grâce à son architecture optimisée, il produit un flux audio continu sans artefacts de découpage, tout en restant compact (500 Mo en version quantifiée) pour fonctionner efficacement sur des appareils mobiles ou des machines grand public comme les iPhones ou les Macs. Le modèle prend en charge plusieurs langues, dont l'anglais, le français et l'allemand, et peut même reproduire des voix personnalisées à partir d'échantillons audio, offrant ainsi une grande flexibilité pour des applications variées. Ses principaux atouts résident dans sa capacité à traiter des séquences textuelles entières de manière contextuelle, garantissant une intonation et un rythme naturels, ainsi que dans son approche multimodale qui gère simultanément texte et audio. Idéal pour les assistants vocaux, les outils d'accessibilité ou la création de contenu, Marvis se distingue par sa légèreté et son efficacité, tout en restant accessible via des bibliothèques comme MLX ou Transformers.

Documentation

Introduction

[code]

Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.

Key Features

Real-time Streaming: Stream audio chunks as text is processed, enabling natural conversational flow
Compact Size: Only 500MB when quantized, enabling on-device inference
Edge deployment: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
Natural Audio Flow: Process entire text context for coherent speech synthesis without chunking artifacts
Multimodal Architecture: Seamlessly handles interleaved text and audio tokens

Supported Languages

Currently optimized for English, French, and German.

Quick Start

Using MLX

Real audio streaming:

Bash

pip install -U mlx-audio
mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
 --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."

Voice cloning:

Bash

mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
 --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." --ref_audio ./conversational_a.wav

You can pass any audio to clone the voice from or select sample audio file from here.

Using transformers

Colab Notebook

Python

from huggingface_hub import snapshot_download
from pathlib import Path
import soundfile as sf
from transformers import AutoProcessor, CsmForConditionalGeneration, infer_device

model_id = "Marvis-AI/marvis-tts-250m-v0.2-transformers"
device = infer_device()
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
prompts_path = snapshot_download(model_id, allow_patterns=["prompts/*.txt", "prompts/*.wav"])

voice = "conversational_a"
generation_text = "With its pristine jungles and small towns, this Hawaiian island retains the unmanicured charm of Old Polynesia with few modern intrusions."

prompt_text = (prompts_path / Path(f"prompts/{voice}.txt")).read_text()
prompt_audio, _ = sf.read(prompts_path / Path(f"prompts/{voice}.wav"))
context = [
    {"role": "0", "content": [{"type": "text", "text": prompt_text}, {"type": "audio", "path": prompt_audio}]},
    {"role": "0", "content": [{"type": "text", "text": generation_text}]},
]
inputs = processor.apply_chat_template(
    context,
    tokenize=True,
    return_dict=True,
)
inputs.pop("token_type_ids")

audio = model.generate(**inputs, output_audio=True)
sf.write("marvis-example.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")

Output:

Model Description

Marvis is built on the Sesame CSM-1B (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses Kyutai's mimi codec. The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:

Multimodal Backbone (250M parameters): Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
Audio Decoder (60M parameters): A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.

Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.

Use Cases

Real-time Voice Assistants: Deploy natural-sounding voice interfaces with custom voices
Content Creation: Generate voiceovers and narration with personalized voices
Accessibility Tools: Create personalized speech synthesis for communication aids
Interactive Applications: Build conversational AI with consistent voice identity
Podcast & Media: Generate natural-sounding speech for automated content

Legal and Ethical Considerations:

Users are responsible for complying with local laws regarding voice synthesis and impersonation
Consider intellectual property rights when cloning voices of public figures
Respect privacy laws and regulations in your jurisdiction
Obtain appropriate consent and permissions before deployment

License & Agreement

Apache 2.0

Citation

If you use Marvis in your research or applications, please cite:

Bibtex

@misc{marvis-tts-2025,
  title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
  author={Prince Canuma and Lucas Newman},
  year={2025}
}

Acknowledgments

Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.

Version: 0.2

Release Date: 20/10/2025

Creators: Prince Canuma & Lucas Newman

Liens & Ressources

Introduction

[code]

Key Features

Real-time Streaming: Stream audio chunks as text is processed, enabling natural conversational flow
Compact Size: Only 500MB when quantized, enabling on-device inference
Edge deployment: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
Natural Audio Flow: Process entire text context for coherent speech synthesis without chunking artifacts
Multimodal Architecture: Seamlessly handles interleaved text and audio tokens

Supported Languages

Currently optimized for English, French, and German.

Quick Start

Using MLX

Real audio streaming:

Bash

pip install -U mlx-audio
mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
 --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."

Voice cloning:

Bash

mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
 --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." --ref_audio ./conversational_a.wav

You can pass any audio to clone the voice from or select sample audio file from here.

Using transformers

Colab Notebook

Python

from huggingface_hub import snapshot_download
from pathlib import Path
import soundfile as sf
from transformers import AutoProcessor, CsmForConditionalGeneration, infer_device

model_id = "Marvis-AI/marvis-tts-250m-v0.2-transformers"
device = infer_device()
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
prompts_path = snapshot_download(model_id, allow_patterns=["prompts/*.txt", "prompts/*.wav"])

voice = "conversational_a"
generation_text = "With its pristine jungles and small towns, this Hawaiian island retains the unmanicured charm of Old Polynesia with few modern intrusions."

prompt_text = (prompts_path / Path(f"prompts/{voice}.txt")).read_text()
prompt_audio, _ = sf.read(prompts_path / Path(f"prompts/{voice}.wav"))
context = [
    {"role": "0", "content": [{"type": "text", "text": prompt_text}, {"type": "audio", "path": prompt_audio}]},
    {"role": "0", "content": [{"type": "text", "text": generation_text}]},
]
inputs = processor.apply_chat_template(
    context,
    tokenize=True,
    return_dict=True,
)
inputs.pop("token_type_ids")

audio = model.generate(**inputs, output_audio=True)
sf.write("marvis-example.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")

Output:

Model Description

Multimodal Backbone (250M parameters): Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
Audio Decoder (60M parameters): A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.

Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.

Use Cases

Real-time Voice Assistants: Deploy natural-sounding voice interfaces with custom voices
Content Creation: Generate voiceovers and narration with personalized voices
Accessibility Tools: Create personalized speech synthesis for communication aids
Interactive Applications: Build conversational AI with consistent voice identity
Podcast & Media: Generate natural-sounding speech for automated content

Legal and Ethical Considerations:

Users are responsible for complying with local laws regarding voice synthesis and impersonation
Consider intellectual property rights when cloning voices of public figures
Respect privacy laws and regulations in your jurisdiction
Obtain appropriate consent and permissions before deployment

License & Agreement

Apache 2.0

Citation

If you use Marvis in your research or applications, please cite:

Bibtex

@misc{marvis-tts-2025,
  title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
  author={Prince Canuma and Lucas Newman},
  year={2025}
}

Acknowledgments

Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.

Version: 0.2

Release Date: 20/10/2025

Creators: Prince Canuma & Lucas Newman