by Marvis-AI
Open source · 328 downloads · 1 likes
Marvis TTS 250M v0.2 is a conversational text-to-speech model designed to generate speech in real time from text, with natural fluidity suited for interactive exchanges. Thanks to its optimized architecture, it produces a continuous audio stream without choppy artifacts while remaining compact (500 MB in quantized form) to run efficiently on mobile devices or consumer machines like iPhones or Macs. The model supports multiple languages, including English, French, and German, and can even replicate custom voices from audio samples, offering great flexibility for diverse applications. Its key strengths lie in its ability to process entire text sequences contextually, ensuring natural intonation and rhythm, as well as its multimodal approach that handles text and audio simultaneously. Ideal for voice assistants, accessibility tools, or content creation, Marvis stands out for its lightweight design and efficiency while remaining accessible via libraries like MLX or Transformers.
[code]
Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.
Currently optimized for English, French, and German.
Real audio streaming:
pip install -U mlx-audio
mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2 --stream \
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
Voice cloning:
mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2 --stream \
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." --ref_audio ./conversational_a.wav
You can pass any audio to clone the voice from or select sample audio file from here.
from huggingface_hub import snapshot_download
from pathlib import Path
import soundfile as sf
from transformers import AutoProcessor, CsmForConditionalGeneration, infer_device
model_id = "Marvis-AI/marvis-tts-250m-v0.2-transformers"
device = infer_device()
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
prompts_path = snapshot_download(model_id, allow_patterns=["prompts/*.txt", "prompts/*.wav"])
voice = "conversational_a"
generation_text = "With its pristine jungles and small towns, this Hawaiian island retains the unmanicured charm of Old Polynesia with few modern intrusions."
prompt_text = (prompts_path / Path(f"prompts/{voice}.txt")).read_text()
prompt_audio, _ = sf.read(prompts_path / Path(f"prompts/{voice}.wav"))
context = [
{"role": "0", "content": [{"type": "text", "text": prompt_text}, {"type": "audio", "path": prompt_audio}]},
{"role": "0", "content": [{"type": "text", "text": generation_text}]},
]
inputs = processor.apply_chat_template(
context,
tokenize=True,
return_dict=True,
)
inputs.pop("token_type_ids")
audio = model.generate(**inputs, output_audio=True)
sf.write("marvis-example.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")
Output:
Marvis is built on the Sesame CSM-1B (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses Kyutai's mimi codec. The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
Multimodal Backbone (250M parameters): Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
Audio Decoder (60M parameters): A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.
Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.
If you use Marvis in your research or applications, please cite:
@misc{marvis-tts-2025,
title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
author={Prince Canuma and Lucas Newman},
year={2025}
}
Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.
Version: 0.2
Release Date: 20/10/2025
Creators: Prince Canuma & Lucas Newman