by Marvis-AI
Open source · 2k downloads · 73 likes
Marvis TTS 250M v0.1 is a real-time conversational text-to-speech model designed to generate smooth, natural audio streams from text directly on consumer devices like iPhones, iPads, or Macs. Leveraging a multimodal architecture and the ability to process entire textual sequences without segmentation, it produces more realistic prosody and intonation than conventional solutions while remaining compact (500 MB in quantized form) for efficient local execution. Primarily optimized for English, it also supports additional languages such as German, French, and Mandarin, with further enhancements planned. Ideal for voice assistants, accessibility tools, or content creation, it stands out for its lightweight design, low latency, and seamless handling of interleaved audio and text streams without artifacts. Its flexible deployment—whether locally or in the cloud—makes it a versatile solution for interactive or automated applications.
[code]
Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.
Currently optimized for English with support for expressive speech synthesis with additional languages such as German, Portuguese, French and Mandarin coming soon.
pip install -U mlx-audio
python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream \
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
Without Voice Cloning
import torch
from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
from tokenizers.processors import TemplateProcessing
import soundfile as sf
model_id = "Marvis-AI/marvis-tts-250m-v0.1-transformers"
device = "cuda"if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device).pop("token_type_ids")
# infer the model
audio = model.generate(**inputs, output_audio=True)
sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")
Marvis is built on the Sesame CSM-1B (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses Kyutai's mimi codec. The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
Multimodal Backbone (250M parameters): Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
Audio Decoder (60M parameters): A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.
Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.
Pretraining:
Post-training:
Total Training Cost: ~$2,000
Local Deployment:
Cloud Deployment:
If you use Marvis in your research or applications, please cite:
@misc{marvis-tts-2025,
title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
author={Prince Canuma and Lucas Newman},
year={2025}
}
Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.
Version: 0.1
Release Date: 26/08/2025
Creators: Prince Canuma & Lucas Newman