by KandirResearch
Open source · 226 downloads · 9 likes
CiSiMi v0.1 is a text-to-speech (TTS) model designed to convert text into audio in an accessible way, even on modest hardware thanks to its CPU-optimized performance. It stands out for its ability to generate both text and audio responses, offering a lightweight alternative to traditional pipelines that combine ASR, LLM, and TTS. Primarily intended for English use, it targets developers or users looking to integrate a speech synthesis solution without relying on expensive hardware resources. While still in the prototype phase, it paves the way for smoother, more interactive applications, with plans to evolve toward multi-turn conversations and improved context handling. Its open-source approach and efficiency make it a promising tool for democratizing access to advanced voice technology.
CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.
"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"
This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities.
Dataset Preparation:
Audio Generation:
Model Training:
Explain to me how gravity works!
pip install outetts llama-cpp-python --upgrade
pip install huggingface_hub sounddevice
import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput
# Download the model
model_path = hf_hub_download(
repo_id="KandirResearch/CiSiMi-v0.1",
filename="unsloth.Q8_0.gguf",
)
# Configure the model
model_config = outetts.GGUFModelConfig_v2(
model_path=model_path,
tokenizer_path="KandirResearch/CiSiMi-v0.1",
)
# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()
# Helper function to extract audio from tokens
def get_audio(tokens):
outputs = prompt_processor.extract_audio_from_tokens(tokens)
if not outputs:
return None
audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
return ModelOutput(audio_tensor, audio_codec.sr)
# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
text = ""
for line in tts_output.strip().split('\n'):
if '<|audio_end|>' in line or '<|im_end|>' in line:
continue
if '<|' in line:
word = line.split('<|')[0].strip()
if word:
text += word + " "
else:
text += line.strip() + " "
return text.strip()
# Generate response function
def generate_response(instruction):
prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
gen_cfg = outetts.GenerationConfig(
text=prompt,
temperature=0.6,
repetition_penalty=1.1,
max_length=4096,
speaker=None
)
input_ids = prompt_processor.tokenizer.encode(prompt)
tokens = gguf_model.generate(input_ids, gen_cfg)
output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
if "<|audio_end|>" in output_text:
first_part, _, _ = output_text.partition("<|audio_end|>")
if "<|audio_end|>\n<|im_end|>\n" not in first_part:
first_part += "<|audio_end|>\n<|im_end|>\n"
extracted_text = extract_text_from_tts_output(first_part)
audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
audio_output = get_audio(audio_tokens)
if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
audio_numpy = audio_output.audio.cpu().numpy()
if audio_numpy.ndim > 1:
audio_numpy = audio_numpy.squeeze()
return extracted_text, (audio_output.sr, audio_numpy)
return output_text, None
# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)
# Play audio if available
if response_audio is not None:
if "ipykernel" in sys.modules:
from IPython.display import display, Audio
display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
else:
import sounddevice as sd
sd.play(response_audio[1], samplerate=response_audio[0])
sd.wait()
This early prototype has several areas for improvement:
Potential Limitation: This type of model quickly fills up context window, making smaller models generally more practical for implementation.
This model builds on the following open-source projects: