by susnato
Open source · 105k downloads · 0 likes
The CLVP Dev model is a key component of the Tortoise-TTS speech synthesis system, designed to enhance the quality of speech generation. It is based on an architecture inspired by CLIP but employs two distinct encoders: one for processing text tokens and another for MEL tokens, which represent the spectral characteristics of the audio signal. This approach ensures a better alignment between the text and the generated voice, yielding more natural and expressive results. Its primary use cases include creating voiceovers, generating dialogue for virtual characters, or producing audio content from text. What sets it apart is its ability to finely capture the nuances of language while maintaining prosodic consistency, thanks to the interaction between the two encoders.
DISCLAIMER : I do not own any weights present in this repository. All weights belong to the author of the
paper - "Better speech synthesis through scaling", James Betker . I am storing the weights(temporarily) for the tortoise-tts integration
to Huggingface. Please refer to this PR to know more.
CLVP model is an integral part of tortoise-tts presented in the paper - "Better speech synthesis through scaling" by James Betker.
CLVP uses an architecture similar to the CLIP text encoder, except it uses two of them: one for text
tokens and the other for MEL tokens.