by espnet
Open source · 3k downloads · 7 likes
FastSpeech2Conformer is a non-autoregressive text-to-speech model that generates natural, high-quality speech from text while offering significantly faster processing speeds than traditional models. It combines the efficiency of FastSpeech2 with the Conformer architecture, which integrates convolutions within transformer blocks to better capture both local and global speech patterns. The model stands out for its ability to directly incorporate varied information such as intonation, energy, and duration, thereby enhancing the naturalness of the generated voice. It is particularly well-suited for applications requiring fast and realistic speech synthesis, such as voice assistants, audiobooks, or communication tools. Its non-autoregressive approach makes it more efficient than classical sequential models while maintaining high sound quality.
FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the strengths of FastSpeech2 and the conformer architecture to generate high-quality speech from text quickly and efficiently.
The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. It was first released in this repository. The license used is Apache 2.0.
FastSpeech2 is a non-autoregressive TTS model, which means it can generate speech significantly faster than autoregressive models. It addresses some of the limitations of its predecessor, FastSpeech, by directly training the model with ground-truth targets instead of the simplified output from a teacher model. It also introduces more variation information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs. Furthermore, the conformer (convolutional transformer) architecture makes use of convolutions inside the transformer blocks to capture local speech patterns, while the attention layer is able to capture relationships in the input that are farther away.
You can run FastSpeech2Conformer locally with the 🤗 Transformers library.
pip install --upgrade pip
pip install --upgrade transformers g2p-en
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan
import soundfile as sf
tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]
model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
output_dict = model(input_ids, return_dict=True)
spectrogram = output_dict["spectrogram"]
hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
waveform = hifigan(spectrogram)
sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf
tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]
model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]
sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
from transformers import pipeline, FastSpeech2ConformerHifiGan
import soundfile as sf
vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder)
speech = synthesiser("Hello, my dog is cooler than you!")
sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
Use the code below to get started with the model.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
[More Information Needed]
[More Information Needed]
Connor Henderson (Disclaimer: no ESPnet affiliation)
[More Information Needed]