par espnet
Open source · 2k downloads · 7 likes
FastSpeech2Conformer est un modèle de synthèse vocale (text-to-speech) non autorégressif qui génère une parole naturelle et de haute qualité à partir de texte, tout en offrant une vitesse de traitement bien supérieure aux modèles traditionnels. Il combine l'efficacité de FastSpeech2 avec l'architecture Conformer, qui intègre des convolutions dans les blocs de transformers pour mieux capturer les motifs locaux et globaux de la parole. Ce modèle se distingue par sa capacité à intégrer directement des informations variées comme l'intonation, l'énergie et la durée, améliorant ainsi la naturalité de la voix générée. Il est particulièrement adapté aux applications nécessitant une synthèse vocale rapide et réaliste, comme les assistants vocaux, les livres audio ou les outils de communication. Son approche non autorégressive le rend plus efficace que les modèles séquentiels classiques, tout en conservant une qualité sonore élevée.
FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the strengths of FastSpeech2 and the conformer architecture to generate high-quality speech from text quickly and efficiently.
The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. It was first released in this repository. The license used is Apache 2.0.
FastSpeech2 is a non-autoregressive TTS model, which means it can generate speech significantly faster than autoregressive models. It addresses some of the limitations of its predecessor, FastSpeech, by directly training the model with ground-truth targets instead of the simplified output from a teacher model. It also introduces more variation information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs. Furthermore, the conformer (convolutional transformer) architecture makes use of convolutions inside the transformer blocks to capture local speech patterns, while the attention layer is able to capture relationships in the input that are farther away.
You can run FastSpeech2Conformer locally with the 🤗 Transformers library.
pip install --upgrade pip
pip install --upgrade transformers g2p-en
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan
import soundfile as sf
tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]
model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
output_dict = model(input_ids, return_dict=True)
spectrogram = output_dict["spectrogram"]
hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
waveform = hifigan(spectrogram)
sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf
tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]
model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]
sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
from transformers import pipeline, FastSpeech2ConformerHifiGan
import soundfile as sf
vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder)
speech = synthesiser("Hello, my dog is cooler than you!")
sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
Use the code below to get started with the model.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
[More Information Needed]
[More Information Needed]
Connor Henderson (Disclaimer: no ESPnet affiliation)
[More Information Needed]