AI/EXPLORER
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium
—Outils IA
—Sites & Blogs
—LLMs & Modèles
—Catégories
AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • ›Tous les outils
  • ›Sites & Blogs
  • ›LLMs & Modèles
  • ›Comparer
  • ›Chatbots
  • ›Images IA
  • ›Code & Dev

Entreprise

  • ›Premium
  • ›À propos
  • ›Contact
  • ›Blog

Légal

  • ›Mentions légales
  • ›Confidentialité
  • ›CGV

© 2026 AI Explorer·Tous droits réservés.

AccueilLLMsfastspeech2 conformer

fastspeech2 conformer

par espnet

Open source · 2k downloads · 7 likes

1.1
(7 avis)AudioAPI & Local
À propos

FastSpeech2Conformer est un modèle de synthèse vocale (text-to-speech) non autorégressif qui génère une parole naturelle et de haute qualité à partir de texte, tout en offrant une vitesse de traitement bien supérieure aux modèles traditionnels. Il combine l'efficacité de FastSpeech2 avec l'architecture Conformer, qui intègre des convolutions dans les blocs de transformers pour mieux capturer les motifs locaux et globaux de la parole. Ce modèle se distingue par sa capacité à intégrer directement des informations variées comme l'intonation, l'énergie et la durée, améliorant ainsi la naturalité de la voix générée. Il est particulièrement adapté aux applications nécessitant une synthèse vocale rapide et réaliste, comme les assistants vocaux, les livres audio ou les outils de communication. Son approche non autorégressive le rend plus efficace que les modèles séquentiels classiques, tout en conservant une qualité sonore élevée.

Documentation

FastSpeech2Conformer

FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the strengths of FastSpeech2 and the conformer architecture to generate high-quality speech from text quickly and efficiently.

Model Description

The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. It was first released in this repository. The license used is Apache 2.0.

FastSpeech2 is a non-autoregressive TTS model, which means it can generate speech significantly faster than autoregressive models. It addresses some of the limitations of its predecessor, FastSpeech, by directly training the model with ground-truth targets instead of the simplified output from a teacher model. It also introduces more variation information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs. Furthermore, the conformer (convolutional transformer) architecture makes use of convolutions inside the transformer blocks to capture local speech patterns, while the attention layer is able to capture relationships in the input that are farther away.

  • Developed by: Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.
  • Shared by: Connor Henderson
  • Model type: text-to-speech
  • Language(s) (NLP): [More Information Needed]
  • License: Apache 2.0
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

  • Repository: ESPnet
  • Paper [optional]: Recent Developments On Espnet Toolkit Boosted By Conformer

🤗 Transformers Usage

You can run FastSpeech2Conformer locally with the 🤗 Transformers library.

  1. First install the 🤗 Transformers library, g2p-en:
CSS
pip install --upgrade pip
pip install --upgrade transformers g2p-en
  1. Run inference via the Transformers modelling code with the model and hifigan separately
Python

from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
output_dict = model(input_ids, return_dict=True)
spectrogram = output_dict["spectrogram"]

hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
waveform = hifigan(spectrogram)

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
  1. Run inference via the Transformers modelling code with the model and hifigan combined
Python
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
  1. Run inference with a pipeline and specify which vocoder to use
Python
from transformers import pipeline, FastSpeech2ConformerHifiGan
import soundfile as sf

vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder)

speech = synthesiser("Hello, my dog is cooler than you!")

sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Connor Henderson (Disclaimer: no ESPnet affiliation)

Model Card Contact

[More Information Needed]

Liens & Ressources
Spécifications
CatégorieAudio
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Note
1.1

Essayer fastspeech2 conformer

Accédez directement au modèle