par facebook
Open source · 2M downloads · 213 likes
Le modèle W2v-BERT 2.0 est un encodeur audio basé sur une architecture Conformer, spécialement conçu pour traiter et comprendre la parole. Pré-entraîné sur 4,5 millions d'heures de données audio non étiquetées couvrant plus de 143 langues, il excelle dans l'extraction de représentations riches et multilingues du signal vocal. Bien qu'il nécessite un fine-tuning pour des tâches spécifiques comme la reconnaissance automatique de la parole (ASR) ou la classification audio, il peut déjà servir à générer des embeddings audio de haute qualité à partir de sa couche supérieure. Son intégration dans les modèles Seamless Communication démontre sa polyvalence pour des applications de communication multilingue. Ce qui le distingue, c'est sa capacité à capturer des nuances linguistiques variées tout en restant performant sur des langues peu dotées en ressources.
We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.
This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
| Model Name | #params | checkpoint |
|---|---|---|
| W2v-BERT 2.0 | 600M | checkpoint |
This model and its training are supported by 🤗 Transformers, more on it in the docs.
This is a bare checkpoint without any modeling head, and thus requires finetuning to be used for downstream tasks such as ASR. You can however use it to extract audio embeddings from the top layer with this code snippet:
from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
import torch
from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = AutoProcessor.from_pretrained("facebook/w2v-bert-2.0")
model = Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0")
# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
To learn more about the model use, refer to the following resources:
This model can be used in Seamless Communication, where it was released.
Here's how to make a forward pass through the voice encoder, after having completed the installation steps:
import torch
from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model
audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
num_mel_bins=80,
waveform_scale=2**15,
channel_last=True,
standardize=True,
device=device,
dtype=dtype,
)
collater = Collater(pad_value=1)
model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()
with Path(audio_wav_path).open("rb") as fb:
block = MemoryBlock(fb.read())
decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)
with torch.inference_mode():
seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
seqs, padding_mask = model.encoder(seqs, padding_mask)