AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsw2v bert 2.0

w2v bert 2.0

by facebook

Open source · 2M downloads · 213 likes

2.9
(213 reviews)EmbeddingAPI & Local
About

The W2v-BERT 2.0 model is an audio encoder based on a Conformer architecture, specifically designed to process and understand speech. Pre-trained on 4.5 million hours of unlabeled audio data spanning over 143 languages, it excels at extracting rich, multilingual representations from vocal signals. While it requires fine-tuning for specific tasks such as automatic speech recognition (ASR) or audio classification, it can already be used to generate high-quality audio embeddings from its top layer. Its integration into the Seamless Communication models highlights its versatility for multilingual communication applications. What sets it apart is its ability to capture diverse linguistic nuances while remaining effective for low-resource languages.

Documentation

W2v-BERT 2.0 speech encoder

We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.

This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

Model Name#paramscheckpoint
W2v-BERT 2.0600Mcheckpoint

This model and its training are supported by 🤗 Transformers, more on it in the docs.

🤗 Transformers usage

This is a bare checkpoint without any modeling head, and thus requires finetuning to be used for downstream tasks such as ASR. You can however use it to extract audio embeddings from the top layer with this code snippet:

Python
from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
import torch
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = AutoProcessor.from_pretrained("facebook/w2v-bert-2.0")
model = Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0")

# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

To learn more about the model use, refer to the following resources:

  • its docs
  • a blog post showing how to fine-tune it on Mongolian ASR
  • a training script example

Seamless Communication usage

This model can be used in Seamless Communication, where it was released.

Here's how to make a forward pass through the voice encoder, after having completed the installation steps:

Python
import torch

from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model


audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
    num_mel_bins=80,
    waveform_scale=2**15,
    channel_last=True,
    standardize=True,
    device=device,
    dtype=dtype,
)
collater = Collater(pad_value=1)

model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()

with Path(audio_wav_path).open("rb") as fb:
    block = MemoryBlock(fb.read())

decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)

with torch.inference_mode():
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
  seqs, padding_mask = model.encoder(seqs, padding_mask)
Capabilities & Tags
transformerssafetensorswav2vec2-bertfeature-extractionafamarasazbe
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
2.9

Try w2v bert 2.0

Access the model directly