by ai-forever
Open source · 102k downloads · 100 likes
The SBERT Large NLU RU model is a specialized version of BERT optimized for generating sentence embeddings in Russian. It converts texts into dense numerical vectors, facilitating tasks such as semantic search, classification, or comparing sentence similarity. Its key capabilities include advanced contextual understanding of Russian, making it ideal for applications requiring nuanced language analysis. The model stands out for its accuracy and efficiency, particularly through the use of average token embeddings to enhance representation quality. It is especially well-suited for natural language processing projects where subtlety and context are critical.
The model is described in this article
For better quality, use mean token embeddings.
You can use the model directly from the model repository to compute sentence embeddings:
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
#Sentences we want sentence embeddings for
sentences = ['Привет! Как твои дела?',
'А правда, что 42 твое любимое число?']
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("ai-forever/sbert_large_nlu_ru")
model = AutoModel.from_pretrained("ai-forever/sbert_large_nlu_ru")
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])