by DeepPavlov
Open source · 37k downloads · 29 likes
The *rubert base cased sentence* model is a Russian sentence encoder designed to generate precise and contextualized vector representations. It is based on RuBERT, a pre-trained language model for Russian, and has been fine-tuned on benchmark datasets such as SNLI and XNLI, which have been translated or adapted for this language. Its key capabilities include semantic understanding, sentence comparison, and similarity classification, making it particularly useful for tasks like information retrieval, text clustering, or sentiment analysis. What sets it apart is its ability to produce robust embeddings for Russian, optimized for applications requiring fine-grained semantic analysis. It stands as a high-performing tool for natural language processing (NLP) projects that demand in-depth analysis of Russian texts.
Sentence RuBERT (Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters) is a representation‑based sentence encoder for Russian. It is initialized with RuBERT and fine‑tuned on SNLI[1] google-translated to russian and on russian part of XNLI dev set[2]. Sentence representations are mean pooled token embeddings in the same manner as in Sentence‑BERT[3].
[1]: S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326
[2]: Williams A., Bowman S. (2018) XNLI: Evaluating Cross-lingual Sentence Representations. arXiv preprint arXiv:1809.05053
[3]: N. Reimers, I. Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084