by DeepPavlov
Open source · 285k downloads · 126 likes
RuBERT base cased is an advanced language model specifically designed for Russian, based on the BERT architecture. Trained on data from Russian Wikipedia and news sources, it excels in understanding and generating Russian text. Its key capabilities include semantic analysis, text classification, and masked word prediction, making it suitable for tasks such as natural language processing or chatbots. The model stands out for its precision in capturing the linguistic nuances of Russian, particularly through its vocabulary tailored to the language's specific subwords. It is especially useful for businesses or researchers working with Russian-language content, delivering robust performance across a variety of applications.
RuBERT (Russian, cased, 12‑layer, 768‑hidden, 12‑heads, 180M parameters) was trained on the Russian part of Wikipedia and news data. We used this training data to build a vocabulary of Russian subtokens and took a multilingual version of BERT‑base as an initialization for RuBERT[1].
08.11.2021: upload model with MLM and NSP heads
[1]: Kuratov, Y., Arkhipov, M. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213.