by allegro
Open source · 207k downloads · 22 likes
HerBERT is a language model based on the BERT architecture, specifically trained to understand and generate text in Polish. It employs advanced techniques such as dynamic whole-word masking and structural sentence objectives to enhance its performance. The model excels in tasks like text classification, sentiment analysis, and question answering in Polish. Its use cases include natural language processing for local applications where a deep grasp of linguistic nuances is essential. What sets it apart is its extensive training on diverse Polish corpora, optimized to deliver precise results in the language.
HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish.
Model training and experiments were conducted with transformers in version 2.9.
HerBERT was trained on six different corpora available for Polish language:
| Corpus | Tokens | Documents |
|---|---|---|
| CCNet Middle | 3243M | 7.9M |
| CCNet Head | 2641M | 7.0M |
| National Corpus of Polish | 1357M | 3.9M |
| Open Subtitles | 1056M | 1.1M |
| Wikipedia | 260M | 1.4M |
| Wolne Lektury | 41M | 5.5k |
The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with
a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
Example code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
CC BY 4.0
If you use this model, please cite the following paper:
@inproceedings{mroczkowski-etal-2021-herbert,
title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
author = "Mroczkowski, Robert and
Rybak, Piotr and
Wr{\\'o}blewska, Alina and
Gawlik, Ireneusz",
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
month = apr,
year = "2021",
address = "Kiyv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
pages = "1--10",
}
The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.
You can contact us at: [email protected]