par allegro
Open source · 227k downloads · 22 likes
HerBERT est un modèle de langage basé sur l'architecture BERT, spécialement entraîné pour comprendre et générer du texte en polonais. Il utilise des techniques avancées comme le masquage dynamique de mots entiers et des objectifs structurels de phrase pour améliorer ses performances. Ce modèle excelle dans des tâches comme la classification de texte, l'analyse de sentiments ou la réponse aux questions en polonais. Ses cas d'usage incluent le traitement automatique des langues pour des applications locales, où la maîtrise des nuances linguistiques est cruciale. Ce qui le distingue, c'est son entraînement approfondi sur des corpus polonais variés, optimisé pour des résultats précis dans cette langue.
HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish.
Model training and experiments were conducted with transformers in version 2.9.
HerBERT was trained on six different corpora available for Polish language:
| Corpus | Tokens | Documents |
|---|---|---|
| CCNet Middle | 3243M | 7.9M |
| CCNet Head | 2641M | 7.0M |
| National Corpus of Polish | 1357M | 3.9M |
| Open Subtitles | 1056M | 1.1M |
| Wikipedia | 260M | 1.4M |
| Wolne Lektury | 41M | 5.5k |
The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with
a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
Example code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
CC BY 4.0
If you use this model, please cite the following paper:
@inproceedings{mroczkowski-etal-2021-herbert,
title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
author = "Mroczkowski, Robert and
Rybak, Piotr and
Wr{\\'o}blewska, Alina and
Gawlik, Ireneusz",
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
month = apr,
year = "2021",
address = "Kiyv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
pages = "1--10",
}
The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.
You can contact us at: [email protected]