AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsherbert base cased

herbert base cased

by allegro

Open source · 207k downloads · 22 likes

1.7
(22 reviews)EmbeddingAPI & Local
About

HerBERT is a language model based on the BERT architecture, specifically trained to understand and generate text in Polish. It employs advanced techniques such as dynamic whole-word masking and structural sentence objectives to enhance its performance. The model excels in tasks like text classification, sentiment analysis, and question answering in Polish. Its use cases include natural language processing for local applications where a deep grasp of linguistic nuances is essential. What sets it apart is its extensive training on diverse Polish corpora, optimized to deliver precise results in the language.

Documentation

HerBERT

HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish.

Model training and experiments were conducted with transformers in version 2.9.

Corpus

HerBERT was trained on six different corpora available for Polish language:

CorpusTokensDocuments
CCNet Middle3243M7.9M
CCNet Head2641M7.0M
National Corpus of Polish1357M3.9M
Open Subtitles1056M1.1M
Wikipedia260M1.4M
Wolne Lektury41M5.5k

Tokenizer

The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.

We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

Usage

Example code:

Python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")

output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)

License

CC BY 4.0

Citation

If you use this model, please cite the following paper:

INI
@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}

Authors

The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: [email protected]

Capabilities & Tags
transformerspytorchtfjaxbertfeature-extractionherbertplendpoints_compatible
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
1.7

Try herbert base cased

Access the model directly