AI ExplorerAI Explorer
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium

—

AI Tools

—

Sites & Blogs

—

LLMs & Models

—

Categories

AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • All tools
  • Sites & Blogs
  • LLMs & Models
  • Compare
  • Chatbots
  • AI Images
  • Code & Dev

Company

  • Premium
  • About
  • Contact
  • Blog

Legal

  • Legal notice
  • Privacy
  • Terms

© 2026 AI Explorer. All rights reserved.

HomeLLMsSapBERT UMLS 2020AB all lang from XLMR

SapBERT UMLS 2020AB all lang from XLMR

by cambridgeltl

Open source · 181k downloads · 10 likes

1.3
(10 reviews)EmbeddingAPI & Local
About

This model, named SapBERT UMLS 2020AB all lang from XLMR, is designed to identify and group medical or biomedical concepts across different languages by leveraging a unified knowledge base. It excels at recognizing entities such as diseases, medications, or procedures, even when formulations vary by language or context. Its key capabilities include semantic normalization and disambiguation, making it a valuable tool for analyzing multilingual medical texts. It is particularly useful in fields like clinical research, patient record management, or large-scale information extraction. What sets it apart is its ability to process multilingual data efficiently without requiring prior translation, thanks to its training on multilingual corpora.

Documentation

language: multilingual

tags:

  • biomedical
  • lexical-semantics
  • cross-lingual

datasets:

  • UMLS

[news] A cross-lingual extension of SapBERT will appear in the main onference of ACL 2021!
[news] SapBERT will appear in the conference proceedings of NAACL 2021!

SapBERT-XLMR

SapBERT (Liu et al. 2020) trained with UMLS 2020AB, using xlm-roberta-base as the base model. Please use [CLS] as the representation of the input.

Extracting embeddings from SapBERT

The following script converts a list of strings (entity names) into embeddings.

Python
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")  
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()

# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] 

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs], 
                                       padding="max_length", 
                                       max_length=25, 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

For more details about training and eval, see SapBERT github repo.

Citation

Bibtex
@inproceedings{liu2021learning,
	title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
	author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
	booktitle={Proceedings of ACL-IJCNLP 2021},
	month = aug,
	year={2021}
}
Capabilities & Tags
transformerspytorchsafetensorsxlm-robertafeature-extractionendpoints_compatible
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
1.3

Try SapBERT UMLS 2020AB all lang from XLMR

Access the model directly