by cambridgeltl
Open source · 181k downloads · 10 likes
This model, named SapBERT UMLS 2020AB all lang from XLMR, is designed to identify and group medical or biomedical concepts across different languages by leveraging a unified knowledge base. It excels at recognizing entities such as diseases, medications, or procedures, even when formulations vary by language or context. Its key capabilities include semantic normalization and disambiguation, making it a valuable tool for analyzing multilingual medical texts. It is particularly useful in fields like clinical research, patient record management, or large-scale information extraction. What sets it apart is its ability to process multilingual data efficiently without requiring prior translation, thanks to its training on multilingual corpora.
language: multilingual
tags:
datasets:
[news] A cross-lingual extension of SapBERT will appear in the main onference of ACL 2021!
[news] SapBERT will appear in the conference proceedings of NAACL 2021!
SapBERT (Liu et al. 2020) trained with UMLS 2020AB, using xlm-roberta-base as the base model. Please use [CLS] as the representation of the input.
The following script converts a list of strings (entity names) into embeddings.
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]
bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
padding="max_length",
max_length=25,
truncation=True,
return_tensors="pt")
toks_cuda = {}
for k,v in toks.items():
toks_cuda[k] = v.cuda()
cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
all_embs.append(cls_rep.cpu().detach().numpy())
all_embs = np.concatenate(all_embs, axis=0)
For more details about training and eval, see SapBERT github repo.
@inproceedings{liu2021learning,
title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
booktitle={Proceedings of ACL-IJCNLP 2021},
month = aug,
year={2021}
}