SciNCL

SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert-scivocab-uncased. The underlying citation embeddings are trained on the S2ORC citation graph.

Paper: Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper).

Code: https://github.com/malteos/scincl

PubMedNCL: Working with biomedical papers? Try PubMedNCL.

How to use the pretrained model

Sentence Transformers

Python

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

Transformers

Python

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

# calculate the similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
# => 0.8440518379211426

Triplet Mining Parameters

Setting	Value
seed	4
triples_per_query	5
easy_positives_count	5
easy_positives_strategy	5
easy_positives_k	20-25
easy_negatives_count	3
easy_negatives_strategy	random_without_knn
hard_negatives_count	2
hard_negatives_strategy	knn
hard_negatives_k	3998-4000

SciDocs Results

These model weights are the ones that yielded the best results on SciDocs (seed=4). In the paper we report the SciDocs results as mean over ten seeds.

model	mag-f1	mesh-f1	co-view-map	co-view-ndcg	co-read-map	co-read-ndcg	cite-map	cite-ndcg	cocite-map	cocite-ndcg	recomm-ndcg	recomm-P@1	Avg
Doc2Vec	66.2	69.2	67.8	82.9	64.9	81.6	65.3	82.2	67.1	83.4	51.7	16.9	66.6
fasttext-sum	78.1	84.1	76.5	87.9	75.3	87.4	74.6	88.1	77.8	89.6	52.5	18	74.1
SGC	76.8	82.7	77.2	88	75.7	87.5	91.6	96.2	84.1	92.5	52.7	18.2	76.9
SciBERT	79.7	80.7	50.7	73.1	47.7	71.1	48.3	71.7	49.7	72.6	52.1	17.9	59.6
SPECTER	82	86.4	83.6	91.5	84.5	92.4	88.3	94.9	88.1	94.8	53.9	20	80
SciNCL (10 seeds)	81.4	88.7	85.3	92.3	87.5	93.9	93.6	97.3	91.6	96.4	53.9	19.3	81.8
SciNCL (seed=4)	81.2	89.0	85.3	92.2	87.7	94.0	93.6	97.4	91.7	96.5	54.3	19.6	81.9

Additional evaluations are available in the paper.

License

MIT

from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer("malteos/scincl") # Concatenate the title and abstract with the [SEP] token papers = [ "BERT [SEP] We introduce a new language representation model called BERT", "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks", ] # Inference embeddings = model.encode(papers) # Compute the (cosine) similarity between embeddings similarity = model.similarity(embeddings[0], embeddings[1]) print(similarity.item()) # => 0.8440517783164978

from transformers import AutoTokenizer, AutoModel # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('malteos/scincl') model = AutoModel.from_pretrained('malteos/scincl') papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] # concatenate title and abstract with [SEP] token title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] # preprocess the input inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512) # inference result = model(**inputs) # take the first token ([CLS] token) in the batch as the embedding embeddings = result.last_hidden_state[:, 0, :] # calculate the similarity embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) similarity = (embeddings[0] @ embeddings[1].T) print(similarity.item()) # => 0.8440518379211426

Setting

Value

seed

triples_per_query

easy_positives_count

easy_positives_strategy

easy_positives_k

20-25

easy_negatives_count

easy_negatives_strategy

random_without_knn

hard_negatives_count

hard_negatives_strategy

knn

hard_negatives_k

3998-4000

model

mag-f1

mesh-f1

co-view-map

co-view-ndcg

co-read-map

co-read-ndcg

cite-map

cite-ndcg

cocite-map

cocite-ndcg

recomm-ndcg

recomm-P@1

Avg

Doc2Vec

66.2

69.2

67.8

82.9

64.9

81.6

65.3

82.2

67.1

83.4

51.7

16.9

66.6

fasttext-sum

78.1

84.1

76.5

87.9

75.3

87.4

74.6

88.1

77.8

89.6

52.5

74.1

SGC

76.8

82.7

77.2

75.7

87.5

91.6

96.2

84.1

92.5

52.7

18.2

76.9

SciBERT

79.7

80.7

50.7

73.1

47.7

71.1

48.3

71.7

49.7

72.6

52.1

17.9

59.6

SPECTER

86.4

83.6

91.5

84.5

92.4

88.3

94.9

88.1

94.8

53.9

SciNCL (10 seeds)

81.4

88.7

85.3

92.3

87.5

93.9

93.6

97.3

91.6

96.4

53.9

19.3

81.8

SciNCL (seed=4)

81.2

89.0

85.3

92.2

87.7

94.0

93.6

97.4

91.7

96.5

54.3

19.6

81.9