AI/EXPLORER
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium
—Outils IA
—Sites & Blogs
—LLMs & Modèles
—Catégories
AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • ›Tous les outils
  • ›Sites & Blogs
  • ›LLMs & Modèles
  • ›Comparer
  • ›Chatbots
  • ›Images IA
  • ›Code & Dev

Entreprise

  • ›Premium
  • ›À propos
  • ›Contact
  • ›Blog

Légal

  • ›Mentions légales
  • ›Confidentialité
  • ›CGV

© 2026 AI Explorer·Tous droits réservés.

AccueilLLMsscincl

scincl

par malteos

Open source · 22k downloads · 36 likes

2.0
(36 avis)EmbeddingAPI & Local
À propos

SciNCL est un modèle de langage BERT spécialement conçu pour générer des plongements (embeddings) au niveau des documents pour les articles de recherche. Il exploite le graphe de citations pour améliorer ses représentations grâce à un apprentissage contrastif, en s'appuyant initialement sur les poids du modèle SciBERT. Entraîné sur le vaste graphe de citations S2ORC, il excelle dans la capture des relations sémantiques entre publications scientifiques. Ce modèle est particulièrement utile pour des tâches comme la recherche d'articles similaires, la recommandation de documents ou l'analyse de réseaux de recherche. Sa force réside dans sa capacité à intégrer le contexte des citations, offrant des embeddings plus riches et plus précis que les approches traditionnelles.

Documentation

SciNCL

SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert-scivocab-uncased. The underlying citation embeddings are trained on the S2ORC citation graph.

Paper: Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper).

Code: https://github.com/malteos/scincl

PubMedNCL: Working with biomedical papers? Try PubMedNCL.

How to use the pretrained model

Sentence Transformers

Python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

Transformers

Python
from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

# calculate the similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
# => 0.8440518379211426

Triplet Mining Parameters

SettingValue
seed4
triples_per_query5
easy_positives_count5
easy_positives_strategy5
easy_positives_k20-25
easy_negatives_count3
easy_negatives_strategyrandom_without_knn
hard_negatives_count2
hard_negatives_strategyknn
hard_negatives_k3998-4000

SciDocs Results

These model weights are the ones that yielded the best results on SciDocs (seed=4). In the paper we report the SciDocs results as mean over ten seeds.

modelmag-f1mesh-f1co-view-mapco-view-ndcgco-read-mapco-read-ndcgcite-mapcite-ndcgcocite-mapcocite-ndcgrecomm-ndcgrecomm-P@1Avg
Doc2Vec66.269.267.882.964.981.665.382.267.183.451.716.966.6
fasttext-sum78.184.176.587.975.387.474.688.177.889.652.51874.1
SGC76.882.777.28875.787.591.696.284.192.552.718.276.9
SciBERT79.780.750.773.147.771.148.371.749.772.652.117.959.6
SPECTER8286.483.691.584.592.488.394.988.194.853.92080
SciNCL (10 seeds)81.488.785.392.387.593.993.697.391.696.453.919.381.8
SciNCL (seed=4)81.289.085.392.287.794.093.697.491.796.554.319.681.9

Additional evaluations are available in the paper.

License

MIT

Liens & Ressources
Spécifications
CatégorieEmbedding
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Note
2.0

Essayer scincl

Accédez directement au modèle