AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsMoLFormer XL both 10pct

MoLFormer XL both 10pct

by ibm-research

Open source · 253k downloads · 35 likes

1.9
(35 reviews)EmbeddingAPI & Local
About

MoLFormer XL both 10% is a specialized language model designed for analyzing chemical molecules, trained on textual representations of molecular structures (SMILES) from large databases such as ZINC and PubChem. It excels at extracting molecular features and can be fine-tuned to predict chemical properties (such as solubility or toxicity) or used directly as a descriptor extractor for tasks involving similarity or visualization. Unlike generative models, it focuses on understanding molecular structures rather than creating them. Its performance is optimal for molecules of moderate size (up to around 200 atoms), and it requires inputs in the form of canonical SMILES to ensure reliable results. The model stands out for its linear attention-based approach, enabling efficient analysis of large-scale chemical data.

Documentation

MoLFormer-XL-both-10%

MoLFormer is a class of models pretrained on SMILES string representations of up to 1.1B molecules from ZINC and PubChem. This repository is for the model pretrained on 10% of both datasets.

It was introduced in the paper Large-Scale Chemical Language Representations Capture Molecular Structure and Properties by Ross et al. and first released in this repository.

Model Details

Model Description

MoLFormer is a large-scale chemical language model designed with the intention of learning a model trained on small molecules which are represented as SMILES strings. MoLFormer leverges masked language modeling and employs a linear attention Transformer combined with rotary embeddings.

MoLFormer pipeline

An overview of the MoLFormer pipeline is seen in the image above. One can see that the transformer-based neural network model is trained on a large collection of chemical molecules represented by SMILES sequences from two public chemical datasets PubChem and ZINC in a self-supervised fashion. The MoLFormer architecture was designed with an efficient linear attention mechanism and relative positional embeddings with the goal of learning a meaningful and compressed representation of chemical molecules. After training the MoLFormer foundation model was then adopted to different downstream molecular property prediction tasks via fine-tuning on task-specific data. To further test the representative power of MoLFormer, the MoLFormer encodings were used to recover molecular similarity, and analysis on the correspondence between the interatomic spatial distance and attention value for a given molecule was performed.

Intended use and limitations

You can use the model for masked language modeling, but it is mainly intended to be used as a feature extractor or to be fine-tuned for a prediction task. The "frozen" model embeddings may be used for similarity measurements, visualization, or training predictor models. The model may also be fine-tuned for sequence classification tasks (e.g., solubility, toxicity, etc.).

This model is not intended for molecule generation. It is also not tested for molecules larger than ~200 atoms (i.e., macromolecules). Furthermore, using invalid or noncanonical SMILES may result in worse performance.

Example code

Use the code below to get started with the model.

Py
import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

smiles = ["Cn1c(=O)c2c(ncn2C)n(C)c1=O", "CC(=O)Oc1ccccc1C(=O)O"]
inputs = tokenizer(smiles, padding=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
outputs.pooler_output

Training Details

Data

We trained MoLFormer-XL on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on 10% ZINC + 10% PubChem.

Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.

Hardware

  • 16 x NVIDIA V100 GPUs

Evaluation

We evaluated MoLFormer by fine-tuning on 11 benchmark tasks from MoleculeNet. The tables below show the performance of different MoLFormer variants:

BBBPHIVBACESIDERClinToxTox21
10% ZINC + 10% PubChem91.581.386.668.994.684.5
10% ZINC + 100% PubChem92.279.286.369.094.784.5
100% ZINC89.978.487.766.882.283.2
MoLFormer-Base90.977,782.864.861.343.1
MoLFormer-XL93.782.288.269.094.884.7
QM9QM8ESOLFreeSolvLipophilicity
10% ZINC + 10% PubChem1.77540.01080.32950.22210.5472
10% ZINC + 100% PubChem1.90930.01020.27750.20500.5331
100% ZINC1.94030.01240.30230.29810.5440
MoLFormer-Base2.25000.01110.27980.25960.6492
MoLFormer-XL1.59840.01020.27870.23080.5298

We report AUROC for all classification tasks, average MAE for QM9/8, and RMSE for the remaining regression tasks.

Citation

INI
@article{10.1038/s42256-022-00580-7,
  year = {2022},
  title = {{Large-scale chemical language representations capture molecular structure and   properties}},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and   Mroueh, Youssef and Das, Payel},
  journal = {Nature Machine Intelligence},
  doi = {10.1038/s42256-022-00580-7},
  pages = {1256--1264},
  number = {12},
  volume = {4}
}
INI
@misc{https://doi.org/10.48550/arxiv.2106.09553,
  doi = {10.48550/ARXIV.2106.09553},
  url = {https://arxiv.org/abs/2106.09553},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
Capabilities & Tags
transformerspytorchsafetensorsmolformerfill-maskchemistryfeature-extractioncustom_code
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
1.9

Try MoLFormer XL both 10pct

Access the model directly