AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsmaterials.smi ted

materials.smi ted

by ibm-research

Open source · 18k downloads · 33 likes

1.9
(33 reviews)EmbeddingAPI & Local
About

The *materials.smi ted* model is a foundational model specialized in materials science and chemistry, developed by IBM. It employs a transformer-based encoder-decoder architecture trained on a vast dataset of 91 million molecules represented in SMILES format, equivalent to 4 billion molecular tokens. Designed to predict quantum properties and tackle complex tasks in materials science, it stands out with its two main variants (289M and 8X289M parameters), delivering state-of-the-art performance across various benchmarks. Its capabilities include predicting molecular properties, reconstructing chemical structures, and extracting relevant features for sustainable chemistry research. The model serves as a powerful tool to accelerate innovation in the discovery of new materials by leveraging diverse representations and self-supervised learning approaches. Its innovative approach, combining token masking and SMILES reconstruction, enables the generation of rich, exploitable latent spaces for downstream applications.

Documentation

Introduction to IBM's Foundation Models for Materials

Welcome to IBM's series of large foundation models for sustainable materials. Our models span a variety of representations and modalities, including SMILES, SELFIES, 3D atom positions, 3D density grids, molecular graphs, and other formats. These models are designed to support and advance research in materials science and chemistry.

GitHub: GitHub Link

Paper (pre-print): arXiv:2407.20267

Paper: Communications Chemistry

SMILES-based Transformer Encoder-Decoder (SMI-TED)

smi-ted

This repository provides PyTorch source code associated with our publication, "A Large Encoder-Decoder Family of Foundation Models for Chemical Language".

Paper: Arxiv Link

We provide the model weights in two formats:

  • PyTorch (.pt): smi-ted-Light_40.pt
  • safetensors (.safetensors): model_weights.safetensors

For more information contact: [email protected] or [email protected].

Introduction

We present a large encoder-decoder chemical foundation model, SMILES-based Transformer Encoder-Decoder (SMI-TED), pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8X289M). Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. For more information contact: [email protected] or [email protected].

Table of Contents

  1. Getting Started
    1. Pretrained Models and Training Logs
    2. Replicating Conda Environment
  2. Pretraining
  3. Finetuning
  4. Feature Extraction
  5. Citations

Getting Started

This code and environment have been tested on Nvidia V100s and Nvidia A100s

Pretrained Models and Training Logs

We provide checkpoints of the SMI-TED model pre-trained on a dataset of ~91M molecules curated from PubChem. The pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet.

Add the SMI-TED pre-trained weights.pt to the inference/ or finetune/ directory according to your needs. The directory structure should look like the following:

Lua
inference/
├── smi_ted_light
│   ├── smi_ted_light.pt
│   ├── bert_vocab_curated.txt
│   └── load.py

and/or:

Lua
finetune/
├── smi_ted_light
│   ├── smi_ted_light.pt
│   ├── bert_vocab_curated.txt
│   └── load.py

Replicating Conda Environment

Follow these steps to replicate our Conda environment and install the necessary libraries:

Create and Activate Conda Environment

INI
conda create --name smi-ted-env python=3.10
conda activate smi-ted-env

Install Packages with Conda

INI
conda install pytorch=2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Install Packages with Pip

CSS
pip install -r requirements.txt
pip install pytorch-fast-transformers

Pretraining

For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder-decoder strategy to refine SMILES reconstruction and improve the generated latent space.

SMI-TED is pre-trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:

  • Compounds are filtered to a maximum length of 202 tokens during preprocessing.
  • A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
  • A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.

The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.

To pre-train the two variants of the SMI-TED model, run:

Bash
bash training/run_model_light_training.sh

or

Bash
bash training/run_model_large_training.sh

Use train_model_D.py to train only the decoder or train_model_ED.py to train both the encoder and decoder.

Finetuning

The finetuning datasets and environment can be found in the finetune directory. After setting up the environment, you can run a finetuning task with:

Bash
bash finetune/smi_ted_light/esol/run_finetune_esol.sh

Finetuning training/checkpointing resources will be available in directories named checkpoint_<measure_name>.

Feature Extraction

The example notebook smi_ted_encoder_decoder_example.ipynb contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks.

To load smi-ted, you can simply use:

Python
model = load_smi_ted(
    folder='../inference/smi_ted_light',
    ckpt_filename='smi_ted_light.pt'
)

or

Python
with open('model_weights.bin', 'rb') as f:
    state_dict = torch.load(f)
    model.load_state_dict(state_dict)
)

To encode SMILES into embeddings, you can use:

Python
with torch.no_grad():
    encoded_embeddings = model.encode(df['SMILES'], return_torch=True)

For decoder, you can use the function, so you can return from embeddings to SMILES strings:

Python
with torch.no_grad():
    decoded_smiles = model.decode(encoded_embeddings)

Citations

INI
@article{soares2025open,
  title     = {An open-source family of large encoder-decoder foundation models for chemistry},
  author    = {Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
  journal   = {Communications Chemistry},
  volume    = {8},
  pages     = {193},
  year      = {2025},
  publisher = {Nature Portfolio},
  doi       = {10.1038/s42004-025-01585-0},
  url       = {https://doi.org/10.1038/s42004-025-01585-0}
}
INI
@misc{soares2024largeencoderdecoderfamilyfoundation,
      title={A Large Encoder-Decoder Family of Foundation Models For Chemical Language}, 
      author={Eduardo Soares and Victor Shirasuna and Emilio Vital Brazil and Renato Cerqueira and Dmitry Zubarev and Kristin Schmidt},
      year={2024},
      eprint={2407.20267},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.20267}, 
}
Capabilities & Tags
transformerspytorchSMI-TEDchemistryfoundation modelsAI4Sciencematerialsmoleculessafetensorstransformer
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
1.9

Try materials.smi ted

Access the model directly