AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsChatgpt2 zinc 87m

gpt2 zinc 87m

by entropy

Open source · 268k downloads · 4 likes

0.9
(4 reviews)ChatAPI & Local
About

Le modèle GPT2 Zinc 87m est un modèle de langage autoregressif inspiré de GPT2, spécialement conçu pour générer des molécules chimiques sous forme de chaînes SMILES. Entraîné sur un vaste corpus de près de 480 millions de SMILES issus de la base de données ZINC, il permet de créer des molécules aux propriétés similaires à celles de médicaments existants. Ses principales capacités incluent la génération de nouvelles structures chimiques et la production d'embeddings à partir de SMILES, utiles pour des analyses ou des tâches de modélisation en chimie computationnelle. Ce modèle se distingue par sa taille compacte (87 millions de paramètres) et sa performance optimisée pour des applications en drug design ou en exploration de l'espace chimique. Il est particulièrement adapté aux chercheurs et ingénieurs en chimie médicinale cherchant à accélérer la découverte de nouveaux composés.

Documentation

GPT2 Zinc 87m

This is a GPT2 style autoregressive language model trained on ~480m SMILES strings from the ZINC database.

The model has ~87m parameters and was trained for 175000 iterations with a batch size of 3072 to a validation loss of ~.615. This model is useful for generating druglike molecules or generating embeddings from SMILES strings

How to use

Python
from transformers import GPT2TokenizerFast, GPT2LMHeadModel

tokenizer = GPT2TokenizerFast.from_pretrained("entropy/gpt2_zinc_87m", max_len=256)
model = GPT2LMHeadModel.from_pretrained('entropy/gpt2_zinc_87m')

To generate molecules:

Python
inputs = torch.tensor([[tokenizer.bos_token_id]])

gen = model.generate(
              inputs,
              do_sample=True, 
              max_length=256, 
              temperature=1.,
              early_stopping=True,
              pad_token_id=tokenizer.pad_token_id,
              num_return_sequences=32
                         )
smiles = tokenizer.batch_decode(gen, skip_special_tokens=True)

To compute embeddings:

Python
from transformers import DataCollatorWithPadding

collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt')

inputs = collator(tokenizer(smiles))
outputs = model(**inputs, output_hidden_states=True)
full_embeddings = outputs[-1][-1]
mask = inputs['attention_mask']
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))

WARNING

This model was trained with bos and eos tokens around SMILES inputs. The GPT2TokenizerFast tokenizer DOES NOT ADD special tokens, even when add_special_tokens=True. Huggingface says this is intended behavior.

It may be necessary to manually add these tokens

Python
inputs = collator(tokenizer([tokenizer.bos_token+i+tokenizer.eos_token for i in smiles]))

Model Performance

To test generation performance, 1m compounds were generated at various temperature values. Generated compounds were checked for uniqueness and structural validity.

  • percent_unique denotes n_unique_smiles/n_total_smiles
  • percent_valid denotes n_valid_smiles/n_unique_smiles
  • percent_unique_and_valid denotes n_valid_smiles/n_total_smiles
temperaturepercent_uniquepercent_validpercent_unique_and_valid
0.50.92807410.928074
0.750.9984680.9999670.998436
10.9996590.9991640.998823
1.250.9995140.993510.993027
1.50.9987490.9702230.96901

Property histograms computed over 1m generated compounds: property histograms

Capabilities & Tags
transformerspytorchgpt2text-generationchemistrymoleculedrugtext-generation-inferenceendpoints_compatible
Links & Resources
Specifications
CategoryChat
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
0.9

Try gpt2 zinc 87m

Access the model directly