Skip to content
forked from jinhojsk515/spmm

Multimodal learning for chemical domain, with SMILES and properties.

License

Notifications You must be signed in to change notification settings

ga83wuw/spmmGANs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPMM: Structure-Property Multi-Modal learning for molecules

The official GitHub for SPMM, a multi-modal molecular pre-trained model for a synergistic comprehension of molecular structure and properties. The details can be found in the following paper: Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (Nature Communications 2024)

DOI


method1

Molecule structure will be given in SMILES, and we used 53 simple chemical properties to build a property vector(PV) of a molecule.

The model checkpoint and data are too heavy to be included in this repo, and they can be found here.

Files

  • data/: Contains the data used for the experiments in the paper. (you have to make this folder and put the data that you downloaded from the link above.)
  • Pretrain/: Contains the checkpoint of the pre-trained SPMM. (you have to make this folder and put the checkpoint that you downloaded from the link above.)
  • vocab_bpe_300.txt: Contains the SMILES tokens for the SMILES tokenizer.
  • property_name.txt: Contains the name of the 53 chemical properties.
  • normalize.pkl: Contains the mean and standard deviation of the 53 chemical properties that we used for PV.
  • calc_property.py: Contains the code to calculate the 53 chemical properties and build a PV for a given SMILES. Modify this code accordingly to utilize SPMM pre-training for your custom PVs.
  • SPMM_models.py: Contains the code for the SPMM model and its pre-training codes.
  • SPMM_pretrain.py: runs SPMM pre-training.
  • d_*.py: Codes for the downstream tasks.

Requirements

Run pip install -r requirements.txt to install the required packages.

Code running

Arguments can be passed with commands, or be edited manually in the running code.

  1. Pre-training

    python SPMM_pretrain.py --data_path './data/pretrain.txt'
    
  2. PV-to-SMILES generation

    • batched: The model takes PVs from the molecules in input_file, and generates molecules with those PVs using k-beam search. The generated molecules will be written in generated_molecules.txt.
      python d_pv2smiles_batched.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt' --k 2
      
    • single: The model takes one query PV and generates n_generate molecules with that PV using k-beam search. The generated molecules will be written in generated_molecules.txt. Here, you need to build your input PV in the code. Check the four examples that we included.
      python d_pv2smiles_single.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --n_generate 1000 --stochastic True --k 2
      
  3. SMILES-to-PV generation

    The model takes the query molecules in input_file, and generates their PV.

    python d_smiles2pv.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt'
    
  4. MoleculeNet + DILI prediction task

    d_regression.py, d_classification.py, and d_classification_multilabel.py, perform regression, binary classification, and multi-label classification tasks, respectively.

    python d_regression.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bace'
    python d_classification.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bbbp'
    python d_classification_multilabel.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'clintox'
    
  5. Forward/retro-reaction prediction tasks

    d_rxn_prediction.py performs both forward/reverse reaction prediction tasks on USPTO-480k and USPTO-50k datasets.

    e.g. forward reaction prediction, no beam search

    python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'forward' --n_beam 1 
    

    e.g. retro reaction prediction, beam search with k=3

    python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'retro' --n_beam 3 
    

Acknowledgement

  • The code for BERT with cross-attention layers xbert.py and schedulers are modified from the one in ALBEF.
  • The code for SMILES augmentation is taken from pysmilesutils.

About

Multimodal learning for chemical domain, with SMILES and properties.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%