Skip to content

A trainable PyTorch reproduction of AlphaFold 3.

License

Notifications You must be signed in to change notification settings

kehan777/Protenix

 
 

Repository files navigation

Protenix: Protein + X

A trainable PyTorch reproduction of AlphaFold 3.

For more information on the model's performance and capabilities, see our technical report.

You can follow our twitter or join the conversation in the discord server.

Protenix predictions

⚡ Try it online

Installation and Preparations

Installing Protenix

Run with PyPI (recommended):

    # maybe you need to update libxrender1 and libxext6 firstly, run as following for Debian:
    # apt-get update
    # apt-get install libxrender1
    # apt-get install libxext6
    pip3 install protenix

Run with Docker:

See run with docker documentation .

Setting up kernels

  • Custom CUDA layernorm kernels modified from FastFold and Oneflow accelerate about 30%-50% during different training stages. To use this feature, run the following command:
    export LAYERNORM_TYPE=fast_layernorm
    If the environment variable LAYERNORM_TYPE is set to fast_layernorm, the model will employ the layernorm we have developed; otherwise, the naive PyTorch layernorm will be adopted. The kernels will be compiled when fast_layernorm is called for the first time.
  • DeepSpeed DS4Sci_EvoformerAttention kernel is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative. To use this feature, simply pass:
    --use_deepspeed_evo_attention true
    into the command line. DS4Sci_EvoformerAttention is implemented based on CUTLASS. You need to clone the CUTLASS repository and specify the path to it in the environment variable CUTLASS_PATH. The Dockerfile has already include this setting:
    RUN git clone -b v3.5.1 https://github.com/NVIDIA/cutlass.git  /opt/cutlass
    ENV CUTLASS_PATH=/opt/cutlass
    The kernels will be compiled when DS4Sci_EvoformerAttention is called for the first time.

Preparing the datasets

To download the wwPDB dataset and proprecessed training data, you need at least 1T disk space.

Use the following command to download the preprocessed wwpdb training databases:

wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz
tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/
rm /af3-dev/release_data/release_data.tar.gz

The data should be placed in the /af3-dev/release_data/ directory. You can also download it to a different directory, but remember to modify the DATA_ROOT_DIR in configs/configs_data.py correspondingly. Data hierarchy after extraction is as follows:

├── components.v20240608.cif [408M] # ccd source file
├── components.v20240608.cif.rdkit_mol.pkl [121M] # rdkit Mol object generated by ccd source file
├── indices [33M] # chain or interface entries
├── mmcif [283G]  # raw mmcif data
├── mmcif_bioassembly [36G] # preprocessed wwPDB structural data
├── mmcif_msa [450G] # msa files
├── posebusters_bioassembly [42M] # preprocessed posebusters structural data
├── posebusters_mmcif [361M] # raw mmcif data
├── recentPDB_bioassembly [1.5G] # preprocessed recentPDB structural data
└── seq_to_pdb_index.json [45M] # sequence to pdb id mapping file

With the above data, you can run the training demo from scratch. components.v20240608.cif and components.v20240608.cif.rdkit_mol.pkl is also used in inference pipeline for generating ccd reference feature. If you only want to run inference, the full released data is not necessary, you can download these two files separately.

wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl

Data processing scripts are still being organized and prepared, and distillation data will be released in the future.

Running your first prediction

Model checkpoints

Use the following command to download pretrained checkpoint [1.4G]:

wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt 

the checkpoint should be placed in the /af3-dev/release_model/ directory.

Notebook demo

You can use notebooks/protenix_inference.ipynb to run the model inference.

Inference demo

You can run the script inference_demo.sh to do model inference:

bash inference_demo.sh

Arguments in this scripts are explained as follows:

  • load_checkpoint_path: path to the model checkpoints.
  • input_json_path: path to a JSON file that fully describes the input.
  • dump_dir: path to a directory where the results of the inference will be saved.
  • dtype: data type used in inference. Valid options include "bf16" and "fp32".
  • use_deepspeed_evo_attention: whether use the EvoformerAttention provided by DeepSpeed.
  • use_msa: whether to use the MSA feature, the default is true. If you want to disable the MSA feature, add --use_msa false to the inference_demo.sh script.

or you can run inference with:

# run with examples floder
protenix_infer --input_json_path examples/example.json --dump_dir  ./output

Detailed information on the format of the input JSON file and the output files can be found in input and output documentation .

Predicted structures for the posebusters set are available at:

https://af3-dev.tos-cn-beijing.volces.com/pb_samples_release.tar.gz

Training and Finetuning

Training demo

After the installation and data preparations, you can run the following command to train the model from scratch:

bash train_demo.sh 

Key arguments in this scripts are explained as follows:

  • dtype: data type used in training. Valid options include "bf16" and "fp32".

    • --dtype fp32: the model will be trained in full FP32 precision.
    • --dtype bf16: the model will be trained in BF16 Mixed precision, by default, the SampleDiffusion,ConfidenceHead, Mini-rollout and Loss part will still be training in FP32 precision. if you want to train and infer the model in full BF16 Mixed precision, pass the following arguments to the train_demo.sh:
      --skip_amp.sample_diffusion_training false \
      --skip_amp.confidence_head false \
      --skip_amp.sample_diffusion false \
      --skip_amp.loss false \
  • use_deepspeed_evo_attention: whether use the EvoformerAttention provided by DeepSpeed as mentioned above.

  • ema_decay: the decay rate of the EMA, default is 0.999.

  • sample_diffusion.N_step: during evalutaion, the number of steps for the diffusion process is reduced to 20 to improve efficiency.

  • data.train_sets/data.test_sets: the datasets used for training and evaluation. If there are multiple datasets, separate them with commas.

  • Some settings follow those in the AlphaFold 3 paper, The table in model_performance.md shows the training settings and memory usages for different training stages.

  • In this version, we do not use the template and RNA MSA feature for training. As the default settings in configs/configs_base.py and configs/configs_data.py:

    --model.template_embedder.n_blocks 0 \
    --data.msa.enable_rna_msa false \

    This will be considered in our future work.

  • The model also supports distributed training with PyTorch’s torchrun. For example, if you’re running distributed training on a single node with 4 GPUs, you can use:

    torchrun --nproc_per_node=4 runner/train.py

    You can also pass other arguments with --<ARGS_KEY> <ARGS_VALUE> as you want.

Finetune demo

If you want to fine-tune the model on a specific subset, such as an antibody dataset, you only need to provide a PDB list file and load the pretrained weights as finetune_demo.sh shows:

checkpoint_path="/af3-dev/release_model/model_v1.pt"
...

--load_checkpoint_path ${checkpoint_path} \
--load_checkpoint_ema_path ${checkpoint_path} \
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/subset.txt \

, where the subset.txt is a file containing the PDB IDs like:

6hvq
5mqc
5zin
3ew0
5akv

Performance

See the performance documentation.

Acknowledgements

Implementation of the layernorm operators referred to OneFlow and FastFold. We used OpenFold for some module implementations, except the LayerNorm.

Contribution

Please check Contributing for more details. If you encounter problems using Protenix, feel free to create an issue! We also welcome pull requests from the community.

Code of Conduct

Please check Code of Conduct for more details.

Security

If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify Bytedance Security via our security center or vulnerability reporting email.

Please do not create a public GitHub issue.

License

This project, including code and model parameters are made available under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/

For commercial use, please reach out to us at [email protected] for the commercial license. We welcome all types of collaborations.

About

A trainable PyTorch reproduction of AlphaFold 3.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.5%
  • Cuda 1.8%
  • C++ 1.5%
  • Jupyter Notebook 0.7%
  • Shell 0.3%
  • Dockerfile 0.1%
  • C 0.1%