A trainable PyTorch reproduction of AlphaFold 3.
For more information on the model's performance and capabilities, see our technical report.
You can follow our twitter or join the conversation in the discord server.
# maybe you need to update libxrender1 and libxext6 firstly, run as following for Debian:
# apt-get update
# apt-get install libxrender1
# apt-get install libxext6
pip3 install protenix
See run with docker documentation .
- Custom CUDA layernorm kernels modified from FastFold and Oneflow accelerate about 30%-50% during different training stages. To use this feature, run the following command:
If the environment variable
export LAYERNORM_TYPE=fast_layernorm
LAYERNORM_TYPE
is set tofast_layernorm
, the model will employ the layernorm we have developed; otherwise, the naive PyTorch layernorm will be adopted. The kernels will be compiled whenfast_layernorm
is called for the first time. - DeepSpeed DS4Sci_EvoformerAttention kernel is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative. To use this feature, simply pass:
into the command line. DS4Sci_EvoformerAttention is implemented based on CUTLASS. You need to clone the CUTLASS repository and specify the path to it in the environment variable CUTLASS_PATH. The Dockerfile has already include this setting:
--use_deepspeed_evo_attention true
The kernels will be compiled when DS4Sci_EvoformerAttention is called for the first time.RUN git clone -b v3.5.1 https://github.com/NVIDIA/cutlass.git /opt/cutlass ENV CUTLASS_PATH=/opt/cutlass
To download the wwPDB dataset and proprecessed training data, you need at least 1T disk space.
Use the following command to download the preprocessed wwpdb training databases:
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz
tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/
rm /af3-dev/release_data/release_data.tar.gz
The data should be placed in the /af3-dev/release_data/
directory. You can also download it to a different directory, but remember to modify the DATA_ROOT_DIR
in configs/configs_data.py correspondingly. Data hierarchy after extraction is as follows:
├── components.v20240608.cif [408M] # ccd source file
├── components.v20240608.cif.rdkit_mol.pkl [121M] # rdkit Mol object generated by ccd source file
├── indices [33M] # chain or interface entries
├── mmcif [283G] # raw mmcif data
├── mmcif_bioassembly [36G] # preprocessed wwPDB structural data
├── mmcif_msa [450G] # msa files
├── posebusters_bioassembly [42M] # preprocessed posebusters structural data
├── posebusters_mmcif [361M] # raw mmcif data
├── recentPDB_bioassembly [1.5G] # preprocessed recentPDB structural data
└── seq_to_pdb_index.json [45M] # sequence to pdb id mapping file
With the above data, you can run the training demo from scratch. components.v20240608.cif
and components.v20240608.cif.rdkit_mol.pkl
is also used in inference pipeline for generating ccd reference feature. If you only want to run inference, the full released data is not necessary, you can download these two files separately.
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl
Data processing scripts are still being organized and prepared, and distillation data will be released in the future.
Use the following command to download pretrained checkpoint [1.4G]:
wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt
the checkpoint should be placed in the /af3-dev/release_model/
directory.
You can use notebooks/protenix_inference.ipynb to run the model inference.
You can run the script inference_demo.sh
to do model inference:
bash inference_demo.sh
Arguments in this scripts are explained as follows:
load_checkpoint_path
: path to the model checkpoints.input_json_path
: path to a JSON file that fully describes the input.dump_dir
: path to a directory where the results of the inference will be saved.dtype
: data type used in inference. Valid options include"bf16"
and"fp32"
.use_deepspeed_evo_attention
: whether use the EvoformerAttention provided by DeepSpeed.use_msa
: whether to use the MSA feature, the default is true. If you want to disable the MSA feature, add--use_msa false
to the inference_demo.sh script.
or you can run inference with:
# run with examples floder
protenix_infer --input_json_path examples/example.json --dump_dir ./output
Detailed information on the format of the input JSON file and the output files can be found in input and output documentation .
Predicted structures for the posebusters set are available at:
https://af3-dev.tos-cn-beijing.volces.com/pb_samples_release.tar.gz
After the installation and data preparations, you can run the following command to train the model from scratch:
bash train_demo.sh
Key arguments in this scripts are explained as follows:
-
dtype
: data type used in training. Valid options include"bf16"
and"fp32"
.--dtype fp32
: the model will be trained in full FP32 precision.--dtype bf16
: the model will be trained in BF16 Mixed precision, by default, theSampleDiffusion
,ConfidenceHead
,Mini-rollout
andLoss
part will still be training in FP32 precision. if you want to train and infer the model in full BF16 Mixed precision, pass the following arguments to the train_demo.sh:--skip_amp.sample_diffusion_training false \ --skip_amp.confidence_head false \ --skip_amp.sample_diffusion false \ --skip_amp.loss false \
-
use_deepspeed_evo_attention
: whether use the EvoformerAttention provided by DeepSpeed as mentioned above. -
ema_decay
: the decay rate of the EMA, default is 0.999. -
sample_diffusion.N_step
: during evalutaion, the number of steps for the diffusion process is reduced to 20 to improve efficiency. -
data.train_sets/data.test_sets
: the datasets used for training and evaluation. If there are multiple datasets, separate them with commas. -
Some settings follow those in the AlphaFold 3 paper, The table in model_performance.md shows the training settings and memory usages for different training stages.
-
In this version, we do not use the template and RNA MSA feature for training. As the default settings in configs/configs_base.py and configs/configs_data.py:
--model.template_embedder.n_blocks 0 \ --data.msa.enable_rna_msa false \
This will be considered in our future work.
-
The model also supports distributed training with PyTorch’s
torchrun
. For example, if you’re running distributed training on a single node with 4 GPUs, you can use:torchrun --nproc_per_node=4 runner/train.py
You can also pass other arguments with
--<ARGS_KEY> <ARGS_VALUE>
as you want.
If you want to fine-tune the model on a specific subset, such as an antibody dataset, you only need to provide a PDB list file and load the pretrained weights as finetune_demo.sh shows:
checkpoint_path="/af3-dev/release_model/model_v1.pt"
...
--load_checkpoint_path ${checkpoint_path} \
--load_checkpoint_ema_path ${checkpoint_path} \
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/subset.txt \
, where the subset.txt
is a file containing the PDB IDs like:
6hvq
5mqc
5zin
3ew0
5akv
See the performance documentation.
Implementation of the layernorm operators referred to OneFlow and FastFold. We used OpenFold for some module implementations, except the LayerNorm
.
Please check Contributing for more details. If you encounter problems using Protenix, feel free to create an issue! We also welcome pull requests from the community.
Please check Code of Conduct for more details.
If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify Bytedance Security via our security center or vulnerability reporting email.
Please do not create a public GitHub issue.
This project, including code and model parameters are made available under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/
For commercial use, please reach out to us at [email protected] for the commercial license. We welcome all types of collaborations.