Here we provide a few worked examples of creating graph datasets from protein structures
The data contained within PPISP is drawn from DeepPPISP [1]. They collate a number of protein-protein interaction structures from three existing datasets. This is a node-classification task, where the task to is to predict whether or not a residue in the graph participates in a protein-protein interaction. The authors make available additional evolutionary information in the form of a PSSM for each protein.
The authors describe the dataset construction as follows: The three benchmark datasets are given, i.e., Dset_186
, Dset_72
and PDBset_164
. Dset_186
consists of 186 protein sequences with the resolution lower than 3.0 Å with sequence homology less than 25%. Dset_72
and PDBset_164
were constructed as the same as Dset_186
. Dset_72
has 72 protein sequences and PDBset_164
consists of 164 protein sequences. These protein sequences in the three benchmark datasets have been annotated. Thus, we have 422 different annotated protein sequences. We remove two protein sequences for they do not have PSSM file.
The data contained within PSCDB
is drawn from the Protein Structural Change Database [2] . The dataset consists of paired protein structures in their bound and unbound forms across 7 classes of structural rearrangement motion. Several tasks can be formulated with this dataset. E.g. predicting the bound conformation of a protein as and edge-prediction task or graph-classification task predicting which class of structural rearrangement a protein undergoes upon ligand binding.
These datasets are unions of structural protein-ligand / protein-metal / protein-nucleotide / protein-nucleic acid interactions sourced from ccPDB [3, 4]. Each of the datasets are collections of non-redundant PDB chain of ligand interacting protein chains generated by Blastclust (25% identity) and LPC. PDB resolution is maximum 3
Dataset | # Classes | # Proteins | Notebook |
---|---|---|---|
PROTEINS_LIGANDS | 7 | 3312 | |
PROTEINS_METAL | 7 | 6262 | |
PROTEINS_NUCLEIC | 2 | 975 | |
PROTEINS_NUCLEOTIDES | 8 | 1226 |
PDB Ligand | Ligand Name | # Proteins | # Residues | Dataset |
---|---|---|---|---|
SO4 | Sulphate | 3312 | 954518 | SO4_3312 |
PO4 | Phosphate | 1299 | 401977 | PO4_1299 |
NAG | N-Acetylglucosamine | 727 | 267137 | NAG_727 |
HEM | Heme | 176 | 48737 | HEM176 |
BME | Beta-Mercaptoethanol | 191 | 52910 | BME191 |
EDO | Ethylene Glycol | 1507 | 438555 | EDO1507 |
PLP | Vitamin B6 Phosphate | 65 | 26754 | PLP65 |
PDB Ligand | Ligand Name | # Proteins | # Residues | Dataset |
---|---|---|---|---|
ATP | Adenosine Triphosphate | 313 | 127493 | ATP313 |
ADP | Adenosine Diphosphate | 353 | 152419 | ADP353 |
GTP | Guanosine Triphosphate | 83 | 40442 | GTP83 |
GDP | Guanosine Diphosphate | 120 | 41412 | GDP120 |
NAD | Nicotinamide Adenine Dinucleotide | 140 | 53276 | NAD140 |
FAD | Flavin Adenine Dinucleotide | 172 | 73733 | FAD172 |
FMN | Flavin Mononucleotide | 117 | 33514 | FMN117 |
UDP | Uridine Diphosphate | 68 | 25064 | UDP68 |
PDB Ligand | Ligand Name | # Proteins | # Residues | Dataset |
---|---|---|---|---|
Fe | Iron | 215 | 69779 | Fe215 |
Mg | Magnesium | 1908 | 655860 | MG1908 |
Ca | Calcium | 1402 | 468225 | CA1402 |
Mn | Manganese | 521 | 179324 | MN521 |
Zn | Zinc | 1660 | 487759 | ZN1660 |
Co | Cobalt | 201 | 56079 | CO201 |
Ni | Nickel | 355 | 95902 | NI355 |
Ligand Name | # Proteins | # Residues | Dataset |
---|---|---|---|
DNA | 560 | 168746 | DNA560 |
RNA | 415 | 117841 | RNA415 |
[1] Min Zeng, Fuhao Zhang, Fang-Xiang Wu, Yaohang Li, Jianxin Wang, Min Li. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. DOI:10.1093/bioinformatics/btz699
[2] Amemiya, T., Koike, R., Kidera, A., & Ota, M. (2011). PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Research, 40(D1), D554–D558. https://doi.org/10.1093/nar/gkr966
[3] Agrawal et al., (2019) ccPDB 2.0: an updated version of datasets created and compiled from Protein Data Bank. Database, Volume 2019, 1 January 2019, bay142
[4] Singh et al., (2012) ccPDB: compilation and creation of data sets from Protein Data Bank. Nucleic Acids Res. 40(1):D486-9