Skip to content

Latest commit

 

History

History

datasets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Datasets

Here we provide a few worked examples of creating graph datasets from protein structures

PPISP - Protein-Protein Interaction Site Prediction

The data contained within PPISP is drawn from DeepPPISP [1]. They collate a number of protein-protein interaction structures from three existing datasets. This is a node-classification task, where the task to is to predict whether or not a residue in the graph participates in a protein-protein interaction. The authors make available additional evolutionary information in the form of a PSSM for each protein.

The authors describe the dataset construction as follows: The three benchmark datasets are given, i.e., Dset_186, Dset_72 and PDBset_164. Dset_186 consists of 186 protein sequences with the resolution lower than 3.0 Å with sequence homology less than 25%. Dset_72 and PDBset_164 were constructed as the same as Dset_186. Dset_72 has 72 protein sequences and PDBset_164 consists of 164 protein sequences. These protein sequences in the three benchmark datasets have been annotated. Thus, we have 422 different annotated protein sequences. We remove two protein sequences for they do not have PSSM file.

PSCDB - Protein Structural Change Database

The data contained within PSCDB is drawn from the Protein Structural Change Database [2] . The dataset consists of paired protein structures in their bound and unbound forms across 7 classes of structural rearrangement motion. Several tasks can be formulated with this dataset. E.g. predicting the bound conformation of a protein as and edge-prediction task or graph-classification task predicting which class of structural rearrangement a protein undergoes upon ligand binding.

PROTEINS_{LIGANDS / METAL / NUCLEOTIDES / NUCLEIC}

These datasets are unions of structural protein-ligand / protein-metal / protein-nucleotide / protein-nucleic acid interactions sourced from ccPDB [3, 4]. Each of the datasets are collections of non-redundant PDB chain of ligand interacting protein chains generated by Blastclust (25% identity) and LPC. PDB resolution is maximum 3 $\AA$ and PDB chain length is minimum 80 amino acids and interaction distance (distance between SO4 - amino acid interaction) is 0-4.0 $\AA$.

Dataset # Classes # Proteins Notebook
PROTEINS_LIGANDS 7 3312 Open In Colab
PROTEINS_METAL 7 6262 Open In Colab
PROTEINS_NUCLEIC 2 975 Open In Colab
PROTEINS_NUCLEOTIDES 8 1226 Open In Colab

Open In Colab

PDB Ligand Ligand Name # Proteins # Residues Dataset
SO4 Sulphate 3312 954518 SO4_3312
PO4 Phosphate 1299 401977 PO4_1299
NAG N-Acetylglucosamine 727 267137 NAG_727
HEM Heme 176 48737 HEM176
BME Beta-Mercaptoethanol 191 52910 BME191
EDO Ethylene Glycol 1507 438555 EDO1507
PLP Vitamin B6 Phosphate 65 26754 PLP65

Open In Colab

PDB Ligand Ligand Name # Proteins # Residues Dataset
ATP Adenosine Triphosphate 313 127493 ATP313
ADP Adenosine Diphosphate 353 152419 ADP353
GTP Guanosine Triphosphate 83 40442 GTP83
GDP Guanosine Diphosphate 120 41412 GDP120
NAD Nicotinamide Adenine Dinucleotide 140 53276 NAD140
FAD Flavin Adenine Dinucleotide 172 73733 FAD172
FMN Flavin Mononucleotide 117 33514 FMN117
UDP Uridine Diphosphate 68 25064 UDP68

Open In Colab

PDB Ligand Ligand Name # Proteins # Residues Dataset
Fe Iron 215 69779 Fe215
Mg Magnesium 1908 655860 MG1908
Ca Calcium 1402 468225 CA1402
Mn Manganese 521 179324 MN521
Zn Zinc 1660 487759 ZN1660
Co Cobalt 201 56079 CO201
Ni Nickel 355 95902 NI355

Open In Colab

Ligand Name # Proteins # Residues Dataset
DNA 560 168746 DNA560
RNA 415 117841 RNA415

References

[1] Min Zeng, Fuhao Zhang, Fang-Xiang Wu, Yaohang Li, Jianxin Wang, Min Li. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. DOI:10.1093/bioinformatics/btz699

[2] Amemiya, T., Koike, R., Kidera, A., & Ota, M. (2011). PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Research, 40(D1), D554–D558. https://doi.org/10.1093/nar/gkr966

[3] Agrawal et al., (2019) ccPDB 2.0: an updated version of datasets created and compiled from Protein Data Bank. Database, Volume 2019, 1 January 2019, bay142

[4] Singh et al., (2012) ccPDB: compilation and creation of data sets from Protein Data Bank. Nucleic Acids Res. 40(1):D486-9