Name		Name	Last commit message	Last commit date
parent directory ..
pcddb		pcddb
ppisp		ppisp
proteins_ligands		proteins_ligands
proteins_metal		proteins_metal
proteins_nucleic		proteins_nucleic
proteins_nucleotides		proteins_nucleotides
pscdb		pscdb
README.md		README.md

README.md

Datasets

Here we provide a few worked examples of creating graph datasets from protein structures

PPISP - Protein-Protein Interaction Site Prediction

The data contained within PPISP is drawn from DeepPPISP [1]. They collate a number of protein-protein interaction structures from three existing datasets. This is a node-classification task, where the task to is to predict whether or not a residue in the graph participates in a protein-protein interaction. The authors make available additional evolutionary information in the form of a PSSM for each protein.

The authors describe the dataset construction as follows: The three benchmark datasets are given, i.e., Dset_186, Dset_72 and PDBset_164. Dset_186 consists of 186 protein sequences with the resolution lower than 3.0 Å with sequence homology less than 25%. Dset_72 and PDBset_164 were constructed as the same as Dset_186. Dset_72 has 72 protein sequences and PDBset_164 consists of 164 protein sequences. These protein sequences in the three benchmark datasets have been annotated. Thus, we have 422 different annotated protein sequences. We remove two protein sequences for they do not have PSSM file.

PSCDB - Protein Structural Change Database

The data contained within PSCDB is drawn from the Protein Structural Change Database [2] . The dataset consists of paired protein structures in their bound and unbound forms across 7 classes of structural rearrangement motion. Several tasks can be formulated with this dataset. E.g. predicting the bound conformation of a protein as and edge-prediction task or graph-classification task predicting which class of structural rearrangement a protein undergoes upon ligand binding.

PROTEINS_{LIGANDS / METAL / NUCLEOTIDES / NUCLEIC}

These datasets are unions of structural protein-ligand / protein-metal / protein-nucleotide / protein-nucleic acid interactions sourced from ccPDB [3, 4]. Each of the datasets are collections of non-redundant PDB chain of ligand interacting protein chains generated by Blastclust (25% identity) and LPC. PDB resolution is maximum 3 $\AA$ and PDB chain length is minimum 80 amino acids and interaction distance (distance between SO4 - amino acid interaction) is 0-4.0 $\AA$.

Dataset	# Classes	# Proteins
PROTEINS_LIGANDS	7	3312
PROTEINS_METAL	7	6262
PROTEINS_NUCLEIC	2	975
PROTEINS_NUCLEOTIDES	8	1226

PROTEINS_LIGANDS

PDB Ligand	Ligand Name	# Proteins	# Residues	Dataset
SO4	Sulphate	3312	954518	SO4_3312
PO4	Phosphate	1299	401977	PO4_1299
NAG	N-Acetylglucosamine	727	267137	NAG_727
HEM	Heme	176	48737	HEM176
BME	Beta-Mercaptoethanol	191	52910	BME191
EDO	Ethylene Glycol	1507	438555	EDO1507
PLP	Vitamin B6 Phosphate	65	26754	PLP65

PROTEINS_NUCLEOTIDES

PDB Ligand	Ligand Name	# Proteins	# Residues	Dataset
ATP	Adenosine Triphosphate	313	127493	ATP313
ADP	Adenosine Diphosphate	353	152419	ADP353
GTP	Guanosine Triphosphate	83	40442	GTP83
GDP	Guanosine Diphosphate	120	41412	GDP120
NAD	Nicotinamide Adenine Dinucleotide	140	53276	NAD140
FAD	Flavin Adenine Dinucleotide	172	73733	FAD172
FMN	Flavin Mononucleotide	117	33514	FMN117
UDP	Uridine Diphosphate	68	25064	UDP68

PROTEINS_METAL

PDB Ligand	Ligand Name	# Proteins	# Residues	Dataset
Fe	Iron	215	69779	Fe215
Mg	Magnesium	1908	655860	MG1908
Ca	Calcium	1402	468225	CA1402
Mn	Manganese	521	179324	MN521
Zn	Zinc	1660	487759	ZN1660
Co	Cobalt	201	56079	CO201
Ni	Nickel	355	95902	NI355

PROTEINS_NUCLEIC

Ligand Name	# Proteins	# Residues	Dataset
DNA	560	168746	DNA560
RNA	415	117841	RNA415

References

[1] Min Zeng, Fuhao Zhang, Fang-Xiang Wu, Yaohang Li, Jianxin Wang, Min Li. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. DOI:10.1093/bioinformatics/btz699

[2] Amemiya, T., Koike, R., Kidera, A., & Ota, M. (2011). PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Research, 40(D1), D554–D558. https://doi.org/10.1093/nar/gkr966

[3] Agrawal et al., (2019) ccPDB 2.0: an updated version of datasets created and compiled from Protein Data Bank. Database, Volume 2019, 1 January 2019, bay142

[4] Singh et al., (2012) ccPDB: compilation and creation of data sets from Protein Data Bank. Nucleic Acids Res. 40(1):D486-9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

README.md

Datasets

PPISP - Protein-Protein Interaction Site Prediction

PSCDB - Protein Structural Change Database

PROTEINS_{LIGANDS / METAL / NUCLEOTIDES / NUCLEIC}

PROTEINS_LIGANDS

PROTEINS_NUCLEOTIDES

PROTEINS_METAL

PROTEINS_NUCLEIC

References

Files

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md

Datasets

PPISP - Protein-Protein Interaction Site Prediction

PSCDB - Protein Structural Change Database

PROTEINS_{LIGANDS / METAL / NUCLEOTIDES / NUCLEIC}

PROTEINS_LIGANDS

PROTEINS_NUCLEOTIDES

PROTEINS_METAL

PROTEINS_NUCLEIC

References