Graphein - Residue Graph Tutorial#
In this notebook, we’ll run through residue-level graph construction in Graphein. We start by discsussing the config, the high-level API and spend the bulk of the tutorial running through the various options. At the end we’ll see how the low-level API can be used to control each step of the graph construction process.
[1]:
# Install Graphein if necessary
# !pip install graphein
# Install pymol if necessary - in this tutorial PyMol is only used for the initial plot. Feel free to skip!
# sudo apt-get install pymol (recommended for colab) OR conda install -c schrodinger pymol
First, let’s checkout the protein we’ll be playing with today:
[2]:
# NBVAL_SKIP
from graphein.utils.pymol import MolViewer
pymol = MolViewer()
pymol.delete("all") # delete all objects from other sessions if necessary.
pymol.fetch("3eiy")
pymol.show_as("cartoon")
pymol.display()
[2]:
Config#
Graphein is designed for processing datasets of protein structures into graphs. We rely on a global config object, `ProteinGraphConfig
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>`__ to store various parameters in the high-level API. We use `pydantic
<https://pydantic-docs.helpmanual.io/>`__ for the config object and provide sane defaults.
[3]:
from graphein.protein.config import ProteinGraphConfig
config = ProteinGraphConfig()
config.dict()
[3]:
{'granularity': 'CA',
'keep_hets': False,
'insertions': False,
'pdb_dir': PosixPath('../examples/pdbs'),
'verbose': False,
'exclude_waters': True,
'deprotonate': False,
'protein_df_processing_functions': None,
'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
'edge_metadata_functions': None,
'graph_metadata_functions': None,
'get_contacts_config': None,
'dssp_config': None}
Let’s run through the parameters of `ProteinGraphConfig
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>`__:
granularity
: specifies the granularity of the graph (i.e. what should the nodes be). Possible values are: atom identifiers (e.g."CA"
for \(\alpha\) carbon,"CB"
for \(\beta\) carbon),"centroid"
to use residue centroids (under the hood, this is the same as"CA"
, but we use the averagex,y,z
coordinates for the atoms in the residue) or “atom” for atom-level construction. This is discussed in another notebook.keep_hets
: this is a boolean specifying whether or not to keep heteroatoms present in the.pdb
file. Heteroatoms are typically non-protein atoms (waters, metal ions, ligands) but can sometimes contain non-standard or modified residues.insertions
: boolean specifying whether or not to keep insertions in the PDB filepdb_dir
optional path to a folder in which to save pdb files. Otherwise,/tmp/
will be usedverbose
: bool controlling amount of info printedexclude_waters
: not implementeddeprotonate
: bool indicating whether or not to remove Hydrogen atomsprotein_df_processing_functions
: list of functions with which to process the PDB dataframe. Discussed in the low-level API.edge_construction_functions
: list of functions to compute edges withnode_metadata_functions
: list of functions to annotate nodes withedge_metadata_functions
: list of functions to annotate edges withgraph_meta_functions
: list of functions to annotate graph withget_contacts_config
: A separate config object if usingGetContacts
edge construction functions
If you wish, you can construct an entirely fresh configuration. Or, if we wish to change only some of these parameters, we can pass a dictionary containing our modifications:
[4]:
params_to_change = {"granularity": "centroids"}
config = ProteinGraphConfig(**params_to_change)
config.dict()
[4]:
{'granularity': 'centroids',
'keep_hets': False,
'insertions': False,
'pdb_dir': PosixPath('../examples/pdbs'),
'verbose': False,
'exclude_waters': True,
'deprotonate': False,
'protein_df_processing_functions': None,
'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
'edge_metadata_functions': None,
'graph_metadata_functions': None,
'get_contacts_config': None,
'dssp_config': None}
High-level API#
Graphein features a high-level API which should be applicable for most simple graph constructions. This can be used on either a .pdb
file (so you can run whatever pre-processing you wish), or we can provide a PDB accession code and retrieve a structure from the PDB itself. If a path is provided, it takes precedence over the PDB code.
To use it we do as follows:
[5]:
from graphein.protein.graphs import construct_graph
g = construct_graph(config=config, pdb_code="3eiy")
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading meiler embeddings from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/meiler_embeddings.csv
174
If you wish to use a local .pdb
file, you can run:
g = construct_graph(config=config, pdb_path="../graphein/examples/pdbs/3eiy.pdb")
Let’s check out the results with the in-built visualisation
[6]:
from graphein.protein.visualisation import plotly_protein_structure_graph
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="degree",
label_node_ids=False,
plot_title="Peptide backbone graph. Nodes coloured by degree.",
node_size_multiplier=1
)
p.show()
`Cool! So let’s look at what we’ve got here:
We’ve plotted the graph, positioning nodes according to their x,y,z
coordinates and coloured them by their degree. As we can see all the nodes are yellow except the two corresponding to the N and C terminal residues. Why is this? Because we’ve only computed the peptide-bond edges. Now, we probably want more so let’s look at how to do that!
Edge Functions#
Graphein is implemented in a functional fashion. In this case, this means in order to compute edges, we pass a list of edge construction functions to the construction. We have supplied a number of edge computation functions. These are located in: * `graphein.protein.edges.distance
<https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.distance>`__ *
`graphein.protein.edges.intramolecular
<https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.intramolecular>`__ (these rely on an installation of GetContacts, an optional dependency) and a separate GetContactsConfig *
`graphein.protein.edges.atomic
<https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.atomic>`__ (these are used in atomic-level graphs and we discuss these in another notebook tutorial)
However, edge functions are simple functions that take in an nx.Graph
and return an nx.Graph
with added edges. This means users can easily define their own to suit their purposes! Let’s take a closer look at some of the in-built functions before defining our own.
Built-in Edge Functions#
[7]:
from graphein.protein.edges.distance import add_hydrogen_bond_interactions, add_peptide_bonds
new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds, add_hydrogen_bond_interactions]}
config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph with peptide backbone and H-Bonds. \n Nodes coloured by sequence position. Edges coloured by type.",
node_size_multiplier=1,
)
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
174
Great! So what if we try a bunch of them
[8]:
from graphein.protein.edges.distance import (add_peptide_bonds,
add_hydrogen_bond_interactions,
add_disulfide_interactions,
add_ionic_interactions,
add_aromatic_interactions,
add_aromatic_sulphur_interactions,
add_cation_pi_interactions
)
new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds,
add_aromatic_interactions,
add_hydrogen_bond_interactions,
add_disulfide_interactions,
add_ionic_interactions,
add_aromatic_sulphur_interactions,
add_cation_pi_interactions]
}
config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph with: peptide backbone, H-Bonds, \n Disulphide, ionic, aromatic, aromatic-sulphur and cation-pi interactions. \n Nodes coloured by sequence position, edges by type",
node_size_multiplier=1
)
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
INFO:graphein.protein.edges.distance:Found 1308 ionic interactions.
174
Cool! So we’ve added a bunch on distance-based computations of intramolecular edges to our graph! We have a few more distance based edge functions that we can visualise. Let’s check them out!
`graphein.protein.edges.distance.add_delaunay_triangulation
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_delaunay_triangulation>`__: Adds edges based on the delaunay triangulation`graphein.protein.edges.distance.add_k_nn_edges
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_k_nn_edges>`__: Adds edges based on K Nearest neighbours`graphein.protein.edges.distance.add_distance_threshold
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_distance_threshold>`__: Adds edges based on distance cutoff in Angstroms
First up, the Delaunay triangulation:
[9]:
from graphein.protein.edges.distance import add_delaunay_triangulation
new_edge_funcs = {"edge_construction_functions": [add_delaunay_triangulation]}
config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
G=g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
node_size_multiplier=1,
plot_title="Protein graph created by the Delaunay triangulation of the structure. \n Nodes coloured by sequence position."
)
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.edges.distance:Detected 954 simplices in the Delaunay Triangulation.
Next, let’s consider the last two edge construction functions `add_k_nn_edges
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_k_nn_edges>`__ and `add_distance_threshold
<https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_distance_threshold>`__. These two functions take additional parameters. We can manage this with partial functions. These functions also feature a long_interaction_threshold
parameter which
specifies the minimum number of residues in the sequence between two residues in order to add an edge. This is because we may not be so interested in residues close together in the sequence being close together in the graph.
[10]:
from functools import partial
from graphein.protein.edges.distance import add_distance_threshold
new_edge_funcs = {"edge_construction_functions": [partial(add_distance_threshold, long_interaction_threshold=5, threshold=10.)]}
config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created by thresholding distance between nodes. \n Nodes must be <10A apart and at least 5 positions apart \n Nodes coloured by sequence position.",
node_size_multiplier=1
)
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 3218 distance edges
INFO:graphein.protein.edges.distance:Added 1904 distance edges. (1314 removed by LIN)
[11]:
from functools import partial
from graphein.protein.edges.distance import add_k_nn_edges
new_edge_funcs = {"edge_construction_functions": [partial(add_k_nn_edges, k=3, long_interaction_threshold=0)]}
config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created from K-NN of each node. Nodes coloured by sequence position",
node_size_multiplier=1
)
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 522 KNN edges
Defining your own edge functions#
Now, let’s turn our attention to defining our own edge construction functions. These must take in a nx.Graph
and return a nx.Graph
. You can use these custom functions together with all the in-built edge functions as before. If you come up with something cool, consider making a pull-request and sharing it with the community!
Here, as an example, we define an arbitrary function that creates an edge between all histidine residues in the graph.
[12]:
import networkx as nx
def add_histidine_histidine_edges(G: nx.Graph) -> nx.Graph:
# Iterate over nodes to identify histidines
histidines = [n for n, d in G.nodes(data=True) if d["residue_name"] == "HIS"]
# Iterate over histidines and create a bond between them
[G.add_edge(x, y, kind={"histidine"}) for i, x in enumerate(histidines) for j, y in enumerate(histidines) if i!=j]
return G
[13]:
new_edge_funcs = {"edge_construction_functions": [add_histidine_histidine_edges]}
config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created using a user-defined function that connects all Histidines."
)
p.show()
# Let's check this worked:
for u, v, a in g.edges(data=True):
print(u, v, a)
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
A:HIS:111 A:HIS:137 {'kind': {'histidine'}}
A:HIS:111 A:HIS:163 {'kind': {'histidine'}}
A:HIS:137 A:HIS:163 {'kind': {'histidine'}}
Features & Metadata#
Graphein
is designed to facilitate geometric deep learning on protein structures and so we have a bunch of feature and metadata annotation functions. These operate on three levels: * node-level annotation * graph-level annotation * edge-level annotation
These behave very similarly to the edge construction functions we just looked at. You simply pass a list of functions that take the right sort of arguments and return the right types.
graph-level annotation takes in a
nx.Graph
and returns annx.Graph
(just like before!).node-level annotation takes in a node, data tuple from
G.nodes(data=True)
and returns apd.Series
edge-level annotation in node_u, node_v, data tuple from
G.edges(data=True)
and returns
Built-in Graph Annotation Functions#
These are found in graphein.protein.features.sequence
and graphein.protein.features.graph
. We make a slight distinction between those functions that operate on the protein sequence for cleaner organisation. N.B. we’ve mentioned the chain_selection
parameter briefly - if you include multiple chains in your graph, then will be multiple sequences associated with it. These are accessed as such: G.graph[f"sequence_{CHAIN_ID}"]
and features computed from sequence are accessed as:
G.graph[f"{FEATURE_NAME}_{CHAIN_ID}]
[14]:
from graphein.protein.features.sequence.sequence import molecular_weight
new_graph_annotation_funcs = {"graph_metadata_functions": [molecular_weight]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
print("Sequence:", g.graph["sequence_A"])
print("MW:", g.graph["molecular_weight_A"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
174
Sequence: SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK
MW: 19029.71029999999
We also provide some utilities for adding graph-level features in the form of sequence embeddings from various pre-trained language models. Let’s check them out!
[15]:
# Warning! This cell may crash a binder notebook as the pre-trained model download is rather large!
# NBVAL_SKIP
from graphein.protein.features.sequence.embeddings import esm_sequence_embedding, biovec_sequence_embedding
new_graph_annotation_funcs = {"graph_metadata_functions": [esm_sequence_embedding, biovec_sequence_embedding]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
print("ESM:", g.graph["esm_embedding_A"])
print("biovec:", g.graph["biovec_embedding_A"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
Using cache found in /Users/arianjamasb/.cache/torch/hub/facebookresearch_esm_master
/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning:
Failed to load image Python extension: dlopen(/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIxEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
Referenced from: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so
Expected in: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torch/lib/libtorch_cpu.dylib
174
INFO:gensim.utils:loading Word2Vec object from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model
DEBUG:smart_open.smart_open_lib:{'uri': '/Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'compression': None, 'transport_params': None}
INFO:gensim.utils:loading vocabulary recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.vocabulary.* with mmap=None
INFO:gensim.utils:loading wv recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute vectors_norm to None
INFO:gensim.utils:loading trainables recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.trainables.* with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model
ESM: [-0.02279943 0.16464774 -0.00461995 ... 0.02556622 -0.19798394
0.09682215]
biovec: [array([ -8.865931 , 5.65769 , -1.5052402 , 3.2893941 ,
-6.6228924 , 2.4890645 , 1.0635861 , 2.4991994 ,
-2.2740204 , -7.6614437 , 0.3739053 , 4.285517 ,
-1.7950253 , -2.0930681 , 0.19075748, 9.249631 ,
-1.2537682 , 6.0140305 , -3.0592997 , 3.1944623 ,
5.6747837 , 1.4732779 , -9.556535 , 4.5126605 ,
-15.789192 , -9.917104 , 10.61495 , 1.6693095 ,
1.8446481 , -12.656563 , 2.5226657 , 3.2782233 ,
5.6363025 , -1.5776551 , 9.913681 , -1.4776824 ,
7.4292827 , 2.6953607 , -0.81495905, 1.0415509 ,
-18.352554 , -9.603085 , -11.215957 , 3.9449768 ,
2.2988393 , 2.8612108 , -5.647224 , -9.894558 ,
2.8099368 , -3.47802 , -1.7364154 , 4.10581 ,
-1.1337806 , -7.851351 , 3.667841 , 2.5098174 ,
0.3878222 , -0.282476 , 4.242093 , 2.7918775 ,
-0.3927042 , -3.504907 , 2.2170143 , -6.7696056 ,
-8.574576 , 7.549987 , 5.5607357 , -7.514735 ,
5.778782 , 2.1968088 , 5.747694 , -4.4072075 ,
1.6411833 , 2.339804 , 4.364424 , -4.615387 ,
0.61950064, 7.19126 , -0.10569584, -6.485407 ,
-3.3555477 , 0.58533096, 4.150122 , 4.6470437 ,
-7.6111784 , 5.8581285 , -9.8836 , 8.114656 ,
4.3746543 , 1.4033681 , -4.632345 , 13.016971 ,
-2.839704 , -6.210647 , 3.4460034 , 2.6573007 ,
-14.987186 , -6.545936 , -2.3990562 , 1.2732263 ],
dtype=float32), array([ -9.079695 , 3.7650356 , 0.3614651 , 2.1325634 ,
-5.285296 , 3.0021908 , 1.2300422 , 1.8999658 ,
-2.0472856 , -5.5658956 , 0.5014471 , 3.8743246 ,
-1.0728757 , -3.0830898 , 0.73323405, 9.658365 ,
-0.10142106, 6.2466826 , -2.4961333 , 2.8022783 ,
6.5060906 , 1.6311501 , -10.04929 , 5.4711785 ,
-16.297068 , -10.765219 , 10.580414 , 0.503202 ,
2.4227042 , -12.57771 , 2.7075891 , 3.9921677 ,
5.5343657 , -0.9493397 , 10.082592 , 0.301396 ,
6.978562 , 2.4752004 , -1.1012759 , -1.1357698 ,
-16.81547 , -9.117541 , -10.919918 , 3.8029845 ,
2.5540738 , 3.0349758 , -5.4471354 , -9.438955 ,
3.067747 , -3.1571815 , -3.1563768 , 4.7784586 ,
-2.3092556 , -7.704368 , 0.82422435, 1.8206171 ,
2.161718 , 0.37946355, 4.7007704 , 2.2930143 ,
0.75466317, -3.3875456 , 2.3561945 , -6.173627 ,
-7.655903 , 6.066585 , 5.7083497 , -8.785268 ,
4.0877957 , 2.1213973 , 5.8782153 , -4.1335964 ,
0.26376987, 3.185395 , 3.9498727 , -4.5998893 ,
-0.11535516, 8.4039135 , -0.8255909 , -5.910516 ,
-2.2402806 , -0.15940607, 3.476339 , 6.034683 ,
-6.72777 , 6.143824 , -9.662175 , 8.747985 ,
3.985702 , 2.272303 , -3.933255 , 12.071237 ,
-1.8998326 , -6.6413393 , 4.2529097 , 3.4216733 ,
-15.628299 , -5.513835 , -3.004577 , 1.3258756 ],
dtype=float32), array([ -6.9852 , 4.425231 , -0.5174276 , 2.0380528 ,
-7.1311088 , 2.4794824 , 1.9649935 , 1.9006643 ,
-1.3209196 , -5.8989286 , 1.2391084 , 3.485902 ,
-1.9545169 , -2.2625084 , 1.8165749 , 9.708086 ,
-0.74385244, 6.5338845 , -3.0916584 , 3.9312694 ,
6.1540847 , 3.4793413 , -10.056544 , 5.7322254 ,
-13.599701 , -10.923328 , 9.378405 , 0.555632 ,
3.512236 , -11.891929 , 2.382594 , 3.2599561 ,
5.123339 , -0.55314994, 10.585077 , -0.9078483 ,
8.167705 , 2.4789596 , 0.78942966, -1.0045129 ,
-17.38679 , -9.496934 , -10.542821 , 3.5852609 ,
3.3952198 , 2.9937162 , -5.9167666 , -9.517663 ,
2.2889428 , -3.6738756 , -2.7072256 , 4.6205435 ,
-1.970848 , -6.7903805 , 2.1213295 , 2.5594456 ,
0.4943355 , 0.27333227, 4.096034 , 1.2748998 ,
-1.3203058 , -3.6010296 , 2.623712 , -8.468901 ,
-6.3624053 , 7.388231 , 4.908528 , -7.1587243 ,
4.9127274 , 3.2538335 , 6.45744 , -3.3919108 ,
-0.22479697, 2.4506824 , 4.864544 , -4.201532 ,
-0.08151506, 5.778758 , -0.5880602 , -7.52435 ,
-2.3001645 , 0.04607171, 4.0975637 , 5.7935786 ,
-8.676276 , 6.952252 , -11.0132 , 7.6395264 ,
4.4887743 , 2.5889583 , -5.0494356 , 12.681698 ,
-2.3577175 , -5.7180576 , 3.481403 , 3.9928114 ,
-14.854708 , -6.135002 , -2.9227335 , 1.2122653 ],
dtype=float32)]
Writing your own graph-level annotation function.#
We provide a couple of useful functions for dealing with multiple chains robustly, namely compute_feature_over_chains
and aggregate_feature_over_chains
. This takes an input (nx.Graph
), your function: (Callable
), and feature_name (str
). Your function must operate on a sequence (str
) to use this utility.
aggregate_feature_over_chains
can be used to perform aggregation {"min", "max", "sum" , "mean"}
of chain-specific features. It takes: an nx.Graph
, a feature name (str
) and and aggregation type (str
)
If you don’t want to operate on the sequnce, you don’t need to use these!
Let’s use this to construct a graph-level feature that operates on the sequence. This feature is an integer equal to the number of histidines in the chain.
[16]:
from graphein.protein.features.sequence.utils import compute_feature_over_chains, aggregate_feature_over_chains
# Define our graph-level function that operates on sequences
def histidine_count_feature(G: nx.Graph) -> nx.Graph:
# Define our feature function that operates on a sequence
def count_histidines(sequence: str) -> int:
return sequence.count("H")
# Compute the feature over the chains in the graph
G = compute_feature_over_chains(G, func=count_histidines, feature_name="histidine_count")
# Aggregate the feature over the chains
G = aggregate_feature_over_chains(G, feature_name="histidine_count", aggregation_type="mean")
return G
new_graph_annotation_funcs = {"graph_metadata_functions": [histidine_count_feature]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)
# Test our new feature
g = construct_graph(config=config, pdb_code="3eiy")
print(g.graph["histidine_count_A"])
# And now on a multi-chain graph
g = construct_graph(config=config, pdb_code="9API", chain_selection="all")
print("Chain A Hists:", g.graph["histidine_count_A"])
print("Chain B Hists:", g.graph["histidine_count_B"])
print("Chain B Seq:", g.graph["sequence_B"])
print("Aggregated HIS count:", g.graph["histidine_count_mean"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
174
3
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 375 total nodes
375
Chain A Hists: 11
Chain B Hists: 0
Chain B Seq: SIPPEVKFNKPFVFLMIEQNTKSPLFMGKVVNPTQK
Aggregated HIS count: 5.5
Node-level features#
Similarly, we provide some in-built functions for computing node-level features. For example, we’ll look at some featurisations of amino acid types.
First, we’ll look at a simple one-hot encoding of the amino acid type
[17]:
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
config = ProteinGraphConfig(**{"node_metadata_functions": [amino_acid_one_hot]})
g = construct_graph(config=config, pdb_code="3eiy")
for n, d in g.nodes(data=True):
print(d["amino_acid_one_hot"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
174
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[18]:
from graphein.protein.features.nodes.amino_acid import expasy_protein_scale
config = ProteinGraphConfig(**{"node_metadata_functions": [expasy_protein_scale]})
g = construct_graph(config=config, pdb_code="3eiy")
for n, d in g.nodes(data=True):
print(d['expasy'])
break
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading Expasy protein scales from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/amino_acid_properties.csv
174
pka_cooh_alpha 2.21
pka_nh3 9.15
pka_rgroup 7.00
isoelectric_points 5.68
molecularweight 105.00
...
antiparallelbeta_strand 0.87
parallelbeta_strand 0.70
a_a_composition 6.90
a_a_swiss_prot 6.56
relativemutability 120.00
Name: SER, Length: 61, dtype: float64
Geometric features#
We can also add geometric features to protein structure graphs.
[19]:
from graphein.protein.features.nodes.geometry import add_sidechain_vector, add_beta_carbon_vector, add_sequence_neighbour_vector
from graphein.protein.visualisation import add_vector_to_plot
add_sidechain_vector(g)
add_beta_carbon_vector(g)
add_sequence_neighbour_vector(g)
fig = plotly_protein_structure_graph(g, node_size_multiplier=1)
fig = add_vector_to_plot(g, fig, "sidechain_vector", colour="red", scale=1.5)
fig = add_vector_to_plot(g, fig, "c_beta_vector", colour="blue", scale=1.5)
fig = add_vector_to_plot(g, fig, "sequence_neighbour_vector_n_to_c", colour="green", scale=1.5)
fig
Low-Level API#
Graphein also features a low-level API if you want to get down into the gritty details of the construction. Essentially, every step of the construction is a function that you can tweak/alter the sequencing of as you please. This is very flexible if not unnecessary for most applications
Pre-Processing#
All graphs begin as pd.DataFrame
representations of the .PDB file. These are handled by PandasPDB. There is a convenience function here process_dataframe
, to which you can pass a list of dataframe processing functions. These are similar to the edge construction and annotation functions described above except they take in a pd.DataFrame
and return a pd.DataFrame
. We’ll start by reimplementing the construction we’ve been using above.
[20]:
from graphein.protein.graphs import read_pdb_to_dataframe, process_dataframe, deprotonate_structure, convert_structure_to_centroids, subset_structure_to_atom_type, filter_hetatms, remove_insertions
processing_funcs = [deprotonate_structure, convert_structure_to_centroids, remove_insertions]
pdb_code = "3eiy"
# Read dataframe from PDB
raw_df = read_pdb_to_dataframe(pdb_code=pdb_code)
# Apply processing functions
df = process_dataframe(raw_df, atom_df_processing_funcs = processing_funcs)
df.head()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes
[20]:
record_name | atom_number | blank_1 | atom_name | alt_loc | residue_name | blank_2 | chain_id | residue_number | insertion | ... | y_coord | z_coord | occupancy | b_factor | blank_4 | segment_id | element_symbol | charge | line_idx | node_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ATOM | 2 | CA | SER | A | 2 | ... | 54.615500 | -0.136833 | 1.0 | 52.54 | C | NaN | 610 | A:SER:2 | ||||||
1 | ATOM | 8 | CA | PHE | A | 3 | ... | 51.417273 | 2.656727 | 1.0 | 48.73 | C | NaN | 616 | A:PHE:3 | ||||||
2 | ATOM | 19 | CA | SER | A | 4 | ... | 50.090833 | -2.889667 | 1.0 | 47.00 | C | NaN | 627 | A:SER:4 | ||||||
3 | ATOM | 25 | CA | ASN | A | 5 | ... | 52.759000 | -4.146375 | 1.0 | 41.42 | C | NaN | 633 | A:ASN:5 | ||||||
4 | ATOM | 33 | CA | VAL | A | 6 | ... | 51.575429 | 0.109000 | 1.0 | 32.41 | C | NaN | 641 | A:VAL:6 |
5 rows × 22 columns
We can define our own processing functions. Here we define a function that removes odd numbered residues from the dataframe. We’ll use this again later.
[21]:
import pandas as pd
def remove_odd_resis(df: pd.DataFrame) -> pd.DataFrame:
return df.loc[df.index % 2 == 0]
remove_odd_resis(df).head()
[21]:
record_name | atom_number | blank_1 | atom_name | alt_loc | residue_name | blank_2 | chain_id | residue_number | insertion | ... | y_coord | z_coord | occupancy | b_factor | blank_4 | segment_id | element_symbol | charge | line_idx | node_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ATOM | 2 | CA | SER | A | 2 | ... | 54.615500 | -0.136833 | 1.0 | 52.54 | C | NaN | 610 | A:SER:2 | ||||||
2 | ATOM | 19 | CA | SER | A | 4 | ... | 50.090833 | -2.889667 | 1.0 | 47.00 | C | NaN | 627 | A:SER:4 | ||||||
4 | ATOM | 33 | CA | VAL | A | 6 | ... | 51.575429 | 0.109000 | 1.0 | 32.41 | C | NaN | 641 | A:VAL:6 | ||||||
6 | ATOM | 47 | CA | ALA | A | 8 | ... | 46.781200 | 0.618800 | 1.0 | 25.67 | C | NaN | 655 | A:ALA:8 | ||||||
8 | ATOM | 56 | CA | LYS | A | 10 | ... | 51.873333 | 1.231222 | 1.0 | 27.63 | C | NaN | 664 | A:LYS:10 |
5 rows × 22 columns
Building the Graph#
First we initialise the graph with the metadata. We then use this to add the nodes
[22]:
from graphein.protein.graphs import initialise_graph_with_metadata, add_nodes_to_graph
g = initialise_graph_with_metadata(protein_df=df, # from above cell
raw_pdb_df=raw_df.df["ATOM"], # Store this for traceability
pdb_id = pdb_code, #and again
granularity = "centroid" # Store this so we know what kind of graph we have
)
g = add_nodes_to_graph(g)
print(g.nodes)
['A:SER:2', 'A:PHE:3', 'A:SER:4', 'A:ASN:5', 'A:VAL:6', 'A:PRO:7', 'A:ALA:8', 'A:GLY:9', 'A:LYS:10', 'A:ASP:11', 'A:LEU:12', 'A:PRO:13', 'A:GLN:14', 'A:ASP:15', 'A:PHE:16', 'A:ASN:17', 'A:VAL:18', 'A:ILE:19', 'A:ILE:20', 'A:GLU:21', 'A:ILE:22', 'A:PRO:23', 'A:ALA:24', 'A:GLN:25', 'A:SER:26', 'A:GLU:27', 'A:PRO:28', 'A:VAL:29', 'A:LYS:30', 'A:TYR:31', 'A:GLU:32', 'A:ALA:33', 'A:ASP:34', 'A:LYS:35', 'A:ALA:36', 'A:LEU:37', 'A:GLY:38', 'A:LEU:39', 'A:LEU:40', 'A:VAL:41', 'A:VAL:42', 'A:ASP:43', 'A:ARG:44', 'A:PHE:45', 'A:ILE:46', 'A:GLY:47', 'A:THR:48', 'A:GLY:49', 'A:MET:50', 'A:ARG:51', 'A:TYR:52', 'A:PRO:53', 'A:VAL:54', 'A:ASN:55', 'A:TYR:56', 'A:GLY:57', 'A:PHE:58', 'A:ILE:59', 'A:PRO:60', 'A:GLN:61', 'A:THR:62', 'A:LEU:63', 'A:SER:64', 'A:GLY:65', 'A:ASP:66', 'A:GLY:67', 'A:ASP:68', 'A:PRO:69', 'A:VAL:70', 'A:ASP:71', 'A:VAL:72', 'A:LEU:73', 'A:VAL:74', 'A:ILE:75', 'A:THR:76', 'A:PRO:77', 'A:PHE:78', 'A:PRO:79', 'A:LEU:80', 'A:LEU:81', 'A:ALA:82', 'A:GLY:83', 'A:SER:84', 'A:VAL:85', 'A:VAL:86', 'A:ARG:87', 'A:ALA:88', 'A:ARG:89', 'A:ALA:90', 'A:LEU:91', 'A:GLY:92', 'A:MET:93', 'A:LEU:94', 'A:LYS:95', 'A:MET:96', 'A:THR:97', 'A:ASP:98', 'A:GLU:99', 'A:SER:100', 'A:GLY:101', 'A:VAL:102', 'A:ASP:103', 'A:ALA:104', 'A:LYS:105', 'A:LEU:106', 'A:VAL:107', 'A:ALA:108', 'A:VAL:109', 'A:PRO:110', 'A:HIS:111', 'A:ASP:112', 'A:LYS:113', 'A:VAL:114', 'A:CYS:115', 'A:PRO:116', 'A:MET:117', 'A:THR:118', 'A:ALA:119', 'A:ASN:120', 'A:LEU:121', 'A:LYS:122', 'A:SER:123', 'A:ILE:124', 'A:ASP:125', 'A:ASP:126', 'A:VAL:127', 'A:PRO:128', 'A:ALA:129', 'A:TYR:130', 'A:LEU:131', 'A:LYS:132', 'A:ASP:133', 'A:GLN:134', 'A:ILE:135', 'A:LYS:136', 'A:HIS:137', 'A:PHE:138', 'A:PHE:139', 'A:GLU:140', 'A:GLN:141', 'A:TYR:142', 'A:LYS:143', 'A:ALA:144', 'A:LEU:145', 'A:GLU:146', 'A:LYS:147', 'A:GLY:148', 'A:LYS:149', 'A:TRP:150', 'A:VAL:151', 'A:LYS:152', 'A:VAL:153', 'A:GLU:154', 'A:GLY:155', 'A:TRP:156', 'A:ASP:157', 'A:GLY:158', 'A:ILE:159', 'A:ASP:160', 'A:ALA:161', 'A:ALA:162', 'A:HIS:163', 'A:LYS:164', 'A:GLU:165', 'A:ILE:166', 'A:THR:167', 'A:ASP:168', 'A:GLY:169', 'A:VAL:170', 'A:ALA:171', 'A:ASN:172', 'A:PHE:173', 'A:LYS:174', 'A:LYS:175']
Next, we add edges to the graph:
[23]:
from graphein.protein.graphs import compute_edges
g = compute_edges(g, get_contacts_config=None, funcs=[add_peptide_bonds, add_hydrogen_bond_interactions])
print(g.edges)
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
174
[('A:SER:2', 'A:PHE:3'), ('A:PHE:3', 'A:SER:4'), ('A:SER:4', 'A:ASN:5'), ('A:ASN:5', 'A:VAL:6'), ('A:VAL:6', 'A:PRO:7'), ('A:PRO:7', 'A:ALA:8'), ('A:ALA:8', 'A:GLY:9'), ('A:GLY:9', 'A:LYS:10'), ('A:LYS:10', 'A:ASP:11'), ('A:ASP:11', 'A:LEU:12'), ('A:LEU:12', 'A:PRO:13'), ('A:PRO:13', 'A:GLN:14'), ('A:GLN:14', 'A:ASP:15'), ('A:ASP:15', 'A:PHE:16'), ('A:ASP:15', 'A:ARG:87'), ('A:PHE:16', 'A:ASN:17'), ('A:ASN:17', 'A:VAL:18'), ('A:VAL:18', 'A:ILE:19'), ('A:ILE:19', 'A:ILE:20'), ('A:ILE:20', 'A:GLU:21'), ('A:GLU:21', 'A:ILE:22'), ('A:ILE:22', 'A:PRO:23'), ('A:PRO:23', 'A:ALA:24'), ('A:ALA:24', 'A:GLN:25'), ('A:GLN:25', 'A:SER:26'), ('A:SER:26', 'A:GLU:27'), ('A:GLU:27', 'A:PRO:28'), ('A:PRO:28', 'A:VAL:29'), ('A:VAL:29', 'A:LYS:30'), ('A:LYS:30', 'A:TYR:31'), ('A:TYR:31', 'A:GLU:32'), ('A:GLU:32', 'A:ALA:33'), ('A:ALA:33', 'A:ASP:34'), ('A:ASP:34', 'A:LYS:35'), ('A:LYS:35', 'A:ALA:36'), ('A:ALA:36', 'A:LEU:37'), ('A:LEU:37', 'A:GLY:38'), ('A:GLY:38', 'A:LEU:39'), ('A:LEU:39', 'A:LEU:40'), ('A:LEU:40', 'A:VAL:41'), ('A:VAL:41', 'A:VAL:42'), ('A:VAL:42', 'A:ASP:43'), ('A:ASP:43', 'A:ARG:44'), ('A:ARG:44', 'A:PHE:45'), ('A:ARG:44', 'A:LYS:149'), ('A:PHE:45', 'A:ILE:46'), ('A:ILE:46', 'A:GLY:47'), ('A:GLY:47', 'A:THR:48'), ('A:THR:48', 'A:GLY:49'), ('A:GLY:49', 'A:MET:50'), ('A:MET:50', 'A:ARG:51'), ('A:ARG:51', 'A:TYR:52'), ('A:TYR:52', 'A:PRO:53'), ('A:PRO:53', 'A:VAL:54'), ('A:VAL:54', 'A:ASN:55'), ('A:ASN:55', 'A:TYR:56'), ('A:TYR:56', 'A:GLY:57'), ('A:TYR:56', 'A:LYS:105'), ('A:TYR:56', 'A:ASP:71'), ('A:GLY:57', 'A:PHE:58'), ('A:PHE:58', 'A:ILE:59'), ('A:ILE:59', 'A:PRO:60'), ('A:PRO:60', 'A:GLN:61'), ('A:GLN:61', 'A:THR:62'), ('A:THR:62', 'A:LEU:63'), ('A:LEU:63', 'A:SER:64'), ('A:SER:64', 'A:GLY:65'), ('A:GLY:65', 'A:ASP:66'), ('A:ASP:66', 'A:GLY:67'), ('A:GLY:67', 'A:ASP:68'), ('A:ASP:68', 'A:PRO:69'), ('A:PRO:69', 'A:VAL:70'), ('A:VAL:70', 'A:ASP:71'), ('A:ASP:71', 'A:VAL:72'), ('A:ASP:71', 'A:LYS:105'), ('A:VAL:72', 'A:LEU:73'), ('A:LEU:73', 'A:VAL:74'), ('A:VAL:74', 'A:ILE:75'), ('A:ILE:75', 'A:THR:76'), ('A:THR:76', 'A:PRO:77'), ('A:PRO:77', 'A:PHE:78'), ('A:PHE:78', 'A:PRO:79'), ('A:PRO:79', 'A:LEU:80'), ('A:LEU:80', 'A:LEU:81'), ('A:LEU:81', 'A:ALA:82'), ('A:ALA:82', 'A:GLY:83'), ('A:GLY:83', 'A:SER:84'), ('A:SER:84', 'A:VAL:85'), ('A:VAL:85', 'A:VAL:86'), ('A:VAL:86', 'A:ARG:87'), ('A:ARG:87', 'A:ALA:88'), ('A:ALA:88', 'A:ARG:89'), ('A:ARG:89', 'A:ALA:90'), ('A:ARG:89', 'A:ASP:112'), ('A:ALA:90', 'A:LEU:91'), ('A:LEU:91', 'A:GLY:92'), ('A:GLY:92', 'A:MET:93'), ('A:MET:93', 'A:LEU:94'), ('A:LEU:94', 'A:LYS:95'), ('A:LYS:95', 'A:MET:96'), ('A:MET:96', 'A:THR:97'), ('A:THR:97', 'A:ASP:98'), ('A:ASP:98', 'A:GLU:99'), ('A:ASP:98', 'A:LYS:143'), ('A:GLU:99', 'A:SER:100'), ('A:SER:100', 'A:GLY:101'), ('A:GLY:101', 'A:VAL:102'), ('A:VAL:102', 'A:ASP:103'), ('A:ASP:103', 'A:ALA:104'), ('A:ASP:103', 'A:LYS:105'), ('A:ALA:104', 'A:LYS:105'), ('A:LYS:105', 'A:LEU:106'), ('A:LEU:106', 'A:VAL:107'), ('A:VAL:107', 'A:ALA:108'), ('A:ALA:108', 'A:VAL:109'), ('A:VAL:109', 'A:PRO:110'), ('A:PRO:110', 'A:HIS:111'), ('A:HIS:111', 'A:ASP:112'), ('A:ASP:112', 'A:LYS:113'), ('A:LYS:113', 'A:VAL:114'), ('A:VAL:114', 'A:CYS:115'), ('A:CYS:115', 'A:PRO:116'), ('A:CYS:115', 'A:MET:117'), ('A:PRO:116', 'A:MET:117'), ('A:MET:117', 'A:THR:118'), ('A:THR:118', 'A:ALA:119'), ('A:ALA:119', 'A:ASN:120'), ('A:ASN:120', 'A:LEU:121'), ('A:LEU:121', 'A:LYS:122'), ('A:LYS:122', 'A:SER:123'), ('A:SER:123', 'A:ILE:124'), ('A:ILE:124', 'A:ASP:125'), ('A:ASP:125', 'A:ASP:126'), ('A:ASP:126', 'A:VAL:127'), ('A:VAL:127', 'A:PRO:128'), ('A:PRO:128', 'A:ALA:129'), ('A:ALA:129', 'A:TYR:130'), ('A:TYR:130', 'A:LEU:131'), ('A:LEU:131', 'A:LYS:132'), ('A:LYS:132', 'A:ASP:133'), ('A:ASP:133', 'A:GLN:134'), ('A:GLN:134', 'A:ILE:135'), ('A:ILE:135', 'A:LYS:136'), ('A:LYS:136', 'A:HIS:137'), ('A:HIS:137', 'A:PHE:138'), ('A:PHE:138', 'A:PHE:139'), ('A:PHE:139', 'A:GLU:140'), ('A:GLU:140', 'A:GLN:141'), ('A:GLN:141', 'A:TYR:142'), ('A:TYR:142', 'A:LYS:143'), ('A:LYS:143', 'A:ALA:144'), ('A:ALA:144', 'A:LEU:145'), ('A:LEU:145', 'A:GLU:146'), ('A:GLU:146', 'A:LYS:147'), ('A:LYS:147', 'A:GLY:148'), ('A:GLY:148', 'A:LYS:149'), ('A:LYS:149', 'A:TRP:150'), ('A:TRP:150', 'A:VAL:151'), ('A:VAL:151', 'A:LYS:152'), ('A:LYS:152', 'A:VAL:153'), ('A:VAL:153', 'A:GLU:154'), ('A:GLU:154', 'A:GLY:155'), ('A:GLY:155', 'A:TRP:156'), ('A:TRP:156', 'A:ASP:157'), ('A:ASP:157', 'A:GLY:158'), ('A:GLY:158', 'A:ILE:159'), ('A:ILE:159', 'A:ASP:160'), ('A:ASP:160', 'A:ALA:161'), ('A:ALA:161', 'A:ALA:162'), ('A:ALA:162', 'A:HIS:163'), ('A:HIS:163', 'A:LYS:164'), ('A:LYS:164', 'A:GLU:165'), ('A:LYS:164', 'A:ASP:168'), ('A:GLU:165', 'A:ILE:166'), ('A:ILE:166', 'A:THR:167'), ('A:THR:167', 'A:ASP:168'), ('A:ASP:168', 'A:GLY:169'), ('A:GLY:169', 'A:VAL:170'), ('A:VAL:170', 'A:ALA:171'), ('A:ALA:171', 'A:ASN:172'), ('A:ASN:172', 'A:PHE:173'), ('A:PHE:173', 'A:LYS:174'), ('A:LYS:174', 'A:LYS:175')]
Let’s check it out!
[24]:
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created using low-level API. Shows peptide bond back bond and H-Bonds",
node_size_multiplier=1
)
p.show()
Just as we expected! From here we can apply any number of graph, node or edge metadata annotation functions that we may wish to. The advantage here is that we are not tied to the sequencing in the high-level API of:
(annotate node metadata -> compute edges -> annotate graph metadata -> annotate edge metadata).
This means, if we want, we can define graph annotation functions that utilise edge metadata that would otherwise not be possible. E.g. here we annotate bonds containing histidines with a 1. We then define a graph feature that looks for these bonds, sums the residue numbers involved and sums them.
[25]:
from graphein.protein.graphs import annotate_node_metadata, annotate_graph_metadata, annotate_edge_metadata
# Define edge feature that assigns hist_status = 1 to any bond involving a histidine
def bond_has_histidine_edge_feature(u, v, d):
if "HIS" in u or "HIS" in v:
d["hist_status"] = 1
else:
d["hist_status"] = 0
# Annotate the edge metadata
g = annotate_edge_metadata(g, [bond_has_histidine_edge_feature])
# Check it out
for u, v, d in g.edges(data=True):
if d["hist_status"] == 1:
print(u, v, d)
# Annotate graph-level feature that depends on histidine_status edge feature. This adds the sum of the residue numbers involved in hist_status bonds
def sum_hist_status_bond_residue_nums(G: nx.Graph) -> nx.Graph:
to_add = []
for u, v, a in G.edges(data=True):
if a["hist_status"] == 1:
to_add.append(G.nodes[u]["residue_number"])
to_add.append(G.nodes[v]["residue_number"])
total = sum(to_add)
G.graph["hist_bond_total_residue_number"] = total
return G
g = annotate_graph_metadata(g, [sum_hist_status_bond_residue_nums])
print("Graph-level hist feature:", g.graph["hist_bond_total_residue_number"])
A:PRO:110 A:HIS:111 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:111 A:ASP:112 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:LYS:136 A:HIS:137 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:137 A:PHE:138 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:ALA:162 A:HIS:163 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:163 A:LYS:164 {'kind': {'peptide_bond'}, 'hist_status': 1}
Graph-level hist feature: 1644
fin