Lecture 6 Evolutionary Sequence Alignment Algorithms

Evolutionary Basis of
Sequence Alignment
and Algorithms
Dr. Aditya Kumar Padhi
Laboratory for Computational Biology & Biomolecular Design (LCBD)

School of Biochemical Engineering, IIT (BHU)
Outline of today’s discussion
v General overview of sequence alignment.
v Evolution, and its significance & relationship with sequence

alignment
v The rationale behind sequence alignment
v Types of sequence alignment
v Protein sequence alignment
v Sequence similarity and sequence identity
v Algorithms of sequence alignment (brief overview)
v A case study
Overview
v Sequence comparison (DNA, RNA & Protein) lies at the heart of
bioinformatics analysis.
v It is the first step toward structural and functional analysis of

newly determined sequences.
v The most fundamental comparison process is sequence

alignment.
ü Search for common character patterns
Evolutionary basis
v DNA and proteins are products of evolution.
v Linear sequences of the nucleotide bases

and amino acids form the primary structure
of the DNA and proteins.
v They can be considered molecular fossils

that encode the history of millions of years
of evolution.
v During this time period, they undergo

random changes
ü Selections
ü Mutations
Evolutionary basis cont…
v Despite this, the traces of evolution may still exist, which allows the identification
of the common ancestry.
v This is due to the residues that perform key functional structural roles tend to be
preserved by natural selection. Others tend to mutate more frequently.
v To detect sequence homology, we must first align sequences.
v By sequence alignment, patterns of conversions and variation can be identified.
v The degree of sequence alignment often reveals the evolutionary relatedness of

different sequences.
• It serves as the basis for the prediction of the
structure and functions of uncharacterized
sequences.
• If a significant similarity between two

sequences is found, that indicates that they
belong to the same family.
Why to
perform • Also, it provides inference for the relatedness
(evolutionary relationship) of two sequences
sequence under study.
alignment?
• If they share significant similarities, it reflects
the fact that they must have derived from a
common evolutionary origin.
Evolutionary Basis for Sequence Alignment and
its inference
• Homology: When two sequences are descended from a common

evolutionary origin, they are said to have a homologous relationship or
share homology.
• Similarity: The percentage of aligned residues that are similar in

physiochemical properties such as size, charge and etc.
• Identity: Quantity that describes how much two sequences are alike in
the strictest terms or the extent to which two sequences are invariant.
• Example: two sequences share 40% similarity.
• The two sequences are either homologous or nonhomologous

Sequence alignment and evolution
n Assume we know the evolutionary history relating q and d
species:
n The true alignment can be found using h as a template:

h : GLVS T
q’: GLISVT
d’: GIV--T
Sequence alignment and evolution
n Given an alignment, several different evolutionary histories can be

(equally) possible & derived.
n Example:
q Alignment:
q’: GLISVT
d’: G-I-VT
q One possible history:

H*:GLIVT
/\
->S / \ L->
/ \
q:GLISVT d:GIVT
Evolutionary basis of sequence alignment
Why are there regions of identity?
1) Conserved function - residues participate in reaction.
2) Structural - residues participate in maintaining structure of

protein. (For example, conserved cysteine residues that
form a disulfide linkage)
3) Historical - Residues that are conserved solely due to a

common ancestor gene.
Protein sequence alignment
• Nucleotide sequences consist of only four characters, and therefore,
unrelated sequences have at least a 25% chance of being identical.
• For protein sequences, there are 20 possible amino acid residues, and so 2
unrelated sequences can match up 5% of the residues by random chance. If
gaps are allowed, the percentage could increase to 10–20%.
• Sequence length is also a crucial factor.
• The shorter the sequence, the higher the chance that some alignment is
attributable to random chance. The longer the sequence, the less likely the
matching at the same level of similarity is attributable to random chance.
Protein sequence alignment
• For determining a homology relationship of 2 protein sequences, if both
sequences are aligned at full length (having 100 residues long), an identity of
30% or higher can be safely regarded as having close homology.
The 3 zones of protein sequence alignments. 2 protein sequences can be regarded as

homologous if the percentage sequence identity falls in the safe zone (identity of 30% or
higher). Sequence identity values below the zone boundary, but above 20%, are considered to
be in the twilight zone, where homologous relationships are less certain. The region below 20%
is the midnight zone, where homologous relationships cannot be reliably determined.
Sequence Similarity & Sequence Identity
• Sequence similarity and sequence identity are synonymous for
nucleotide sequences. For protein sequences, however, the two concepts
are very different.
• In a protein sequence alignment, sequence identity refers to the

percentage of matches of the same amino acid residues between two
aligned sequences.
• Similarity refers to the percentage of aligned residues that have similar

physicochemical characteristics and can be more readily substituted for
each other.
• One way to calculate is the use of the overall sequence lengths of both
sequences.
where S is the percentage sequence similarity, Ls is the number of aligned

residues with similar characteristics, and La and Lb are the total lengths of each
individual sequence.
where I is the percentage sequence identity, Li is the number of aligned

residues with the exact same residues, and La and Lb are the total lengths of
each individual sequence.
> protein-A
ACHKLMGCGLITPNASR
> protein-B
SKTVHRMPGSRAPKLSM
Star: identical residues, One dot:
somewhat similar, Two dots: very
similar, Dashes: gaps in sequences
I = [(4 ✕ 2) / (17 + 17)] ✕ 100
Percentage sequence identity (I) = 23.52%
• Although of equal length, these two sequences are not very identical.
• They may not have a common evolutionary origin.

Methods of sequence alignment
• Pairwise sequence alignment – compare two sequences
• Multiple sequence alignment – compare one sequence to many
others
For each of the above, we can do

• Local Alignment – compare similar parts of two sequences
• Global Alignment – compare the whole sequence
• For the different types of alignments, there are different assumptions

and methods
Pairwise vs. Multiple sequence alignment
n Pairwise
q The process of lining up two sequences to achieve maximal levels of
identity/similarity for the purpose of assessing the degree of similarity and
the possibility of homology.
q Example: It is used to decide if two genes are structurally or functionally

related.
Bar: identical residues, One dot: somewhat similar, Two dots: very similar, Dashes: gaps in sequences
Pairwise vs. Multiple sequence alignment
n Multiple
q MSA is an alignment of three or more sequences such that each column
of the alignment is an attempt to represent the evolutionary changes in
one sequence position, including substitutions, insertions, and deletions.
q It is believed that over time the functional components embedded within

the sequences are conserved in order to retain function.
Disease-associated I71V variant/mutant

Local vs. Global alignment
Local alignment Global alignment
n Aligns segments of the n Aligns the entire sequence.
sequences. n Identifies all conserved
n Identifies short conserved residues.
residues.
n Dynamic programming is
n Complete alignment is not done. required.
S2
S2
Ancestor Ancestor
S1 S1
n May miss out on some important

conserved residues. n Computationally intensive,
n Computationally less intensive, much slower than local
faster than global alignment. alignment.
n Example: Smith-Waterman, n Example: Needleman &

BLAST, FASTP Wunsch method
Alignment algorithms
The 3 primary methods of producing Pairwise alignments
1. Dot matrix method (old method)
2. The dynamic programming (DP) algorithm (advanced

method)
3. Word or k -tuple methods

Alignment algorithms
The dot-matrix method:
• The two sequences are written out as column and row headings of a
two-dimensional matrix.
• A dot is put in the dot-matrix plot at a position where the nucleotides in

the two sequences are identical.
• The alignment is defined by a path from the upper-left element to the

lower-right element.
Advantages of Dot-Matrix method
The vertical gap indicates that a

coding region corresponding to ~75
amino acids has either been deleted
from the human gene or inserted into
the bacterial gene.
The two diagonally oriented parallel

lines most probably indicate that a
small internal duplication has
occurred in the bacterial gene.
Disadvantages of Dot-Matrix method
May not identify the best alignment.

Dynamic programming method
• Global alignment program is based on Needleman-Wunsch algorithm and
local alignment on Smith-Waterman. Both algorithms are derivates from the
basic dynamic programming algorithm.
• Three steps in dynamic programming

1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
Word or k-tuple method

• This method is useful in large-scale database searches to find whether there
is a significant match available with the query sequence.
• The Word method is used in the database search tools like the BLAST
family.
• They identify a series of short, non-overlapping subsequences (words) of the

query sequence.
• Details of other algorithms will be explained in next class.

MSA - a case study
• Gorilla and Chimpanzee are closely related in terms of ANG’s evolution.

• Human is also a closely related species.
• Chicken is distant in terms of evolution when ANG is considered specifically.
Thank you

Lecture 6 Evolutionary Sequence Alignment Algorithms

Uploaded by

Lecture 6 Evolutionary Sequence Alignment Algorithms

Uploaded by

Evolutionary Basis of

Dr. Aditya Kumar Padhi

Laboratory for Computational Biology & Biomolecular Design (LCBD)

v Evolution, and its significance & relationship with sequence

v The rationale behind sequence alignment

v Types of sequence alignment

v Protein sequence alignment

v Sequence similarity and sequence identity

v Algorithms of sequence alignment (brief overview)

v It is the first step toward structural and functional analysis of

v The most fundamental comparison process is sequence

v Linear sequences of the nucleotide bases

v They can be considered molecular fossils

v During this time period, they undergo

v To detect sequence homology, we must first align sequences.

v By sequence alignment, patterns of conversions and variation can be identified.

v The degree of sequence alignment often reveals the evolutionary relatedness of

• If a significant similarity between two

• Homology: When two sequences are descended from a common

• Similarity: The percentage of aligned residues that are similar in

• Example: two sequences share 40% similarity.

• The two sequences are either homologous or nonhomologous

n The true alignment can be found using h as a template:

n Given an alignment, several different evolutionary histories can be

q One possible history:

1) Conserved function - residues participate in reaction.

2) Structural - residues participate in maintaining structure of

3) Historical - Residues that are conserved solely due to a

• Sequence length is also a crucial factor.

The 3 zones of protein sequence alignments. 2 protein sequences can be regarded as

• In a protein sequence alignment, sequence identity refers to the

• Similarity refers to the percentage of aligned residues that have similar

where S is the percentage sequence similarity, Ls is the number of aligned

where I is the percentage sequence identity, Li is the number of aligned

I = [(4 ✕ 2) / (17 + 17)] ✕ 100

Percentage sequence identity (I) = 23.52%

• They may not have a common evolutionary origin.

For each of the above, we can do

• For the different types of alignments, there are different assumptions

q Example: It is used to decide if two genes are structurally or functionally

q It is believed that over time the functional components embedded within

Disease-associated I71V variant/mutant

n May miss out on some important

n Example: Smith-Waterman, n Example: Needleman &

1. Dot matrix method (old method)

2. The dynamic programming (DP) algorithm (advanced

3. Word or k -tuple methods

• A dot is put in the dot-matrix plot at a position where the nucleotides in

• The alignment is defined by a path from the upper-left element to the

The vertical gap indicates that a

The two diagonally oriented parallel

May not identify the best alignment.

• Three steps in dynamic programming

Word or k-tuple method

• They identify a series of short, non-overlapping subsequences (words) of the

• Details of other algorithms will be explained in next class.

• Gorilla and Chimpanzee are closely related in terms of ANG’s evolution.

You might also like