Lecture 6 Evolutionary Sequence Alignment Algorithms
Lecture 6 Evolutionary Sequence Alignment Algorithms
Sequence Alignment
and Algorithms
v A case study
Overview
v Sequence comparison (DNA, RNA & Protein) lies at the heart of
bioinformatics analysis.
v This is due to the residues that perform key functional structural roles tend to be
preserved by natural selection. Others tend to mutate more frequently.
• Identity: Quantity that describes how much two sequences are alike in
the strictest terms or the extent to which two sequences are invariant.
n Example:
q Alignment:
q’: GLISVT
d’: G-I-VT
• For protein sequences, there are 20 possible amino acid residues, and so 2
unrelated sequences can match up 5% of the residues by random chance. If
gaps are allowed, the percentage could increase to 10–20%.
• The shorter the sequence, the higher the chance that some alignment is
attributable to random chance. The longer the sequence, the less likely the
matching at the same level of similarity is attributable to random chance.
Protein sequence alignment
• For determining a homology relationship of 2 protein sequences, if both
sequences are aligned at full length (having 100 residues long), an identity of
30% or higher can be safely regarded as having close homology.
> protein-B
SKTVHRMPGSRAPKLSM
Star: identical residues, One dot:
somewhat similar, Two dots: very
similar, Dashes: gaps in sequences
• Although of equal length, these two sequences are not very identical.
Bar: identical residues, One dot: somewhat similar, Two dots: very similar, Dashes: gaps in sequences
Pairwise vs. Multiple sequence alignment
n Multiple
q MSA is an alignment of three or more sequences such that each column
of the alignment is an attempt to represent the evolutionary changes in
one sequence position, including substitutions, insertions, and deletions.
• The two sequences are written out as column and row headings of a
two-dimensional matrix.
• The Word method is used in the database search tools like the BLAST
family.