Sequence Alignment Methods and Algorithms
Sequence Alignment Methods and Algorithms
Sequence Alignments
A. Brüngger, Labhead Bioinformatics
Novartis Pharma AG
adrian.bruengger@pharma.novartis.com
Algorithms for Sequence Alignments
Structure
Similar sequence leads to similar function
Sequence A Sequence B
Evolutionary relationship between two
similar sequences and a possible common
y Steps x Steps ancestor. The number of steps to convert
one sequence into the other is the
"evolutionary" distance between the
sequences (x + y). Usually, the ancestor
common ancestor sequence is not available, only (x + y) can
sequence be computed.
Origins of Homology Significance of Sequence Alignments
MLGPSSKQTGKGS-SRIWDN*
|| | ||| | | Pairwise Global Alignment
(over whole length of sequences)
MLN-ITKSAGKGAIMRLGDA*
GKG
||| Pairwise Local Alignment
(similar parts of sequences)
GKG
• sequence alignment is an optimiztion problem
bringing as many identical residues as possible into corresponding
positions
Pairwise Sequence Alignment
Compute Score:
agaag-tagattcta •11 matches
|| || ||| || || •1 mismatch
•3 gaps
aggaggtag-tt-ta Score = 11 - 1 -3 = 7
Example
Dot Plots of DNA and Protein of Phage l cI (horizontal) and P22 c3
(vertical)
DNA Protein
Removing “noise”:
Plot a dot only if 7 ("stringency") out of the next 11 ("window size") residues are identical
Algorithms for Pairwise Sequence Alignments: Dot Plot
• Strengths:
– visualization of sequence similarity
– finding direct and inverted repeats in sequences
– finding self-complementary regions in RNA (secondary structures)
– simple to compute, simple to visualize
• Implementations
– DNA Strider (Macintosh)
– DOTTER (UNIX X-Windows)
– GCG "COMPARE", "DOTPLOT"
– online SIB: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
• Computational Complexity:
compute time
– two sequences of length n, m: O(n·m)
sequence length
Scoring Matrices for Proteins
A W T V A S A V R T S I Organism A
A Y T V A S A V R T S I Organism B
A W T V A A A V L T S I Organism C
A B C
Therefore:
W to Y Assigning L to R scores higher
than assigning it to another AA
L to R
A to S
Scoring Matrices for Proteins
PAM1: p(HM) = 0
R K PAM2: p(HM) = p(HK)*p(KM)
+p(HR)*p(RM)
H
Scoring Matrices for Proteins
• Description by Example:
– input: two AA sequences “VDSCY” and “VESLCY”
scoring “matrix”: match +4, mismatch +2, gap -2
– Some possible alignments and their score:
VDSCY VD-SCY VDS-CY
| | | | | | ||
VESLCY V-ESLCY VESLCY
14 8 16
– Observation: Longer alignments can be derived from shorter
new score = old score + score of new pair
VDS-CY VDS-C Y
VESLCY VESLC Y
16 = 12 + 4
VDS-C VDS- C
VESLC VESL C
12 = 8 + 4
Dynamic Programming Approach to Sequence Alignments
V D S C Y
V
E
Seq 1: V D S - C Y
Seq 2: V E S L C Y S
L
C
Y
V D S C Y two possibilities: V D S C Y
4 + (E->S) + gap = 4
4 2 2 2 2 fill first V 4 2 2 2 2
V 2 + (E->S) = 4
row/column 2 6 4 4 4
E 2 choose one (first) E
2 S 2 4
S three possibilities:
2 4 + (L->D) + 2 gaps = 2 L 2 4
L
2 + (L->D) + 1 gap = 2 C 2 4
C 2 2 + (L->D) = 4
Y 2 choose third (score max) Y 2 4
Dynamic Programming Approach to Sequence Alignments
• Formally:
Si-1, j-1 + s(ai -> bi)
• Iterate process until whole matrix is filled with scores and back-
pointers
• Choose maximum score in last column or row
• Follow pointers to construct alignment
Dynamic Programming Approach to Sequence Alignments
• Global Alignment
Needleman and Wunsch
(1970)
• Local Alignment:
Smith-Waterman
(minor modification)
Dynamic Programming Approach to Sequence Alignments
• Strengths:
– finds optimal (mathematically best) alignment
– suited for both, local and global alignments
– global alignment: Needleman-Wunsch
– local alignment: Smith-Waterman
– statistical significance can be attached
(recompute alignment when one or both sequences are randomly
changed)
• Implementations
– LALIGN
– GCG "GAP" (global) and BESTFIT (local)
– ...
• Complexity:
– two sequences of length n, m
– time: O(n·m), space: O(n·m)
– space is crucial
• example: matrix element 4 bytes, n=m=10000, space requirement: 400MB
• can be improved: trade time for space
Multiple Sequence Alignment
B
C, D
C
D
Multiple Sequence Alignment: Progressive methodes,
CLUSTALW
• Example: Seven
Globins from
SWISSPROT
Multiple Sequence Alignment: Progressive methodes,
CLUSTALW
• Available Implementations
Riddle: Probability that a short string occurs in a longer text
• Example:
q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQMAQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS
|| | |||| | || |||||||| | |||||| |||| ||||||||| |||||| ||||||||| |
... MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYETQLRRYINTLTRPRYGKRAEEENTGGLP...
1
3
2 Gonnet120 score: 412, 76 % identities no more increase
in score
• BLAST
• HSP, “high scoring pair”
• gapped alignment
• starting extension also from similar (and not only identical) seeds
Basics of Sequence DB Dearches: Detection of identical k-
mers
• Precompute position of all k-mers in DB sequence
• Indexing all Peptides of length k in "database“
• Example: 0 1 2 3
1234567890123456789012345678901234
MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR
AAARL
AARLC
....
APLEP
Sorted:
AAARL 2
AARLC 3
APLEP 30
...
MAAAR 1
...
VALLL 17
• For each peptide of length k in "query", the position in the wordlist can be
easily computed (no binary search!)
Significance of matches: DNA case
• issue: searching with short query vs. large database found mat
could have occurred by pure chance
• assume equal distribution of c,g,a,t
• what is ...
– the probability q, that sequence B (len=m) is contained in sequence A
(len=n)?
– the expected length of a common subsequence of two sequences?
– the expected score when locally aligning two sequences of length n, m
• statistics
– the statistical distribution of alignment scores found in a DB search
follows the extreme value distribution (not normal distribution)
– extreme value distribution changes with length of sequences and their
residual composition
– scores of actual database search results are plotted vs. expected
scores
(FASTA)
– BLAST computes E-Value (number of expected hits with this score,
when comparing the query with unrelated database sequences)
Putting it together: BLAST2 A W T V A S A V R T S I
AWT VAS AVR TSI | WTV ASA VRT | TVA SAV RTS
AWA IAS TVR ...
TWA LAS AIR ...
ART ITS AVS ...
... ... ...
Each dot:
conserved stretch of AA
HSP, high scoring pair
Conclusions