75% found this document useful (4 votes)
2K views37 pages

Sequence Alignment Methods and Algorithms

This document discusses algorithms for sequence alignment. It begins by defining sequence alignment and the origins of sequence similarity, including global alignment, local alignment, and alignment of pairs and multiple sequences. It then describes methods for pairwise sequence alignment, including dot matrices, scoring matrices, and dynamic programming approaches like Smith-Waterman and Needleman-Wunsch. Finally, it discusses methods for multiple sequence alignment and database searching for similar sequences using tools like BLAST, FASTA, and hidden Markov models.

Uploaded by

api-3747254
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
Download as pps, pdf, or txt
75% found this document useful (4 votes)
2K views37 pages

Sequence Alignment Methods and Algorithms

This document discusses algorithms for sequence alignment. It begins by defining sequence alignment and the origins of sequence similarity, including global alignment, local alignment, and alignment of pairs and multiple sequences. It then describes methods for pairwise sequence alignment, including dot matrices, scoring matrices, and dynamic programming approaches like Smith-Waterman and Needleman-Wunsch. Finally, it discusses methods for multiple sequence alignment and database searching for similar sequences using tools like BLAST, FASTA, and hidden Markov models.

Uploaded by

api-3747254
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
Download as pps, pdf, or txt
Download as pps, pdf, or txt
You are on page 1/ 37

Algorithms for

Sequence Alignments
A. Brüngger, Labhead Bioinformatics
Novartis Pharma AG
adrian.bruengger@pharma.novartis.com
Algorithms for Sequence Alignments

• Definition of Sequence Alignment and Origins of Sequence


Similarity
– Global Alignment
– Local Alignment
– Alignment of Pairs of Sequences
– Alignment of Multiple Sequences
• Methods for Pairwise Sequence Alignments
– Dot matrix
– Scoring Matrices
– Dynamic Programming: Smith-Waterman, Needleman-Wunsch
• Methods for Multiple Sequence Alignment
– Dynamic Programming
– Progressive multiple alignments: CLUSTALW, PILEUP
• Database Searching for Similar Sequences
– FASTA, BLAST Outline follows: David W. Mount,
"Bioionformatics - Sequence and
– Patscan Genome Analysis“ Cold Spring
– Hidden Markov Models Harbour Laboratory Press, 2001.
Online:
http://www.bioinformaticsonline.or
g
Rational for Sequence Analysis, Origins of Sequence
Similarity
Function

Structure
Similar sequence leads to similar function

Protein Sequence Analysis as the basic tool to


discover
functional, structural,evolutionary
DNA information in biological sequences

Sequence A Sequence B
Evolutionary relationship between two
similar sequences and a possible common
y Steps x Steps ancestor. The number of steps to convert
one sequence into the other is the
"evolutionary" distance between the
sequences (x + y). Usually, the ancestor
common ancestor sequence is not available, only (x + y) can
sequence be computed.
Origins of Homology  Significance of Sequence Alignments

Possible Origins of Sequence Homology:


• orthologs (panel A and B) a1 in species I and a1 in species II (same ancestor!)
• paralogs (panel A and B) a1 and a2 (arose from gene duplication event)
• analogs (panel C): different genes converge to same function by different
evolutionary paths
• transfer of genetic material (panel D) between different species
Homology vs. Similarity
• Similarity can be computed (by sequence alignments)
• Homology is deduced (e.g. from similarity, but also from other evidence!)
Definition of Sequence Alignment

Computational procedure (“algorithm”) for comparing two/many


sequences
• identify series of identical residues or patterns of identical residues
that appear in the same order in the sequences
• visualized by writing sequences as follows:

MLGPSSKQTGKGS-SRIWDN*
|| | ||| | | Pairwise Global Alignment
(over whole length of sequences)
MLN-ITKSAGKGAIMRLGDA*

GKG
||| Pairwise Local Alignment
(similar parts of sequences)
GKG
• sequence alignment is an optimiztion problem
bringing as many identical residues as possible into corresponding
positions
Pairwise Sequence Alignment

Alignment Score: Quantify the Quality of an alignment


• simple scoring system (eg. for DNA sequences)
– when two identical residues are aligned: +1
– when two non-identical residues are aligned: -1
– when a gap is introduced -1
• total score of alignment is the sum of the scores in each position:

Compute Score:
agaag-tagattcta •11 matches
|| || ||| || || •1 mismatch
•3 gaps
aggaggtag-tt-ta Score = 11 - 1 -3 = 7

Alignment is 'optimal' when assigned score is maximal


Note: Scoring scheme defines 'optimal' alignment
Algorithms for Pairwise Sequence Alignments: Dot Plot

Algorithm introduced by Gibbs and McIntyre (1970):


• write one sequence on top from left to right
• write other sequence on the left from top to bottom
• place a dot whenever two residues are identical; variants use
colors
• visual inspection (variants: identify diagonals with “many” dots
a g a c c t a g
a * * *
g * *
a * * *
c * *
c * *
t *
Algorithms for Pairwise Sequence Alignments: Dot Plot

Example
Dot Plots of DNA and Protein of Phage l cI (horizontal) and P22 c3
(vertical)

DNA Protein

Removing “noise”:
Plot a dot only if 7 ("stringency") out of the next 11 ("window size") residues are identical
Algorithms for Pairwise Sequence Alignments: Dot Plot

Example: Human LDL receptor against itself.

Window size 1, stringency 1. Window size 23, stringency 7.

"Repeats" in first part of sequence.


Algorithms for Pairwise Sequence Alignments: Dot Plot

• Strengths:
– visualization of sequence similarity
– finding direct and inverted repeats in sequences
– finding self-complementary regions in RNA (secondary structures)
– simple to compute, simple to visualize

• Implementations
– DNA Strider (Macintosh)
– DOTTER (UNIX X-Windows)
– GCG "COMPARE", "DOTPLOT"
– online SIB: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

• Computational Complexity:

compute time
– two sequences of length n, m: O(n·m)

sequence length
Scoring Matrices for Proteins

Observation: Protein (Function, Structure) is preserved when


substitutions in certain AAs happen

• scoring system less simple in case of proteins


• Example: The same protein in three different species

A W T V A S A V R T S I Organism A
A Y T V A S A V R T S I Organism B
A W T V A A A V L T S I Organism C

A B C
Therefore:
W to Y Assigning L to R scores higher
than assigning it to another AA
L to R
A to S
Scoring Matrices for Proteins

Generalization: Dayhoff PAM Weight Matrices (percent accepted


mutation)
– observed mutations during a unit of evolutionary time
(1 PAM ~ time that is required that an AA is changed with probability
1%)
– initial PAM matrix ("PAM1"): derived from 1572 changes in 71 groups
of protein sequences that are at least 85% similar
– observed in proteins that have the same function => "accepted"
mutations
C S T P ..
C fcc fcs fct fcp
S fsc fss fst fsp
.
.
Scoring Matrices for Proteins

Extrapolation to longer evolutionary distances possible!

PAM1: p(HM) = 0
R K PAM2: p(HM) = p(HK)*p(KM)
+p(HR)*p(RM)

H
Scoring Matrices for Proteins

• For each pair of residues, a score is given by a "scoring matrix"


• Scoring of two pairs of residues depends on:
– frequency of the two residues
– exchange that tends to preserves function should have high score
• Scoring matrices are constructed by empirically evaluating
(manually constructed) alignments of closely related proteins
• PAM
– introduced by Margarete Dayhoff, 1978
– inspecting 1572 changes in 71 groups of protein sequences
– different matrices for varying degree of similarity (or, evolutionary
distance)
• BLOSUM
– inspection of about 2'000 conserved AA-stretches, "BLOCKS"
Sequence 1: V D S - C Y
Sequence 2: V E S L C Y
score 4 2 4 -2 4 4 Total 16
Dynamic Programming Approach to Sequence Alignments

• Description by Example:
– input: two AA sequences “VDSCY” and “VESLCY”
scoring “matrix”: match +4, mismatch +2, gap -2
– Some possible alignments and their score:
VDSCY VD-SCY VDS-CY
| | | | | | ||
VESLCY V-ESLCY VESLCY
14 8 16
– Observation: Longer alignments can be derived from shorter
new score = old score + score of new pair

VDS-CY VDS-C Y
VESLCY VESLC Y
16 = 12 + 4

VDS-C VDS- C
VESLC VESL C
12 = 8 + 4
Dynamic Programming Approach to Sequence Alignments

• Alignment: Path in matrix from "top left" to "bottom right"

V D S C Y
V
E
Seq 1: V D S - C Y
Seq 2: V E S L C Y S
L
C
Y

possible extensions of "current"


alignment ( )

• Three situations for extending previous alignment:

introduce gap in introduce gap in


sequence 1 sequence 2
Dynamic Programming Approach to Sequence Alignments
• In each matrix element, we can write the score of the best alignment
up to this element, and a pointer to the shorter alignment it is derive
from
V D S C Y
Goal is to compute this path!
V 4
Seq 1: V D S - C Y E 6
Seq 2: V E S L C Y
S 10
4 2 4 -2 4 4
L
match +4, mismatch +2, gap -2 C 12
Y 16

V D S C Y two possibilities: V D S C Y
4 + (E->S) + gap = 4
4 2 2 2 2 fill first V 4 2 2 2 2
V 2 + (E->S) = 4
row/column 2 6 4 4 4
E 2 choose one (first) E
2 S 2 4
S three possibilities:
2 4 + (L->D) + 2 gaps = 2 L 2 4
L
2 + (L->D) + 1 gap = 2 C 2 4
C 2 2 + (L->D) = 4
Y 2 choose third (score max) Y 2 4
Dynamic Programming Approach to Sequence Alignments

• Formally:
Si-1, j-1 + s(ai -> bi)

Sij = max [x≥1] (Si-x,j - wx)


max
max [y≥1] (Si,j-y - wy)

– ai, bj : Residue of Sequence A at position i and Residue of Sequence B at position j


– wx : Penalty for introducing gap of length x
– Sij : Score of alignment at position (i, j)
– s(a->b) : Score for aligning a and b (known from scoring matrix)

• Iterate process until whole matrix is filled with scores and back-
pointers
• Choose maximum score in last column or row
• Follow pointers to construct alignment
Dynamic Programming Approach to Sequence Alignments

• Global Alignment
Needleman and Wunsch
(1970)
• Local Alignment:
Smith-Waterman
(minor modification)
Dynamic Programming Approach to Sequence Alignments

• Implementations available on the web


Dynamic Programming Approach to Sequence Alignments

• Strengths:
– finds optimal (mathematically best) alignment
– suited for both, local and global alignments
– global alignment: Needleman-Wunsch
– local alignment: Smith-Waterman
– statistical significance can be attached
(recompute alignment when one or both sequences are randomly
changed)
• Implementations
– LALIGN
– GCG "GAP" (global) and BESTFIT (local)
– ...
• Complexity:
– two sequences of length n, m
– time: O(n·m), space: O(n·m)
– space is crucial
• example: matrix element 4 bytes, n=m=10000, space requirement: 400MB
• can be improved: trade time for space
Multiple Sequence Alignment

• When aligning k (k>2) sequences, the dynamic programming


idea can theoretically be used:
– instead of a two-dimensional scoring schema, a k-dimensional one is
used
– instead of two-dimensional residue substitution matrix, a k-
dimensional one is used
• However, this is not feasible in practical terms
– size of these matrices increase too rapidly!
– example: 5 sequences, 100 residues each, matrix size = 100^5 =
10^10
in terms of computer memory: 10 GB
• Consequence:
Optimal alignment for multiple sequences is not
computable!
• Approximative methods used (“heuristics”)
Multiple Sequence Alignment: Progressive methodes

• Idea for progressive methodes:

– step 1 : Align two of the sequences


– step 2 : Compute "consensus" sequence
– step 3 : Align a third sequence to the consensus
– step 4 : Repeat steps 2 and 3
Multiple Sequence Alignment: Progressive methodes,
CLUSTALW
• Basic steps
– step 1 : Perform pairwise sequence alignments of all possible pairs
– step 2 : Use scores to produce a “phylogenetic tree”
– step 3 : align the sequences sequentially, guided by the tree
– the most closely related sequences are aligned first
– other sequences or groups of sequences are added

• Example: Phylogenetic tree for 4 sequences


A B C D A B
A 100 90 85 90 90
B 100 95 80 90
C 100 97 85 80
D 100
95
Percentage Identities in C 97 D Join sequences with highest similarity
pairwise alignment
Multiple Sequence Alignment: Progressive methodes,
CLUSTALW
• Computation of Phylogenetic tree Join sequences with
highest similarity
A B C D
A 100 95 80 90 A 95 B A 95 B
B 100 95 85 90
C 100 97
80 85 (80+90)/2 (85+95)/2
D 100
95 = 85 = 90
Percentage Identities in
C 97 D
pairwise alignment
C, D

Join sequences with A, B


highest similarity
(85+90)/2
= 87.5
A

B
C, D
C
D
Multiple Sequence Alignment: Progressive methodes,
CLUSTALW
• Example: Seven
Globins from
SWISSPROT
Multiple Sequence Alignment: Progressive methodes,
CLUSTALW
• Available Implementations
Riddle: Probability that a short string occurs in a longer text

• Given: an alphabet A = {a, c, g, t}


a random string s of length k over A
a random text t of length n (n>k) over A

• Find probability p, by which the string s occurs in the text t.


• Find expected number of occurrences e of s in t.

• Example: s = tggtacaaatgttct (glucocorticoid response element GR


t = 10’000 bp (‘promoter’, upstream DNA to start codon
Basics of Sequence DB Searches: Definition

• Given: one “short” sequence q (query)


sequence DB, conceptually one long sequence s
(subject)

• Find occurrences of “similar” sequences to q in s.


• Attach to each similar occurrence a statistical significance.

• Example: s = human genomic DNA, 3·109 b


q = tggtacaaatgttct (glucocorticoid response
element GRE)

• Dynamic programming approach not feasible!


Basics of Sequence DB Searches: Detection of identical k-
mers

• Idea: identify identical k-mers in s and q (“seed”)


expand alignment from seed in both directions

• Example:

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQMAQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS
|| | |||| | || |||||||| | |||||| |||| ||||||||| |||||| ||||||||| |
... MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYETQLRRYINTLTRPRYGKRAEEENTGGLP...
1
3
2 Gonnet120 score: 412, 76 % identities no more increase
in score

• BLAST
• HSP, “high scoring pair”
• gapped alignment
• starting extension also from similar (and not only identical) seeds
Basics of Sequence DB Dearches: Detection of identical k-
mers
• Precompute position of all k-mers in DB sequence
• Indexing all Peptides of length k in "database“
• Example: 0 1 2 3
1234567890123456789012345678901234
MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP

MAAAR
AAARL
AARLC
....
APLEP
Sorted:
AAARL 2
AARLC 3
APLEP 30
...
MAAAR 1
...
VALLL 17

• For each peptide of length k in "query", the identical peptides in


"database" are identified efficiently (binary search in sorted list)
• For identical pairs, extension step in both directions is performed
Basics of Sequence DB Dearches: Detection of identical k-
mers
• Indexing all Peptides of length k in "database“: some
refinements
0 1 2 3
1234567890123456789012345678901234
MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers to previous occurrences

List of all words of length k and


last occurrence in query:
AAAAA - Simple, yet efficient data-structure:
AAAAC - - array of integers (size=length of db)
AAAAD - - array of integers (size=number of words with length k
....
AARLC 3 Book-keeping:
....
- more than one db/query sequence
APLEP 30
... - build database chunks that fit into main memory
... (speeds up computation 1000x)
VALLL 17
... Extension step: optimizations

• For each peptide of length k in "query", the position in the wordlist can be
easily computed (no binary search!)
Significance of matches: DNA case

• issue: searching with short query vs. large database  found mat
could have occurred by pure chance
• assume equal distribution of c,g,a,t
• what is ...
– the probability q, that sequence B (len=m) is contained in sequence A
(len=n)?
– the expected length of a common subsequence of two sequences?
– the expected score when locally aligning two sequences of length n, m

– Solution for first:


• a. There are (n-m) 'words' of length m in sequence A Not independent!
• b. In total, there are 4^m sequences of length m
• c. p = (n-m) / 4^m

This is wrong. Why?


Significance of matches: Statistics

• statistics
– the statistical distribution of alignment scores found in a DB search
follows the extreme value distribution (not normal distribution)
– extreme value distribution changes with length of sequences and their
residual composition
– scores of actual database search results are plotted vs. expected
scores
(FASTA)
– BLAST computes E-Value (number of expected hits with this score,
when comparing the query with unrelated database sequences)
Putting it together: BLAST2 A W T V A S A V R T S I

• (optional) filtering for low complexity region in query


• all words of length 3 are listed:
AWT VAS AVR TSI | WTV ASA VRT | TVA SAV RTS

• to each word, ~50 'high scoring' additional words are added:

AWT VAS AVR TSI | WTV ASA VRT | TVA SAV RTS
AWA IAS TVR ...
TWA LAS AIR ...
ART ITS AVS ...
... ... ...

• matching words are found in DB (with datastructure as described


before)
• ungapped alignment constructed from word matches, 'HSP'
• statistics determines, whether HSP is significant
• SW-alignment for significant HSPs
An example: comparing mus chr X vs. hum chr X, syntenic
regions mouse chromosome X (147Mbp)
human chromosome X (149Mbp)

Each dot:
conserved stretch of AA
HSP, high scoring pair
Conclusions

• types of sequence alignments: pairwise, multiple, query vs. database


• local and global alignment
• scoring matrices: attach score to each pair of residues
(allow „similar“ residues to be aligned)
• sequence alignment problem is mathematical optimization problem:
find optimal alignment: maximise score
• optimal alignment in practical terms only feasible for pairwise alignment
• Heuristics for multiple sequence alignments:
progressive alignment guided by all pairwise alignments
• Sequence database searches:
– efficiently find occurrence of query k-mers in db sequences (seeds)
– expand (ungapped) HSP from seeds
– attach statistical significance

You might also like