previous page  PLNT4610/PLNT7690
Bioinformatics Lecture 4, part 2 of 2 
next page 
A 
A 
C 
T 
G 
T 
T 

A 
1 

A 
2 

C 
3 

T 
4 

G 
5 

T 
6 

T 
7 
Sequence alignment methods predate dotmatrix searches, and all of the alignment methods in use today are related to the original method of Needleman and Wunsch (1970). Needleman and Wunsch wanted to quantify the similarity between two sequences. Over the course of evolution, some positions undergo base or amino acid substitutions, and bases or amino acids can be inserted or deleted. Any measurement of similarity must therefore be done with respect to the best possible alignment between two sequences. Because insertion/deletion events are rare compared to base substitutions, it makes sense to penalize gaps more heavily than mismatches when calculating a similarity score. As an example, a very simple scoring scheme would add +1 for each match, 1 for each mismatch, and 2 for each gap inserted. That is, the larger the gap, the more we subtract. The similarity between two sequences would then be
Similarity = +1(# identities) 1(#mismatches) 2(#gaps)For example:
GCTGGAACCAGSimilarity = 1(8)  1(2) 2(1) = 4
 
ACTGGATCAG
By definition, the alignment which gives the higest similarity score is the optimal alignment. However, the number of alignments that must be checked increases exponentially with the lengths of the sequences. Allowing gaps also results in an exponential increase in the computation time required.
Although the problem may seem intractably large for all but very small sequences, Needleman and Wunsch conceptualized alignment as a problem in dynamic programming, in which the solution to a large problem is simplified if we first know the solution to a smaller problem that is a subset of the larger problem. Think of an alignment occurring in a matrix, where sequence s of length m is written on the Yaxis, and sequence t of length n is written on the Xaxis. The alignment can then be acomplished in two steps:
1. All possible alignments of s and t are contained in array a[0..m,0..n] (the component a[0,0] represents a gap at the beginnning of each sequence).
2. The optimal alignment will be the path through the array that has the highest score, ie. the largest number of matches and fewest mismatches and gaps.
Needleman and Wunch realized that all parts of the alignment problem boiled down to the same decision made at every position i in sequence s, when compared with every position j in sequence t.
If we want to calculate the score at any position a[i,j] in the alignment matrix a, we only have to look at three adjacent cells in the matrix to calculate that score, a[i,j1], a[i1,j1], or a[i1,j]. These are the positions in the alignment that represent the part of the alignment just prior to a[i,j], at which point either:
The first step is the
trivial calculation of the case in which one or more terminal
gaps are added to the beginning of either sequence. This is done
by running across the top of the array and progressively adding
a gap penalty (eg. 2) to the score in each cell, and doing the
same to the leftmost column. Next, we apply the three scoring
rules to each cell in the matrix.
At cell a[1,1], the score is the largest of three possible scores
This process is repeated down the matrix:
until all cells are filled.
The optimal alignment will be the path that gives the highest total score. In the example, the optimal alignment would be
and the similarity score is simply the score at cell a[i,j], the last cell in the alignment. We can check the score by using the written alignment: 4(+1) + 0(1) + 1(2) = 2.TCGCA
 
TCCA
Note that if we hadn't inserted the gap, the alignment would be
and the similarity score at cell a[i,j], would be: 2(+1) + 2(1) + 1(2) = 2. The point is that a single gap can make a large difference in the score.TCGCA

TCCA
Though these examples
give the essence of the Needleman Wunsch method, the literature
contains a wealth of papers improving on this simple algorithm.
Example of an optimal alignment between cDNA
sequences for defensin genes from Pea (PPEADRR230B) and
Cucumber (CAGT). The alignment was produced by GLSEARCH from the FASTA package. nw opt: The raw NeedlemanWunch alignment score Zscore: Standard Deviations of the best score relative to the mean of all alignments tested. bits: Measure of information content of the alignment ie. how it differs from a random alignment. E: Probability of seeing an alignment this good by comparing two arbitrary sequences of similar nucleotide content. 
Statistics: (shuffled [500]) Unscaled normal statistics: mu= 90.3780 var=579.7105 Ztrim: 0 
penalty = gap_begin + (# gaps * gap_extend)Example: For a 2 base gap in a DNA pairwise alignment, if gap_begin = 10 and gap_extend=1, the gap penalty is 10 + 2(1) = 12.
Construction of biologically significant alignments should take into account the fact that protein evolution is constrained by the chemical properties of amino acids, and by the degeneracy of the genetic code. Chemically conservative replacements tend to occur more frequently than replacements with amino acids that are chemically different. For example, it is far more likely to see a substitution of Leucine with Isoleucine, both of which are nonpolar, than a substitution of Aspartic acid, which is negativelycharged, for Leucine.
Scoring matrices have been constructed to replace the p(i,j) function above. That is, for any possible amino acid substitution, a value from the scoring matrix is added to the similarity score, rather than adding 1 for an identity and 1 for a mismatch. One of the most commonlyused scoring matrices is the PAM250 matrix, shown below:
Cys 12Substitution of an Aspartic acid for Glutamic acid (both acidic) adds 3 to the score. Substitution of the positivelycharged Lys for the nonpolar Pro adds 1 to the score. Generally, the more conservative the replacement, the higher the value that will be added to the score.
Gly 3 5
Pro 3 1 6
Ser 0 1 1 1
Ala 2 1 1 1 2
Thr 2 0 0 1 1 3
Asp 5 1 1 0 0 0 4
Glu 5 0 1 0 0 0 3 4
Asn 4 0 1 1 0 0 2 1 2
Gln 5 1 0 1 0 1 2 2 1 4
His 3 2 0 1 1 1 1 1 2 3 6
Lys 5 2 1 0 1 0 0 0 1 1 0 5
Arg 4 3 0 0 2 1 1 1 0 1 2 3 6
Val 2 1 1 1 0 0 2 2 2 2 2 2 2 4
Met 5 3 2 2 1 1 3 2 0 1 2 0 0 2 6
Ile 2 3 2 1 1 0 2 2 2 2 2 2 2 4 2 5
Leu 6 4 3 3 2 2 4 3 3 2 2 3 3 2 4 2 6
Phe 4 5 5 3 4 3 6 5 4 5 2 5 4 1 0 1 2 9
Tyr 0 5 5 3 3 3 4 4 2 4 0 4 5 2 2 1 1 7 10
Trp 8 7 6 2 6 5 7 7 4 5 3 3 2 6 4 5 2 0 0 17
Cys Gly Pro Ser Ala Thr Asp Glu Asn Gln His Lys Arg Val Met Ile Leu Phe Tyr Trp
Substitution matrices
like PAM250 are constructed by observing the frequencies of
amino acid replacements in large samples of protein sequences.
For a given replacement, the PAM value is proportional to the
natural log of the frequency with which that replacement was
observed to occur.
The Blosum series of matrices are also commonlyused.
Gly 7
Pro 2 9
Asp 1 1 7
Glu 2 0 2 6
Asn 0 2 2 0 6
His 2 2 0 0 1 10
Gln 2 1 0 2 0 1 6
Lys 2 1 0 1 0 1 1 5
Arg 2 2 1 0 0 0 1 3 7
Ser 0 1 0 0 1 1 0 1 1 4
Thr 2 1 1 1 0 2 1 1 1 2 5
Ala 0 1 2 1 1 2 1 1 2 1 0 5
Met 2 2 3 2 2 0 0 1 1 2 1 1 6
Val 3 3 3 3 3 3 3 2 2 1 0 0 1 5
Ile 4 2 4 3 2 3 2 3 3 2 1 1 2 3 5
Leu 3 3 3 2 3 2 2 3 2 3 1 1 2 1 2 5
Phe 3 3 4 3 2 2 4 3 2 2 1 2 0 0 0 1 8
Tyr 3 3 2 2 2 2 1 1 1 2 1 2 0 1 0 0 3 8
Trp 2 3 4 3 4 3 2 2 2 4 3 2 2 3 2 2 1 3 15
Cys 3 4 3 3 2 3 3 3 3 1 1 1 2 1 3 2 2 3 5 12
Gly Pro Asp Glu Asn His Gln Lys Arg Ser Thr Ala Met Val Ile Leu Phe Tyr Trp Cys
One criticism of the PAM
matrices is that the PAM units were calibrated using pairs of
closelyrelated proteins, because these could be aligned by
hand. However, PAM1 units are extrapolated to high PAM values
(eg. PAM250), so it is uncertain how realistic these
extrapolated values are.
The Blosum matrices
were calculated using data from the BLOCKS* database [http://blocks.fhcrc.org ],
which contains alignments of more distantlyrelated proteins. In
principle, Blosum matrices should be more realistic for
comparing distantlyrelated proteins because the matrices are
derived from distantlyrelated proteins. On the other hand, one
could argue that the more distantly related proteins are, the
less reliable the alignment, which could contribute to error in
the calculation of Blosum matrices.
*Note: The
BLOCKS database is no longer supported. This just goes to show
a chronic problem in bioinformatics  the original datasets
used to build empiracallyderived computational tools become
more difficult to find as the original data sinks into
obsolescence.
Example: Calculation of the
NeedlwmanWunsch score for a short segment of superoxide
dismutase proteins from rice and parsley. The BLOSUM 45
scoring matrix was used. :  identities .  similar amino acids 
RICSOD G S V S G L K P G L . . : : : : : PETSOD V R I T
G L A P G L nw: 3 1 +3 +2 +7 +5 1 +9 +7 +5 = 33 
An excellent summary of the most commonlyused scoring matrices
for DNA and proteins can be found at http://www.ebi.ac.uk/help/matrix.html.
The following is an excerpt from that web site.
Equivalent PAM and Blossum matrices
Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution ShareAlike 2.5 Canada 
previous page  PLNT4610/PLNT7690
Bioinformatics Lecture 4, part 2 of 2 
next page 