PLNT4610/PLNT7690 Bioinformatics - Lecture 6, part 2 of 2

PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 2 of 2

a. Dynamic Programming

An optimal multiple alignment depends on all possible pairwise comparisons of all amino acids in each protein at each position. It is possible to extend the dynamic programming algorithm for pairwise comparisons from the special case of k=2 sequences to the more general case of k=n sequences.

Scoring must account for those comparisons. Consider the alignment of k sequences of length n

MQPILLL
MLR-LL-
MK-ILLL
MPPVLIL

For every position (column) in the alignment, we calculate the sum of scores, based on the amino acid substitution matrix (eg. PAM250) using the function

where i and j refer to all unique pairs of sequences, ignoring self comparisons, and eliminating duplicate comparisons.

The sum-of-pairs (SP) function scores each position in the protein, that is, each column, as the sum of the pairwise scores. For k sequences, there are k(k-1)/2 unique pairwise comparisons, excluding self comparisons.

For example, in column 4, the score would be

SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V)

where p(a,b) is the pairwise score of two amino acids. This function is valid regardless of the order of sequences in the alignment.

Unlike pairwise alignments, it is legitimate in multiple alignments to have a match between two gap characters. By definition , p(-,-) = 0. Although one might be tempted to score a gap as highly negative, when two sequences match at a gap, it indicates that both sequences have a deletion at that position, that is not shared by other sequences. Gaps that are conserved between two sequences represent the conservation of a unique evolutionary event. Put another way, if two sequences have a gap at the same position, they haven't changed. In contrast, when two amino acids are drastically different, that indicates a significant mutational event. Matching gaps therefore should get a higher score than, for example, a mismatch between two very different amino acids, such as Proline and Lysine.

Generalizing the pairwise dynamic programming algorithm to k sequences is, in essence, doing an alignment in k dimensions. For example, an optimal alignment between k=3 sequences could be visualized in 3 dimensions:

Figure 2 from Zucker [http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/]

The problem with the dynamic programming algorithm is size and speed. To start with, where k sequences of length n are compared, the storage required for the alignment matrix is O(n^k). For 20 sequences of 500 amino acids each, it would require 500²⁰ units of storage (eg. bytes).

Recall that in a pairwise comparison, for each cell in the 2-dimensional array the score was dependent on the 3 preceeding adjacent cells. In a k-dimensional array, there are 2^k-1 preceeding cells. (This makes sense: for k=2 ie. a pairwise comparison, 2²-1 = 3). Finally, for each column, there are k(k-1)/2 pairwise comparisons.

In summary, the time required for a truly exhaustive multiple alignment, using the most straightforward approach, is O(k²2^kn^k).

Figure from Zucker [http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/]

Although some methods have introduced great efficiencies that bring global dynamic programming to a level that is practical for a handful of sequences, more efficient approximate methods are needed for typical alignment problems.

Probably the single most important speedup method for dynamic programming methods [Carillo and Lipman (1988) SIAM J. Appl. Math.48:1073-1082, Altschul and Lipman (1989) SIAM J. Appl. Math. 49:197-209]. This method only scales to about 8 sequences, so it is not usually practical.

Optimal multiple alignments are an example of NP complete problems. NP complete problems are problems whose solutions can not be done in polynomial time. Polynomial time requires O(n^c) steps, where c is a constant. In contrast, an NP complete problem requires O(n^x) steps, where x is proportional to the size of the dataset. So, in the equation above, the real culprits are the exponential terms 2^k and n^k.

b. Heuristic alignment methods

It is often the case that good approximations to an otherwise intractable problem can be found by heuristic methods. Heuristic methods take a "learn as you go" strategy, in which the algorithm solves small parts of the problem at a time, and gradually converges on a solution.

Heuristic methods are not guaranteed to find the optimal solution, but can sometimes be shown to produce a solution that is close to optimal. Heuristics are often very sensitive to the starting conditions, which are often arbitrary. That is, different starting conditions may yeild different solutions.

1. CLUSTAL Omega - global alignment using Neighbor-Joining Trees

(Note: CLUSTAL Omega is the most recent of the CLUSTAL family of programs, and should be considered the successor to the CLUSTAL family of programs.)

Hierarchical clustering methods work by aligning small sets of closely-related sequences into sub-alignments, and then aligning the alignments, until all sequences are represented in an alignment.

Hierarchical methods work on the hypothesis that sequences in an alignment will reflect their evolutionary history. That is, if one were to go from one sequence to the next most-closely related one, one would visit all nodes on the phylogenetic tree describing how these sequences evolved.

interior nodes (red) - each level is an alignment of some subset of the sequence dataset.

sequences (blue) - The terminal nodes of each branch are the sequences in the alignment.

As illustrated, the sequences to be aligned are the end nodes, the "leaves" of the tree.

CLUSTAL ALGORITHM

Step	Efficiency
1. Calculate distances between all possible pairs of sequences	O(n²)
2. Construct a Neighbor-Joining tree from pairwise distances	O(n)
3. while (not all nodes on the tree have been visited) 3.1 align each pair of sequences or profiles at the terminal nodes 3.2 replace aligned sequences with a profile representing the alignment of all sequences in below that node	O(log₂(n))

In effect, this algorithm keeps going deeper and deeper into the tree, clustering larger and larger groups of sequences. Clusters (profiles) are merged, until the root of the tree is reached.

from G. Fuellen, Multiple Alignment. Complexity International 4, 1997. URL http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html.

EXAMPLE: Progressive Alignment The alignment starts with the individual sequences and progresses up the tree until a final alignment is complete.
from https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/download/lectures/PCB_Lect05_Multip_Align.pdf

Once a gap always a gap.

An important rule of multiple alignment is that once a gap is inserted in 1 or more sequences, that gap is a permanent part of the alignment. Gaps are never removed as the alignment proceeds.

WARNING - The guide trees produced by programs such as clustal in step A are not legitimate phylogenetic trees! They should never be used for any purpose other than creating the alignment.

Reason: A good phylogeny is calculated from all of the information in the multiple alignment. Because of the "once a gap, always a gap" rule, pairwise distances calculated from the final alignment may contain gaps not present in the original pairwise alignments. Thus, the original pairwise alignments would underestimate the distances across the whole tree.

Advantages

Very fast alignment, can be done for large numbers of sequences

Disadvantages

Multiple alignment is strongly influenced by the initial pairwise-alignments
Error from pairwise alignments propagate through subsequent steps
Neighbor Joining, used for tree building, is one of the least reliable tree-building algorithms

2. MAFFT - Progressive and Iterative refinement of Alignments

Katoh K, Standley DM (2013) MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30:772-780 doi: 10.1093/molbev/mst010

MAFFT Algorithms (https://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html)

MAFFT (Multiple Alignment through Fast Fourier Transform) improves upon the progressive alignment algorithm of CLUSTAL/TCOFFEE by providing a number of different algorithms. Alignments are done using a Fast Fourier Transform (FFT) approximation, which is faster than the exhaustive Needlman-Wunsch (NW) alignment.

a) Progressive methods

FFT-NS-1

Comparable to ClustalW, but distance matrices calculated based on 6-mers shared between each pair of sequences
VERY FAST (Use for >2000 sequences)

FFT-NS-2

Guide tree is reconstructed based on the first progressive alignment, and a second progressive alignment is done
VERY FAST (Use for >2000 sequences)
Improves upon FFT-NS-1 by recalculating distances and the guide tree and reconstructing the alignment from the improved guide tree

b) Iterative refinement of progressive alignments using Weighted Sum of Pair (WSP) scores

One of the limitations of using the SP (sum of pairs) function to calculate alignment scores is that when there are many sequences, some highly similar to one another, and others that are more distant, the highly similar sequences will dominate the score. Put another way, they will carry more weight than the distantly-related sequences. To correct for that, some programs use a Weighted Sum of Pairs, which multiplies the distances between any pair of two sequences times their pairwise similarity score. Two sequences sharing a high degree of similarity will thus be given a smaller weight, because the distance between them is smaller.

FFT-NS-i (2 cycles)

Similar to FFT-NS-2, but distance matrices calculated by WSP at each position for individual amino acid or nucleotide pairs, rather than using the 6-mer shortcut.
still FAST, but slower than FFT-NS-2

FFT-NS-i (up to 1000 cycles)

Like FFT-NS-i (2cycles), but steps are iterated (i) until no further improvement in the alignment score is seen.
time proportional to the number of iterations required
More accurate than FFT-NS-2

Calculating the Weighted Sum of Pairs

from

https://www4.stat.ncsu.edu/~muse/assets/msa.pdf

The distances between any 2 sequences in the guide tree can be used as a weight, which can be multiplied by the SP score for those two sequences.

For example, the distance between sequences A and C would be

w_A,C = w_A + w_C

We can modify the SP function by multiplying each SP score between two sequences i and j at any position by the weight w_ij.

c) Iterative refinement of progressive alignments using WSP and consistency scores (SLOW)

TCOFFEE (Noterdame, Higgins, Heringa, JMB 2000, 302 205-217) introduced the idea of a consistency alignment. When adding sequences to the alignment, the consistency alignment searches for the way to align the third sequence to a given pair of sequences, such that the highest scoring alignment for the three sequences is found.

pre-calculate all pairwise alignments between each pair of sequences to create a library of aligned sequence pairsfor each aligned sequence pair in the library calculate all possible alignemts with sequence c choose the highest-scoring three-way alignment

MAAFT implements three versions of this iterative alignment process, each optimized for different cases. As with FFT-NS-i, alignments are iterated until no further improvement is seen in the alignment score.

G-INS-i (global)

Appropriate for the case in which all parts of all sequences can be legitimately aligned, or if one alignable domain can be edited from the sequences for subsequent alignment.

XXXXXXXXXXX-XXXXXXXXXXXXXXX
XX-XXXXXXXXXXXXXXX-XXXXXXXX
XXXXX----XXXXXXXX---XXXXXXX
XXXXX-XXXXXXXXXX----XXXXXXX
XXXXXXXXXXXXXXXX----XXXXXXX

x alignable residues
o unalinable residues
- gaps

L-INS-i (local) - Appropriate for proteins which contain a single large region that is alignable. Flanking sequences are NOT aligned.

ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

E-INS-i (multiple domains) - Appropriate for proteins/RNAs which have 2 or more conserved domains, flanked by regions that cannot be aligned. Only the conserved domains will be aligned.

oooooooooXXX------XXXXooooooooooo---------------oooooooXXXXXooooooooooooooooo--oooooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-ooooooooooooooooooooooo-----------
oooooooo-XXXXX----XXXX---------------------------------XXXXX---oooooooooo--oooooooooooo--------
---------XXXXXX---XXXX---------------------------------XXXXX-----------------------------------
---------XXXXXXXXXXXXX---------------------------------XXXXX-----------------------------------
---------XX-------XXXX---------------------------------XXXXX-----------------------------------