PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 1 of 2

MULTIPLE ALIGNMENTS AND PHYLOGENIES

REFERENCES

Baxevanis AD (1998) Practical Aspects of Multiple Sequence Alignment. In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley, Toronto.

Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Co., Toronto. Ch. 3 "Sequence Comparison and Database Search".

Lectures by Dr. Michael Zuker, Bio-5495, Institute for Biomedical Computing, Washington University Medical School
[http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/ ]

G. Fuellen, Multiple Alignment. Complexity International 4, 1997. URL http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html.

Toompa, M. (2000) Multiple sequence alignment
http://www.cs.washington.edu/education/courses/527/00wi/lectures/lect06.pdf

Learning checklist:

1. Understand how the multiple alignment problem is related to the phylogenetic relationships among sequences.

2. Understand the different steps involved in going from a raw dataset of sequences to a final alignment.

3. Understand the main approaches to multiple sequence alignment, and what is distinct about each.

global alignment by dynamic programming
global alignment through use of a guide tree (clustal, tcoffee)

1. Multiple alignment and phylogeny construction are related problems

2. Tutorial: Creating a dataset

3. Constructing alignments

4. Tutorial: Creating a multiple alignment

Why do multiple alignments and phylogenies?

Determine whether a group of proteins are related
Show regions of conservation within a protein
Be able to design PCR primers for conserved or divergent regions
Determine the number of ortholgous groups within a multigene familiy
Determine evolutionary history of a gene family
Determine evolutionary relationships among species or populations

1. Multiple alignment and phylogeny construction are related problems

An alignment of a set of related proteins or DNA sequences from several species illustrates that amino acid sequences can diverge over time. Below are the first 120 amino acid positions of chitinase III proteins from several plant species. A consensus for all of the proteins is listed at top. Sites which agree with the consensus are indicated by a dot, and amino acids that differ from the consensus are explicitly printed.

                  10        20        30        40        50        60
consensus -----xxxxx-xxxxlxxxxlxxxxxxxsxaxgIxiYWGQngnEGsLadtCxtgnYxxVn
CACHIT    .....me--k.cfniipsll.islliks.n.a..av...........q.a.n.n..qf..
PSTCHIT   .....meslk.kaslvlfpi.vlslfnh.n.a..av......g........n....ef..
NTACIDCL3 ....mi----.kysf.ltalvlflralkle.gd.v................a.n..ai..
S66038    .maaki----.vsvlflisl.ifasfes.hgsq.v.......d........ns...gt.i
CUSSEQ_1  .....maahkiittt.siff.lssifrs.n.a..a.............s..a....ef..
CUSSEQ_2  .....maahk.ittt.siff.lssifrs.d.a..a.............s..a....ef..
CUSSEQ_3  .....maahk.ittt.siff.lssifhs.d.a..g.............s..a....ef..
VIRECT    .....-maclkqvsa.llpl.fisffkp.h.g..sv.............a.n....ky..
VURNACH3A .....-----.----------------------.sv.............a.n....ky..
ATHCHIA   ..mtnmtlrkhviyf.ffiscslskpsdasrg..a..........n.sa..a..r.ay..
VURNACH3B mvktkisl--.llpl.ff---tlvgtsha--g..a..........t.sea.d..r.th..
NTBASICL3 .mnikvsl--lfilpifl----llltskvk.gd.vv....dvg..k.i...ns.l.ni..

                  70        80        90       100       110       120
consensus iAFlsxFGxgQtPxlNLAGHCnPxxnxCxxxsxxixxCqsxgiKvllSxGGgxgxYslxS
CACHIT    .....t..n..n.qi......d.st.g.tkf.pe.qa..ak.......l...a.s...n.
PSTCHIT   .....t..s....q.......d.ss.g.tgf.se.qt..nr.......l..sa.t...n.
NTACIDCL3 ....vv..n..n.v.......d.naga.tgl.nd.ra..nq....m..l...a.s.f.s.
S66038    l..vat..n....a.......d.-atn.nsl.sd.kt..qa.......i...a.g...s.
CUSSEQ_1  .....s..s....v.........dn.g.afv.de.ns...qnv.....i...v.r...s.
CUSSEQ_2  .....s..s..a.v.........dn.g.afl.de.ns.k.qnv.....i...a.s...s.
CUSSEQ_3  .....s..g....v.........dn.g.til.ne.ns...qnv.....i...t.s...y.
VIRECT    ....ft..g....q.........si.n.nvf.dq.ke...kd......l..as.s...t.
VURNACH3A .....a..g....q.........si.n.nvf.dq.kg...r.....p.l..as.s...s.
ATHCHIA   v...vk..n....e.........aa.t.thfgsqvkd...r....m..l...i.n..ig.
VURNACH3B ....nk..n....em........at.s.tkf.aq.ky...kn......i...i.t.t.a.
NTBASICL3 .....s..nf...k.......e.ssgg.qqltks.rh...i...im..i...tpt.t.s.

Just by looking at the alignment, it is obvious that the three cucumber sequences CUSSEQ_1, CUSSEQ_2 and CUSSEQ_3 are very closely-related. For example, the highly-divergent N-terminal region from 1 - 35 in the alignment shows that this region which is poorly-conserved across species is almost identical among from cucumber.

However, relationships among the chitinase III proteins are easier to visualize in a phylogenetic tree. For example, we see that among the legumes, chitinase III from winged bean and chickpea are closely-related, but sequences from other legumes in the genus Vigna appear to have evolved from some more distant ancestor. For example, VURNACH3B clusters with the crucifer Arabidopsis.

Perhaps the biggest problem concerning multiple alignment and phylogeny are that they are interdependent. In principle, multiple alignment could be done in the absence of any knowledge of the evolutionary relationships of the species or proteins being aligned. In practice, exhaustive methods for mulitple alignment scale on a very sharp exponential curve, so that exhaustive alignments with more than a few sequences are impossible. Alignments can be built quickly using phylogenetic trees and pairwise comparisons as guides. The price we pay is the loss of independence. The alignment depends on some sort of phylogeny, while the phylogeny must be calculated from an alignment.

Assumptions of multiple alignment

All sequences are homologous
No duplicate sequences are present
In each column, amino acid residues are homologous
The alignment is optimal, with minimal gaps

Assumptions of phylogeny

All sequences are homologous
No duplicate sequences are present
In each column, amino acid residues are homologous
The alignment is optimal with minimal gaps
No back mutation has occurred (some methods take this into account)
All sequences are the same length

2. Overview of alignment and phylogeny

The actual strategy used in construction of an alignment and phylogeny varies with the biological problem, and the nature of the data available.

Protein vs DNA - During our discussion of pairwise sequence comparisons, we mentioned that pairwise alignment of DNA sequences is far less reliable than protein alignment, due to the small alphabet size of 4 for DNA, compared to 20 for proteins. This problem is far more serious for multiple alignments, because there are O(k²) pairwise comparisons, where k is the number of sequences. Therefore, alignments should be done with proteins, wherever possible. One exception of course is tRNA or rRNA molecules where information on secondary structure can be used to guide an alignment.
DNA data is more informative than protein data, because phylogeny construction depends on detecting mutational events.

The degeneracy of the genetic code can mask mutations, making it preferable to construct phylogenies using protein coding DNA sequences, rather than proteins.
DNA sequence may be under less selective pressure than the corresponding protein sequence.
For closely-related sequences, little or no sequence divergence may have occurred at the amino acid level, while divergence might be detectible at the DNA level.

A good compromise is to do the protein alignment, and from the protein alignment, construct a DNA alignment for use in the phylogeny. The pal2nal.pl script can do this.

Display of the alignment in various ways can yield important insights into an alignment.
A very small dataset of only a few genes or proteins may give a misleading answer simply because there are too few examples.
Very large datasets may impose computational constraints on the choice of methods used.
As datasets get larger, redundant sequences may creep into the dataset.

Conceptually, the workflow in creating a phylogeny would include creating a multiple alignment. The overall chain of events might look something like this:

Implementation of such a workflow might be done in a number of ways. There is no one protocol for constructing alignments and phylogenies. At each step, decisions must be made as to which approach to take. In some cases, it may be necessary to try several methods before choosing one. As well, results at one step often make it necessary to go back several steps and refine the dataset. For example, a poor phylogeny may indicate the need to re-do the alignment, and then to retry the phylogeny.

The steps in constructing an alignment and a phylogeny are illustrated using programs from the BIRCH system. Assume a set of GenBank entries has been retrieved, all of which represent homologous genes from several species. In GenBank entries, protein coding sequences are annotated as 'CDS' features. The FEATURES program can extract CDS sequences from a group of GenBank entries automatically. Next, the coding sequences must be translated. Two multiple alignment program are available. ClustalOmega and MAAFT. To display the alignment for evaluation or final publication, it is best to try several of the programs listed to tailor the output to your needs. REFORM generates straight ASCII text, which can be easily imported into a word processor or drawing program. The other programs listed generate PostScript output for direct viewing or printing.

If an aligned DNA sequence is desired, pal2nal.pl can read in an aligned protein sequence and the corresponding DNA sequence and generate a file containing the DNA alignment.

The most complex part of the decision process involves choosing a strategy for phylogeny constructon. The three main phylogeny methods, parsimony, distance, and maximum likelihood, will be explained elsewhere.

Though not shown in the figure, protein sequences can be directly used in phylogeny construction. In this example, though, a phylogenetic tree is constructed from the aligned DNA. The quality of the alignment may be evaluated by bootstraping the alignment and rerunning the phylogeny search. although this option is not usually possible for large numbers of sequences when using a computationally-intensive maximum likelihood program.

last page PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 1 of 2 next page

October 10 and 15, 2024

MULTIPLE ALIGNMENTS AND PHYLOGENIES

Learning checklist:

1. Multiple alignment and phylogeny construction are related problems

2. Tutorial: Creating a dataset

3. Constructing alignments

4. Tutorial: Creating a multiple alignment

Why do multiple alignments and phylogenies?

1. Multiple alignment and phylogeny construction are related problems

Assumptions of multiple alignment

Assumptions of phylogeny

2. Overview of alignment and phylogeny