last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 1 of 2
next page

October 17 and 19, 2017

MULTIPLE ALIGNMENTS AND PHYLOGENIES


REFERENCES

Baxevanis AD (1998) Practical Aspects of Multiple Sequence Alignment.  In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley, Toronto.

Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Co., Toronto. Ch. 3 "Sequence Comparison and Database Search".

Lectures by Dr. Michael Zuker, Bio-5495, Institute for Biomedical Computing, Washington University Medical School
[http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/ ]

G. Fuellen, Multiple Alignment. Complexity International 4, 1997. URL http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html.

Toompa, M. (2000) Multiple sequence alignment
http://www.cs.washington.edu/education/courses/527/00wi/lectures/lect06.pdf


Learning checklist:

1. Understand how the multiple alignment problem is related to the phylogenetic relationships among sequences.

2. Understand the different steps involved in going from a raw dataset of sequences to a final alignment.

3. Understand the main approaches to multiple sequence alignment, and what is distinct about each.



1. Multiple alignment and phylogeny construction are related problems

2. Tutorial: Creating a dataset

3. Constructing  alignments

4. Tutorial: Creating a multiple alignment


Why do multiple alignments and phylogenies?

1. Multiple alignment and phylogeny construction are related problems

An alignment of a set of related proteins or DNA sequences from several species illustrates that amino acid sequences can diverge over time. Below are the first 120 amino acid positions of chitinase III proteins from several plant species. A consensus for all of the proteins is listed at top.  Sites which agree with the consensus are indicated by a dot, and amino acids that differ from the consensus are explicitly printed.
                  10        20        30        40        50        60
consensus -----xxxxx-xxxxlxxxxlxxxxxxxsxaxgIxiYWGQngnEGsLadtCxtgnYxxVn
CACHIT    .....me--k.cfniipsll.islliks.n.a..av...........q.a.n.n..qf..
PSTCHIT   .....meslk.kaslvlfpi.vlslfnh.n.a..av......g........n....ef..
NTACIDCL3 ....mi----.kysf.ltalvlflralkle.gd.v................a.n..ai..
S66038    .maaki----.vsvlflisl.ifasfes.hgsq.v.......d........ns...gt.i
CUSSEQ_1  .....maahkiittt.siff.lssifrs.n.a..a.............s..a....ef..
CUSSEQ_2  .....maahk.ittt.siff.lssifrs.d.a..a.............s..a....ef..
CUSSEQ_3  .....maahk.ittt.siff.lssifhs.d.a..g.............s..a....ef..
VIRECT    .....-maclkqvsa.llpl.fisffkp.h.g..sv.............a.n....ky..
VURNACH3A .....-----.----------------------.sv.............a.n....ky..
ATHCHIA   ..mtnmtlrkhviyf.ffiscslskpsdasrg..a..........n.sa..a..r.ay..
VURNACH3B mvktkisl--.llpl.ff---tlvgtsha--g..a..........t.sea.d..r.th..
NTBASICL3 .mnikvsl--lfilpifl----llltskvk.gd.vv....dvg..k.i...ns.l.ni..

                  70        80        90       100       110       120
consensus iAFlsxFGxgQtPxlNLAGHCnPxxnxCxxxsxxixxCqsxgiKvllSxGGgxgxYslxS
CACHIT    .....t..n..n.qi......d.st.g.tkf.pe.qa..ak.......l...a.s...n.
PSTCHIT   .....t..s....q.......d.ss.g.tgf.se.qt..nr.......l..sa.t...n.
NTACIDCL3 ....vv..n..n.v.......d.naga.tgl.nd.ra..nq....m..l...a.s.f.s.
S66038    l..vat..n....a.......d.-atn.nsl.sd.kt..qa.......i...a.g...s.
CUSSEQ_1  .....s..s....v.........dn.g.afv.de.ns...qnv.....i...v.r...s.
CUSSEQ_2  .....s..s..a.v.........dn.g.afl.de.ns.k.qnv.....i...a.s...s.
CUSSEQ_3  .....s..g....v.........dn.g.til.ne.ns...qnv.....i...t.s...y.
VIRECT    ....ft..g....q.........si.n.nvf.dq.ke...kd......l..as.s...t.
VURNACH3A .....a..g....q.........si.n.nvf.dq.kg...r.....p.l..as.s...s.
ATHCHIA   v...vk..n....e.........aa.t.thfgsqvkd...r....m..l...i.n..ig.
VURNACH3B ....nk..n....em........at.s.tkf.aq.ky...kn......i...i.t.t.a.
NTBASICL3 .....s..nf...k.......e.ssgg.qqltks.rh...i...im..i...tpt.t.s.

Just by looking at the alignment, it is obvious that the three cucumber sequences CUSSEQ_1, CUSSEQ_2 and CUSSEQ_3  are very closely-related. For example, the highly-divergent N-terminal region from 1 - 35 in the alignment shows that this region  which is poorly-conserved across species is almost identical among from cucumber.

However, relationships among the chitinase III proteins are easier to visualize in a phylogenetic tree. For example, we see that among the legumes, chitinase III from winged bean and chickpea  are closely-related, but sequences from other legumes in the genus Vigna appear to have evolved from some more distant ancestor. For example, VURNACH3B clusters with the crucifer Arabidopsis.

Perhaps the biggest problem concerning multiple alignment and phylogeny are that they are interdependent. In principle, multiple alignment could be done in the absence of any knowledge of the evolutionary relationships of the species or proteins being aligned. In practice, exhaustive methods for mulitple alignment scale on a very sharp exponential curve, so that exhaustive alignments with more than a few sequences are impossible. Alignments can be built quickly using phylogenetic trees and pairwise comparisons as guides. The price we pay is the loss of independence. The alignment depends on some sort of phylogeny, while the phylogeny must be calculated from an alignment.

Assumptions of multiple alignment

Assumptions of phylogeny

2. Overview of alignment and phylogeny

The actual strategy used in construction of an alignment and phylogeny varies with the biological problem, and the nature of the data available.
  1. Protein vs DNA - During our discussion of pairwise sequence comparisons, we mentioned that pairwise alignment of DNA sequences is far less reliable than protein alignment, due to the small alphabet size of 4 for DNA, compared to 20 for proteins. This problem is far more serious for multiple alignments, because there are O(k2) pairwise comparisons, where k is the number of sequences. Therefore, alignments should be done with proteins, wherever possible. One exception of course is tRNA or rRNA molecules where information on secondary structure can be used to guide an alignment.
  2. DNA data is more informative than protein data, because phylogeny construction depends on detecting mutational events.
  3. Display of the alignment in various ways can yield important insights into an alignment.
  4. A very small dataset of only a few genes or proteins may give a misleading answer simply because there are too few examples.
  5. Very large datasets may impose computational constraints on the choice of methods used.
  6. As datasets get larger, redundant sequences may creep into the dataset.
Conceptually, the workflow in creating a phylogeny would include creating a multiple alignment. The overall chain of events might look something like this:

 

Implementation of such a workflow might be done in a number of ways. There is no one protocol for constructing alignments and phylogenies. At each step, decisions must be made as to which approach to take. In some cases, it may be necessary to try several methods before choosing one. As well, results at one step often make it necessary to go back several steps and refine the dataset. For example, a poor phylogeny may indicate the need to re-do the alignment, and then to retry the phylogeny.

The steps in constructing an alignment and a phylogeny are illustrated using programs from the BIRCH system. Assume a set of GenBank entries has been retrieved, all of which represent homologous genes from several species. In GenBank entries, protein coding sequences are annotated as 'CDS' features. The FEATURES program can extract CDS sequences from a group of GenBank entries automatically. Next, the coding sequences must be translated. Two multiple alignment program are available. TCOFFEE and Dialign. To display the alignment for evaluation or final publication, it is best to try several of the programs listed to tailor the output to your needs. REFORM generates straight ASCII text, which can be easily imported into a word processor or drawing program. The other programs listed generate PostScript output for direct viewing or printing.

If an aligned DNA sequence is desired, MRTRANS from the FASTA package can read in an aligned protein sequence and the corresponding DNA sequence and generate a file containing the DNA alignment.

The most complex part of the decision process involves choosing a strategy for phylogeny constructon. The three main phylogeny methods, parsimony, distance, and maximum likelihood, will be explained elsewhere.

Though not shown in the figure, protein sequences can be directly used in phylogeny construction. In this example, though, a phylogenetic tree is constructed from the aligned DNA. The quality of the alignment may be evaluated by bootstraping the alignment and rerunning the phylogeny search. although this option is not usually possible for large numbers of sequences when using a computationally-intensive maximum likelihood program.
 

last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 1 of 2
next page