last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 1 of 4
next page

October  23,  2018



Felsenstein J PHYLIP documentation []

Fitch WM and Margoliash E (1967) Construction of phylogenetic trees. Science 155:279-284.

Hershkovitz MA and Leipe DD (1998)  Phylogenetic Analysis In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley, Toronto.

Hillis DM, Allard MW, Miyamoto MM (1993) Analysis of DNA sequence data: Phylogenetic Inference. Meth. Enz. 224:456-487.

Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Co., Toronto. Ch. 6 "Phylogenetic Trees".

Learning checklist:

1. Understand the principles of gene evolution by duplication, deletion, and mutation.
2. Know the advantages and disadvantages of using DNA or protein sequences for phylogeny construction.
3. Understand the basic concepts, advantages and disadvantages of the main tree building methods:
4. Understand how bootstrap resampling provides an estimate of the quality of branches on a tree.

A. Evolutionary Models

B. Methods for tree building

C. Evaluating phylogenies

D. Bayesian phylogenies

E. Running phylogeny programs

  1. Distance matrix programs
  2. Maximum parsimony and maximum likelihood

Estimating phylogenetic trees is not just an academic exercise: in some cases it can literally be a matter of life or death. For example, phylogenetic trees provided crucial evidence in the murder trial of a dentist that infected one of his patients with human immunodeficiency virus (HIV). Evolutionary trees also showed that cases of encephalitis in New York and New England represented the first examples of the mosquito-borne West Nile virus in the western hemisphere.

- Holder M, Lewis PO (2003) Phlogeny estimation: Traditional and Bayesian approaches. Nature Genetics 4:275-284.

A. Evolutionary Models

Evolution of genes or species can be modeled as a bifurcating tree

To a first approximation, the evolution of species or genes can be modeled as a birfurcating process.  Two populations become reproductively isolated and diverge due to random mutational processes. Over time, this process may repeat itself, so that at any time, each population can be said to be most closely-related to some other population with which it shares a direct common ancestor.

Divergence consists of changes in characters, such as amino acids in a protein, or nucleotides in DNA. The longer two populations remain reproductively isolated, the more divergence will occur. In principle, given the states for homologous characters across a set of populations, it should be possible to work backwards in time, ascending the tree, and reconstructing the character states for each common ancestor in the tree, until a common ancestor of all populations in the set is reached.

Reproductive isolation

The most important assumption in any phylogenetic model is that of reproductive isolation, such that no horizontal gene flow has occurred. However, speciation can often be a gradual process. Until reproductive isolation becomes complete, re-hybridization of closely-related species can result in horizontal gene flow. A good example is the recent evolutionary history of species related to wheat.

As illustrated at right, modern wheat (Triticum aestivum)  is a hexaploid containing three genomes, the A, B and D genome. Each genome is inferred to have arisen from the same common ancestor, but diverged into separate species. However, subsequent hybridization events have produced cereal species with all combinations of genomes, of which wheat is only one example.

Horizontal gene flow is even more critical when considering prokaryotic genomes. Prokaryotes appear to have fewer fundamental barriers to horizontal gene transfer. Among prokaryotes, horizontal DNA transfer is relatively easy. At the other extreme, higher eukaryotes have germlines that are physically sequestered from contact with foreign species.


A second factor that is critical to phylogenetic analysis is that homologous characters be compared across species. For DNA and proteins, this means that gaps must be correctly placed in multiple alignments to ensure that the same position is being compared for each species. Consequently, if a multiple alignment is poor, phylogeny construction will also be poor.
The alignment below shows a small region of a group of monocot thionins.

Heavily gapped regions are a major source of error in multiple alignments

The alignment shows a region with gaps flanked by non-gapped sequence. This heavily gapped region illustrates two points. First,  regions with large numbers of gaps are probably poorly aligned. For example, it might be more legitimate to align the block of A's at position 13 with the A's at position 10. However, that would require additional gaps, which cause a lower alignment score. Put another way, we are uncertain whether those A's in each case are homologous, that is similar due to common ancestry, or whether they represent independent positions in these proteins.

Therefore,  where parts of analignment are uncertain due to gaps, it is often best to delete those parts from the alignment before performing phylogenetic analysis.

For those gaps that remain in the alignment, the problem of scoring arises. It is most likely that gaps spanning more than one position represent single insertion/deletion events. In that case, it is unreasonable to allow each gap position to contribute to the score during tree construction. One approach that is often used is to change all but the first gap in a long insertion/deletion region as unknowns. That is, most programs represent gaps as either dashes (-) or periods (.), and unknown positions as question marks (?). Thus programs that correct for incomplete data at some positions make it possible to work with a more realistic model of gaps.

Gene Trees versus Species Trees

The existence of multigene families makes it important to distinguish between gene trees and species trees. If only one copy of a gene existed as species diverged, then the two would be equivalent. However, as species diverge, duplication and deletion events generate paralogous copies of genes which may be unique to each species. At the same time, orthologous copies of a gene may persist across many speciation events.

The figure shows a multigene family in some ancestral species, consisting of four copies of a gene. As speciation progresses, deletions and duplications occur independently in each species. In the lineage on the left, copies A and C are lost, and both B and D are duplicated. In the lineage leading to species 2,  copy D is lost, and copies B and C are tandemly duplicated, followed by  deletion of  one copy of B. This model predicts that the longer since the divergence of two species, the fewer orthologous copies of a gene will remain, unless there is selective pressure to retain specific orthologous copies. However,  while orthologous copies persist, they will cluster together on a tree. For example, B orthologues in species 1 will be more closely-related to B orthologoues in species2, than they will be to D orthologoues in species 1. Consequently, there is no guarantee that multiple copies of a gene will always be more closely-related to other copies within a species, than they will be to copies in other species.

The more rapidly gene families undergo duplication/deletion events, the more likely it is that most or all copies of a gene within a species cluster together on the same branch of a tree.

Finally, if gene conversion results in the complete replacement of one copy of a gene with another, it is functionally equivalent to a deletion of one copy and duplication of another. However, it is possible for partial gene conversion to result in replacement of only part of a gene. In that case, a partially converted copy would not be consistent with any simple model of evolution. That is, two domains of a protein might have different evolutionary histories. This phenomenon is referred to as 'reticulate evolution'.

DNA/RNA versus protein

The choice of DNA versus protein for phylogeny construction is complex and depends on the system being studied. As well, ribosomal RNA also has a specific role in phylogenetic analysis.

1homoplasy - return of a character to its previous state, thus masking intervening mutational events. Homoplasies are most important in DNA sequences, because there are only 4 nucleotides. Every fourth mutation should result in a homoplasy.

last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 1 of 4
next page