Lecture 7, part 1 of 4
Felsenstein J PHYLIP documentation [http://home.cc.umanitoba.ca/~psgendb/doc/Phylip/main.html]
Fitch WM and Margoliash E (1967) Construction of phylogenetic trees. Science 155:279-284.
Hershkovitz MA and Leipe DD (1998) Phylogenetic Analysis In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley, Toronto.
Hillis DM, Allard MW, Miyamoto MM (1993) Analysis of DNA sequence data: Phylogenetic Inference. Meth. Enz. 224:456-487.
Setubal, J. and
Meidanis, J. (1997) Introduction to Computational Molecular
Biology. PWS Publishing Co., Toronto. Ch. 6 "Phylogenetic
Evolution of genes or species can be modeled as a bifurcating tree
To a first approximation, the evolution of species or genes can be modeled as a birfurcating process. Two populations become reproductively isolated and diverge due to random mutational processes. Over time, this process may repeat itself, so that at any time, each population can be said to be most closely-related to some other population with which it shares a direct common ancestor.
Divergence consists of changes in characters, such as amino acids in a protein, or nucleotides in DNA. The longer two populations remain reproductively isolated, the more divergence will occur. In principle, given the states for homologous characters across a set of populations, it should be possible to work backwards in time, ascending the tree, and reconstructing the character states for each common ancestor in the tree, until a common ancestor of all populations in the set is reached.
Horizontal gene flow is
even more critical when considering prokaryotic genomes.
Prokaryotes appear to have fewer fundamental barriers to
horizontal gene transfer. Among prokaryotes, horizontal DNA
transfer is relatively easy. At the other extreme, higher
eukaryotes have germlines that are physically sequestered from
contact with foreign species.
Heavily gapped regions are a major source of
error in multiple alignments
The alignment shows a region with gaps flanked by non-gapped sequence. This heavily gapped region illustrates two points. First, regions with large numbers of gaps are probably poorly aligned. For example, it might be more legitimate to align the block of A's at position 13 with the A's at position 10. However, that would require additional gaps, which cause a lower alignment score. Put another way, we are uncertain whether those A's in each case are homologous, that is similar due to common ancestry, or whether they represent independent positions in these proteins.
Therefore, where parts of analignment are uncertain due to gaps, it is often best to delete those parts from the alignment before performing phylogenetic analysis.
For those gaps that remain in the alignment, the problem of scoring arises. It is most likely that gaps spanning more than one position represent single insertion/deletion events. In that case, it is unreasonable to allow each gap position to contribute to the score during tree construction. One approach that is often used is to change all but the first gap in a long insertion/deletion region as unknowns. That is, most programs represent gaps as either dashes (-) or periods (.), and unknown positions as question marks (?). Thus programs that correct for incomplete data at some positions make it possible to work with a more realistic model of gaps.
The figure shows a multigene family in some ancestral species, consisting of four copies of a gene. As speciation progresses, deletions and duplications occur independently in each species. In the lineage on the left, copies A and C are lost, and both B and D are duplicated. In the lineage leading to species 2, copy D is lost, and copies B and C are tandemly duplicated, followed by deletion of one copy of B. This model predicts that the longer since the divergence of two species, the fewer orthologous copies of a gene will remain, unless there is selective pressure to retain specific orthologous copies. However, while orthologous copies persist, they will cluster together on a tree. For example, B orthologues in species 1 will be more closely-related to B orthologoues in species2, than they will be to D orthologoues in species 1. Consequently, there is no guarantee that multiple copies of a gene will always be more closely-related to other copies within a species, than they will be to copies in other species.
The more rapidly gene
families undergo duplication/deletion events, the more likely it
is that most or all copies of a gene within a species cluster
together on the same branch of a tree.
Finally, if gene
conversion results in the complete replacement of one copy of a
gene with another, it is functionally equivalent to a deletion
of one copy and duplication of another. However, it is possible
for partial gene conversion to result in replacement of only
part of a gene. In that case, a partially converted copy would
not be consistent with any simple model of evolution. That is,
two domains of a protein might have different evolutionary
histories. This phenomenon is referred to as 'reticulate
1homoplasy - return of a character to its previous state, thus masking intervening mutational events. Homoplasies are most important in DNA sequences, because there are only 4 nucleotides. Every fourth mutation should result in a homoplasy.
Lecture 7, part 1 of 4