TUTORIAL: CHARACTER PHYLOGENY METHODS

Parsimony and Maximum Likelihood

June 24, 2021

PHYLIP Main documentation: $doc/Phylip/main.html
PHYLIP DNA Parsimony methods: $doc/Phylip/dnapars.html
PHYLIP Protein Parsimony methods: $doc/Phylip/protpars.html
PHYLIP Bootstrapping: $doc/Phylip/seqboot.html
PHYLIP Consensus trees: $doc/Phylip/consense.html
PHYLIP DNA Maximum Likelihood methods: $doc/Phylip/dnaml.html
PHYLIP Protein Maximum Likelihood methods: $doc/Phylip/proml.html
Archaeopteryx: $doc/forester/ArchaeopteryxUserDocumentation.pdf

This tutorial continues from the previous tutorial Phylogenetic Analysis Using Distance Methods. It begins with multiply aligned protein and DNA sequences for plant type III chitinases. To eliminate gappy positions, the alignments were processed by Gblocks. The starting point for this tutorial is the Gblocks output alignments.

1. The dataset

Create a new directory called character. Save these files to the character directory

Aligned Gblocks DNA sequences - chitIII.CDS.pal2nal.gblocks.fsn
Aligned Gblocks protein sequences - chitIII.pro.mafft.gblocks.fsa

Because the steps for DNA and protein phylogeny are virtually identical, we will show as an example DNA phylogenies run from blnalign.

2. Parsimony methods are sensitive to the order in which sequences are added to the tree.

cd into the character directory and launch blnalign. Read in chitIII.CDS.pal2nal.gblocks.fsn.

The goal of parsimony is to find the tree that requires the fewest base or amino acid substitutions, when mutational distances from each sequence to each ancestral node, and between ancestral nodes, are added up. For an example, run DNAPARS by choosing Phylogeny --> DNAPARS.

As seen in the previous tutorial, multiple output windows appear: the DNAPARS report, a machine readable tree file in a text editor, the same treefile in bltree, and the same treefile in Archaeopteryx.

Parsimony methods set out to build a tree by successively adding sequences to the tree, until all sequences have been added. Unlike distance methods, which tend towards a single answer, the tree you get could be strongly influenced by the order in which branches are added.

To randomize the order of sequence addition, choose 'Yes' in the Jumble area of the DNAPARS menu. BioLegato will automatically set a random number seed to create a list of random numbers used for jumbling. Usually only a few jumbles are needed to uncover most of the alternate trees. In the example, the search is repeated 5 times with a random order of sequences.

Three equally parsimonious trees are produced, as shown in bltree:

By "equally parsimonious", we mean that the number of mutational steps required to traverse the entire tree is the same in all 3 trees, in this case, the report tells us requires a total of 2718 (data not shown).

An example shows 1 subtree from the complete tree for each of the three resultant trees.

One subtree that is mostly conserved between the three trees illustrates the differences, solely due to redoing the analysis by adding the sequences to the tree in different orders. For example, In A and C, the subbranch containing sequences whose names end in 8984, 3327,3226, 8024, and 0067 are conserved. Comparing B and C, both have an additional sequence ending in 6547, in B as an outgroup, and in C as an internal node.

There are other differences when the larger tree is considered, but this example serves to demonstrate that parsimony is sensitive to the order in which sequences are added. For this reason, jumbling should always be done in when using either parsimony or maximum likelihood.

The other thing to take not of in the output report is the time elapsed:

Elapsed time on cc14.cc.umanitoba.ca: 8.55016589165 seconds.

This will be important in the next section.

For comparison, let's try running DNAML on the same dataset. Again, we'll do 5 jumbles, and global rearrangements.

This tree compares well with the previous ones, although in order to include 8984 we also had to include another clade, 1055 and 7126. As we will see later from bootstrap replication, this is because many of the internal branches are not as reproducible as the more terminal branches.

From the report, we get Ln Likelihood = -13106.23792

Ln likelihood - The log likelihood is the log to the base e of the products of the probabilities of observing the data at each position, given the final tree shown. This makes it possible to compare the likelihood of seeing the data with one tree, as opposed to another, thereby making it possible to choose the best tree as the one with the highest likelihood.

WARNING
Maximum likelihood methods are very slow, because they attempt to consider an enormous number of possible trees. The time required increases exponentially with the number of sequences. Therefore doubling the number of sequences does not double the execution time. In fact for DNAML, the time required increases roughly with the 4th power of the number of the sequences! For most practical purposes, direct tree construction with greater than 30 sequences requires prohibitive amounts of time.

Since bootstrapping multiplies the time required often by a factor of 100 or more, we usually don't have the luxury of bootstrapping with maximum likelihood methods. However, as illustrated above, we can bootstrap with a less time consuming method such as parsimony, and then build the final tree with Maximum Likelihood.

In this case, elapsed time on cc14.cc.umanitoba.ca: 1735.23524117 seconds, which means that DNAML took about 1500 times longer to construct a tree de-novo than did DNAPARS. If we were to do 100 bootstrap replicates using 5 jumbles as described in the next section, it would take about 173500 seconds, or about 48 hr. to complete. This is not necessarily a reason not to do Maximum Likelihood, but is does illustrate the fact that for larger datasets it won't be practical.

Therefore, the our strategy will be to create a rigorous tree topology using parsimony with bootstrap resampling, and then calculating the final branch lengths using Maximum Likelihood.

3. Validating the quality of the tree by bootstrapping.

Bootstrap resampling of the sequence data is usually the method of choice for evaluating trees. Set RESAMPLING to "Bootstrap":

By default, 100 bootstrap replicate datasets will be created, each containing positions sampled at random from the sequence alignment . In each set, some positions will be overrepresented, and others underrepresented. A large enough set of replicates should ensure that all parts of the sequence are equally biased among the replicates as a whole. If the original tree was simply due to a fortuitous circumstance that a few positions tipped the balance between one topology and another, different topologies will appear as each replicate dataset is evaluated. If the data are robust, meaning that a given branch appears regardless of which sites are omitted from the sample, then that branch is strongly supported by the data.

In order to add bootstrap values to the final consensus tree, we also need to save all of the bootstrapped trees in a separate file.

Since 100 bootstraps require 100 iterations of the tree building process, the time required can be substantial when there are large numbers of sequences. If 1 iteration took 8.55 sec., then 100 iterations will require about 855 seconds, or about 14 min. 15 sec. It is therefore a good practice to send output to files, rather than to windows:

In this case two files will be created:

Report: chitIII.CDS.pal2nal.gblocks.dnapars.boot.outfile
Bootstrap replicate trees: chitIII.CDS.pal2nal.gblocks.dnapars.boot.alltrees.treefile
Consensus tree: chitIII.CDS.pal2nal.gblocks.dnapars.boot.treefile

The report tells us

Elapsed time on cc14.cc.umanitoba.ca: 953.693402052 seconds

So it took a bit longer than expected. Since this is a multiuser system, actual time required will vary depending on system load.

The trees created during the 100 runs of DNAPARS are combined into a consensus tree, showing the number of times each branch occurred among 100 bootstrap replicates. Note that branch lengths are arbitrary. This tree ONLY gives us boostrap numbers telling how often a group of sequences is found on the same clade.

A larger subtree is shown to illustrate where sequences from the previous subtrees ended up. The 5 sequences on the lower clade are still found together, although the bootstrap value shows that these sequences cluster together only 57.2% of the time. (The bootstrap values are not always whole numbers because for each of 100 replicates we jumble 5 times, and may get several equally parsimonious trees with each jumble.)

The remaining sequences from the original subtrees do cluster together 29.5% of the time. Since the deeper branches on this clade are 100%, it is likely that in other replicates, 6547 clusters on some other clade. This would be consistent with the fact that it was seen in 2 of the 3 previous subtrees.

Bootstrapping therefore gives us a way to determine which parts of the tree are most strongly supported by the data, and which are not.

4. Calculating branch lengths

Bootstrapping does not directly give us branch lengths. However, it is possible to use parsimony trees as input maximum likelihood programs which do give branch lengths. For large datasets, DNAML is probably too slow to do 100 or more bootstrap replicates, or maybe even a single run. For calculating branch lengths for an already existing tree, however, it is quite fast. Choose Phylogeny --> DNAML.

In this example, "Evaluate user-supplied tree" is chosen, and "chitIII.CDS.pal2nal.gblocks.dnapars.boot.treefile" is chosen as the treefile to use.Output is shown below:

Report: chitIII.CDS.pal2nal.gblocks.dnaml.outfile
Consensus tree: chitIII.CDS.pal2nal.gblocks.dnaml.treefile

In order to visualize this tree with both branch lengths and boostrap values, we need to create a phyloXML file from the existing treefile. Select the maximum likelihood tree in bltree and choose Draw --> ConfAdd. Set the Bootstrap tree file name to chitIII.CDS.pal2nal.gblocks.dnapars.boot.alltrees.treefile, which we saved earlier. As shown, confadd will add the bootstrap values to this tree and create a phyloXML file with the output. The output will pop up in Archaeopteryx

Here is what we wanted: A gree with branch lengths and bootstrap values. From Archaeopteryx, save this file as chitIII.CDS.pal2nal.gblocks.dnaml.phyloxml.

NEXT: Visualization of Phylogenetic Trees