TUTORIAL: PHYLOGENETIC ANALYSIS USING PARSIMONY
Oct. 19, 2014
|The PHYLIP programs are command line programs, but
can be run by BioLegato.
The programs in the PHYLIP package are interactive programs designed to be run at the command line. BioLegato can run these programs by generating the keystrokes you would have typed to set program parameters.
Create a directory called parsimony, and save chitIII.mrtrans.gde
to this directory. Open the file in blnalign:
OUTFILE - the report on the
TREEFILE- the machine
-readable treefile. Readable by programs such as DRAWTREE,
DRAWGRAM, and TREETOOL.
The treefile is also loaded into a bltree window, allowing further analysis.
ATV - the treefile in the ATV tree editor.
Parsimony methods set out to build a tree by successively adding sequences to the tree, until all sequences have been added. Unlike distance methods, which tend towards a single answer, the tree you get could be strongly influenced by the order in which branches are added.
To randomize the order of sequence addition, choose 'Yes' in the
Jumble area of the DNAPARS menu. BioLegato will automatically set
a random number seed to create a list of random numbers used for
jumbling. Usually only a few jumbles are needed to uncover most of
the alternate trees. In the example, the search is repeated 5
times with a random order of sequences.
Two equally parsimonious trees are produced.
Many of the groups of sequences cluster together in both trees. However, sequences shown in bold on the tree (NTACIDCL3, ATHCHIA, VURNACH3B, NTBASICL3, S66038) cluster together in the tree at right, but are split among several clades in the tree at left. The fundamental topology of the tree is therefore dependent upon the order in which the sequences are added.
By default, 100 bootstrap replicate datasets will be created, each containing positions sampled at random from the sequence alignment . In each set, some positions will be overrepresented, and others underrepresented. A large enough set of replicates should ensure that all parts of the sequence are equally biased among the replicates as a whole. If the original tree was simply due to a fortuitous circumstance that a few positions tipped the balance between one topology and another, different topologies will appear as each replicate dataset is evaluated. If the data are robust, meaning that a given branch appears regardless of which sites are omitted from the sample, then that branch is strongly supported by the data.
Since 100 bootstraps require 100 iterations of the tree building
process, the time required can be substantial when there are large
numbers of sequences. It is therefore a good practice to send
output to files, rather than to windows:
In this case two files will be created:
chitIII.dnapars.boot.outfileThe trees created during the 100 runs of DNAPARS are combined into a consensus tree, showing the number of times each branch occurred among 100 bootstrap replicates.
From the two parsimony trees examined above, we got a rough idea of which groups should cluster together. For example, CUSSEQ1, 2 & 3 should all cluster together, although surprisingly, CUSSEQ3 clusters with CUSSEQ1 in 71% of the trees, this is not true in 29% of the trees. The least certain grouping in the tree is the clustering of NTBASICL3, ATHCHIA, VURNACH3B and NTACIDCL3. These only cluster together 33% of the time.
Bootstrapping therefore gives us a way to determine which parts
of the tree are most strongly supported by the data, and which are
In this example, "Evaluate user-supplied tree" is chosen, and "chitIII.dnapars.boot.treefile" is chosen as the treefile to use.
The contents of the output file (chitIII.dnaml.outfile)
are shown below. The machine-readable tree is in chitIII.dnaml.treefile.
JUMBLING SEQUENCE ORDER 1 ITERATIONS, SEED=9221
Nucleic acid sequence Maximum Likelihood method, version 3.63
Empirical Base Frequencies:
Transition/transversion ratio = 2.000000
| +----------------9 +--CUSSEQ_1
| | |
| +----2 +-CUSSEQ_2
| | |
| | | +-VIRECT
| | +------1
| | +--VURNACH3A
| | +--6
| | | | +--------------ATHCHIA
| | +---5 +-----7
| | | | +-------------VURNACH3B
| +----4 |
| | +------------NTACIDCL3
remember: this is an unrooted tree!
Ln Likelihood = -7100.58647
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
8 CACHIT 0.18427 ( 0.14360, 0.22495) **
8 PSTCHIT 0.15338 ( 0.11579, 0.19090) **
8 3 0.06366 ( 0.03188, 0.09549) **
3 2 0.08926 ( 0.05358, 0.12495) **
2 9 0.27943 ( 0.22659, 0.33225) **
9 10 0.00861 ( zero, 0.01910) *
10 CUSSEQ_3 0.03501 ( 0.02064, 0.04935) **
10 CUSSEQ_1 0.03584 ( 0.02138, 0.05030) **
9 CUSSEQ_2 0.02704 ( 0.01325, 0.04084) **
2 1 0.12011 ( 0.08334, 0.15696) **
1 VIRECT 0.02928 ( 0.01359, 0.04493) **
1 VURNACH3A 0.04964 ( 0.03105, 0.06823) **
3 4 0.08198 ( 0.04708, 0.11699) **
4 5 0.06701 ( 0.03375, 0.10031) **
5 6 0.03709 ( 0.00899, 0.06511) **
6 NTBASICL3 0.32799 ( 0.26703, 0.38900) **
6 7 0.09252 ( 0.05328, 0.13175) **
7 ATHCHIA 0.26025 ( 0.20748, 0.31309) **
7 VURNACH3B 0.24921 ( 0.19733, 0.30108) **
5 NTACIDCL3 0.22492 ( 0.17727, 0.27256) **
4 S66038 0.25430 ( 0.20315, 0.30545) **
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
Execution times on deneb: 0.0u 0.0s 0:00 0% 0+0k 0+0io 0pf+0w
Maximum likelihood methods are very slow, because they attempt to consider an enormous number of possible trees. The time required increases exponentially with the number of sequences. Therefore doubling the number of sequences does not double the execution time. In fact for DNAML, the time required increases roughly with the 4th power of the number of the sequences! For most practical purposes, direct tree construction with greater than 30 sequences requires prohibitive amounts of time.
Since bootstrapping multiplies the time required often by
a factor of 100 or more, we usually don't have the luxury
of bootstrapping with maximum likelihood methods. However,
as illustrated above, we can bootstrap with a less time
consuming method such as parsimony, and then build the
final tree with Maximum Likelihood.
Note that the original tree had a log likelihood of -7100.58647. After trying global rearrangements of this tree, a slightly different tree with a log likelihood of -7094.81012 was found. The difference in likelihoods between these two trees is roughly -7101 - (-7095) = -6. Recalling that these are logs to the base e, e6 = 403. Therefore the final tree is 403 times more likely to produce the data than the original tree.
A note on execution times - The execution time shown in
the example is not the time that fastDNAml would require to build
this tree from scratch. It is simply the time required to evaluate
the tree it was given, doing some rearrangements, but leaving most
of the tree intact. When the tree was built from scratch using
DNAML, the execution times were:
Execution times on deneb: 95.0u 0.0s 1:35 99% 0+0k 0+0io 0pf+0w
Therefore, a de-novo construction of this tree would require 95 seconds of CPU time, and took 1 minutes 35 seconds elapsed time.