BIRCH

TUTORIAL: PHYLOGENETIC ANALYSIS USING PARSIMONY


Oct. 19, 2014


PHYLIP Main documentation: $doc/Phylip/main.html
PHYLIP Parsimony methods: $doc/Phylip/dnapars.html
PHYLIP Parsimony methods: $doc/Phylip/protpars.html
PHYLIP Bootstrapping: $doc/Phylip/seqboot.html
PHYLIP Consensus trees: $doc/Phylip/consense.html
PHYLIP Maximum Likelihood methods: $doc/Phylip/dnaml.html

The PHYLIP programs are command line programs, but can be run by BioLegato.
The programs in the PHYLIP package are interactive programs designed to be run at the command line. BioLegato can run these programs by generating the keystrokes you would have typed to set program parameters.

Example: Plant Type III Chitinases

The chitinases in plants are hydrolytic enzymes that degrade chitins (N-acetyl glucosamine). Although chitin does not occur in plants,  in many fungi, it is a major component of the fungal cell wall. Not surprisingly, chitinases are produced in plants in response to fungi. Chitinases have been demonstrated to play an important role in plant defense responses. There are six classes of chitinases so far identified. Most known chitinases fall into the Type I and Type II classes. This exercise will work with a smaller class of genes encoding Type III chitinases.
 

1. The dataset

The file chitIII.mrtrans.gde is a GDE format file containing protein coding sequences (CDS) from chitinase III genes. These DNA sequences have already been aligned using Pearson's MRTRANS program, which reads a set of unaligned DNA sequences and aligns them according to a set of aligned proteins.

Create a directory called parsimony, and save chitIII.mrtrans.gde to this directory. Open the file in blnalign:

 

2. Parsimony methods are sensitive to the order in which sequences are added to the tree.

The goal of parsimony is to find the tree that requires the fewest base or amino acid substitutions, when mutational distances from each sequence to each ancestral node, and between ancestral nodes, are added up. For an example, run DNAPARS by choosing Phylogeny --> DNAPARS.

dnapars.menu.png

OUTFILE - the report on the phylogeny

outfile.png

TREEFILE- the machine -readable treefile. Readable by programs such as DRAWTREE, DRAWGRAM, and TREETOOL.


treefile.png

The treefile is also loaded into a bltree window, allowing further analysis.

treefile.bltree.png

ATV - the treefile in the ATV tree editor.

chitIII.dnapars.atv.png

Parsimony methods set out to build a tree by successively adding sequences to the tree, until all sequences have been added. Unlike distance methods, which tend towards a single answer, the tree you get could be strongly influenced by the order in which branches are added.

To randomize the order of sequence addition, choose 'Yes' in the Jumble area of the DNAPARS menu. BioLegato will automatically set a random number seed to create a list of random numbers used for jumbling. Usually only a few jumbles are needed to uncover most of the alternate trees. In the example, the search is repeated 5 times with a random order of sequences.

jumble.png

Two equally parsimonious trees are produced.
 
          
+---------VURNACH3B
+-----9
| +----------ATHCHIA
|
| +-VURNACH3A
+---8 +-------7
| | | +-VIRECT
| | +-----6
| | | | +-CUSSEQ_2
| | | +---------4
| +---3 | +-CUSSEQ_3
+----2 | +--5
| | | +-CUSSEQ_1
| | |
| | +---------S66038
| |
| | +------------NTBASICL3
| +--10
| +---------NTACIDCL3
|
1-------PSTCHIT
|
+-------CACHIT
                  
+-VURNACH3A
+------7
| +-VIRECT
+----6
| | +CUSSEQ_2
| +----------4
| | +-CUSSEQ_3
| +--5
+----3 +-CUSSEQ_1
| |
| | +---------S66038
| | |
| +----2 +----------NTBASICL3
| | +---10
| | | | +----------VURNACH3B
| +----8 +-----9
| | +----------ATHCHIA
| |
| +-------NTACIDCL3
|
1-------PSTCHIT
|
+-------CACHIT

Many of the groups of sequences cluster together in both trees. However, sequences shown in bold on the tree (NTACIDCL3, ATHCHIA, VURNACH3B, NTBASICL3, S66038) cluster together in the tree at right, but are split among several clades in the tree at left. The fundamental topology of the tree is therefore dependent upon the order in which the sequences are added.

3. Testing the quality of the tree by bootstrapping.

Bootstrap resampling of the sequence data is usually the method of choice for evaluating trees. Set RESAMPLING to "Bootstrap", and set a random number seed as shown below:
bootstrap.png

By default, 100 bootstrap replicate datasets will be created, each containing positions sampled at random from the sequence alignment . In each set, some positions will be overrepresented, and others underrepresented. A large enough set of replicates should ensure that all parts of the sequence are equally biased among the replicates as a whole. If the original tree was simply due to a fortuitous circumstance that a few positions tipped the balance between one topology and another, different topologies will appear as each replicate dataset is evaluated. If the data are robust, meaning that a given branch appears regardless of which sites are omitted from the sample, then that branch is strongly supported by the data.

Since 100 bootstraps require 100 iterations of the tree building process, the time required can be substantial when there are large numbers of sequences. It is therefore a good practice to send output to files, rather than to windows:

In this case two files will be created:

chitIII.dnapars.boot.outfile
chitIII.dnapars.boot.treefile
The trees created during the 100 runs of DNAPARS are combined into a consensus tree, showing the number of times each branch occurred among 100 bootstrap replicates.
 
                                            +-------------NTBASICL3
+-38.6-|
| | +------ATHCHIA
+-32.6-| +-90.5-|
| | +------VURNACH3B
+-46.3-| |
| | +--------------------NTACIDCL3
| |
+-75.5-| +---------------------------S66038
| |
| | +------CACHIT
| +----------------------91.0-|
+-100.0-| +------PSTCHIT
| |
| | +------CUSSEQ 3
| | +-70.7-|
+------| +----------------------100.0-| +------CUSSEQ 1
| | |
| | +-------------CUSSEQ 2
| |
| +------------------------------------------------VIRECT
|
+-------------------------------------------------------VURNACH3A

From the two parsimony trees examined above, we got a rough idea of which groups should cluster together. For example, CUSSEQ1, 2 & 3 should all cluster together, although surprisingly,  CUSSEQ3 clusters with CUSSEQ1 in 71% of the trees, this is not true in 29% of the trees. The least certain grouping in the tree is the clustering of NTBASICL3, ATHCHIA, VURNACH3B and NTACIDCL3. These only cluster together 33% of the time.

Bootstrapping therefore gives us a way to determine which parts of the tree are most strongly supported by the data, and which are not.
 

4. Calculating branch lengths

Parsimony analysis does yield reliable branch lengths. However, it is possible to use parsimony trees as input to distance or maximum likelihood programs which do give branch lengths. Choose Phylogeny --> DNAML.

In this example, "Evaluate user-supplied tree" is chosen, and "chitIII.dnapars.boot.treefile" is chosen as the treefile to use.

The contents of the output file (chitIII.dnaml.outfile) are shown below. The machine-readable tree is in chitIII.dnaml.treefile.

DNAML


JUMBLING SEQUENCE ORDER 1 ITERATIONS, SEED=9221

Nucleic acid sequence Maximum Likelihood method, version 3.63


Empirical Base Frequencies:

A 0.25559
C 0.24475
G 0.23579
T(U) 0.26388

Transition/transversion ratio = 2.000000

User-defined tree:


+--------PSTCHIT
|
| +--CUSSEQ_3
| +-10
| +----------------9 +--CUSSEQ_1
| | |
| +----2 +-CUSSEQ_2
| | |
| | | +-VIRECT
| | +------1
| | +--VURNACH3A
| |
8---3 +-------------------NTBASICL3
| | +--6
| | | | +--------------ATHCHIA
| | +---5 +-----7
| | | | +-------------VURNACH3B
| +----4 |
| | +------------NTACIDCL3
| |
| +--------------S66038
|
+----------CACHIT


remember: this is an unrooted tree!

Ln Likelihood = -7100.58647

Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------

8 CACHIT 0.18427 ( 0.14360, 0.22495) **
8 PSTCHIT 0.15338 ( 0.11579, 0.19090) **
8 3 0.06366 ( 0.03188, 0.09549) **
3 2 0.08926 ( 0.05358, 0.12495) **
2 9 0.27943 ( 0.22659, 0.33225) **
9 10 0.00861 ( zero, 0.01910) *
10 CUSSEQ_3 0.03501 ( 0.02064, 0.04935) **
10 CUSSEQ_1 0.03584 ( 0.02138, 0.05030) **
9 CUSSEQ_2 0.02704 ( 0.01325, 0.04084) **
2 1 0.12011 ( 0.08334, 0.15696) **
1 VIRECT 0.02928 ( 0.01359, 0.04493) **
1 VURNACH3A 0.04964 ( 0.03105, 0.06823) **
3 4 0.08198 ( 0.04708, 0.11699) **
4 5 0.06701 ( 0.03375, 0.10031) **
5 6 0.03709 ( 0.00899, 0.06511) **
6 NTBASICL3 0.32799 ( 0.26703, 0.38900) **
6 7 0.09252 ( 0.05328, 0.13175) **
7 ATHCHIA 0.26025 ( 0.20748, 0.31309) **
7 VURNACH3B 0.24921 ( 0.19733, 0.30108) **
5 NTACIDCL3 0.22492 ( 0.17727, 0.27256) **
4 S66038 0.25430 ( 0.20315, 0.30545) **

* = significantly positive, P < 0.05
** = significantly positive, P < 0.01



Execution times on deneb: 0.0u 0.0s 0:00 0% 0+0k 0+0io 0pf+0w



WARNING
Maximum likelihood methods are very slow, because they attempt to consider an enormous number of possible trees. The time required increases exponentially with the number of sequences. Therefore doubling the number of sequences does not double the execution time. In fact for DNAML, the time required increases roughly with the 4th power of the number of the sequences! For most practical purposes, direct tree construction with greater than 30 sequences requires prohibitive amounts of time. 

Since bootstrapping multiplies the time required often by a factor of 100 or more, we usually don't have the luxury of bootstrapping with maximum likelihood methods. However, as illustrated above, we can bootstrap with a less time consuming method such as parsimony, and then build the final tree with Maximum Likelihood.
 

What the output means

Ln likelihood - The log likelihood is the log to the base e of the products of the probabilities of observing the data at each position, given the final tree shown. This makes it possible to compare the likelihood of seeing the data with one tree, as opposed to another, thereby making it possible to choose the best tree as the one with the highest likelihood.

Note that the original tree had a log likelihood of  -7100.58647. After trying global rearrangements of this tree, a slightly different tree with a log likelihood of -7094.81012 was found. The difference in likelihoods between these two trees is roughly -7101 - (-7095)  = -6.  Recalling that these are logs to the base e,  e6 =  403. Therefore the final tree is 403 times more likely to produce the data than the original tree.

A note on execution times - The execution time shown in the example is not the time that fastDNAml would require to build this tree from scratch. It is simply the time required to evaluate the tree it was given, doing some rearrangements, but leaving most of the tree intact. When the tree was built from scratch using DNAML, the execution times were:

Execution times on deneb: 95.0u 0.0s 1:35 99% 0+0k 0+0io 0pf+0w

Therefore, a de-novo construction of this tree would require 95 seconds of CPU time, and took 1 minutes 35 seconds elapsed time.