TUTORIAL: PHYLOGENETIC ANALYSIS OF DISCRETE STATE DATA

Oct. 21, 2014

PHYLIP Main documentation: $doc/Phylip/main.html
PHYLIP Discrete state data methods: $doc/Phylip/discrete.html
PHYLIP RESTDIST (distance matrices from discrete state data): $doc/Phylip/restdist.html
PHYLIP Bootstrapping: $doc/Phylip/seqboot.html
PHYLIP Consensus trees: $doc/Phylip/consense.html
PHYLIP Dollo and polymorphism parsimony: $doc/Phylip/dollop.html
PHYLIP Camin-Sokal and Wagner parsimony: $doc/Phylip/mix.html
PHYLIP RESTML - Maximum Lilelihood: $doc/Phylip/restml.html

This tutorial assumes that you have already gone through the DNA phylogeny tutorials at $birch/tutorials/bioLegato/bioLegato.html

Example: RAPD Analysis of green foxtail populations in Manitoba

1. The dataset

Data taken from: Stilkowski, Gayle (1997) Inheritance of fenoxaprop-P-ethyl and trifuluralin resistance, and genetic variability among green foxtail (Setaria viridis (L.) Beauv.) populations. M.Sc. Thesis, Department of Plant Science, University of Manitoba.

Green foxtail is a grassy weed that is commonly found in cultivated fields in the Canadian prairies. Foxtail is usually controlled in cereal crops by inhibitors of acetyl-CoA carboxylase (ACCase) such as fenoxaprop-P-ethyl or by trifuluralin. However, accessions of foxtail have been identified which are resistant to one or both types of herbicide. The map below indicates the locations of sites in the province of Manitoba where herbicide resistant green foxtail populations have been found.

Thirteen accessions of green foxtail were tested for resistance to ACCase inhibitor and trifluralin. These accessions were also subjected to RAPD analysis, in which polymorphism was scored at 42 loci. The data are represented in Phylip format in the file manitoba.phyl.

Note: Phylip format for discrete data

Discrete state data are represented as follows: 0 - absence of a band; 1 - presence of a band; ? - data not available.

The first line lists the number of sequences (eg. species, lines, isolates, populations etc.) followed by the number of markers scored for each sequence.

The name of each sequence is listed in columns 1-10. The marker data begins in column 11. If the name is less than 10 characters, it must be padded by blanks so that the sequence begins in column 11. Do NOT use TAB characters in columns 1-10, which will cause blmarker to crash when reading in the data.

If the sequences are too long to be written in one line, use either Phylip sequential or Phylip interleved format, as described in $doc/Phylip/main.html.

Create a directory called foxtail, and save manitoba.phyl to this directory. Start blmarker by typing

blmarker

To read this file into blmarker, choose File -> Import Phylip Discrete Data.

The sequences will be read into the blmarker window.

2. Distance methods

For distance methods, blmarker calls restdist which creates a distance matrix from the selected sequences. This makes it possible to use the same distance programs used for DNA sequence analysis. Select all accessions (rows) using Edit --> Select All, and then choose Plylogeny -> RFLP/RAPD/AFLP etc. Distance methods:

Although this menu is quite the same as the distance menu for DNA and protein sequences, let's look at some important menu items:

DISTANCE MODEL: SITES/REST. FRAGS. - By default, each marker represents a distinct genetic locus. This is true for most types of molecular markers. However, for RFLPs, it is often the case that a given locus may be represented by two distinct bands, representing alleles. If this were restriction fragment data then, "REST. FRAGS." should be used.
SITE LENGTH - The number of nucleotide positions assayed by a marker method. For RFLP analysis using restriction enzymes with 6-base recognition sequences, this number is 6. That is, cleavage by the enzyme requires specific nucleotides must match at 6 positions. For RAPDs using primers 10 bases long, SITE LENGTH is 20 (not 10!). This is because the primer must match 10 bases on each side of the fragment to be amplified. For AFLPs, the number of selective bases is between 14 - 18 nt, depending on the restriction enzymes used. Obviously, the longer the site length, the lower the probability of a 0 -> 1 mutation. Put another way, it is much easier for a single site to mutate to match a 6 base restriction sequence than it is for two nearby sites to mutate to match a 10 base primer.

The output appears in windows as shown:

OUTFILE (manitoba.fitch.outfile)

TREEFILE (manitoba.fitch.treefile)

The output treefile is launched in two applications: a text editor, to make it easy to save the file, and bltree, to make it easier to work with the tree, or to use various tree drawing programs.

ATV

Note that accession 439-86 appears to be an outgroup, very distantly related to all other accessions. This is consistent with the fact that 439-86 originated in China, and was included as a control.

3. Parsimony methods.

The goal of parsimony is to find the tree that requires the fewest base or amino acid substitutions, when mutational distances from each sequence to each ancestral node, and between ancestral nodes, are added up. Parsimony methods can be run by choosing Phylogeny --> RFLP/RAPD/AFLP etc. Parsimony.

Several parsimony methods are available, as indicated in the menu:

In this example, Joe Felsenstein's Polymorphism parsimony method has been chosen. This method is worthy of some mention. For most molecular marker systems, Dollop parsimony is usually the most realistic choice because it assumes that the probability of a 1 -> 0 mutation (loss of a band) is much greater than a 0 -> 1 mutation (gain of a band.) However, this model may be overly pessimistic, simply due to the fact that, if polymorphism for a given site exists, in a population (ie. both 1 AND 0 are present), a population containing both alleles might be sampled, and erroneously assigned 0, even though 1 is present in other individuals that were not sampled. Polymorphism parsimony tries to take this possibility into account. If most nodes on a branch contain 0s, but at least one 1 node is implied, the polymorphism parsimony option of dollop will assume that the presence of the 1 is due to polymorphism in the population, rather than a rare forward mutation.

For the same input data, dollop will produce the following output files:

manitoba.polymorphism.outfile
manitoba.polymorphism.treefile

Make sure to save your files using these names for the next example.

4. Maximum Likelihood

Maximum likelihood methods are considered the most robust method of phylogenetic inference, and are demonstrably more accurate than distance or parsimony methods. However, the increased accuracy comes at a price of computational time. As the number of isolates or species increases, the time required increases exponentially.

For this reason, it is often useful to construct trees using a method such as parsimony, and then read the parsimony tree into a maximum likehood program to refine the tree, calculate branch lengths and assign probability estimates to branches on the tree.

Although the dataset in this tutorial is small enough to run maximum likelihood directly, we will illustrate how to run DNAML on an already-existing parsimony tree, using the treefile from the previous section.

Select the same set of isolates, and choose Phylogeny --> RFLP/RAPD/AFLP etc. Max. Likelihood. Select "Evaluate user-supplied tree", and choose manitoba.polymorphism.treefile as the User tree filename.

Menu items are similar to those in DNAML, but several items are worthy of note.

When a data set is monomorphic at a given site (ie. all 0 or all 1) is it because the population is monomorphic, or is it that we simply haven't sampled enough individuals to detect polymorphism? In a population monomorphic for absence of a band (ie. all 0), probability of seeing a band is zero. In a population containing 1s, if we take a sample that contains all 0's, the probabilty of 1 is most definitely not 0. We only got all 0's due to how the population was sampled. For this reason, restml defaults to No, for All sites detected'. For situations in which some sites may be polymorphic, but are not reflected in the data, however, choose 'Yes', all sites detected.

Caveat: I think I got this right, but it's a subtle point. See $doc/Phylip/restml.html for further explaination. Again, since we're using RAPD data, remember to set SITE LENGTH to 20.

Sample output files:

manitoba.restml.outfile
manitoba.restml.treefile

WARNING
Maximum likelihood methods are very slow, because they attempt to consider an enormous number of possible trees. The time required increases exponentially with the number of sequences. Therefore doubling the number of sequences does not double the execution time. In fact for restml, the time required increases roughly with the 4th power of the number of the sequences! For most practical purposes, direct tree construction with greater than 30 sequences requires prohibitive amounts of time.

Since bootstrapping multiplies the time required often by a factor of 100 or more, we usually don't have the luxury of bootstrapping with maximum likelihood methods. However, as illustrated above, we can bootstrap with a less time consuming method such ase parsimony, and then build the final tree with Maximum Likelihood.