TUTORIAL: PHYLOGENETIC ANALYSIS DISCRETE STATE DATA



PHYLIP Main documentation: $doc/Phylip/main.html
PHYLIP Discrete state data methods: $doc/Phylip/discrete.html
PHYLIP RESTDIST (distance matrices from discrete state data): $doc/Phylip/restdist.html
PHYLIP Bootstrapping: $doc/Phylip/seqboot.html
PHYLIP Consensus trees: $doc/Phylip/consense.html
PHYLIP Dollo and polymorphism parsimony: $doc/Phylip/dollop.html
PHYLIP Camin-Sokal and Wagner parsimony: $doc/Phylip/mix.html
PHYLIP RESTML - Maximum Lilelihood: $doc/Phylip/restml.html

This tutorial assumes that you have already gone through the DNA phylogeny tutorials at http://home.cc.umanitoba.ca/~psgendb/tutorials/GDE/gde.html

Example: RAPD Analysis of green foxtail populations in Manitoba

1. The dataset

Data taken from: Stilkowski, Gayle (1997) Inheritance of fenoxaprop-P-ethyl and trifuluralin resistance, and genetic variability among green foxtail (Setaria viridis (L.) Beauv.) populations. M.Sc. Thesis, Department of Plant Science, University of Manitoba.

Green foxtail is a  grassy weed that is commonly found in cultivated fields in the Canadian prairies. Foxtail is usually controlled in cereal crops by inhibitors of acetyl-CoA carboxylase (ACCase) such as fenoxaprop-P-ethyl or by trifuluralin. However, accessions of foxtail have been identified which are resistant to one or both types of herbicide. The map below indicates the locations of sites in the province of Manitoba where herbicide resistant green foxtail populations have been found.

Thirteen accessions of green foxtail were tested for resistance to ACCase inhibitor and trifluralin. These accessions were also subjected to RAPD analysis, in which polymorphism was scored at 42 loci. The data are represented in Phylip format in the file manitoba.phyl.
 
Note: Phylip format for discrete data
  • Discrete state data are represented as follows: 0 - absence of a band; 1 - presence of a band; ? - data not available. 
  • The first line lists the number of sequences (eg. species, lines, isolates, populations etc.) followed by the number of markers scored for each sequence.
  • The name of each sequence is listed in columns 1-10. The sequence begins in column 11. If the name is less than 10 characters, it must be padded by blanks so that the sequence begins in column 11. Do NOT use TAB characters in columns 1-10, which will cause GDE to crash when reading in the data.
  • If the sequences are too long to be written in one line, use either Phylip sequential or Phylip interleved format, as described in  $doc/Phylip/main.html.

Create a directory called foxtail, and save manitoba.phyl  to this directory. Start mGDE by typing

mgde

To read this file into mGDE, choose File -> Import Phylip Discrete Data.


The sequences will be read into the GDE window.

2. Distance methods

For distance methods, GDE calls restdist  which creates a distance matrix from the selected sequences. This makes it possible to use the same distance programs used for DNA sequence analysis. Select all sequences, and then choose Plylogeny -> RFLP/RAPD/AFLP etc. Distance methods:

Although this menu is quite the same as the distance menu for DNA and protein sequences, let's look at some important menu items:

DISTANCE MODEL: SITES/REST. FRAGS. - By default, each marker represents a distinct genetic locus. This is true for most types of molecular markers. However, for RFLPs, it is often the case that a given locus may be represented by two distinct bands, representing alleles. In this case, "REST. FRAGS." should be used.

SITE LENGTH - The number of nucleotide positions assayed by a marker method. For RFLP analysis using restriction enzymes with 6-base recognition sequences, this number is 6. That is, cleavage by the enzyme requires specific nucleotides must match at 6 positions. For RAPDs using primers 10 bases long, SITE LENGTH is 20 (not 10!). This is because the primer must match 10 bases on each side of the fragment to be amplified. For AFLPs, the number of selective bases is between 14 - 18 nt, depending on the restriction enzymes used. Obviously, the longer the site length, the lower the probability of a 0 -> 1 mutation. Put another way, it is much easier for a single site to mutate to match a 6 base restriction sequence than it is for two nearby sites to mutate to match a 10 base primer.

The output appears in windows as shown:

OUTFILE (manitoba.fitch.outfile)

TREEFILE (manitoba.fitch.treefile)

The output treefile is launched in two applications: a text editor, to make it easy to save the file, and tGDE, to make it easier to work with the tree, or to use various tree drawing programs.


ATV

Note that accession 439-86 appears to be an outgroup, very distantly related to all other accessions. This is consistent with the fact that 439-86 originated in China, and was included as a control.

3. Parsimony methods.

The goal of parsimony is to find the tree that requires the fewest base or amino acid substitutions, when mutational distances from each sequence to each ancestral node, and between ancestral nodes, are added up. Parsimony methods can be run by choosing Phylogeny --> RFLP/RAPD/AFLP etc. Parsimony.

Several parsimony methods are available, as indicated in the menu:
 
 

In this example, Joe Felsenstein's Polymorphism parsimony method has been chosen. This method is worthy of some mention. For most molecular marker systems, Dollo parsimony is usually the most realistic choice because it assumes that the probability of a 1 -> 0 mutation (loss of a band) is much greater than a 0 -> 1 mutation (gain of a band.) However, this model may be overly pessimistic, simply due to the fact that,  if polymorphism for a given site exists, in a population (ie. both 1 AND  0  are present), a population containing both alleles might be sampled, and erroneously assigned 0, even though 1 is present in other individuals that were not sampled. Polymorphism parsimony tries to take this possibility into account. If most nodes on a branch contain 0s, but at least one 1 node is implied, the polymorphism parsimony option of dollop will assume that the presence of the 1 is due to polymorphism in the population, rather than a rare forward mutation.

For the same input data, dollop will produce the following output files:

manitoba.polymorphism.outfile
manitoba.polymorphism.treefile

4. Maximum Likelihood

Maximum likelihood trees can be build from discrete data by choose Phylogeny --> RFLP/RAPD/AFLP etc.  Max. Likelihood. Menu items are similar to those in DNAML, but two items are worthy of note.

When a data set is monomorphic at a given site (ie. all 0 or all 1) is it because the population is monomorphic, or is it that we simply haven't sampled enough individuals to detect polymorphism? In a populaiton monomorphic for absence of a band (ie. all 0), probability of seeing a band is zero. In a population containing 1s, if we take a sample that contains all 0's, the probabilty of 1 is most definitely not 0. We only got all 0's due to how the population was sampled. For this reason, restml defaults to No, for 'All sites detected'. For situations in which some sites may be polymorphic, but are not reflected in the data, however, choose 'Yes', all sites detected.

Caveat: I think I got this right, but it's a subtle point. See $doc/Phylip/restml.html   for further explaination.


When branch lengths are being calculated, restml has to try a number of
different branch lengths, for each branch, to see which is the branch
length
most likely to generate the observed data. Since branch lengths are
real
numbers, in principle, an infinite number of branches would have to be
tested. In practice, restml tries a small number of possible branch
lengths,
starting with a short branch and multiplying its length by an
extrapolation
factor. A small extrapolation factor will increase the branch length
slowly,
requiring more iterations, but resulting in more precise branch
lengths.
A large extrapolation factor will require few iterations, but at the
expense
of precision. A "Speedier but Rougher Analysis" uses a larger
extrapolation factor, and therefore sacrifices thoroughness in favor of
speed.

Sample output files:

manitoba.restml.outfile
manitoba.restml.treefile
WARNING
Maximum likelihood methods are very slow, because they attempt to consider an enormous number of possible trees. The time required increases exponentially with the number of sequences. Therefore doubling the number of sequences does not double the execution time. In fact for restml, the time required increases roughly with thr 4th power of the number of the sequences! For most practical purposes, direct tree construction with greater than 30 sequences requires prohibitive amounts of time. 

Since bootstrapping multiplies the time required often by a factor of 100 or more, we usually don't have the luxury of bootstrapping with maximum likelihood methods. However, as illustrated above, we can bootstrap with a less time consuming method such ase parsimony, and then build the final tree with Maximum Likelihood.