TUTORIAL: PHYLOGENETIC ANALYSIS USING DISTANCE METHODS

June 24, 2021

Gblocks - $doc/Castresana/Gblocks_documentation.html
PHYLIP Main documentation: $doc/Phylip/main.html
PHYLIP Distance methods: $doc/Phylip/distance.html
PHYLIP DNADIST: $doc/Phylip/dnadist.html
PHYLIP PROTDIST: $doc/Phylip/protdist.html
PHYLIP FITCH: $doc/Phylip/fitch.html
PHYLIP PROTDIST: $doc/Phylip/kitsch.html
PHYLIP NEIGHBOR: $doc/Phylip/neighbor.html
Archaeopteryx: $doc/forester/ArchaeopteryxUserDocumentation.pdf

The PHYLIP programs are command line programs, but can be run by BioLegato.
The programs in the PHYLIP package are interactive programs designed to be run at the command line. BioLegato can run these programs by generating the keystrokes you would have typed to set program parameters.

Construction of a phylogeny using distance methods involves two steps:

blnalign and blpalign run DNADIST and PROTDIST, respectively, to construct a distance matrix.

The distance matrix is used to construct a phylogenetic tree, using any of a number of methods implemented in the programs FITCH, KITSCH or NEIGHBOR.

This tutorial is not an exhaustive treatment of phylogenetic analysis, which is a massive field spanning many decades. For a short introduction to the theory behind phylogenetic analysis, see the lecture on Phylogenetic Analysis from PLNT4610 Bioinformatics.

Example: Plant Type III Chitinases

The chitinases in plants are hydrolytic enzymes that degrade chitins (N-acetyl glucosamine). Although chitin does not occur in plants, in many fungi, it is a major component of the fungal cell wall. Not surprisingly, chitinases are produced in plants in response to fungi. Chitinases have been demonstrated to play an important role in plant defense responses. There are six classes of chitinases so far identified. Most known chitinases fall into the Type I and Type II classes. This exercise will work with a smaller class of genes encoding Type III chitinases.

1. The datasets

The datasets were created as follows: starting point for this dataset is a collection of chitinase III genes retrieved from a tblastn search of the RefSeq database, chitIII.gen. Coding sequences (CDS features) for the 40 highest-scoring hits were extracted to a fasta file called chitIII.CDS.fsn. These sequences were translated to proteins using RIBOSOME, and proteins saved to chitIII.pro.fsa. Amino acids were aligned using MAFFT FFT-NS-i, and the aligned proteins written to chitIII.pro.mafft.fsn. The CDS sequences were aligned to the protein alignment using pal2nal, as described in the Multiple Alignment tutorial. The DNA alignment is found chitIII.CDS.pal2nal.fsn.

In this tutorial, we will demonstrate phylogeny using both protein and DNA alignments as input.

Start by creating a directory called distance, and save chitIII.pro.mafft.fsn and chitIII.CDS.pal2nal.fsn to this directory. Open these files in blpalign and blnalign, respectively.

blpalign - protein alignment	blnalign - DNA alignment

For comparison only 10 of the 40 sequences in the alignment are shown for both proteins and DNA. The alignments have been scrolled to show a gap of two amino acids (eg. AS--LLL) in the protein, corresponding to three codons (six gap positions) in the DNA (eg. gcctca------ttactctta). The N-terminal regions of the proteins vary in length, which is why gaps are also seen in the N-terminal region of the alignment.

The other item of note is that ":CDSx" is appended to the Accession numbers of the proteins (which are actually the Accession numbers of the GenBank DNA entries from which they were translated. These CDS labels indicate which CDS from each entry was used (remember, long DNA sequences may contain many CDS features). When Ribosome translates proteins, it automatically shortens the CDSx label to _x, in part because some programs have limits to the length of sequence names. As we proceed, the _x naming convention will be used.

Why compare DNA and protein phylogenies?

Due to the degeneracy of the genetic code, DNA alignments contain evolutionary information that is lost from protein alignments. In principle then, more realistic phylogenetic trees should be obtained from DNA alignments. The problem is that the small alphabet size (n=4) of DNA, compared to the alphabet size of proteins (n=20) means that alignment of proteins, especially where multiple gaps are found. The solution as shown above is to align the proteins, and then align the DNA sequences to the protein alignment. In effect, this preserves evolutionary information that would otherwise be hidden in the protein alignment. In practice, it is a good idea to do phylogenies on both proteins and DNA sequences.

2. The problem with gappy regions

The whole point of using multiple sequence alignment as input for phylogeny programs is that over the course of evolution, insertions and deletions (indels) occur in genes. Multiple alignments insert gaps into sequences to make amino acids at homologous positions to line up.

At right we see the alignment scrolled to the C-terminal region. It is obvious that an insertion of 29 amino acids occurred in XM_024786391, which we can assume was a recent event since this insertion was not seen in any of the other sequences.

A problem arises because phylogeny programs treat each position (column) independently. That is, the gap would be treated as if it was 29 independent indels, resulting in an artificially long branch on the tree, separating XM_024786391 from all other sequences.

The other problem with gappy regions is that they are usually regions in which any multiple alignment is less reliable, because it is not always obvious where gaps should be inserted.

One common solution to this problem is to edit out ALL positions containing gaps. The downside of this approach is that indels are highly informative from an evolutionary standpoint, because a single indel, regardless of its size, can point to ancient events which instruct the deeper branches of phylogenetic trees. In other words, completely eliminating all gap positions would be throwing away useful data.

Gblocks from the Castresana lab gives us a more sophisticated way to "trim" gaps, while retaining some gap positions.
To create a de-gapped alignment, select all sequences in the protein alignment and choose Alignment --> Gblocks.

Set "Allowed Gap positions" to "Only positions with <= 50% gaps. This will retain some gap positions, as we'll see below. See the Gblocks documentation for more detailed information on fine-tuning gappy regions in multiple alignments.

Output pops up in a new blpalign window. This output was saved in chitIII.pro.maaft.gblocks.fsa.

One of the strong points of Gblocks is the HTML output, excerpted below. The first 100 positions of the alignment, out of 352 positions are shown. Conserved amino acids are highlighted.

Based on setting for conserved blocks and flanking posiitons, a subset of 283 positions (underlined in blue) were extracted from the alignment into the blpalign window at right.

Next, let's run Gblocks on the DNA alignment in blnalign.

Save this output in chitIII.CDS.pal2nal.gblocks.fsn.

(The HTML output for the DNA alignment is not shown.)

3. A quick DNA phylogeny using FITCH

For routine distance tree construction, the method of Fitch and Margoliash is the method of choice. FITCH allows for variable rates of evolution in different lineages, and iterates the tree to minimize the least squares distance across the entire tree. Although Neighbor-Joining is faster, it is also much less thorough, only considering one tree. It is probably the least rigorous method for constructing a phylogeny . Go to the DNA alignment in blnalign and choose Edit --> Select All. Next choose Phylogeny --> DNA Distance methods. Choose the Fitch-Margoliash method.

DNADIST will calculate a distance matrix, and then FITCH will run, and by default, 3 windows will appear

OUTFILE- the report on the phylogeny

TREEFILE - the machine -readable treefile. Readable by programs such as DRAWTREE, DRAWGRAM, and Archaeopteryx.

The treefile also pops up on a bltree window, allowing further tasks to be performed using the tree as input.

TREEFILE - the treefile in the Archaeopteryx tree editor.

Note: Do NOT save the contents of the Archaeopteryx window using the .treefile extension. You will overwrite the original treefile. Archaeopteryx saves files in phyloXML format with the extension ".phyloxml".

4. Phylogeny using amino acid sequences.

Since also possible to construct distance matricies for multiple alignments of amino acid sequences, the same programs (FITCH, KITSCH, and NEIGHBOR) can be used to construct distance trees. Go back to the blpalign window with the output from Gblocks. Or read chitIII.pro.maaft.gblocks.fsa into a new blpalign window.

Some of the parameters for construction of the distance matrix using PROTDIST are different from those for DNADIST. These include several different methods for constructing distance matrices, as well as a choice of alternative genetic codes, where appropriate.

Once the distance matrix is constructed, there is no difference in computation of the phylogenetic tree, so all parameters are the same as previously.

FITCH will produce an outfile (chitIII.pro.maaft.gblocks.fitch.outfile) and a treefile (chitIII.pro.maaft.gblocks.fitch.treefile) similar to those with the DNA alignment. For comparison, the treefile is shown in Archaeopteryx below:

5. Analysis of Results in Archaeopteryx

For now, we will show results using Archaeopteryx without going into how they were produced. In a later tutorial, Visualization of Phylogenetic Trees, we will explore in more depth how to work with trees in Archaeopteryx.

The trees containing only Accession numbers are missing important information, such as taxonomic classifications. In a later tutorial we will see step by step how Archaeopteryx can add these metadata to your trees. We will also show the importance of how the presentation of the trees can lead to insights into the data.

The two trees from the previous steps have been edited in Archaeopteryx as shown below. Based on the Accession numbers, Archaeopteryx has retrieved taxonomic information for each sequence. We have used Colorize Subtrees via Taxonomic Rank to assign colors to nodes based on Order. You may also notice that many of the branches have been swapped (rotated) to make the protein and DNA trees as similar as possible. (It's perfectly legitimate to rotate branches at any node, because it doesn't change the branching order.) After swapping, the remaining differences tell us which clades have different branching orders between the two trees.

protein	DNA

Sequences from plants in the Order Fabales are shown in green. These include leguminous crops such as soybean (Glycine max) common bean (Phaseolus vulgaris) and peanut (Arachis). (The fact that one Glycine soja sequence is at the top, rather than clustering with other Fabales, is an artifact of the fact that this branch is actually a trifurcation, and there was no way to force it to the bottom by swapping.) The other sequences come from woody trees or shrubs such as popular (Populus, Order Malphigales), Oak (Quercus, Order Fagales), or Apple (Malus, Order Rosales).

Like many genes in plants, the chitinases are a multigene family, so that there are numerous chitinase genes in a typical plant genome. Because this dataset was made by simply taking the best 40 BLAST hits to a chitinase III query, the dataset is missing many of the other chitIII genes known to be in the genome. However, this small dataset still tells us some things.

To a first approximation, sequences within a genus, and usually within a species, tend to cluster together. This is explained by the fact that over time, some members of a multigene family will be lost, and others duplicated. Given enough time, chitinase orthologues within a species will most likely resemble one ancestral copy of the gene. This phenomenon tends to homogenize gene homologues, so that the gene tree resembles the species tree.

The one anomaly seems to be Prosopis alba (South American mesquite, Order Fabales). In the protein tree, P. alba is on the same internal clade as peanus (Arachis). In the DNA tree, P. alba is on a deep branch by itself, although still separate from the woody plants. This could be an indication that this gene is from an ancient gene duplication event, whose orthologues may not be present in most of the other Fabales.

Part of the problem is that distance methods derive their speed through data reduction. Put bluntly, distance methods throw away most of the evolutionary information found in sequences. The next tutorial will take us through Maximum Parsimony and Maximum Likelihood, which are more computationally intensive methods, but can often clear up ambiguities of this type.

NEXT: Character Phylogeny Methods: Parsimony and Maximum Likelihood