Assignment 3 (Nov. 8, 2018)


This assignment is worth 20% of the course grade.  

Due: Tuesday November 20, 2018.


One of the assumptions underlying phylogenetic analysis of a gene is that the gene evolves as a single unit, and that all parts of the gene evolve uniformly. Evolutionary processes such as exon shuffling, unequal crossing over and gene conversion can invalidate these assumptions. For example, in gene conversion between two copies of a gene, the net effect is that one copy of the gene overwrites the other. Gene conversion* can replace an entire copy of a gene with another, or just part of a gene with sequence from a different copy. If the two original copies of the gene diverged from a single ancestral copy in the distant past, each copy would have a distinct phylogenetic tree. As a consequence, a chimeric gene resulting from gene conversion would give you different phylogenetic trees, depending on which part of the gene you looked at.

*If you are not already familiar with gene conversion, see Forsdyke Evolution Academy 01-53 Gene Conversion []. Not required.

The problem

The file priglobin.fasta is a FASTA file containing 10 gamma globin genes: four human and two each from chimp, gorilla and orangutan. These sequences have been aligned to maximize similarity. The corresponding GenBank entries (5 GenBank entries containing 2 genes each) are contained in priglobin.gen . The genes in priglobin.fasta have been extracted from these larger GenBank entries. Examination of the GenBank entries will show that in primates, gamma globin genes are found in two tandem copies.

Note: In priglobin.fasta, HumanA1 and Human A2 are tandem copies of the SAME locus as HumanB1 and Human B2. That is, A1A2 is a haplotype of the same locus as B1B2. The GenBank entry for the A1A2 is HUMGAMGLOA, and the B1B2 haplotype is HUMGAMGLOB in the GenBank file.

The question you need to address is: Is it valid to construct a phylogenetic tree using the sequences in  priglobin.fasta  as a single unit, or do different parts of the gene have distinct evolutionary histories?

1. (5 points) Annotate the alignment, showing the locations of important parts of the alignment, particularly promoters, exons and introns.

a) Read priglobin.fasta  into blnalign. Run Alignment --> Reform to generate a view of the alignment. Print 100 nucleotides per line, and make sure that conserved sites are printed as dots (.), and gaps as dashes (-). Save this file as priglobin.reform.

b) Read priglobin.reform into LibreOffice Writer. Format the sequence to fit a standard letter-sized page as follows: Select all and change the font to a fixed font, such as Liberation mono, 8 point. You will probably need to adjust top, bottom, left and right margins to fit the alignment.

c) Using the GenBank file as a guide, annotate the alignment for features such as exons and introns. It is critical to realize that each GenBank entry contains two tandem copies of each gene. As well, your alignment contains gaps. For this reason, the coordinates in the alignment will not correspond exactly with the coordinates found in the GenBank Features Table. However, if you use the Chimp1 sequence as a reference sequence, it should be straightforward to find the beginnings and ends of features such as exons and introns by inspection.

Add lines to your alignment document showing the start and stop points for each feature. For an example of annotation, see labeling.odt. Save your document as priglobin.odt. Be creative in how you annotate, but don't waste a lot of time on this step. The main point is to have a well-annotated copy of the alignment to make decisions on further anlalysis steps. Although you can look at your document on the screen, it will probably be most useful to print a copy.

d) Although the reform output is enough to assess polymorphism in different regions of the gene, Jalview has several features that make it easy to visualize the conserved vs. divergent regions. With your alignment in blnalign, launch Jalview using Alignment --> Jalview.

First, choose Select --> Select All. Next, choose Color --> Nucleotide colour scheme. This leaves conserved bases, and leaves rare bases with a white background, so they stick out prominently. Also choose Format --> Show non-conserved. This will show ONLY  those nucleotides that differ from the consensus, which helps bring out the polymorphism in the alignment. Next open the View --> Overview window. This gives a coloured  low-resolution view of the entire alignment. There is a red box around the region shown in the alignment editor, and you can pull the red box left or right and the alignment scrolls with it. This is a really great way to get a feeling for the conservation in an alignment.

2.  (5 points) Based on the results above, create files containing specific regions of the alignment to be used to test the hypothesis that the different parts of these genes evolved independently

The goal, in essence, is to find out whether different parts of the alignment give different phylogenetic trees. Therefore, it will be necessary to build trees using different parts of the alignment. Two considerations would include:
It is up to you to decide which regions of the alignment are best to use. Explain your choices.

Next, create FASTA files for each region you wish to analyze. Specific regions of the alignment can be extracted using readseq. For example, if you wanted to extract the part of the alignment from 500 to 1000, use the following command to send output to a file called priglobin500-1000.fasta:
readseq -extract=500..1000 -f fasta -o priglobin500-1000.fasta priglobin.fasta

3. (5 points) Test the hypothesis by constructing maximum likelihood trees.

In the tutorial entitled Phylogenetic Analysis Using Parsimony and Maximum Likelihood , we saw a good compromise between the speed of parsimony and the rigor of Maximum Likelihood.

a) Construct a maximum likelihood tree of the entire alignment. Run DNAPARS, to generate a tree topology. Keeping in mind that the branch lengths from the consensus tree are bootstrap replicate numbers, not real branch lengths. Consequently, it is still necessary to save the consensus tree from the bootstrapping step, and then run DNAML using the bootstrap consensus tree as a User Tree, to generate a final tree with branch lengths.
Create image files as described in tree_images.html.

b) Repeat the process for each region of the alignment which you extracted in step 2 above.

c) Compare the trees from different regions of the sequence. Probably the best criterion for comparing trees from different regions is the bootstrap replicate numbers from DNAPARS. If a region gives a consistent tree across all bootstrapped replicates, it probably has evolved as a coherent unit over time. If few branches on the tree are consistently replicated, the region probably has a more complex evolutionary history. Also, if branch lengths are long in one region and short in another, it is evidence that different mutation rates occur between the two regions. (This assumes that regions of roughly equal numbers of informative positions are compared.)

d) Evaluate how consistent the trees are between different regions of the gene using the Phylip TREEDIST program. In bltree, import all your treefiles using File --> Import Treefile, and select the trees you wish to compare. Next, choose Evaluate --> Treedist. The output should be saved in a file called priglobin.treedist.
Present your TREEDIST results as shown in the example in sample_tree.html.

4. (3 points) Conclusions - What have you learned?

Give a brief summary of your conclusions from the data, but be sure to address the following:

5. (2 points) Presentation.

Part of the grade will be determined by the quality of your web page(s) for the assignment, including:

How to get started

1. Create a directory called either public_html/PLNT4610/as3 or public_html/PLNT7690/as3. Make this directory world-readable and world searchable.

2. Do  all work in the as3 directory. That way, all your files will already be where they need to be.

What you need to complete your assignment

Your report should include links to the following:

How to post it

1. Create a new HTML file called as3/as3.html. Your web page for Assignment 3 should take the form of a report, that makes it easy to figure out what you did.

2. Make all files in the as3 directory world-readable. (chmod a+r  *)

3. Edit either PLNT4610.html or PLNT7690.html to include a link to as3/as3.html.

4. In the Firefox or SeqMonkey Browser, go to your home page and follow all hypertext links to your assignment, and test all links to your output files.

5. If you paste excerpts of output into a web page, change the output section to a fixed font such as Courier, or set the style to "Preformat". The output from most sequence programs assumes that each character takes up an equal amount of width, which is not true for proportional fonts such as Helvetica or Times.

Academic integrity: Your work is assumed to be your own original work. All University policies regarding academic integrity apply.

On the day the assignments are due, I should be able to just go to each person's web site and find the output. You don't need to send me an email message saying that your assignment is complete. If you choose not to hand in this assignment, you don't need to do anything.