TUTORIAL: Visualizing genomes with DNAPlotter

Sept. 12, 2018

DNAPlotter Web Site: http://www.sanger.ac.uk/science/tools/dnaplotter

In this tutorial we will use the Artemis companion program, DNAPlotter, create a genomic map of Acidianus hospitalis. Acidianus hospitalis is a member of the Archaea, and an acidothermophile which oxidizes reduced inorganic sulphuric compounds as a source of energy. One of the advantages of using DNAPlotter is that it can save a file containing the settings that govern which features are displayed and how they are displayed. The file can be used to create maps from other species or strains, making it easy to create plots that can easily be compared between genomes.

To begin the tutorial, first create a directory called tutorials/artemis/dnaplotter:

cd $HOME/tutorials/artemis
mkdir dnaplotter
cd dnaplotter

1. Save the test sequence, and read it into Artemis.

Download the following files to your dnaplotter directory:

Acidianus_hospitalis.gen

Launch artemis, either by typing 'artemis' at the command line, or from the BIRCH launcher, choose Genomics --> Artemis. If you launched from BIRCH, make sure to set the working directory to your tutorials/artemis/dnaplotter directory by choosing Options ---> Set Working Directory in artemis.

To read the file, choose File --> Open.

Set Files of Type to All Files. Next select Acidianus_hospitalis.gen. This GenBank entry contains the complete genome. Click on Open to open the file.

The artemis window should appear something like the one below.

2. Customize the features displayed and their appearance.

DNAPlotter is used by Artemis to render maps of annotated features.

To launch DNAPlotter, choose File --> Open in DNAPlotter.

The default plot shows, from the outer circle and going inwards:

coordinates and tick marks
CDS (coding sequence) features on the forward strand (aqua)
CDS (coding sequence) features on the reverse strand (aqua)
pseudo CDS features (none shown on this plot)
the source feature (grey). The source feature typically refers to the entire sequence.

Changing number and tick mark intervals

When more precision is desired on the scale line, we can change the interval for numbering and tick marks. In DNAPlotter choose Options --> Tick marks.

Set Tick interval to 100,000 and the Minor Tick Interval to 20,000 and press Set to apply the changes.

Customizing feature display using the Track Manager

Features in DNA plotter are displayed in concentric circles, referred to as 'tracks'. The Track Manager is used to tell which types of features to display on each track. You can assign color and thickness features on each track. The Track Manager will display ALL features with a given feature key. If you want to modify the display of any specific feature (eg. a single gene), use Options --> Features.

Let's start by opening the Track Manager: Options --> Track Manager

Track: the circle to display, going from outer to inner
Key: display all features with this key
Qualifier: display only features with this qualifier
Not: display only features that do NOT have this qualifier
Elipsis (...) : remove qualifier
Strand: forward, reverse or any
Size: width of the line in pixels
Position: radius at which to display the circle
Colour: brings up a colour menu for the feature
Delete: delete this track

Let's start by changing the colour of the CDS sequences displayed on the forward strand (outermost track). Click on Colour to open the colour chooser. Click on the color box to the left of the button that says "Apply Colour to All".
The colour chooser has five different methods for specifying a color. For now, we'll use the Swatches method. In this example, click on one of the pink boxes in the center of the top line. You'll notice that when the mouse hovers over a color, the RGB code for that colour appears (eg. 255,204,204). Click on OK to continue.
You MUST click on Apply Colour to All, at this point, for your colour change to take effect. If you fail to do so, the default colour of White will be used, and it will look as if your circle has disappeared! (This little quirk took me a long time to figure out.) Next, click on Close to return to the track manager.
About the "Use colour qualifier" box. Some genome annotation software an generate annotation files that include colour codes for each feature. If the use colour qualifier box is checked, DNAPlotter will use those colours. An example of such a file is S_typhi.tab, used in the Hinxton Artemis tutorial.

The DNAPlotter window should now look like this:

Having different colours for the two strands make it easier to distinguish features on each strand.

Let's complete the plot, including gene features, including CDS, pseudo genes, tRNAs and rRNAs, as well as repeat regions. (No pseudo genes are annotated in this file.) To illustrate another way of making it easy to distinguish between features on two strands, where only a few features are found in the genome, we will add two gray lines as separators.
You will need to add several tracks:

Track 4 - tRNA, forward strand, colour: blue
Track 5 - tRNA, reverse strand, colour: blue
Track 6 - rRNA, forward strand, colour: green
Track 7 - rRNA, reverse strand, colour: green

Tracks 8 and 9 - We will create a gray line between the forward and reverse strands for both tRNA and rRNA features. To do this, we take advantage of the fact that all GenBank entries contain a source feature, that encompasses the entire sequence. For Track 8, set the feature key to 'source', both strands, size=3, and Position= 0.673. This will create a 3 pixel line between the two rRNA features which are at positions 0.7 and 0.65. For Track 9, set the feature key to 'source', both strands, size=3, and Position= 0.775. This will create a 3 pixel line between the two rRNA features which are at positions 0.8 and 0.75.

Track 10 - repeat_region, forward and reverse strands, colour: red

Click on "Update Tracks" to make your changes take effect. The map should now appear as shown at right.

Finally, let's add plots of GC plots. In DNAPlotter choose Graph --> GC plot --> Draw, and Graph --> GC skew --> Draw.

GC plot - tells the fraction of bases that are either G or C within a sliding window, plotted between the minimum and maximum values.

GC skew - tells the degree to which the GC content is skewed toward G or skewed toward C, plotted between teh minimum and maximum values. It is calculated by the formula GC skew = (G - C)/(G + C). (To clarify, G - C is a subtraction. This doesn't refer to GC dinucleotides.) When (G-C) is greater than 0, the bias is toward G. When (G-C) is negative, the bias is toward C.

The final plot should look similar to that shown at right.

At this point, it would be a good idea to save the settings for this map in a file, so that the same settings can be applied to each track in the display of other genomes. In DNAPlotter, choose File --> Export Track Template, and save as Acidianus_hospitalis.tracks.

One of the most obvious features that we see in the graph is the spike in GC content in the immediate vicinity of the rRNA genes, shown in green on track 6. At this level of resolution, we can't be sure that the spike perfectly coincides with the rRNA genes. To verify that hypothesis, go back to the Artemis window and go to the neighborhood of the spike at nucleotide 1,300,000: Goto --> Navigator --> Goto Base: 1300000. Next, display the GC plot by choosing Graph --> GC content (%). To more easily distinguish the plot from the average line, move the mouse into the graph, hold down the right mouse button and choose Configure. Click on the Color box to set the color of the plot to red, and click OK.

Artemis should look something like this:

To see which genes are under the long peak in GC content, click on the genes marked Ahos_r002 and Ahos_r003. The gene annotation will be highlighted in the feature box at the bottom. We have confirmed that both of these features correspond to gene features at the coordinates of these two ribosomal RNA genes.

3. What have we learned?

Virtually the entire A. hospitalis genome is transcribed. Non-coding regions between genes are extremely small.
There is no strand bias for protein coding genes. Genes are transcribed in both directions throughout the entire genome. There are some regions (eg. 330,000) at which there is a local bias toward transcription from the forward strand. These are small enough that they could be solely due to random variation in the orientation of genes.
The vast majority of tRNA genes are found between around 500,000 and 1,400,000. Ribosomal genes are also found in this region.
The rRNA genes show up as a peak of high GC content. This should not be surprising, since rRNAs have a complex secondary structure, comprised of numerous stem-loop hairpins. RNA duplexes typicaly are stabilized by G-C base pairs.

A word of caution about genome annotations.

Not everything in a genome is annotated. The features that you see in a genome are limited to those that the authors or annotation pipeline chose to annotate. For example, some projects will annotate repeat regions or transposable elements. Others might annotate nothing except CDS regions. The apparent absence of a feature does not necessarily mean that those features are not present in the genome!!

Not everything that is annotated is annotated correctly. Genome annotation is at best an imperfect science. Although most annotations are based in some way on sequence similarity to known genes or features, there are many ways to get both false negatives and false positives. As well, the exact start and stop positions of genome features are not always correctly assigned by automated software. Very few genome features are ever verified by experiments.