TUTORIAL: Comparing genomes using dotplots

Dec. 13, 2016


Last - Genome Scale Sequence Comparisons

Rationale: Genomes contain enormous amounts of evolutionary information, which tell us not only about evolutionary histories of species, but also about genome organization and structure. For this reason, we need robust graphical tools that create views of the data bring out the important features of genomes. In this tutorial, we will compare a yeast genomes in the genus Saccharomyces, using the baker's yeast, S. cereviseae as the reference genome.

Goal: To discover major genome rearrangements in a pairwise comparison of two complete genomes.

This tutorial continues from the previous tutorial
Finding and retrieving complete eukaryotic genomes

1. Comparison of S. cereviseae and S. arboricola genomes

The Last package contains a number of tools for comparative genomics. To create a dotplot comparing the two species, we first need to do a sequence alignment. In Last, this is done in two setps: first create a database from the reference sequence for use in later steps, and secondly, to align S. arboricola with S. cereveseae, using that database.

When working with large files, it's best to get an idea of how long various steps will take. The Unix time command can preceed any other command, and will print out a report of the time used in executing the command. To make the database, type

{mars:/home/birch/BIRCH/tutorials/bioLegato/getgenome}time lastdb -cR01 Scer GCF_000146045.2_R64_genomic.fna

real    0m21.387s
user    0m20.282s
sys    0m0.383s

The database will be written to a number of files with the base name Scer:

-rw-rw-r-- 1 birch birch  7789492 Dec 11 12:23 Scer.bck
-rw-rw-r-- 1 birch birch      188 Dec 11 12:23 Scer.des
-rw-rw-r-- 1 birch birch      468 Dec 11 12:23 Scer.prj
-rw-rw-r-- 1 birch birch       72 Dec 11 12:23 Scer.sds
-rw-rw-r-- 1 birch birch       72 Dec 11 12:23 Scer.ssp
-rw-rw-r-- 1 birch birch 46626752 Dec 11 12:23 Scer.suf
-rw-rw-r-- 1 birch birch 12157123 Dec 11 12:23 Scer.tis

The output of the time command tells us the time elapsed during the execution of the program (real), the CPU time used by the program (user) and the time required for system overhead (sys).

To create the alignment, type

time lastal Scer GCF_000292725.1_SacArb1.0_genomic.fna > ScerSarb.maf

real    1m33.753s

user    1m33.213s
sys    0m0.121s

This run takes about 1.5 min. on our system (96 Gb RAM, 64 CPUs).

Finally, to see the results in a dotplot,

time last-dotplot ScerSarb.maf ScerSarb.png

last-dotplot: reading alignments...
last-dotplot: done
last-dotplot: choosing bp per pixel...
last-dotplot: bp per pixel = 12951
last-dotplot: processing alignments...
last-dotplot: done

real    0m6.534s
user    0m6.082s
sys    0m0.317s

You can double-click on ScerSarb.png to view the output.

The X-axis sequence at top is the S. cer. genome, and the Y-axis at the left is S. arb. There are several points to note:
The plot tells us several things of importance:
It is important to keep in mind that the scale of the plots is at an extremely low level of resolution in order to view 12 Mb in a single image. What seem like minor diagonals in the plot actually span thousands of base pairs.