Rationale: Genomes contain
enormous amounts of evolutionary information, which tell us not
only about evolutionary histories of species, but also about
genome organization and structure. For this reason, we need robust
graphical tools that create views of the data bring out the
important features of genomes. In this tutorial, we will compare a
yeast genomes in the genus Saccharomyces, using the baker's yeast,
S. cereviseae as the reference genome.
To discover major
in a pairwise
comparison of two complete genomes.
1. Comparison of S. cereviseae and S.
The Last package
contains a number of tools for comparative genomics. To
create a dotplot comparing the two species, we first need to
do a sequence alignment. In Last, this is done in two setps:
first create a database from the reference sequence for use
in later steps, and secondly, to align S. arboricola with S.
cereveseae, using that database.
When working with large files, it's best to get an idea of
how long various steps will take. The Unix time command can
preceed any other command, and will print out a report of
the time used in executing the command. To make the
lastdb -cR01 Scer GCF_000146045.2_R64_genomic.fna real 0m21.387s user 0m20.282s sys 0m0.383s
The database will be written to a number of files with the
base name Scer:
-rw-rw-r-- 1 birch
birch 7789492 Dec 11 12:23 Scer.bck -rw-rw-r-- 1 birch birch
188 Dec 11 12:23 Scer.des -rw-rw-r-- 1 birch birch
468 Dec 11 12:23 Scer.prj -rw-rw-r-- 1 birch
birch 72 Dec 11 12:23
Scer.sds -rw-rw-r-- 1 birch
birch 72 Dec 11 12:23
Scer.ssp -rw-rw-r-- 1 birch birch 46626752 Dec 11 12:23 Scer.suf -rw-rw-r-- 1 birch birch 12157123 Dec 11 12:23 Scer.tis
The output of the time command tells us the time elapsed
during the execution of the program (real), the CPU time
used by the program (user) and the time required for system
To create the alignment, type
time lastal Scer
GCF_000292725.1_SacArb1.0_genomic.fna > ScerSarb.maf
real 1m33.753s user 1m33.213s sys 0m0.121s
This run takes about 1.5 min. on our system (96 Gb RAM, 64
Finally, to see the results in a dotplot,
last-dotplot ScerSarb.maf ScerSarb.png
alignments... last-dotplot: done last-dotplot: choosing bp per pixel... last-dotplot: bp per pixel = 12951 last-dotplot: processing alignments... last-dotplot: done real 0m6.534s user 0m6.082s sys 0m0.317s
You can double-click on ScerSarb.png to view the output.
The X-axis sequence at top is the S. cer. genome, and the
Y-axis at the left is S. arb. There are several points to
Each chromosome in
each genome appears in the order in which they are found in
the input file eg. for S. cer, they are in the roman numeral
order I - XVIII plus mitochondrial at the end. However, not
all sequence names were able to fit in the output. However,
you will see thin grey lines in the matrix plot that delineate
the extents of each chromosome. For publication purposes, one
could edit the genome files to say something like ScerI,
ScerII etc, rather than the Accession numbers. With the
default output, unfortunately, if you want to know which
chromosome you are looking at, you need to count from left to
right or top to bottom.
representing matches between two chromosomes on the forward
strand go top left to bottom right, and are colored red.
Diagonals representing matches between two chromosomes on the
reverse strands go bottom left to top right, and are shown in
The plot tells us
several things of importance:
The two genomes are substantially colinear, as indicated by
the fact that the main diagonal spans all chromosomes.
A reciprocal translocation between Chromosomes IV and XIII is
indicated by offset diagonals and gaps in the main diagonal.
It is important to keep
in mind that the scale of the plots is at an extremely low
level of resolution in order to view 12 Mb in a single
image. What seem like minor diagonals in the plot actually
span thousands of base pairs.