Because of its breadth MUMmer can, at first glance, be an overwhelming sea of scripts and subroutines. This document attempts to walk the user through some of the more useful modules of the package, and provides example data and expected outputs to assure the correct and productive operation of MUMmer. All example data is real DNA sequence from various eukaryotic and prokaryotic organisms, and can be found in its entirety in the data directory. Although the input sequences are only subsections of their respective genomes, they have been carefully selected to permit speedy and informative walk-throughs. It is not necessary to download all of the data at once, as each subsection will have separate links to the relevant files.
For further information regarding any of the MUMmer programs or their output formats, please refer to the online MUMmer manual.
MapView is a utility script for displaying sequence alignments as provided
by NUCmer or PROmer. It takes the output from show-coords
and converts
it to a FIG, PDF or PS image file. By default, it produces FIG files which can
be viewed with the common system utility xfig
or converted to PDF
or PS with the fig2dev
utility (neither programs are included with
MUMmer). mapview
is useful for mapping multiple query contigs (e.g.
from a draft sequencing project) against an annotated reference sequence. Exons
and other features can also be plotted with the NUCmer or PROmer alignments,
aiding in exon refinement and analysis. Individual MUMmer hits are plotted according
to their percent identity, making regions of high or low similarity easily distinguishable.
In the following sections, a short example is given that demonstrates how to
use mapview
. Since nucmer
and promer
have a near identical user interface, the alignments for this example will be
generated using promer
. This example aligns a few query sequences
to a single reference sequence using promer
, and then uses mapview
to plot the resulting areas of conservation and the reference sequence annotation.
D_melanogaster_2Rslice.cds
D_melanogaster_2Rslice.fasta
D_melanogaster_2Rslice.utr
D_pseudoobscura_contigs.fasta
Please complete the PROmer walk-though in order to generate
the alignment between the Drosophila melanogaster chromosome 2R segment
and the 2 contigs from Drosophila pseudoobscura. The PROmer walk-through
will generate the .coords
file that is necessary to continue with
the rest of this tutorial. If already familiar with the promer
alignment script, simply continue this tutorial using the supplied promer.coords
file. Note that when generating the .coords
file with show-coords
it is important to use the -l -r
options (and optionally the -k
option) in order to generate the proper input format for mapview
.
The output of show-coords
is then used by MapView to create a
FIG, PDF or PS file.
mapview -n 1 -p mapview promer.coords
The -n
option is used to set the number of output files to 1.
By default, MapView partitions its output among 10 files in order to keep the
figures for large comparisons small. Since we are only comparing a small slice
of the actual chromosome, only 1 file will be needed. The output of this command
will be a single file named mapview_0.fig
. A more informative plot
can be generated by supplying a UTR and CDS coordinate file in GFF
format. These files contain annotation information that will be plotted
along side the PROmer alignments, thus making it possible to compare the conserved
regions with annotated exon positions.
mapview -n 1 -p mapview promer.coords D_melanogaster_2Rslice.utr D_melanogaster_2Rslice.cds
This will generate a single file, mapview_0.fig
, that will have
the annotation information displayed above the blue reference rectangle. Below,
you can see this file displayed with the xfig viewer. The only difference between
this file and the file produced without the UTR and CDS files are the annotation
rectangles above the blue rectangle at the very top of the figure.
In order to generate a PDF format, use the same command plus the -f pdf
option.
mapview -n 1 -f pdf -p mapview promer.coords D_melanogaster_2Rslice.utr
D_melanogaster_2Rslice.cds
This will generate the same image, mapview_0.pdf
, but in PDF format.
The above MapView FIG shows a 220 kbp slice of D. melanogaster chromosome 2L and its alignment to D. pseudoobscura. The alignment, generated by PROmer, shows all regions of conserved amino acid sequence. The blue rectangle spanning the figure represents the reference (D. melanogaster), with annotated genes shown above it and the PROmer alignments shown below it. Alternative splice variants of the same gene are stacked vertically. Exons are shown as boxes, with intervening introns connecting them. The 5' and 3' UTRs are colored pink and blue to indicate the gene's direction of translation. PROmer matches are shown twice, once just below the reference genome, where all matches are collapsed into red boxes, and in a larger display showing the separate matches within each contig, where the contigs are colored differently to indicate contig boundaries. The vertical position of the matches indicates their percent identity, ranging from 50% at the bottom of the display to 100% just below the red rectangles. Percent identity is of the amino acid translations used by PROmer. Matches from the same query sequence are connected by lines of the same color.
mummer
is a suffix tree algorithm designed to find maximal exact
matches of some minimum length between two input sequences. The match lists
produced by mummer
can be used alone to generate alignment dot
plots, or can be passed on to the clustering algorithms for the identification
of longer non-exact regions of conservation. These match lists have great versatility
because they contain huge amounts of information and can be passed forward to
other interpretation programs for clustering, analysis, searching, etc.
In the following sections, a short example is given that demonstrates how to
use mummer
. This example compares a single query sequence to a
single reference sequence using mummer
, and then uses mummerplot
to generate a dot plot representation of the comparison.
mummer
can handle multiple reference and multiple query sequences,
however a dotplot of more that two sequences can be confusing, so for the case
of this example we will be dealing with a single reference and a single query
sequence.
mummer -mum -b -c H_pylori26695_Eslice.fasta H_pyloriJ99_Eslice.fasta
> mummer.mums
This command will find all maximal unique matches (-mum
) between
the reference and query on both the forward and reverse strands (-b
)
and report all the match positions relative to the forward strand (-c
).
Output is to stdout
, so we will redirect it into a file named mummer.mums
.
This file lists all of the MUMs of the default length or greater between the
two input sequences.
A dotplot of all the MUMs between two sequences can reveal their macroscopic similarity.
mummerplot -x "[0,275287]" -y "[0,265111]" -postscript
-p mummer mummer.mums
This command will plot all of the MUMs in the mummer.mums
file
in postscript format (-postscript
) between the given ranges for
the X and Y axes. When plotting mummer
output, it is necessary
to use the lengths of the input sequences to set the plot ranges, otherwise
the plot will be automatically scaled around the minimum and maximum data points.
The four output files are prefixed by the string specified with the -p
option. The plot
files contains the data points, mummer.gp
is a gnuplot script for plotting the data points in the plot
files,
and mummer.ps
is the postscript plot generated by the gnuplot script.
Below, you can see the mummer.ps
file displayed with ghostview.
Note that with newer versions of mummerplot
the color and thickness
of the plot lines may be different.
Most image manipulation programs can edit the postscript output, or it can
be sent directly to a printer with the lpr
command. If you would
rather use the default terminal for gnuplot, simply remove the -postscript
option from the mummerplot
call.
The above postscript plot represents the set of all MUMs between the two input sequences used in this example. Forward MUMs are plotted as red lines/dots while reverse MUMs are plotted as green lines/dots (blue may be used for reverse matches in newer versions). A line of dots with slope == 1 represents an undisturbed segment of conservation between the two sequences, while a line of slope == -1 represents an inverted segment of conservation between the two sequences. The green segment in the upper left quadrant of the graph shows both an inversion and translocation, as it is of negative slope and inconsistently located relative to the rest of the plot which falls on a line approximated by f(x) = x. However the green segment in the upper right quadrant of the graph shows only an inversion, as it is of negative slope but is consistent in location with the rest of the plot. Generally, the closer a plot is to an imaginary line f(x) = x (or -x) the fewer macroscopic differences exist between the two sequences.
nucmer
is the MUMmer's most user-friendly alignment script for
standard DNA sequence alignment. It is a robust pipeline that allows for multiple
reference and multiple query sequences to be aligned in a many vs. many fashion.
For instance, a very common use for nucmer
is to determine the
position and orientation of a set of sequence contigs in relation to a finished
sequence, however it can be just as effective in comparing two finished sequences
to one another.
In the following sections, a short example is given that demonstrates how to
use nucmer
. This example aligns a set of draft sequence contigs
to a finished sequence using nucmer
; displays the alignment coordinates
using show-coords
; and tiles them across the reference using show-tiling
.
Like mummer
, nucmer
can handle multiple reference
and query sequences, however it is most commonly used to map a set of query
sequences to a single reference sequence. This example will demonstrate that
functionality, as a number of B. anthracis draft contigs will be mapped
to the final assembly.
nucmer -maxmatch -c 100 -p nucmer B_anthracis_Mslice.fasta B_anthracis_contigs.fasta
To assure all contigs were mapped, all maximal matches were used as alignment
anchors (-maxmatch
) and because of the sequence similarity the
minimum cluster size was bumped up to 100 (-c 100
). The two output
files are prefixed by the string specified with the -p
option.
nucmer.delta
is an
encoded file that represents the alignment between the two inputs. At this stage,
the alignment of the two inputs is complete, however it is necessary to parse
the nucmer.delta
file with the provided utilities in order to extract
useful information from the comparison.
To view a summary of all the alignments produced by NUCmer, we need to run
the nucmer.delta
file through the show-coords
utility.
show-coords -r -c -l nucmer.delta > nucmer.coords
This command will list the coordinates, percent identities and other useful
statistics of each alignment in a table. Each line of the table represents an
individual pairwise alignment, and each line is sorted by its starting reference
coordinate (-r
). Additional information, like alignment coverage
(-c
) and sequence length (-l
) can be added to the
table with the appropriate options. Output is to stdout
, so we
have redirected it into the file, nucmer.coords
.
To view a summary of all the SNPs and indels between the two sequence sets,
we need to run the nucmer.delta
file through the show-snps
utility.
show-snps -C nucmer.delta > nucmer.snps
This will generate a report of all the SNPs internal to the alignments contained
in the nucmer.delta
file. Each line of the table represents a single
mismatch in the pairwise alignment. With the -C
option, only SNPs
from uniquely aligned regions will be reported. Additional information can be
added or removed with the command line switches described in the manual. Output
is to stdout
, so we have redirected it into the file, nucmer.snps
.
To produce a minimal tiling of contigs across the reference sequence, we need
to run the nucmer.delta
file through the show-tiling
utility.
show-tiling nucmer.delta > nucmer.tiling
This command will list the contigs and positions that generate the maximal alignment coverage across the reference sequence using the fewest contigs possible. This output can aid the closure of a draft genome when a closely related organism has already be finished.
nucmer
and show-tiling
output can both be viewed
with mummerplot
, however these plots would offer little more information
in regards to this example. mapview
can also be used to display
the output of show-coords
, as is shown in the mapview
walkthrough.
promer
is a close relative to the NUCmer script. It follows the
exact same steps as NUCmer and even uses most of the same programs in its pipeline,
with one exception - all matching and alignment routines are performed on the
six frame amino acid translation of the DNA input sequence. This provides promer
with a much higher sensitivity than nucmer
because protein sequences
tends to diverge much slower than their underlying DNA sequence. Therefore,
on the same input sequences, promer
may find many conserved regions
that nucmer
will not, simply because the DNA sequence is not as
highly conserved as the amino acid translation.
In the following sections, a short example is given that demonstrates how to
use promer
. This example aligns a few query sequences to single
reference sequence using promer
; displays the alignment coordinates
using show-coords
; and prints a pairwise alignment of one of the
contigs using show-aligns
.
Like mummer
, promer
can handle multiple reference
and query sequences, however it is most commonly used to map a set of query
sequences to a single reference sequence. This example will demonstrate that
functionality, as two D. pseudoobscura draft contigs will be mapped
to the final D. melanogaster assembly.
promer -p promer D_melanogaster_2Rslice.fasta D_pseudoobscura_contigs.fasta
Default parameters were used to align the two inputs, however if the alignment
is too sensitive or not sensitive enough the minimum match length and cluster
sizes can be adjusted accordingly. The two output files are prefixed by the
string specified with the -p
option. promer.delta
is an encoded file that represents
the alignment between the two inputs. At this stage, the alignment of the two
inputs is complete, however it is necessary to parse the promer.delta
file with the provided utilities in order to extract useful information from
the comparison.
To view a summary of all the alignments produced by PROmer, we need to run
the promer.delta
file through the show-coords
utility.
show-coords -r -c -l -L 100 -I 50 promer.delta > promer.coords
This command will list the coordinates, percent identities and other useful
statistics of each alignment in a table. Each line of the table represents an
individual pairwise alignment, and each line is sorted by its starting reference
coordinate (-r
). Additional information, like alignment coverage
(-c
) and sequence length (-l
) can be added to the
table with the appropriate options. And minimum length (-L
) and
minimum percent identity (-I
) cutoffs can be specified to reduce
poor alignments. Output is to stdout
, so we have redirected it
into the file, promer.coords
. If this file is planned for input
to mapview
, it is important to always use the -r
-c
-l
options.
To view all the pairwise alignments between two of the input sequences, we
need to run the promer.delta
file through the show-coords
utility.
show-aligns promer.delta "D_melanogaster_2Rslice" "3214968"
> promer.aligns
This command will print all of the pairwise alignments stored in the promer.delta
file for the sequences "D_melanogaster_2Rslice" and "3214968".
Output is to stdout
, so we have redirected it into the file, promer.aligns
.
If the alignments do not fit within your screen width, or you would like them
to be printed on longer lines, the screen width can be adjusted with the -w
option. Since show-aligns
only displays the alignments between
two sequences, it will have to be run separately for each desired pair of sequences.
promer
and show-tiling
output can both be viewed
with mummerplot
, however these plots would offer little more information
in regards to this example. mapview
can also be used to display
the output of show-coords
, as is shown in the mapview
walkthrough which uses the promer.coords
file generated in
this example to generate a plot of the alignment.
run-mummer1
is a legacy script from the original MUMmer1.0 release.
It has been updated to utilize the new suffix tree code of version 3.0, however
all other programs called from this script are identical to the original MUMmer
release back in 1999. Even though it is an outdated program, it still has some
advantages over the newer alignment scripts (nucmer
, promer
,
run-mummer3
). Like all of the alignment scripts, run-mummer1
is a three step process - matching, clustering and extension. However, unlike
the newer alignment scripts, run-mummer1
uses the gaps
program for its clustering step. The gaps
program does not allow
for rearrangements like mgaps
, instead if finds the single longest
increasing subset of matches across the full length of both sequences. This
makes it well suited for SNP and small indel identification between small (<
10 Mbp), very similar sequences with few to no rearrangements.
In the following sections, a short example is given that demonstrates how to
use run-mummer1
. This example aligns a single query sequence to
a single reference sequence using run-mummer1
.
run-mummer1
is only suited for a single reference and query sequence
that have few to zero inversions or translocations. This example aligns two
such sequences.
run-mummer1 H_pylori26695_Bslice.fasta H_pyloriJ99_Bslice.fasta mummer1
To adjust the minimum match length for the comparison, the user must manually
edit the run-mummer1
script. Output files are prefixed by the string
specified at the end of the command line call. mummer1.align
displays
the alignments of each gap between adjacent MUMs, mummer1.errorsgaps
lists each MUM and the number of errors between it and the previous MUM, mummer1.gaps
lists the ordered set of MUMs and the gap distance to the previous MUM, and
mummer1.out
simply lists all of the MUMs greater than or equal
to the minimum match length.
There are no visualization tools designed for run-mummer1
output.
To view a MUM dotplot, run mummer
by itself on two individual sequence
as demonstrated in the mummer walkthrough.
run-mummer3
is the simplest pipeline of the latest MUMmer3.0 programs.
It runs the same matching and clustering algorithm as nucmer
and
promer
, however it uses a different extension technique and does
not perform the important pre- and post-processing steps of NUC/PROmer. Because
of its simplistic form, run-mummer3
can only handle a single reference
sequence, but like run-mummer1
its error-focused output makes it
a handy tool for detecting SNPs and other small errors. The only major difference
between run-mummer3
and run-mummer1
is the new version's
ability to handle multiple query sequences and its tolerance of large rearrangements.
This makes run-mummer3
well suited for error detection between
highly similar sequences that may have large rearrangements, inversions etc.
In the following sections, a short example is given that demonstrates how to
use run-mummer3
. This example aligns a single query sequence to
a single reference sequence using run-mummer3
.
run-mummer3
can only handle a single reference sequence, but it
is capable of dealing with multiple query sequences. However, this example aligns
a single query sequence to a single reference sequence. Unlike run-mumer1
,
run-mummer3
can handle inversions and translocations, but not with
the same grace as nucmer
.
run-mummer3 H_pylori26695_Bslice.fasta H_pyloriJ99_Bslice.fasta mummer3
To adjust any of the alignment parameters, the user must manual edit the run-mummer3
scripts. Do not, however, add the -c
option to the mummer
invocation, as it will confuse the next steps in the pipeline. It may be easier
to reverse complement the sequence yourself and run the script twice (once for
forward, second for reverse) with the -b
option removed. Try adding
the -D
option to the combineMUMs
command line in the
script to output a format that is easier to parse for SNPs and small indels.
Output files are prefixed by the string specified at the end of the command
line call. mummer3.align
displays the alignments of each gap between
adjacent MUMs, mummer3.errorsgaps
lists each MUM and the number
of errors between it and the previous MUM, mummer3.gaps
lists the
ordered set of MUMs and the gap distance to the previous MUM, and mummer3.out
simply lists all of the MUMs greater than or equal to the minimum match length.
The mummer3.out
file is identical to the output of mummer
on a 1 vs many search, so it may be plotted as demonstrated in the mummer
walkthrough.
Please address questions and bug reports via Email to:
VERSION 3.17 - May 2005