-------------------------------------------------------------------------------- dnadiff is a wrapper for nucmer and analysis utilities that provides detailed information on the differences between two genomes, and also provides a high level report file that quantifies the differences between the two inputs. Use Cases: + diff'ing two strains of the same species + diff'ing two assemblies of the same organism + diff'ing a draft assembly and a closely related finished genome If any of this code is used in any publication, please cite the following: Versatile and open software for comparing large genomes. S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. Genome Biology (2004), 5:R12. -------------------------------------------------------------------------------- This manual is also available as HTML documentation included in this distribution, or at: http://mummer.sourceforge.net http://mummer.sourceforge.net/manual http://mummer.sourceforge.net/examples -- DESCRIPTION -- dnadiff is a wrapper around nucmer that builds an alignment using default parameters, and runs many of nucmer's helper scripts to process the output and report alignment statistics, SNPs, breakpoints, etc. It is designed for evaluating the sequence and structural similarity of two highly similar sequence sets. E.g. comparing two different assemblies of the same organism, or comparing two strains of the same species. -- dnadiff EXAMPLE -- To compare two strains of the same species, type: "dnadiff genome1.fna genome2.fna" Output will be... out.report - Summary of alignments, differences and SNPs out.delta - Standard nucmer alignment output out.1delta - 1-to-1 alignment from delta-filter -1 out.mdelta - M-to-M alignment from delta-filter -m out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta out.snps - SNPs from show-snps -rlTHC .1delta out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta out.unref - Unaligned reference sequence IDs and lengths out.unqry - Unaligned query sequence IDs and lengths For more information on the formats and meanings of all the files produced, please see the documentation for the corresponding utility. This document serves to describe running the dnadiff script and interpreting the produced .report file. -- RUNNING 'dnadiff' -- USAGE: dnadiff [options] or dnadiff [options] -d DESCRIPTION: Run comparative analysis of two sequence sets using nucmer and its associated utilities with recommended parameters. See MUMmer documentation for a more detailed description of the output. Produces the following output files: .delta - Standard nucmer alignment output .1delta - 1-to-1 alignment from delta-filter -1 .mdelta - M-to-M alignment from delta-filter -m .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta .snps - SNPs from show-snps -rlTHC .1delta .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta .report - Summary of alignments, differences and SNPs .unref - Unaligned reference sequence IDs and lengths .unqry - Unaligned query sequence IDs and lengths MANDATORY: Reference Set the input reference multi-FASTA filename Query Set the input query multi-FASTA filename or Delta File Unfiltered .delta alignment file from nucmer OPTIONS: -d|delta Provide precomputed delta file for analysis -h --help Display help information and exit -p|prefix Set the prefix of the output files (default "out") -V --version Display the version information and exit -- NOTES -- The -p option is recommended to avoid overwriting previous output. A simple naming convention is for files A.fna and B.fna, to set "-p A_B". It is safest to let dnadiff run nucmer automatically, so avoid using the -d option unless the delta file was already generated with "nucmer --maxmatch" and has not been filtered. -- OUTPUT FILES -- dnadiff produces many outputs, however all but one are produced by other utilities in the MUMmer package. Please see their corresponding documentation for more information. This section will only describe the .report file generated by dnadiff and tips on interpreting it. *** .report OUTPUT *** Report statistics are broken into two columns - reference and query. Rows are grouped by themed alignment metrics and are described here. Summary counts are estimates and do not represent the exact number of occurrences of a particular evolutionary event. When reading a reference column, think number of XYZ in reference with regard to the query. When reading a query column, think number of XYZ in query with regard to the reference. [Sequences] - Sequence-centric stats. TotalSeqs - Total number of input sequences. AlignedSeqs - Number of input sequences with at least one alignment. UnalignedSeqs - Number of input sequences with no alignment. [Bases] - Base-pair-centric stats. TotalBases - Total number of bases in the input sequences. AlignedBases - Total number of bases contained within an alignment. UnalignedBases - Total number of unaligned bases. This is a rough measure for the amount of "unique" sequence in the reference and query. [Alignments] - Alignment-centric stats. 1-to-1 - Number of alignment blocks comprising the 1-to-1 mapping of reference to query. This is a subset of the M-to-M mapping, with repeats removed. TotalLength - Total length of 1-to-1 alignment blocks. AvgLength - Average length of 1-to-1 alignment blocks. AvgIdentity - Average identity of 1-to-1 alignment blocks. M-to-M - Number of alignment blocks comprising the many-to-many mapping of reference to query. The M-to-M mapping represents the smallest set of alignments that maximize the coverage of both reference and query. This is a superset of the 1-to-1 mapping. TotalLength - Total length of M-to-M alignment blocks. AvgLength - Average length of M-to-M alignment blocks. AvgIdentity - Average identity of M-to-M alignment blocks. [Features] - Structural alignment features, such as rearrangements. These counts are rough estimates based on an automated analysis of the alignments. Features are identified by scanning the reference (or query) from low to high, and noting the positions where the query alignments are inconsistently ordered or oriented with respect to the reference. Breakpoints - Number of non-maximal alignment endpoints, i.e. endpoints that do not occur at the beginning or end of a sequence. Relocations - Number of breaks in the alignment where adjacent 1-to-1 alignment blocks are in the same sequence, but not consistently ordered. A separate feature is recorded for each end of a relocation, so this is really a count of relocation endpoints. Translocations - Number of breaks in the alignment where adjacent 1-to-1 alignment blocks are in different sequences. A separate feature is recorded for each end of a translocation, so this is really a count of translocation endpoints. Inversions - Number of breaks in the alignment where adjacent 1-to-1 alignment blocks are inverted with respect to one another. A separate feature is recorded for each end of an inversion, so this is really a count of inversion endpoints. Insertions - Rough count of insertion events. Note that this is slightly different from "UnalignedBases" because it counts duplications as insertions, whereas UnalignedBases does not. Also, this count does not included sequences that have no alignments as insertions, whereas UnalignedBases does. Note than insertions in R can be viewed as deletions from Q. This number reports only "major" insertions defined as insertions large enough to break an alignment. Nucmer will align through smaller insertions of less than ~60 bases. These smaller insertions are reported in the "Indels" count below. InsertionSum - Rough sum of inserted sequence. InsertionAvg - Average length of insertion. TandemIns - Rough count of tandem duplication insertion events. Note that expansions in R can be viewed as collapses in Q. TandemInsSum - Rough sum of tandem duplication insertions. TandemInsAvg - Average length of tandem duplications. [SNPs] - Single Nucleotide Polymorphism counts. TotalSNPs - Total number of SNPs, same for both sequences. XY - X-to-Y SNP. For reference column, this means reference 'X' to query 'Y'. For query column, this means query 'X' to reference 'Y'. The same convention applies below. TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact, base-pair matches on both sides. TotalIndels - Single Nucleotide Insertions/Deleltions. X. - X insertion. For reference column, 'X.' means insertion of 'X' in the reference. For query column, 'X.' means insertion of 'X' in the query. Nucmer will align through group insertions of up to ~60 bases. Each base of these group insertions will be reported in this count. Large insertions will be reported in the "Insertions" count about. TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20 exact, base-pair matches on both sides.