QUAST 2.3 manual

QUAST stands for QUality ASsesment Tool. The tool evaluates genome assemblies by computing various metrics.

You can find all project news and the latest version of the tool at http://sourceforge.net/projects/quast.

QUAST utilizes MUMmer, GeneMark.hmm, MetaGeneMark, GlimmerHMM and GAGE. These tools are built in, so you do not need to install them separately.

Version 2.3 of QUAST was released under GPL v2 (see LICENSE for details) on 17 January 2014.

1. Installation

QUAST automatically compiles all its sub-parts when needed (on the first use). Thus, there is no special installation command for QUAST. However, we recommend you to run:

Note: you should place quast-2.3 directory in the final destination before the first use (e.g. before run with --test). If you want to move QUAST to some new place after several usages you should use a clean copy of quast-2.3. This limitation is caused by auto-generation of absolute paths in compiled modules of QUAST.

2. Running QUAST

2.1 For impatient people

2.2 Input data

The test_data directory contains examples of assembly, reference, gene and operon files.

Sequences
The tool accepts assemblies and references in FASTA format. Files may be compressed with zip, gzip, or bzip2.
Multiple reference chromosomes can be provided as separate sequences in a single FASTA file.

Maximum assembly length is 4.29 Gbp.
Maximum length of a reference sequence (e.g. a chromosome) is 536 Mbp. The number of sequences in a reference file is not limited.

Those restrictions belongs to Nucmer, a tool that QUAST applies to align contigs to a reference genome. The metrics that do not require alignment are computed in any case.

Genes and operons
One can also specify files with gene and operon positions in the reference. QUAST will count fully and partially aligned regions, and output total values and cumulative plots.

The following file formats are suported:

2.3 GAGE mode

2.4 Command line options

-t (or --threads) <int>

Maximum number of threads. The default value is the number of CPUs. If QUAST fails to determine the number of CPUs, the number is set to 4.

--labels (or -l) <label,label...>

Human-readable assembly names. Those names will be used in reports, plots and logs. For example:

-l SPAdes,IDBA-UD

If your labels include spaces, use quotes:

-l SPAdes,"Assembly 2",Assembly3

-l "SPAdes 2.5, SPAdes 2.4, IDBA-UD"

-L

Take assembly names from their parent directory names.

--gene-finding

Enables gene finding. Affects perfomance, thus disabled by default.

By default, we asume that the genome is prokaryotic, and apply GeneMark.hmm for gene finding. If the genome is eukaryotic, add the --eukaryte option to enable GlimmerHMM instead. If it is a metagenome, add the --meta option.

If a gene file is provided by -G as well, both # genes in the file covered by the assembly, and # predicted genes are reported. Note that operons are not predicted, but a file of known operon positions can be provided instead.

--gene-thresholds <int,int,...>

Comma-separated list of thresholds for gene lengths to find with a finding tool. The default value is 0,300,1500,3000. Note: this list is used only if --gene-finding option is specified.

--eukaryote

Genome is eukarytic. Affects gene finding and contig alignment:

For prokaryotes (which is default), GeneMark.hmm is used. For eukaryotes, GlimmerHMM is used.
By default, QUAST assumes that a genome is circular and correctly processes its linear representation. This options indicates that the genome is not circular.

--meta

Use MetaGeneMark for gene finding, if the --gene-finding option is specified. If the --eukaryote option is also provided, MetaGeneMark still will be used.

Note: if you have multiple references, use metaquast.py instead (it is in the same directory as quast.py).

--est-ref-size <int>

Estimated reference size (in bases) for computing NGx statistics. This value will be used only if a reference genome file is not specified (see --R option).

--gage

Starts QUAST in "GAGE mode" (see section 2.3). Note: in this case, you also have to set the -R option.

--contig-thresholds <int,int,...>

Comma-separated list of contig length thresholds. Used in # contigs ≥ x and total length (≥ x) metrics (see section 3). The default value is 0,1000.

--scaffolds

The assemblies are scaffolds (rather than contigs). QUAST will add split versions of assemblies to the comparison. Assemblies are split by continuous fragments of N's of length ≥ 10.

--use-all-alignments

Compute genome fraction, # genes, # operons metrics in the manner used in QUAST v.1.*. By default, QUAST v.2.0 and higher filters out ambiguous and redundant alignments, keeping only one alignment per contig (or one set of non-overlapping or slightly overlapping alignments). This option makes QUAST count all alignments.

--ambiguity-usage <none|one|all>

Way of processing equally good alignments (probably repeats):

`none`	skip all such alignments;
`one`	take only one (the first one);
`all`	use all alignements. Can cause a significant increase of # mismatches (repeats are almost always inexact due to accumulated SNPs, indels, etc.).

The default value is 'one'.

--strict-NA

Break contigs at every misassembly event (including local ones) to compute NAx and NGAx statistics. By default, QUAST breaks contigs only at extensive misassemblies (not local ones).

--no-plots

Do not draw plots. This will speed up computation but you will get only text reports as a result.

--test

Run the tool on a data from the test_data folder and check correctness of the evaluation process. Output is saved in quast_test_output.

-h (or --help)

Print help.

2.5 Metagenomic assemblies

The metaquast.py script accepts multiple references. One can provide several files, a merged FASTA file with multiple sequences, or a combination. The tool partitions all contigs in groups aligned to each reference. Then it runs quast.py several times:

All outputs are in separate directories inside the directory provided by -o (or in quast_results/latest).

3. QUAST output

If an output path was not specified manually, QUAST puts its output into the directory quast_results/result_<DATE> and creates a symlink latest to it inside the directory quast_results/.

QUAST output contains:

report.txt	an assessment summary in a simple text format,
report.tsv	a tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc),
report.tex	a LaTeX version of the summary,
alignment.svg	a contig alignment plot (file is created if the matplotlib python library is installed),
report.pdf	all other plots combined with all tables (file is created if the matplotlib python library is installed),
report.html	an HTML version of the report with interactive plots inside it,
contigs_reports/
misassemblies_report	a detailed report on misassemblies. See section 3.1.2 for details,
unaligned_report	a detailed report on unaligned and partially unaligned contigs. See section 3.1.3 for details.

Note:

3.1 Metrics description

3.1.1 Summary report

# contigs (≥ x bp) is total number of contigs of length ≥ x bp. Not affected by the ‑‑min‑contig parameter (see section 2.4).

Total length (≥ x bp) is the total number of bases in contigs of length ≥ x bp. Not affected by the ‑‑min‑contig parameter (see section 2.4).

All remaining metrics are computed only the contigs that exceed the threshold specified by specified by the ‑‑min‑contig option (see section 2.4, default is 500).

# contigs is the total number of contigs in the assembly.

Largest contig is the length of the longest contig in the assembly.

Total length is the total number of bases in the assembly.

Reference length is the total number of bases in the reference.

GC (%) is the total number of G and C nucleotides in the assembly, divided by the total length of the assembly.

Reference GC (%) is the percentage of G and C nucleotides in the reference.

N50 is the length for which the collection of all contigs of that length or longer covers at least half an assembly.

NG50 is the length for which the collection of all contigs of that length or longer covers at least half a reference genome.
This metric is computed only if a reference genome is provided.

N75 and NG75 are defined similarly with 75 % instead of 50 %.

L50 (L75, LG50, LG75) is the number of contigs as long as N50 (N75, NG50, NG75)
In other words, L50, for example, is the minimal number of contigs that cover half the assembly.

# misassemblies is the number of positions in the contigs that satisfy one of the following criteria:

the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference;
flanking sequences overlap on more than 1 kbp;
flanking sequences align to different strands or different chromosomes.

This metric requires a reference genome.

# misassembled contigs is the number of contigs that contain misassembly events.

Misassembled contigs length is the total number of bases in misassembled contigs.

# local misassemblies is the number of breakpoints that satisfy the following conditions:

Two or more distinct alignments cover the breakpoint.
The gap between left and right flanking sequences is less than 1 kbp.
The left and right flanking sequences both are on the same strand of the same chromosome of the reference genome.

# unaligned contigs is the number of contigs that have no alignment to the reference sequence. The value "X + Y part" means X totally unaligned contigs plus Y partially unaligned contigs.

Unaligned length is the total length of all unaligned regions in the assembly (sum of lengths of fully unaligned contigs and unaligned parts of partially unaligned ones).

Genome fraction (%) is the percentage of alinged bases in the reference. A base in the reference is aligned if there is at least one contig with at least one alignment to this base. Contigs from repetitive regions may map to multiple places, and thus may be counted multiple times.

Duplication ratio is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference (see Genome fraction (%) for the 'aligned base' defenition). If the assembly contains many contigs that cover the same regions of the reference, its duplication ratio may be much larger than 1. This may occur due to overestimating repeat multiplicities and due to small overlaps between contigs, among other reasons.

# N's per 100 kbp is the average number of uncalled bases (N's) per 100000 assembly bases.

# mismatches per 100 kbp is the average number of mismatches per 100000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.

# indels per 100 kbp is the average number of indels per 100000 aligned bases. Several consecutive single nucleotide indels are counted as one indel.

# genes is the number of genes in the assembly (complete and partial), based on a user-provided list of gene positions in the reference. A gene 'partially covered' if the assembly contains at least 100 bp of this gene but not the whole one.

This metric is computed only if a reference genome and an annotated list of gene positions are provided (see section 2.4).

# operons is defined similarly to # genes, but an operon positions file required instead.

# predicted genes is the number of genes in the assembly found by GeneMark.hmm, GlimmerHMM or MetaGeneMark. See the description of the --gene-finding option for details.

Largest alignment is the length of the largest continuous alignment in the assembly. A value can be smaller than a value of largest contig if the largest contig is misassembled.

NA50, NGA50, NA75, NGA75, LA50, LA75, LGA50, LGA75 ("A" stands for "aligned") are similar to the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered.
Aligned blocks are obtained by breaking contigs in misassembly events and removing all analigned bases.

3.1.2 Misassemblies report

# misassemblies is the same as # misassemblies from section 3.1.1. However, this report also contains a classification of all misassemblies into three groups: relocations, translocations, and inversions (see below).

Relocation is a misassembly where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference, or they overlap by more than 1 kbp, and both flanking sequences align on the same chromosome.

Translocation is a misassembly where the flanking sequences align on different chromosomes.

Inversion is a misassembly where the flanking sequences align on opposite strands of the same chromosome.

# misassembled contigs and misassembled contigs length are the same as the metrics from section 3.1.1 and are counted among all contigs with any type of a misassembly (relocation, translocation or inversion).

# local misassemblies is the same as # local misassemblies from section 3.1.1.

# mismatches is the number of mismatches in all aligned bases.

# indels is the number of indels in all aligned bases.

# short indels (≤ 5 bp) is the number of indels of length ≤ 5 bp.

# long indels (> 5 bp) is the number of indels of length > 5 bp.

Indels length is the total number of bases contained in all indels.

3.1.3 Unaligned report

# fully unaligned contigs is the number of contigs that have no alignment to the reference sequence.

Fully unaligned length is the total number of bases in all unaligned contigs.

# partially unaligned contigs is the number of contigs that are not fully unaligned, but have fragments with no alignment to the reference sequence.

# with misassembly is the number of partially unaligned contigs that have a misassembly in their aligned fragment. Note that such misassemblies are not counted in # misassemblies and other misassemblies statistics.

# both parts are significant is the number of partially unaligned contigs that have both aligned and unaligned fragments longer than the value of --min-contig.

Partially unaligned length is the total number of unaligned bases in all partially unaligned contigs.

# N's is the total number of uncalled bases (N's) in the assembly.

3.2 Plots description

Contig alignment plot shows alignment of contigs to the reference genome and the positions of misassemblies in these contigs. Contigs that align correctly are colored blue if the boundaries agree (within 2 kbp on each side, contigs are larger than 10 kbp) in at least half of the assemblies, and green otherwise. Blocks of misassembled contigs are colored orange if the boundaries agree in at least half of the assemblies, and red otherwise. Contigs are staggered vertically and are shown in different shades of their color in order to distinguish the separate contigs, including small ones. If the reference file consists of several sequences all of them are drawn on the single plot horizontally next to each other.

Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.

Nx plot shows Nx values as x varies from 0 to 100 %.

NGx plot shows NGx values as x varies from 0 to 100 %.

GC content plot shows the distribution of GC content in the contigs.

The x value is the GC percentage (0 to 100 %).
The y value is the number of non-overlapping 100 bp windows which GC content equals x %.

For a single genome, the distribution is typically Gaussian. However, for assemblies with contaminants, the GC distribution appears to be a superposition of Gaussian distributions, giving a plot with multiple peaks.

Cumulative length plot for aligned contigs shows the growth of lengths of aligned blocks. If a contig has a misassembly, QUAST breaks it into smaller pieces called aligned blocks.

On the x-axis, blocks are ordered from the largest to smallest. The y-axis gives the size of the x largest aligned blocks.
This plot is created only if a reference genome is provided.

NAx and NGAx plots
These plots are similar to the Nx and NGx plots but for the NAx and NGAx metrics respectively. These plots are created only if a reference genome is provided.

Genes plot shows the growth rate of full genes in assemblies.
The y-axis is the number of full genes in the assembly, and the x-axis is the number of contigs in the assembly (from the largest one to the smallest one).
This plot could be created only if a reference and genes annotations files are given.

Operons plot is similar to the previous one but for operons.

4. Adjusting QUAST reports and plots

You can easily change content, order of metrics, and metric names in all QUAST reports. For doing this, please edit the CONFIGURABLE PARAMETERS section in libs/reporting.py. It contains a lot of informative comments, which will help you to adjust QUAST reports easily even if you are new to Python.

You can also adjust plot colors, style and width of lines, legeng font, plots output format, etc. Please see the CONFIGURABLE PARAMETERS section in libs/plotter.py.

Note: if you restart QUAST on the same directory with new parameters, is will reuse alignments and run much faster. See the description of the -o option in section 2.4.

5. Citation

6. Feedback and bug reports

We will be thankful if you help us make QUAST better by sending your comments, bug reports, and suggestions to quast.support@bioinf.spbau.ru.

We kindly ask you to attach the quast.log file from output directory (or an entire archive of the folder) if you have troubles running QUAST.

Note that if you didn't specify the output directory manually, it is going to be automatically set to quast_results/results_<date_time>, with a symbolic link quast_results/latest to that directory.

7. FAQ

This section contains most popular questions about QUAST output. Read answers for deeper understanding of results generated by the tool.

In several answers there are descriptions of files under <quast_output_dir> directory.
If you use the command-line version of QUAST you specify <quast_output_dir> by -o option or it is "quast_results/latest" by default.
If you use http://quast.bioinf.spbau.ru/ you should download full report by pressing "Download report" button (at top-right corner), decompress result and go to "full_report" subdirectory.

Q1. It seems that QUAST is giving me a differing number of misassemblies and misassembled contigs. Does this imply that QUAST looks for multiple misassemblies within one contig?

Yes, you are right, QUAST looks for multiple misassemblies within one contig. Thus, number of misassemled contigs is always less or equal to number of misassemblies.

Yes, there is such way.
QUAST copies all misassembled contigs of "<assembly_name>" assembly into <quast_output_dir>/contigs_reports/<assembly_name>.mis_contigs.fa file.
E.g. if your assembly is "contigs.fasta" then the file is "contigs.mis_contigs.fa", if your assembly is "ecoli_assembly_1.fasta" then the file is "ecoli_assembly_1.mis_contigs.fa".

Q3. Is it possible to find which misassembly corresponds to each contig and which kind of a misassembly it is?

Yes, it is possible.
You should open <quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.stdout> file.
E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.stdout", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.stdout".

After that, you should look for "Extensive misassembly" in the file and look around to detect contig name which corresponds this misassembly.

Let's look at the following example:

Q4. Could you explain the format of Real Alignments in contigs report files (see the answer for Q3)?

There are two output files concerning SNPs. Both of them are saved in <quast_output_dir>/contigs_reports/nucmer_output/ directory.
The first one has extension ".all_snps" and it is raw Nucmer aligner output. Its format is: