dipSPAdes 1.0 manual

1. What is dipSPAdes?
    1.1. dipSPAdes pipeline
2. Installing dipSPAdes
3. Running dipSPAdes
    3.1 dipSPAdes input
    3.2 dipSPAdes command line options
         3.2.1 Basic options
         3.2.2 Input data
         3.2.3 Advanced options
         3.2.4 Examples
         3.2.5 Examples of advanced options usage
    3.3 dipSPAdes output
         3.3.1 Haplocontigs alignment output
         3.3.2 Haplotype assembly output
4. Citation
5. Feedback and bug reports

1. What is dipSPAdes?

dipSPAdes is a genome assembler designed specifically for diploid highly polymorphic genomes based on SPAdes. It takes advantage of divergence between haplomes in repetitive genome regions to resolve them and construct longer contigs. dipSPAdes produces consensus contigs (representing a consensus of both haplomes for the orthologous regions) and performs haplotype assembly. Note that dipSPAdes can only benefit from high polymorphism rate (at least 0.4%). For the data with low polymorphism rate no improvement in terms of N50 vs consentional assemblers is expected.

1.1 dipSPAdes pipeline

dipSPAdes pipeline consists of three steps:
    1. Assembly of haplocontigs (contigs representing both haplomes).
    2. Consensus contigs construction.
    3. Haplotype assembly.

2. Installing dipSPAdes

dipSPAdes comes as a part of SPAdes assembler package.
See SPAdes manual for installation instructions.
Please verify your dipSPAdes installation prior to initiate the dipSPAdes:


    <spades installation dir>/dipspades.py --test

If the installation is successful, you will find the following information at the end of the log:


 * Assembled consensus contigs are in: test_dipspades/dipspades/consensus_contigs.fasta
 * Assembled paired consensus contigs are in: test_dipspades/dipspades/paired_consensus_contigs.fasta
 * Assembled paired consensus contigs are in: test_dipspades/dipspades/unpaired_consensus_contigs.fasta
 * Alignment of haplocontigs is in: test_dipspades/dipspades/haplocontigs_alignent
 * Assembled paired consensus contigs are in: test_dipspades/dipspades/haplotype_assembly.out
 * Possibly conservative regions are in: test_dipspades/dipspades/possibly_conservative_regions.fasta

Thank you for using SPAdes!

======= dipSPAdes finished.
dipSPAdes log can be found here: test_dipspades/dipspades/dipspades.log

3. Running dipSPAdes

3.1 dipSPAdes input

dipSPAdes can take as an input one of the three following alternatives:

Reads. dipSPAdes takes them in the same format as described in SPAdes manual. In this case dipSPAdes runs SPAdes to obtain haplocontigs as the first step "Assembly of haplocontigs".
Haplocontigs. dipSPAdes can use user-provided haplocontigs (for example computed with another assembler). In this case dipSPAdes skips the first step and starts from the second step "Consensus contigs construction".
Reads and haplocontigs. dipSPAdes can also use both reads and haplocontigs. In this case dipSPAdes first computes haplocontigs from reads and then uses mixture of computed haplocontigs and user-provided haplocontigs as input for further steps.

We provide example command lines for each of these scenarios in Examples section.

3.2 dipSPAdes command line options

To run dipSPAdes from the command line, type


dipspades.py [options] -o <output_dir>

Note that we assume that SPAdes installation directory is added to the PATH variable (provide full path to dipSPAdes executable otherwise: <spades installation dir>/dipspades.py).

3.2.1 Basic options

-o <output_dir>
Specifies the output directory. Required option.

--test
Runs SPAdes on the toy data set; see section 2.

-h (or --help)
Prints help.

3.2.2 Input data

For input reads specfication use options of SPAdes described in SPAdes manual.

--hap <file_name>
Specifies file with haplocontigs in FASTA format. Note that dipSPAdes can use any number of haplocontig files.

3.2.3 Advanced options

--expect-gaps
Indicates significant amount of expected gaps in genome coverage (e.g. for datasets with relatively low coverage).

--expect-rearrangements
Indicates extreme heterozygosity rate in haplomes (e.g. haplomes differ by long insertions/deletions).

--hap-assembly
Enables haplotype assembly phase that results in files haplotype_assembly.out, conservative_regions.fasta, and possibly_conservative_regions.fasta (see Haplotype assembly output).

3.2.4 Examples

To perform assembly (construct consensus contigs and perform haplotype assembly) of diploid genome from paired-end reads (reads_left.fastq and reads_right.fastq) run:


dipspades.py -1 reads_left.fastq -2 reads_right.fastq -o output_dir

To perform assembly (construct consensus contigs and perform haplotype assembly) of diploid genome from preliminary computed haplocontigs (haplocontigs1.fasta and haplocontigs2.fasta) run:


dipspades.py --hap haplocontigs1.fasta --hap haplocontigs2.fasta -o output_dir

To perform assembly of diploid genome from both reads (reads_left.fastq and reads_right.fastq) and preliminary computed haplocontigs (haplocontigs.fasta) run:


dipspades.py -1 reads_left.fastq -2 reads_right.fastq --hap haplocontigs.fasta -o output_dir

3.2.5 Examples of advanced options usage

To perform assembly of diploid genome with additional options run:


dipspades.py -1 reads_left.fastq -2 reads_right.fastq --expect-gaps -o output_dir

To relaunch steps 2 and 3 of dipSPAdes (see dipSPAdes pipeline section) with different set of advanced options you can use haplocontigs constructed in the previous run (see dipSPAdes output section) run:


dipspades.py -hap output_dir/haplocontigs.fasta --expect-gaps --expect-rearrangements --hap-assembly -o new_output_dir

3.3 dipSPAdes output

dipSPAdes produces the following output:

haplocontigs.fasta - file in FASTA format with computed haplocontigs (if input reads were provided).
consensus_contigs.fasta - file in FASTA format with a set of constructed consensus contigs
paired_consensus_contigs.fasta - file in FASTA format with a subset of consensus contigs that have a polymorphism detected on them.
unpaired_consensus_contigs.fasta - file in FASTA format with a subset of consensus contigs that have no polymorphism detected on them. These contigs are potentially redundant.
haplocontigs_alignment.out - file with recorded haplocontigs that correspond to homologous regions on haplomes.
haplotype_assembly.out - result of haplotype assembly
conservative_regions.fasta - file in FASTA format with conservative regions of diploid genome
possibly_conservative_regions.fasta - file in FASTA format with unresolved regions of haplocontigs that may be either conservative or repetitive.

3.3.1 Haplocontigs alignment output

File haplocontigs_alignment.out consists of blocks of the following structure:


Consensus contig: CONSENSUS_CONTIG_NAME
    Overlapping haplocontigs:
        HAPLOCONTIG_NAME_1 HAPLOCONTIG_NAME_2
                         ...
    Nested haplocontigs:
        HAPLOCONTIG_NAME_3 HAPLOCONTIG_NAME_4
                        ...

Each block corresponds to alignment of haplocontigs to consensus contigs CONSENSUS_CONTIG_NAME. Name of consensus contig, CONSENSUS_CONTIG_NAME, coincides with the name in file consensus_contigs.fasta. Further the list of pairs of haplocontig names is printed. Haplocontigs in each pair at least partially correspond either to the same positions on the same haplome or to homologous positions on different haplomes. Also the list is divided into two subblocks: Overlapping haplocontigs and Nested haplocontigs. Overlapping haplocontigs contain pairs of haplocontigs such that the suffix of the first haplocontig corresponds to the prefix of the second contig. Nested haplocontigs contains pairs of haplocontigs such that certain subcontig of the second contig corresponds to the entire first contig.

3.3.2 Haplotype assembly output

File haplotype_assembly.out consists of lines of the following structure:


HAPLOCONTIG_NAME_1	HAPLOCONTIG_NAME_2

where HAPLOCONTIG_NAME_1 and HAPLOCONTIG_NAME_2 are names of homologous haplocontigs that correspond to different haplomes and at least partially correspond to homologous positions in different chromosomes. Names correspond to the names of haplocontigs specified as an input using options --hap or computed at the first step.

4. Citation

If you use dipSPAdes in your research, please include Safonova, Bankevich, and Pevzner, 2014 in your reference list.

In addition, we would like to list your publications that use our software on our website. Please email the reference, the name of your lab, department and institution to spades.support@bioinf.spbau.ru.

5. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed.
If you have trouble running dipSPAdes, please provide us with the files params.txt and dipspades.log from the directory <output_dir>.
Address for communications: spades.support@bioinf.spbau.ru.