dipSPAdes 1.0 manual

1. What is dipSPAdes?
    1.1. dipSPAdes pipeline
2. Installing dipSPAdes
3. Running dipSPAdes
    3.1 dipSPAdes input
    3.2 dipSPAdes command line options
         3.2.1 Basic options
         3.2.2 Input data
         3.2.3 Advanced options
         3.2.4 Examples
         3.2.5 Examples of advanced options usage
    3.3 dipSPAdes output
         3.3.1 Haplocontigs alignment output
         3.3.2 Haplotype assembly output
4. Citation
5. Feedback and bug reports

1. What is dipSPAdes?

dipSPAdes is a genome assembler designed specifically for diploid highly polymorphic genomes based on SPAdes. It takes advantage of divergence between haplomes in repetitive genome regions to resolve them and construct longer contigs. dipSPAdes produces consensus contigs (representing a consensus of both haplomes for the orthologous regions) and performs haplotype assembly. Note that dipSPAdes can only benefit from high polymorphism rate (at least 0.4%). For the data with low polymorphism rate no improvement in terms of N50 vs consentional assemblers is expected.

1.1 dipSPAdes pipeline

dipSPAdes pipeline consists of three steps:
    1. Assembly of haplocontigs (contigs representing both haplomes).
    2. Consensus contigs construction.
    3. Haplotype assembly.

2. Installing dipSPAdes

dipSPAdes comes as a part of SPAdes assembler package.
See SPAdes manual for installation instructions.
Please verify your dipSPAdes installation prior to initiate the dipSPAdes:

    <spades installation dir>/dipspades.py --test

If the installation is successful, you will find the following information at the end of the log:

 * Assembled consensus contigs are in: test_dipspades/dipspades/consensus_contigs.fasta
 * Assembled paired consensus contigs are in: test_dipspades/dipspades/paired_consensus_contigs.fasta
 * Assembled paired consensus contigs are in: test_dipspades/dipspades/unpaired_consensus_contigs.fasta
 * Alignment of haplocontigs is in: test_dipspades/dipspades/haplocontigs_alignent
 * Assembled paired consensus contigs are in: test_dipspades/dipspades/haplotype_assembly.out
 * Possibly conservative regions are in: test_dipspades/dipspades/possibly_conservative_regions.fasta

Thank you for using SPAdes!

======= dipSPAdes finished.
dipSPAdes log can be found here: test_dipspades/dipspades/dipspades.log

3. Running dipSPAdes

3.1 dipSPAdes input

dipSPAdes can take as an input one of the three following alternatives: We provide example command lines for each of these scenarios in Examples section.

3.2 dipSPAdes command line options

To run dipSPAdes from the command line, type

dipspades.py [options] -o <output_dir>


Note that we assume that SPAdes installation directory is added to the PATH variable (provide full path to dipSPAdes executable otherwise: <spades installation dir>/dipspades.py).

3.2.1 Basic options

-o <output_dir>
    Specifies the output directory. Required option.

--test
    Runs SPAdes on the toy data set; see section 2.

-h (or --help)
    Prints help.

3.2.2 Input data

For input reads specfication use options of SPAdes described in SPAdes manual.

--hap <file_name>
    Specifies file with haplocontigs in FASTA format. Note that dipSPAdes can use any number of haplocontig files.

3.2.3 Advanced options

--expect-gaps
    Indicates significant amount of expected gaps in genome coverage (e.g. for datasets with relatively low coverage).

--expect-rearrangements
    Indicates extreme heterozygosity rate in haplomes (e.g. haplomes differ by long insertions/deletions).

--hap-assembly
    Enables haplotype assembly phase that results in files haplotype_assembly.out, conservative_regions.fasta, and possibly_conservative_regions.fasta (see Haplotype assembly output).

3.2.4 Examples

To perform assembly (construct consensus contigs and perform haplotype assembly) of diploid genome from paired-end reads (reads_left.fastq and reads_right.fastq) run:

dipspades.py -1 reads_left.fastq -2 reads_right.fastq -o output_dir


To perform assembly (construct consensus contigs and perform haplotype assembly) of diploid genome from preliminary computed haplocontigs (haplocontigs1.fasta and haplocontigs2.fasta) run:

dipspades.py --hap haplocontigs1.fasta --hap haplocontigs2.fasta -o output_dir


To perform assembly of diploid genome from both reads (reads_left.fastq and reads_right.fastq) and preliminary computed haplocontigs (haplocontigs.fasta) run:

dipspades.py -1 reads_left.fastq -2 reads_right.fastq --hap haplocontigs.fasta -o output_dir


3.2.5 Examples of advanced options usage

To perform assembly of diploid genome with additional options run:

dipspades.py -1 reads_left.fastq -2 reads_right.fastq --expect-gaps -o output_dir


To relaunch steps 2 and 3 of dipSPAdes (see dipSPAdes pipeline section) with different set of advanced options you can use haplocontigs constructed in the previous run (see dipSPAdes output section) run:

dipspades.py -hap output_dir/haplocontigs.fasta --expect-gaps --expect-rearrangements --hap-assembly -o new_output_dir


3.3 dipSPAdes output

dipSPAdes produces the following output:

3.3.1 Haplocontigs alignment output

File haplocontigs_alignment.out consists of blocks of the following structure:

Consensus contig: CONSENSUS_CONTIG_NAME
    Overlapping haplocontigs:
        HAPLOCONTIG_NAME_1 HAPLOCONTIG_NAME_2
                         ...
    Nested haplocontigs:
        HAPLOCONTIG_NAME_3 HAPLOCONTIG_NAME_4
                        ...

Each block corresponds to alignment of haplocontigs to consensus contigs CONSENSUS_CONTIG_NAME. Name of consensus contig, CONSENSUS_CONTIG_NAME, coincides with the name in file consensus_contigs.fasta. Further the list of pairs of haplocontig names is printed. Haplocontigs in each pair at least partially correspond either to the same positions on the same haplome or to homologous positions on different haplomes. Also the list is divided into two subblocks: Overlapping haplocontigs and Nested haplocontigs. Overlapping haplocontigs contain pairs of haplocontigs such that the suffix of the first haplocontig corresponds to the prefix of the second contig. Nested haplocontigs contains pairs of haplocontigs such that certain subcontig of the second contig corresponds to the entire first contig.

3.3.2 Haplotype assembly output

File haplotype_assembly.out consists of lines of the following structure:

HAPLOCONTIG_NAME_1	HAPLOCONTIG_NAME_2

where HAPLOCONTIG_NAME_1 and HAPLOCONTIG_NAME_2 are names of homologous haplocontigs that correspond to different haplomes and at least partially correspond to homologous positions in different chromosomes. Names correspond to the names of haplocontigs specified as an input using options --hap or computed at the first step.

4. Citation

If you use dipSPAdes in your research, please include Safonova, Bankevich, and Pevzner, 2014 in your reference list.

In addition, we would like to list your publications that use our software on our website. Please email the reference, the name of your lab, department and institution to spades.support@bioinf.spbau.ru.

5. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed.
If you have trouble running dipSPAdes, please provide us with the files params.txt and dipspades.log from the directory <output_dir>.
Address for communications: spades.support@bioinf.spbau.ru.