NAME

validateGapAssembly.pl


SYNOPSIS

  validateGapAssembly.pl [options] <leftAnchorFastaFile> <rightAnchorFastaFile>
    <readinfo.txt> <libinfo.txt> <sffinfo.txt> <contigs.fasta> <contigs.qual>
    <outputFile (e.g., validinfo.txt)>
    
  Options:
  -gsize <number>             Gap size (required)
  -gsizeStd <number>          Gap size standard deviation (required)
  -aligner <name>             Path and name of aligner to use for aligning anchors to reference (required)
  -alignerParams <params>     Aligner parameters (required)
  -formatdb <name>            Path and name of formatdb (required)
  -pctId <number>             Minimum percent identity for aligning anchors to reference (optional; default=95)
  -alignLen <number>          Minimum alignment length for aligning anchors to reference (optional; default=40)
  -pctValidReads <number>     Minimum percent of valid read pairs (optional; default=90)
  -minQual <number>           Minimum avg consensus quality between anchors (optional; default=30)
  -insertSizeStdMult <number> Library insert size standard deviation multiplier (optional; default=1)
  -debug                      Prints additional information in output file (optional)
  -h                          Detailed help message (optional)


DESCRIPTION

This software component is part of the Gap Resolution sub system that is responsible for validating a sub project for closure after it has been reassembled. The following validation are performed:

Unless otherwise noted, anchor here refers to the anchor sequence obtained from the left and right contigs of the gap prior to reassembly. This is done upstream using idContigRepeats.pl.

1. Validate anchor distance. The left and right anchors are aligned (using -aligner <aligner>) to the contigs of the assembly. If the anchors reside on the same contig and the distance is within the gap size (-gsize option) +/- standard deviation (-gsizeStd option), the anchor distance is considered to be valid. Otherwise, the anchor distance is invalid. Alignments of the anchors to the assembly contigs are filtered by percent identity (-pctId option) and the alignment length (-alignLen option).

2. Validate read pairing. For each read pair, determine the library insert size by mapping the read to the it's corresponding sff file (via 454's sffinfo script) and determining the library insert sizes and standard deviations x a multiplier (-insertSizeStdMult option) using the sffinfo.txt and libinfo.txt files. Read pairs are considered valid only if they meet all of the following criteria: a) the read pairs are located on the same contig and their distance are within the library insert size +/- standard deviation * a multiplier (-insertSizeStdMult option), b) the orientation of the reads point towards each other, then that read pair is deemed valid, and c) the percent of the valid read pairs to invalid read pairs is >= 90% (configurable using -pctValidReads option).

3. Validate consensus quality. If the anchors reside on the same contig of the assembly, the average quality between the anchors is determined and must be >= 30 (configurable using -minQual option) to be considered valid. Otherwise, the consensus quality is invalid.

4. If validation fails and the anchors are on different contigs, set doPrimerDesign=1 for designing primers.

The specified output file contains information pertaining to the validation in a key/value pair. The following entries are reported:

  leftAnchorContig=name of contig
  leftAnchorContigLength=number
  leftAnchorStart=number
  leftAnchorEnd=number
  rightAnchorContig=name of contig
  rightAnchorContigLength=number
  rightAnchorStart=number
  rightAnchorEnd=number
  anchorStart=number
  anchorEnd=number
  anchorDistance=number
  gapSize=number
  gapSizeStdDev=number
  numConsistentReadPairs=number
  numInconsistentReadPairs=number
  pctConsistent=number
  avgConsensusQualityBetweenAnchors=number
  isDistanceValid=0|1
  isReadPairingValid=0|1
  isQualityValid=0|1
  status=PASS|FAIL
  doPrimerDesign=0|1
  Comment=comment entry

The Status is defined as SUCCESSFUL if all three validations passed. Otherwise, it is reported as FAILED.

Description of input files:

  * leftAnchorFastaFile - fasta file containing the sequence of the left anchor
  * rightAnchorFastaFile - fasta file containing the sequence of the right anchor
  * readinfo.txt - file containing read pairing information of the assembly.
    This file is generated by newblerAce2ReadPair.pl. For more information on
    the format of the file, refer to newblerAce2ReadPair.pl's documentation.
  * libinfo.txt - file containing library insert size and standard deviation.
    This file is generated by parseNewblerMetrics.pl.  For more information on
    the format of the file, refer to parseNewblerMetrics.pl's documentation.
  * sffinfo.txt - file containing the path of the sff file, it's corresponding
    library and the type of the sff file. This file is generated by
    parseNewblerMetrics.pl. For more information on the format of the file,
    refer to parseNewblerMetrics.pl's documentation.
  * contigs.fasta - fasta file containing the contigs of the assembly.
  * contigs.qual - qual file of the corresponding contigs.fasta.
  * outputFile - name of the output file containing the information of the validation.


VERSION

$Revision$

$Date$


AUTHOR(S)

Stephan Trong


HISTORY