validateGapAssembly.pl
validateGapAssembly.pl [options] <leftAnchorFastaFile> <rightAnchorFastaFile> <readinfo.txt> <libinfo.txt> <sffinfo.txt> <contigs.fasta> <contigs.qual> <outputFile (e.g., validinfo.txt)> Options: -gsize <number> Gap size (required) -gsizeStd <number> Gap size standard deviation (required) -aligner <name> Path and name of aligner to use for aligning anchors to reference (required) -alignerParams <params> Aligner parameters (required) -formatdb <name> Path and name of formatdb (required) -pctId <number> Minimum percent identity for aligning anchors to reference (optional; default=95) -alignLen <number> Minimum alignment length for aligning anchors to reference (optional; default=40) -pctValidReads <number> Minimum percent of valid read pairs (optional; default=90) -minQual <number> Minimum avg consensus quality between anchors (optional; default=30) -insertSizeStdMult <number> Library insert size standard deviation multiplier (optional; default=1) -debug Prints additional information in output file (optional) -h Detailed help message (optional)
This software component is part of the Gap Resolution sub system that is responsible for validating a sub project for closure after it has been reassembled. The following validation are performed:
Unless otherwise noted, anchor here refers to the anchor sequence obtained from the left and right contigs of the gap prior to reassembly. This is done upstream using idContigRepeats.pl.
1. Validate anchor distance. The left and right anchors are aligned (using -aligner <aligner>) to the contigs of the assembly. If the anchors reside on the same contig and the distance is within the gap size (-gsize option) +/- standard deviation (-gsizeStd option), the anchor distance is considered to be valid. Otherwise, the anchor distance is invalid. Alignments of the anchors to the assembly contigs are filtered by percent identity (-pctId option) and the alignment length (-alignLen option).
2. Validate read pairing. For each read pair, determine the library insert size by mapping the read to the it's corresponding sff file (via 454's sffinfo script) and determining the library insert sizes and standard deviations x a multiplier (-insertSizeStdMult option) using the sffinfo.txt and libinfo.txt files. Read pairs are considered valid only if they meet all of the following criteria: a) the read pairs are located on the same contig and their distance are within the library insert size +/- standard deviation * a multiplier (-insertSizeStdMult option), b) the orientation of the reads point towards each other, then that read pair is deemed valid, and c) the percent of the valid read pairs to invalid read pairs is >= 90% (configurable using -pctValidReads option).
3. Validate consensus quality. If the anchors reside on the same contig of the assembly, the average quality between the anchors is determined and must be >= 30 (configurable using -minQual option) to be considered valid. Otherwise, the consensus quality is invalid.
4. If validation fails and the anchors are on different contigs, set doPrimerDesign=1 for designing primers.
The specified output file contains information pertaining to the validation in a key/value pair. The following entries are reported:
leftAnchorContig=name of contig leftAnchorContigLength=number leftAnchorStart=number leftAnchorEnd=number rightAnchorContig=name of contig rightAnchorContigLength=number rightAnchorStart=number rightAnchorEnd=number anchorStart=number anchorEnd=number anchorDistance=number gapSize=number gapSizeStdDev=number numConsistentReadPairs=number numInconsistentReadPairs=number pctConsistent=number avgConsensusQualityBetweenAnchors=number isDistanceValid=0|1 isReadPairingValid=0|1 isQualityValid=0|1 status=PASS|FAIL doPrimerDesign=0|1 Comment=comment entry
The Status is defined as SUCCESSFUL if all three validations passed. Otherwise, it is reported as FAILED.
Description of input files:
* leftAnchorFastaFile - fasta file containing the sequence of the left anchor * rightAnchorFastaFile - fasta file containing the sequence of the right anchor * readinfo.txt - file containing read pairing information of the assembly. This file is generated by newblerAce2ReadPair.pl. For more information on the format of the file, refer to newblerAce2ReadPair.pl's documentation. * libinfo.txt - file containing library insert size and standard deviation. This file is generated by parseNewblerMetrics.pl. For more information on the format of the file, refer to parseNewblerMetrics.pl's documentation. * sffinfo.txt - file containing the path of the sff file, it's corresponding library and the type of the sff file. This file is generated by parseNewblerMetrics.pl. For more information on the format of the file, refer to parseNewblerMetrics.pl's documentation. * contigs.fasta - fasta file containing the contigs of the assembly. * contigs.qual - qual file of the corresponding contigs.fasta. * outputFile - name of the output file containing the information of the validation.
$Revision$
$Date$
Stephan Trong
S.Trong 2008/12/05 creation
S.Trong 2009/03/02 - added library insert size checking for phrap based reads.