getSubProjReads.pl
getSubProjReads.pl [options] <readinfo.txt> <gapdirs.txt> <libinfo.txt> <contigs.fasta> <contigs.fasta.qual>
Options: -od <dir> Output directory (optional; default is current directory). -log <file> Log file (optional; default is getSubProjectReads.pl.log). -warn <file> Warnings log file (optional; default is defined in gapRes.config) -h Detailed message (optional)
This is a wrapper program for the Gap Resolution sub system that is responsible for getting reads and it's pairs in the unique and repeat regions of each contig in a sub project directory.
For each sub project directory and contig left or right of the gap, the program performs the following steps:
1. Get reads + pairs in the unique region of the contig by calling getReadsInUnique.pl (configurable). Save list in a file within the sub project's directory as <contigName>.unique.reads (extension configurable).
2. Get reads in the repeat region of the contig by calling getReadsInRepeats.pl (configurable) Save list in a file within the sub project's directory as <contigName>.repeat.reads (extension configurable).
3. Look for read pairs from the contig of interest that are in contigs from different scaffolds using getRepeatContig.pl (configurable). At least 2 (configurable) read pairs must be present. If found, check that the "repeat" contig is > 250 (configurable) and less than the gap size + a gap padding (configurable). For each repeat contig, create the fasta and qual files in the sub project directory and a $contigs.repeatContigs.txt (extension configurable) file containing a list of the repeat contig names. The position withing the contig to look for reads is determined by using the largest library insert size + 2 (configurable) x standard deviation.
4. For each repeat contig found, get reads for the entire contig using getReadsInRepeat.pl (configurable). Save the list in a file within the sub project directory as <contigName>.repeat.reads (extension configurable). For each repeat contig fasta and qual file created as fakes, check if fasta sequence is > 2kb (configurable). If so, shred fasta and qual to 1kb (configurable) fragments with 100bp (configurable) overlap.
For more information regarding each of the software component called, refer to it's documentation.
The following output files are created within each of the sub project directory:
* <contig>.unique.reads - list of reads to assemble from the unique region of the contig adjacent to the gap (one for the left contig, one for the right contig). * <contig>.repeat.reads - list of reads to assemble from the repeat region of the contig adjacent to the gap or from the repeat contig (if exists). * <contig>.repeatContigs.txt - list of repeat contig names found from read pairs belonging to the contig adjacent to the gap (one for the left contig, one for the right contig). If no repeat contig is found, this file is not created. * fastas/<contig>.fasta - fasta sequence of the repeat contig consensus (fake read). If no repeat contig is found, this file is not created. This file is created in the repeatFastas (configurable) sub directory within the sub project directory. * fastas/<contig>.fasta.qual - quality values of the repeat contig consensus (fake read). If no repeat contig is found, this file is not created. This file is created in the repeatFastas (configurable) sub directory within the sub project directory. * readlist.txt - list of output read pairing info files created by this software.
The file format of the <contig>.unique.reads and <contig>.repeat.reads are the same as the readinfo.txt file. For more details, refer to the documentation for newblerAce2ReadPair.pl.
A default config file named gapRes.config residing in <installPath>/config is used to specify the following parameters:
(configurable)
getSubProjReads.libInsertSizeStdDevMultiplier=2 Specify the multiplier of the library insert size standard deviation to determine the distance from the end of the repeat contig to grab reads. The maximum library insert size defined in the the libinfo.txt is used.
getSubProjReads.minRepeatContigLength=250 Specify the minimum repeat contig length to be consider for creating fakes and grabbing reads from.
getSubProjReads.gapSizePadding=0 Specify the padding to add/subtract from the gap size to determine the maximum repeat contig length such that it can fit inside the gap.
getSubProjReads.shredRepeatConsensus=1 Specify whether to shred the repeat consensus fasta and qual files.
getSubProjReads.shredRepeatConsensusIfGreaterThanThisLength=2000 Specify the minimum length of the repeat contig to be considered for shredding.
shredFasta.fragmentLength=1000 Specify the fragment length when shredding the repeat contig consensus.
shredFasta.overlapLength=100 Specify the overlap length when shredding the repeat contig consensus.
getSubProjReads.minNumReadLinksInRepeatContig=2 Specify the minimum number of read links between the contig belonging to the gap and the repeat contig outside of the scaffold.
getSubProjReads.keepTempContigReadInfoFiles=2 Keep tmp directory containing read info files by contig.
(system configuration)
script.getReadsInUnique=getReadsInUnique.pl
script.getReadsInRepeat=getReadsInRepeat.pl
script.getRepeatContig=getRepeatContig.pl
getSubProjReads.repeatFastaFileExtension=.repeat.fasta
getSubProjReads.repeatQualFileExtension=.repeat.fasta.qual
getSubProjReads.repeatContigListFileExtension=.repeatContigs.txt
idContigRepeats.boundaryFileExtension=.boundary
getSubProjReads.readListFileName=readlist.txt
getSubProjReads.directoryOfRepeatContigConsensus=fastas
getSubProjReads.uniqueReadsFileExtension=.unique.reads
getSubProjReads.repeatReadsFileExtension=.repeat.reads
getSubProjReads.outputFastaDirectory=fastas
The following scripts (configurable in config file) must exist in the same path as getSubProjReads.pl unless the path to the script is defined in the config file:
* getReadsInUnique.pl * getReadsInRepeat.pl * getRepeatContig.pl * shredFasta.pl
The following are the description of the input files used by the getSubProjReads.pl.
* readinfo.txt - read pairing file created by newblerAce2ReadPair.pl * gapdir.txt - list of gap directories created by createSubProject.pl * 454Scaaffolds.txt - agp formatted file containing scaffold information created by Newbler * libinfo.txt - library insert size and std dev file created by parseNewblerMetrics.pl * contigs.fasta - fasta file of all contigs in the assembly * contigs.fasta.qual - qual file of all contigs in the assembly
The getSubProjReads.pl also expects a scaffinfo.txt file and a <contigName>.boundary file for each of the gap contigs within each of the sub project directory. This scaffinfo.txt file is created by createSubProject.pl. The <contigName>.boundary files are created by idRepeatBoundary.pl.
For more information regarding the formats of these files, refer the documentation of the scripts that are used to create the file.
$Revision: 1.21 $
$Date: 2010-03-06 14:46:14 $
Stephan Trong
S.Trong 2008/11/11 creation
S.Trong 2009/08/05 - added ability to skip and create warnings file if sub project fails.
S.Trong 2009/12/29 - added -log and -warn options.