NAME

getSubProjReads.pl


SYNOPSIS

  getSubProjReads.pl [options] <readinfo.txt> <gapdirs.txt> <libinfo.txt>
  <contigs.fasta> <contigs.fasta.qual>
  Options:
  -od <dir>     Output directory (optional; default is current directory).
  -log <file>   Log file (optional; default is getSubProjectReads.pl.log).
  -warn <file>  Warnings log file (optional; default is defined in gapRes.config)
  -h            Detailed message (optional)


DESCRIPTION

This is a wrapper program for the Gap Resolution sub system that is responsible for getting reads and it's pairs in the unique and repeat regions of each contig in a sub project directory.

For each sub project directory and contig left or right of the gap, the program performs the following steps:

1. Get reads + pairs in the unique region of the contig by calling getReadsInUnique.pl (configurable). Save list in a file within the sub project's directory as <contigName>.unique.reads (extension configurable).

2. Get reads in the repeat region of the contig by calling getReadsInRepeats.pl (configurable) Save list in a file within the sub project's directory as <contigName>.repeat.reads (extension configurable).

3. Look for read pairs from the contig of interest that are in contigs from different scaffolds using getRepeatContig.pl (configurable). At least 2 (configurable) read pairs must be present. If found, check that the "repeat" contig is > 250 (configurable) and less than the gap size + a gap padding (configurable). For each repeat contig, create the fasta and qual files in the sub project directory and a $contigs.repeatContigs.txt (extension configurable) file containing a list of the repeat contig names. The position withing the contig to look for reads is determined by using the largest library insert size + 2 (configurable) x standard deviation.

4. For each repeat contig found, get reads for the entire contig using getReadsInRepeat.pl (configurable). Save the list in a file within the sub project directory as <contigName>.repeat.reads (extension configurable). For each repeat contig fasta and qual file created as fakes, check if fasta sequence is > 2kb (configurable). If so, shred fasta and qual to 1kb (configurable) fragments with 100bp (configurable) overlap.

For more information regarding each of the software component called, refer to it's documentation.

The following output files are created within each of the sub project directory:

  * <contig>.unique.reads - list of reads to assemble from the unique region of
    the contig adjacent to the gap (one for the left contig, one for the right
    contig).
  
  * <contig>.repeat.reads - list of reads to assemble from the repeat region of
    the contig adjacent to the gap or from the repeat contig (if exists).
  
  * <contig>.repeatContigs.txt - list of repeat contig names found from read
    pairs belonging to the contig adjacent to the gap (one for the left contig,
    one for the right contig). If no repeat contig is found, this file is not
    created.
  
  * fastas/<contig>.fasta - fasta sequence of the repeat contig consensus (fake
    read). If no repeat contig is found, this file is not created. This file is
    created in the repeatFastas (configurable) sub directory within the sub
    project directory.
  
  * fastas/<contig>.fasta.qual - quality values of the repeat contig consensus
    (fake read). If no repeat contig is found, this file is not created. This
    file is created in the repeatFastas (configurable) sub directory within the
    sub project directory.
  
  * readlist.txt - list of output read pairing info files created by this software.

The file format of the <contig>.unique.reads and <contig>.repeat.reads are the same as the readinfo.txt file. For more details, refer to the documentation for newblerAce2ReadPair.pl.

A default config file named gapRes.config residing in <installPath>/config is used to specify the following parameters:

(configurable)

  getSubProjReads.libInsertSizeStdDevMultiplier=2
    Specify the multiplier of the library insert size standard deviation
    to determine the distance from the end of the repeat contig to grab reads.
    The maximum library insert size defined in the the libinfo.txt is used.
  getSubProjReads.minRepeatContigLength=250
    Specify the minimum repeat contig length to be consider for creating fakes
    and grabbing reads from.
  getSubProjReads.gapSizePadding=0
    Specify the padding to add/subtract from the gap size to determine the
    maximum repeat contig length such that it can fit inside the gap.
  getSubProjReads.shredRepeatConsensus=1
    Specify whether to shred the repeat consensus fasta and qual files.
  getSubProjReads.shredRepeatConsensusIfGreaterThanThisLength=2000
    Specify the minimum length of the repeat contig to be considered for
    shredding.
  shredFasta.fragmentLength=1000
    Specify the fragment length when shredding the repeat contig consensus.
  shredFasta.overlapLength=100
    Specify the overlap length when shredding the repeat contig consensus.
  getSubProjReads.minNumReadLinksInRepeatContig=2
    Specify the minimum number of read links between the contig belonging
    to the gap and the repeat contig outside of the scaffold.
  getSubProjReads.keepTempContigReadInfoFiles=2
    Keep tmp directory containing read info files by contig.

(system configuration)

  script.getReadsInUnique=getReadsInUnique.pl
  script.getReadsInRepeat=getReadsInRepeat.pl
  script.getRepeatContig=getRepeatContig.pl
  getSubProjReads.repeatFastaFileExtension=.repeat.fasta
  getSubProjReads.repeatQualFileExtension=.repeat.fasta.qual
  getSubProjReads.repeatContigListFileExtension=.repeatContigs.txt
  idContigRepeats.boundaryFileExtension=.boundary
  getSubProjReads.readListFileName=readlist.txt
  getSubProjReads.directoryOfRepeatContigConsensus=fastas
  getSubProjReads.uniqueReadsFileExtension=.unique.reads
  getSubProjReads.repeatReadsFileExtension=.repeat.reads
  getSubProjReads.outputFastaDirectory=fastas


DEPENDENCIES

The following scripts (configurable in config file) must exist in the same path as getSubProjReads.pl unless the path to the script is defined in the config file:

  * getReadsInUnique.pl
  * getReadsInRepeat.pl
  * getRepeatContig.pl
  * shredFasta.pl

The following are the description of the input files used by the getSubProjReads.pl.

  * readinfo.txt - read pairing file created by newblerAce2ReadPair.pl
  
  * gapdir.txt - list of gap directories created by createSubProject.pl
  
  * 454Scaaffolds.txt - agp formatted file containing scaffold information
    created by Newbler
    
  * libinfo.txt - library insert size and std dev file created by parseNewblerMetrics.pl
  
  * contigs.fasta - fasta file of all contigs in the assembly
  
  * contigs.fasta.qual - qual file of all contigs in the assembly

The getSubProjReads.pl also expects a scaffinfo.txt file and a <contigName>.boundary file for each of the gap contigs within each of the sub project directory. This scaffinfo.txt file is created by createSubProject.pl. The <contigName>.boundary files are created by idRepeatBoundary.pl.

For more information regarding the formats of these files, refer the documentation of the scripts that are used to create the file.


VERSION

$Revision: 1.21 $

$Date: 2010-03-06 14:46:14 $


AUTHOR(S)

Stephan Trong


HISTORY