#!/usr/bin/env perl

=head1 NAME

getSubProjReads.pl

=head1 SYNOPSIS

  getSubProjReads.pl [options] <readinfo.txt> <gapdirs.txt> <libinfo.txt>
  <contigs.fasta> <contigs.fasta.qual>

  Options:
  -od <dir>     Output directory (optional; default is current directory).
  -log <file>   Log file (optional; default is getSubProjectReads.pl.log).
  -warn <file>  Warnings log file (optional; default is defined in gapRes.config)
  -h            Detailed message (optional)

=head1 DESCRIPTION

This is a wrapper program for the Gap Resolution sub system that is responsible
for getting reads and it's pairs in the unique and repeat regions of each
contig in a sub project directory. 

For each sub project directory and contig left or right of the gap, the program
performs the following steps:

1. Get reads + pairs in the unique region of the contig by calling
   getReadsInUnique.pl (configurable). Save list in a file within the sub
   project's directory as <contigName>.unique.reads (extension configurable).

2. Get reads in the repeat region of the contig by calling getReadsInRepeats.pl
   (configurable) Save list in a file within the sub project's directory 
   as <contigName>.repeat.reads (extension configurable).

3. Look for read pairs from the contig of interest that are in contigs from
   different scaffolds using getRepeatContig.pl (configurable). At least 2
   (configurable) read pairs must be present. If found, check that the "repeat"
   contig is > 250 (configurable) and less than the gap size + a gap padding
   (configurable). For each repeat contig, create the fasta and qual files in
   the sub project directory and a $contigs.repeatContigs.txt (extension
   configurable) file containing a list of the repeat contig names. The position
   withing the contig to look for reads is determined by using the largest
   library insert size + 2 (configurable) x standard deviation.

4. For each repeat contig found, get reads for the entire contig using
   getReadsInRepeat.pl (configurable). Save the list in a file within the sub
   project directory as <contigName>.repeat.reads (extension configurable).
   For each repeat contig fasta and qual file created as fakes, check if
   fasta sequence is > 2kb (configurable).  If so, shred fasta and qual to
   1kb (configurable) fragments with 100bp (configurable) overlap.

For more information regarding each of the software component called, refer
to it's documentation.

The following output files are created within each of the sub project directory:

  * <contig>.unique.reads - list of reads to assemble from the unique region of
    the contig adjacent to the gap (one for the left contig, one for the right
    contig).
  
  * <contig>.repeat.reads - list of reads to assemble from the repeat region of
    the contig adjacent to the gap or from the repeat contig (if exists).
  
  * <contig>.repeatContigs.txt - list of repeat contig names found from read
    pairs belonging to the contig adjacent to the gap (one for the left contig,
    one for the right contig). If no repeat contig is found, this file is not
    created.
  
  * fastas/<contig>.fasta - fasta sequence of the repeat contig consensus (fake
    read). If no repeat contig is found, this file is not created. This file is
    created in the repeatFastas (configurable) sub directory within the sub
    project directory.
  
  * fastas/<contig>.fasta.qual - quality values of the repeat contig consensus
    (fake read). If no repeat contig is found, this file is not created. This
    file is created in the repeatFastas (configurable) sub directory within the
    sub project directory.
  
  * readlist.txt - list of output read pairing info files created by this software.

The file format of the <contig>.unique.reads and <contig>.repeat.reads are
the same as the readinfo.txt file.  For more details, refer to the documentation
for newblerAce2ReadPair.pl.

A default config file named gapRes.config residing in
<installPath>/config is used to specify the following parameters:

(configurable)

  getSubProjReads.libInsertSizeStdDevMultiplier=2
    Specify the multiplier of the library insert size standard deviation
    to determine the distance from the end of the repeat contig to grab reads.
    The maximum library insert size defined in the the libinfo.txt is used.

  getSubProjReads.minRepeatContigLength=250
    Specify the minimum repeat contig length to be consider for creating fakes
    and grabbing reads from.

  getSubProjReads.gapSizePadding=0
    Specify the padding to add/subtract from the gap size to determine the
    maximum repeat contig length such that it can fit inside the gap.

  getSubProjReads.shredRepeatConsensus=1
    Specify whether to shred the repeat consensus fasta and qual files.

  getSubProjReads.shredRepeatConsensusIfGreaterThanThisLength=2000
    Specify the minimum length of the repeat contig to be considered for
    shredding.

  shredFasta.fragmentLength=1000
    Specify the fragment length when shredding the repeat contig consensus.

  shredFasta.overlapLength=100
    Specify the overlap length when shredding the repeat contig consensus.

  getSubProjReads.minNumReadLinksInRepeatContig=2
    Specify the minimum number of read links between the contig belonging
    to the gap and the repeat contig outside of the scaffold.

  getSubProjReads.keepTempContigReadInfoFiles=2
    Keep tmp directory containing read info files by contig.

(system configuration)

  script.getReadsInUnique=getReadsInUnique.pl

  script.getReadsInRepeat=getReadsInRepeat.pl

  script.getRepeatContig=getRepeatContig.pl

  getSubProjReads.repeatFastaFileExtension=.repeat.fasta

  getSubProjReads.repeatQualFileExtension=.repeat.fasta.qual

  getSubProjReads.repeatContigListFileExtension=.repeatContigs.txt

  idContigRepeats.boundaryFileExtension=.boundary

  getSubProjReads.readListFileName=readlist.txt

  getSubProjReads.directoryOfRepeatContigConsensus=fastas

  getSubProjReads.uniqueReadsFileExtension=.unique.reads

  getSubProjReads.repeatReadsFileExtension=.repeat.reads

  getSubProjReads.outputFastaDirectory=fastas

=head1 DEPENDENCIES

The following scripts (configurable in config file) must exist in the same
path as getSubProjReads.pl unless the path to the script is defined in the config
file:

  * getReadsInUnique.pl
  * getReadsInRepeat.pl
  * getRepeatContig.pl
  * shredFasta.pl

The following are the description of the input files used by the getSubProjReads.pl.

  * readinfo.txt - read pairing file created by newblerAce2ReadPair.pl
  
  * gapdir.txt - list of gap directories created by createSubProject.pl
  
  * 454Scaaffolds.txt - agp formatted file containing scaffold information
    created by Newbler
    
  * libinfo.txt - library insert size and std dev file created by parseNewblerMetrics.pl
  
  * contigs.fasta - fasta file of all contigs in the assembly
  
  * contigs.fasta.qual - qual file of all contigs in the assembly

The getSubProjReads.pl also expects a scaffinfo.txt file and a
<contigName>.boundary file for each of the gap contigs within each of the sub
project directory. This scaffinfo.txt file is created by createSubProject.pl.
The <contigName>.boundary files are created by idRepeatBoundary.pl.

For more information regarding the formats of these files, refer the documentation
of the scripts that are used to create the file.

=head1 VERSION

$Revision: 1.21 $

$Date: 2010-03-06 14:46:14 $

=head1 AUTHOR(S)

Stephan Trong

=head1 HISTORY

=over

=item *

S.Trong 2008/11/11 creation

=item *

S.Trong 2009/08/05 - added ability to skip and create warnings file if sub project fails.

=item *

S.Trong 2009/12/29 - added -log and -warn options.

=back

=cut

use strict;
use warnings;
use Pod::Usage;
use Cwd;
use Cwd qw(abs_path);
use Carp;
use Carp qw(cluck);
use Getopt::Long;
use File::Path;
use File::Copy;
use File::Basename;
use IO::File;
use FindBin qw($RealBin);
use lib "$RealBin/../lib";
use PGF::Utilities::Properties;
use PGF::Utilities::RunProcess qw(runProcess);
use PGF::Utilities::Logger;
use PGF::Parsers::FastaParse;
use PGF::GapResolution::Warnings;
use vars qw( $optHelp $optOutputDir $optLogFile $optWarningsFile
    $optContigReadFileOfFiles);

#============================================================================#
# INPUT VALIDATION
#============================================================================#
my $programExecution = abs_path(dirname($0))."/".basename($0)." @ARGV";

if( !GetOptions(
        "od=s"=>\$optOutputDir,
        "cf=s"=>\$optContigReadFileOfFiles,
        "log=s"=>\$optLogFile,
        "warn=s"=>\$optWarningsFile,
        "h"=>\$optHelp,
    )
) {
    printhelp(1);
}

printhelp(2) if $optHelp;

#============================================================================#
# INITIALIZATION
#============================================================================#

my $DEBUG = 0;
my $configFile = defined $ENV{GAPRES_CONFIG} ?
    $ENV{GAPRES_CONFIG} : "$RealBin/../config/gapRes.config";
my $OBJ_PROPS = PGF::Utilities::Properties->new(-configFile=>$configFile);
   $OBJ_PROPS->setExceptionIfEntryNotFound(1); # confess if entry in config file
                                               # is not found.
my $OBJ_LOGGER = PGF::Utilities::Logger->new();
my $outputDir = $optOutputDir ? $optOutputDir : getcwd;
   $outputDir =~ s/\/+$//g; # remove trailing '/'.
my $OBJ_WARNINGS = PGF::GapResolution::Warnings->new(
    path=>$outputDir, logger=>$OBJ_LOGGER);
my $logfile = $optLogFile ? $optLogFile : "$outputDir/".basename($0) . ".log";
my $tmpContigReadInfoFileDir = "contigReadInfoFiles.tmp.$$";

if ( @ARGV != 5 ) {
    my $errMsg = "Required input parameters are missing in command line.";
    print STDERR "$errMsg\n";
    printhelp(1);
}
    
my ($readInfoFile, $subProjectFile, $libInfoFile, $contigsFastaFile,
    $contigsQualFile) = @ARGV;

$OBJ_WARNINGS->setLogFile($optWarningsFile) if $optWarningsFile;

#============================================================================#
# VALIDATE INPUTS
#============================================================================#
my $errMsg = '';

# Set path for logging.
#
setFileForLogging($logfile);

# Log execution into log file.
#
logExecution($programExecution);

# Check readinfo file.
#
if (!-s $readInfoFile ) {
    $errMsg .= "The $readInfoFile is missing or zero size.\n";
}

# Check sub project file.
#
if ( !-s $subProjectFile) {
    $errMsg .= "The $subProjectFile is missing or zero size.\n";
}

# Check library info file.
#
if ( !-s $libInfoFile ) {
    $errMsg .= "The $libInfoFile is missing or zero size.\n";
}

# Check contigs fasta file.
#
if ( !-s $contigsFastaFile ) {
    $errMsg .= "The $contigsFastaFile is missing or zero size.\n";
}

# Check contigs qual file.
#
if ( !-s $contigsQualFile ) {
    $errMsg .= "The $contigsQualFile is missing or zero size.\n";
}

# Check contig read file of files.
#
if ( $optContigReadFileOfFiles && !-s $optContigReadFileOfFiles ) {
    $errMsg .= "The $optContigReadFileOfFiles is missing or zero size.\n";
}

if ( $errMsg  ){
    print STDERR "$errMsg.\nUse the -h option for more info.\n";
    logError($errMsg,1);
}

#============================================================================#
# MAIN
#============================================================================#

# Get sub project directories
#
my @subProjectDirs = getSubProjectDirectories($subProjectFile);
my %contigReadInfoFiles = ();

if ( $optContigReadFileOfFiles ) {
    %contigReadInfoFiles = parseContigReadFileOfFiles($optContigReadFileOfFiles);
    
} else {
    # Create tmp dir to store individual read info files for each contig.
    #
    createTmpContigReadFileDir($tmpContigReadInfoFileDir);
    
    # Split read info file into individual files by contig.
    #
    %contigReadInfoFiles = createContigReadInfoFiles($readInfoFile,
        $tmpContigReadInfoFileDir);
}

foreach my $subProjectDir (@subProjectDirs) {
    chomp $subProjectDir;
    next if !length $subProjectDir;
    
    my $subProjScaffInfoFile = "$subProjectDir/".
    	$OBJ_PROPS->getProperty("idContigRepeats.scaffFileName");
    my $repeatFastaExt = $OBJ_PROPS->getProperty("getSubProjReads.repeatFastaFileExtension");
    my $repeatQualExt = $OBJ_PROPS->getProperty("getSubProjReads.repeatQualFileExtension");
    my $repeatContigFastaDir = "$subProjectDir/".
        $OBJ_PROPS->getProperty("getSubProjReads.outputFastaDirectory");
    my $scaffoldFile = "$subProjectDir/".
        $OBJ_PROPS->getProperty("idContigRepeats.scaffFileName");

    # Add sub project dir name to warnings.
    #
    $OBJ_WARNINGS->setSubProjectName($subProjectDir);
    
    next if !$OBJ_WARNINGS->fileExists($subProjScaffInfoFile);
        
    # Parse scaffold file in sub project directory.
    #
    my %scaffDatas = parseSubProjectScaffoldFile($subProjScaffInfoFile);
    if ( !%scaffDatas ) {
        $OBJ_WARNINGS->add("Failed to parse $subProjScaffInfoFile");
        next;
    }

    # Log scaffold information.
    #
    my $msg = "Scaffold: $scaffDatas{scaffoldName}\n".
              "Directory: $scaffDatas{directoryName}\n".
              "LeftContig: $scaffDatas{leftContigName}, length: $scaffDatas{leftContigLength}\n".
              "RightContig: $scaffDatas{rightContigName}, length: $scaffDatas{rightContigLength}\n".
              "GapSize: $scaffDatas{gapSize}";
    logOutput($msg, 1);

    my @repeatContigListFiles = ();
    my @listOfReadInfoFiles = ();
    
    # Get reads for contigs left and right of gap.
    #
    foreach my $contigPos ("left", "right") {
        my %repeatContigs = ();
        my $contigName = $scaffDatas{"${contigPos}ContigName"};
        my $boundaryFile = "$subProjectDir/$contigName".
            $OBJ_PROPS->getProperty("idContigRepeats.boundaryFileExtension");
        my $contigReadInfoFile = $contigReadInfoFiles{$contigName};
        
        # Error handling...
        #
        last if !$OBJ_WARNINGS->fileExists($contigReadInfoFile);
        last if !$OBJ_WARNINGS->fileExists($boundaryFile);
        
        # Get unique and repeat coordinates from boundary file.
        #
	my %coords = parseBoundaryFile($boundaryFile);

        logOutput("Getting unique reads for $contigPos contig $contigName ...",1);
        
        # Get unique reads for this contig.
        #
	my $uniqueReadInfoFile = runGetReadsInUnique($contigReadInfoFile, $contigName,
            $contigPos, $coords{uniqueStart}, $coords{uniqueStop}, $subProjectDir);

        # Add read info file to list if exists.
        #
        push @listOfReadInfoFiles, $uniqueReadInfoFile if -s $uniqueReadInfoFile;
        
        logOutput("Getting repeat reads for $contigPos contig $contigName ...",1);
        
        # Get repeat reads for this contig.
        #
        my $repeatReadsInfoFile = runGetReadsInRepeat($contigReadInfoFile,
            $contigName, $coords{repeatStart}, $coords{repeatStop}, $subProjectDir);

        # Add read info file to list if exists.
        #
        push @listOfReadInfoFiles, $repeatReadsInfoFile if -s $repeatReadsInfoFile;
        
        my $repeatContigListFile = "$subProjectDir/${contigName}".
            $OBJ_PROPS->getProperty("getSubProjReads.repeatContigListFileExtension");
        
        logOutput("Looking for repeat contigs from $contigPos contig $contigName ...",1);
        
        # If unique reads found, look for repeat contigs based on read pairs found in this contig.
        #
        if ( -s $uniqueReadInfoFile ) {
            runGetRepeatContigs($libInfoFile, $scaffoldFile, $uniqueReadInfoFile,
                $contigsFastaFile, $contigsQualFile, $repeatContigListFile,
                $repeatContigFastaDir, $scaffDatas{gapSize}, $contigName, $contigPos,
                $repeatFastaExt, $repeatQualExt);
        } else {
            last;
        }
        
        # Add read info file to list if exists.
        #
        push @repeatContigListFiles, $repeatContigListFile if -s $repeatContigListFile;
        
    }
    
    # If repeat contigs exist, get repeat contig reads and shred repeat consensus
    # fasta and quals.
    #
    if ( @repeatContigListFiles ) {
        my @readList = processRepeatContigs($subProjectDir,
            $repeatContigFastaDir, $repeatFastaExt, \@repeatContigListFiles,
            \%contigReadInfoFiles);
        # Add read info file to list if exists.
        #
        push @listOfReadInfoFiles, @readList if @readList;
    }
    
    # Create file of read info file names.
    #
    if ( @listOfReadInfoFiles ) {
        createReadListFile($subProjectDir, @listOfReadInfoFiles);
    }
       
}

# Remove directory containing contig read info files if config param
# getSubProjReads.keepTempContigReadInfoFiles is set to zero.
#
if ( !$OBJ_PROPS->getProperty("getSubProjReads.keepTempContigReadInfoFiles") &&
    -e $tmpContigReadInfoFileDir ) {
    rmtree( $tmpContigReadInfoFileDir );
}

# If warning messages present, then save in .warnings.out file.
#
$OBJ_WARNINGS->createFile() if $OBJ_WARNINGS->getNumberOfWarnings;

exit 0;

#============================================================================#
# SUBROUTINES
#============================================================================#
sub createTmpContigReadFileDir {
    
    my $tmpContigReadInfoFileDir = shift;
    
    mkpath( $tmpContigReadInfoFileDir, 0777 ) if !-e $tmpContigReadInfoFileDir;

}

#============================================================================#
sub parseContigReadFileOfFiles {
    
    my $file = shift;
    
    my %contigReadInfoFiles = ();
    
    unless (open FH, $file) {
        logError("ERROR: failed to open file $file: $!",1);
    }
    
    while (my $entry = <FH> ) {
        next if !length $entry;
        next if $entry =~ /^#/;
        $contigReadInfoFiles{$1} = $2 if $entry =~ /^(\S+)\s+(\S+)/;
    }
    
    close FH;
    
    return %contigReadInfoFiles;
    
}

#============================================================================#
sub getSubProjectDirectories {
    
    my $file = shift;
    
    unless (open FH, $subProjectFile) {
        logError("ERROR: failed to open file $subProjectFile: $!",1);
    }
    while (my $subProjectDir = <FH>) {
        chomp $subProjectDir;
        next if !length $subProjectDir;
        if ( !-s $subProjectDir ) {
            logError("WARNING: in file $subProjectFile, directory $subProjectDir does not exist.\n");
            next;
        } else {
            push @subProjectDirs, $subProjectDir;
        }
    }
    close FH;
    
    return @subProjectDirs;
    
}

#============================================================================#
sub createContigReadInfoFiles {

    my $readInfoFile = shift;
    my $outputFileDir = shift;
    
    my %contigFH = ();
    my %contigFileNames = ();
    my %openedFiles = ();
    my $numFileHandles = 0;
    my $maxNumFileHandles = 1020;
    my $lastContig = '';
    
    unless (open FH, $readInfoFile) {
        logError("ERROR: failed to open file $readInfoFile: $!",1);
    }
    
    # NOTE: using IO::File to create multiple file handles has a limit of
    # 1020 max file handles created!
    #
    while (my $line = <FH>) {
        next if $line !~ /\w/;
        my @values = split /\t/, $line;
        next if @values < 5;
        my $contig = $values[4];
        if ( defined $contigFH{$contig} ) {
            my $fh = $contigFH{$contig};
            print $fh $line;
        } else {
            $numFileHandles++;
            # if num of file handles is greater than max limit, then
            # close previous file handle.  We'll need to open the previous
            # file handle the next time it is needed.
            #
            if ( $numFileHandles > $maxNumFileHandles ) {
                my $fh = $contigFH{$lastContig};
                $fh->close;
                delete $contigFH{$lastContig};
                $numFileHandles--;
            }
            
            # If file has already been opened, but has been closed due
            # to max limit, then open file handle with append. Otherwise
            # open file handle to write from beginning of file.
            #
            my $openMode = defined $openedFiles{$contig} ? ">>" : ">";
            
            my $file = "$outputFileDir/$contig.readinfo.txt";
            my $fh = new IO::File "$openMode $file";
            $contigFH{$contig} = $fh;
            $contigFileNames{$contig} = $file;
            $openedFiles{$contig}++;
            $lastContig = $contig;
            print $fh $line;
        }
    }
    
    foreach my $contig ( keys %contigFH ) {
        my $fh = $contigFH{$contig};
        $fh->close;
    }
    
    close FH;

    return %contigFileNames;

}

#============================================================================#
sub getRepeatContigs {
    
    my @contigListFiles = @_;
    
    my @repeatContigs = ();
    
    foreach my $file (@contigListFiles) {
        unless (open FH, $file ) {
            logError("ERROR: failed to open file $file: $!",1);
        }
        chomp( my @contigs = <FH> );
        close FH;
        push @repeatContigs, @contigs;
    }
    
    # Remove duplicates from array.
    #
    my %tmp = ();
    @repeatContigs = grep { !$tmp{$_}++ } @repeatContigs;
    
    return @repeatContigs;

}

#============================================================================#
sub checkSubProjectDirectories {

    my $file = shift;

    my $errMsg = '';

    unless (open FH, $file ) {
        logError("ERROR: failed to open file $file: $!",1);
    }

    while (my $dir = <FH>) {
        chomp $dir;
	next if !length $dir;
        next if $dir =~ /^#/;
	if ( !-d $dir ) {
	    $errMsg = "In file $file, directory $dir does not exist.\n";
	    last;
        }
    }
    close FH;

    return $errMsg;

}

#============================================================================#
sub parseSubProjectScaffoldFile {

    my $file = shift;

    my %scaffDatas = ();

    unless (open FH, $file ) {
        logError("WARNING: failed to open file $file: $!");
        return;
    }
    my @entries = <FH>;
    close FH;
    @entries = grep !/^#/, @entries;
    
    if ( !@entries ) {
        logError("WARNING: could not find any entries in scaffold file $file.");
        return;
    }
    
    chomp( my $entry = $entries[0] );

    if ( !length $entry ) {
        logError("WARNING: scaffold file $file is not valid.");
        return;
    }

    my @datas = split /\t/, $entry;

    if( @datas != 7 ) {
        logError("WARNING: scaffold file $file is not valid.");
        return
    }

    $scaffDatas{directoryName} = $datas[0];
    $scaffDatas{gapSize} = $datas[1];
    $scaffDatas{leftContigLength} = $datas[2];
    $scaffDatas{leftContigName} = $datas[3];
    $scaffDatas{rightContigLength} = $datas[4];
    $scaffDatas{rightContigName} = $datas[5];
    $scaffDatas{scaffoldName} = $datas[6];
        
    return %scaffDatas;

}

#============================================================================#
sub parseBoundaryFile {

    my $file = shift;

    my %coords = ();

    unless (open FH, $file ) {
        logError("WARNING: failed to open file $file: $!");
        return;
    }
    my @entries = <FH>;
    close FH;
    @entries = grep !/^#/, @entries;
    
    if ( !@entries ) {
        logError("WARNING: could not find any entries in boundary file $file.");
        return;
    }
    
    chomp( my $entry = $entries[0] );

    if ( !length $entry ) {
        logError("WARNING: boundary file $file is not valid.");
        return;
    }

    my @datas = split /\t/, $entry;

    if ( @datas != 4 ) {
        logError("WARNING: boundary file $file is not valid.");
        return;
    }

    $coords{uniqueStart} = $datas[0];
    $coords{uniqueStop} = $datas[1];
    $coords{repeatStart} = $datas[2];
    $coords{repeatStop} = $datas[3];

    return %coords;

}

#============================================================================#
sub getLargestLibraryInsertSize {

    my $file = shift;

    unless (open FH, $file ) {
        logError("ERROR: failed to open file $file: $!",1);
    }
    chomp( my @entries = <FH> );
    close FH;

    if ( !@entries ) {
        logError("ERROR: could not find any entries in lib file $file.",1);
    }
    
    my $largestLibSize = 0;
    my $stdDev = 0;

    foreach my $entry (@entries) {
        next if !length $entry;
        next if $entry =~ /^#/;
        my @datas = split /\t/, $entry;
	if ( @datas != 3 ) {
            logError("ERROR: library file $file contains an invalid entry.",1);
	}
	if ( $datas[1] > $largestLibSize ) {
	    $largestLibSize = $datas[1];
	    $stdDev = $datas[2];
	}
    }

    return $largestLibSize, $stdDev;

}

#============================================================================#
sub runGetReadsInUnique {

    my $readInfoFile = shift;
    my $contigName = shift;
    my $contigPos = shift;
    my $start = shift;
    my $stop = shift;
    my $outputDir = shift;

    my $outputFile = "$outputDir/$contigName".
        $OBJ_PROPS->getProperty("getSubProjReads.uniqueReadsFileExtension");
    my $script = getScript("script.getReadsInUnique");
    my $args = "-iRead $readInfoFile -cName $contigName ".
               "-cPos $contigPos -bstart $start -bend $stop -oRead $outputFile";
    my $cmd = "$script $args";

    my %processInfo = runCommand($cmd);
     
    checkProcess(%processInfo);
	  
    if ( -s $outputFile ) {
        logOutputFileCreation($outputFile);
    } else {
        logOutput("No $outputFile created.",1);
    }

    return $outputFile;

}

#============================================================================#
sub runGetReadsInRepeat {

    my $readInfoFile = shift;
    my $contigName = shift;
    my $start = shift;
    my $stop = shift;
    my $outputDir = shift;

    my $outputFile = "$outputDir/$contigName".
        $OBJ_PROPS->getProperty("getSubProjReads.repeatReadsFileExtension");
    my $script = getScript("script.getReadsInRepeat");
    my $args = "-iRead $readInfoFile -cName $contigName ".
               "-bstart $start -bend $stop -oRead $outputFile";
    my $cmd = "$script $args";

    my %processInfo = runCommand($cmd);
     
    checkProcess(%processInfo);
	  
    if ( -s $outputFile ) {
        logOutputFileCreation($outputFile);
    } else {
        logOutput("No $outputFile created.",1);
    }
    
    return $outputFile;
    
}

#============================================================================#
sub runGetRepeatContigs {

    my $libInfoFile = shift;
    my $scaffoldFile = shift;
    my $readInfoFile = shift;
    my $contigsFastaFile = shift;
    my $contigsQualFile = shift;
    my $outputFile = shift;
    my $outputDir = shift;
    my $gapSize = shift;
    my $contigName = shift;
    my $contigPos = shift;
    my $repeatFastaExt = shift;
    my $repeatQualExt = shift;
    
    my $minNumReadLinks = $OBJ_PROPS->getProperty("getSubProjReads.minNumReadLinksInRepeatContig");
    my $stdDevMult = $OBJ_PROPS->getProperty("getSubProjReads.libInsertSizeStdDevMultiplier");
    my $gapSizePad = $OBJ_PROPS->getProperty("getSubProjReads.gapSizePadding");
    my $minContigLength = $OBJ_PROPS->getProperty("getSubProjReads.minRepeatContigLength");
    my ($libInsertSize, $stdDev) = getLargestLibraryInsertSize($libInfoFile);
    my $regionLength = $libInsertSize + $stdDevMult*$stdDev;
    my $maxContigLength = $gapSize + $gapSizePad;

    my $contigPosParam = $contigPos eq "left" ? "L":"R";
    my $script = getScript("script.getRepeatContig");
    my $args = "-rs $regionLength ".
               "-contig $contigName ".
	       "-pos $contigPosParam ".
	       "-min $minContigLength ".
	       "-max $maxContigLength ".
	       "-rl $minNumReadLinks ".
	       "-od $outputDir ".
               "-fastaExt $repeatFastaExt ".
               "-qualExt $repeatQualExt ".
	       "$scaffoldFile $readInfoFile $contigsFastaFile $contigsQualFile ".
	       "$outputFile";

    my $cmd = "$script $args";

    my %processInfo = runCommand($cmd);
     
    checkProcess(%processInfo);
    
    if ( -s $outputFile ) {
        logOutputFileCreation($outputFile);
    } else {
        logOutput("No $outputFile created.",1);
    }
    
}
    
#============================================================================#
sub shredRepeatConsensus {
    
    my $repeatContig = shift;
    my $fastaDir = shift;
    
    my $fragLength = $OBJ_PROPS->getProperty("shredFasta.fragmentLength");
    my $overlapLength = $OBJ_PROPS->getProperty("shredFasta.overlapLength");
    my $minFastaLengthToShred =
        $OBJ_PROPS->getProperty("getSubProjReads.shredRepeatConsensusIfGreaterThanThisLength");
    
    my $repeatFasta = "$fastaDir/$repeatContig".
        $OBJ_PROPS->getProperty("getSubProjReads.repeatFastaFileExtension");
    my $repeatQual = "$fastaDir/$repeatContig".
        $OBJ_PROPS->getProperty("getSubProjReads.repeatQualFileExtension");
    my $outputFasta = "$repeatFasta.shred";
    my $outputQual = "$repeatQual.shred";
    
    my $repeatFastaLength = getFastaLength($repeatContig, $repeatFasta);
    
    return if $repeatFastaLength <= $minFastaLengthToShred;
    
    # shred fasta sequences if fasta file found.
    #
    if ( -e $repeatFasta ) {
        logOutput("Shredding $repeatFasta to $fragLength bp lengths with $overlapLength bp overlap",1);
        runShredFasta($repeatFasta, 0, $outputFasta, $fragLength,
            $overlapLength);
        
        # Rename output shredded file back to repeat fasta file.
        #
        if ( -s $repeatFasta && -s $outputFasta ) {
            unlink( $repeatFasta );
            rename ($outputFasta, $repeatFasta);
            logOutput("Renaming $outputFasta to $repeatFasta ...",1);
        }
            
    }
    
    # shred qual values if qual file found.
    #
    if ( -e $repeatQual) {
        logOutput("Shredding $repeatQual to $fragLength bp lengths with $overlapLength bp overlap",1);
        runShredFasta($repeatQual, 1, $outputQual, $fragLength,
            $overlapLength);
        
        # Rename output shredded file back to repeat qual file.
        #
        if ( -s $repeatQual && -s $outputQual ) {
            unlink( $repeatQual);
            rename ($outputQual, $repeatQual);
            logOutput("Renaming $outputQual to $repeatQual ...",1);
        }
    }
    
}
    
#============================================================================#
sub runShredFasta {
    
    my $fastaFile = shift;
    my $isQualityFile = shift;
    my $outputFile = shift;
    my $fragLength = shift;
    my $overlapLength = shift;
    
    my $script = getScript("script.shredFasta");
    my $args = "-l $fragLength -o $overlapLength";
       $args .= " -q" if $isQualityFile;
       $args .= " $fastaFile $outputFile";
    
    my $cmd = "$script $args";

    my %processInfo = runCommand($cmd);
     
    checkProcess(%processInfo);
    
    if ( -s $outputFile ) {
        logOutputFileCreation($outputFile);
    } else {
        logOutput("No $outputFile created.",1);
    }
    
}

#============================================================================#
sub getRepeatContigReads {
    
    my $contig = shift;
    my $subProjectDir = shift;
    my $readInfoFile = shift;
    my $repeatContigFasta = shift;
    
    logOutput("Getting repeat contig reads for contig $contig ...",1);
    
    my $contigLength = getFastaLength($contig, $repeatContigFasta);
    if ( !$contigLength ) {
        logError("ERROR: failed to get contig length for contig $contig from fasta file $repeatContigFasta.",1);
    }
    
    my $outputFile = runGetReadsInRepeat($readInfoFile, $contig, 1, $contigLength,
        $subProjectDir);
    
    return $outputFile;

}
    
#============================================================================#
sub processRepeatContigs {
    
    my $subProjectDir = shift;
    my $repeatContigFastaDir = shift;
    my $repeatFastaExt = shift;
    my $A_repeatContigListFiles = shift;
    my $H_contigRepeatInfoFiles = shift;
    
    my @listOfReadInfoFiles = ();
    my $doShredRepeatConsensus = $OBJ_PROPS->getProperty("getSubProjReads.shredRepeatConsensus");
    
    # get list of repeat contigs from repeat contig list files.
    #
    my @repeatContigs = sort {$a cmp $b} getRepeatContigs(@$A_repeatContigListFiles);
    
    # log list of repeat contigs found.
    #
    my $logmsg = "Repeat contigs found: ";
       $logmsg .= join ", ", @repeatContigs;
       $logmsg =~ s/,\s*$//;
    logOutput($logmsg, 1);
    
    # For each repeat contigs found, get reads and shred repeat contig fasta
    # and quals.
    #
    foreach my $repeatContig (@repeatContigs) {
        
        # define repeat contig fasta.  Need this to determine fasta length
        # in getRepeatContigReads subroutine.
        #
        my $repeatContigFasta = "$repeatContigFastaDir/$repeatContig$repeatFastaExt";
        
        if ( !-s $repeatContigFasta ) {
            logError("WARNING: expecting file $repeatContigFasta but it does not exist.");
            return @listOfReadInfoFiles;
        }
        
        my $contigReadInfoFile = $H_contigRepeatInfoFiles->{$repeatContig};
        
        # get repeat contig reads.
        #
        my $repeatContigReadsFile = getRepeatContigReads($repeatContig, $subProjectDir,
            $contigReadInfoFile, $repeatContigFasta);
        
        if ( -s $repeatContigReadsFile ) {
            push @listOfReadInfoFiles, $repeatContigReadsFile;
        }
        
        # shred repeat contig fastas and quals.
        #
        shredRepeatConsensus($repeatContig, $repeatContigFastaDir) if
            $doShredRepeatConsensus;
    }
    
    return @listOfReadInfoFiles;
    
}

#============================================================================#
sub createReadListFile {
    
    my $subProjectDir = shift;
    my @listOfReadInfoFiles = @_;
    
    my $outputFile = "$subProjectDir/".
        $OBJ_PROPS->getProperty("getSubProjReads.readListFileName");
        
    unless (open FH, ">$outputFile" ) {
        logError("ERROR: failed to open file $outputFile: $!",1);
    }
    foreach my $readListFile (@listOfReadInfoFiles) {
        chomp $readListFile;
        my $fileName = basename($readListFile);
        print FH "$fileName\n";
    }
    close FH;
    
    if ( -s $outputFile ) {
        logOutputFileCreation($outputFile);
    } else {
        $OBJ_WARNINGS->add("No $outputFile created.");
        logOutput("No $outputFile created.");
    }
    
}

#============================================================================#
sub getFastaLength {
    
    my $contig = shift;
    my $file = shift;
    
    return 0 if !$OBJ_WARNINGS->fileExists($file);
    
    my $contigLength = 0;
    my $objFastaParser = new PGF::Parsers::FastaParse($file);

    while ($objFastaParser->MoreEntries) {
        $objFastaParser->ReadNextEntry( -rawFormat=>0 );
        next if $objFastaParser->Name ne $contig;
        
        my $seq = $objFastaParser->Seq;
        $contigLength = length $seq;
        last;
    }
    
    return $contigLength;
    
}
    
#============================================================================#
sub setFileForLogging {
    
    my $logFile = shift;
    
    $OBJ_LOGGER->setLogOutFileAppend($logFile);
    $OBJ_LOGGER->setLogErrorFileAppend($logFile);
    
}

#============================================================================#
sub createDirectory {
    
    my $dir = shift;
    
    if ( !-d $dir ) {
        eval { mkpath($dir, {mode=>0755}) };
        if ($@) {
            my $errMsg = "ERROR: failed to create directory $dir: $@\n";
            logError($errMsg, 1);
        }
    }
}

#============================================================================#
sub getScript {
    
    my $scriptType = shift;
    
    my $script = $OBJ_PROPS->getProperty("$scriptType");
       $script = "$FindBin::RealBin/$script" if
            $script !~ /\//; # add path to script if not specified in config file
            
    # Check if script exists.
    #
    if ( !-e $script ) {
        my $errMsg = "ERROR: cannot find script $script defined in config file as $scriptType.\n";
        print $errMsg;
        logError($errMsg, 1);
    }
    
    return $script;
    
}

#============================================================================#
sub checkProcess {
    
    my %processInfo = @_;
    
    if ( $processInfo{exitCode} ) {
        my $errMsg = '';
        if ( defined $processInfo{logStdoutMessage} &&
            $processInfo{logStdoutMessage} ) {
            $errMsg .= $processInfo{stdoutMessage};
        }
        $errMsg .= $processInfo{stderrMessage};
        logError($errMsg);
    }
    
}

#============================================================================#
sub runCommand {
   
    my $cmd = shift;
    
    # Execute command, capture stderr, stdout, exitcode
    #
    my $errMessage = "CMD: $cmd\n";
    
    print "Running $cmd ...\n\n";
    logOutput($cmd);
    
    my %processInfo = runProcess($cmd,
        {-checkExecutable=>0,
        }
    );
    
    return %processInfo;
    
}

#============================================================================#
sub logExecution {
    
    my $programExecution = shift;
    
    my $msg = "Command: ".$programExecution."\n".
              "Current directory: ".getcwd;
    logOutput($msg);
}

#============================================================================#
sub logError {
    my $message = shift;
    my $confess = shift || 0;
    
    $OBJ_LOGGER->logError($message);
    $OBJ_WARNINGS->add($message);
    
    if ( $confess ) {
        $OBJ_WARNINGS->createFile() if $OBJ_WARNINGS->getNumberOfWarnings;
        confess $message;
    }
    
}

#============================================================================#
sub logOutput {
    my $message = shift;
    my $printMsg = shift || 0;
    
    $OBJ_LOGGER->logOut($message);
    print "$message\n" if $printMsg;
}
    
#============================================================================#
sub logOutputFileCreation {
    
    my $outputFile = shift;
    
    $outputFile = abs_path($outputFile);
    
    my $msg = -e $outputFile ? "Created $outputFile" :
        "Failed to create $outputFile";
    logOutput($msg,1);
    
}

#============================================================================#
sub printhelp {
    my $verbose = shift || 1;
    pod2usage(-verbose=>$verbose);
    exit 1;
}

#============================================================================#
