#!/usr/bin/env perl

=head1 NAME

validateSubProject.pl

=head1 SYNOPSIS

  validateSubProject.pl [options] <gapdirs.txt> <libinfo.txt> <sffinfo.txt>

  Options:
  -o            Output validation summary file (optional; default defined in gapRes.config file)
  -log <file>   Log file (optional; default is validateSubProject.pl.log)
  -warn <file>  Warning file (optional; default is defined in gapRes.config)
  -h            Detailed message (optional)

=head1 DESCRIPTION

This is a wrapper program for the Gap Resolution sub system that is responsible
for validating each of the sub project to determine if the gap is successfully
closed or not.

Unless otherwise noted, anchor here refers to the anchor sequence obtained
from the left and right contigs of the gap prior to reassembly.  This is done
upstream using idContigRepeats.pl.

The following tasks are done for each sub project:

  1. Create a contig orientation file of the assembly in the assembly directory
     using the 454Scaffolds.txt file.
  2. Create a read pairing info file (readinfo.txt) file in the assembly directory
     from the ace file.
  3. Validate the gap assembly by calling valdateGapAssembly.pl. This will create
     a validinfo.txt file within the sub project directory containing information
     pertaining to the validation of the assembly.  For more info, refer to 
     validateGapAssembly.pl's documentation.

For more information on the validation rules, see the documentation on
validateGapAssembly.pl.

The format of the validinfo.txt file is as follows:

  leftAnchorContig=name of contig
  leftAnchorContigLength=number
  leftAnchorStart=number
  leftAnchorEnd=number
  rightAnchorContig=name of contig
  rightAnchorContigLength=number
  rightAnchorStart=number
  rightAnchorEnd=number
  anchorStart=number
  anchorEnd=number
  anchorDistance=number
  gapSize=number
  gapSizeStdDev=number
  numConsistentReadPairs=number
  numInconsistentReadPairs=number
  pctConsistent=number
  avgConsensusQualityBetweenAnchors=number
  isDistanceValid=0|1
  isReadPairingValid=0|1
  isQualityValid=0|1
  status=PASS|FAIL
  doPrimerDesign=0|1
  comment=comment entry

The format of the contigOrientation.txt file is as follows, with each item
separated by a tab, one entry per line:

  1. contigName
  2. orientation (+|-)

A validationSummary.txt file is created in the working directory containing
the validation status of each of the sub project directories.  The format of
the file is as follows, with each entry separated by a tab:

  1. full path of sub project directory
  2. status (PASS or FAIL)
  3. comments

=head1 DEPENDENCIES

The following scripts (configurable in config file) must exist in the same
path as validateSubProject.pl unless the path to the script is defined in the config
file:

  * newblerAce2ReadPair.pl
  * validateGapAssembly.pl

The following are the description of the input files used by the validateSubProject.pl.

  * gapdir.txt - list of gap directories created by createSubProject.pl
  * libinfo.txt - library insert size and std dev file created by parseNewblerMetrics.pl
  * sffinfo.txt - location of sff files created by parseNewblerMetrics.pl

The following files are created or must exist in each sub project directory:

  * 454Scaffolds.txt - created by Newbler in the sub project directory + Newbler/assembly.
  * contigOrientation.txt - created within the sub project directory +
    Newbler/assembly using the 454Scaffolds.txt file. The format is contig + tab
    + orientation(+/-), one per line.
  * readinfo.txt - created within the sub project directory + Newbler/assembly
    using newblerAce2ReadPair.pl.
  * scaffinfo.txt - must exist within the sub project directory (created
    elsewhere using createSubProject.pl)

For more information regarding the formats of these files, refer the documentation
of the scripts that are used to create the file.

A default config file named gapRes.config residing in
<installPath>/config is used to specify the following parameters:

(configurable)

  validateSubProject.anchorSeqMinAlignPercentIdentity
    The anchor sequences are aligned to the reference sequence to determine the
    positions of the anchors in the assembly. Use this to specify the minimum
    alignment percent identity.

  validateSubProject.anchorSeqMinAlignmentLength
    The anchor sequences are aligned to the reference sequence to determine the
    positions of the anchors in the assembly. Use this to specify the minimum
    alignment length.
  
  validateSubProject.percentGapSizedPaddingForValidation
    If the left and right anchors are aligned to the same contig in the assembly,
    the distance between the anchors must be within the gap size +/ a standard
    deviation to be considered valid. This parameter is used to compute the gap
    size standard deviation, represented as a percentage of the gap size (e.g.,
    std dev=gap size * percent gap size padding). The gap size is determined
    from the required pre-existing file scaffinfo.txt (previously generaged by
    createSubProject.pl)
   
  validateSubProject.percentReadPairConsistency
    When validating the read pairs for consistency, the percentage of consistent
    read pairs (valid read pairs/invalid read pairs) must be >= to the threshold
    defined by this parameter.  Read pairs are considered valid if they meet all
    of the following criteria: 1) read pairs are located on the same contig, 2)
    distance between the read pairs are within the library insert size +/ standard
    deviation, 3) The read orientation points toward each other.
  
  validateSubProject.libInsertSizeStdDevMultiplier
    One of the criteria for validating read pairs for consistency is determining
    whether the distance of the read pairs are within the library insert size +/
    standard deviation. This parameter is used as a standard deviation multiplier
    in order to allow for the configuration of the library insert size range.
  
  validateSubProject.minAvgConsensusQualityBetweenAnchors
    One of the criteria for validating a sub project is that the average cosensus
    quality between the anchors must be >= to a threshold.  This parameter is used
    to specify this threshold.
  
(system configuration)

  validateSubProject.assemblyDirectory

  validateSubProject.validateGapAssemblyOutputFile

  validateSubProject.assemblyContigsFasta

  validateSubProject.assemblyContigsQual

  validateSubProject.validationSummaryFile

  validateSubProject.createDbFileExecutable

  validateSubProject.aligner

  validateSubProject.alignerParameters

=head1 VERSION

$Revision$

$Date$

=head1 AUTHOR(S)

Stephan Trong

=head1 HISTORY

=over

=item *

S.Trong 2008/12/05 creation

=item *

S.Trong 2009/08/05
  - Split validation of sub projects from the creation of fakes. Fakes creation
    is now handled by the createSubProjectFakes.pl.
  - If sub project fails during analysis, skip instead of generating fatal error.
    The error is reported in a .warnings file.

=item *

S.Trong 2008/12/29 - added -log and -warn options.

=back

=cut

use strict;
use warnings;
use Pod::Usage;
use Cwd;
use Cwd qw(abs_path);
use Carp;
use Carp qw(cluck);
use Getopt::Long;
use File::Path;
use File::Copy;
use File::Basename;
use FindBin qw($RealBin);
use lib "$RealBin/../lib";
use PGF::Utilities::Properties;
use PGF::Utilities::RunProcess qw(runProcess);
use PGF::Utilities::Logger;
use PGF::Newbler::Scaffolds454;
use PGF::GapResolution::ParseScaffInfo;
use PGF::GapResolution::ParseGapValidationInfo;
use PGF::GapResolution::ParseContigOrientation;
use PGF::GapResolution::Warnings;
use vars qw( $optHelp $optDebug $optValiditionSummaryFile $optLogFile
    $optWarningsFile );

#============================================================================#
# INPUT VALIDATION
#============================================================================#
my $programExecution = abs_path(dirname($0))."/".basename($0)." @ARGV";

if( !GetOptions(
        "o=s"=>\$optValiditionSummaryFile,
        "log=s"=>\$optLogFile,
        "warn=s"=>\$optWarningsFile,
        "debug"=>\$optDebug,
        "h"=>\$optHelp,
    )
) {
    printhelp(1);
}

printhelp(2) if $optHelp;

if ( @ARGV != 3 ) {
    my $errMsg = "Required input parameters are missing in command line.";
    print STDERR "$errMsg\n";
    printhelp(1);
}

#============================================================================#
# INITIALIZATION
#============================================================================#

my $DEBUG = $optDebug ? 1:0;
my $configFile = defined $ENV{GAPRES_CONFIG} ?
    $ENV{GAPRES_CONFIG} : "$RealBin/../config/gapRes.config";
my $OBJ_PROPS = PGF::Utilities::Properties->new(-configFile=>$configFile);
   $OBJ_PROPS->setExceptionIfEntryNotFound(1); # confess if entry in config file
                                               # is not found.
my $OBJ_LOGGER = PGF::Utilities::Logger->new();
my $outputDir = getcwd;
my $logfile = $optLogFile ? $optLogFile : "$outputDir/".basename($0) . ".log";
my $OBJ_WARNINGS = PGF::GapResolution::Warnings->new(
    path=>$outputDir, logger=>$OBJ_LOGGER);
   $OBJ_WARNINGS->setLogFile($optWarningsFile) if $optWarningsFile;
my $subProjectListFile = $ARGV[0];
my $libInfoFile = $ARGV[1];
my $sffInfoFile = $ARGV[2];
    
#============================================================================#
# VALIDATE INPUTS
#============================================================================#
my $errMsg = '';

# Set path for logging.
#
setFileForLogging($logfile);

# Log execution into log file.
#
logExecution($programExecution);

# Check sub project list file.
#
if (!-s $subProjectListFile) {
    $errMsg .= "The $subProjectListFile file is missing or zero size.\n";
}

if ( $errMsg  ){
    print STDERR "$errMsg.\nUse the -h option for more info.\n";
    logError($errMsg,1);
}

#============================================================================#
# MAIN
#============================================================================#

my @validationStatuses = ();

# validate and get sub project directories.
#
my @subProjectDirs = validateSubProjectLocs($subProjectListFile);

foreach my $subProjectDir (@subProjectDirs) {
    $subProjectDir =~ s/\/$//; # remove any trailing '/'
    
    # Add sub project dir name to warnings.
    #
    $OBJ_WARNINGS->setSubProjectName($subProjectDir);
    
    logOutput("Validating sub project $subProjectDir ...",1);
       
    # Initialize file name and paths
    #
    my $subProjectName = basename($subProjectDir);
    my $assemDir = $subProjectDir.'/'.
        $OBJ_PROPS->getProperty("validateSubProject.assemblyDirectory");
    my $contigOrientationFile = $assemDir.'/'.
        $OBJ_PROPS->getProperty("createGapResProject.contigOrientationFileName");
    my $readInfoFile = $assemDir.'/'.
        $OBJ_PROPS->getProperty("createGapResProject.readInfoFileName");
    my $newblerMetricsFile = $assemDir.'/454NewblerMetrics.txt';
    my $newblerScaffoldsFile = $assemDir.'/454Scaffolds.txt';
    my $aceFile = $assemDir.'/consed/edit_dir/454Contigs.ace.1';
    my $validInfoFile = $subProjectDir.'/'.
        $OBJ_PROPS->getProperty("validateSubProject.validateGapAssemblyOutputFile");
    my $contigsFasta = $assemDir.'/'.
        $OBJ_PROPS->getProperty("validateSubProject.assemblyContigsFasta");
    my $contigsQual= $assemDir.'/'.
        $OBJ_PROPS->getProperty("validateSubProject.assemblyContigsQual");
        
    $ENV{SFFFILE_PATH} = $OBJ_PROPS->getProperty("script.sffinfo");
        
    # check for existence of required files
    #
    next if !checkForRequiredFiles(
        $subProjectDir,
        $newblerMetricsFile,
        $newblerScaffoldsFile,
        $aceFile,
        $contigsFasta,
        $contigsQual
    );
        
    my $success = 0;
    
    # Create contig orientation file.
    #
    $success = createContigOrientationFile($assemDir, $newblerScaffoldsFile,
        $contigOrientationFile);
    next if !$success;
    
    # Create read pairing info file.
    #
    $success = runReadInfoFileGenerator($assemDir, $aceFile, $contigOrientationFile,
        $readInfoFile);
    next if !$success;
    
    # Validate sub project and create valid info file.
    #
    $success = runValidateGapAssembly($subProjectDir, $assemDir, $readInfoFile,
        $libInfoFile, $sffInfoFile, $contigsFasta, $contigsQual, $validInfoFile);
    next if !$success;
    
    # Generate status entries to place into summary file.
    #
    my $validationStatus = "$subProjectDir\t";
    my $objGapValidation = PGF::GapResolution::ParseGapValidationInfo->new(
        $validInfoFile);
    my $comment = $objGapValidation->getComment;
    
    if ( $objGapValidation->isSuccessful ) {
        $validationStatus .= "PASS";
    } else {
        $validationStatus .= "FAIL";
    }
    $validationStatus .= "\t$comment\n";
    push @validationStatuses, $validationStatus;
}

# Create summary file in current working directory.
#
createValidationSummaryFile(@validationStatuses);

# If warning messages present, then save in .warnings.out file.
#
$OBJ_WARNINGS->createFile() if $OBJ_WARNINGS->getNumberOfWarnings;

exit 0;

#============================================================================#
# SUBROUTINES
#============================================================================#
sub checkForRequiredFiles {
    
    my $subProjectDir = shift;
    my @files = @_;
    
    my $success = 1;
    
    foreach my $file (@files) {
        if ( !$OBJ_WARNINGS->fileExists($file) ) {
            $success = 0;
        }
    }
    
    return $success;
}

#============================================================================#
sub validateSubProjectLocs {
    
    my $file = shift;   
    
    my @subProjectDirs = ();
    
    my $scaffFileName = $OBJ_PROPS->getProperty("idContigRepeats.scaffFileName");
    
    unless (open FH, $file ) {
        logError("ERROR: failed to open file $file: $!", 1);
    }
    
    while( my $subProjectDir = <FH> ) {
        chomp $subProjectDir;
        my $errMsg = '';
        if ( !-e $subProjectDir ) {
            $errMsg = "WARNING: in file $file, directory $subProjectDir does not exist.\n";
        } elsif ( !-s "$subProjectDir/$scaffFileName" ) {
            $errMsg = "WARNING: Scaff info file $subProjectDir/$scaffFileName file does not exist or is zero size.\n";
        }
        
        if ( $errMsg  ){
            logError($errMsg);
            next;
        } else {
            $subProjectDir =~ s/\/$//; # remove any trailing '/'
            push @subProjectDirs, $subProjectDir;
        }
    }
    
    close FH;
    
    return @subProjectDirs;
    
}

#============================================================================#
sub createContigOrientationFile {
    
    my $subProjectDir = shift;
    my $scaffoldFile = shift;
    my $outputFile = shift;
    
    logOutput("Creating contig orientation file ...", 1);
    
    my $objScaffold = PGF::Newbler::Scaffolds454->new($scaffoldFile);
    $objScaffold->createContigOrientationFile($outputFile);
    
    my $success = $OBJ_WARNINGS->fileExists($outputFile);
    
    logOutputFileCreation($outputFile);
    
    return $success;
    
}

#============================================================================#
sub runReadInfoFileGenerator {
    
    my $assemDir = shift;
    my $aceFile = shift;
    my $contigOrientationFile = shift;
    my $outputFile = shift;
    
    logOutput("Parsing ace file for read pairing info ...", 1);
    
    my $success = 0;
    my $script = getScript("script.readInfoFileGenerator");
    my $options = '';
       $options .= "-cf $contigOrientationFile" if -s $contigOrientationFile;
    my $cmd = "$script $options $aceFile $outputFile";
    
    my %processInfo = runCommand($cmd);
    
    if ( checkProcess(%processInfo) ) {
        $success = $OBJ_WARNINGS->fileExists($outputFile);
    }
    
    logOutputFileCreation($outputFile);
    
    return $success;
    
}
    
#============================================================================#
sub runValidateGapAssembly {
    
    my $subProjectDir = shift;
    my $assemDir = shift;
    my $readInfoFile = shift;
    my $libInfoFile = shift;
    my $sffInfoFile = shift;
    my $contigsFasta = shift;
    my $contigsQual = shift;
    my $outputFile = shift;
    
    logOutput("Checking gap assembly ...", 1);
    
    my $success = 0;
    my $scaffInfoFile = $subProjectDir.'/'.
        $OBJ_PROPS->getProperty("idContigRepeats.scaffFileName");
    my $objScaffInfo = PGF::GapResolution::ParseScaffInfo->new($scaffInfoFile);
    my $leftContig = $objScaffInfo->getLeftContig();
    my $rightContig = $objScaffInfo->getRightContig();
    my $gapSize = $objScaffInfo->getGapSize();
    my $percentGapSizePadding = $OBJ_PROPS->getProperty(
        "validateSubProject.percentGapSizedPaddingForValidation") / 100;
    my $gapSizeStdDev = $gapSize * $percentGapSizePadding;
    my $leftAnchorSeqFile = "$subProjectDir/$leftContig".
        $OBJ_PROPS->getProperty("idContigRepeats.anchorFileExtension");
    my $rightAnchorSeqFile = "$subProjectDir/$rightContig".
        $OBJ_PROPS->getProperty("idContigRepeats.anchorFileExtension");
    
    return 0 if !$OBJ_WARNINGS->fileExists($leftAnchorSeqFile );
    return 0 if !$OBJ_WARNINGS->fileExists($rightAnchorSeqFile);
    return 0 if !$OBJ_WARNINGS->fileExists($contigsFasta);
    return 0 if !$OBJ_WARNINGS->fileExists($contigsQual);
    
    my $script = getScript("script.validateGapAssembly");
    my $args = "-gsize $gapSize".
               " -gsizeStd $gapSizeStdDev".
               " -pctId ".$OBJ_PROPS->getProperty(
                    "validateSubProject.anchorSeqMinAlignPercentIdentity").
               " -alignLen ".$OBJ_PROPS->getProperty(
                    "validateSubProject.anchorSeqMinAlignmentLength").
               " -pctValidReads ".$OBJ_PROPS->getProperty(
                    "validateSubProject.percentReadPairConsistency").
               " -minQual ".$OBJ_PROPS->getProperty(
                    "validateSubProject.minAvgConsensusQualityBetweenAnchors").
               " -insertSizeStdMult ".$OBJ_PROPS->getProperty(
                    "validateSubProject.libInsertSizeStdDevMultiplier").
               " -formatdb ".$OBJ_PROPS->getProperty(
                    "validateSubProject.createDbFileExecutable").
               " -aligner ".$OBJ_PROPS->getProperty(
                    "validateSubProject.aligner").
               " -alignerParams '".$OBJ_PROPS->getProperty(
                    "validateSubProject.alignerParameters")."'".
               " $leftAnchorSeqFile".
               " $rightAnchorSeqFile".
               " $readInfoFile".
               " $libInfoFile".
               " $sffInfoFile".
               " $contigsFasta".
               " $contigsQual".
               " $outputFile";
               
    $args .= " -debug" if $DEBUG;
    
    my $cmd = "$script $args";
    my %processInfo = runCommand($cmd);
    
    if ( checkProcess(%processInfo) ) {
        $success = $OBJ_WARNINGS->fileExists($outputFile);
        logOutputFileCreation($outputFile);
    }
    
    return $success;
    
}

#============================================================================#
sub createValidationSummaryFile {
    
    my @validationStatuses = @_;
    
    my $validationSummaryFile = $optValiditionSummaryFile ?
        $optValiditionSummaryFile : $OBJ_PROPS->getProperty(
        "validateSubProject.validationSummaryFile");
    unless (open FH, ">$validationSummaryFile" ) {
        logError("ERROR: failed to create file $validationSummaryFile: $!",1);
    }
    print FH @validationStatuses;
    close FH;
    
    logOutputFileCreation($validationSummaryFile);

}

#============================================================================#
sub setFileForLogging {
    
    my $logFile = shift;
    
    $OBJ_LOGGER->setLogOutFileAppend($logFile);
    $OBJ_LOGGER->setLogErrorFileAppend($logFile);
    
}

#============================================================================#
sub getScript {
    
    my $scriptType = shift;
    
    my $script = $OBJ_PROPS->getProperty("$scriptType");
       $script = "$FindBin::RealBin/$script" if
            $script !~ /\//; # add path to script if not specified in config file
            
    # Check if script exists.
    #
    if ( !-e $script ) {
        my $errMsg = "ERROR: cannot find script $script defined in config file as $scriptType.\n";
        print $errMsg;
        logError($errMsg, 1);
    }
    
    return $script;
    
}

#============================================================================#
sub checkProcess {
    
    my %processInfo = @_;
    
    my $success = 1;
    
    if ( $processInfo{exitCode} ) {
        my $errMsg = '';
        if ( defined $processInfo{logStdoutMessage} &&
            $processInfo{logStdoutMessage} ) {
            $errMsg .= $processInfo{stdoutMessage};
        }
        $errMsg .= $processInfo{stderrMessage}."\n";
        print STDERR "$errMsg\n";
        logError($errMsg);
        $success = 0;
    }
    
    return $success;
}

#============================================================================#
sub runCommand {
   
    my $cmd = shift;
    
    # Execute command, capture stderr, stdout, exitcode
    #
    my $errMessage = "CMD: $cmd\n";
    
    print "Running $cmd ...\n\n";
    logOutput($cmd);
    
    my %processInfo = runProcess($cmd,
        {-checkExecutable=>0,
        }
    );
    
    return %processInfo;
    
}

#============================================================================#
sub logExecution {
    
    my $programExecution = shift;
    
    my $msg = "Command: ".$programExecution."\n".
              "Current directory: ".getcwd;
    logOutput($msg);
}

#============================================================================#
sub logError {
    
    my $message = shift;
    my $confess = shift || 0;
    
    $OBJ_LOGGER->logError($message);
    $OBJ_WARNINGS->add($message);
    
    if ( $confess ) {
        $OBJ_WARNINGS->createFile() if $OBJ_WARNINGS->getNumberOfWarnings;
        confess $message;
    }
    
}
    
#============================================================================#
sub logOutput {
    my $message = shift;
    my $printMsg = shift || 0;
    
    $OBJ_LOGGER->logOut($message);
    print "$message\n" if $printMsg;
}
    
#============================================================================#
sub logOutputFileCreation {
    
    my $outputFile = shift;
    
    $outputFile = abs_path($outputFile);
    
    my $fileExists = -s $outputFile ? 1:0;
    my $msg = $fileExists ? "Created $outputFile" :
        "ERROR: failed to create $outputFile";
    logOutput($msg,1);
    
    return $fileExists;
    
}

#============================================================================#
sub printhelp {
    my $verbose = shift || 1;
    pod2usage(-verbose=>$verbose);
    exit 1;
}

#============================================================================#
