idRepeatBoundary.pl
idRepeatNoundary.pl [options] -if <input fasta> -id <input blast database> -e <contig end> -blastcmd <blastcommand> -ao <AnchorFileName> -bo <BoundaryFileName> Options: -h this help message -d turn debug printing on -id input blast database (REQUIRED) -if input query fasta (REQUIRED) -e contig end (REQUIRED) -li library info file (REQUIRED) -blastcmd blast command (REQUIRED) ex /usr/X11R6/bin/blastall -p blastn -F F -e 1e-5 #-m 8 is hardcoded, and -i -d are command line options -bo boundary output file name (REQUIRED) ex /thisdir/ThisCtg.boundary" -ao anchor output file name (REQUIRED) ex "/thisotherdir/ThisCtg.anchor" -k keep temporary blast files -al unique anchor size (default=50) -pl length of boundary padding (default=50) -wl window size to examine for repeats (default=500) -sl subwindow size to examine for repeats (default=100) -ws window step size (default=250) -rl definition: repeat minimum length (default=100) -ri definition: repeat minimum identity (default=95)
Input fasta is aligned to a database using blastn -m 8 and the unique/repeat boundary is returned as well as a unique anchor in fasta format. =head1 DESCRIPTION
Given a newbler assembly where 2 contigs in a scaffold flank a particular gap,
ctgA ctgB |----------| gap |----------| 1 N 1 N
This program is designed to identify the boundary between unique and repeat sequence within a contig by aligning the contig to a blast database of the entire assembly. Blast results are parsed for any repeat >= repeat length (-rl) with percent identity >= repeat identity (-ri).
The boundary locations of unique and repeat as well as the fasta sequence for a unique anchor are deposited in the location of the input query sequence.
The unique-repeat boundary is defined by examining blast results within a specified size window X to Y (-wl), starting in a small subwindow Y-Z defined by subwindow length (-sl).
<-gap ctgNNN |----------------------------------------------------.... X Y-Z Y ^.............|.........^
If there are no repeat hits then the boundary is X+padding (-pl). The unique anchor is then: X+padding (-pl) to X+padding+anchor length (-al).
If there are blast hits then the hits are merged to consense overlapping, adjacent, and redundant hits, and the presence of merged repeat is first checked for in the sub-window, defined by Y-subwindow length (-sl) to Y. If repeat is present within this sub-window then the entire window X to Y is shifted by window step (-ws). The window keeps sliding by -ws, and the presence of repeat is checked in the subwindow until no repeat is found.
Once the presence of repeat is not detected in the subwindow, the rest of the window X to Y - subwindow length (-sl) is checked for repeat and the unqique/repeat boundary is defined as the end of the last repeat within that window + padding (-pl):
<-gap ctgNNN --------||||||----|||||---------------------.... ^repeat Y minus Z X.................................|........Y ^ ^- no repeat detected in sub-window | boundary -| <-(-pl)-> x y repeat bounds >------< unique anchor |-----------------------------| |------..... unique bounds
The END of the unique boundary is defined by the largest insert size library for the project x 1.5, starting at the beginning of the contig, no the beginning of the unique boundary.
If the boundary end coordinate goes past the end of the contig then the contig is considered all repeat and no anchor tag is created.
The unique anchor (see x,y) start is then defined as 1 + boundary position + padding length (-pl), and end is start + anchor length (-al).
The boundaries of unique and repeat as well as unique anchor sequence are named and deposited according to -bo -ao respectively.
boundary file: #uniqueStart uniqueEnd repeatStart repeatEnd 51 2080 1 50
anchor file: >contig00005 length=2080 numreads=426 GGTCGGCGACGTAGCCGGGCGCGGCGCCGTCCGGCACCTCGCCGTGGAAC
Contig end, specified by -e, is important in that if the contig resides on the left side of a gap, then the fasta sequence is reverse complimented and treated as a contig who resides on the right side of the gap (for coding efficiency). The boundary and anchor sequence are identified, and reverse complimented.
The coordinates of unique and repeat can then be used to extract which reads can be safely gathered for subprojecting, and the anchor is used to check if reassembly of a subproject is in a single contig.
$Revision: 1.23 $
$Date: 2010-01-07 19:41:48 $
Kurt M. LaButti
K.LaButti 2008/10/23 Creation