tacg  -  finds short patterns in nucleic acids, translates
       DNA <-> protein.

SYNOPSIS

       tacg  -flag  [option]  -flag  [option]   ...   <input.file
       >output.file  tacg  takes input from stdin (| or <); spits
       output to screen (default), >file, | next command

       [-chHlLqQsSv]  [-b  #]  [-e  #]   [-C   {0-12}]   [--clone
       '#_#,#x#...']    [--cost   #]  [-D  0-4]  [--dam]  [--dcm]
       [--example] [-f {0|1}] [-F {0-3}] [-g #,#] [-G  #,{X|Y|L}]
       [-H  (--HTML)  0|1]  [-i (--idonly) 0-2] [-m #] [-M #] [-n
       {3-8}] [--numstart] [--notics] [-o {0|1|3|5}] [-O {1-6},#]
       [-p                  Name,pattern,Err]                 [-P
       NameA,(+|-)(l|g)Dist_Lo(-Dist_Hi),NameB]  [--ps]   [--pdf]
       [--logdegens]    [-r    (--regex)    {'Label:RegexPat'   |
       'FILE:FileOfRegexPats'}]                           [--rule
       'Name,(LabA:m:M&LabB:m:M),Win']                [--rulefile
       '/path/to/rulefile'] [-R alterative  Pattern/Matrix  file]
       [--raw]  [--silent] [--strands {1|2}] [-T {0|1|3|6},{1|3}]
       [-V   {1-3}]   [-w   {1|#}]   [-W   (--slidwin)   #]   [-x
       NameA(=),NameB..(,C)] [-X (--extract) {b,e,[0|1]}] [-# %]

DESCRIPTION

       tacg takes input from stdin, automagically translates most
       standard ASCII formats of Nucleic Acid (NA) sequence, then
       analyses  that  sequence for restriction enzyme (RE) sites
       and other NA motifs  such  as  Transcription  Factor  (TF)
       binding sites (w/ or w/o mismatch errors), matrix matches,
       and regular  expressions,  finally  writing   analyses  to
       stdout.   It also can translate the NA input to protein in
       any frame, using any of a  number  of  Codon  translations
       tables, and search for Open Reading Frames (ORFs), as well
       as perform many other analyses.  Most of the internals use
       dynamic  memory  so there are few limits on sequence input
       size and pattern number.  It's ~  5-50x  faster  than  the
       comparable  routines  in GCG or EMBOSS and as it's writ in
       ANSI C, portable to all unix variants, and even  Microsoft
       Win32 with the Cygwin and the ming32 toolkits.

       tacg  searches  the  sequence  read from stdin for matches
       based on descriptions stored in a  database  of  patterns,
       either   explicit  sequences,  possibly  containing  IUPAC
       degeneracies  (default  rebase.data,  in  GCG  format   or
       extended   format),   or   matrix   descriptions  (default
       matrix.data,  in  TRANSFAC  format),  regular  expressions
       (default  regex.data, in GCG-like format), or a rules file
       (default rules.data in a simple format) based  on  matches
       and  options entered on the command line, sends ALL output
       to stdout. (Unless requested, it no longer sends errors to
       the input sequence in different ways depending  on the  -D
       flag (see below).  It either strips all letters other than
       'a','c','g', or 't' and analyzes the  sequence  as  'pure'
       using a fast incremental hashing algorithm or it treats it
       as degenerate and analyses it via a slower de  novo  hash.
       By default, it treats sequence as 'pure' unless it detects
       an IUPAC degeneracy, in  which  case  it  will  adaptively
       switch  back  and  forth between the fast and slow hashing
       routines.


       NB: tacg can produce lots of  output,  especially  in  the
       Linear  map  mode;  while  it's possible to pipe direct to
       lp/lpr, you'll probably regret it.

REQUIREMENTS

       tacg 3.5 requires an external Codon  file  codon.data  but
       does   not   absolutely  require  a  pattern/REBASE  file,
       allowing you to enter patterns  via the command line  with
       the  '-p'  flag.   However,  most users will want to use a
       REBASE file in GCG format to supply  the  RE  definitions.
       By   default   the   name  of  this  (supplied)  file  is:
       rebase.data, altho other files in the same format  can  be
       specified  by  the -R flag.  While you can use the default
       GCG-formatted  file   from   NEB's   REBASE   distribution
       (http://rebase.neb.com),    additional    information   is
       required to use the --dam, --dcm, or --cost options.  This
       info  is  included  in the distribution of tacg and can be
       added or modified  with  a  text  editor.   Searching  for
       Matrices  requires  the  use  of a TRANSFAC-formatted file
       (also supplied in the default name of matrix.data ).


       The codons/pattern/matrix data files may exist in any of 3
       locations  which are searched in the order of: the current
       directory $PWD, your home directory  $HOME,  or  tacg  lib
       $TACGLIB.   Many  shells will automatically define the 1st
       two; the last must be specified either via command line or
       in your .cshrc file.

       ie. 'setenv TACGLIB /usr/local/lib/tacg'   [csh/tcsh]

       or 'export TACGLIB=/usr/local/lib/tacg'    [bash]

FLAGS and OPTIONS

       {}  =  required  for flag; * = default (doesn't need to be
       entered); # = an integer value; () = optional

       ie. -f {0,1*} means that the flag must be entered  -f1  or
       -f0.   The  flags  -f0  or -f 0 are equally acceptable and
              label indicates numbering from the beginning of the
              entire sequence (see file 'tacg.main.html' for more
              detail).   The  smallest  sequence  that  tacg  can
              handle is 4 bases, 10  for  the  ladder  map  (-l).
              This allows analysis of primers and linkers.


       -e {#} select  the  end  of  a  subsequence  from a larger
              sequence file; 0* for last base of  sequence.   The
              largest sequence that I've sent thru it is ~75MB.


       -c     order  the output by # of cuts/fragments by each RE
              (Strider   style)   and   thence    alphabetically;
              otherwise  output  is by order of appearance in the
              REBASE file.




       -C {1*-13}
              Codon Usage table to use for translation:


               1 Standard                  8 Euplotid_Nuc
               2 Vert_Mito                 9 Bacterial
               3 Yeast_Mito               10 Alt_Yeast_Nuc
               4 Mold_Protoz_Coel_Mito    11 Ascidian_Mito
               5 Invert_Mito              12 Flatworm_Mito
               6 Ciliate_Nuc              13 Blepharisma_Nuc
               7 Echino_Mito

              The Codon Usage file used in Ver 3 (codon.data)  is
              a  slightly  modified  NCBI  format, which includes
              info (currently ignored) about  multiple  initiator
              codons and references.


       --clone '#_#,#x#...'
              Clone  finds  sequence ranges which either MUST NOT
              be cut (#_#) or that MUST be cut  (#x#),  up  to  a
              maximum of 15 at once.  Ranges not specified can be
              either cut or not cut.  The output first lists  all
              REs  (if  any)  which match ALL the rules, then all
              REs which match SOME rules as long  as  all  NO-CUT
              rules are respected.  The same filters that work in
              other  RE  selections  (-n,  -o,  -m,  -M,  --cost,
              --dam/dcm)  can  be  applied  here to fine-tune the
              selection.


       --cost {#}
              as '-D3'
               2   degen's  OK;  ignore in KEY hexamer, but match
              outside of KEY
               3  degen's OK; expand in KEY  hexamer,  find  only
              EXACT matches
               4   degen's  OK;  expand  in KEY hexamer, find ALL
              POSSIBLE matches

              The pattern matching is  adaptive;  given  a  small
              window  of  nondegenerate  sequence,  the algorithm
              will match very fast;  if  degenerate  sequence  is
              detected,  it  will  switch  to a slower, iterative
              approach.   This   results   in   speed   that   is
              proportional  to degeneracy for most cases.  If you
              have  long   sequences   of   'n's   (inserted   as
              placekeepers,  for  instance),  -D2 may be a better
              choice.  In all cases, as soon as degeneracy of the
              KEY  hexamer  exceeds  a compiled-in limit (usually
              256-fold degeneracy), the KEY is skipped.


       --dam  Dam sensitivity simulation of  Dam  methylation  of
              the   DNA.    Dam   methylase   has  a  palindromic
              recognition site (GmATC) which can  interfere  with
              the  binding  and   cutting  of a number of Type II
              REs.   This  flag  simulates  the  effect  of   Dam
              methylation,   but   requires   extra  data  to  be
              available  in  the  rebase  file.   If  the  RE  is
              completely  blocked,   it will be noted that it did
              not  cut  at  all   in   the   summary   statement.
              Otherwise,  the  effect is noted only by difference
              in the number of sites listed for the  -S   and  -F
              flags.   The  sites  are still listed in the Linear
              Map to indicate where they WOULD be if the DNA  was
              not methylated.


       --dcm  Dcm sensitivity similar to '--dam' simulation above
              but with Dcm methylation of the DNA.  Dcm methylase
              also  has  a  palindromic recognition site (CmCWGG)
              which can interfere with RE action.


       --example {1-10}
              example code to show how to add your own flags  and
              functions.   Search  for  'EXAMPLE' in 'SetFlags.c'
              and 'tacg.c' for the code.


       -f {0|1*}
              form (or topology) of DNA - 0 (zero) for  circular;
              1   for   linear.    This  flag  also  operates  on
              matrix  matching,  even  tho  it  doesn't make much
              sense to use it in that way.


       -g {min#(,Max#)}
              specify if you want a pseudo-gel map graphic,  with
              a  low  end  cutoff  of  min#  bases and a high end
              cutoff of Max#.  If Max # is omitted, the length of
              the sequence is assumed.   These numbers can be any
              any integer exponent of 10 (10,  100,  1000,  etc).
              See examples below.


       -G {binsize,X|Y|L}
              Graphic data output, so (mis)named for its original
              use, where:

              binsize = # bases for which hits should  be  pooled
              X|Y|L  indicates  whether the BaseBins should be on
              the X or Y axis
               X: BaseBins 1000 2000 3000 4000  ..
                  NameA      0    4    0    7   ..
                  NameB     22   57    98    29    ..      (#s  =
              matches per bin)
                  NameC      1    0    0    3   ..
                  .
               Y: BaseBins  NameA   NameB   NameC   ..
                    1000      0      22       1     ..
                    2000      4      57       0     ..
                    3000      0      98       0     ..
                    4000      7      29       3     ..
                   .
               L: Basebins  NameA
                    1000      0
                    2000      4
                      .      .
                  Basebins  NameB
                    1000     22
                    2000     57
                      .      .

              This  addresses  some missing features - allows the
              export of match data  for  the  selected  Names  to
              allow  external  analysis  of  the  raw data.  Like
              other output, it is streamed to stdout, so it's not
              wise  to  mix  -G  with  other  analyses; the lines
              generated (esp.  w/ the X  option),  can  be  quite
              long and are NOT governed by the -w flag).


       -h     brief help page (condensed man page).

              controls the output for sequences (in a collection)
              that have no hits for the options  selected.   0  -
              (default)  ID  line and normal output regardless of
              hits 1 - BOTH ID line and normal output are printed
              ONLY  IF  there  are  hits.   2  -  ONLY ID line is
              printed if there are hits (to identify sequences of
              interest in a scan for further analysis).



       -l     specify  if  you  want  a  ladder  map  of selected
              enzymes, much like the  GCG  MAPPLOT  output.  Also
              appends a summary of those enzymes that match a few
              times.  The number of matches that is  included  in
              the  summary is length-sensitive in the distributed
              source code, but it can be overrriden  by  changing
              the  value  assigned  to  '#define SUMMARY_CUTS' in
              'tacg.h'



       -L     specify if you WANT a Linear map.  This  spews  the
              most  output  (about 10x the # of input characters)
              and depending on what other options are  specified,
              can  be  of  moderate  to  very   little use.  This
              option no longer generates  the  co-translation  by
              default  as  it did in prior versions.  If you want
              the co-translation, you'll have to specify  it  via
              the  -T  flag below.  The Linear map also no longer
              shows ALL the patterns that match from the  pattern
              file.   It  now obeys the same filtering rules that
              the Sites, Fragments, Ladder Map and other analyses
              do.  This behavior was requested by several people,
              and I have to admit it makes sense.   tacg  3  also
              labels  non-palindromic  patterns as to orientation
              if they are reversed relative to the way they  were
              enterered, by appending a ~ character to the end of
              the pattern label in the linear map.



       --strands {1|2*}
              in Linear Map, print 1 or 2  strands.   Along  with
              '--notics',  can be used to compact the output by 2
              lines per stanza.  1  -  only  the  top  strand  is
              printed.   2  -  both  top  and  bottom strands are
              printed



       --notics
              in Linear Map, DON'T print the tics -  can be  used
              cuts in sequence; 0* for all. Affects the number of
              enzymes displayed  by  the  sites  (-s),  fragments
              (-F),  gel  (-g),  ladder (-l), and linear map (-L)
              flags.


       -n {3*-10}
              select enzymes by magnitude of recognition site;  3
              =  all,  5  =  5,6,7,8...   n's  don't count, other
              degeneracies  are  summed  ie:  tgca=4,   tgyrca=5,
              tgcnnngca=6, tannnnnnnnnnta=4


       -o {0,1*,3,5}
              select  enzymes  by overhang generated; 5 = 5', 3 =
              3', 0 for blunt, 1 for all.


       -O {1-6,MinSiz}
              crude ORF analysis producing either  a  line  or  a
              block (depends on -w) for each ORF including:

               = Frame of the Current ORF
               = Sequence # of the Current ORF
               = Offset from the start in both bases and AAs
               = Size of the ORF in AAs and KDa
               =  ORF  itself  in  1  letter  code  for  external
              analysis

              NB: If -w is set to 1, the output is written in a 2
              line,  FASTA-like  stanza  for each ORF (the header
              prefixed by '>', and the ORF itself), so that line-
              oriented  pattern-matching tools (grep, egrep, awk)
              can  examine   the   ORF   for   matching   regular
              expressions  (see  the  GNU  grep  man  page for an
              explanation of regular expressions).  In  this  way
              you  can  search  all  6 frames of >MinSize AAs for
              whatever pattern interests you.  If -w  is  set  to
              one  of the regular widths, the ORF will be wrapped
              at that length to form a FASTA formatted block  for
              analysis  by  other  apps,  more biologically aware
              tools like FASTA, BLAST, etc.

              Examples:
               -O 145,25  frames 1,4,5 with a min ORF size of  25
              AAs
               -O  2,66     frame 2 with a min ORF size of 66 AAs


       -p {Name,Pattern[,Err]}
              allows entry of search patterns  from  the  command
              line;
              in validating the patterns. But actual  the  search
              will go very fast.


       -P   {NameA,[+-][lg]Dist_Lo[-Dist_Hi],NameB}
              Proximity  matching.  Use this option to search for
              spacial relationships between factors, 2 at a  time
              (up to a total of 10).

              NameA and NameB must be in a REBASE-formatted file,
              either the default rebase.data or another specified
              by  the  -R flag and are case INsensitive.  NameA/B
              patterns can be composed of  any  IUPAC  bases  and
              ERRORs can be specified in the REBASE entry ie:

               Pit1   5   WWTATNCATW   0   2 ! a Pit1 site with 2
              error
               Tataa 4  TATAAWWWW   0  1 ! a Tataa  site  with  1
              error

               +   NameA  is  DOWNSTREAM  of  NameB  (default  is
              either)
               -  NameA is UPSTREAM of NameB     (ditto)
               l  NameA is LESS THAN Dist_Lo from NameB (default)
               g  NameA is GREATER THAN Dist_Lo from NameB
               Dist_Hi  - if used, implies a RANGE, obviates l or
              g

              Example I
                 -PHindIII,350,bamhi     Match all HindIII  sites
              within 350 bases of BamHI sites

              Example II
                 -PPit1,-30-2500,Tataa    Match  all  Pit1  sites
              that are 30 to 2500 bases UPSTREAM of a Tataa site.


       --ps   generates  a  postscript  plasmid map (and multiple
              pages with the same  parameters  if  fed  a  multi-
              sequence   file).    The   output   file  is  named
              tacg_Map.ps and additional plots will  be  appended
              to  it  if it exists in the same directory.  REs to
              be  plotted  can  be  selected   with   the   usual
              parameters:  (-m  -M  --cost  --n -x -p) but you'll
              usually want to use -M1 or -M2.   Degeneracies  are
              plotted  along  the rim as grayscale arcs (remember
              tacg can tolerate degeneracies in sequence, so  you
              can  compose  accurate  plasmid  maps by connecting
              known sequences with N's.)  ORFs from any  and  all
              frames can be plotted internal to the sequence ring
              by using the -O flag.

              shading  degeneracies  differently.   It  is  quite
              memory intensive as it marks the beginning and  end
              of  every  degeneracy  run.    No  external data is
              produced, but could be as it's just a simple 2-step
              array.



       -q     Work  quietly.   DISallows  sending  diagnostic UDP
              info back  to  author;  this  is  now  the  default
              behavior, on the request of a number of people.  If
              you wish to send UDP packets back to my server  and
              possibly  annoy  your  network security people, you
              may try to with the -Q flag.



       -R {REBASE|Matrix file}
              specifies an alternative database, (RE  or  Matrix)
              use.   The  RE  database  must  be  in the same GCG
              format as  rebase.data.   There  are  some  example
              alternative  REBASE  files  shipped  with  the tacg
              distribution named '*.RB'.

              The latest REBASE files are available via FTP:

              ftp://ftp.neb.com/pub/rebase/

              or via WWW:

              http://www.neb.com/rebase/rebase.html

              and the latest TRANSFAC database is available at:

              http://transfac.gbf.de/TRANSFAC/index.html

              The file specified with the -R flag is searched for
              in  the  same order as the other data files: $PWD ,
              $HOME , $TACGLIB.


       --raw  makes tacg consider ALL input as  raw,  unformatted
              sequence.   This  allows it to process unstructured
              data such as fragments of files and editor buffers.
              It  ignores everything NOT an IUPAC degeneracy, but
              will consider all possible IUPAC  degeneracies,  so
              will   produce   odd  output  if  fed  a  regularly
              formatted sequence file (it  will  process  headers
              and  comments   as sequence.)  This is the behavior
              of the version 2 tacg (before SEQIO).


              pattern  'FILE').  Specific regex patterns from the
              file can be specified by using  the  '-r'  flag  to
              name  them explicitly.  Regular expression searches
              are  considerably  slower  than  other   types   of
              searches,  but  searches  of  100Kb, with <10 regex
              patterns of even reasonably high complexity  should
              be tolerable.



       --rule {logic}
              (see  also  -P  above) --rule allows you to specify
              arbitrarily   complex   logical   associations   of
              characteristics   to   detect   the  patterns  that
              interest   you.    Admittedly,   that   phrase   is
              incomprehensible  on  its  own,  so  let me give an
              example:

              Say you wanted to search for an enhancer  that  you
              suspected  might be involved in the transcriptional
              regulation of a pituitary-specific gene.  You  knew
              that  you were looking for a sequence about 1000 bp
              long in which there were at least 2 Pit1 sites  and
              3-5 Estrogen response elements, but NO TATAA boxes.
              If you had defined these patterns in a file  called
              pit.specific as:

               Pit1  0  WWTATNCATW    0 1 ! Pit1 site w/ 1 error
               ERE   0  GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
               TATAA 0  tataawwww     0 0 ! TATAA site, no errors
              allowed

               you could specify this search by:

              tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)'
              -R pit.specific < input_sequence >output


              This  query  searches  a sliding window of 1000 bps
              (-W 1000) for ((2-7 Pit1 AND 3-5 ERE sites) AND  (0
              TATAA  sites)).  These combinations can be as large
              as your OS allows  your  command-line  to  be  with
              arbitraily   complex   relations  represented  with
              logical  AND  (&),  OR  (|),   and   XOR   (^)   as
              conjunctions.   Parens enforce groupings; otherwise
              it's evaluated left to right.




       --rulefile '/path/to/the/rulefile'
              This option allows you to read in a  complete  file


       -S     prints  the  the  actual  matched  Sites in tabular
              form, much like Strider's output.  See  also  '-c',
              above.


       --silent
              requests  that  the  nucleic  sequence submitted be
              translated starting at the 1st  base,  in  frame  1
              (use  -b  to shift the starting base), according to
              the Codon Translation table specified with -C, then
              reverse translated, using the same table, using all
              the  possible  degeneracies,  then  restrict   that
              (quite)  degenerate  sequence  and show all the REs
              that will match it.  You should use  the  '-L'  and
              '-T'  flags  to generate the linear map which shows
              both the  REs  and  the  cotranslated  sequence  to
              verify  that all is as it should be.  NB: Depending
              on  Codon  Table,  some  AAs  are  not   reversibly
              translatable.    Using   the  standard  table,  Arg
              (=mgn),  Leu  (=ytn),  and  Ser  (=wsn)  cannot  be
              Forward  translated from their Reverse translation.


       --tmppath /path/to/tmp/dir
              passes the path to tacg to cooperate with  CGIs  or
              other  programs  that  need  to  tell tacg where to
              place the ps/pdf files.


       -T {[0*|1|3|6],[1|3]}
              requests frames 1, 1-3, or 1-6 to  be  cotranslated
              with  the  Linear  Map  using  1 or 3 letter codes.
              Requires '-L' to have any effect.

              Ex: "-T3,3" translateslates  Frames  1,2,3  with  3
              letter labels.
                  "-T1,1"  translateslates  Frame 1 with 1 letter
              labels.



       -v     asks for program version  (there  may  be  multiple
              versions  of  the  same functional program to track
              its migration).


       -V {1-3}
              Verbose  output-  requests  all   kinds   of   ugly
              diagnostic  info  to be spat to the screen.  May be
              useful in diagnosing why tacg  did  not  behave  as
              small fonts, you may want to use this flag to widen
              the output. In version 3, the option '-w 1'  allows
              you  to  put as much information as possible on one
              line for easier parsing by some external apps.

              Ex: "-w 1" prints output in one line
                  "-w  150"  causes   wrapping   at   about   170
              characters  (150 bp wide in the Linear map option).



       -x {Label(,=),Label..(,C)}
              used to restrict the patterns searched for by  Name
              label (either from the 1st field of a REBASE format
              file or the NA field from a TRANSFAC  format  file)
              up to a maximum of 15.  Case INsensitive (HindIII =
              hindiii =  HinDiIi),  but  it  HAS  to  be  spelled
              exactly  like  the  entry  in  rebase.data  with no
              spaces.  (HindIII != Hind III != Hind3).

              The '=' tag  invokes  the  Hookey  function  (named
              after its requestor, John Hookey), in which the '='
              tags the RE to  which  it  is  appended.   This  is
              useful  if  you're  trying  to discern or predict a
              labelled fragment in a mixture of  fragments.   The
              output  shows  the fragments generated only if they
              have one or both ends generated by the  tagged  RE.
              This  option  works  even  if there are a number of
              REs,  but  only  one  can  be   tagged.   Ex:   '-x
              HindIII,=,MseI,HinfI'  causes  the DNA to be cut by
              HindIII, MseI, and HinfI, but only  fragments  that
              have  a  HindIII  end will be shown.  The output is
              shown both unsorted and sorted  by  fragment  size.
              If  you  want  to  cause  the  output to simulate a
              multiple digest with all the REs designated, append
              a   ',C'   to   the   list   of   RE   names.   Ex:
              -xBamHI,EcorI,NruI,C

              NB: Don't assign the name 'C' to  any  patterns  or
              REs.




       -X (--extract) {b,e,[0|1]}
              causes  the sequences bounding the match to be spat
              to stdout  in  FASTA  format.   b  and  e  are  the
              beginning   and  ending  offsets  respectively  for
              varying the window around the match.   NB:  both  b
              and  e are measured from the start of the match, so
              e must be corrected for the length of  the  pattern
              itself.
              has to match the matrix at 95% or better).

RELATED PROGRAMS

       In Ver 3, tacg has  (finally)  incorporated  Jim  Knight's
       (jknight@guarneri.curagen.com)   SEQIO  library  calls  to
       provide automagic format conversion of incoming sequences.
       This  also allows multiple sequences to be run at the same
       time, allowing tacg to scan databases.

       Wu and Manber's agrep is an amazing piece of software  for
       searching  for  multiple  patterns with errors.  While not
       optimzed for molecular biology, it can  be  used  to  scan
       sequences.   Jim Knight distributes a variant of it called
       grepseq with his SEQIO pkg, which IS molbio-aware, but not
       as  generally useful as (to me anyway) as tacg, as it only
       scans one strand and will only search up to 6 matches  for
       some  reason.  However,  I've  started  to incorporate the
       grepseq  core  into  tacg.    agrep   is   available   via
       ftp://ftp.cs.arizona.edu/agrep/ or http://manber.com.
        The SEQIO pkg is distributed around the web.


       You can also use the excellent paging utility less to move
       thru your sequence file and use  its  marking  and  piping
       facility  to  punt the sequence of interest to 'tacg'.  In
       many terminal emulators it  will  also  highlight  matched
       search  terms,  and  so makes an excellent way to scan the
       output for regions of interest.  Many editors  also  allow
       piping  a  selection  of  text  to an external program and
       inclusion of the  result  into  another  window  (  nedit,
       crisp, joe, the indefatiguable emacs/xemacs and others).


       Much   of   the  output  benefits  from  wider-than-normal
       printing.  The '-w#' flag allows output up  to  about  230
       characters  wide,  however to print this without wrapping,
       you need to use small fonts.  A number  of  unix  printing
       utilities   allow  you  to  do  this,  notably  genscript:
       http://www.hut.fi/%7Emtr/genscript/index.html

EXAMPLES

       Used alone:

              tacg  -f0   -n5   -T3,1   -sL   -F3   -g   100,1000
              <NewFile.Genbank >output.file

              tacg  -R  yeast.matrices  -#  85   -sSlc   -w90   <
              yst_chr_4.seq >out

              Translation:  Search  the sequence in yst_chr_4.seq
              for  all  the  matrices  described  in   the   file
              yeast.matrices  ,  applying a uniform cutoff of 85%
              (-# 85) to the maximum possible score, writing  the
              summary,  Sites,  ladder map, doubly-sorted (-sSlc)
              printed 90 characters wide (-w90) to the file out


       Specifying patterns on the command-line

              tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90 -sSL
              < rprlPromo.seq > promo.map

              Translation:  search  for the patterns labeled Pit1
              and ap2 with 1 error each and search  the  sequence
              from  the file rprlPromo.seq for them, printing the
              results (summary (-s), Sites (S),  and  the  Linear
              Map  (L)  90  characters  wide  (-w90)  to the file
              promo.map


       Used to search the entire yeast 500bp Upstream  Regulatory
       sequences  (a  database  of  6226  500  bp  sequences) for
       matches to the MATa1 binding site (from TRANSFAC) :

              tacg  -R  TRANSFAC.data  -sScw1  -rMATa1   -#95   <
              utr5_sc_500.fasta > yeast.summary

              Translation:  translate each of the FASTA formatted
              entries in the input  file  utr5_sc_500.fasta  into
              usable  sequence,  and  after finding the MATa1 (-r
              MATa1) matrix  description  from  the  database  -R
              TRANSFAC.data  search  the sequences for matches at
              95% of the max score that it has  in  the  TRANSFAC
              database  (-#  95), returning the summary (-s), the
              sites  (-S)  sorted  in  Strider  order  (-c)  with
              results  printed  on  1  line  (w1),  directing the
              output into the file yeast.summary

BUGS and ODDITIES

       Major


              tacg, if used with the -Q flag,  spits  back  about
              100  bytes  of  information  about  it's  use  (the
              SEQIO  library,  but  I'd  rather  not  as  it does
              include a lot of (hidden) functionality that I plan
              to use later.


              tacg  v2.0  will not currently cut sequence shorter
              than 5 bases; if  you  need  to  analyze  sequences
              shorter  than  this, perhaps you're using the wrong
              program.


              main() and functions  were  originally  written  as
              single pass code but with the help of Gray Watson's
              excellent (!)  dmalloc  malloc  debugging  library,
              available  at: http://www.dmalloc.com I've recently
              put  some  effort   into  tracking  memory   leaks,
              especially  since  much  of  the code has to be re-
              entrant for doing  analyses  over  many  sequences.
              However, it's not completely leak-free yet, so user
              beware.


              The command line handling has been  completely  re-
              written,   using  the  getopt()  and  getopt_long()
              functions,  so  the  flags  are  considerably  less
              sensitive to spacing and order.


              Translation  in  6 frames assumes circular sequence
              regardless of '-f' flag, so  that  the  last  amino
              acids in frames 5 and 6 in the 1st output block are
              obviously incorrect  if  you  are  assuming  linear
              sequence.

              See  the  manual  for  other  bugs which the author
              thinks are less problematic.


              Harry Mangalam (hjm@tacgi.com)

Man(1) output converted with man2html