tacg - finds short patterns in nucleic acids, translates
DNA <-> protein.
SYNOPSIS
tacg -flag [option] -flag [option] ... <input.file
>output.file tacg takes input from stdin (| or <); spits
output to screen (default), >file, | next command
[-chHlLqQsSv] [-b #] [-e #] [-C {0-12}] [--clone
'#_#,#x#...'] [--cost #] [-D 0-4] [--dam] [--dcm]
[--example] [-f {0|1}] [-F {0-3}] [-g #,#] [-G #,{X|Y|L}]
[-H (--HTML) 0|1] [-i (--idonly) 0-2] [-m #] [-M #] [-n
{3-8}] [--numstart] [--notics] [-o {0|1|3|5}] [-O {1-6},#]
[-p Name,pattern,Err] [-P
NameA,(+|-)(l|g)Dist_Lo(-Dist_Hi),NameB] [--ps] [--pdf]
[--logdegens] [-r (--regex) {'Label:RegexPat' |
'FILE:FileOfRegexPats'}] [--rule
'Name,(LabA:m:M&LabB:m:M),Win'] [--rulefile
'/path/to/rulefile'] [-R alterative Pattern/Matrix file]
[--raw] [--silent] [--strands {1|2}] [-T {0|1|3|6},{1|3}]
[-V {1-3}] [-w {1|#}] [-W (--slidwin) #] [-x
NameA(=),NameB..(,C)] [-X (--extract) {b,e,[0|1]}] [-# %]
DESCRIPTION
tacg takes input from stdin, automagically translates most
standard ASCII formats of Nucleic Acid (NA) sequence, then
analyses that sequence for restriction enzyme (RE) sites
and other NA motifs such as Transcription Factor (TF)
binding sites (w/ or w/o mismatch errors), matrix matches,
and regular expressions, finally writing analyses to
stdout. It also can translate the NA input to protein in
any frame, using any of a number of Codon translations
tables, and search for Open Reading Frames (ORFs), as well
as perform many other analyses. Most of the internals use
dynamic memory so there are few limits on sequence input
size and pattern number. It's ~ 5-50x faster than the
comparable routines in GCG or EMBOSS and as it's writ in
ANSI C, portable to all unix variants, and even Microsoft
Win32 with the Cygwin and the ming32 toolkits.
tacg searches the sequence read from stdin for matches
based on descriptions stored in a database of patterns,
either explicit sequences, possibly containing IUPAC
degeneracies (default rebase.data, in GCG format or
extended format), or matrix descriptions (default
matrix.data, in TRANSFAC format), regular expressions
(default regex.data, in GCG-like format), or a rules file
(default rules.data in a simple format) based on matches
and options entered on the command line, sends ALL output
to stdout. (Unless requested, it no longer sends errors to
the input sequence in different ways depending on the -D
flag (see below). It either strips all letters other than
'a','c','g', or 't' and analyzes the sequence as 'pure'
using a fast incremental hashing algorithm or it treats it
as degenerate and analyses it via a slower de novo hash.
By default, it treats sequence as 'pure' unless it detects
an IUPAC degeneracy, in which case it will adaptively
switch back and forth between the fast and slow hashing
routines.
NB: tacg can produce lots of output, especially in the
Linear map mode; while it's possible to pipe direct to
lp/lpr, you'll probably regret it.
REQUIREMENTS
tacg 3.5 requires an external Codon file codon.data but
does not absolutely require a pattern/REBASE file,
allowing you to enter patterns via the command line with
the '-p' flag. However, most users will want to use a
REBASE file in GCG format to supply the RE definitions.
By default the name of this (supplied) file is:
rebase.data, altho other files in the same format can be
specified by the -R flag. While you can use the default
GCG-formatted file from NEB's REBASE distribution
(http://rebase.neb.com), additional information is
required to use the --dam, --dcm, or --cost options. This
info is included in the distribution of tacg and can be
added or modified with a text editor. Searching for
Matrices requires the use of a TRANSFAC-formatted file
(also supplied in the default name of matrix.data ).
The codons/pattern/matrix data files may exist in any of 3
locations which are searched in the order of: the current
directory $PWD, your home directory $HOME, or tacg lib
$TACGLIB. Many shells will automatically define the 1st
two; the last must be specified either via command line or
in your .cshrc file.
ie. 'setenv TACGLIB /usr/local/lib/tacg' [csh/tcsh]
or 'export TACGLIB=/usr/local/lib/tacg' [bash]
FLAGS and OPTIONS
{} = required for flag; * = default (doesn't need to be
entered); # = an integer value; () = optional
ie. -f {0,1*} means that the flag must be entered -f1 or
-f0. The flags -f0 or -f 0 are equally acceptable and
label indicates numbering from the beginning of the
entire sequence (see file 'tacg.main.html' for more
detail). The smallest sequence that tacg can
handle is 4 bases, 10 for the ladder map (-l).
This allows analysis of primers and linkers.
-e {#} select the end of a subsequence from a larger
sequence file; 0* for last base of sequence. The
largest sequence that I've sent thru it is ~75MB.
-c order the output by # of cuts/fragments by each RE
(Strider style) and thence alphabetically;
otherwise output is by order of appearance in the
REBASE file.
-C {1*-13}
Codon Usage table to use for translation:
1 Standard 8 Euplotid_Nuc
2 Vert_Mito 9 Bacterial
3 Yeast_Mito 10 Alt_Yeast_Nuc
4 Mold_Protoz_Coel_Mito 11 Ascidian_Mito
5 Invert_Mito 12 Flatworm_Mito
6 Ciliate_Nuc 13 Blepharisma_Nuc
7 Echino_Mito
The Codon Usage file used in Ver 3 (codon.data) is
a slightly modified NCBI format, which includes
info (currently ignored) about multiple initiator
codons and references.
--clone '#_#,#x#...'
Clone finds sequence ranges which either MUST NOT
be cut (#_#) or that MUST be cut (#x#), up to a
maximum of 15 at once. Ranges not specified can be
either cut or not cut. The output first lists all
REs (if any) which match ALL the rules, then all
REs which match SOME rules as long as all NO-CUT
rules are respected. The same filters that work in
other RE selections (-n, -o, -m, -M, --cost,
--dam/dcm) can be applied here to fine-tune the
selection.
--cost {#}
as '-D3'
2 degen's OK; ignore in KEY hexamer, but match
outside of KEY
3 degen's OK; expand in KEY hexamer, find only
EXACT matches
4 degen's OK; expand in KEY hexamer, find ALL
POSSIBLE matches
The pattern matching is adaptive; given a small
window of nondegenerate sequence, the algorithm
will match very fast; if degenerate sequence is
detected, it will switch to a slower, iterative
approach. This results in speed that is
proportional to degeneracy for most cases. If you
have long sequences of 'n's (inserted as
placekeepers, for instance), -D2 may be a better
choice. In all cases, as soon as degeneracy of the
KEY hexamer exceeds a compiled-in limit (usually
256-fold degeneracy), the KEY is skipped.
--dam Dam sensitivity simulation of Dam methylation of
the DNA. Dam methylase has a palindromic
recognition site (GmATC) which can interfere with
the binding and cutting of a number of Type II
REs. This flag simulates the effect of Dam
methylation, but requires extra data to be
available in the rebase file. If the RE is
completely blocked, it will be noted that it did
not cut at all in the summary statement.
Otherwise, the effect is noted only by difference
in the number of sites listed for the -S and -F
flags. The sites are still listed in the Linear
Map to indicate where they WOULD be if the DNA was
not methylated.
--dcm Dcm sensitivity similar to '--dam' simulation above
but with Dcm methylation of the DNA. Dcm methylase
also has a palindromic recognition site (CmCWGG)
which can interfere with RE action.
--example {1-10}
example code to show how to add your own flags and
functions. Search for 'EXAMPLE' in 'SetFlags.c'
and 'tacg.c' for the code.
-f {0|1*}
form (or topology) of DNA - 0 (zero) for circular;
1 for linear. This flag also operates on
matrix matching, even tho it doesn't make much
sense to use it in that way.
-g {min#(,Max#)}
specify if you want a pseudo-gel map graphic, with
a low end cutoff of min# bases and a high end
cutoff of Max#. If Max # is omitted, the length of
the sequence is assumed. These numbers can be any
any integer exponent of 10 (10, 100, 1000, etc).
See examples below.
-G {binsize,X|Y|L}
Graphic data output, so (mis)named for its original
use, where:
binsize = # bases for which hits should be pooled
X|Y|L indicates whether the BaseBins should be on
the X or Y axis
X: BaseBins 1000 2000 3000 4000 ..
NameA 0 4 0 7 ..
NameB 22 57 98 29 .. (#s =
matches per bin)
NameC 1 0 0 3 ..
.
Y: BaseBins NameA NameB NameC ..
1000 0 22 1 ..
2000 4 57 0 ..
3000 0 98 0 ..
4000 7 29 3 ..
.
L: Basebins NameA
1000 0
2000 4
. .
Basebins NameB
1000 22
2000 57
. .
This addresses some missing features - allows the
export of match data for the selected Names to
allow external analysis of the raw data. Like
other output, it is streamed to stdout, so it's not
wise to mix -G with other analyses; the lines
generated (esp. w/ the X option), can be quite
long and are NOT governed by the -w flag).
-h brief help page (condensed man page).
controls the output for sequences (in a collection)
that have no hits for the options selected. 0 -
(default) ID line and normal output regardless of
hits 1 - BOTH ID line and normal output are printed
ONLY IF there are hits. 2 - ONLY ID line is
printed if there are hits (to identify sequences of
interest in a scan for further analysis).
-l specify if you want a ladder map of selected
enzymes, much like the GCG MAPPLOT output. Also
appends a summary of those enzymes that match a few
times. The number of matches that is included in
the summary is length-sensitive in the distributed
source code, but it can be overrriden by changing
the value assigned to '#define SUMMARY_CUTS' in
'tacg.h'
-L specify if you WANT a Linear map. This spews the
most output (about 10x the # of input characters)
and depending on what other options are specified,
can be of moderate to very little use. This
option no longer generates the co-translation by
default as it did in prior versions. If you want
the co-translation, you'll have to specify it via
the -T flag below. The Linear map also no longer
shows ALL the patterns that match from the pattern
file. It now obeys the same filtering rules that
the Sites, Fragments, Ladder Map and other analyses
do. This behavior was requested by several people,
and I have to admit it makes sense. tacg 3 also
labels non-palindromic patterns as to orientation
if they are reversed relative to the way they were
enterered, by appending a ~ character to the end of
the pattern label in the linear map.
--strands {1|2*}
in Linear Map, print 1 or 2 strands. Along with
'--notics', can be used to compact the output by 2
lines per stanza. 1 - only the top strand is
printed. 2 - both top and bottom strands are
printed
--notics
in Linear Map, DON'T print the tics - can be used
cuts in sequence; 0* for all. Affects the number of
enzymes displayed by the sites (-s), fragments
(-F), gel (-g), ladder (-l), and linear map (-L)
flags.
-n {3*-10}
select enzymes by magnitude of recognition site; 3
= all, 5 = 5,6,7,8... n's don't count, other
degeneracies are summed ie: tgca=4, tgyrca=5,
tgcnnngca=6, tannnnnnnnnnta=4
-o {0,1*,3,5}
select enzymes by overhang generated; 5 = 5', 3 =
3', 0 for blunt, 1 for all.
-O {1-6,MinSiz}
crude ORF analysis producing either a line or a
block (depends on -w) for each ORF including:
= Frame of the Current ORF
= Sequence # of the Current ORF
= Offset from the start in both bases and AAs
= Size of the ORF in AAs and KDa
= ORF itself in 1 letter code for external
analysis
NB: If -w is set to 1, the output is written in a 2
line, FASTA-like stanza for each ORF (the header
prefixed by '>', and the ORF itself), so that line-
oriented pattern-matching tools (grep, egrep, awk)
can examine the ORF for matching regular
expressions (see the GNU grep man page for an
explanation of regular expressions). In this way
you can search all 6 frames of >MinSize AAs for
whatever pattern interests you. If -w is set to
one of the regular widths, the ORF will be wrapped
at that length to form a FASTA formatted block for
analysis by other apps, more biologically aware
tools like FASTA, BLAST, etc.
Examples:
-O 145,25 frames 1,4,5 with a min ORF size of 25
AAs
-O 2,66 frame 2 with a min ORF size of 66 AAs
-p {Name,Pattern[,Err]}
allows entry of search patterns from the command
line;
in validating the patterns. But actual the search
will go very fast.
-P {NameA,[+-][lg]Dist_Lo[-Dist_Hi],NameB}
Proximity matching. Use this option to search for
spacial relationships between factors, 2 at a time
(up to a total of 10).
NameA and NameB must be in a REBASE-formatted file,
either the default rebase.data or another specified
by the -R flag and are case INsensitive. NameA/B
patterns can be composed of any IUPAC bases and
ERRORs can be specified in the REBASE entry ie:
Pit1 5 WWTATNCATW 0 2 ! a Pit1 site with 2
error
Tataa 4 TATAAWWWW 0 1 ! a Tataa site with 1
error
+ NameA is DOWNSTREAM of NameB (default is
either)
- NameA is UPSTREAM of NameB (ditto)
l NameA is LESS THAN Dist_Lo from NameB (default)
g NameA is GREATER THAN Dist_Lo from NameB
Dist_Hi - if used, implies a RANGE, obviates l or
g
Example I
-PHindIII,350,bamhi Match all HindIII sites
within 350 bases of BamHI sites
Example II
-PPit1,-30-2500,Tataa Match all Pit1 sites
that are 30 to 2500 bases UPSTREAM of a Tataa site.
--ps generates a postscript plasmid map (and multiple
pages with the same parameters if fed a multi-
sequence file). The output file is named
tacg_Map.ps and additional plots will be appended
to it if it exists in the same directory. REs to
be plotted can be selected with the usual
parameters: (-m -M --cost --n -x -p) but you'll
usually want to use -M1 or -M2. Degeneracies are
plotted along the rim as grayscale arcs (remember
tacg can tolerate degeneracies in sequence, so you
can compose accurate plasmid maps by connecting
known sequences with N's.) ORFs from any and all
frames can be plotted internal to the sequence ring
by using the -O flag.
shading degeneracies differently. It is quite
memory intensive as it marks the beginning and end
of every degeneracy run. No external data is
produced, but could be as it's just a simple 2-step
array.
-q Work quietly. DISallows sending diagnostic UDP
info back to author; this is now the default
behavior, on the request of a number of people. If
you wish to send UDP packets back to my server and
possibly annoy your network security people, you
may try to with the -Q flag.
-R {REBASE|Matrix file}
specifies an alternative database, (RE or Matrix)
use. The RE database must be in the same GCG
format as rebase.data. There are some example
alternative REBASE files shipped with the tacg
distribution named '*.RB'.
The latest REBASE files are available via FTP:
ftp://ftp.neb.com/pub/rebase/
or via WWW:
http://www.neb.com/rebase/rebase.html
and the latest TRANSFAC database is available at:
http://transfac.gbf.de/TRANSFAC/index.html
The file specified with the -R flag is searched for
in the same order as the other data files: $PWD ,
$HOME , $TACGLIB.
--raw makes tacg consider ALL input as raw, unformatted
sequence. This allows it to process unstructured
data such as fragments of files and editor buffers.
It ignores everything NOT an IUPAC degeneracy, but
will consider all possible IUPAC degeneracies, so
will produce odd output if fed a regularly
formatted sequence file (it will process headers
and comments as sequence.) This is the behavior
of the version 2 tacg (before SEQIO).
pattern 'FILE'). Specific regex patterns from the
file can be specified by using the '-r' flag to
name them explicitly. Regular expression searches
are considerably slower than other types of
searches, but searches of 100Kb, with <10 regex
patterns of even reasonably high complexity should
be tolerable.
--rule {logic}
(see also -P above) --rule allows you to specify
arbitrarily complex logical associations of
characteristics to detect the patterns that
interest you. Admittedly, that phrase is
incomprehensible on its own, so let me give an
example:
Say you wanted to search for an enhancer that you
suspected might be involved in the transcriptional
regulation of a pituitary-specific gene. You knew
that you were looking for a sequence about 1000 bp
long in which there were at least 2 Pit1 sites and
3-5 Estrogen response elements, but NO TATAA boxes.
If you had defined these patterns in a file called
pit.specific as:
Pit1 0 WWTATNCATW 0 1 ! Pit1 site w/ 1 error
ERE 0 GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
TATAA 0 tataawwww 0 0 ! TATAA site, no errors
allowed
you could specify this search by:
tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)'
-R pit.specific < input_sequence >output
This query searches a sliding window of 1000 bps
(-W 1000) for ((2-7 Pit1 AND 3-5 ERE sites) AND (0
TATAA sites)). These combinations can be as large
as your OS allows your command-line to be with
arbitraily complex relations represented with
logical AND (&), OR (|), and XOR (^) as
conjunctions. Parens enforce groupings; otherwise
it's evaluated left to right.
--rulefile '/path/to/the/rulefile'
This option allows you to read in a complete file
-S prints the the actual matched Sites in tabular
form, much like Strider's output. See also '-c',
above.
--silent
requests that the nucleic sequence submitted be
translated starting at the 1st base, in frame 1
(use -b to shift the starting base), according to
the Codon Translation table specified with -C, then
reverse translated, using the same table, using all
the possible degeneracies, then restrict that
(quite) degenerate sequence and show all the REs
that will match it. You should use the '-L' and
'-T' flags to generate the linear map which shows
both the REs and the cotranslated sequence to
verify that all is as it should be. NB: Depending
on Codon Table, some AAs are not reversibly
translatable. Using the standard table, Arg
(=mgn), Leu (=ytn), and Ser (=wsn) cannot be
Forward translated from their Reverse translation.
--tmppath /path/to/tmp/dir
passes the path to tacg to cooperate with CGIs or
other programs that need to tell tacg where to
place the ps/pdf files.
-T {[0*|1|3|6],[1|3]}
requests frames 1, 1-3, or 1-6 to be cotranslated
with the Linear Map using 1 or 3 letter codes.
Requires '-L' to have any effect.
Ex: "-T3,3" translateslates Frames 1,2,3 with 3
letter labels.
"-T1,1" translateslates Frame 1 with 1 letter
labels.
-v asks for program version (there may be multiple
versions of the same functional program to track
its migration).
-V {1-3}
Verbose output- requests all kinds of ugly
diagnostic info to be spat to the screen. May be
useful in diagnosing why tacg did not behave as
small fonts, you may want to use this flag to widen
the output. In version 3, the option '-w 1' allows
you to put as much information as possible on one
line for easier parsing by some external apps.
Ex: "-w 1" prints output in one line
"-w 150" causes wrapping at about 170
characters (150 bp wide in the Linear map option).
-x {Label(,=),Label..(,C)}
used to restrict the patterns searched for by Name
label (either from the 1st field of a REBASE format
file or the NA field from a TRANSFAC format file)
up to a maximum of 15. Case INsensitive (HindIII =
hindiii = HinDiIi), but it HAS to be spelled
exactly like the entry in rebase.data with no
spaces. (HindIII != Hind III != Hind3).
The '=' tag invokes the Hookey function (named
after its requestor, John Hookey), in which the '='
tags the RE to which it is appended. This is
useful if you're trying to discern or predict a
labelled fragment in a mixture of fragments. The
output shows the fragments generated only if they
have one or both ends generated by the tagged RE.
This option works even if there are a number of
REs, but only one can be tagged. Ex: '-x
HindIII,=,MseI,HinfI' causes the DNA to be cut by
HindIII, MseI, and HinfI, but only fragments that
have a HindIII end will be shown. The output is
shown both unsorted and sorted by fragment size.
If you want to cause the output to simulate a
multiple digest with all the REs designated, append
a ',C' to the list of RE names. Ex:
-xBamHI,EcorI,NruI,C
NB: Don't assign the name 'C' to any patterns or
REs.
-X (--extract) {b,e,[0|1]}
causes the sequences bounding the match to be spat
to stdout in FASTA format. b and e are the
beginning and ending offsets respectively for
varying the window around the match. NB: both b
and e are measured from the start of the match, so
e must be corrected for the length of the pattern
itself.
has to match the matrix at 95% or better).
RELATED PROGRAMS
In Ver 3, tacg has (finally) incorporated Jim Knight's
(jknight@guarneri.curagen.com) SEQIO library calls to
provide automagic format conversion of incoming sequences.
This also allows multiple sequences to be run at the same
time, allowing tacg to scan databases.
Wu and Manber's agrep is an amazing piece of software for
searching for multiple patterns with errors. While not
optimzed for molecular biology, it can be used to scan
sequences. Jim Knight distributes a variant of it called
grepseq with his SEQIO pkg, which IS molbio-aware, but not
as generally useful as (to me anyway) as tacg, as it only
scans one strand and will only search up to 6 matches for
some reason. However, I've started to incorporate the
grepseq core into tacg. agrep is available via
ftp://ftp.cs.arizona.edu/agrep/ or http://manber.com.
The SEQIO pkg is distributed around the web.
You can also use the excellent paging utility less to move
thru your sequence file and use its marking and piping
facility to punt the sequence of interest to 'tacg'. In
many terminal emulators it will also highlight matched
search terms, and so makes an excellent way to scan the
output for regions of interest. Many editors also allow
piping a selection of text to an external program and
inclusion of the result into another window ( nedit,
crisp, joe, the indefatiguable emacs/xemacs and others).
Much of the output benefits from wider-than-normal
printing. The '-w#' flag allows output up to about 230
characters wide, however to print this without wrapping,
you need to use small fonts. A number of unix printing
utilities allow you to do this, notably genscript:
http://www.hut.fi/%7Emtr/genscript/index.html
EXAMPLES
Used alone:
tacg -f0 -n5 -T3,1 -sL -F3 -g 100,1000
<NewFile.Genbank >output.file
tacg -R yeast.matrices -# 85 -sSlc -w90 <
yst_chr_4.seq >out
Translation: Search the sequence in yst_chr_4.seq
for all the matrices described in the file
yeast.matrices , applying a uniform cutoff of 85%
(-# 85) to the maximum possible score, writing the
summary, Sites, ladder map, doubly-sorted (-sSlc)
printed 90 characters wide (-w90) to the file out
Specifying patterns on the command-line
tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90 -sSL
< rprlPromo.seq > promo.map
Translation: search for the patterns labeled Pit1
and ap2 with 1 error each and search the sequence
from the file rprlPromo.seq for them, printing the
results (summary (-s), Sites (S), and the Linear
Map (L) 90 characters wide (-w90) to the file
promo.map
Used to search the entire yeast 500bp Upstream Regulatory
sequences (a database of 6226 500 bp sequences) for
matches to the MATa1 binding site (from TRANSFAC) :
tacg -R TRANSFAC.data -sScw1 -rMATa1 -#95 <
utr5_sc_500.fasta > yeast.summary
Translation: translate each of the FASTA formatted
entries in the input file utr5_sc_500.fasta into
usable sequence, and after finding the MATa1 (-r
MATa1) matrix description from the database -R
TRANSFAC.data search the sequences for matches at
95% of the max score that it has in the TRANSFAC
database (-# 95), returning the summary (-s), the
sites (-S) sorted in Strider order (-c) with
results printed on 1 line (w1), directing the
output into the file yeast.summary
BUGS and ODDITIES
Major
tacg, if used with the -Q flag, spits back about
100 bytes of information about it's use (the
SEQIO library, but I'd rather not as it does
include a lot of (hidden) functionality that I plan
to use later.
tacg v2.0 will not currently cut sequence shorter
than 5 bases; if you need to analyze sequences
shorter than this, perhaps you're using the wrong
program.
main() and functions were originally written as
single pass code but with the help of Gray Watson's
excellent (!) dmalloc malloc debugging library,
available at: http://www.dmalloc.com I've recently
put some effort into tracking memory leaks,
especially since much of the code has to be re-
entrant for doing analyses over many sequences.
However, it's not completely leak-free yet, so user
beware.
The command line handling has been completely re-
written, using the getopt() and getopt_long()
functions, so the flags are considerably less
sensitive to spacing and order.
Translation in 6 frames assumes circular sequence
regardless of '-f' flag, so that the last amino
acids in frames 5 and 6 in the 1st output block are
obviously incorrect if you are assuming linear
sequence.
See the manual for other bugs which the author
thinks are less problematic.
Harry Mangalam (hjm@tacgi.com)
Man(1) output converted with
man2html