tacg Version 3.5 - Documentation

Welcome to tacg Version 3.5

Document Version 3.5.0 - Nov 16, 2001

This page provides information about both the commandline and Web interface of tacg Ver 3.5, describes some of the philosophy behind its design, provides links to more documentation, and where it can be got.
Please note that this document is in flux, incomplete, and certainly inaccurate in places. The man page is more likely to be correct than this doc, and the help section even more likely to be correct. In the event of complete incomprehensibility, please mail me.

A web interface to version 3.5 is now available; you can find some sites that offer it here.

Cross Index of Command-line Options

What's new with Version 3?
What is it?
How does it handle incoming sequence?
What will it do next?
Where can I get it?
Is there a man page I can look at?
Is there a GUI or Web Interface?
Design Criteria
Feature Set
Changes and Fixes since Ver 2.28
- Major
- Minor
Changes and Fixes since Ver 1.7
- Major
- Minor
Changes and Fixes since Ver 1.5
- Major
- Minor
Changes and Fixes since Ver 1.34
- Major
- Minor
Continuing Bugs and Oddities
- Major
- Minor
To Do List
...And The Small Print Taketh Away...(Copyright Notice and Legal Stuff)

What's new or changed with Version 3.5 ? (vs 2.4)

Briefly:

Added circular graphic plasmid maps in Postscript and PDF.
Added 'Rules-based' searching
Added cloning features to find REs that hit or don't hit in particular regions.
brought man pages and HTML-ized man pages up to date.
use of -p (pattern) also sets -S (Sites table) so that tacg acts more grep-like.
changed license to DUAL license, either GPL or proprietary.
better autoconf integration (tho no automatic regression testing yet)
integration (finally) with Jim Knight's SEQIO to allow automatic sequence conversion on input and scanning multiple sequence databases (or not - with --raw it still supports the ability to assume that ALL input is sequence - useful for analyzing file fragments or editor buffers).
can use it to search for arbitraily complex combinations of patterns at the same time by combining pattern specifications with the logic conjunctions AND, NOT. see --rule.
searching for regular expressions (in nucleic acid), with autoconversion of IUPAC degeneracies to the appropriate regex.
searching for TRANSFAC matrices, with user-specified cutoffs.
gel simulations with low and high end cutoffs for expansion.
selection of Restriction Enzymes explicitly, by overhang generated, magnitude of recognition site, price, minimum, maximum number of cuts (overall or on a per-pattern basis).
supports Combination Cuts of up to 15 REs at a time.
supports limited AFLP fragment matching / simulation.
simulates Dam and/or Dcm methylation of DNA
searches for silent sites, with reverse translation.
ORF finding in any combination of frames with FASTA output, with offsets in DNA, protein, Molecular Wt, pI, with optional additional info on AA frequency in #s or %.
conditional output based on matches, for scanning large numbers of sequences at a time.
sequence extraction surrounding pattern matches, with variable upstream, downstream inclusion, optional reverse translation in FASTA format.
various ways of making the output even more compact (printing only single strands, or eliminating tics)
uses autoconf/configure to ease building on different platforms.
by default, tacg no longer sends usage data back to me.
includes an explicit example function to show how to add your own funtionality.
much more robust, debugged, especially as to memory allocation. Since most of the routines tacg run in a loop now, memory leaks are much more apparent and get plugged quicker.
BUT: more features leads to more (new) bugs so be on guard..

What is it?

tacg is a character-based, command line tool for unix-like operating systems for pattern-matching in nucleic acids and performing some of the basic protein manipulations. It was originally designed for restriction enzyme analysis of DNA, but has been extended to other types of matching. It now handles degenerate sequence input in a variety of matching approaches, as well as patterns with errors, regular expressions and TRANSFAC-formatted matrices.
It was designed to be a 'grep' for DNA and like the original grep, it's capabilities have grown so that now the author has to keep calling up the help page to figure out which flags (now ~50) mean what.

tacg itself is NOT a GUI application in any sense. However, it's existance as a strictly command-line tool lends itself well to Webification and it is now distributed with a web interface form and a Perl CGI handler (the original C CGI written with Tom Boutell's CGIC code is also distributed with it for historical/educational purposes, but Perl is a MUCH easier interface language. NO question about it.)
Also, by it's strict command line nature, it can parasitize the interface of a number of GUI programs that allow passing selections made within their windows to external programs. This is particularly effective with a free, Xwindow editor called NEdit. With minimal reconfiguration, NEdit can become a fairly good biosequence editor and analysis tool a la Strider for XWindows. See the entry for NEdit below.

A real world example may help to illumninate.. say the Genbank entry for the lambda genome.

tacg -w75 -n6 -F2 -l -c -g'100,10000' -L -T3,1 -O145x,25 < lamcg.gb | gzip - > lambda.map.gz

Translation: Process the file 'lamcg.gb' (Genbank format) thru the seqio routines (transparent to users) and pipe the following sequence to tacg, returning the analysis:

75 characters wide (-w75)
on all 6+ cutters (-n6)
giving me the sorted fragment sizes (-F2) of those Enz's that match
providing a ladder map summary (-l)
sort tabular ouput by number of hits and thence alphabetically (-c)
provide a gel simulation with a 100 bp LOW cutoff and a 10000 bp HIGH cutoff (-g100)
with a linear restriction map (-L)
with 1 letter, 3 frame translation (-T3,1)
with the ORF table for frames 1,4,5 at a minimum ORF size of 25 AAs with the extended composition info (-O145x,25)
and pipe the output thru 'gzip' to a file called lambda.map.gz.

Altho it's not clear why you would be interested in seeing this particular analysis, here's the resulting output (130KB text, gzipped - you'll need a gzip-compatible app to view it, such as Stuffit Expander, Stuffit For Windows, Winzip, or even ... gzip).

The lambda genome used to generate this map is also included in the tacg distribution directory (called lamcg.gb.gz)

How does it handle incoming sequence?

If the --raw flag is used, the file is processed directly by the GetSequence() function which reads everything and strips out anything not interpretable as an IUPAC character. Note that there are a lot of IUPAC characters and the typical letter to your Mum read with the --raw flag will probably yield quite a bit of acceptable, if uninterpretable, sequence.
If the --raw flag is not used, the input stream is processed by the SEQIO routines. If the stream contains multiple sequences, they are passed to tacg serially, in an internal loop.
tacg reads ASCII characters from the buffer passed in by SEQIO, determines whether there are IUPAC characters in the stream and, depending on the -D flag option, EITHER strips out anything not an 'a', 'c', 't', 'u', or 'g', OR allows all IUPAC characters (but strips all others, including numbers, spaces, tabs, linefeeds, etc), and places that input sequence in a buffer.

What it does then depends on what other flags are set. The 1st writ function (probably still the most popular) is Restriction Enzyme Analysis. Flags calling this function cause tacg to read in a series of named patterns (either from the command line, or more typically, from a GCG-formatted REBASE file, now somewhat extended for increased functionality) and match them against the sequence. These matches are logged, then some basic math is done with them and the output is sent to stdout, in the form of tables of cut sites, fragments generated, various maps and simulations.

The number of recognition patterns is now dynamically allocated so 1000s of patterns can be searched for at once (I've generated ~7000 without problems), as long as they are in the simple format required. For instance it can now process all the patterns in the TRANSFAC database at one shot (altho I'm not sure that would produce anything usable).

A similar function matches (possibly degenerate) input with patterns that contain errors (as well as simple degeneracies). This means that if the pattern is gartc with 1 error, it could match nartc, gnrtc, gantc, garnc, or gartn. Of course, the 1st and last sequences decay to artc and gart, as a leading or lagging 'n' makes no difference in the pattern. NB: See the note on how subsequence patterns are currently handled.

Related functions (-P and --rule) analyze the data from the above searches by parameters that you give them and filter the number of reports by your proximity rules.

The -O flag allows Open Reading Frames to be extracted. This function is not very sophisticated, but does present the data in a way that allows it to be analyzed by other programs (each ORF is written in a FASTA format so that you can use the resulting file for other types of searching).

Enzyme selection can be filtered by a variety of criteria; these criteria affect what is printed as output (if only 3 enzymes match your selection criteria, only those 3 will show up in most analyses. They can be chosen by 'magnitude' of recognition site (4, 5, 6 cutters, etc), by overhang generated, by minimum/maximum times they cut, cost, dam/dcm protection, etc.

The entire feature set is described in more detail below, but it includes:

Enzyme cut-frequency summary tables,
Tables of site data (sorted or not)
Tables of fragment data (sorted or not; ordered or not)
Linear map with names stacked efficiently with/without translations in 1,3, or 6 frames, from 1 of 8 Codon Usage tables
Pseudo-graphical ladder maps a la GCG
Summary map of infrequent cutters
Gel simulations (using a simple log fit).
Tables of ORFs for any frames, with offsets in bases, AAs, and MolWt. [new]
Entering patterns from the commandline [new]
Searching for patterns with errors [new]
Proximity matching - reports only sites that match distance criteria (up/downstream, less/more than a distance, within a range of distances) [new]
Other match data output formats

What will it do next?

What you want it to do will have an effect on what I work on. If you want something added, especially a variant of an existing analysis, I'd be happy to consider it, so let me know your opinion. Otherwise, I try to add what I believe to be the most useful features, especially if they are missing from other tools.

Anyone who believes the promises of software suppliers also believes in the tooth fairy. That said...

I'd like to add secondary structure prediction output to the postscript maps, so if I can wrangle some public domain code into shape, I may be able to carry thru on this wild promise as well.
XML output (BSML for example) would be a neat additional feature.
support for more sequence formats to Seqio would be useful (support for some binary formats (Lasergene, DNA Strider, GFF, etc).
automated regression testing would be very useful for development testing and porting to different platforms.
adding support for more efficient searching using Manber's agrep code for extremely fast approximate searching.

Where can I get it?

tacg V3.5 and higher is now available via its own Sourceforge Project site in ANSI C source code. Precompiled binaries for a few architectures may follow. Instructions are included with each package.

Is there a GUI or Web Interface?

tacg is a command line program (see Simplicity, below) but as such, it lends itself well to Webification as a cgi program.
The Version 3 Web interface is actually composed of three similar pages that emphasize different aspects of tacg's abilities. All three pages address the same CGI (perl-based for this version), and call the same tacg executable. The reason for the different pages is just for clarity. Only the 1st is finished and you can find sites that make it publicly available via the Sourceforge project site

The only substantial crippling is that it currently has a cutoff at 500,000 characters as the output for that size of analysis can get very large for some analyses. If you routinely use more sequence than this, it's probably worth your time to install it locally. NB The 500,000 character limit, while probably sufficient for most analyses, counts the header and comment characters as well.

Is there a man page I can look at?

Yup. You have your choice of

a real groffed manpage,
a PREFORMATTED man page,
an auto-HTMLized version thereof,
or a slightly more complex Webified man page

All are included with both the source code and the binary pkgs.

Design Criteria

Simplicity (and a few Alternatives)

In answer to: "Why does the world need another restriction mapping program, especially one as crude as this?" or Why didn't you just use (or modify) XXX?", I enumerate:

Well, in the 1st place, crude is not always bad... 'grep' could be classified as crude and it's one of the most useful and used programs in the world. It's free, simple, flexible, chainable, and very fast; it provides the desired info in a useable form with a minimum of mucking about.
There are a variety of programs available for Windows and (especially) the Mac for doing these kinds of analysis, but there's little along the same lines for unix, especially Linux, which is an extremely attractive OS for bioinformatics. There is the very complete EMBOSS package tho, which does a LOT more than tacg, but what they both do, tacg does much faster (20-50x faster).
Many available programs, even the commercial ones, start to come undone at sequence lengths quite a bit smaller than those now being published. Dynamic memory allocation relieves this limit, so I decided to write one that incorporatedS this feature. tacg is relatively miserly in memory requrirements - vs the EMBOSS restrict program which uses >100MB RAM to analyze the E coli genome (4.6 MB), tacg uses 13MB.
Even some of my favorite programs (such as the beautifully designed DNA Strider) refuse to analyze degenerate DNA sequence, making the generation of useable plasmid maps from the usual bits of known sequence, polylinker, and hand-drawn maps impossible or extremely unpleasant.
While I have had access to GCG for quite some time, I have been increasingly aggravated at it's speed, output, cost, and the fact that it doesn't run on Linux. While tacg's code is sufficiently arcane that few bench biologists will attempt to add functionality to it, an undergrad CS student could, especially with the inclusion of the new example function.
There is also the challenge of creating a program that, like grep has significant utility in a relatively small footprint.
It was fun.

However, if you want alternatives, here are a few:

The very complete EMBOSS package has to top the list. It's well-designed, autoconfigs, has good documentation, and is Open Source (GPL). All you could ask for is a little more speed...
Roland Walker's SEALS Perl package is also very comprehensive, but requires a lot of add-ons and admin to get it up an running.
There's the stalwart (if ugly) GCG, but it's $ware and does not (yet?) run on Linux. It also remains essentially a collection of unfriendly command-line programs (hey, nice glass house..) sheltered under a common but still not too intuitive GUI. The GUI is strangely warped - it appears to be very similar to a HTML forms-based GUI, but with neither the benefit of HTML, nor many of the advantages of a 'native' GUI. The controls are surprisingly crudely laid out (much like the tacg Web form), which is odd for a 'native' app; I would have expected a much more compact and intuitive interface.
Webcutter 2 is a Web-only cgi (no command-line version, no source code) that Max Heiman put together at about the same time that I was writing tacgi. I've tried it a few times and while it may be useful, I found it too slow and the output a bit info-sparse. It does have some features that tacg lacks (coloration of selected Restriction Enzymes (REs), grabbing sequences direct from Genbank to restrict), but for my own use they weren't that compelling. Although the access is free, you cannot install it locally, nor examine the source code. Nevertheless, if anyone thinks that a feature of WebCutter would make tacg considerably more usable, I'd be interested in hearing about it.
WebGene is a service offered from Indiana which offers a small subset of the options and analyses that WWWtacg does. It has a similar Restriction Enzyme selection interface that the original WWWtacg had, but I decided that while it was a nice display, it took too much HTML to code which made selecting the REs too slow. It offers both an ORF interface as well as a Restriction Enzyme interface.
Cloneit, a unix-based cloning program with a web interface as well.
BioNavigator from eBioinformatics, now Entigen, VERY nice (uses tacg, incidentally), but for-pay.
Curatools, Curagen's genomics portal (still free, I believe).
restmap (perl thingie out of stanford by Steve Chervitz, now at Neomorphic). Quite good Web output, including colored image maps that really nice) but being perl, it's a bit slow and doesn't have nearly the features that tacg has. Appears to have been taken offline..?
The Biology Workbench is an excellent example of how a Web Interface can make coherent a great number of the free software tools of the net. Now that you mention it, tacg is one of its included tools :).
The only exception I take to Web interfaces (including my own) is the response time/uncertainty of the connection and the inability to retain state easily. But it is a great way to take advantage of programs that you'd never be able to take the time to get working and working reasonably easily and a well thought-out interface can do a lot to reduce the problem of retained state.
Steve Smith's 'GDE' has been compiled for Linux, but requires a fair amount of dedication to put and hold it all together; last time I looked it was available only an OpenLook interface, which is an increasing disadvantage. Steve is now at GCG (I think - the new versions of GCG seem to show the GDE influence.) so there probably won't be a huge amount of work going into it.
Brian Fristensky's estimable suite of programs BIRCH (of which tacg is also a part) is pretty good and his comments and suggestions have resulted in a a number of the improvements seen here, especially the Degenerate Sequence matching.
MSI has just released an app called Gene Explorer, which includes quite a bit of functionality as a loss-leader to attract researchers to their higher-end products. Haven't had enough experience with it to comment further.
James Tisdall's 'DNA Workbench' is designed along the same idea as tacg, but requires perl and is therfore interpreted. Despite perl's many syntactic and other advantages, I wanted something even simpler (a standalone app) and therefore easier to install, faster, and more complete.
Don Gilbert's multiplatform, GUI ' SeqPup' is the closest to an ideal setup in terms of design and interface. It has recently been recoded in Java and Don is still leading the way in terms of innovation and interface in bioinformatics. Anyone interested in interface design for bioinformatics is well-advised to check out and follow Don's contributions. His have consistently led the field in terms of usability, Internet use, cohesiveness, and inventiveness.
On some systems, you can run Executor, Macintosh Application Environment, SoftWindows, and Virtual PC and thus have access to the Mac and Windows programs, but that's also somewhat costly, slow, and complex.

Installation

tacg REQUIRES only 2 files - the executable (tacg, which can be renamed) and the Codon table file 'codon.prefs'. Most people will want to use the restriction enzyme database file (rebase.data; included with the distributions, which is simply the GCG-formatted version of Rich Robert's REBASE), but it is not strictly needed - you can enter patterns from the command line since Ver 2. The codon usage file (codon.prefs) is also required if you want to do any translation. It was required in the 1st few versions, then was compiled into the app in version 2 and now has been externalized again for easier access and modification. As the sequencing factories roll thru more organisms, there will more need for changing or adding to the file.

The data files (rebase.data, codon.prefs, other pattern files) will be found in any of 3 places automatically, in the following order: the current directory ($PWD), the home directory ($HOME), or the tacg lib directory ($TACGLIB). The program searches these paths and gives up only if the file can't be found in any of these places. 'rebase.data' is ASCII text and can be edited and modified by the user, if required. I'd suggest placing 'rebase.data' in the TACGLIB directory and keeping modified versions of these files, if required, in your home directory which is searched first.

tacg can use the NEB/REBASE-supplied file of restriction enzyme data as is, (in GCG format - latest unmodified version here) but can also make use of additional extended information if available. Version 3 will use error, cost, and methylation information added to the file as shown here.

The format of the unmodified version is:

Comments
..
;         Top   Recognition   Bottom
;Name   Offset  Pattern       Offset  !  Isoschiz                   >Supplier Index
;AarI      3    CACCTGC        0      !                             >
;AatI      3    AGG'CCT        0      !  StuI,Eco147I,Pme55I,SseBI  >O
AatII      5    G_ACGT'C      -4      !                             >ADEFIKLMNOR
;AauI      1    T'GTAC_A       4      !  Bsp1407I,BsrGI,SspBI       >I
AccI       2    GT'mk_AC       2      !  FblI,XmiI                  >ABDEGJKLMNOQRS

The format of the MODIFIED version is:

Comments
..
;           Top Recognition     Bottom
;Name    Offset  Pattern        Offset Err dam dcm   units   $    !  Isoschiz         >Supplier Index
;AarI         3 CACCTGC          0      0  100 100     0     0    !                   >
AatII         5 G_ACGT'C        -4      0  100 100  2500   176    !                   >ADEFIKLMNOR
AccI          2 GT'mk_AC         2      0  100 100  2500   220    !  FblI,XmiI        >ABDEGJKLMNOQRS
AccII         2 CG'CG            0      0  100 100   444    99    !  Bsh1236I,BstFNI  >KQR
Acc65I        1 G'GTAC_C         4      0  100   3  5000   200    !  KpnI,Asp718I     >DFINR

In the above, the extra fields are:

Err - the maximum number of errors that can be tolerated in the search.
dam - the offset from the Top Offset that if methylated by Dam methylase, blocks cutting
dcm - the offset from the Top Offset that if methylated by Dcm methylase, blocks cutting
units - the number of units packaged (by NEB) in their Small size lots
$ - the cost for that many units above.

Program Output

The output of tacg uses only alphanumeric characters (except for the recent circular graphics maps) so that all of its output can be viewed on a vanilla vt100-like terminal, although you can do more useful things if you're using an X display or have imported the output into a Word Processor (some of the output can best be viewed using very small fonts or in multiple columns on a page).

The different output sections have been prefixed with "==" so that you can easily jump to the different sections if you're viewing the output with an editor or pager such as less. In the WWW version, the different sections have been labeled with HTML tags so that clicking the automatically generated Table of Results entry will drop you to the corresponding results section.

Plasmid Maps

WARNING: This is a beta-level function and may explode without warning or give inaccurate results. If you suspect the latter, please let me know.
Select the # of REs to place via the Restriction Enzyme Selection section above, paying special attention to the Maximum Number of Cuts setting. Usually a Max of 1 or 2 is what you want.
To place ORFs on map, check the "Stream Out Open Reading Frames" stanza immediately above, setting the minimum size of the ORF to detect, and the frames to display. ORF (The extended info will be printed but is currently ignored in the plasmid map.) The map will not accurately track ORFs that cross the origin, so if you suspect that an ORF crosses the origin, edit or subselect the sequence to verify.

For the Web interface, if you're using a Mac, printing to a Laserwriter-type device, you might try the '105' width, with the 'Small' font, then use Page Setup to select Landscape output, 2-up (2 pages fitted to one page).

If you're using MS Windows, there are probably similar output selections that you can use.

If you're using a unix box, save the output in whatever width you want and pass it thru a text-to-postscript filter to generate nicely formatted output. Some that I've used are:

lptops - included on SGI's; probably available somewhere for other platforms.
nenscript - (Free implementation of Adobe's enscript)
genscript aka GNU enscript - Free, most features, currently supported, no good reason to use any other...
a2ps - a monster text-printing system that will handle almost any conceivable file format (and allows you to color code your own if you want).

Use it thusly:
genscript -1GZ -r -fCourier7 -p- [filename] | lpr -P[PSprinter]

Most Xterminals allow different font selections and window re-sizing, so you can probably view the results adequately that way as well.

High Portability

The program is written in vanilla ANSI C. After being directed to the correct libraries, it compiles with few complaints on:

Silicon Graphics/IRIX 5.3, 6.2 (R3000, R4X00, R5000, R10K) (cc)
Sun SPARC/SunOS (4.1.3) (gcc)
Sun SPARC/Solaris (gcc, cc)
DEC Alpha/DEC Unix aka OSF/1 (V3.0/347) (cc, gcc)
Convex C3480/ConvexOS (cc, gcc)
HP-UX/Exemplar OS (single CPU of an SPP2000) (cc, gcc)
NeXT boxes (Black)/NeXTSTEP (3.2) (except the UDP code) (gcc)
Intel PCs running Linux (Slackware, RedHat, Caldera) (gcc)
DEC Alpha Linux (gcc)
MkLinux on PowerMacs (gcc)
Probably anything with gcc.

Bewildering to me, on those systems where both the native commercial compiler (cc) and gcc have been available, gcc has consistently outperformed cc. YMMV. I think this is because almost all the core code is integer ops and therefore the carefully coded math and floating point libs make little difference. Stemming from this, tacg runs faster on an Intel P6 architecture than on many RISC workstations of similar or even higher clock speed.

Speed and Capacity

The program uses a pre-calculated hashtable lookup of the restriction enzyme recognition sites, so that only about half of the sequence is checked any further than the initial hash (which itself is generated via the very efficient shift/add DFA algorithm). Depending on the degeneracy of the input sequence, the kind of output you request and the i/o of the machine, the program processes DNA at this speed:

Speed*	Hardware	OS	Compiler, flags
68	i486/66	RH Linux 2.0.18	gcc -O2 -m486
140	R4000/100 Indigo2	IRIX 5.3	cc -O2 -mips2
165	early DEC Alpha	OSF/1	gcc -O2
290	R4400/200 Indigo2	IRIX 5.3	cc -O2 -mips2
313	SPP/HP-PA	SPP-UX 5.1/HP-UX 10.1	gcc -O2
300	PMac8500/PPC604/120	Mklinux 2.0.27	gcc -O2
330	DECAlphaStation/233	RH Linux 2.0.29	gcc -O2
450	Sparc Ultra1/??MHz	Sun OS v5.5	gcc -O2
454	R10000/175 Indigo2	IRIX 6.2	cc -O2 -mips2
515	Intel PPro/200	RHLinux 2.0.27	gcc -O2 -486

* in Kb/sec, crudely adjusted for cpu usage via the `time` shell command, using 215 enzymes, digesting the E coli genome (4.638 MB), for only the summary listing (-s). Requesting the full analysis (-n4 -sSlc -F3 -g100) takes about twice this time. The full Linear Map doesn't take much more CPU time, but considerably more wall time due to the i/o (output is 10x input).

NB: This is about the slowest performance you can expect. Because of the size of the input sequence, the program has to go thru several rounds of memory allocation. Shorter sequences, more typical of cloning projects, complete essentially immediately even on slow machines.

Here's a table of how fast the same analysis completes on various sequence lengths (for a PPro/200/96M/PCI-SCSI and a 486/66/32M/VESA-SCSI):

Sequence Length Seconds(PPro) Bases/s(PPro) Seconds(486) Bases/s(486)

100 .04 2.5k .28 360

1,000 .05 20k .3 3.3k

10,000 .07 143k .4 24.4k

100,000 .2 500k 1.4 71.4k

1,000,000 1.6 617k 11.3 88.5k

4,630,000 9.1 509k 68 68k

10,000,000 22.7 492k 261 38k

Sequence Length	Seconds(PPro)	Bases/s(PPro)	Seconds(486)	Bases/s(486)
100	.04	2.5k	.28	360
1,000	.05	20k	.3	3.3k
10,000	.07	143k	.4	24.4k
100,000	.2	500k	1.4	71.4k
1,000,000	1.6	617k	11.3	88.5k
4,630,000	9.1	509k	68	68k
10,000,000	22.7	492k	261	38k

From earlier, crude analyses it appears to be about 7-35 times faster than the equivalent routines in the GCG 'map' program.

In my experience, interacting with tacgi/tacg via it's web interface (on an 100MHz R4000 SGI) over 2 LAN hops is both faster and easier than using GCG's native Xwindows interface (on a 200MHz R4400 SGI). There may be some subjective feeling involved here, but objectively, it's faster by 10X (on a machine that's less than 1/2 as fast :) ). GCG has recently been trumpeting that its code is now parallel, allowing much faster operation. Putting the above in a similar context, you'd have to buy ~20 CPUs to run your analyses with GCG's routines compared to 1 if you ran them on tacg.

Memory Usage

tacg uses dynamic memory allocation for most data structures so that while there are a few hard-coded limitations (mostly in output format), it easily handles sequences into the millions of bases.

As a practical rule of thumb, the program needs about 2 MB of free memory for itself, and then, depending on what results you've asked for, as much as 3 times the input sequence length to hold all the intermediate results before they are barfed to stdout. For the E Coli genome (4.7Mbases), tacg memory usage tops out at about 13 MB, so on a 16 MB machine, you'll probably be doing some swapping in this case. On a 32 MB Linux PC, I had no swapping on the above 4.7 Mbase analysis, but quite a bit with 10MB (about 40MB total required, according to 'top'). YMMV.

In the Web form, there is a cutoff of 500,000 characters (not bases) as input; currently this includes extraneous characters such as headers, numbers, spaces, etc, so if you think you are being unfairly denied service, check how many bytes your file actually is. The command-line version is limited only by your machine's RAM and your patience. (Actually, I think Linux's max per-PID allocation is currently ~900MB...?

Feature Set

Inspired by [add mailto] Christian Marck's elegant DNA Strider, I tried for a similiar output format, changing a few things I didn't like, adding a few things I wished he had. I'm open to suggestions for enhancements, interface changes, etc.
Kvetch or it won't change.

Sequence Input

tacg will take any ASCII data as input. If it isn't a recognized format, it will assume that it's raw sequence and will process it accordingly. You can also explicit tell it that all input is to be considered sequence with the --raw flag. If this is the case, it will ignore anything that isn't a,c,g, u, or t (or IUPAC degeneracies - see below) and subject that sequence to analysis. Because of this, numbering, spacing, line width, etc, of the input file should not be a problem.
The Web interface allows you to upload entire files of sequence for analysis. Recently, with the recruitment of Jim Knight's SEQIO, Version 3 in both commandline and Web form allows most ASCII sequence formats (see below) to be read and processed without further modification or editing.

Input formats supported by SEQIO (and therefore tacg and WWWtacg):

IG/Stanford GenBank/GB NBRF

EMBL GCG DNAStrider (ASCII)

Fitch Pearson/Fasta Zuker (in-only)

Olsen (in-only) Phylip3.2 Phylip

Plain/Raw PIR/CODATA MSF

ASN.1 PAUP/NEXUS

However, it will NOT accept (at least reliably) binary file formats such as those from DNA Strider, MacVector, NBI, GeneWorks, etc. However, most commercial programs have an ASCII export function that should work. For the WWW interface, anything that you can cut and paste into the window and looks like sequence ought to be fine. As mentioned above, the --raw option skips SEQIO and so reverts tacg to it's former aggressive behavior of considering EVERYTHING sequence.

IUPAC Degeneracies

IUPAC Degeneracies are those characters which denote uncertainty in the sequence; most are familiar with y (pyrimidines - c or t) and r (purines - a or g), but there are several others that can be useful. The entire set is:

Base	Name	Bases Represented	Complementary Base
A	Adenine	A	T
T	Thymidine	T	A
U	Uridine(RNA only)	U	A
G	Guanidine	G	C
C	Cytidine	C	G
Y	pYrimidine	C T	R
R	puRine	A G	Y
S	Strong(3Hbonds)	G C	W
W	Weak(2Hbonds)	A T	S
K	Keto	T/U G	M
M	aMino	A C	K
B	not A	C G T	V
D	not C	A G T	H
H	not G	A C T	D
V	not T/U	A C G	B
N	Unknown	A C G T	N

Handling Degeneracy

tacg V2 was modified to handle degenerate input sequence as well as the previous gatc-only sequences. Version 3 has the same matching engine as Version 2 which matches sequence using the fast, hexamer hashtable look-ahead algorithm that DNA Strider uses (my implementation). When it hits a degeneracy in the leading hexamer window, it switches to a more extensive (but slower) matching approach until the degeneracy is cleared, whereupon it switches back to the faster approach. This allows tacg to accurately search sequence that is only slightly degenerate (such as many genomic sequences) essentially as fast as nondegenerate sequence.

Although tacg behaves as above in the default mode, it can be forced into several other modes if desired. Truth be told, I'm not aware of ANYONE besides me actually using the other modes, tho they are formally useful. See the explanation of the -D options in the man page.

Enzyme Selection

Restriction Enzymes can be selected either by:

Explicit pick from all the active entries from the latest REBASE.
Enzymes can be picked explicitly via the -x flag (-x NameA,NameB,NameC... where NameX is the case INsensitive name of the Enzyme or pattern you wish to use in the GCG-formatted REBASE file, either 'rebase.data' or another specified by the -R flag (-R alt.rebase.file).)

Selected recursively by characteristic, based on:

Magnitude of recognition sequence -n
The 'magnitude' of the recognition sequence depends on the number of defined bases that make up the site. Degenerate bases can also contribute:

acgt    each count   '1'  magnitude point 
yrwsmk  each count  '1/2' magnitude point 
bdhu    each count  '1/4' magnitude point 
n       doesn't count at all

(tgca=4, tgyrca=5, tgcnnngca=6, etc)

Overhang of resulting ends (5', 3', blunt) -o
This refers to the stretch of unpaired bases at the border of the cut created by offset breaks in the phosphate backbone. Enzymes can leave:

5' overhangs - ie. BamHI(G'GATC_C) 
          v                   BamHI
 5'...tagG GATC_Ccga...3'      ->      5'...tagG       GATCCcga...3'
 3'...atcC CTAG Ggct...5'      ->      3'...atcCCTAG       Ggct...5'
               ^
 
 3' overhangs - ie. BbeI(G_GCGC'C)     
               v              BbeI
 5'...tagG_GCGC'Ccga...3'      ->      5'...tagGGCGC       Ccga...3'
 3'...atcC CGCG Ggct...5'      ->      3'...atcC       CGCGGgct...5'
          ^  
              
 Blunt (no overhang) ... You get the idea...

Minimum, Maximum times they hit the sequence -m, -M
This is pretty self explanatory, but..
'Minimum' indicates the minimum # of times a pattern HAS TO match before it's included in the output (the default is 1).
'Maximum' indicates the maximum # of times an enzyme CAN match before it's excluded from the output (no default maximum).

Alternative REBASE database files to use -R
This option allows you to specify an alternative REBASE file to use in the matching, perhaps edited to contain only cheap enzymes or otherwise customized for your lab/site. It starts at 300 entries, which is enough for most Restriction Enzyme databases, such as REBASE, but if you format TRANSFAC or TFD into REBASE form, it will easily process them (about 3000 entries).

Linear Restriction Map -L

produces linear restriction maps, with enzyme names nested compactly, readably.
is routinely tested to more 4.6 Mbases, has been tested to 70 Mbases.
handles linear/circular topologies -f, subsequences -b, -e.
shows EXACT cutting position (not just the start of the recognition sequence - minor quibble with Strider and other programs)
can make the Linear Map even more compact by eliminating the second strand --strands or the tic marks --notics
can show co-linear translation
- in 1/3/6 frames, in 1 or 3 letter codes -T
- using one of several Codon Usage Tables -C
can scale output (reasonably intelligently) in widths up to ~230 characters -w
can re-number the sequence arbitrarily, negative numbers as well as positives --numstart
shows numeric indices for both the whole sequence (4635, below) and the subsequence (181, below) if specified.

                            Bsu36I
                       Sau3AI
                         DpnI
                   AlwI  NlaIV
                       BstYI                            Tsp4CI
              BbvI     BamHI   AlwI          MseI MaeIII SfaNI
              \    \   \ \  \  \             \    \     \\
    181   gagaaacaacaatggatcctaaggttagaacattgttaaaagttactgttgaagatgctt    240
   4635   ctctttgttgttacctaggattccaatcttgtaacaattttcaatgacaacttctacgaa   4694
              ^    *    ^    *    ^    *    ^    *    ^    *    ^    *
          E  K  Q  Q  W  I  L  R  L  E  H  C  Z  K  L  L  L  K  M  L

Plasmid Maps --ps, --pdf

Plasmid maps can be generated by using these flags. The --ps flag causes it to be generated in the original Postscript. The --pdf flag invokes the --ps option silently and then ghostscript to convert the postscript to be converted to Adobe's Portable Document Format (aka pdf). The plasmid map shows a variant of the standard circular plasmid map, with Restriction Enzyme hits marked around the periphery. The algorithm that lays out the distribution of RE names is sub-optimal, but works OK for small numbers of hits. You can generate some interesting spirograph-like effects by forcing large numbers of hits to be plotted. In addition, it plots any degeneracy as shaded marks around the circumference. You can also add any ORFs from any 6 frames as radial lines by using the -O flag to find ORFs larger than the minimum ORF size you specify. For example, to request a map in PDF format with ORFs larger than 50 Amino acids from frames 1, 3, and 5, the appro flags would be:
--pdf -O135,50

Simple ORF Analysis -O

tacg can search for Open Reading Frames (ORFs) in any of the 6 frames with a user-specified minimum ORF size -O. It is not a sophisticated search algorithm; the rules are:

It assumes that the sequence starts in a ORF - this may cause ORFs to be reported without starting M's and may lead to short ORFs being reported in short input sequences. (If it has not hit a STOP before it gets to the end of the input sequence, it will report the cruft that it assumed was an aleady started ORF.
If not already in an ORF, a Met codon starts an ORF and a STOP codon ends it.
If the ORF is smaller than the specified size when it hits a STOP, it's discarded.
Internal Met's are not noted in any other way, as they are in some other apps such as Lasergene's Protean (good graphic design there), although this is an easy addition.
If there is an unidentified AA (because of degeneracies in the DNA) the AA is translated as an 'X' and given the averaged MolWt (118.89 Da), but the ORF is allowed to continue (see output below).

Very little analysis is done (still..). The beginning and ending Offsets in both base pairs and Amino Acids are reported so that it can be found on a linear map (offsets for frames in the opposite strand are calculated so that they refer to the current top strand). The size in both Amino Acids and KDa is calculated, as is the pI. Perhaps most usefully, the results are writ to stdout in FASTA format, with the stats as the ID line so that it can be examined by a more sophisticated pattern matcher such as grep, agrep, perl, etc (see Using tacg with Other Programs
The output of this analysis can be in either a single line (by specifying -w 1) in the case where you're concerned that the external pattern matcher won't be able to pick up a pattern split by a linefeed) or wrapped at the regular width.
Programming Note: The data structure that holds the ORF information is dynamically allocated so that all the ORF info is available until the end of the program. As well this allows very large sequences to be scanned in one sweep - my aging 486/66/32M will process the E coli genome, searching all the ORFs for a degenerate 10 AA motif with 'agrep' in under 5 minutes, but this means that it can suck up QUITE a lot of memory. Here is some sample output:

% ./tacg -O156,60  < dpp.mel.seq

== ORF Analysis for Frame 1:   7 ORF(s) > 60 AAs
 F#  ORF#   Begin(bp/AAs)     End(bp/AAs)   #AAs    MWt(KDa)
> 1     1    2581 /   860    2922 /   974    114    13155.40
MWTVLWHAENFMTSFADHQASGAEAPNKKHSPRLAHNRPRFCLLMASLERFSGCNTRNFVQRQHNSPKDSLNFDQLSDWVAAAPQHHRNWLAEKLELHPLLNGPIRNPQDFTDG
> 1     2    3091 /  1030    3282 /  1094     64     7107.97
MLNKNDKTQPRIVICNLCTCTLNHLVAPVLLLDAPPAPILFMPLSVAHCCCCCHTHTHIRTTQK
> 1     3    5929 /  1976    6216 /  2072     96     9692.80
MEVDAGSGMLGAGQVGYGLGWVLAPLRFPGHADAADGARGATTPADKAVRVQLPTPAREASAQATKLAKMQRDQGMSGGVLARMGLRLGLCGINCA
> 1     4    7522 /  2507    7704 /  2568     61     7430.89
MYIYLFFACVQYQTLMFNKSGKNLYCAYKNHQNHKGYALPYEPTYLTRSQQTLHFESIHWV
> 1     5    7870 /  2623    8160 /  2720     97    10690.71
MIYVWDVGECVSVCWRRLGGWGAERADRAGRSDPAESSSKTTTAAAIMIHSRNVGNGCLTILFFFSIFGSRRRRSCAGYEIHTTQILTHTATVAFTT
> 1     6   11071 /  3690   11277 /  3759     69     7770.53
MWNKLARASQRFSRSFFSSLRATISAALFRGQRMMGTLPRVTNRAKSESTMESIRASISSSRVLARLMA
> 1     7   12406 /  4135   12768 /  4256    121    14055.17
MILKEEHPHQSIETAANAARQAQVRWRMAHLKALSRTRTPAHGNCCGRVVSKNHFFKHSRAFLWFLLCNLVMNADAFAHSQLLINVQNQVRNRKKKWFKTLKSQQRKTQQRQTATLAALPI

== ORF Analysis for Frame 5:   5 ORF(s) > 60 AAs
 F#  ORF#   Begin(bp/AAs)     End(bp/AAs)   #AAs    MWt(KDa)
> 5     1   11241 /  3747   10885 /  3627    120    13410.25
MEALMDSIVDSLFALFVTLGNVPIIRCPRNSAAEMVARKLEKKLRENLWDARANLFHMDATQAGGGVFSFQRPVLLLLDRNMDLATPLHHTWSYQALVHDVLDLGLNLVYVEDETASAGK
> 5     2    8007 /  2669    7708 /  2568    101    10973.69
MAAAVVVLLDDSAGSDRPARSARSAPQPPNLRQHTDTHSPTSHTYIIILTTLNKAEQIPKFEEPDSNSQSGSRTIPQNEVPARQRRRSSDTASALALGWDL
> 5     3    2304 /   768    2113 /   703     65     7230.86
MLAQHQIEIKVYPPFRSIPCYVCFLPGQNNSALFWALGATQKVMIKLCGDPVEGTRLRGYPSTSS
> 5     4    1734 /   578    1492 /   496     82     9039.19
MPISWHKKMRQPCPHSPAITEGTGLCIIFNEDICQQKCSGVLFMAFKAKPIFITTATAGQRKTTRDAVVVVVVLIIDTSYAN
> 5     5     762 /   254     583 /   193     61     6683.20
MLCPSILPFTLKHKFMLSVHFFQCTTKLTPCGAESATRRRSAKGKGAHGQGTGCATDWLTR

== ORF Analysis for Frame 6:   5 ORF(s) > 60 AAs
 F#  ORF#   Begin(bp/AAs)     End(bp/AAs)   #AAs    MWt(KDa)
> 6     1   11181 /  3727   10997 /  3664     63     7224.82
MCPSSAVRGTVPLKWWHASWRRSCARTFGMRVPICSTWMPRKPAAECSAFRDLCSCCWTGTWI
> 6     2   10764 /  3588    9728 /  3241    347    39136.07
MTHKGSPFPTVAEAIQEELESYRNSEEEIKRLKTSMGIEGESDIAFSLVNDTTARLTNAVNSLPQLMEKKRLIDMHTKIATAILNFIKARRLDSFFEIEEKVMSKQTLDRPLLDLLRDGEFGQAEDKLRLYI
IYFICAQQLPESEQERLKEALQAAGCDLTALAYVQRWKGIMNRSPSISQATQYEGGGTKTVSMFTKLVSQGSSFVMEGVKNLVVKRHVSFSFQIQIQVLYQNCLQNLPVTKITEQVMECRSNAETDDYLYLD
PKLLKGGEVLPKNRAPFQDAVVFMVGGGNYIEYQNLVDFIKQKQTSNVQRRIIYGASTLTNARQFLKELSALGGEIQSPTATS
> 6     3    6774 /  2258    6530 /  2175     83     9280.03
MQGNCNRLAGEWTQKGQPRMREDAALPARNPTFLGLHYPGLGRFFVRPTAPNGHQQTAANSFIKHKQIPWTCSALTPHAAFEH
> 6     4    5901 /  1967    5687 /  1894     73     8163.18
MVLFLRVVCFSPVDNKVQPTCGYRTDRFRVLVALVLGASRWLTLRANDAKQQIKGAIRKIAARDVLLNCSSKA
> 6     5    1647 /   549    1310 /   435    114    13523.17
MKTSVSKNVRAFCLWHLRQNRFLLPQRQQGKGKPHATLSSLSSSSSLTHRMQIKSVSQPDHKVAAPDRRICRFQCWERGDPWRFRSRIPDTDTETRLEVKRTRPIRRDLWHESS

==--------------------------  End of Analysis  --------------------------==

Tabular Output

All these output formats, as well as the Linear Map and Summaries are affected by the width parameter -w; if you can manage the wider output, it's more efficient and easier to view. The -w flag has a special case besides the normal. If you specify -w1, it will set the output to be 10,000,000 characters wide, assuring that all output for a particular stanza will be in 1 line. This is sometimes useful for post-processing data.

Summary of All Cuts -s

This is no longer printed by default; you have to explicitly request all output. This was to make tacg more amenable to inclusion into other programs as a shell command - see Using tacg with Other Programs. This output includes info on EVERY active (uncommented) enzyme/pattern in the rebase file that has passed all the filtering flags (-n, -o, -m/M).

Restriction Enzymes that DO NOT CUT in this sequence:

      BbeI      EheI      FseI      KasI      NarI      NheI      NotI
      PacI    PaeR7I      SalI      SfiI      SpeI      SwaI      XhoI

 Total Number of Cuts per Restriction Enzyme:

     AatII    5     BsiYI  130     EcoNI    5      MluI    7      SalI    0
      AccI    5      BsmI   30  EcoO109I    2      MmeI    8      SapI    7
     AflII    2     BsmAI   26     EcoRI    3      MnlI  184      SauI    1
    AflIII   13   Bsp120I    1    EcoRII   49      MscI   17    Sau96I   61
      AgeI   12  Bsp1286I   26     EcoRV   14      MseI  106      ScaI    4
      AluI   89     BspEI   22      EheI    0      MspI  278     ScrFI  145
                               [remainder omitted]

Tables of Cutting Sites -S

(for enzymes that pass the filtering flags)
All of the following output formats (except the Infrequent Cutters Map) can have the REs listed by order of their names in the database or sorted by # of cuts and thence alphabetically, like Strider, using the -c flag.
In the next example output, the first example is sorted by 'order in database', the second is sorted by '# cuts, then alphabetically'.

sorted by 'order in database'

  ==  Cut Sites by Restriction Enzyme 

AatII       G_ACGT'C - 2 Cut(s)
   1118  12143

AccI        GT'mk_AC - 16 Cut(s)
    632   3008   3073   3079   5393   5816   6881   7974   8815   9978  11962
  14136  16857  18087  20067  20222

AceIII      CAGCTCnnnnnnn'nnnn_ - 8 Cut(s)
   2127   5417  11681  11750  15151  17252  17584  21795

AciI        C'CG_C - 48 Cut(s)
    567    850   1512   2455   2931   3300   3339   3490   4842   5468   5600
   5907   6174   6318   7533   8123   8807   9086   9745  10482  10805  11626
  11669  11896  12050  12214  12404  12417  12675  13337  13646  14217  14881
  15949  16013  16668  17060  17089  17101  17132  18035  18140  18820  18896
  19781  21146  21536  21958

AflII       C'TTAA_G - 5 Cut(s)
   1292   4185  11172  18334  20663

                               [remainder omitted]

sorted by '# cuts, then alphabetically' -c flag

== Cut Sites by Restriction Enzyme

ApaI        G_GGCC'C - 1 Cut(s)
  21979

ApaLI       G'TGCA_C - 1 Cut(s)
  10676

Bpu10I      CC'TnA_GC - 1 Cut(s)
    844

Bpu1102I    GC'TnA_GC - 1 Cut(s)
  13369

NarI        GG'CG_CC - 1 Cut(s)
  16014

NgoAIV      G'CCGG_C - 1 Cut(s)
  15606
                               [remainder omitted]

Tables of Fragment Sizes -F

(unsorted by fragment size, sorted or both; 'vertical' ordering can also be sorted with '-c' as in the above example.

  **  SORTED Fragment Sizes by Restriction Enzyme **

AatII       G_ACGT'C - 5 Fragment(s)
   1849   3449   3731   4289   5110  14062

AccI        GT'mk_AC - 5 Fragment(s)
    639   1187   2192   3574  11828  13070

AflII       C'TTAA_G - 2 Fragment(s)
   6078   6541  19871

AflIII      A'CryG_T - 13 Fragment(s)
     35    170    459    493    956   1268   1712   1913   2360   2419   4091
   4920   5733   5961
                               [remainder omitted]

Ladder Map -l (el, not one)

overhangs indicated: 5'(\), 3'(/) blunt(|) cutters
if more than 1 cut maps to a space, the number of cuts is shown numerically (see AluI, below)
total # of cuts indicated at end of line).
width-sensitive, so you can increase accuracy by choosing a wider output. See also the complete output of the lambda analysis.

  Ladder Map of Restriction Enzyme Cut Sites:

            900 1800 2700 3600 4500 5400 6300 7200 8100 9000
              :    :    :    :    :    :    :    :    :    :
     AccI ----------------\-----------------\--------------------         2
     AciI -----------------\-----------\--\---\\------\--\\------         8
    AflII ---------\-------------\---------\--\----\---\---------         6
     AhdI ---------------------------/---------------------------         1
              :    :    :    :    :    :    :    :    :    :
     AluI --|---|||--23||||2||-|2--2---|2|232||-|-|22--2-|3|-||--        50
     AlwI ------2---\-----------\--2-------------\2--------------         9
   Alw26I ----------------------------------------------------\--         1
    AlwNI -----------------------/--------------/--/-------------         3
              :    :    :    :    :    :    :    :    :    :

Infrequent Cutters Map part of the -l flag

(a la Strider) of enzymes that cut less than X (a small number of times) - (In version 2/3 this has been changed to be length-sensitive; X = 2 until the sequence increases to beyond 10,000, then X increases proportional to the log of the increased size). This output option is not called explicitly, but is appended to the Ladder map (-l).
It should be obvious, but each label is of the form:
Name@Cut_Site

                                                                                     
 == Summary of Enzymes that cut ** 2 ** times or less:
                           XcmI@9286
      XcmI@2466           UbaEI@9054
    Sse8647I@1801          SexAI@9288  UbaDI@13309
        SmaI@2882            PstI@10119       PstI@15573
    Psp5II@1801                 Pfl1108I@10898   SgrAI@16556
      NruI@2248              HgiEII@10152            PflMI@18025
        NcoI@2848        SexAI@8621  EcoNI@12615  NruI@17146
        BstXI@2962              EcoNI@10850   NgoAIV@15606  PmlI@20299
           BstEII@4076             EagI@12054  NarI@16014   Pfl1108I@20364
   BstEII@1183                  Bsu36I@10980         HgiEII@18039 MluI@22230
        BstDSI@2848   NcoI@7724        Bpu1102I@13369  PacI@18544 MluI@22201
  Bpu10I@844   PshAI@5399      ApaLI@10676     EagI@16150   BstXI@20204
  AatII@1118          BstDSI@7724  AatII@12143   Bsu36I@16688    ApaI@21979
  ||| | |  |   |      |  ||| | ||  | | |      || ||  | |    |    ||
-------------------------------------------------------------------
     :     :     :     :     :     :     :     :     :     :     :
  2000  4000  6000  8000 10000 12000 14000 16000 18000 20000 22000

Pseudo Gel Map -g

This option shows approximately how different digests would look if run on an agarose gel. Turn the page 90°, blur your eyes slightly, and it'll look like a massive agarose gel... really. Useful for selecting restriction enzymes for best separation of fragments...
It uses a straight log10() approximation, but the addition of math to simulate different percentages of agarose/polyacryamide would be trivial. If anyone has a favorite approximation, I'd be happy to include it.
It uses the same representation as the ladder map, with single fragments represented as '|', multiple fragments that cannot be resolved as a digit showing how many map to that space.
the flag values set the LOW and HIGH CUTOFFs - fragments smaller or larger than this window are excluded from the simulation. The value must be an integer multiple of 10 (10, 100, 1000, etc).

Below is a section of gel output generated with the width set to '-w75'; the sequence length was 22380 bp (so the gel is automatically extended to 30 Kb).

From examining the UDP data so far, this is the option that is used incorrectly most of the time. To obtain the results below (a 10 bp cutoff), it should be used thusly:
tacg -g10 [other flags] < input.file
If you wanted a 1000 bp cutoff, change it to:
tacg -g1000 [other flags] < input.file

 == Pseudo-Gel Map of Digestions:

      10                   100                  1000                 10000
       .      .  .  . . . ....      .  .  . . . ....      .  .  . . . ....      .  .
 AatII                                               |                    ||          2
  AccI |                 |        |        |   |  | |2|   222 |                      16
AceIII                    |              |     |           2   || |   |               8
  AciI   ||      || ||   | |  | |22 || 33||4 |227||3| |2                             48
       .      .  .  . . . ....      .  .  . . . ....      .  .  . . . ....      .  .
 AflII                                                |  |  | |        2              5
AflIII           |               |    |  |          |   | 2 2     |   |              11
  AhdI      3                                  |      |     |         |     |         7
  AluI |     3 ||3  3|222 |||2223 3334|2352|2|42|||  2|2                             74
       .      .  .  . . . ....      .  .  . . . ....      .  .  . . . ....      .  .
  AlwI   5                     | ||   |  |||  ||||2|| |  3|   ||                     26
Alw26I |           |  2      | |  | |2|| 3 32||3|   |4|2 |    |                      35
 AlwNI                      |             |        |    2  |          |  |            7
  ApaI                                     |                                          1
                         [rest deleted]

Rules-based Logical Associations --rule, --rulefile

--rule allows you to specify arbitrarily complex logical associations of small motifs to detect the patterns that interest you. Here's an example:
Say you wanted to search for an enhancer that you suspected might be involved in the transcriptional regulation of a pituitary-specific gene. You knew that you were looking for a sequence about 1000 bp long in which there were at least 2 Pit1 sites and 3-5 Estrogen response elements, but NO TATAA boxes. If you had defined these patterns in a file called pit.specific as:

 Pit1  0  WWTATNCATW    0 1 ! Pit1 site w/ 1 error
 ERE   0  GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
 TATAA 0  tataawwww     0 0 ! TATAA site, no errors allowed

you could specify this search by:

tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)' -R pit.specific < input_sequence

This query searches a sliding window of 1000 bps (-W 1000) for ((2-7 Pit1 AND 3-5 ERE sites) AND (0 TATAA sites)). These combinations can be as large as your OS allows your command-line to be with arbitraily complex relations represented with logical AND (&), OR (|), and XOR (^) as conjunctions. Parens enforce groupings; otherwise it's evaluated left to right.

After rules are evaluated from the commandline, if the rule parses correctly, the rule is written to the tacg.patterns file (created in the current directory if it doesn't exist) so that it can be incorporated into a FILE of such rules which can then be evaluated en masse using the --rulefile flag. This option, followed by a path to the file, allows all the uncommented rules in the file to be evaluated at once. The format for the rules in both the tacg.patterns and the rulefile is as follows:

free-format comments 
free-format comments down to the '..' separater between header and body.
..
;comments below the '..' must have a ';' character as the 1st character of the line.
;Name ,   Rule  , sliding window   -  Note that they must be separated by a ','
TestRule1  ,  (MwoI:1:8 | NaeI:1:8 | NarI:1:8) , 866
;rules can be split over multiple lines with a '\' as the last (continuation) character
TestRule2  ,  ((MwoI:1:8 | NaeI:1:8 | NarI:1:8) ^ ( NciI:1:8 & NcoI:1:8) | \
(NdeI:1:8 | (NgoMIV:1:8 & NheI:1:8))),866

Cloning functions --clone

The --clone flag allows you to specify regions of the sequence which can or cannot be cleaved by the REs selected by the other RE selection criteria (-o, -n, --cost, -m, -M, etc). You can specify up to the compiled-in limit of 15 criteria at once in the form of: --clone #_#,#x#,#x#,#_#... where # is an integer indicating the sequence offset and each #_# or #x# is a criterium. #_# indicates a range that MAY NOT be cleaved; #x# indicates a range that MUST BE cleaved.
The output indicates which criteria are met, with any combinations that perfectly meet all the criteria reported 1st.

Pattern Matching -p

The pattern matching used in tacg has been extended to a more general approach for nucleic acids so that it is applicable for searching for degenerate patterns with a specified number of errors (in possibly degenerate sequences).

Pattern Matching with Errors

The original tacg allowed searching for nucleic acid patterns with the standard IUPAC degeneracies, but V2/3 allows you to extend this capability by allowing searching with mismatch errors.

Quite often when you are searching for binding sites for transcription patterns, you wish to search for not only degenerate sites but degenerate sites with 'a bit of slop'. You can use tacg to do this in 2 ways, using the -p flag:
Say you had defined a site (called NPR) on a promoter by footprinting as 'gtcagggcgaat' and you wanted to search another implicated region for this sequence but you wanted to allow 2 errors to be allowed, so that you could find a sequence that would, for example match as follows:

               ...gtcagggcgaat...
                     x      x 
               ...gtcggggcgatt...

An appropriate flag/option sequence to obtain such a match would be:

tacg -pNPR,gtcagggcgaat,2 -Sl -w75 <input.file |less

Translation: search input.file for the NPR sequence gtcagggcgaat, allowing at most 2 errors (-pNPR,gtcagggcgaat,2), returning the Site data and a ladder map (-Sl) formatted 75 characters wide (-w75) and pipe the results directly to the pager program 'less' (|less)

NB: In Version 3.5, the error term can be omitted if you want perfect matches (O mismatches allowed). -pNPR,gtcagggcgaat indicates that no mismatches are allowed. Additionally, when the -p flag is invoked, it also sets the -S flag so that tacg now behaves more like a grep.

or you can specify that the sequence is always supposed to be searched 'with slop' via the addition of a single character in the standard GCG-formatted REBASE file entry:

Here is the standard GCG REBASE format line:

NPR  6   gtcagggcgaat   0  !NPR is searched with 0 errors

Here is my hack to the standard GCG format line:

NPR  6   gtcagggcgaat   0  2 !NPR is always searched with 2 errors

You can also add more info to the REBASE file as described here. The algorithm to do this is reasonably brute force: an 'n' scan of the sequence is done, replacing every base of the sequence with an 'n'. If there is more than 1 error allowed, then each successive generation is 'n'-scanned. At each iteration, each generated pattern is compared with the previous ones so that only novel patterns are actually stored. This approach dramatically decreases the number of patterns searched for, but can consume substantial CPU resources in the validation for long patterns with many errors allowed. There's a tweak to the algorithm that will also substantially decrease this time, but I haven't implemented it yet. I'll see if anyone uses it 1st. Udi Manber's very fast, elegant, bitmasked agrep could also be applied to this approach, but I haven't spent the time to do so yet.

Proximity Matching -P

This is another option designed specifically for Transcription Factor (TF) analyses. I apologize that the commandline is particularly horrible, but In many cases, you are interested in a particular juxtaposition of 2 or more TFs; this option allows you to specify up to 10 sets of relationships between TFs and extract only those which match your criteria. The output has been organized so that you can see both the raw data (All) and the filtered data (Matched) in the same output stanza (below):

== Proximity Matches:

lef(1E)/IHF(0E) * 1 matches * found from search parameters: lef,200,IHF
   57309/57116     


 == Ladder Map of Enzyme Sites:  
                                                                          # hits

           10000 20000 30000 40000 50000 60000 70000 80000 90000100000
               :     :     :     :     :     :     :     :     :     : 

       lef |--|-|-|463222222||||25|2|243---|324|222222--32223|-||3-3|--   98 (All)
       lef ----------------------------------|-------------------------    1 (Matched)
       IHF ----------------------------------|-------------------------    1 (Matched)
       IHF ----------------------------|-----|-------------------------    2 (All)

               :     :     :     :     :     :     :     :     :     :

Pattern Match Data Output by 'Bins' NEW -G

This option was included to allow the results of the matching to be exported to other programs for further analysis. Especially where there is a high number of matches, it may be of more use to plot or analyze these data using an external program, whether it be a spreadsheet, plotting app, or custom-writ program. I don't presume to know what you might want to do with the data and so to make tacg play nicely with other apps in the best unix tradition, I've tried to address some of the simpler data output formats that might be of most use. Output is possible in 2 rectangular formats (X and Y; one is the diagonal flip of the other) as well as in a long form (L)that may be better suited for simple, custom-writ programs that just want to read x,y pairs. The very slick gnuplot can read the multiple columns of the Y form with only a very small amount of editing or filtering of the output. An example of gnuplot's processing of some tacg-generated data is shown as well. See also the gnuplot stanza in the section below.

Using tacg with Other Programs

tacg is obviously not the only molbio program you need to do your work. There are many analyses tacg just cannot do; to that end, it at least regurgitates your data in a re-digestible form so that you can masticate it further elsewhere. If you can program a bit, for quick and dirty programs you probably can't do better than perl (but if you can program a bit, you already know that). If you can't program, importing your data into a spreadsheet and manipulating it cell by cell is certainly viable with the sophistication of modern spreadsheets.

In terms of working with your data in the context of tacg, there are lots of other programs that make input easier, and output more readable/browsable. Here are some that I use on a regular basis:

NEdit A well-designed, actively supported, free GUI editor for Xwindows on unix and VMS. Besides its own features, NEdit allows you to seamlessly pipe selected text to external programs (...like tacg) and display the output in a NEdit window, and even add menu items of external programs (...like tacg), effectively making the NEdit/tacg combination a decent GUI biosequence analysis tool.
I've included a few of my own shell commands and other NEdit settings in the dotnedit file accompanying the distribution. Either copy your .nedit to a safe place and start NEdit with mine, or edit the tasty bits of mine into yours. If you're just browsing, here's my .nedit. And incidentally, here's another from Thomas Siegmund which allows biosequence coloration inside Nedit. He has also developed some very neat Kaptain scripts to wrap many of the EMBOSS applications (some examples shown here.
While the NEdit/tacg combo doesn't allow quite the same flexibility as Strider, it does allow you to use an exceptional editor to modify or annotate your sequence in a very natural way and also allows you to submit parts of that sequence for some of the usual analyses. On slow machines, (~486/66) there is a perceptible pause while the data is passed to tacg and tacg's output is redirected back into nedit, but the window does appear; at least you don't have to go searching for it as you usually do in GCG.
less My favorite pager - slick design, fast, memory miser. Like joe, it can pass selected text (but by lines only) to an external app with the '|from mark X to here' command.
Example:
- go to the start of the selected text
- hit 'm', then a key to mark the current position (say 'a')
- move to the end of the text
- hit '|', then the same indicator key as above ('a'), then the command you wish to pipe it to (|a tacg -n4 -sSl)
readseq The standard for biological sequence format conversion, from that biocomputing code wiz Don Gilbert. You can quibble with the interface a bit, but it's a lifesaver. If only GCG could convert sequence like readseq...
gnuplot gnuplot is a another of those many unix utilities that seems crude and clunky until you play with it a while. And after a bit, things fall into place and you wonder how you got along without it. Using tacg's -G output with Y format, it is easy to plot the distribution of A/T rich regions (using a pattern entry called 'at-rich') in the E coli genome, along with the distribution of an IHF site (pattern called 'ihf') allowing 1 error. While they should both be further reduced with a sliding window, you can see that it is reasonably easy to visualize primary data from tacg. Example:
- generate the plot data:
```
tacg -x at-rich,ihf -G2000,Y < ec.raw > plot.data
```
- clip off the top and bottom few lines with your favorite editor (or be scripted with `head` and `tail`, if the whole thing is going to be iterated).
- start gnuplot and type the commands:
- plot 'plot.data' using 2
- replot 'plot.data' using 3
- The output is a plot with A/T-rich regions marked by green dots and IHF sites marked by red dots.
agrep A really useful tool that works very much like the other members of the 'grep' family but with a few notable differences. It's VERY fast, and like tacg, you can use it to search for patterns with errors. I was originally going to use it as the basis for tacg's searching with errors, but by the time I started to understand the bit operations that it uses, I had already implemented a workable scheme using the code I had already written, which was more applicable to the kind of output I wanted. I might switch later - it's a very cool algorithm (underlies Jim Knight's grepseq as well as the Glimpse and Harvest search tools). The technical papers describing it (in the same FTP tree) are models of readability.
Genscript (Gnu enscript) This is an ASCII to postscript converter that allows you to scale and rotate text on the printed page. For many tacg analyses, it's convenient to have very wide output and genscript allows you to squeeze the text you want in the space you have.
Alpha Not a unix tool at all, but a hair-raisingly, spine-tinglingly good shareware text editor for the Mac from Pete Keleher. Often compared with the commercial BBedit. The reason I use it is because it works the way I work - almost everything I'd like in an editor is there and if it isn't, Alpha is mostly written in tcl so that it can be extended to do whatever you want. If I was going to try a native Mac port of tacg, I'd try it 1st as an add-in to Alpha, similar to the shell commands added to NEdit, described above.

Using tacg to Scan Databases

Thanks to the SEQIO lib, tacg now handles this just fine. Another thing that would be very handy for database handling is the ability to deal transparently with gzipped/compressed data; this is available via the zlib library.

The Odd one: UDP Reporting

tacg was also designed to track it's own use and spread - something in which I'm also interested. However, with version 3.x, this has ceased to be the default. You can still explicitly send back info with the -Q flag (and I'd be grateful), but it annoyed a few people and I think it has fulfilled it's usefulness. The binaries have been compiled with code (udping.c) that spits a small amount of data back to me at each usage IF YOU EXPLICITLY ALLOW IT TO.
The data that gets sent back is:

the IP number of the hosting machine
the cpu type
Operating System and Version # (output of 'uname -a')
what flags were used in calling the program
the number of bases processed.

It does NOT return user names or actual sequence. An example of the data that is returned:

Returned: [hw=IP22 os=IRIX osver=5.3] [TACG Version 1.5F] /usr/local/bin/tacg -n5 -o5 -F2 -slL -g100 <100000 bp>

...And The Small Print Taketh Away...

COPYRIGHT NOTICE

Permission to use, copy, modify, and distribute this software and its documentation is hereby granted, subject to the following restrictions and understandings:

Any copy of this software or any copy of software derived from it must include this copyright notice in full.
All materials or software developed as a consequence of the use of this software or software derived from it must duly acknowledge such use, in accordance with the usual standards of acknowledging credit in academic research.
This software is licensed under the Free Software Foundation's General Public which is available by clicking the preceding link or at http://www.gnu.org/copyleft/gpl.html and is also included in the source code distribution in the file COPYING
If the GPL is unacceptable to you or your organization, you may also contact the author to negotiate a proprietary license under which you are relieved of the terms of the GPL.
This software is provided AS IS with no warranties of any kind. The author shall have no liability with respect to the infringement of copyrights, trade secrets or any patents by this software or any part thereof. In no event will the author be liable for any lost revenue or profits or other special, indirect and consequential damages. Period.

IG/Stanford	GenBank/GB	NBRF
EMBL	GCG	DNAStrider (ASCII)
Fitch	Pearson/Fasta	Zuker (in-only)
Olsen (in-only)	Phylip3.2	Phylip
Plain/Raw	PIR/CODATA	MSF
ASN.1	PAUP/NEXUS