rest.txt update 4/ 2/2006 RESTRICTION SITE SEARCH PROGRAMS: INTREST AND BACHREST I. Interactive restriction site search: INTREST Below is the screen output of a typical interactive session with INTREST. Program output and user responses are listed as they would actually appear on the screen. Comments, which are listed here for explanatory purposes but would not appear in the program, are enclosed in the symbols (* *). INTREST VERSION 5/10/90 (* program begins *) Enter input filename: PBR322.DNA The following file formats can be read: F:free format B:BIONET G:GENBANK Type letter of format (F|B|G) F Reading input file... Is sequence circular or linear? (Type C or L ) C (* User types C or L *) (* INTREST reads in the sequence. *) Enter output filename: PBR322.rest Type name to appear on output: pBR322 _____________________________________________________________________ INTREST MAIN MENU _____________________________________________________________________ Input file: PBR322.DNA Output file: PBR322.rest _____________________________________________________________________ 1) Read in a new sequence 2) Open a new output file 3) Search for sites (output to screen) 4) Search for sites (output to file) _____________________________________________________________________ Type the number of your choice (0 to quit program) 3 (* INTREST writes a heading to the screen or file *) INTREST Version 5/10/90 pBR322 Configuration: CIRCULAR Length: 4363 bp # of Cut Sites Sites Frags Begin End (* User is prompted for enzyme information. *) (* Ambiguous nucloetides may be specified *) (* using the IUPAC-IUB conventions: *) (* Cornish-Bowden (1985) Nucl. Acids Res. 13: *) (* 3021-3030. *) Enter name of enzyme: Ava2 (*User types name of enzyme*) Enter recognition site using the symbols below: Nucleotides: A,C,G,T Ambiguous nucleotides: R = A,G K = G,T H = A,C,T D = G,T,A Y = C,T S = G,C B = G,T,C N = G,A,T,C M = A,C W = A,T V = G,C,A Site must be <= 20 characters: GGWCC (* user types restriction site *) Position of cut? 1 Searching... (* INTREST searches for the enzyme in the sequence and writes a report*) (* to the output file, in this case, the printer. If the restriction *) (* site is symmetric, it searches once. If it is asymmetric, INTREST *) (* searches again using the inverse complement of the site. *) (* After searching and printing, the user is prompted: *) Type S to search for a site, Q to quit (* The user may repeat the cycle and search for a new enzyme by *) (* by typing S. Typing Q ends the program. *) (* Below an example of what output might look like: *) BamH1 GGATCC 1 1 376 4363 376 375 Bgl1 GCCNNNNNGGC 7 3 936 2319 1170 3488 1170 1810 3489 935 3489 234 936 1169 BstN1 CCWGG 2 6 132 1857 2638 131 1061 1060 1444 2503 1444 929 132 1060 2504 383 1061 1443 2625 121 2504 2624 2638 13 2625 2637 EcoR1 GAATTC 1 1 4362 4363 4362 4361 Hind3 AAGCTT 1 1 30 4363 30 29 Kpn1 GGTACC 5 0 Mbo2 GAAGA 13 ( 12) 11 477 791 2347 3137 731 755 3209 3963 1002 753 1594 2346 1594 592 1002 1593 2347 493 4347 476 3138 271 731 1001 3209 254 477 730 3964 196 4151 4346 4042 109 4042 4150 4151 78 3964 4041 4347 71 3138 3208 II. What the output means Although the output is quite straightforward, several aspects deserve some comment. First, note that the column labled "Cut" refers to the position in the restriction site after which the enzyme cuts. "# of sites" obviously tells how many sites were found in the sequence. The column "Sites" tells the starting position of the fragment after the cut, not the beginning of the restriction site. Thus, in a BamH1 digest of pBR322, a single linear fragment is produced which begins at position 376 and ends at position 375 (as read 5'-->3'). Restriction sites are listed in this column in the order of their appearance in the sequence. The rightmost three columns should be thought of as a group. The column "Frags" lists the fragments produced by digestion with a given enzyme in descending order of size, as they would appear on a gel. The columns "Begin" and "End" list the corresponding 5' and 3' ends of each fragment in "Frags". To reiterate, the column "Sites" represents an entirely different way of representing the data as compared to "Frags", "Begin" and "End". Note that the fragment lengths, and beginning and ends of fragments refer only to the strand being processed. III. Searching for many sites: BACHREST INTREST allows the user to search for restriction sites one at a time in an interactive manner. Frequently, however, the investigator may wish to search for a battery of restriction enzymes. Since typing a large number of sites can be rather tedious and allows errors to crop up, a batch version of INTREST has been written, called BACHREST that reads recognition sequences are read from a file. Two types of restriction site files can be read. Preferrably, you should use Rich Robert's REBASE file "type2.xxx"(where xxx is the current version number) as input. With REBASE, a variety of subsets of restriction enzymes may be chosen, as illustrated in the example below. REBASE is available by anonyous FTP to the directory pub/rebase at vent.neb.com. REBASE entries take the form: <1> <2> <3> <4> <5> <6> For example: <1>HindII <2> <3>GTY^RAC <4>5(6) <5>EM <6>333,627,628,686 The original restriction enzyme files used with earlier versions of BACHREST can also be read. For example, to produce the output shown above, the following restriction site file would be used: ENZYME RECOGNITION SEQ. CUTTING POSITION(S) 4/24/87 BamH1 GGATCC 1 Bgl1 GCCNNNNNGGC 7 BstN1 CCWGG 2 EcoR1 GAATTC 1 Hind3 AAGCTT 1 Kpn1 GGTACC 5 Mbo2 GAAGA 13 12 It is readily apparant that the restriction sequence file simply contains the same information that the user would have typed at the keyboard in INTREST. Such a file, therefore, is quite easy to make. The user can then make a file of restriction sequences to suit his own needs, following the three column format shown above. The following rules apply : 1. The first two lines, reserved for titles and spacing, are ignored by the programs. Input begins on line 3 of the file. 2. All enzymes must include a name, a recognition sequence, and a cutting site. These data items must all be on the same line, and must be separated by one or more blanks. 3. Restriction enzyme names must be ten or fewer characters. Blanks are not permitted. 4. If the recognition sequence contains characters other than those specified above, the enzyme will be ignored. 5. If the recognition sequence is assymetric, (ie. the inverse complement is not the same as the original site) a second cutting site must be included after the first, telling where the enzyme cuts on the complementary strand. A useful type of file might consist of only "4-cutters", or another file of "6-cutters", or any other grouping that is convenient. A sample file which contains commercially-available restriction enzymes is found in the file comrest.dat. (Note: With the availability of REBASE, comrest.dat is no longer being updated.) IV. Sample BACHREST run In this example, we will search for restriction sites that cut the Pseudomonas aeruginosa (http://www.pseudomonas.org) into only a few fragments. To accomplish this, we will search for recognition sequences 6 bases or longer, and set FRAGMOST to 5, which will only print digests giving 5 or fewer fragments. To eliminate redundancy, we will only search for prototypes (ie. the original enzyme for a given recognition sequence, ignoring all isoscizomers.) NOTE: The sequence obtained from pseudomonas.org is in FASTA format, which has no way to specify that the sequence is circular. When reading fasta (bionet) formatted sequences, bachrest assumes that the sequence is linear, unless the file ends with the character '2'. In Unix, a 2 can be appended to a file in the following way: echo '2' >> P_aeruginosa.fsa The following is a transcript of the bachrest search of P_aeruginosa: bachrest BACHREST Version 3/29/2006 Enter filename of sequence to search: P_aeruginosa.fsa The following file formats can be read: F:free format B:BIONET G:GENBANK Type letter of format (F|B|G) b (* Bionet format is essentially the same as FASTA*) Reading input file... Enter restriction site filename: /home/psgendb/dat/REBASE/type2.lst (* REBASE enzymes*) Enter output filename: P_aeruginosa.bachrest ________________________________________________________________________________ BACHREST MAIN MENU ________________________________________________________________________________ Input file: P_aeruginosa.fsa Rest. enz. file: /home/psgendb/dat/REBASE/type2.lst Output file: P_aeruginosa.bachrest ________________________________________________________________________________ 1) Read in a new sequence 2) Open a new rest. enz. file 3) Open a new output file 4) Set search parameters 5) Search for sites and write output to file ________________________________________________________________________________ Type the number of your choice (0 to quit program) 4 ________________________________________________________________________________ Parameter Description/Response Value ________________________________________________________________________________ 1)SOURCE C:Commercial only A:all C 2)PROTOTYPE P:prototypes only A:all A 3)PROT3 3' protruding end cutters (Y/N) Y 4)BLUNT Blunt end cutters (Y/N) Y 5)PROT5 5' protruding end cutters (Y/N) Y 6)SYMM S:symmetric A:asymmetric B:both B 7)MINSITE Minimum RE site length 4 8)MAXSITE Maximum RE site length 23 9)FRAGLEAST Min. # of fragments to print a digest 0 10)FRAGMOST Max. # of fragments to print a digest 6000 11)FRAGPRINT Print a number if > FRAGPRINT frags 30 ________________________________________________________________________________ Type number of parameter you wish to change (0 to continue) 2 Type new value for PROTOTYPE (CURRENT VALUE: A) p ________________________________________________________________________________ Parameter Description/Response Value ________________________________________________________________________________ 1)SOURCE C:Commercial only A:all C 2)PROTOTYPE P:prototypes only A:all P 3)PROT3 3' protruding end cutters (Y/N) Y 4)BLUNT Blunt end cutters (Y/N) Y 5)PROT5 5' protruding end cutters (Y/N) Y 6)SYMM S:symmetric A:asymmetric B:both B 7)MINSITE Minimum RE site length 4 8)MAXSITE Maximum RE site length 23 9)FRAGLEAST Min. # of fragments to print a digest 0 10)FRAGMOST Max. # of fragments to print a digest 6000 11)FRAGPRINT Print a number if > FRAGPRINT frags 30 ________________________________________________________________________________ Type number of parameter you wish to change (0 to continue) 7 Type new value for MINSITE (CURRENT VALUE: 4) 6 ________________________________________________________________________________ Parameter Description/Response Value ________________________________________________________________________________ 1)SOURCE C:Commercial only A:all C 2)PROTOTYPE P:prototypes only A:all P 3)PROT3 3' protruding end cutters (Y/N) Y 4)BLUNT Blunt end cutters (Y/N) Y 5)PROT5 5' protruding end cutters (Y/N) Y 6)SYMM S:symmetric A:asymmetric B:both B 7)MINSITE Minimum RE site length 6 8)MAXSITE Maximum RE site length 23 9)FRAGLEAST Min. # of fragments to print a digest 0 10)FRAGMOST Max. # of fragments to print a digest 6000 11)FRAGPRINT Print a number if > FRAGPRINT frags 30 ________________________________________________________________________________ Type number of parameter you wish to change (0 to continue) 10 Type new value for FRAGMOST (CURRENT VALUE: 6000) 5 ________________________________________________________________________________ Parameter Description/Response Value ________________________________________________________________________________ 1)SOURCE C:Commercial only A:all C 2)PROTOTYPE P:prototypes only A:all P 3)PROT3 3' protruding end cutters (Y/N) Y 4)BLUNT Blunt end cutters (Y/N) Y 5)PROT5 5' protruding end cutters (Y/N) Y 6)SYMM S:symmetric A:asymmetric B:both B 7)MINSITE Minimum RE site length 6 8)MAXSITE Maximum RE site length 23 9)FRAGLEAST Min. # of fragments to print a digest 0 10)FRAGMOST Max. # of fragments to print a digest 5 11)FRAGPRINT Print a number if > FRAGPRINT frags 30 ________________________________________________________________________________ Type number of parameter you wish to change (0 to continue) 0 ________________________________________________________________________________ BACHREST MAIN MENU ________________________________________________________________________________ Input file: P_aeruginosa.fsa Rest. enz. file: /home/psgendb/dat/REBASE/type2.lst Output file: P_aeruginosa.bachrest ________________________________________________________________________________ 1) Read in a new sequence 2) Open a new rest. enz. file 3) Open a new output file 4) Set search parameters 5) Search for sites and write output to file ________________________________________________________________________________ Type the number of your choice (0 to quit program) 5 AatII AccI AcyI AflII etc... XmnI ________________________________________________________________________________ BACHREST MAIN MENU ________________________________________________________________________________ Input file: P_aeruginosa.fsa Rest. enz. file: /home/psgendb/dat/REBASE/type2.lst Output file: P_aeruginosa.bachrest ________________________________________________________________________________ 1) Read in a new sequence 2) Open a new rest. enz. file 3) Open a new output file 4) Set search parameters 5) Search for sites and write output to file ________________________________________________________________________________ Type the number of your choice (0 to quit program) 0 The output file looks like this: ----------------------------------------------------------- BACHREST Version 3/29/2006 >[org=Pseudomonas Topology: CIRCULAR Length: 6264404 bp ----------------------------------------------------------- Search parameters: Recognition sequences between 6 and 23 bp Ends: 5' protruding, Blunt, 3' protruding Type: Symmetric, Asymmetric Minimum fragments: 0 Maximum fragments: 5 Maximum fragments to print: 30 ----------------------------------------------------------- # of Enzyme Recognition Sequence Sites Sites Frags Begin End -------------------------------------------------------------------------------- PmeI GTTT^AAAC 1 5859604 6264404 5859604 5859603 SwaI ATTT^AAAT 5 331044 3424071 2549592 5973662 786993 1302789 1246803 2549591 1246803 621785 5973663 331043 2549592 459810 786993 1246802 5973663 455949 331044 786992 TaqII GACCGA(11/9),CACCCA(11/ 2 843937 5442291 1666050 843936 1666050 822113 843937 1666049 V. Usage Notes 1. Sequence topology (circular or linear) must be specified in the input file. GenBank entries contain this information on the LOCUS line. In Fasta format files, simply append a '2' to the end of the file to indicate circularity. Free-format files do not have a mechanism for specifying topology, so linearity is assumed. 2. "N"'s may be included in the DNA sequence to indicate regions that are unknown. Any restriction sites which are known in an otherwise unknown region may be included in the sequence and will be found by INTREST and BACHREST. The fragments calculated will then be of the correct size. The real power of this approach is that it allows the user to create a composite sequence containing a vector, whose sequence is known, with a partial sequence of the insert, with the experimentally determined sized of unknown regions represented by the appropriate number of "N"'s. INTREST and BACHREST can be used to predict the fragments expected from single-enzyme digests, and MULTIDIGEST can be used to predict double or triple enzyme digests. One can then select from among these several digests which will be easy to experimentally verify. The information gained from these experiments can then be used to update the DNA sequence file. 3. Parameters PROT3,BLUNT,PROT5 and SYMM control which types of enzymes are to be included in the report. Setting PROT3, BLUNT or PROT5 to N will suppress, respectively, reporting of enzymes generating 3' protruding ends, blunt ends, or 5' protruding ends. By default, all three are printed. Similarly, values for SYMM of S, A or B will print, respectively, only symmetric sites, only asymmetric sites, or both. (Default B). 4. The parameters FRAGLEAST and FRAGMOST allow the user to limit the report only to those enzymes producing a specified number of fragments. A digest will only be printed if the number of fragments is such that FRAGLEAST <= # of fragments <= FRAGMOST. For example, if you only wanted to see enzymes that generate at least one fragment, but no more than four fragments, set FRAGLEAST=1 and FRAGMOST=4. 5. FRAGPRINT specifies the maximum number of fragments that will be printed in the report. For example, with the default of FRAGPRINT=30, enzymes generating up to 30 fragments will have a complete listing of the fragment sizes and ends. For enzymes generating a larger number of fragements, only a number will be printed, telling how many fragments were generated, along with the message: (Fragments not shown). In the current version, the maximum value of FRAGPRINT is 6000. This could be increased in the code and the program recompiled, if needed. 6. Although these programs have been written with restriction sequences in mind, one could just as easily use them to search for other short sequences, such as splice junctions or promoter sequences. This is especially facilitated by the fact that INTREST and BACHREST allow ambiguities at any position. Since the programs always assume that the sequence given is a restriction site, a list of fragments will still be generated. With reference to old style Rest. Enz. files: 7. If searching for sites which are not restriction enzyme sites, or for enzymes whose cutting sites are not known, it is best to specify a cutting site of 0, so that sites will be listed as beginning at the first position in the recognition sequence. 8. The cutting site of an enzyme may be a positive or negative integer, and may refer to positions beyond the recognition sequence. In such cases (eg. Mbo2), if the predicted cutting site occurs beyond the end of a linear fragment, no site will be listed, even though the recognition sequence does occur within the fragment. With refreence to REBASE: 9. Be careful about setting BOTH PROTOTYPE and COMMERCIAL. If you search for only those commercially-available enzymes that are also prototypes, you will miss enzymes such as AhaIII and DraIII (TTTAAA). AhaIII is the prototype, but is not commercially available, while DraIII is available, but is not a prototype. 10. BACHREST and INTREST now can read GenBank files with LOCUS names of up to 18 characters. The old format, in which names could be a maximum of 12 characters can also be read.