testcode.doc update 10/29/90 I. Function Reads in a DNA or RNA sequence and evaluates open reading frames for liklihood of coding for protein, using the TESTCODE algorithm found in: Fickett, J., "Recognition of protein coding regions in DNA sequences", Nucl.Acids Res., 10 No.17 (1982) p5303-5318. Briefly, TESTCODE determines the degree to which base composition and distribution of nucleotide frequencies in positions 1, 2, or 3 of codons deviates from that predicted by random chance. No prior knowledge of codon strategy is required. If there is a codon preference of any kind, TESTCODE will detect this as a deviation from randomness. IT IS IMPORTANT TO READ THE PAPER BEFORE USING THIS PROGRAM II. Menus TESTCODE begins by asking the user for the name of a file containing the sequence (Input file) and for writing the sequence to (output file). After these files have been opened, the user is placed in the main menu. It is best to conceptualize of the main menu as performing those tasks in which TESTCODE communicates with the operating system, such as opening and closing files, or writing output. The status lines tell the current filenames (which are peculiar to the operating system, in this case, UNIX). Option 3 also allows the user to enter a title to be printed on the output. _____________________________________________________________________ TESTCODE MAIN MENU _____________________________________________________________________ Input file: humhbb.dna Output file: humhbb.tes Title: Human Beta Globin _____________________________________________________________________ 1) Read in a new sequence 2) Open a new output file 3) Type in a title line for output 4) Change parameters 5) Search sequence (output to screen) 6) Search sequence (output to file) _____________________________________________________________________ Type the number of your choice (0 to quit program) 4 Choosing option 4 in the main menu brings the user into the parameters menu. The parameters menu can be thought of as dealing with the sequence itself, and is thus independent of operating system. The name and topology are read from the sequence file in the case of GENBANK or BIONET files, or are typed in when the sequence is read, in the case of Free-format files. Using the defaults in the menu below, TESTCODE will print scan the entire sequence and produce graphic output. Name: HUMHBB Topology: LINEAR Length: 2165 nt _____________________________________________________________________ Parameter Description/Response Value _____________________________________________________________________ 1)START first nucleotide evaluated 1 2)FINISH last nucleotide evaluated 2165 3)WHICH I: input strand O: opposite strand I 4)FORMAT T:tabular output G:graphic output G 5)WINDOW #codons in search window 67 6)SKIP #codons to skip for each window 10 _____________________________________________________________________ Type number of parameter you wish to change (0 to continue) 0 The output is sent to the printer as shown below. At regular intervals in the sequence, the value of the TESTCODE indicator is calculated for a 67 codon search window. The value is printed in the histogram, and the program moves 20 codons to the right to perform the calculation again. In the beta globin gene, there are three exons, located at positions 267-359, 490-711, and 1562-1690. These appear as peaks on the graph. See section V for a more thorough discussion of the output. Human Beta Globin TESTCODE Version 7/13/90 WINDOW= 67 SKIP= 20 NON-CODING NO OPINION CODING 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 ---------+---------+---------+---------+---------+---------+---------+ 97|==================== | | 157|========================== | | 217|============================= | | 277|=========================================== | 337|================================================ 397|================================ | | 457|================================ | | 517|============================================= | 577|====================================================== 637|==================================================== 697|====================================== | 757|===================================== | 817|======================================= | 877|===================================== | 937|=============================== | | 997|============================= | | 1057|======================== | | 1117|================== | | 1177|================================== | | 1237|=============================== | | 1297|=========================== | | 1357|========================= | | 1417|================== | | 1477|============================================ | 1537|=========================================== | 1597|===================================================== 1657|================================================== 1717|========================================= | 1777|======================= | | 1837|=================== | | 1897|===================== | | 1957|=========================== | | 2017|========================================= | (* An example of tabular output is shown below using exon II. START=490, FINISH=711, and FORMAT=T. *) Human Beta Globin Exon II Open reading frame from 490 to 711 T C A G Pos.Freq. 1 11 19 18 26 2 23 15 24 12 3 22 24 2 26 ------------------------------------------------------------ Cont.Param 0.25 0.26 0.20 0.29 Posn.Param 1.92 1.50 8.00 2.00 TESTCODE indicator: 1.277600000E+00 Probability of coding: 1.000000000E+00 Prediction: CODING III. Constants Constants defined in the constant definition part of the main procedure of TESTCODE. To change them, it is necessary to change their values in the Pascal text and re-compile. MAXSEQ The maximum number of nucleotides in a DNA or RNA sequence. Set to 32700 by default. MINCODONS The minimum number of codons permitted for an evaluation. Set to 30 by default. MAXLINE The maximum length of a variable of the type LINE, used here only for the title to be printed on output. Set by default to 80. LINESIZE Maximum length of a graph line. Set by default to 70. IV. Parameters START FINISH START and FINISH are, respectively, the first and last nucleotides of the input strand to be printed. START may be greater than FINISH, but neither parameter may be less than 1 or greater than the length of the sequence. WHICH WHICH is set by default to 'I', for the input strand. Setting WHICH to 'O' will cause TESTCODE to process the opposite, ie. complementary strand in the opposite direction. Thus, TESTCODE always works 5'--->3'. This perfectly sensible rule leads to some subtle results in the interpretation of how the sequence is to be processed. (see NUMSEQ documentation) FORMAT Indicates type of output. FORMAT=G, which specifies graphic output, is the default. FORMAT=T specifies tabular output. In this case, the search window used is delimited by START and FINISH. SKIP is irrelevant, since only one evaluation will be done. WINDOW (GRAPHIC OUTPUT ONLY) Size of the region (AS MEASURED IN CODONS) evaluated by TESTCODE at each position in the graph. SKIP (GRAPHIC OUTPUT ONLY) Number of CODONS skipped between each calculation of the TESTCODE indicator. V. What the output means GRAPHIC OUTPUT TESTCODE produces a plot of the TESTCODE indicator versus position in the sequence. On the X-axis is plotted the value of the dimensionless TESTCODE indicator, while on the Y axis is the position (in nucleotides) of the center of each window evaluated by program. The plot itself is in the form of a histogram. Note that the X-axis has been divided into three regions, NON-CODING, NO OPINION, and CODING, as described in the Fickett article. In the example shown above, the TESTCODE indicator is evaluated for a 67 codon WINDOW at 20 codon intervals along the entire length of the Human Beta Globin gene. Introns I, II, and III appear as peaks in the graph, whose limits can be roughly defined by the first NON-CODING window on either side of each peak. At best, TESTCODE gives a rough approximation of exon/intron boundaries, which must be precisely assigned by other means. TABULAR OUTPUT For each evaluation done, a table is printed, showing the frequencies at which T,C,A, & G are used at positions 1,2, & 3 in all codons in the region evaluated. The sum of nucleotide usage at positions 1,2 & 3 yeilds the absolute T,C,A, or G content of the region. Dividing the absolute content by the distance searched yeilds the relative content, or Content Parameter, shown in the table. The position parameters for the four nucleotides are computed as described by Fickett. Finally, the TESTCODE indicator, its respetive probability of coding, and the coding prediction, are printed. The user should be aware that TESTCODE IS PREDICTIVE, NOT DEFINITIVE. The TESTCODE indicator can only suggest whether a region is probably coding or non-coding. The actual nature of the region in question must be verified experimentally. VI. Usage Notes 1. TESTCODE assumes that the region it is searching is an open reading frame, and makes no attempt to detect stop codons. It is up to the user to find open reading frames using a program such as NUMSEQ. 2. TESTCODE gives the same result regardless of which strand is evaluated. This is good in that you only have to search one strand, but bad, because it doesn't distinguish between two genes on opposite strands of the same stretch of DNA. 3. TESTCODE can not determine which register of a protein coding sequence actually codes. This is usually not a problem, since protein coding sequences usually have open reading frames in one register only. This aspect of TESTCODE is good, since it means that you don't have to worry about searching in three different reading frames. All that TESTCODE needs is a consistent register to evaluate the degree to which the usage of nucleotides in positions 1, 2, or 3 are non-random. However, this does require that the sequence be error-free, at least to the extent that no frameshifts are introduced. 4. The resolution of TESTCODE is coarse at best. Since TESTCODE is a statistical method, it must have an adequate sample size (ie. WINDOW) in order to accurately predict the probability of coding in a given region. Use of too small a WINDOW will result in a background of high scores, owing to local fluctuations in nucleotide usage. Use of too high a WINDOW will cut down the resolution of the TESTCODE scan. In practice, a WINDOW of 67 codons (the minimum value recommended by Fickett) gives somewhat precise determinations of exon boundaries when used with a fairly small SKIP value (eg. 5 or 10). When an exon is smaller than the WINDOW size, (eg. Exons I and III in beta globin), the exon will probably still be detected as a peak, but it may not yield a TESTCODE indicator in the CODING range.