comp.doc update 5/15/90 COMP I. Function- COMP determines the base or amino acid composition of a DNA, RNA, or protein sequence as a function of position. The user defines a set of characters, called COMPSET, which are searched for at each position. At the starting position, COMP calculates the percentage of the first REGION bases or amino acids that are members of COMPSET. It then shifts SKIP positions to the right and recalculates the percentage. This cycle is repeated until the entire sequence or subsequence has been searched. The output may be sent to any file in the form of a table, but by default, COMP will write the resultant coordinates to a file that is directly readable by LINEPLOT. LINEPLOT then uses these points to create a graph of composition as a function of position in the sequence. NOTE: COMP ONLY READS FREE FORMAT FILES, NOT GENBANK, NBRF, OR BIONET. II. Program Flow Program output and user responses are listed as they would actually appear on the screen. Comments, which are listed here for explanatory purposes but would not appear, are enclosed in the symbols (* *). COMP Version 5/15/90 Type N for DNA or RNA, P for protein sequence: P Type input filename: b:humhbb.pro (* IBM-PC DOS protocol *) Reading input file... MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH* Type output filename: b:humhbb.dat (* IBM-PC DOS protocol *) Type name to appear on output: Beta-globin (* COMP displays parameters which may be changed by user *) Parameter Description/Response Value --------------------------------------------------------- 1)START first position searched 1 2)FINISH last position searched 148 3)REGION width of region searched at each posn. 10 4)SKIP move right SKIP posn.after each search 5 5)FORMAT G: print a graph T: print a table G 6)COMPSET bases or amino acids to search for COMPSET [ ] Type number of parameter you wish to change (0 to continue) 6 COMPSET: [ ] The following may be added to or subtracted from COMPSET: 1) a single amino acid 2)NONPOLAR [ A F I L M P V W ] 3)UNCHPOLAR [ C G N Q S T Y ] 4)ACIDIC [ D E ] 5)BASIC [ H K R ] Type the number of your choice: (0 to continue) 2 (* user has chosen to search for non-polar a.a.'s*) Type + to add, - to subtract: + COMPSET: [ A F I L M P V W ] (* Current value of COMPSET *) The following may be added to or subtracted from COMPSET: 1) a single amino acid 2)NONPOLAR [ A F I L M P V W ] 3)UNCHPOLAR [ C G N Q S T Y ] 4)ACIDIC [ D E ] 5)BASIC [ H K R ] Type the number of your choice: (0 to continue) 0 Parameter Description/Response Value --------------------------------------------------------- 1)START first position searched 1 2)FINISH last position searched 148 3)REGION width of region searched at each posn. 10 4)SKIP move right SKIP posn.after each search 5 5)FORMAT G: print a graph T: print a table G 6)COMPSET bases or amino acids to search for COMPSET [ A F I L M P V W ] Type number of parameter you wish to change (0 to continue) 0 (* COMP calculates the coordinates and writes them to the output file. Using LINEPLOT, MAXHSCALE is set to 0.15 (ie. 0.15 x 1000 amino acids) and the graph is printed as shown below. *) P 1.000E+02| E | R | C | E 9.000E+01| N | T | | [ 8.000E+01| A | F | I | L 7.000E+01| * M | P | V | W 6.000E+01| * * * * * * ] | | | 5.000E+01....*............*.*.....*.........*.........*..... | | | 4.000E+01| * * * ** * * * * ** * * | | | 3.000E+01| * | | | 2.000E+01| | | | 1.000E+01| | | | 0.000E+00-----+----+----+----+----+----+----+----+----+----+ 0.0E+00 3.0E-02 6.0E-02 9.0E-02 1.2E-01 1.5E-01 Posn. in Beta-globin (REGION= 10 SKIP= 5) If only a few datapoints are expected, COMP may also be used to produce output in tabular format. This is most useful if you want to find the base composition of a large region or the entire sequence. To use the table option, change FORMAT to 'T'. Now, COMP will search as above, but only print the actual values found at each position, omitting graph parameters. After each search has been completed, the message Type Q to quit, S to search again: gives the user the option to change search parameters and search again. If, for the sequence used above, the parameters had been changed so that TABLE='T' and FINISH=50, the non-polar amino acid content of the entire sequence would have been calculated and sent to the output file as shown below: Beta-globin SKIP= 5 POSN. PERCENT [ A F I L M P V W ] 0.006 40.0 0.011 50.0 0.016 60.0 0.021 40.0 0.026 40.0 0.031 60.0 0.036 60.0 0.041 40.0 III. Parameters START FINISH START and FINISH determine the part of the sequence to be searched. By default, START is the first position and FINISH is the last. REGION REGION is the width of the region centered on a given position in the sequence, for which a percent composition is to be calculated. Thus, if REGION = 30 and the current position is 260, COMP will calculate the percent composition of the part of the sequence beginning at 245 and ending at 274. COMP can only determine composition for complete regions. Thus if REGION=20, the first position at which a value can be calculated is 11. There is a direct relationship between the percent composition and the size of the REGION searched. As one might expect, over very large REGIONs, the composition will tend to dampen in amplitude to a constant value. Conversely, for small values of REGION (eg. a few nucleotides), the resultant graph will have numerous jagged peaks and valleys. SKIP After calculating the percent composition at a given position, COMP moves right SKIP positions. Generally, SKIP should be small, relative to REGION. If REGION <= the size of the sequence, then the entire sequence will be searched once, and the value of SKIP is irrelevant. This would occur if the base composition of the sequence as a whole were to be determined. FORMAT By default, FORMAT=G, which results in the output file being written in a format readable by LINEPLOT. Setting format to T will result in the output appearing in tabular form, one coordinate per line. This is only recommended if only a few output points are expected. COMPSET The user is given the option of defining a set of nucleotides or amino acids to search for. For example, to search for purine rich regions, COMPSET would be set to [A G]. Nucleotides or amino acids can be added to or subtracted from COMPSET one at a time, or, for amino acids, in groups, such as ACIDIC, BASIC etc. Subtracting nucleotides or amino acids that are not members of COMPSET will have no effect. IV. Input file The input for COMP may be any DNA, RNA, or protein sequence file as described in the general notes. V. Usage notes 1. When COMP calculates base composition for a given region of a DNA sequence, N's are ignored. Similarly, for proteins, X's and *'s are ignored. Thus, the A-composition for a given REGION is A A composition= --------------------------- A + G + C + T such that the sum of the base compositions always equals unity. If N's were included in the calculation, all base compositions would be underestimates, since, in reality, even the unknown part of the sequence consists of A,G,C, & T. Although the true base composition of the unknown part of the sequence may differ from that of the known part, it is probably best, unless other information is available, to assume that they are the same.