numseq.doc update 10/27/90 NUMSEQ I. Function: Reads in a DNA or RNA sequence and writes one or both strands in either orientation in a numbered format specified by the user. The amino acid sequence may also be printed in one or all reading frames, using any genetic code. NUMSEQ is a powerful general-purpose sequence display program. By setting a variety of display parameters, the user can control which parts of the sequence are printed, the numbering scheme, the case of the nucleotides printed, the strand(s) printed, and the width of the lines printed. A number of translation options control the display of amino acid seuqences. By virtue of the ability to read in and write parts of different sequences in a serial fashion, NUMSEQ can perform simulated cloning experiments. II. Menus NUMSEQ begins by asking the user for the name of a file containing the sequence (Input file) and for writing the sequence to (output file). After these files have been opened, the user is placed in the main menu. It is best to conceptualize of the main menu as performing those tasks in which NUMSEQ communicates with the operating system, such as opening and closing files, or writing output. The status lines tell the current filenames (which are peculiar to the operating system, in this case, UNIX). Note that alternative genetic codes can be read in for use in sequence translation (see Note 2). The main menu below is taken from a run in which the Marchantia polymorpha (liverwort) chloroplast genome has been read from a GenBank file, and the Arg-tRNA-acg transcript will be stored in a file. _____________________________________________________________________ NUMSEQ MAIN MENU _____________________________________________________________________ Input file: /usr2/genbank/mpocpcg.gen Output file: mpoarg.trna Genetic code file: _____________________________________________________________________ 1) Read in a new sequence 2) Open a new output file 3) Read in an alternative genetic code 4) Change parameters 5) Write output to screen 6) Write output to file _____________________________________________________________________ Type the number of your choice (0 to quit program) 4 Choosing option 4 in the main menu brings the user into the parameters menu. The parameters menu can be thought of as dealing with the sequence itself, and is thus independent of operating system. (In practice, however, NUMSEQ can only hold sequences this large on 32-bit computers ie. not on IBM-PC's). The name and topology are read from the sequence file in the case of GENBANK or BIONET files, or are typed in when the sequence is read, in the case of Free-format files. Name: MPOCPCG Topology: CIRCULAR Length: 121024 nt _____________________________________________________________________ Parameter Description/Response Value _____________________________________________________________________ 1)START first nucleotide printed 112638 2)FINISH last nucleotide printed 112565 3)NUCCASE U:(A,G,C,T...), l:(a,g,c,t...) U 4)STARTNO number of starting nucleotide 1 5)GROUP number every GROUP nucleotides 10 6)GPL number of GROUPs printed per line 6 7)WHICH I: input strand O: opposite strand O 8)STRANDS 1: one strand, 2:both strands 1 9)KIND R:RNA D:DNA R 10)NUMBERS Number the sequence (Y or N) Y 11)NUCS Print nucleotide seq. (Y or N) Y 12)PEPTIDES Print amino acid seq. (Y or N) N 13)FRAMES 1 for this frame, 3 for 3 frames 1 14)FORM L:3 letter amino acid, S: 1 letter L _____________________________________________________________________ Type number of parameter you wish to change (0 to continue) 0 In the example shown above, the Arginine (acg) tRNA is being written to the output in six groups of 10 nucleotides per line. Since this gene is transcribed from the opposite strand, FINISH is less than START, and WHICH='O'. In order to print the sequence as RNA, KIND has been set to 'R'. When the sequence is written to the output file using option six, the user is prompted for a title line to begin the file. In this case, the title line was begun with a semicolon, and would be ignored as a comment line if mpoarg.trna was later read in as a free-format file. The output is shown below: ; Arg-tRNA-acg 112629 112619 112609 112599 112589 112579 GGGUUUGUAG CUCAGAGGAU UAGAGCACGU GGCUACGAAC CACGGUGUCG GGGGUUCGAA 112569 UCCCUCCUUG CCCA III. Constants MAXSEQ and MAXLINE are defined in the constant definition part of the main procedure of NUMSEQ. To change them, it is necessary to change their values in the Pascal text and re-compile. MAXSEQ The maximum number of nucleotides in a DNA or RNA sequence. Set to 32700 by default. (*IBM-PC*) MAXLINE The maximum length of a variable of the type LINE, used here only for the title to be printed on output, and the width of horizontal lines written as parts of the menus. Set by default to 80. IV. Parameters START FINISH START and FINISH are, respectively, the first and last nucleotides of the input strand to be printed. START may be greater than FINISH, but neither parameter may be less than 1 or greater than the length of the sequence. NUMSEQ always works 5'--->3'. When START>FINISH, NUMSEQ processes nucleotides from position START to the end of the sequence in linear molecules and from position START through the end and then continuing through the beginning of the sequence to position FINISH in circular molecules. Thus in the case of the circular plasmid pBR322 (4363 nucleotides long), if we ask NUMSEQ to print nucleotides using START=4000 and FINISH=100, the 464 nucleotide stretch going through the EcoRI site at nucleotide 1 will be printed, rather than the 3901 nucleotides in the other direction. See the discussion of parameter WHICH for further details. NUCCASE The case of nucleotides printed. U for uppercase (capitals) and l for lowercase (small letters). STARTNO By default, STARTNO=START, which means that the numbering refers to the absolute position in the input sequence ie. a sequence-based co- ordinate system. Setting STARTNO to any number other that START implies a user-specified coordinate system. Now, the position refered to by START will be numbered as STARTNO, and will increase from left to right. For example, if you wanted the 58th nucleotide in a sequence to be numbered as '1', then if all the preceeding nucleotides are to be printed, STARTNO should be set to -57. Numbering will increase through position 57, which will be considered as -1, and position 58 will then be 1. NUMSEQ always skips the zero coordinate. If WHICH=O, numbering will decrease, left to right, thus ensuring that an absolute numbering system obtains regardless of which strand is being printed. NOTE: Setting START will reset STARTNO to be equal to START. Therefore, you should set START first, then assign a value to STARTNO. GROUP GPL Sequences are numbered every GROUP nucleotides, and GPL GROUPs are printed per line. By default, as shown in the example above, 7 GROUPs of 10 nucleotides are printed per line. When changing these parameters, the user should keep the width of the paper in mind. If the amino acid sequence is also to be printed, GROUP must be a number evenly divisible by three, usually 15 or 30. If a value is specified that is not divisible by 3, NUMSEQ will set GROUP to 15 and GPL to 3. Remember, GROUP refers to nucleotides, not codons. The minimum value for GROUP is 3, the maximum is 160. WHICH WHICH is set by default to 'I', for the input strand. Setting WHICH to 'O' will cause NUMSEQ to process the opposite, ie. complementary strand, but in the opposite direction. Remember, NUMSEQ always works 5'--->3' and START and FINISH always refer to nucleotide positions in the input strand. Thus to obtain the entire complementary strand of a linear sequence 40 nucleotides long, for example, START must be set to 40 (which is the 5'end of the complementary strand) and FINISH to 1. NUMSEQ will begin by printing the complementary nucleotide to input position 40, then the complementary nucleotide to position 39 and so on. If START=1 and FINISH=40, then NUMSEQ will only print the complementary nucleotide to input position 1. Study the table and make sure you understand the reason for each result: WHICH START <> FINISH Config. result ------------------------------------------------------------------------ I START < FINISH linear seq. printed from START to FINISH I START < FINISH circular " " " " " " I START > FINISH linear seq. printed from START to end I START > FINISH circular seq. printed from START past the end, continuing at beginning and finishing at FINISH. O START < FINISH linear seq. printed from START to beginning of sequence. O START < FINISH circular seq. printed from START past the beginning, finishing at FINISH O START > FINISH linear seq. printed from START to FINISH O START > FINISH circular " " " " " " Although it is permissible to set START and FINISH values which imply circularity for linear molecules, sequences specified as linear at the beginning of the program will only be processed until an end is reached. NUMSEQ, like an enzyme, will simply 'fall off the ends' of linear molecules. STRANDS If STRANDS=1, then only the input strand is printed. If STRANDS=2 then the complementary strand will be printed below the input strand. KIND If KIND='D' (the default) the nucleotide sequence will be printed as DNA. If KIND='R' then it will be printed as RNA. NUMBERS NUMBERS='Y' causes numbers to be printed corresponding to the nucleotide sequence. 'N' supresses the printing of numbers. This is useful if sending output to a file for use as input by another program. NUCS NUCS='Y' causes the nucleotide sequence to be printed. 'N' supresses printing of nucleotides. PEPTIDES PEPTIDES='Y' causes the amino acid sequence, corresponding to the input strand, to be printed. 'N' supresses printing. NUMBERS, NUCS, and PEPTIDES may be set independantly of one another. FRAMES By default, FRAMES=1, which causes the sequence to be translated in one reading frame, where START is the first position of the first codon, and FINISH is the last position of the last codon. The nucleotide sequence is printed with spacing between codons. If FRAMES=3, then translation is in all three reading frames, beginning at positions START, START+1, and START+2, and ending at positions FINISH, FINISH+1, and FINISH+2, respectively. There is no spacing between codons. FORM By default, FORM='L', which specifies that amino acids are to be printed in the long (3 letter) form. FORM='S' selects the short (1 letter) form. When using the long form, stop codons are translated as blanks, and untranslatable codons are indicated by ---. V. Usage Notes 1. NUMSEQ is extremely flexible. The loop structure of the program allows the user to print different parts of the sequence in different ways. For example, one can print only the DNA sequence in a 5' non-coding region, then print the coding region with DNA and amino acids in the correct reading frame, and then print only DNA again in the 3' non-coding region. Prior to printing, option 5 of the main menu may be used to view output on the screen, without affecting what goes into the output file. The user can simulate recombinant DNA experiments by changing input files, as illustrated in the following example. If one was cloning a restriction fragment derived from a larger sequence into the polylinker region of pUC18, he might first read in pUC18 and, with output going to a disk file, print the plasmid sequence from the origin to the cut site into which the subfragment is to be cloned. Next, he would open a new input file for the sequence to be cloned. Now, he would print only that part of the sequence corresponding to the subfragment into the same output file, retaining the numbering system of the original sequence. The last step would be to read in pUC18 once again, and print the rest of the plasmid sequence, from restriction site back to the origin, again using the same output file. The resultant file will contain the sequence of the chimeric plasmid, with plasmid sequences and insert sequences retain- ing their original coordinates for easy reference. This sequence could later be used for further manipulations. 2. Option 3 of the main menu makes it possible for the user to read in an alternative genetic code from a file. By default, the program is initialized to use the Universal Genetic Code, which can also be read from the file UGC. Additionally, two other genetic code files are sup- plied, the mamalian mitochondrial genetic code (SGC1) and the yeast mito- chondrial genetic code (SGC2). The user can make his own genetic code files by altering the appropriate amino acid to codon assignments of one of the other genetic code files. These files may take almost any form you wish to give it, with the following limitations: Amino acids are repre- sented by the standard one-letter symbols, and must be in capital letters. Lowercase letters and all other characters, including blanks, will be ignored, and may be used for annotation of the file. NUMSEQ will read in the first 64 capital letters that could be amino acid or stop (*) symbols and ignore the rest. The order in which they are read determines the codons to which the amino acids are assigned. The amino acids are read in from left to right, and top to bottom. Thus, the first amino acid will be assigned to TTT, the second to TCT, ... and the last to GGG. 3. For each genetic code used, NUMSEQ creates a set of rules enabling it to correctly assign amino acids to any ambiguous codons for which this is possible. Since the input DNA or RNA sequence may contain any of the IUPAC-IUB ambiguity codes (see READTHIS), even partially ambiguous seq- uences or consensus sequences will still be translated wherever possible. For example, AUY ACN NCN MGG will be correctly translated to Ile Thr --- Arg. (Programmers note: The code for printing the ambiguity rule lists made by NUMSEQ can be found at the end of the procedure MAKERULES, sequestered within comment delimiters. NUMSEQ will list the rules to the screen if the comment delimiters are removed.)