TUTORIAL: DISPLAYING AND MANIPULATING SEQUENCES



NUMSEQ documentation: $doc/fsap/numseq.asc
BACHREST documentation: $doc/fsap/rest.asc

1. Copy  sample sequences to your $HOME/sequence directory


{brassica:/home/plants/frist}cd                 do next step if $HOME/bioinf doesn't exist
{brassica:/home/plants/frist}mkdir tutorials     create directory for this tutorial 
{brassica:/home/plants/frist}mkdir tutorials/sequence    copy GenBank files to new directory

{brassica:/home/plants/frist}cd $birch/tutorials/bioLegato/sequence
{brassica:/home/plants/frist}cp *.gen $HOME/tutorials/sequence       return to $HOME directory verify that new files and directories are present
{brassica:/home/plants/frist}cd
{brassica:/home/plants/frist}ls -l
drwx------   1 frist    drr     512 Oct 31 10:11 tutorials/
{brassica:/home/plants/frist}cd tutorial
{brassica:/home/plants/frist/tutorials}ls -l
drwx------   3 frist    drr     512 Oct 31 10:11 sequence/
{brassica:/home/plants/frist/tutorials}cd sequence
{brassica:/home/plants/frist/tutorials/sequence}ls -l
-rw-------   1 frist  frist   127286 Oct 31 10:13 AC002329.gen
-rw-------   1 frist  frist     5404 Oct 31 10:13 ARBLKSP.gen
-rw-------   1 frist  frist    10739 Oct 31 10:13 PBI101TD.gen
-rw-------   1 frist  frist     8278 Oct 31 10:13 pBSGUS.gen
-rw-------   1 frist  frist     3674 Oct 31 10:13 PEACAB15.gen

Files with the .gen extension are in GenBank format. Since these are ASCII text files, you can view them in any text editor. Double clicking on a file in the file manager will bring up the file in the default text editor for your bioLegato installation.
 

2. Read PEACAB15.gen in NUMSEQ

NUMSEQ is a program for printing out, translating, and subcloning sequences. It runs at the command line. The main menu handles file input and output. Output can either be to the screen or to a file. In the example, the output file has been called PEACAG15.numseq to indicate that the file contains output from numseq.

Example: Reading and printing PEACAB15.gen with NUMSEQ

3. Parameter menu

The parameter menu controls how the sequence is printed. Type '4' in the main menu to bring up the Parameters menu.
Name: PEACAB15            Topology:     LINEAR       Length:        822 nt
________________________________________________________________________________
            Parameter   Description/Response                 Value
________________________________________________________________________________
             1)START    first nucleotide printed               1
             2)FINISH   last  nucleotide printed             822
             3)NUCCASE   U:(A,G,C,T...), l:(a,g,c,t...)        U
             4)STARTNO  number of starting nucleotide          1
             5)GROUP    number every GROUP nucleotides        10
             6)GPL      number of GROUPs printed per line      7
             7)WHICH    I: input strand  O: opposite strand    I
             8)STRANDS  1: one  strand,  2:both strands        1
             9)KIND     R:RNA            D:DNA                 D
            10)NUMBERS  Number  the sequence    (Y or N)       Y
            11)NUCS     Print nucleotide seq.   (Y or N)       Y
            12)PEPTIDES Print amino acid seq.   (Y or N)       N
            13)FRAMES   1 for this frame, 3 for 3 frames       1
            14)FORM     L:3 letter amino acid, S: 1 letter     L
________________________________________________________________________________
Type number of parameter you wish to change (0 to continue)
By default, NUMSEQ will print out the entire sequence (from START to FINISH) as a single strand (STRANDS) in 7 groups (GPL) of 10 nucleotides (GROUP) per line. To change parameters, type the number of a parameter, and you will be prompted for a new value. When you're ready to view the sequence with the new parameters, type '0' at the prompt and '5' in the main menu to print the sequence to the screen.
 

Examples:


To view both strands:

8) STRANDS: 2
To translate in 3 reading frames:
12) PEPTIDES: y
13) FRAMES: 3
5) GROUP: 15
6) GPL: 4

NUMSEQ breaks up the sequence into groups of nucleotides, numbering each group. For translation, GROUP must be divisible by 3, because translation is done in discrete codons of 3 bases each. GPL is set to 4 so that the output line will fit on a typical 80-character line.

To limit printing to only part of the seuqence eg. bases 200 - 400:
1) START:  200
2) FINISH: 400
To view the opposite strand of the same region:
7) WHICH: o
1) START: 400
2) FINISH: 200

This example illustrates that creating an opposite strand requires two steps. First, we have to specify the strand as 'o' (opposite)  rather than 'i' (input strand). This causes the bases to be complemented. However, if all we do is complement the input strand, then the opposite strand would be printed 3' to 5', because we would be starting at 200 and ending at 400. Therefore, START must be set to 400, and FINISH to 200.

4. Running NUMSEQ from bioLegato

bioLegato is a program that runs other programs. As you'll see, the program runs in the window in which bioLegato was started. bioLegato is generates the keystrokes that you would normally be typing.

To illustrate the point, let's try running NUMSEQ from bioLegato.

Launch bioLegato from the command line.

{brassica:/home/plants/frist/tutorials/sequence}biolegato
 
 
IMPORTANT NOTES:
1.While bioLegato is running, the terminal window can not be used for other commands. If you need to type commands, open another terminal window.
2. Although bioLegato can read files from any directory, it's best to launch bioLegato from the directory in which you plan to work.

Read in PEACAB15.gen:

File --> Open


Click on the filename, and click 'Open'.


 
 
 
Hint: There are 2 steps to running a program from bioLegato
1. Select sequence(s) - either:
  • click on a single sequence
  • To select a group of adjacent names
    • click on topmost name
    • hold down SHIFT key
    • click on bottommost name
  • To select several sequences that are not adjacent
    • hold down the Ctrl key and click each sequences
  • To select all sequences, choose Edit --> Select All
2. Choose a program from one of the menus

To run numseq, click on PEACAB15 and choose DNA/RNA --> NUMSEQ.

The numseq menu appears, containing menu items for all parameters in the NUMSEQ Parameters menu.

 
 
 
 
HINTS ON bioLegato MENUS:
  • for sliders, you can either choose the number with the slider knob, increment or decrement by clicking on the slider cable, or increment or decrement using the up and down arrow buttons.
  • always remember to select a sequence before going to a menu. bioLegato has to be told which sequence(s) to work with

Output goes to a temporary file, and appears in a text editor window

When you click OK, bioLegato saves the specified sequence in a temporary file, and runs numseq. Numseq reads in the temproary sequence file and prints it out according to the parameters sent to it by bioLegato. Output is stored in a temporary file, which is opened in a text editor.

 
 

Nrmally, the temporary output file (eg. bioLegato74903554674472742396.tmp.out) will be deleted when you quit the Text Editor window. To save the file, choose File --> Save As and type in a name for the output. It's a good idea to include a .numseq file extension to indicate that this file is output from numseq.

Because the output is ASCII text, you can do lots of things with it, including importintg it into a word processor, pasting it into another window, mailing it, or even using it as input for other sequence programs. In the latter case, the output will probably need to be modified to conform to the desired input file format eg. Pearson/Fasta.
 

5. Working with circular DNA molecules


 

Circular DNA molecules require a bit of thought. Since printing is always done 5' --> 3', the direction (clockwise vs. counterclockwise) determines the strand, or vice versa. Consider the Bluescript cloning vector (GenBank X52331). Conceptually, one base must be arbitrarily labeled as 1. In the GenBank entry, 1 is the first base in the file, and 2958 is the last base in the file. In the physical plasmid, of course, base 2958 is adjacent to 1.

In NUMSEQ, the  START, FINISH and WHICH parameters govern which parts of the sequence are displayed.

To view the top strand of the PvuI (CGAT^CG) fragment going clockwise from 2417 to 503:

1) START:  2417
2) FINISH: 503
7) WHICH: Original
Since you're only considering 1 strand at a time, you want to start with 2417, which is the 5' end of the small PvuI fragment, on the original strand.

To print the same sequence on the other strand, we can't just change WHICH to 'Opposite".

1) START:  2417
2) FINISH: 503
7) WHICH: Opposite
Try it and you'll see that what you get is the large PvuI fragment going from 2417 to 503, and that this fragment doesn't even terminate where PvuI would cut. It's best to visualize the fragment ends as illustrated below:

So the correct way to print the opposite strand of the small fragment would be:

1) START: 501
2) FINISH: 2415
7) WHICH: Opposite
Example: Simulated restriction digest of a pBluescriptKSm13+ at BamH1
The BamH1 site is at 690 on the input strand, meaning that the 5' end of the BamH1 site on the original strand is at position 690. Thus:
1) START:  690
2) FINISH:  689
7) WHICH: Original
If we wanted the inverse complement (ie. counter clockwise), the NUMSEQ parameters would be
1) START:  693
2) FINISH:  694
7) WHICH: Opposite

6. Simulated Cloning

Any recmbinant construct can be simulated by pasting together the correct sequences into a single file.
One easy way is to use NUMSEQ to print out the precise fragments required, and paste them, in the correct order, into a file, using any text editor.

EXAMPLE:  Cloning beta-glucuronidase gene (GUS) from pBI101 to pBluescriptKSm13+.

The GUS gene in pBI101 can be conveniently excised using BamHI and SacI (see map).
 
The goal is to make a datafile that correctly represents the recombinant construct that results from cloning the BamHI/SacI fragment containing the GUS gene into the BamHI/SacI-digested BlueScript plasmid. It should look something like this:

How to do it:

a. Read GenBank entries for pBI101 (PBI101TD.gen) and pBluescriptKSm13+ (ARBLKSP.gen) into bioLegato.

b. Use DNA/RNA --> BACHREST to find the locations of the BamHI and SacI sites in PBI101TD. (PBI101TD BACHREST output). According to BACHREST, the 5' ends of the BamHI and SacI sites are at 2528 and 4419, respectively. Therefore, the 3' end of the fragment we want is at 4418, not 4419.

(See 'II. What the output means' in the BACHREST documentation file rest.asc for details on the output.)

c. PI101TD --> NUMSEQ

START: 2528
FINISH: 4418

Save this output in pBSGUS.dna, and minimize the window to get it out of the way.

d. Use DNA/RNA --> BACHREST to find the locations of the BamHI and SacI sites in ARBLKSP. (ARBLKSP BACHREST output). According to BACHREST, the 5' ends of the BamHI and SacI sites are at 690 and 658, respectively. Since the GUS fragment terminates at a SacI site, the SacI site from the plasmid must come next, with the BamHI site at the other end. We need to generate the opposite strand of the plasmid, going from the 5' end of SacI to the 3' end of BamHI.

e. ARBLKSP --> NUMSEQ

START: 653
FINISH: 694
WHICH: Opposite

Copy this output to the end of pBSGUS.dna and save the fille.

f. Before going any farther, verify that the construct has been built correctly. One way to do this is to use NUMSEQ to generate double-stranded printouts of both original sequences, and then mark the positions of the restriction sites on these printouts. Print out pBSGUS.dna and compare the sequence at the cloning junctions to the sequences in the originals. Make sure that complete BamHI and SacI sites appear at these junctions.

g. Convert pBSGUS into a Pearson/Fasta file to be read by SEQUIN.

In bioLegato, use File --> Import Free Format, and type in 'pBSGUS.dna' Note that free-format files do not contain sequence names, so the filename is used as the sequence name, in bioLegato. Since we don't want the .seq extension to be part of the name, get rid of '.dna' in File --> GetInfo. The name should now be 'pBSGUS'.

Save as a Pearson/Fasta file by choosing  File --> Export Foreign Format, and typing 'pBSGUS.wrp'.

h. The last step in creation of a sequence file is annotation. This is critical, because it documents precisely what you have done. The ability to reproduce results is as important in computers as it is in the lab. GenBank format is the richest and most versatile sequence file format, and it is read by most sequence programs. SEQUIN automates the process of creating GenBank format files.

The menus in SEQUIN walk you throught a step-by-step process of the minimal information needed for a GenBank entry. Without going into every step, the over all series of events is as follows:

1. Start SEQUIN by typing 'sequin' at the command line.
2. Choose "Start new submission"
3. Fill in information in the Submission, Contact, Author and Title (all required)
4. On the page entitled Organism and Sequences, click on "Import Nucleotide FASTA" to import your .wrp file
5. Click on "Specify Topology" and set the topology to 'Circular'.
6. For Organism, type in 'synthetic construct'.
7. This is the minimum information needed to create a GenBank entry  that can be used as a model for a restriction digest. Once the minimal information has been entered, follow the 'Next Page' links until a window pops up with the GenBank entry in it. (For other purposes, you may wish to annotate locations of coding sequences and other features of interest. In a laboratory setting, if you were planning to submit the sequence to GenBank, the most critical things to annotate for a construct such as this are the precise sources of the component sequences, in the FEATURES TABLE, along with a simple explanation in words in the DEFINITION line.)
8. To export your sequence to a GenBank file, choose File --> Export GenBank. Save your sequence as pBSGUS.gen.

A good introduction to SEQUIN, including screen shots, can be fund at http://www.ncbi.nlm.nih.gov/Sequin/.

i. Test your GenBank file by reading it into bioLegato, and running BACHREST. The BACHREST output should show that pBSGUS is circular, and the BamHI  and SacI sites at 1 and 1892, respectively (pBSGUS.bachrest).