BIRCH

TUTORIAL: FINDING AND RETRIEVING SEQUENCES FROM NCBI


March 1, 2017

ENTREZ documentation: http://www.ncbi.nlm.nih.gov/books/NBK44864/

ncbiquery.py documentation


Rationale:

Overview:
Goal: To learn how to search for sequences in the NCBI database, using blncbi to sort through the hits.

1. Create a working directory

I can't repeat this often enough. ALWAYS create a new directory for each project.

cd tutorials
mkdir findseq
cd findseq

go into tutorials directory
create a directory called findseq
go into the findseq directory

Next, open the BIRCH launcher, from which we can run any program in the BIRCH system. One way to launch BIRCH is to type 'birch' at the command line. This will open the BIRCH launcher in your findseq directory.

Alternatively, click on the BIRCH icon on your desktop . The BIRCH launcher will appear



For this tutorial, we will be using the BIRCH blncbi application to search for sequences at NCBI. Choose Data Mining --> blncbi to launch blncbi.
A chooser will appear on your screen asking for the name of the directory in which you wish to work. Choose 'tutorials/findseq' and click on Open.


 
blncbi will appear on the screen. blncbi has several functions for searching NCBI, and results are displayed in a spreadsheet panel.



2. Finding sequences when you know the Accession number

In many cases, you already know the Accession number of a sequence, typically because it is listed in the publication.
As an example, we'll search for the plasmid vector pUC19, whose accession number is M77789.

Choose Database --> Nucleotide which will open the query builder. The query builder lets you create query statements which connect keywords with relations such as AND, OR, NOT and parentheses. You can also choose specific databases to search, set parameters limiting things such as sequence length or number of hits to retrieve, and where to send the output.

Set query term 1 to search the Accession field of GenBank entries for the Accession number M77789. Click on Run: Output to new window.


Blncbi presents output in a spreadsheet, which is particularly useful for viewing large numbers of hits.

Things to note:

Since there is only one hit, we want to retrieve this one. Click on M77789 in column A, and then choose Database --> Seqfetch. By default, seqfetch will retrieve results to bldna, a BioLegato application for working with DNA sequences. This is usually the best choice, since bldna can always save your sequences from bldna, or open them for viewing in programs such as a text editor or the Artemis Genome Viewer. Click on Run to retrieve your sequence.


The sequence is retrieved from NCBI to a bldna window.


3.Viewing and saving sequences

BioLegato applications such as blncbi and bldna are really programs that launch other programs. Thus, they serve to organize large sets of programs into a coherent user interface. Once you have retrieved sequences, there is a large array of tasks that can be done.

In these tutorials, we'll see that all tasks run through BioLegato fall into four basic steps:
  1. Select a sequence by clicking once on the sequence name. If you wish to select several sequences, hold down the CTRL key, and click on the name of each sequence.
  2. Choose a program from the menus
  3. Set the parameters and click on Run to start the program.
  4. Output appears in one or more windows.
If you get empty output or no output at all, it's probably because you forgot to select a sequence.

Saving sequences

Since we already know that this is the sequence we want, it's best to save it now, before proceeding further.
Select the sequence by clicking on its name, SYNPUC19V. Choose File --> Save SELECTION AS. To give the file a name that is more descriptive than the Accession number, let's call it pUC19.gen. To preserve all sequence annotation, set the file format to GenBank. Click on Run to save.



In the file manager (finder on Mac) you should now see this sequence in your findseq directory.

(Files whose names begin with 'bioxxxx' are temporary files created by BioLegato. These should automatically be deleted when BioLegato terminates.)


Viewing sequences

To view your sequence in a text editor, you could either click on pUC19.gen in the file manager, or from bldna, File --> View Sequences. The pull-down menu lets you choose which sequence format you wish to view. For example, if you wanted to paste the sequence into a web program that requires sequences in FASTA format, you could set the format to FASTA. For now, we'll view the complete GenBank entry, which is the default. Click "Run" to view.


The GenBank file will pop up in the default text editor for your BIRCH installation, in this case, gedit.


It is often useful to keep the sequence view open on the screen for reference while doing other tasks. For example, if you scroll down to the FEATURES table, you can see the annotations for different parts of this vector.



A more elaborate program for viewing sequences and their features is the Artemis Genome Browser. In bldna, choose Database --> artemis.

Artemis is a sophisticated genome browser and annotator, used in many genome projects. The wide array of functions and capabilities of artemis are beyond the scope of this tutorial. However, an introduction to Artemis is found in the BIRCH tutorial . See Genome Visualization with ARTEMIS for an in-depth introduction.





Working with sequences

Although bldna can perform a large array of tasks on DNA and RNA sequences, we will illustrate only two of them here.
First, let's try printing a sequence along with its translation in three reading frames using NUMSEQ. Choose DNA/RNA --> NUMSEQ. A menu will pop up allowing us to set different parameters for printing the sequence. At this point, don't change any parameters. Just click on Run.

By default, NUMSEQ will print sequences in 7 groups of 10 nucleotides per line.


Say that we wanted to see both strands of the sequence, along with a translation of the forward strand. Re-run NUMSEQ after setting the following parameters:

write in GROUPs of 15
Both strands
TRANSLATION: Yes
Reading frames: Three

Click on Run to proceed.


The output appears as shown below. Note that translation of the top strand is shown in each of 3 reading frames, using the 1-letter amino acid code. Stop codons are seen as asterisks (*).



Bldna can generate a report of restriction sites found in a sequence using BACHREST.

Choose DNA/RNA --> BACHREST.

The BACHREST menu lets you customize your search base on whether or not an enzyme is comerically available, the type of ends it generates, whether or not the recognition sequence is symmetric or asymmetric,  the length of the recognition sequence, or the number of fragments generated.

To see the output with the default settings, click on Run.



Things to note:

  • Sequence information and search parameters are shown at the top of the report.
  • Enzyme - name of the enzyme
  • Recog. Seq. - 1-strand formula for the restriction site, with the cut site indicated by a caret (^), or for asymmetric sites, the position before which the enzymes cuts on each strand.
  • # of sites - the number of sites
  • Sites - the 5' coordinate of the top strand in a site
  • columns 5 - 7:
    • Frags - size of fragment
    • Begin - 5' coordinate of top  strand of fragment
    • End - 3' coordinate of top strand of fragment.
Note: Highlighting colors are an artifact of the gedit editor, and have no specific meaning in this context.


3. Finding sequences using keywords

It is often the case that you don't have an Accession number for a sequence, but do have limited information regarding the sequence. For example, there is a family of plasmid vectors going by the name of Bluescript. There are four Bluescript vectors.  The pBluescript SK (+/-) vectors have the multiple cloning site (MCS) in the coding orientation of the lacZ gene, going from SacI to KpnI, 5' to 3'.  The pBluescript KS (+/-) vectors have the MCS in the opposite orientation, going from KpnI to SacI, 5' to 3' relative to the direction of lacZ transcription. For each set there are two vectors, designaged by (+) if the f1 origin of replication is in the opposite direction relative to lacZ, or (-) if the f1 origin is in the same orientation relative to lacZ. Consequently, these vectors have the designations pBluescript SK (+), pBluescript SK (-), pBluescript KS (+) and pBluescript KS (-). An information sheet commonly distributed with the Bluescript vectors is found in the file bluescript.pdf.

Finding these vectors is actually more of a challenge than one might first imagine. This section illustrates ways of narrowing the search to a manageable number of hits, that can be identified for retrieval.

First, make sure you have a fresh blncbi window. If you have blncbi open, you can create a new window with File --> New Window. Otherwise, launch blncbi from the BIRCH launcher using Data Mining --> blncbi.




Open the query builder using Database --> Nucleotide. Let's do the simplest search first. For query term 1, the default is to search ALL FIELDS. Set the search term to 'bluescript'. Click on Run: Output to a new window to begin the search.


There are 421 hits.



As a quick way to see if the Bluescript vectors are in the list, you could try sorting the output. Choose Edit --> BLSORT and set the 1st sort key to column 4 (shown as D in BioLegato).


If you remembered that Bluescript was just a bit under 3kb in length, you could try scrolling through the sorted output to the correct size range, as shown in column D. However, we don't see anything that looks like Bluescript in this list.

The search term 'bluescript' doesn't appear to be in the list. Rather than trying different permutations of capitalizations or hyphenations, let's try a different tactic.


Since the information sheet calls this vector a phagemid, setting query term 1 to 'phagemid',  and narrow the search by setting query term 2 to AND Division: SYN where 'SYN' limits the search to only those sequences in the GenBank Synthetic division. This time there are only 56 hits, which is a short enough list to scan by eye. Scrolling down, we see the four Bluescript vectors (which are distinct from the Bluescript II vectors).  Select all four by holding down the CTRL key and clicking on each Accession number.


Retrieve these entries using Database --> SEQFETCH.

 



To save all sequences to a single file, choose File --> Save ALL as.

Set the File Name to bluescript.gen, and make sure Files of Type is set to GenBank. Save the file.