AND RETRIEVING SEQUENCES FROM NCBI
Feb. 7, 2019
ENTREZ documentation: http://www.ncbi.nlm.nih.gov/books/NBK44864/
To learn how to search for sequences in the NCBI
database, using blncbi to sort through the hits.
search using Accession number
- Building search terms for complex queries
and working with sequences
1. Create a working directory
I can't repeat this often
enough. ALWAYS create a new directory for each project.
|go into tutorials directory
create a directory called findseq
go into the findseq directory
Next, open the BIRCH launcher, from which we can run any program
in the BIRCH system. One way to launch BIRCH is to type 'birch' at
the command line. This will open the BIRCH launcher in your
Alternatively, click on the BIRCH icon on your desktop . The
BIRCH launcher will appear
For this tutorial, we will be using the BIRCH blncbi application
to search for sequences at NCBI. Choose Data Mining
--> blncbi to launch blncbi.
chooser will appear on your screen asking for the name of
the directory in which you wish to work. Choose 'tutorials/findseq'
and click on Open.
will appear on the screen. blncbi has several functions
for searching NCBI, and results are displayed in a
There are four tabs pages in the Nucleotide Query menu.
- Allows you to build a query of up to 8 search terms. For
each term, choose the field to limit the search, and type
in the value. If you choose the Feature Key field, also
choose the Feature Key to search. For example, choosing
the 'intron' feature key would limit retrievals to only
those sequences for which introns are annotated.
Pull-down widgets let you group terms with parentheses,
and join terms using AND, NOT or OR.
At the bottom of the window, you cal also limit searches
to a particular molecule type eg. tRNA, rRNA, ncRNA etc.
|What you need
to know about databases
All information in a database is organized into fields.
Each field holds a value. For example, if you had a
database of people, it might look something like this:
There are three fields in each record: FirstName,
LastName, and Phone. Each record has a unique value for
each field. (Think of a field name as a variable from
algebra.) Any database search program will allow you to
search for records which have specific values for one or
more fields. All entries in which the field(s) match the
specified value are returned.
For searching the NCBI databases, the Query tab lets you
specify values for one or more fields, and then retrieves
entries which match those values.
- Sets the database to search.
- Allows you to retrieve only hits in a particular size
range, and to increase or decrease the number of hits
Where you know the approximate size of the sequence you
want, setting a narrow Min.-Max range can limit a search
that would otherwise have hundreds of thousands of hits to
a small enough number that you can view them in the
In such cases, you might increase the maximum hits
retrieved to, say, 5000. If you sort the hits by size or
even title, you can often quickly scan by eye to find the
sequences you really want.
- Allows you to specify how the output is saved or
formatted. By default, the format is Summary, which
returns output as a table. Although you could change the
format to GenBank, the Summary is usually best, since you
could always retrieve GenBank entries from the Summary
By default, Summary output goes to a new blncbi window,
which lets you screen the output, and retrieve selected
sequences. If you wish the output to directly to a fiel,
choose Output file, and make sure to type in an output
filename eg. results.tsv. The .tsv file extension
indicates that the Summary output is in TAB-separated
value format, which can be read directly by blncbi, or any
Finding sequences when you know the Accession number
In many cases, you already
know the Accession number of a sequence, typically because it is
listed in the publication.
As an example, we'll search for the plasmid
vector pUC19, whose accession number is M77789.
Database --> Nucleotide which will open the
query builder. The query builder lets you create query
statements which connect keywords with relations such as
AND, OR, NOT and parentheses. You can also choose specific
databases to search, set parameters limiting things such
as sequence length or number of hits to retrieve, and
where to send the output.
Set query term 1 to search the Accession field of GenBank
entries for the Accession number M77789. Click on Run:
Output to new window.
Blncbi presents output in a spreadsheet,
which is particularly useful for viewing large numbers of hits.
Things to note:
- line 3 - query term
created by blncbi which tells the program to search for a
sequence with accession [ACCN] number M77789 and sequence
length [SLEN] between 1 and 500000. (The SLEN term can be set
in the Limits tab.)
- line 6 - the number
of hits found
- line 8 - UID
(Accession number), Title (corresponds to GenBank Description
line), BioMol (type of molecular) and Slen (sequence length)
Since there is only one hit, we want to retrieve this one.
Click on M77789 in column A, and then choose Database
--> Seqfetch. By default, seqfetch will retrieve
results to bldna, a BioLegato application for working with
DNA sequences. This is usually the best choice, since
bldna can always save your sequences from bldna, or open
them for viewing in programs such as a text editor or the
Artemis Genome Viewer. Click on Run to retrieve your
sequence is retrieved from NCBI to a bldna window.
and saving sequences
such as blncbi and bldna are really programs that launch other
programs. Thus, they serve to organize large sets of programs into
a coherent user interface. Once you have retrieved sequences,
there is a large array of tasks that can be done.
- Saving sequences
- Viewing sequences
- Working with
|In these tutorials, we'll see that all
tasks run through BioLegato fall into four basic steps:
If you get empty output or no output
at all, it's probably because you forgot to
select a sequence.
- Select a
sequence by clicking once on the sequence name. If you
wish to select several sequences, hold down the CTRL
key, and click on the name of each sequence.
- Choose a
program from the menus
- Set the
parameters and click on Run to start the program.
appears in one or more windows.
Since we already know that
this is the sequence we want, it's best to save it now, before
the sequence by clicking on its name, SYNPUC19V. Choose File
--> Save SELECTION AS. To give the file a name
that is more descriptive than the Accession number, let's
call it pUC19.gen. To preserve all sequence annotation,
set the file format to GenBank. Click on Run
the file manager (finder on Mac) you should now see this
sequence in your findseq directory.
(Files whose names begin with 'bioxxxx' are temporary
files created by BioLegato. These should automatically
be deleted when BioLegato terminates.)
view your sequence in a text editor, you could either
click on pUC19.gen in the file manager, or from bldna, File
--> View Sequences. The pull-down menu lets you
choose which sequence format you wish to view. For
example, if you wanted to paste the sequence into a web
program that requires sequences in FASTA format, you could
set the format to FASTA. For now, we'll view the complete
GenBank entry, which is the default. Click "Run" to
GenBank file will pop up in the default text editor for
your BIRCH installation, in this case, gedit.
is often useful to keep the sequence view open on the
screen for reference while doing other tasks. For example,
if you scroll down to the FEATURES table, you can see the
annotations for different parts of this vector.
more elaborate program for viewing sequences and their
features is the Artemis Genome Browser. In bldna, choose Database
Artemis is a
sophisticated genome browser and annotator, used in many
genome projects. The wide array of functions and
capabilities of artemis are beyond the scope of this
tutorial. However, an introduction to Artemis is found in
the BIRCH tutorial . See Genome
Visualization with ARTEMIS for an in-depth
Working with sequences
Although bldna can perform
a large array of tasks on DNA and RNA sequences, we will
illustrate only two of them here.
let's try printing a sequence along with its translation
in three reading frames using NUMSEQ. Choose DNA/RNA
--> NUMSEQ. A menu will pop up allowing us to set
different parameters for printing the sequence. At this
point, don't change any parameters. Just click on Run.
NUMSEQ will print sequences in 7 groups of 10 nucleotides
that we wanted to see both strands of the sequence, along
with a translation of the forward strand. Re-run NUMSEQ
after setting the following parameters:
write in GROUPs of 15
Reading frames: Three
Click on Run to proceed.
The output appears as shown below. Note that translation of the
top strand is shown in each of 3 reading frames, using the 1-letter
amino acid code. Stop codons are seen as asterisks (*).
Bldna can generate a report of restriction sites found in a
sequence using BACHREST.
DNA/RNA --> BACHREST.
The BACHREST menu lets you customize your search base on
whether or not an enzyme is comerically available, the
type of ends it generates, whether or not the recognition
sequence is symmetric or asymmetric, the length of
the recognition sequence, or the number of fragments
To see the output with the default settings, click on Run.
colors are an artifact of the gedit editor, and have no
specific meaning in this context.
information and search parameters are shown at the top
of the report.
- Enzyme -
name of the enzyme
- Recog. Seq.
- 1-strand formula for the restriction site, with the
cut site indicated by a caret (^), or for asymmetric
sites, the position before which the enzymes cuts on
- # of sites -
the number of sites
- Sites - the
5' coordinate of the top strand in a site
- columns 5 -
- Frags -
size of fragment
- Begin - 5'
coordinate of top strand of fragment
- End - 3'
coordinate of top strand of fragment.
Finding sequences using keywords
It is often the case that
you don't have an Accession number for a sequence, but do have
limited information regarding the sequence. For example, there is
a family of plasmid vectors going by the name of Bluescript. There
are four Bluescript vectors. The pBluescript SK (+/-)
vectors have the multiple cloning site (MCS) in the coding
orientation of the lacZ gene, going from SacI to KpnI, 5' to
3'. The pBluescript KS (+/-) vectors have the MCS in the
opposite orientation, going from KpnI to SacI, 5' to 3' relative
to the direction of lacZ transcription. For each set there are two
vectors, designaged by (+) if the f1 origin of replication is in
the opposite direction relative to lacZ, or (-) if the f1 origin
is in the same orientation relative to lacZ. Consequently, these
vectors have the designations pBluescript SK (+), pBluescript SK
KS (+) and pBluescript KS (-). An information sheet commonly
distributed with the Bluescript vectors is found in the file bluescript.pdf.
Finding these vectors is actually more of a challenge than one
might first imagine. This section illustrates ways of narrowing
the search to a manageable number of hits, that can be identified
make sure you have a fresh blncbi window. If you have
blncbi open, you can create a new window with File
--> New Window. Otherwise, launch blncbi from the
BIRCH launcher using Data Mining --> blncbi.
the query builder using Database -->
Nucleotide. Let's do the simplest search first. For
query term 1, the default is to search ALL FIELDS. Set the
search term to 'bluescript'. Click on Run:
Output to a new window to begin the search.
are 345462 hits. In retrospect, this number shouldn't be
too surprising, because the Bluescript vectors and their
derivatives have been widely-used in cloning for decades.
Let's try limiting the search by changing the search field
to Title, so that only those entries in which Bluescript
in which 'bluescript' appears in the title will be
The search indicates that
there are 35,102 hits. This is an improvement by a factor of 10,
but still too many hits to examine by inspection. Most of those
hits are probably from clones that were made using a Bluescript
Bluescript-related vectors are probably a very small
percentage of those hits.That means that we can eliminate
most clones by limiting the search to the GenBank
Synthetic division, which only has synthetic sequences. We
join the two search terms by choosing 'AND', and rerun the
that was disappointing.
This example illustrates that searches of the NCBI
databases can be counterintuitive. I have no idea why the
actual vectors themselves weren't found, because as we'll
see later, they are in fact in the Synthetic division.
(Repeating the search using variants such as "Bluescript"
and "SYN" gives the same result).
again at bluescript.pdf, we see that the term
'phagemid' is prominent in the title. Let's us this as the
search term instead of 'bluescript'. Also, turn off
AND and remove 'syn' from query term 2, because that term
caused us to miss bluescript previously.
result is promising. 191 hits are enough to scan by
a quick way to see if the Bluescript vectors are in the
list, you could try sorting the output by sequence length.
Choose Edit --> BLSORT and set
the 1st sort key to column 4 (shown as D in BioLegato).
you remembered that Bluescript was just a bit under 3kb in
length, you could try scrolling through the sorted output
to the correct size range, as shown in column D.
Select all four by
holding down the CTRL key and clicking on each
these entries using Database --> SEQFETCH.
Select the four
sequences in bldna, and choose File --> View sequences, with
the output format set to GenBank. A quick look at the LOCUS
lines of the four sequences in this file will verify that these
sequences are indeed in the Synthetic (SYN) division.
save all sequences to a single file, choose File
--> Save ALL as.
Set the File Name to bluescript.gen, and make sure Files
of Type is set to GenBank. Save the file.