TUTORIAL: FINDING AND RETRIEVING SEQUENCES FROM NCBI
March 1, 2017
|go into tutorials directory
create a directory called findseq
go into the findseq directory
chooser will appear on your screen asking for the name of
the directory in which you wish to work. Choose 'tutorials/findseq'
and click on Open.
will appear on the screen. blncbi has several functions
for searching NCBI, and results are displayed in a
Database --> Nucleotide which will open the
query builder. The query builder lets you create query
statements which connect keywords with relations such as
AND, OR, NOT and parentheses. You can also choose specific
databases to search, set parameters limiting things such
as sequence length or number of hits to retrieve, and
where to send the output.
Set query term 1 to search the Accession field of GenBank entries for the Accession number M77789. Click on Run: Output to new window.
Since there is only one hit, we want to retrieve this one.
Click on M77789 in column A, and then choose Database
--> Seqfetch. By default, seqfetch will retrieve
results to bldna, a BioLegato application for working with
DNA sequences. This is usually the best choice, since
bldna can always save your sequences from bldna, or open
them for viewing in programs such as a text editor or the
Artemis Genome Viewer. Click on Run to retrieve your
sequence is retrieved from NCBI to a bldna window.
|In these tutorials, we'll see that all
tasks run through BioLegato fall into four basic steps:
the sequence by clicking on its name, SYNPUC19V. Choose File
--> Save SELECTION AS. To give the file a name
that is more descriptive than the Accession number, let's
call it pUC19.gen. To preserve all sequence annotation,
set the file format to GenBank. Click on Run
the file manager (finder on Mac) you should now see this
sequence in your findseq directory.
(Files whose names begin with 'bioxxxx' are temporary files created by BioLegato. These should automatically be deleted when BioLegato terminates.)
view your sequence in a text editor, you could either
click on pUC19.gen in the file manager, or from bldna, File
--> View Sequences. The pull-down menu lets you
choose which sequence format you wish to view. For example,
if you wanted to paste the sequence into a web program
that requires sequences in FASTA format, you could set the
format to FASTA. For now, we'll view the complete GenBank
entry, which is the default. Click "Run" to view.
GenBank file will pop up in the default text editor for
your BIRCH installation, in this case, gedit.
is often useful to keep the sequence view open on the
screen for reference while doing other tasks. For example,
if you scroll down to the FEATURES table, you can see the
annotations for different parts of this vector.
more elaborate program for viewing sequences and their
features is the Artemis Genome Browser. In bldna, choose Database
Artemis is a sophisticated genome browser and annotator, used in many genome projects. The wide array of functions and capabilities of artemis are beyond the scope of this tutorial. However, an introduction to Artemis is found in the BIRCH tutorial . See Genome Visualization with ARTEMIS for an in-depth introduction.
let's try printing a sequence along with its translation
in three reading frames using NUMSEQ. Choose DNA/RNA
--> NUMSEQ. A menu will pop up allowing us to set
different parameters for printing the sequence. At this
point, don't change any parameters. Just click on Run.
By default, NUMSEQ will print sequences in 7 groups of 10 nucleotides per line.
that we wanted to see both strands of the sequence, along
with a translation of the forward strand. Re-run NUMSEQ
after setting the following parameters:
write in GROUPs of 15
Reading frames: Three
Click on Run to proceed.
DNA/RNA --> BACHREST.
The BACHREST menu lets you customize your search base on whether or not an enzyme is comerically available, the type of ends it generates, whether or not the recognition sequence is symmetric or asymmetric, the length of the recognition sequence, or the number of fragments generated.
To see the output with the default settings, click on Run.
make sure you have a fresh blncbi window. If you have
blncbi open, you can create a new window with File
--> New Window. Otherwise, launch blncbi from the
BIRCH launcher using Data Mining --> blncbi.
the query builder using Database -->
Nucleotide. Let's do the simplest search first. For
query term 1, the default is to search ALL FIELDS. Set the
search term to 'bluescript'. Click on Run:
Output to a new window to begin the search.
are 421 hits.
a quick way to see if the Bluescript vectors are in the
list, you could try sorting the output. Choose Edit
--> BLSORT and set the 1st sort
key to column 4 (shown as D in BioLegato).
you remembered that Bluescript was just a bit under 3kb in
length, you could try scrolling through the sorted output
to the correct size range, as shown in column D. However,
we don't see anything that looks like Bluescript in this
The search term 'bluescript' doesn't appear to be in the list. Rather than trying different permutations of capitalizations or hyphenations, let's try a different tactic.
the information sheet calls this vector a phagemid,
setting query term 1 to 'phagemid', and narrow the
search by setting query term 2 to AND Division: SYN where
'SYN' limits the search to only those sequences in the
GenBank Synthetic division. This time there are only 56
hits, which is a short enough list to scan by eye.
Scrolling down, we see the four Bluescript vectors (which
are distinct from the Bluescript II vectors). Select
all four by holding down the CTRL key and clicking
on each Accession number.
these entries using Database --> SEQFETCH.
save all sequences to a single file, choose File
--> Save ALL as.
Set the File Name to bluescript.gen, and make sure Files of Type is set to GenBank. Save the file.