TUTORIAL: RETRIEVING SEQUENCES BY KEYWORD


Fristensky, B. (1993) Feature Expressions: Creating and Manipulating Sequence Datasets. Nucleic Acids Res. 21:5997-6003 

FINDKEY documentation: $doc/xylem/findkey.txt
FETCH documentation:  $doc/xylem/fetch.txt


Note: This tutorial requires a local copy of the PIR database, as described in [http://home.cc.umanitoba.ca/~psgendb/birchadmin/pir.html]. If there is already a copy on your system, the $PIR environment varialbe will list its location. eg.

echo $PIR
/home/psgendb/PIR

Example: Antifreeze Proteins

Suppose you were curious about antifreeze proteins. You knew they came from fish of some sort, but that's about all you knew. FINDKEY lets you find sequences whose annotation contains one or more words anywhere in the text of the annotation.

1. Create temporary working directory and launch GDE

{goad:/home/plants/frist}cd
{goad:/home/plants/frist}cd tutorials
{goad:/home/plants/frist/bioinf}mkdir antifreeze
{goad:/home/plants/frist/bioinf}cd antifreeze
{goad:/home/plants/frist/bioinf/gfp}gde &
 

2. Search for entries in PIR containing the keyword 'antifreeze'.

The PIR (Protein Identification Resource) contains data on structure, function and sequence for all of the known families of proteins. To search PIR using FINDKEY, choose Database --> FINDKEY. A word to search for is typed in the FINDKEY menu on the line reading "Single keyword".  If you want to type in several words, click on " Create list of keywords" and click on OK.

By default, FINDKEY will search PIR, but the Database menu allows you to select specific divisions of GenBank to search (eg. Primate, Rodent, Mammalian, Vertebrate etc.). Alternatively, FINDKEY can also search database subsets you have created yourself, containing either PIR or GenBank entries.

FINDKEY does not retrieve sequence entires. Rather, it retrieves a hitfile and a namefile. The hitfile contains the lines that matched the keyword(s).


 

The namefile contains the names of the sequences found. These names can be directly copied and pasted to for retrieval by FETCH, as shown below. However,  since the hitfile shows the hits in context, it is possible to eliminate some of the names from the list before retrieval.

3. Retrieve selected entries

When you're satisfied with your list, choose Database --> FETCH. In the FETCH menu, you could type in a single name for retrieval, but since we want to retrieve a group of sequences, select "Create list of names/acc#'s". Set DATABASE to "PIR".

There are several choices for WHERE TO SEND OUTPUT. Clicking on GDE would cause the sequences to appear in a new GDE window. This is convenient, but loses all the annotation in the PIR entries. To retrieve the entries intact to a single file, choose "Textedit window" or "Output file". With the former, we have to wait for the entries to pop up in a textedit window. With the latter, the retrieval runs in the background. At this point you could even log out, and the retrieved file would be present in the directory in which GDE was run, the next time you logged in. If you choose "Output file" as in the example below, you must also type the name of a file to contain the output eg. "antifreeze.pir". Finally, FETCH can directly create XYLEM datasets, in which output is split into files containing annotation, sequence, and an index. See XYLEM documentation for details.

After you click "OK", a Text Edit window will pop up, into which you can paste the names or accession numbers of entries you wish to retrieve.

To begin the retrieval,  choose File --> Save in the Text Editor, and then File --> Close to quit the Editor. FETCH will retrieve the PIR entries and write them to the directory in which GDE was launched. The message "Fetch completed" will appear in the Terminal window from which GDE was launched. There should now be a file called  $home/tutorials/antifreeze/antifreeze.pir.

4. Read entries into GDE

GDE can read GenBank entries form the Open menu, but not PIR entries. To read PIR entries, choose File --> Import Foreign Format, and type in the name of the file. Don't forget to press Enter after the end of the filename.