TUTORIAL: SEARCHING LOCAL SEQUENCE DATABASES USING FASTA



FASTA documentation: $doc/fasta/fasta.txt
Bill Pearson's FASTA Web site: [http://www.people.Virginia.EDU/~wrp/pearson.html]

Example: Antifreeze Proteins

This tutorial assumes that local copies of the GenBank and GenPept databases are installed.

1. Read in the test sequence and generate the corresponding amino acid sequence

This tutorial will introduce the several types of sequence database searches, using the antifreeze protein from the sea raven, Hemitripterus americanus. The cDNA sequence for this gene is found in SRAAFP.gen.

Create a directory called 'db' and save this sequence to the db directory.

To read this sequence into bldna, choose 'File --> Open', and select SRAAFP.gen.

Since this tutorial will demonstrate database searches using both DNA and protein sequences as query sequences, we need to generate the amino acid sequence. This is done in two steps. First, we need to extract the protein coding sequence from the cDNA. (Remember that mRNAs contain both 5' and 3' untranslated sequences. If you look at SRAAFP.gen, you'll see that the CDS (protein coding sequence) runs from positions 10 to 597. We will use the FEATURES program to extract the CDS, and then translate the CDS using RIBOSOME.

Make sure SRAAFP is selected, and choose Database --> FEATURES (Extract by feature keys).

Click on 'Run' with the above settings. The CDS will appear in a new bldna window.

Note that the CDS begins with a start codon, indicating that this cDNA contains the entire protein coding region.

Next translate the CDS into protein. Select SRAAFP:CDS1, and choose DNARNA --> Ribosome.  By default, Ribosome sends the output to a new blprotein window.


2. The Database Menu

bldna


blprotein

3. DNA vs. DNA

Select the DNA sequence SRAAFP:CDS1, and choose Database --> FASTA (DNA/RNA vs. DNA database).

First, choose a database. All GenBank divisions are listed, along with a choice to search ALL divisions, or a User-created file in Pearson/FASTA format. Most of the time, it is best to search the taxonomic group from which your query sequence is derived. This is especially true for DNA/DNA searches, since only closely-related DNA sequences are likely to give statistically-significant matches. If you do need to search all of GenBank, this can take more than an hour, depending on the length of the sequence and the K-tuple value (each K-tuple speeds up the search by a factor of 4, but decreases sensitivity).


Hint: Send output from long-running searches to a file.

bldna runs all database searches in the background, ie. as independent jobs. If you quit bldna or even logout, the job will run to completion. For long-running searches, it is usually best to send output directly to a file. Simply click on 'Output file', and type an Output file name. In the example above, output will be directed to two files: SRAAFP.gbvrt.fasta, which contains the FASTA report, and SRAAFP.gbvrt.fasta.nam, which contain the names of the hits. This namefile could be used as input to FETCH to retrieve the hits. Notice that biolegato automatically appends the .nam file extension.


 
Hint: For very closely-related sequences, speed up the search with K-tuple = 5 or 6.
In some cases, you are specifically looking for sequences that share very high similarity with your query. Suppose you had a cDNA and wanted to find the corresponding genomic clone. In that case, the similarity is likely to be greater than 90%. A K-tuple value of 5 or 6 will give a very fast search at low sensitivity, which is all you need.

Note: The FASTA, FASTX/Y, and TFASTA search BOTH strands of DNA database sequences, by default.

4. Protein vs. Protein

DNA vs. DNA searches are far less sensitive than protein vs. protein searches. DNA searches only work for closely-related sequences.

Consequently, protein vs. protein searches are likely to detect more distantly-related genes. As well, since amino acid sequences are only 1/3 the length of the corresponding DNA sequences, protein searches require only (1/3)2 = 1/9 the computational time.

Select the amino acid sequence SRAAFP:CDS1, and choose Database --> FASTA (protein vs. protein database).



For DNA searches a very simple scoring scheme is used, in which +1 is added to the score for a match, and -1 is added to the score for a mismatch. The greater diversity of amino acids lends itself to more complex scoring schemes. The menu shows a variety of protein scoring matrices. The Blosum matrices were constructed using distantly-related sequences. If you need a highly sensitive search, use a low-numbered Blosum matrix. The PAM matrices were constructed from a set of closely-related sequences. For a high-sensitivity search, use PAM250, or PAM250 Gonnet (based on a more recent dataset.). For closely-related sequences, use PAM120.

More information of scoring matrices can be found in
Hugh B. Nicholas Jr., David W. Deerfield II., and Alexander J. Ropelewski, (1998) A Tutorial on Searching Sequence Databases and Sequence Scoring Methods Developed by the Biomedical Supercomputing Initiative of the Pittsburgh Supercomputing Center.[http://www.nrbsc.org/education/tutorials/sequence/db/].

The results of this protein search are found in SRAAFP.genpept.fasta and SRAAFP.genpept.fasta.nam.
 
 
Hint: Oligopeptides need logarithmic score weighting
For very short query sequences such as oligopeptides, unweighted scores would not exceed the minimum cutoff values needed to appear in the output. The FASTA programs allow you to specify a logarithmic weighting ratio that gives short query sequences higher matches. The score is weighted by the natural log of the query length divided by the natural log of the database size.
 

5. Protein vs. Translated DNA

TFASTA is the most sensitive method for searching DNA sequence databases. Each DNA sequence is translated on the fly in either 3 or 6 reading frames (ie. one or both strands) . ALL of the sequence is translated, whether coding or non-coding, intron or exon.


 

Two versions of TFASTA are available.  TFASTX allows 3-base insertions (ie. 1 codon), while TFASTY allows 1 or 2 base insertions. TFASTY is therefore the most sensitive program, but it is also slower. TFASTY would be especially good at detecting similarities between a query protein and ESTs, since ESTs typically have more frameshifts than most sequences.

The results of this protein search are found in SRAAFP.gbvrt.tfasta and SRAAFP.gbvrt.tfasta.nam.

Note on automatic translation:

Programs that translate sequences on the fly (eg. TFASTX, TBLASTN, TBLASTX) have no knowledge whatsoever about gene structure (ie. exons, introns, 5' UTR, 3'UTR).  All these programs do is take every group of 3 nucleotides and assign a codon to it. Even stop codons are represented by an asterisk (*).  Consequently, non-coding sequences and non-coding open reading frames are translated into meaningless amino acid sequences.


6. DNA vs. Protein


FASTX and FASTY are the converse of TFASTX and TFASTY, in that they translate a DNA query sequence, allowing either codon-sized gaps (TFASTX) or 1 or 2 base gaps (TFASTY). TFASTY is therefore the slower, but more sensitive. These programs are particularly well-suited for comparing a sequence that may have frameshifts (eg. an EST) with a protein database.

The result of this protein search is found in SRAAFP.genpept.fasty.

7. Retrieving Hits

1. FETCH - retrieve sequences from local copy of GenBank

An advantage of the BIRCH implementation of the FASTA programs under biolegato instances is that a biolegato instance generates a namefile, with the names of hits taken from the FASTA output. It's easy to copy the names of the most significant hits, in one chunk, into a textedit window for retrieval by FETCH .  Note: The current implementation of FETCH is very slow!

2. NCBI Batch Entrez - retrieve sequences from NCBI

To retrieve the GenBank entries associated with your accession numbers, we'll use Batch Entrez at  http://www.ncbi.nlm.nih.gov/sites/batchentrez. As an example, we'll use
SRAAFP.gbvrt.tfasta.nam, the file of hits from searching the SRAAFP protein against the vertebrate division of GenBank using tfasta. (You may need to save this file in the current working directory.) Go to the batchentrez link above and click on Browse. In the file chooser, select SRAAFP.gbvrt.tfasta.nam.



  Now, click on Retrieve. A message will appear saying something like "Retrieve records for 357 UID(s). Click on that link.

Click on Send to and fill in the following:

Destination: File
Format: GenBank

When you click on Create File, a file chooser will pop up.




The default name will be something like "sequence.gb", but it is good practice to give it a name consistent with the rest of the files in this exercise. A good name would be

SRAAFP.gbvrt.tfasta.gen

NOTE: Remember to save this file in the current working directory! Your file chooser will show a different location from the one shown a right.


This file will contain full GenBank flatfile entries for all of the tfasta hits.