GENE IDENTIFICATION - Database searches

In this section, you will do a database search. Although we could search using your DNA sequence, DNA searches do not always yeild statistically significant matches. This is in part because DNA has a 4-letter alphabet (AGCT), so "false positive" matches are easy to get. That is, with a large database, the chance that two DNA sequences will share some sequence similarity, due to random chance, is quite high. Therefore the criteria for a "significant" match is very high. Searches of protein database with amino acid sequences more often yeild significant matches. This is because the amino acid alphabet is 20 letters, making false positives more difficult to obtain.

Therefore, you will compare one or more of your translated protein sequences with all sequences in GenBank, using the tblastn program. tblastn translates DNA sequences in GenBank to protein, before comparing them with your protein.You will search your DNA sequence against the GenBank DNA database, containing all published DNA sequences from plants, animals, fungi, bacteria, archea, or viruses.

Make sure your protein sequences from the previous section are open in separate text editor windows (eg. Notepad or Wordpad).
Select your protein sequence by dragging over the entire sequence with the mouse. Choose Edit --> Copy to copy to the clipboard.
Click on the link below to go to the BLAST search page at the NCBI.
Choose Edit --> Paste to paste the protein sequence into the BLAST input window.
For a protein search, choose the Program 'PROTEIN query - TRANSLATED database (tblastn)' , and the Database 'nr'. (Don't check the email reply box unless you want results returned by email, rather than directly to the Web browser.)
Press the 'BLAST!' button to search database. The program will compare your sequence with those in the database, and present results hypertext form. This may take a few minutes.
If the results give a significant match (see below) you may wish to save the results page directly from your Web browser. Save your results by choosing File --> Save as. Give the file a distinct name.
Click on the 'Clear Input' button before pasting in a new sequence. Repeat your search for each long open reading frame found. For protein searches, choose the Program 'tblastn' and the Database 'nr'.

SEARCH at the National Center for Biotechnology Information in Bethesda, Maryland.

What the results mean

A summary of results, with hypertext links to each sequence that matches your unknown, is given below.
Results are shown for a search with the sequence from open reading frame 2, translated from the sample sequence.
(Actual results will differ, as new sequences are added to the database).

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value
gi|3286690|emb|AJ007450.1|ATH7450  Arabidopsis thaliana mRNA...   166   2e-40 
gi|18410827|ref|NM_106186.1|  Arabidopsis thaliana DNAJ heat...   157   2e-37 
gi|7212003|gb|AC023754.3|AC023754  Arabidopsis thaliana chro...    54   4e-14 
gi|12331602|gb|AC025814.7|AC025814  Arabidopsis thaliana chr...    54   4e-14

For each match, a score is given. The the more nucleotides that match, the higher the score. The 'E value' tells the number of matches at that score, that would be expected by random chance alone, given the size of the database. An E value greater than 1 means that at least 1 or more such matches are expected, and therefore, the match is of no statistical significance. An E value of 0.01 means that a match of this score would be seen by chance once for every 100 database searches, with different test sequences of comparable length to your sequence. E = 0.001 means that you would only see a match this good once in a thousand searches. The choice of significance level therefore depends on how important it is to eliminate false positives.

The best match in this example is to an Arabidopsis thaliana auxilin-like protein:

>gi|3286690|emb|AJ007450.1|ATH7450   Arabidopsis thaliana mRNA for auxilin-like protein
          Length = 1649

 Score =  166 bits (421), Expect = 2e-40
 Identities = 91/124 (73%), Positives = 104/124 (83%), Gaps = 1/124 (0%)
 Frame = +3

Query: 1   LLKREVMVAASRLALLVIDEAPHLLVQRTKVRVLLVLQTRLPKLSQSRDAKLDLRDTREH 60
           LLK++VM AAS LALLV DEAPHLLV+RTKV+VL +LQTRLPK++QSRD KLDLRDTREH
Sbjct: 837 LLKQKVMAAASHLALLVKDEAPHLLVRRTKVQVLPILQTRLPKVNQSRDVKLDLRDTREH 1016

Query: 61  LSVRQKLLQRRNFVISKSRKRRQREIGSRKLLMLMSSGGRTERKTTCGHCS-NTPIYLGA 119
           L+ +Q+LLQRRNFVI K RK RQREI SRKLLMLMS+GGR ERKTT G  S ++ IYL  
Sbjct: 1017LTAQQRLLQRRNFVILKPRKSRQREIDSRKLLMLMSNGGRVERKTTYGR*SQHSNIYLEQ 1196

Query: 120 ESDG 123
             DG
Sbjct: 1197RVDG 1208

These results show the test (query) sequence aligned with the Arabidopsis sequence from GenBank, with gaps (-) inserted to optimize the alignment. Between the two, amino acids that are present in both sequences are written with the corresponding letter, while plus (+) characters indicate that the corresponding amino acids in both sequences are chemically similar, though not identical.

RETURN TO "Bioinformatics: Gene Identification"