GENE IDENTIFICATION - Database searches


In this section, you will do a database search. Although we could search using your DNA sequence, DNA searches do not always yeild statistically significant matches. This is in part because DNA has a 4-letter alphabet (AGCT), so "false positive" matches are easy to get. That is, with a large database, the chance that two DNA sequences will share some sequence similarity, due to random chance, is quite high. Therefore the criteria for a "significant" match is very high. Searches of protein database with amino acid sequences more often yeild significant matches. This is because the amino acid alphabet is 20 letters, making false positives more difficult to obtain.

Therefore, you will compare one or more of your translated protein sequences with all sequences in GenBank, using the  tblastn program. tblastn translates DNA sequences in GenBank to protein, before comparing them with your protein.You will search your DNA sequence against the GenBank DNA database, containing all published DNA sequences from plants, animals, fungi, bacteria, archea, or viruses.

  1. Make sure your protein sequences from the previous section are open in separate text editor windows (eg. Notepad or Wordpad).
  2. Select your protein sequence by dragging over the entire sequence with the mouse. Choose Edit --> Copy to copy to the clipboard.
  3. Click on the link below to go to the BLAST search page at the NCBI.
  4. Choose Edit --> Paste to paste the protein sequence into the BLAST input window.
  5. For a protein search, choose the Program 'PROTEIN query - TRANSLATED database (tblastn)' , and the Database 'nr'.  (Don't check the email reply box unless you want results returned by email, rather than directly to the Web browser.)
  6. Press the 'BLAST!' button to search database.  The program will compare your sequence with those in the database, and present results hypertext form.  This may take a few minutes.
  7. If the results give a significant match (see below) you may wish to save the results page directly from your Web browser. Save your results by choosing File --> Save as. Give the file a distinct name.
  8. Click on the 'Clear Input' button before pasting in a new sequence. Repeat your search for each long open reading frame found. For protein searches,  choose the Program 'tblastn' and the Database 'nr'.


SEARCH at the National Center for Biotechnology Information in Bethesda, Maryland.
 

What the results mean

A summary of results,  with hypertext links to each sequence that matches your unknown, is given below.
Results are shown for a search with the sequence from open reading frame 2, translated from the sample sequence.
(Actual results will differ, as new sequences are added to the database).
                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value
gi|3286690|emb|AJ007450.1|ATH7450  Arabidopsis thaliana mRNA...   166   2e-40 UniGene infoGeo
gi|18410827|ref|NM_106186.1| Arabidopsis thaliana DNAJ heat... 157 2e-37 Gene infoUniGene info
gi|7212003|gb|AC023754.3|AC023754 Arabidopsis thaliana chro... 54 4e-14
gi|12331602|gb|AC025814.7|AC025814 Arabidopsis thaliana chr... 54 4e-14

For each match, a score is given. The the more nucleotides that match, the higher the score. The 'E value' tells the number of matches at that score, that would be expected by random chance alone, given the size of the database. An E value greater than 1 means that at least 1 or more such matches are expected, and therefore, the match is of no statistical significance. An E value of 0.01 means that a match of this score would be seen by chance once for every 100 database searches, with different test sequences of comparable length to your sequence. E = 0.001 means that you would only see a match this good once in a thousand searches. The choice of significance level therefore depends on how important it is to eliminate false positives.

The best match in this example is to an Arabidopsis thaliana auxilin-like protein:

>gi|3286690|emb|AJ007450.1|ATH7450  UniGene infoGeo Arabidopsis thaliana mRNA for auxilin-like protein
Length = 1649

Score = 166 bits (421), Expect = 2e-40
Identities = 91/124 (73%), Positives = 104/124 (83%), Gaps = 1/124 (0%)
Frame = +3

Query: 1 LLKREVMVAASRLALLVIDEAPHLLVQRTKVRVLLVLQTRLPKLSQSRDAKLDLRDTREH 60
LLK++VM AAS LALLV DEAPHLLV+RTKV+VL +LQTRLPK++QSRD KLDLRDTREH
Sbjct: 837 LLKQKVMAAASHLALLVKDEAPHLLVRRTKVQVLPILQTRLPKVNQSRDVKLDLRDTREH 1016

Query: 61 LSVRQKLLQRRNFVISKSRKRRQREIGSRKLLMLMSSGGRTERKTTCGHCS-NTPIYLGA 119
L+ +Q+LLQRRNFVI K RK RQREI SRKLLMLMS+GGR ERKTT G S ++ IYL
Sbjct: 1017LTAQQRLLQRRNFVILKPRKSRQREIDSRKLLMLMSNGGRVERKTTYGR*SQHSNIYLEQ 1196

Query: 120 ESDG 123
DG
Sbjct: 1197RVDG 1208
These results show the test (query) sequence aligned with the Arabidopsis sequence from GenBank, with gaps (-) inserted to optimize the alignment. Between the two, amino acids that are present in both sequences are written with the corresponding letter, while plus (+) characters indicate that the corresponding amino acids in both sequences are chemically similar, though not identical.
 



 

previous page
RETURN TO "Bioinformatics: Gene Identification" next page