Lecture 5, part 3 of 3
Fortunately, the same
strategies that speed up pairwise searches can be applied to
database searches. The two most common database search packages,
FASTA and BLAST, both use lookup tables of k-tuple words to make
searches quick. Additionally, sites such as NCBI's BLAST server
employ high performance computers that have large numbers of
CPUs, allowing many pairwise comparisons to be performed in
ssearch3* (slow, full Smith-Waterman alignment)
vs. protein database
ssearch3 (slow, full Smith-Waterman alignment)
vs. translated DNA database
vs. translated DNA database
vs. protein database
||fastx3, fasty3 (especially
well-suited for cDNAs, which often contain frameshift
vs. protein database (iterative refinement
pattern vs. protein database
worth pointing out that BLAST comparisons are always an
approximation. BLAST does not do a complete Smith-Waterman
alignment. When you want optimal alignment, use SSEARCH.
There are a number of
more specialized BLAST programs than are listed here. A more
complete listing of BLAST programs can be found at
the scale of the search itself, server-implemented database
searches have additional layers of complexity. The servers must
be able to schedule and process requests from a number of
different types of web clients. In the case of NCBI, over
100,000 requests per day. (1 day = 86,400 seconds, so that is
more than one request processed per second.)
At NCBI, requests from web browsers or PC-based client programs are read by an interface layer, which sends the specifics of the request to a database server.
The Queuing Components server schedules jobs and monitors the progress of jobs. It launches searches, sending each query sequence to one of 280 Worker Nodes, which are Linux servers, each of which has multiple CPUs and large amounts of memory.
Each worker node stores part of the database in memory, such that the entire binary database is stored permanently in memory. This is a critical speedup step, because simply reading the data from disk can be a major bottleneck in database searches. The time required for a program to read data from a disk drive is much slower than reading data from RAM (random access memory). Collectively, the worker nodes store two complete copies of the data to increase the throughput.
When a query has been compared to all parts of the database as requested, the results are returned to the client.
2. Statistical significance of hits
In comparison to
pairwise searches, database searches carry some additional
problems. The most critical is evaluating the statistical
significance of a match. Statistical significance depends on the
size of the query sequence, the size of the database, and the
sample of sequences represented by the database.
The larger the database, the higher the probability of finding a match at a particular score by random chance. Therefore, as the size of the database increases, the score required to infer statistical significance is greater. However, it has been observed that in database searches, sequence scores are not distributed according to a Gaussian (normal) distribution.
The Extreme Value Distribution (EVD) does show a good fit to observed score distributions.
E (the "E value") is the expected number of hits with score >= SBill Pearson shows an example of how observed FASTA scores match up with the Extreme Value Distribution. [ http://www.people.virginia.edu/~wrp/talk95/prot_talk12-95_5.html ]
m is the size of the query sequence
n is the size of the database
and K are empirically derived constants for the scoring system and the search space size, respectively
observation to make about the EVD is that the right tail,
representing the highest scores, is strongly skewed, compared to
a normal distribution. This means that a normal distribution
would predict a much lower number of expected scores above some
high score S. Consequently, the significance of some
high scores might be tremendously overestimated. Put simply, the
normal distribution would judge many similarities to be
significant which were not significant. The right tail of the curve
is shown in the inset on an expanded scale. Remember,
scores in this region are simply part of the random
distribution of scores. By definition, they are not deemed to be
|E values in BLAST and
The most important difference between BLAST and FASTA is the way in which the E values are calculated. Both FASTA and BLAST start with the calculation of probabilities for a similarity of a given score between two randomized sequences of the same size, with equivalent amino acid compositions.
BLAST - K and are pre-calculated from a sample of 10,000 randomly-shuffled sequences BLAST therefore calculates E values based on the total length of the database.
FASTA - K and are calculated from the actual distribution of similarity scores obtained in each search. This makes FASTA slower than BLAST, since FASTA must calcluate an approximate similarity score for every sequence in the database. FASTA thus calculates E values based on the number of sequences in the database.
Pearson, W. R. and Wood, T. C. (2001) Statistical significance in biological sequence comparison. Handbook of Statistical Genetics, Balding, D. J., Bishop, M. and Cannings, C. eds. London:Wiley pp. 39-65 PDF version
A good summary of the
statistical issues related to database similarity searches can
be found in "The Statistics of Sequence Similarity Scores" by
Stephen Altschul at NCBI [
Low complexity sequences are sequences which contain many repeats of short sequence elements. Sequences with skewed amino acid or nucleotide compositions will have a low complexity. When two sequences with similarly skewed compositions are compared, the probability of local similarities is much greater than for pairs of sequences with fairly even compositions. For example, two A/T-rich DNA sequences are far more likely to have matches by random chance than sequences in which all four nucleotides are present at a frequency of approximately 0.25.
Example: Comparison of maize and rice hydroxyproline-rich glycoproteins
SSEARCH searches a sequence data bankThe alignment is deceptive, because the high percentages of Hydroxyproline, Lysine and Tyrosine ensure than a very large number of alignments could have very high scores. This is best illustrated in the dot-matrix plot of the same comparison:
version 35.04 Oct. 7, 2008
T. F. Smith and M. S. Waterman, (1981) J. Mol. Biol. 147:195-197;
W.R. Pearson (1991) Genomics 11:635-650
1>>>S20500 369 bp - 369 aa
Library: bio537987931014488970.tmp.seq2 350 residues in 1 sequences
350 residues in 1 sequences
Statistics: (shuffled ) MLE statistics: Lambda= 0.0325; K=6.695e+07
Algorithm: Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007)
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
Scan time: 0.090
The best scores are: s-w bits E(1)
S22456 350 bp ( 350) 1539 46.2 1.6e-09
>>S22456 350 bp (350 aa)
s-w opt: 1539 Z-score: 203.2 bits: 46.2 E(): 1.6e-09
Smith-Waterman score: 1539; 62.8% identity (73.0% similar) in 392 aa overlap (1-369:1-349)
10 20 30 40
:::. ::::..:::.:...: .:::: :: :::::::. :.:::::::
10 20 30 40 50
50 60 70 80 90 100
::::::..: :::::::::::: :.::: .: :::::::::: ::: :::::::..:
60 70 80 90 100 110
110 120 130 140 150 160
:.: . ::::: : :::::: :.::: : :: :::: : :.::: :: : :: :::
120 130 140 150 160
170 180 190 200 210 220
::: :.::: : :: :.:::: :.:::.:: :: :.::: ::: .:.:::: :.:::
170 180 190 200 210
230 240 250 260 270
: : : :::.:: :: :.::: ::: ::.:::: :.::: .: :::: :.::: ::
220 230 240 250 260 270
280 290 300 310 320 330
: ::::: :.::: :: ::: : ::.:.:: : : ::::: :: : : :::
280 290 300 310 320
340 350 360
:. :::::::: : :.::.:::::
330 340 350
It's easy to see that the HPRG proteins would show high scores with any protein containing a high proline content.P2HOM Version 5/13/91
SIMILARITY RANGE: 10 MIN.PERCENT SIMILARITY: 40
SCALE FACTOR: 0.90 COMPRESSION: 10
100 200 300
Y . . . .
W . . . .
WV de Z c b d . d cc .
N WSXZTTXYVVXW.bd bd bcLLWWWWY .
XQZX Uadbaea XcXYcYYYcaYeZcMaYa b .
IIK cd . .d c .
PIJcd. . . c .
NSKTSURXRZYZUW .c cY bYX aaZWe .
d QHIQPLLONJITVQTVVQRVQSRUURQKcc .
dZ NMMLLLSRROQQYUWZYUTbWXUQaPPNOe .
Va UPMSLXTVTVSVaZVd ZV aZVTaUPOUb .
dd LJPOTTTUTQQScVZccVZcUTOSRVLHW .
dd ZISOMTUXSPPSYaWYXaWWWXPTaZVIUZ .
bb QPQQQNVRTNSRXWWYXWTXYUXUYTVQZ .
ce aSYeYUYbadYbYbZYbaZaYXVWXXd cW .
e MLLPNOSSQONRYTVXYTSYSUTZXUTNZZ .
db ZNPSPVSVQSNTYXWYXXWWWWUTaRUOUa .
ed OHOPPLUPUKRPUSTVUSRUQRRUVRONYb .
eb dOMNOQNTRTNPRYWWYXWTWWUUTYRSNUe .
a aaMMPPOSaSPOSXZWYXZUXXUVUXdVNUZ .
ed dMLMRTPSSUPPabTabZTYbSWUZXUURa .
dd ZLLPNOSSQONRYaVXXaSWWUTWaZTNUZ .
eb dOMMOPNSRSNORXWWYXWTXWUUTYRSNZZ .
ab ccQNORVSXSVPWbXYbaXWYTXWUZUXOUa .
dd dRNSLLPTWRSOVYXVXXXSWWRVWaVVPbZ .
eb dOMNOQNTRTNPRYWWYYWTYWUUTYRSNUe .
dd ZLOSNPSXQONRYaVXXaSWWVTTaZTNUZ .
ad aSScWSYbadYbYbZYbaZaYXVWVXdccUe .
Z KJLMMOOSOPLSXTWYXTUXSUSUXUUKZZ .
Ye ZVOaVMWOXOXMVbZYbbZYaYVWTXYYJRe .
BLAST from NCBI preproccesses all sequences to mask low complexity regions [http://www.ncbi.nlm.nih.gov/blast/blastcgihelp.shtml#filter].
Particularly when using genomic DNA sequences in database searches, a smaller unique sequence is the best query sequence, rather than a long sequence.
BLAST from NCBI
automatically checks query DNA sequences for known repetitive
elements such as AluI, and flags the presence of these sequences
in the output.
|What if I use BLAST to
search the Human Genome with an AluI family sequence? (eg. HUMRSA27)
Do NOT try this at home! You'll just get the same results, and add load to the NCBI servers.
|blastn with defaults
||"No significant similarity
found" This is because BLAST automatically filters out known
repeats such as AluI.
|blastn; no filtering for
||268967 hits. For a 262 base
search sequence, this corresponds to 7 x 107
bases. Since the human genome is about 3 x 109
bp, a BLAST search would estimate that about 2.3% of the
human genome is AluI family sequences.
|blastn; no filtering for
species-specific repeats; discontiguous
megablast which breaks the query words into shorter
words, which in effect allows gaps within the larger query
||430533 hits. For a 262 base search sequence, this corresponds to 11.3 x 107 bases. Since the human genome is about 3 x 109 bp, a BLAST search would estimate that > 3.8% of the human genome is AluI family sequences.|
|Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada|
Lecture 5, part 3 of 3