Most query
systems are oriented toward retrieving small data entries, not
generating statistics on the database as a whole
Example:
NCBI Entrez
Retrieve
sequence entries based on BLAST scores
Retrieve
sequence entries based on keyword search
What
we need is a way to run through large sections of a database, and
tabulate statistics. For example:
What
is the distribution of sequence lengths reported for plants?
Are
some plant species overrepresented, and others underrepresented in the
databases?
For
a given gene family, is the representation of genes in genomic
sequences similar to that in EST populations (ie. bias toward strongly
expressed genes)?
Are
certain categories of genes reported more frequently than others?
Data pipeline for generating
statistics on databases
We
use theGDEinterface
by Steven Smith to call web services which provide the raw data.
In
turn, the actual GenBank entries corresponding to the list of GI
numbers can be retrieved:
The new GDE window has menus for working with sequence data. In this
fashion, the same GDE interface can be used to go back and forth
between different types of data.
GDE is designed for rapid
addition of new functions. GDE itself does nothing but display
data and call external programs. Therefore, any existing program can be added
to GDE's functionality, simply by adding a menu specification.