Database bias - Types of Queries

Most query systems are oriented toward retrieving small data entries, not generating statistics on the database as a whole

Example: NCBI Entrez

Retrieve sequence entries based on BLAST scores

Retrieve sequence entries based on keyword search

What we need is a way to run through large sections of a database, and tabulate statistics. For example:

What is the distribution of sequence lengths reported for plants?
Are some plant species overrepresented, and others underrepresented in the databases?
For a given gene family, is the representation of genes in genomic sequences similar to that in EST populations (ie. bias toward strongly expressed genes)?
Are certain categories of genes reported more frequently than others?

Data pipeline for generating statistics on databases

We use the GDE interface by Steven Smith to call web services which provide the raw data.

In turn, the actual GenBank entries corresponding to the list of GI numbers can be retrieved:

The new GDE window has menus for working with sequence data. In this fashion, the same GDE interface can be used to go back and forth between different types of data.

GDE is designed for rapid addition of new functions. GDE itself does nothing but display data and call external programs. Therefore, any existing program can be added to GDE's functionality, simply by adding a menu specification.