What kinds
of biases exist in biological databases?
Model species
Coding vs. noncoding
Strongly-expressed genes
Redundancy
Length
cDNA vs. genomic
Sampling error
Automated annotation favors
known protein families
smaller genes with few exons
Data pipeline
Bias
can be found at all levels. For example, the distribution of sequences
in the Vertebrate division of GenBank (ie. excluding mammals) is
dominated by four genuses: