How do we measure biases?
Simple Metrics:
bp/ taxon
#genes/taxon
#copies/taxon
median length of sequences/taxon
features annotated/kb
and probably many more ....
Example: Distributions of sizes of mouse sequences in the GenBank Database follow an Extreme Value distribution characterized by
Skewing of data to low values
Very long right hand tail of large values
We are evaluating several statistical tests for measuring differences in the distributions of data between any two datasets.
Kolmogorov-Smirnov Test: Measures point of maximum difference between two distributions
Cramér - von Mises Test: Difference between two distributions is related to differences in areas.