Classifying
Coding DNA with Nucleotide Statistics
Nicolas Carels and Diego Frias
Presentation
Outline:
Carels, N. and Frias, D. (2009).
Classifying coding DNA with nucleotide statistics. Bioinformatics and Biology Insights
3:141-154. pdf.
- Importance of accurate methods for the detection of coding DNA
- Methods which have been used to identify CDS
- Based on codon usage (Hidden Markov Models)
- Nucleotide periodicity
- Detection of ancestral codon ,RNY pattern
- Methodology behind UFM
- Results
- Discussion
- Scoring Purine Bias with UFM
- Comparison of CSF and UFM
- Comparison of the Classification of Coding and Non-coding ORF
by UFM
- References
1.
Importance of accurate methods for the detection of coding DNA
Methods
of Gene Detection
Extrinsic Methods
- search sequences for documented protein families (homology search)
- dependence on accurate input sequences, may miss proteins due
to low conservation of enzyme regions
Intrinsic Methods
- search sequence for patterns
- identification of CDS, introns, 5' and 3' gene extremities, and
gene structure
What is CDS?
The
coding sequence (CDS) region of a
gene is a sequence of nucleotides which corresponds to a sequence of
amino acids in a protein,a typical CDS starts with ATG and ends in a
stop codon (1).

http://www.google.ca/imghp?hl=en&tab=ii
Importance
A large amount of a DNA sequence is
non-coding and
identification of the coding regions is critical is determining
the areas within the genome which code for certain proteins. Due
to the advances in sequencing technologies there is large amount
genomic data that needs to be searched to identify genes. This has
created the demand for automated programs which can accurately and
quickly identify the coding regions thereby providing insight into the
function of a gene.
2.
Methods for CDS Identification
A.
Codon Usage
Hidden Markov Methods
- integration of information about the gene structure
- accuracy depends upon the training data being a good
representative
B. Nucleotide Periodicity
Average Mutual Information and
Spectral Rotation Measure
- independent of biological species
- tolerant to codon usage
- based on neucleotide statistics
- not very sensitive for CDS regions below 400bp
C. Detection of Ancestral Codon, RNY
pattern (CSF and UFM)
- greater success with CDS regions below 350bp
- based on nucleotide statistics
Codon Structure Factor (CSF)
- measures codon asymmetry in 3 reading frames
- maximizes a function, set threshold determines the sequence
classification (coding or not)
Universal Feature Method (UFM)
- independent of codon usage
- able to classify the coding frame among six possible frames of a
ORF without parameter adjustment
Next Page