newcpgseek

Function

Description

newcpgseek identifies in a nucleotide sequence regions with higher than expected frequency of the dinucleotide CG. Each position in the sequence is scored using a running sum calculated from all positions in the sequence. This is a different method to that typically used for identifying CpG islands, for example by newcpgreport and cpgplot. This method overpredicts islands but finds the smaller ones around primary exons. An output file is written with information on the CpG-rich regions that are found. A feature table of sequence features in these regions is also written.

Algorithm

newcpgseek scores each position in the sequence using a running sum calculated from all positions in the sequence, starting with the first and ending in the last. If there is not a CG dinucleotide at a position, the score is decremented, if there is one, the score is incremented by a constant (user-defined) value. If the score for a region in the sequence is higher than a threshold (17 at the moment) then a putative island is declared. Sequence regions scoring above the threshold are searched for recursively.

Usage

Command line arguments


Input file format

newcpgseek reads one or more nucleotide sequences.

Output file format

Data files

None.

Notes

"CpG" refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases. Regions of genomic sequences rich in the CpG pattern or "CpG islands" are resistant to methylation and tend to be associated with genes which are frequently switched on. It's been estimated that about half of all mammalian genes, and, possibly all mammalian house-keeping genes, have a CpG-rich region around their 5' end. Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups. The detection of CpG island upstream of predicted exons or genes is evidence in support of a highly expressed gene.

As there is no official definition of what is a CpG island is or how to identify where they begin and end, we work with two definitions and thus two methods. These are:

1. cpgplot and newcpgreport use a sliding window within which the Observed/Expected ratio of CpG is calculated. For a sequence region to reported as a CpG island, it must satisfy the following contraints:

   Observed/Expected ratio > 0.6
   % C + % G > 50%
   Sequence Length > 200

2. newcpgseek and cpgreport use a running sum calculated from all positions in a sequence rather than a window to produce a score. If there is not a CG dinucleotide at a position, the score is decremented, if there is one, the score is incremented by a constant (user-defined) value. If the score for a region in the sequence is higher than a threshold (17 at the moment) then a putative island is declared. Sequence regions scoring above the threshold are searched for recursively.

This method overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable. For most purposes you should probably use newcpgreport rather than cpgreport. It is used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.

newcpgseek and cpgreport both now display the actual CpG count, the (%C + %G) and the Observed/Expected ratio in the region where the score is above the threshold.

The geecee program measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be useful for detecting sequences that MIGHT contain an island.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

Author(s)

History

Target users

Comments