GeneParser GeneParser Reference Manual GeneParser NAME gp - parse a DNA sequence into introns and exons SYNOPSIS gp -s file -c flat [-m max_length] [-y val] [-z val] DESCRIPTION gp reads DNA sequence file and writes predicted introns and exons to standard output. The following flags are used to identify input parameters: -s Must precede sequence file name. -c Must precede file type, e.g. flat to indicate flat sequence file. At present, only flat file format is supported. (See INPUT FILE FORMAT.) The following option may be used: -m Indicate maximum length of possible intron or exon to consider. For example if sequence is known to contain no introns or exons longer than max_length, only L-matrix values corresponding to sequences below this maximum will be calculated. Furthermore, DP will only search for candidate sequences shorter than this length. Using this parameter can significantly decrease run time. -y Add a constant to exon L-matrix values. A positive value makes program more selective, a negative value, more sensitive. Recomended values range from -0.2 to +0.2. -z Add a constant to intron L-matrix values. A positive value makes program more selective, a negative value, more sensitive. Recomended values range from -0.2 to +0.2. Thus: gp -s HUMALPI -c flat -m 500 generates the following output: GeneParser: *********** T-matrix Score ************* Position Class RF Donor Accep IF-6 LCC Len 6-tup L-mat 49 231 intron 0 0.765 0.858 0.433 0.755 0.900 0.447 +0.206 232 362 exon 1 0.765 0.871 0.592 0.784 1.000 0.439 -0.147 363 444 intron 1 0.816 0.871 0.529 0.756 1.000 0.460 +0.218 445 561 exon 2 0.816 0.932 0.654 0.794 0.530 0.437 -0.096 562 675 intron 0 0.917 0.932 0.505 0.798 0.900 0.456 +0.247 676 791 exon 2 0.917 0.967 0.586 0.769 0.530 0.473 -0.125 792 994 intron 2 0.871 0.967 0.407 0.749 0.940 0.433 +0.259 995 1169 exon 1 0.871 0.944 0.598 0.824 0.880 0.480 -0.110 1170 1245 intron 1 0.869 0.944 0.495 0.783 1.000 0.484 +0.236 1246 1418 exon 2 0.869 0.737 0.607 0.759 0.880 0.478 -0.185 1419 1660 intron 2 0.862 0.737 0.457 0.743 0.940 0.389 +0.236 1661 1808 exon 1 0.862 0.873 0.577 0.773 1.000 0.445 -0.145 1809 2093 intron 0 0.924 0.873 0.515 0.784 0.940 0.397 +0.263 2094 2228 exon 1 0.924 0.853 0.599 0.753 1.000 0.459 -0.136 2229 2312 intron 0 0.896 0.853 0.540 0.666 1.000 0.463 +0.237 2313 2504 exon 1 0.896 0.803 0.609 0.779 0.620 0.496 -0.170 2505 2727 intron 1 0.892 0.803 0.389 0.784 0.940 0.446 +0.238 2728 2844 exon 2 0.892 0.906 0.564 0.837 0.530 0.506 -0.158 2845 2955 intron 0 0.862 0.906 0.513 0.747 0.900 0.471 +0.224 2956 3333 exon 2 0.862 0.775 0.620 0.770 0.000 0.463 -0.193 3334 3551 intron 2 0.841 0.775 0.521 0.760 0.940 0.425 +0.219 Position beginning and ending nucleotides of sequence inclusive Class sequence class, intron or exon RF Exon: reading frame relative to beginning of sequence Intron: intron length mod 3 Donor donor site Accep acceptor site IF-6 in-frame 6-tuple log-likelihood LCC local compositional complexity Len Exon and intron length distribution 6-tup Exon and intron 6-tuple log-likelihood L-mat L-matrix score for subsequence INPUT FILE FORMAT The input sequence file should containing only the sequence to be analyzed. Currently, the following characters are recognized [acgtnryACGTNRY] as being part of the sequence. All other characters including numbers and white spaces are ignored. Ambiguous positions are assigned randomly according to the IUB code for incompletely specified bases. LIMITATIONS GeneParser was trained on a collection of 60 human gene fragments containing either intron or internal exon sequences. Thus, the program performs best on sequences not containing terminal exons or intergenic DNA. If asked to predict terminal exons, GeneParser will usually miss the distal (non-splice site) boundaries. Intergenic DNA will often be parsed into a series of introns and exons with low L-matrix scores. Also note that interpretation of L-matrix scores can be tricky. The larger the score, the better the exon or intron even though exons will still tend to have negative scores. The parsing of long sequences can be time consuming. FILES gp must have access to the following files: acceptor_primate primate acceptor site weight matrix donor_primate primate donor site weight matrix exon_lengths exon length distribution data intron_lengths intron length distributon data primate_6tuple_freq primate 6-tuple frequency data primate_inframe_6tuple_freq primate in-frame 6-tuple data fmc36x_exon.wts weights for L_E calculation fmc36x_intron.wts weights for L_I calculation hum.cod codon translations parameter_file parameters for running gp The path names which gp will use to find these files should be set in parameter_file. This file can be stored in the current working directory or in the directory specified by the environment variable GP_HOME. AUTHORS GeneParser was developed by Eric E. Snyder and Gary Stormo at the University of Colorado at Boulder, Department of Molecular, Cellular and Developmental Biology. The authors can be reached by electronic mail at eesnyder@boulder.colorado.edu or stormo@boulder.colorado.edu. REFERENCE Snyder, E. E., Stormo, G. D. (1993) Identification of Coding Regions in Genomic DNA Sequences: An Application of Dynamic Programming and Neural Networks. Nucleic Acids Research 21(3): 607-613. RELEASE GeneParser version 1.0.1, June 11, 1993. Copyright 1993 Eric E. Snyder.