FEATURES update 14 June 2024 NAME FEATURES - extracts features from GenBank entries SYNOPSIS features.py expression features.py [-f featurekey | -F keyfile] [-n name |-a accession | -e expression | -N namefile |-A accfile | -E expfile] [-u dbfile | -U dbfile | -g ] features.py -h DESCRIPTION FEATURES extracts sequence objects from GenBank entries, using the Features Table language. Features can be retrieved either by specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by evaluating expressions. In practical terms, FEATURES is actually a wrapper for GETOB, which actually performs the parsing and extraction of sequence objects. 'features' followed by an expression retrieves the data directly from GenBank and evaluates the expression. The third form of features requires all arguments to be accompanied by their respective option flags. Finally, 'features -h' prints the SYNOPSIS. Feature keys: -f key {feature key} -F filename {file of feature keys} The following example would retrieve both tRNA and rRNA sequences: OBJECTS tRNA rRNA SITES The words 'OBJECTS' and 'SITES' must enclose the feature keywords, and each keyword must be on a separate line. For a rigorous definition of the input file format, see the GETOB manual pages (getob.txt). A complete list of legal feature keywords can be found in the GenBank Release notes (gbrel.txt) under the subheading 'Feature Key Names'. Entries: -n name {GenBank LOCUS name} -N filename {file of GenBank LOCUS names} -a accession {GenBank ACCESSION number} -A filename {file of GenBank ACCESSION numbers} -e expression {Feature Table expression} -E filename {file of Feature Table expressions, each begin- ning with '@'} e Expressions take the form accession:location where accession refers to a GenBank accession number, while location is any legal feature location. A brief description of location syntax can be found under the subheading "Feature Location" in the GenBank release notes (gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table: Definition" for a complete definition. E name of a file containing one or more Feature expressions. EACH EXPRESSION MUST BEGIN A '@'. All lines beginning with '@' are processed as expressions, and all other lines are copied to the output file unchanged. Databases: -u filename {GenBank dataset} -U filename { " " " " " " , process all entries ie. -nNaAeE options will be ignored} -g {GenBank} For example: features.py -f tRNA -a M81884 In the example, FEATURES was instructed to retrieve all tRNAs from the GenBank entry EPFCPCG with Accession number M81884, which contains the Epifagus plastid genome. By default, the GenBank database was the source of the sequence. A log describing the extraction of each feature is written to M81884.msg, while the extracted features themselves are written to M81884.out. Feature expressions which could be used by FEATURES to reconstruct the .out file, are written to M81884.exp. The first step is to retrieve the EPFCPCG entry from GenBank, which is accomplished by calling seqfetch.py. Next, FEATURES extracts the specified features from the entry. An excerpt from M81884.msg is shown below, describing the extraction of the fifth tRNA found in this entry. To create this tRNA, two exons had to be joined. The qualifier line associated with this feature indicates that it is an Isoleucine tRNA with a gat anticodon. EPFCPCG:anticodon gtg complement ( join ( 70023 70028 1 69 ) ) /product="transfer RNA-His" /gene="His-tRNA" /label=anticodon gtg /note="anticodon gtg" //---------------------------------------------- The actual sequence for this feature, as written to M81884.out, is written with each exon beginning a new line: >EPFCPCG:anticodon gtg ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat gcgggttcaattcccgtcg ttcgcc Finally, the expression that was evaluated to create this feature is written to M81884.exp: >EPFCPCG:anticodon gtg @M81884:anticodon gtg If M81884.exp was used as an expression file in option 2 (E) of FEATURES, M81884.out would be recreated. Examples: Each feature is found by its qualifier line from the original entry. It must be noted that the qualifier line must be unique from others in the same entry in its first 15 characters after the = . The flaL protein coding region of B. licheniformis is described in GenBank entry BLIFALA, accession number M60287 in the following feature: CDS 305..640 /gene="flaL" /note="flaD (sin) homologue; putative; label: ORF2" /codon_start=1 /transl_table=11 /protein_id="AAA22439.1" This feature could be retrieved using any of the following expressions: features.py 'M60287:305..640' features.py 'M60287:/gene="flaL"' features.py 'M60287:/note="flaD (sin) homologue; putative; label: ORF2"' Some locations may contain special characters that have a special meaning to the shell. Therefore, it is safest to enclose expressions in single quotes as shown above, when typing the expression on the command line. DATABASE (WHERE TO GET IT) - By default, all entries processed will be automatically retrieved from GenBank using seqfetch.py. Specifying 'u' (User-defined database subset) makes it possible to extract features from GenBank subsets created by the user. Usually, retrieval of features is much faster with a User-defined subset, so if you frequently work with sets of genes, it is best to retrieve them en-masse using seqfetch.py, and work with them directly. For example, if you had retrieved a set of Beta-globin sequences into a file called 'globin.gen', you could directly extract features from these entries by specifying 'globin' or 'globin.gen' as your User-defined database. If the file extension is '.gen', FEATURES will automatically create temporary files called globin.ano, globin.wrp and globin.ind, containing annotation, sequence, and an index, respectively. These files will be read during feature extraction, and then discarded. If you have already created such files using SPLITDB, simply specify any of 'globin', 'globin.ano', etc. ie. anything, as long as it does not have the .gen file extension. 'U' rather than 'u' causes ALL entries in the user-defined database to be subset. This means that it is unnecessary to specify entry options (eg -n, -N etc.), as these will be ignored, if given. WHERE TO SEND IT - By default (a), the output for all entries goes to a single set of files, whose names are chosen by FEATURES, depending on the setting of option 2, Entries. If a single name (n) or accession number (a) has been chosen, that will be used as the raw filename. For example, if you were processing the entry WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names (N), accession numbers (A) or expressions (E) were read from a file, the raw name of that file would be used eg. cellulase.nam would result in cellulase.msg and cellulase.out. Finally, if a single expression is processed (e), then the primary accession number in that expression will be used for the filenames. In all cases, FEATURES will tell you the names of the files being written. Choosing suboption s, you can specify that the features created for each entry be sent to separate files. In this case, each file will have the name of that entry, with the extension .obj. However, all messages and expressions will still go to a single files. While this can be a convenient way of creating separate files when you need them, this option still has the limitation of writing all features for a given entry (if there are more than one) to the same file. Also, successive resolution of features (anything requiring 'getob -r') will not work with this option. This may be corrected in future versions. COMMAND LINE EXECUTION There are two ways of running FEATURES from the command line. If only one argument is supplied, that argument is interpreted as an expression, and the result of that expression (ie. a sequence ) is written to the standard output. .msg, .out and .exp files are NOT created. For example, GenBank entry BACFLALA (M60287) contains the following feature: CDS 95..271 /label=LORF- /codon_start=1 /translation="MNKDKNEKEELDEEWTELIKHALEQGISPDDIRIFLNLGKKSSK PSASIERSHSINPF" Any of features M60287:LORF- features M60287:95..271 features M60287:/label=LORF- would write the open reading frame to the standard output: atgaataaagataaaaatgagaaagaagaattggatgaggagtggacaga actgattaaacacgctcttgaacaaggcattagtccagacgatatacgta tttttctcaatttgggtaagaagtcttcaaaaccttccgcatcaattgaa agaagtcattcaataaatcctttctga This form of FEATURES is provided to make it easy to pipe output to other programs for further processing. For example features M60287:LORF- |ribosome >LORF.protein would write the translation of the open reading frame to a file called LORF.protein. The full functionality of the FEATURES can be accessed using arguments on the command line. In particular, when there are multiple entries to be processed, or multiple features within entries, it is much faster to supply FEATURES with lists of entries, feature keys or expressions. Command line options are similar to suboptions in menu items 1-3 above: Examples: features -f tRNA -n EPFCPCG retrieves all tRNAs from GenBank entry EPFCPCG and writes .msg, .out, and .exp files. features -e M60287:LORF- would retrieve the same open reading frame as in the earlier example. Since most time-consuming operation in FEATURES is sequence retrieval, it is often best to retrieve frequently-used sequences as database subsets. For example, a set GenBank entries for chlorophyl a/b binding protein genes might be stored in a file called CAB.gen. features -f CDS -N CAB.nam -u CAB.gen would generate the files CAB.msg, CAB.out and CAB.exp containing output for all CDS features in the entries listed in the file CAB.nam. features -E CAB.exp -u CAB.gen would re-create the output file CAB.out. BUGS FEATURES does no preliminary error checking for syntax of GenBank expressions prior to their evaluation. Expressions that can not be evaluated will be flagged by GETOB in the .msg file. At present, little checking is done to test for the presence or correctness of input files. Some errors may cause the program to crash. For User-defined datasets, filename expansion is not performed. FILES Temporary files: X.term X.ano X.wrp X.ind X.gen {X is raw filename, see 4) } UNRESOLVED.fea UNRESOLVED.out FEA.inf FEA.nam FEA.gen FEA.ano FEA.wrp FEA.ind FEA.msg FEA.out SEE ALSO grep(1V) getob splitdb AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-474-7528 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.