FEATURES update 7 Feb 94 NAME FEATURES - extracts features from GenBank entries SYNOPSIS features features expression features [-f featurekey | -F keyfile] [-n name |-a accession | -e expression | -N namefile |-A accfile | -E expfile] [-u dbfile | -U dbfile | -g ] features -h DESCRIPTION FEATURES extracts sequence objects from GenBank entries, using the Features Table language. Features can be retrieved either by specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by evaluating expressions. In practical terms, FEATURES is actually a user interface for GETOB, which actually performs the parsing and extraction of sequence objects. FEATURES can be run either as an interactive program or with command line arguments. 'features' with no arguments runs the program interactively. 'features' followed by an expression retrieves the data directly from GenBank and evaluates the expression. The third form of features requires all arguments to be accompanied by their respective option flags. Finally, 'features -h' prints the SYNOPSIS. INTERACTIVE EXECUTION FEATURES executed with no arguments runs interactively. An example of the FEATURES menu is shown below: ___________________________________________________________________ FEATURES - Version 7 FEB 94 Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003 ___________________________________________________________________ Features: tRNA Entries: EPFCPCG Dataset: ___________________________________________________________________ Parameter Description Value ------------------------------------------------------------------- 1).................... FEATURES TO EXTRACT ....................> f f:Type a feature at the keyboard F:Read a list of features from a file 2)....................ENTRIES TO BE PROCESSED (choose one).....> n Keyboard input - n:name a:accession # e:expression File input - N:name(s) A:accession #(s) E:expression(s) 3)....................WHERE TO GET IT .........................> g u:Genbank dataset g:complete GenBank database U: same as u, but all entries 4)....................WHERE TO SEND IT ........................> a s:Each feature to a separate file a:All output to same file --------------------------------------------------------------- Type number of your choice or 0 to continue: 0 Messages will be written to EPFCPCG.msg Final sequence output will be written to EPFCPCG.out Expressions will be written to EPFCPCG.exp Extracting features... In the example, FEATURES was instructed to retrieve all tRNAs from the GenBank entry EPFCPCG, which contains the Epifagus plastid genome. By default, the GenBank database was the source of the sequence. Messages indicate the progress of the job. A log describing the extraction of each feature is written to EPFCPCG.msg, while the extracted features themselves are written to EPFCPCG.out. Feature expressions which could be used by FEATURES to reconstruct the .out file, are written to EPFCPCG.exp. The first step is to retrieve the EPFCPCG entry from GenBank, which is accomplished by calling FETCH. Next, FEATURES extracts the specified features from the entry. An excerpt from EPFCPCG.msg is shown below, describing the extraction of the fifth tRNA found in this entry. To create this tRNA, two exons had to be joined. The qualifier line associated with this feature indicates that it is an Isoleucine tRNA with a gat anticodon. EPFCPCG:anticodon gtg complement ( join ( 70023 70028 1 69 ) ) /product="transfer RNA-His" /gene="His-tRNA" /label=anticodon gtg /note="anticodon gtg" //---------------------------------------------- The actual sequence for this feature, as written to EPFCPCG.out, is written with each exon beginning a new line: >EPFCPCG:anticodon gtg ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat gcgggttcaattcccgtcg ttcgcc Finally, the expression that was evaluated to create this feature is written to EPFCPCG.exp: >EPFCPCG:anticodon gtg @M81884:anticodon gtg If EPFCPCG.exp was used as an expression file in option 2 (E) of FEATURES, EPFCPCG.out would be recreated. OPTIONS 1) FEATURES - choosing f will cause FEATURES to prompt for a feature to extract. If you wish to extract several types of features simultaneously (ie. F), you must construct a file listing the feature keywords. The following example would retrieve both tRNA and rRNA sequences: OBJECTS tRNA rRNA SITES The words 'OBJECTS' and 'SITES' must enclose the feature keywords, and each keyword must be on a separate line. For a rigorous definition of the input file format, see the GETOB manual pages (getob.doc). In the menu shown above, f was chosen, and the user entered tRNA at the prompt. Thus tRNA is now displayed on the Features: line. If features had been specified from a file (suboption F) then the filename containing the feature keywords would be displayed instead. A complete list of legal feature keywords can be found in the GenBank Release notes (gbrel.txt) under the subheading 'Feature Key Names'. 2) ENTRIES n User is prompted for the name of an entry from which the feature is to be extracted. The name of the entry will appear on the 'Entries' line of the menu. N User is prompted for a filename containing one or more entry names. Each name must be on a separate line. The filename will be displayed on the 'Entries' menu line. a User is prompted for an accession number, which will appear on the 'Entries' line of the menu. A User is prompted for a filename for accession numbers. The filename will appear on the 'Entries:' line. e User is prompted for a GenBank Features expression of the form accession:location.'accession' refers to a GenBank accession number, while 'location' is any legal feature location. A brief description of location syntax can be found under the subheading "Feature Location" in the GenBank release notes (gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table: Definition" Version 1.04 for a complete definition. E User is prompted for a filename containing one or more Feature expressions. EACH EXPRESSION MUST BEGIN A '@'. All lines beginning with '@' are processed as expressions, and all other lines are copied to the output file unchanged. Examples: The tRNA shown above could have been extracted by choosing suboption e and entering either of the following expressions: M81884:complement(join(70023..70028,1..69)) M81884:anticodon gtg In the first example, the feature line from the original entry is used as the location. In the second example, the feature is found by its qualifier line, which also appeared in the original entry. It must be noted that the qualifier line must be unique from others in the same entry in its first 15 characters after the = . The flaL protein coding region of B. licheniformis is described in GenBank entry BLIFALA, accession number M60287 in the following feature: CDS 305..640 /note="flaD (sin) homologue" /gene="flaL" /label=ORF2 /codon_start=1 This feature could be retrieved using any of the following expressions: M60287:305..640 M60287:ORF2 M60287:/label=ORF2 M60287:/gene="flaL" M60287:/note="flaD (sin) homologue" Note that the /label= qualifier is special, in that labels are specifically intented as unique tags on an feature. For labels, only the label itself is need be specified. Thus, /label=ORF2 is equivalent to ORF2. For other qualifiers, the qualifier keyword (eg. /note=) must be included. 3) DATABASE (WHERE TO GET IT) - By default, all entries processed will be automatically retrieved from GenBank using FETCH. Specifying 'u' (User-defined database subset) makes it possible to extract features from GenBank subsets created by the user. Usually, retrieval of features is much faster with a User-defined subset, so if you frequently work with sets of genes, it is best to retrieve them en-masse using FETCH, and work with them directly. For example, if you had retrieved a set of Beta-globin sequences into a file called 'globin.gen', you could directly extract features from these entries by specifying 'globin' or 'globin.gen' as your User-defined database. If the file extension is '.gen', FEATURES will automatically create temporary files called globin.ano, globin.wrp and globin.ind, containing annotation, sequence, and an index, respectively. These files will be read during feature extraction, and then discarded. If you have already created such files using SPLITDB, simply specify any of 'globin', 'globin.ano', etc. ie. anything, as long as it does not have the .gen file extension. 'U' rather than 'u' causes ALL entries in the user-defined database to be subset. This means that it is unnecessary to specify entry options (eg -n, -N etc.), as these will be ignored, if given. One consequence of these conventions is that the individual GenBank divisions can be processed directly. For example, suppose you were only interested in rodent globins. You could directly access the rodent division of GenBank by specifying the base name of that file division (eg. /home/psgendb/GenBank/gbrod) as your user-defined database. In this case, the files gbrod.ano, gbrod.wrp and gbrod.ind already exist. Again, this approach is faster, since FEATURES would not have to find and retrieve the sequences, but can read directly from the database files. Finally, if you wanted to process all of the entries in the database division, simply use -U. The user is warned that a GenBank division is a huge amount of data, and processing every entry could take a long time. 4) WHERE TO SEND IT - By default (a), the output for all entries goes to a single set of files, whose names are chosen by FEATURES, depending on the setting of option 2, Entries. If a single name (n) or accession number (a) has been chosen, that will be used as the raw filename. For example, if you were processing the entry WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names (N), accession numbers (A) or expressions (E) were read from a file, the raw name of that file would be used eg. cellulase.nam would result in cellulase.msg and cellulase.out. Finally, if a single expression is processed (e), then the primary accession number in that expression will be used for the filenames. In all cases, FEATURES will tell you the names of the files being written. Choosing suboption s, you can specify that the features created for each entry be sent to separate files. In this case, each file will have the name of that entry, with the extension .obj. However, all messages and expressions will still go to a single files. While this can be a convenient way of creating separate files when you need them, this option still has the limitation of writing all features for a given entry (if there are more than one) to the same file. Also, successive resolution of features (anything requiring 'getob -r') will not work with this option. This may be corrected in future versions. COMMAND LINE EXECUTION There are two ways of running FEATURES from the command line. If only one argument is supplied, that argument is interpreted as an expression, and the result of that expression (ie. a sequence ) is written to the standard output. .msg, .out and .exp files are NOT created. For example, GenBank entry BACFLALA (M60287) contains the following feature: CDS 95..271 /label=LORF- /codon_start=1 /translation="MNKDKNEKEELDEEWTELIKHALEQGISPDDIRIFLNLGKKSSK PSASIERSHSINPF" Any of features M60287:LORF- features M60287:95..271 features M60287:/label=LORF- would write the open reading frame to the standard output: atgaataaagataaaaatgagaaagaagaattggatgaggagtggacaga actgattaaacacgctcttgaacaaggcattagtccagacgatatacgta tttttctcaatttgggtaagaagtcttcaaaaccttccgcatcaattgaa agaagtcattcaataaatcctttctga This form of FEATURES is provided to make it easy to pipe output to other programs for further processing. For example features M60287:LORF- |ribosome >LORF.protein would write the translation of the open reading frame to a file called LORF.protein. The full functionality of the FEATURES can be accessed using arguments on the command line. In particular, when there are multiple entries to be processed, or multiple features within entries, it is much faster to supply FEATURES with lists of entries, feature keys or expressions. Command line options are similar to suboptions in menu items 1-3 above: Feature keys: -f key {feature key} -F filename {file of feature keys} Entries: -n name {GenBank LOCUS name} -N filename {file of GenBank LOCUS names} -a accession {GenBank ACCESSION number} -A filename {file of GenBank ACCESSION numbers} -e expression {Feature Table expression} -E filename {file of Feature Table expressions, each begin- ning with '@'} Databases: -u filename {GenBank dataset} -U filename { " " " " " " , process all entries ie. -nNaAeE options will be ignored} -g {GenBank} Examples: features -f tRNA -n EPFCPCG retrieves all tRNAs from GenBank entry EPFCPCG and writes .msg, .out, and .exp files. features -e M60287:LORF- would retrieve the same open reading frame as in the earlier example. Since most time-consuming operation in FEATURES is sequence retrieval, it is often best to retrieve frequently-used sequences as database subsets. For example, a set GenBank entries for chlorophyl a/b binding protein genes might be stored in a file called CAB.gen. features -f CDS -N CAB.nam -u CAB.gen would generate the files CAB.msg, CAB.out and CAB.exp containing output for all CDS features in the entries listed in the file CAB.nam. features -E CAB.exp -u CAB.gen would re-create the output file CAB.out. BUGS FEATURES does no preliminary error checking for syntax of GenBank expressions prior to their evaluation. Expressions that can not be evaluated will be flagged by GETOB in the .msg file. At present, little checking is done to test for the presence or correctness of input files. Some errors may cause the program to crash. For User-defined datasets, filename expansion is not performed. FILES Temporary files: X.term X.ano X.wrp X.ind X.gen {X is raw filename, see 4) } UNRESOLVED.fea UNRESOLVED.out FEA.inf FEA.nam FEA.gen FEA.ano FEA.wrp FEA.ind FEA.msg FEA.out SEE ALSO grep(1V) fetch getob splitdb TRANSPORTATION NOTES It should be fairly easy to get FEATURES to work even on systems in which GenBank has not been formatted for the XYLEM package. This is because FEATURES does not work directly on the database, but rather retrieves all necessary sequences by calling FETCH. Thus, statements like 'fetch FEA.nam FEA.gen' could be replaced with any command that, given a file containing names or accession numbers, returns a file containing GenBank entries. In principle, you could even implement this sort of command to retrieve entries from the email server (retrieve@ncbi.nlm.nih.gov) at NCBI, although such a setup would undoubtedly be quite slow. AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-261-5732 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.