features.py

update July 19, 2024

NAME

features.py - extracts features from GenBank entries

SYNOPSIS

features.py expression
features.py [-f featurekey | -F keyfile]
      [-n name | -N namefile | -a accession | -A accfile | -e expression | -E expfile]
[-u dbfile   | -U dbfile | -g ]
      [-o outname]
features.py -h

DESCRIPTION

FEATURES extracts sequence objects from GenBank entries, using the Features Table language. Features can be retrieved either by specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by
evaluating expressions. In practical terms, FEATURES is actually a wrapper for GETOB, which actually performs the parsing and extraction of sequence objects.

'features' followed by an expression retrieves the data directly from GenBank and evaluates the expression. The second form of features requires all arguments to be accompanied by their respective option flags. Finally, 'features -h' prints the SYNOPSIS.

Feature keys
       -f key               {feature key}
       -F keyfile          {file of feature keys}

       The following keyfile would retrieve both tRNA and rRNA sequences:

        OBJECTS
       tRNA
       rRNA
       SITES

The words 'OBJECTS' and 'SITES' must enclose the feature keywords, and each keyword must be on a separate line. For a rigorous definition of the input file format, see the GETOB manual pages.

       A complete list of legal feature keywords can be found in the GenBank
       Release notes (gbrel.txt) under the subheading 'Feature Key Names'.

Entries

       -n name                 {GenBank LOCUS name}
       -N namefile            {file of GenBank LOCUS names}
       -a accession           {GenBank ACCESSION number}
       -A accfile              {file of GenBank ACCESSION numbers}

Note: -n, -N, -a, -A, -f and -F are ignored when -e or -E are used.

       -e expression          {Feature Table expression}
       -E expfile                {file of Feature Table expressions, each beginning with '@'}

         Expressions take the form

            accession:location

where accession refers to a GenBank accession number, while location is any legal feature location. A brief description of location syntax can be found under the subheading "Feature Location" in the GenBank release notes (gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table Definition" for a complete definition.

E name of a file containing one or more Feature expressions. EACH EXPRESSION MUST BEGIN with '@'. All lines beginning with '@' are processed as expressions, and all other lines are copied to the output file unchanged.

Databases
        -u filename           {GenBank dataset}
        -U filename           { "      "        " "    "       "    ,   process all entries ie. -nNaAeE
                                      will be ignored}
        -g                           {GenBank. If neither -u nor -U is specified, -g is assumed}

By default, all entries processed will be automatically retrieved from GenBank using seqfetch.py. Specifying 'u' (User-defined database subset) makes it possible to extract features from GenBank subsets created by the user. Usually, retrieval of features is much faster with a User-defined subset, so if you frequently work with sets of genes, it is best to retrieve them en-masse using seqfetch.py, and work with them directly. For example, if you had retrieved a set of Beta-globin sequences into a file called 'globin.gen', you could directly extract features from these entries by specifying 'globin' or 'globin.gen' as your User-defined database. If the file extension is '.gen', FEATURES will automatically create temporary files called globin.ano, globin.wrp and globin.ind, containing annotation, sequence, and an index, respectively. These files will be read during feature extraction, and then discarded. If you have already created such files using SPLITDB, simply specify any of 'globin', 'globin.ano', etc. ie. anything, as long as it does not have the .gen file extension.

'U' rather than 'u' causes ALL entries in the user-defined database to be subset. This means that it is unnecessary to specify entry options (eg -n, -N etc.), as these will be ignored, if given.

Where to send output

-o outname - prefix for output filenames. Default is the name of the database, without the file extension.

-s send output for each feature to a separate file

By default, the output for all entries goes to a single set of files, whose names are chosen by FEATURES, depending on the setting of option 2, Entries. If a single name (n) or accession number (a) has been chosen, that will be used as the raw filename. For example, if you were processing the entry WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names (N), accession numbers (A) or expressions (E) were read from a file, the raw name of that file would be used eg. cellulase.nam would result in cellulase.msg and cellulase.out. Finally, if a single expression is processed (e), then the primary accession number in that expression will be used for the filenames. In all cases, FEATURES will tell you the names of the files being written.

Choosing suboption s, you can specify that the features created for each entry be sent to separate files. In this case, each file will have the name of that entry, with the extension .obj. However, all messages and expressions will still go to a single files. While this can be a convenient way of creating separate files when you need them, this option still has the limitation of writing all features for a given entry (if there are more than one) to the same file. Also, successive resolution of features (anything requiring 'getob -r') will not work with this option. This may be corrected in future versions.

EXAMPLES

Example 1: Retrieving features by feature key, from a specific sequence

features.py -f tRNA -a M81884

In the example, FEATURES was instructed to retrieve all tRNAs from the GenBank entry EPFCPCG with Accession number M81884, which contains the Epifagus plastid genome. By default, the GenBank database was the source of the sequence. A log describing the extraction of each feature is written to M81884.msg, while the extracted features themselves are written to M81884.out. Feature expressions which could be used by FEATURES to reconstruct the .out file, are written to M81884.exp.

features.py will retrieve the EPFCPCG entry from GenBank, which is accomplished by calling seqfetch.py. Next, FEATURES extracts the specified features from the entry.

An excerpt from M81884.msg is shown below, describing the extraction of the first tRNA found in this entry. To create this tRNA, two exons had to be joined. The qualifier line associated with this feature indicates that it is a histidine tRNA with a gtg anticodon.
EPFCPCG:tRNA1 complement ( join ( 4487 4492 1 69 ) )/gene="trnH"/product="tRNA-His"/note="label: anticodon_gtg"//----------------------------------------------

The actual sequence for this feature, as written to M81884.out, is written with each exon beginning a new line:

>EPFCPCG:tRNA1 ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat gcgggttcaattcccgtcg

Finally, the expression that was evaluated to create this feature is written to M81884.exp:

>EPFCPCG:tRNA1@M81884:complement(join(70023..70028,1..69))

If M81884.exp was used as an expression file in option 2 (E) of FEATURES, M81884.out would be recreated.

Example 2: Retrieving features by feature expressions

Each feature is found by its qualifier line from the original entry. It must be noted that the qualifier line must be unique from others in the same entry in its first 15 characters after the = .

The flaL protein coding region of B. licheniformis is described in GenBank entry BLIFALA, accession number M60287 in the following feature:

CDS 305..640 /gene="flaL" /note="flaD (sin) homologue; putative; label: ORF2" /codon_start=1 /transl_table=11 /protein_id="AAA22439.1" /translation="MIGQRIKQYRKEKGYSLSELAEKAGVAKSYLSSIERNLQTNPSI QFLEKVSAVLDVSVHTLLNEKDETEYDGQLDSEWENLVRDAMASGVSKKQFREFLDYQ KWKKRQEKE"

         This feature could be retrieved using any of the following
         expressions:

        features.py 'M60287:305..640'        features.py 'M60287:/gene="flaL"'        features.py 'M60287:/note="flaD (sin) homologue; putative; label: ORF2"'

Some locations may contain special characters that have a special meaning to the shell. Therefore, it is safest to enclose expressions in single quotes as shown above, when typing the expression on the command line. The output would appear as

ttgattggccagcgtattaaacaatatcgaaaagaaaaaggctactcactatctgaactagctgaaaaggctggggtagcgaagtcttatttaagttcaatagaaagaaacttgcaaacaaacccctccattcaatttctagaaaaagtctccgctgttctggacgtctcggttcataccctgctcaatgagaaagatgaaaccgaatacgatggtcaattagatagtgaatgggaaaatctagttcgtgacgctatggcatcaggggtttctaaaaaacaatttcgagaatttttagattatcaaaagtggaagaaaagacaggaaaaggagtaa

The single argument form of features shown above makes it possible to pipe output to another program. For example,

features.py 'M60287:305..640' | ribosome

would pipe the output to the ribosome program, generating the amino acid sequence

LIGQRIKQYRKEKGYSLSELAEKAGVAKSYLSSIERNLQTNPSIQFLEKVSAVLDVSVHTLLNEKDETEYDGQLDSEWENLVRDAMASGVSKKQFREFLDYQKWKKRQEKE*

If the -e option is used,

features.py -e 'M60287:305..640'

features.py writes output to files whose name starts with the accession number ie. M60287.out, M60287.msg.

Example 3: Retrieving features from a custom GenBank dataset

Since most time-consuming operation in FEATURES is sequence retrieval, it is often best to retrieve frequently-used sequences as database subsets. For example, a set GenBank entries for chlorophyll a/b binding protein genes might be stored in a file called CAB.gen.

features -f CDS -A CAB.acc -u CAB.gen

would generate the files CAB.msg, CAB.out and CAB.exp containing output for all CDS features in the entries listed in the file CAB.nam.

features -E CAB.exp -u CAB.gen

would re-create the output file CAB.out.

BUGS

FEATURES does no preliminary error checking for syntax of GenBank expressions prior to their evaluation. Expressions that can not be evaluated will be flagged by GETOB in the .msg file.

At present, little checking is done to test for the presence or correctness of input files. Some errors may cause the program to crash.

For User-defined datasets, filename expansion is not performed.