update July 19, 2024
NAME
features.py - extracts features from GenBank
entries
SYNOPSIS
features.py expression
features.py [-f featurekey | -F
keyfile]
[-n name |
-N namefile | -a accession | -A
accfile | -e expression | -E expfile]
[-u dbfile | -U
dbfile | -g ]
[-o outname]
features.py -h
DESCRIPTION
FEATURES extracts sequence objects from GenBank entries,
using the Features Table language. Features can be retrieved
either by specifying keywords (eg. CDS, mRNA, exon, intron etc.)
or by
evaluating expressions. In practical terms, FEATURES is actually a
wrapper for GETOB,
which actually performs the parsing and extraction of sequence
objects.
'features' followed by an expression retrieves the data directly
from GenBank and evaluates the expression. The second form of
features requires all arguments to be accompanied by their
respective option flags. Finally, 'features -h' prints the
SYNOPSIS.
Feature keys
-f key
{feature key}
-F keyfile
{file of feature
keys}
The following keyfile would
retrieve both tRNA and rRNA sequences:
OBJECTS
tRNA
rRNA
SITES
The words 'OBJECTS' and 'SITES' must enclose the
feature keywords, and each keyword must be on a separate line.
For a rigorous definition of the input file format, see the GETOB manual
pages.
A complete list of legal
feature keywords can be found in the GenBank
Release notes (gbrel.txt)
under the subheading 'Feature Key Names'.
Entries
-n name
{GenBank LOCUS name}
-N namefile
{file
of GenBank LOCUS names}
-a accession
{GenBank ACCESSION number}
-A accfile
{file of GenBank ACCESSION numbers}
Note: -n, -N, -a, -A, -f and -F are ignored when -e or
-E are used.
-e expression
{Feature Table expression}
-E expfile
{file of Feature Table expressions, each
beginning with '@'}
Expressions take
the form
accession:location
where accession refers to a GenBank accession number,
while location is any legal feature location. A brief
description of location syntax can be found under the subheading
"Feature Location" in the GenBank release notes (gbrel.txt).
See "The
DDBJ/EMBL/GenBank Feature Table Definition" for a complete
definition.
E name of a file containing one or more Feature
expressions. EACH EXPRESSION MUST BEGIN with '@'. All
lines beginning with '@' are processed as expressions, and all
other lines are copied to the output file unchanged.
Databases
-u filename
{GenBank dataset}
-U filename
{ "
" "
" "
" , process all entries ie.
-nNaAeE
will be ignored}
-g
{GenBank. If neither -u nor -U
is specified, -g is assumed}
By default, all entries processed will be
automatically retrieved from GenBank using seqfetch.py.
Specifying 'u' (User-defined database subset) makes it possible
to extract features from GenBank subsets created by the user.
Usually, retrieval of features is much faster with a
User-defined subset, so if you frequently work with sets of
genes, it is best to retrieve them en-masse using
seqfetch.py, and work with them directly. For example, if you
had retrieved a set of Beta-globin sequences into a file called
'globin.gen', you could directly extract features from these
entries by specifying 'globin' or 'globin.gen' as your
User-defined database. If the file extension is '.gen', FEATURES
will automatically create temporary files called globin.ano,
globin.wrp and globin.ind, containing annotation, sequence, and
an index, respectively. These files will be read during feature
extraction, and then discarded. If you have already created such
files using SPLITDB, simply specify any of 'globin',
'globin.ano', etc. ie. anything, as long as it does not have the
.gen file extension.
'U' rather than 'u' causes ALL entries in the
user-defined database to be subset. This means that it is
unnecessary to specify entry options (eg -n, -N etc.), as these
will be ignored, if given.
Where to send output
-o outname - prefix for output filenames. Default
is the name of the database,
without the file extension.
-s send output for each feature to a
separate file
By default, the output for all entries goes to a single set of
files, whose names are chosen by FEATURES, depending on the
setting of option 2, Entries. If a single name (n) or accession
number (a) has been chosen, that will be used as the raw
filename. For example, if you were processing the entry WHTCAB,
the output files would be WHTCAB.msg and WHTCAB.out. If names
(N), accession numbers (A) or expressions (E) were read from a
file, the raw name of that file would be used eg. cellulase.nam
would result in cellulase.msg and cellulase.out. Finally,
if a single expression is processed (e), then the primary
accession number in that expression will be used for the
filenames. In all cases, FEATURES will tell you the names
of the files being written.
Choosing suboption s, you can specify that the features created
for each entry be sent to separate files. In this case, each
file will have the name of that entry, with the extension .obj.
However, all messages and expressions will still go to a
single files. While this can be a convenient way of creating
separate files when you need them, this option still has the
limitation of writing all features for a given entry (if there
are more than one) to the same file. Also, successive resolution
of features (anything requiring 'getob -r') will not work with
this option. This may be corrected in future versions.
EXAMPLES
Example 1: Retrieving features by feature key, from a
specific sequence
features.py -f tRNA -a M81884
In the example, FEATURES was instructed to retrieve all tRNAs from
the GenBank entry EPFCPCG with Accession number M81884, which
contains the Epifagus plastid genome. By default, the GenBank
database was the source of the sequence. A log describing the
extraction of each feature is written to M81884.msg, while the
extracted features themselves are written to M81884.out. Feature
expressions which could be used by FEATURES to reconstruct the
.out file, are written to M81884.exp.
features.py will retrieve the EPFCPCG entry from GenBank, which is
accomplished by calling seqfetch.py. Next, FEATURES extracts the
specified features from the entry.
An excerpt from M81884.msg is shown below, describing the
extraction of the first tRNA found in this entry. To create this
tRNA, two exons had to be joined. The qualifier line associated
with this feature indicates that it is a histidine tRNA with a gtg
anticodon.
EPFCPCG:tRNA1
complement
(
join
(
4487
4492
1
69
)
)
/gene="trnH"
/product="tRNA-His"
/note="label: anticodon_gtg"
//----------------------------------------------
The actual sequence for this feature, as written to M81884.out, is
written with each exon beginning a new line:
>EPFCPCG:tRNA1
ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat
gcgggttcaattcccgtcg
Finally, the expression that was evaluated to create this feature
is written to M81884.exp:
>EPFCPCG:tRNA1
@M81884:complement(join(70023..70028,1..69))
If M81884.exp was used as an expression file in option 2 (E) of
FEATURES, M81884.out would be recreated.
Example 2: Retrieving features by feature expressions
Each feature is found by its qualifier line from the
original entry. It must be noted that the qualifier line must be
unique from others in the same entry in its first 15 characters
after the = .
The flaL protein coding region of B. licheniformis is
described in GenBank entry BLIFALA, accession number M60287 in the
following feature:
CDS
305..640
/gene="flaL"
/note="flaD (sin) homologue; putative; label: ORF2"
/codon_start=1
/transl_table=11
/protein_id="AAA22439.1"
/translation="MIGQRIKQYRKEKGYSLSELAEKAGVAKSYLSSIERNLQTNPSI
QFLEKVSAVLDVSVHTLLNEKDETEYDGQLDSEWENLVRDAMASGVSKKQFREFLDYQ
KWKKRQEKE"
This feature could
be retrieved using any of the following
expressions:
features.py 'M60287:305..640'
features.py 'M60287:/gene="flaL"'
features.py 'M60287:/note="flaD (sin) homologue;
putative; label: ORF2"'
Some locations may contain special characters that have
a special meaning to the shell. Therefore, it is safest to enclose
expressions in single quotes as shown above, when typing the
expression on the command line. The output would appear as
ttgattggccagcgtattaaacaatatcgaaaagaaaaaggctactcact
atctgaactagctgaaaaggctggggtagcgaagtcttatttaagttcaa
tagaaagaaacttgcaaacaaacccctccattcaatttctagaaaaagtc
tccgctgttctggacgtctcggttcataccctgctcaatgagaaagatga
aaccgaatacgatggtcaattagatagtgaatgggaaaatctagttcgtg
acgctatggcatcaggggtttctaaaaaacaatttcgagaatttttagat
tatcaaaagtggaagaaaagacaggaaaaggagtaa
The single argument form of features shown above makes it possible
to pipe output to another program. For example,
features.py 'M60287:305..640' | ribosome
would pipe the output to the ribosome program, generating the
amino acid sequence
LIGQRIKQYRKEKGYSLSELAEKAGVAKSYLSSIERNLQTNPSIQFLEKV
SAVLDVSVHTLLNEKDETEYDGQLDSEWENLVRDAMASGVSKKQFREFLD
YQKWKKRQEKE*
If the -e option is used,
features.py -e 'M60287:305..640'
features.py writes output to files whose name starts with the
accession number ie. M60287.out, M60287.msg.
Example 3: Retrieving features from a custom GenBank
dataset
Since most time-consuming operation in FEATURES is
sequence retrieval, it is often best to retrieve frequently-used
sequences as database subsets. For example, a set GenBank entries
for chlorophyll a/b binding protein genes might be stored in a
file called CAB.gen.
features
-f CDS -A CAB.acc -u CAB.gen
would generate the files CAB.msg, CAB.out and CAB.exp
containing output for all CDS features in the entries listed in
the file CAB.nam.
features
-E CAB.exp -u CAB.gen
would re-create the
output file CAB.out.
BUGS
FEATURES does no preliminary error checking for syntax
of GenBank expressions prior to their evaluation. Expressions that
can not be evaluated will be flagged by GETOB in the .msg file.
At present, little checking is done to test for the presence or
correctness of input files. Some errors may cause the program to
crash.
For User-defined datasets, filename expansion is not performed.
SEE ALSO
seqfetch.py, getob, splitdb
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist
REFERENCES
Fristensky, B. (1993) Feature Expressions: Creating and
Manipulating Sequence Datasets. Nucleic Acids Res. 21:5997-6003
(https://doi.org/10.1093/nar/21.25.5997).
GenBank Release Notes (https://ftp.ncbi.nih.gov/genbank/gbrel.txt).