FEATURES                                              update 14 June 2024


      NAME
            FEATURES - extracts features from GenBank entries
      
      SYNOPSIS

            features.py expression
            features.py [-f featurekey | -F keyfile]
                     [-n name     |-a accession    | -e expression |
                      -N namefile |-A accfile      | -E expfile]
                     [-u dbfile   | -U dbfile      | -g ] 
            features.py -h     

      DESCRIPTION
            FEATURES extracts sequence objects from GenBank entries, using
            the Features Table language. Features can be retrieved either by 
            specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by 
            evaluating expressions. In practical terms, FEATURES is actually
            a wrapper for GETOB, which actually performs the parsing
            and extraction of sequence objects. 
       
            'features' followed by an expression retrieves the data directly
            from GenBank and evaluates the expression. The third form of 
            features requires all arguments to be accompanied by their 
            respective option flags. Finally, 'features -h' prints the
            SYNOPSIS. 

      Feature keys:  
       -f  key               {feature key}
       -F  filename          {file of feature keys}

       The following example would retrieve both tRNA and rRNA sequences:

	 OBJECTS
	 tRNA
	 rRNA
	 SITES

      The words 'OBJECTS' and 'SITES' must enclose the feature keywords,
      and each keyword must be on a separate line. For a rigorous
      definition of the input file format, see the GETOB manual pages
      (getob.txt).
      
       A complete list of legal feature keywords can be found in the GenBank
       Release notes (gbrel.txt) under the subheading 'Feature Key Names'.


       Entries:      
       -n name                {GenBank LOCUS name}
       -N filename            {file of GenBank LOCUS names}
       -a accession           {GenBank ACCESSION number}
       -A filename            {file of GenBank ACCESSION numbers}


       -e expression          {Feature Table expression}
       -E filename            {file of Feature Table expressions, each begin-
                               ning with '@'}
         e  Expressions take the form

            accession:location

            where accession refers to a GenBank 
	    accession number, while location is any legal feature location.
	    A brief description of location syntax can be found under the
	    subheading "Feature Location" in the GenBank release notes
	    (gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table:
	    Definition" for a complete definition.
         E  name of a file containing one or more Feature
	    expressions. EACH EXPRESSION MUST BEGIN A '@'. All lines beginning
            with '@' are processed as expressions, and all other lines are
            copied to the output file unchanged.

        Databases:
        -u filename           {GenBank dataset}
        -U filename           { "      "        "  "    "       "    ,
                              process all entries ie. -nNaAeE options
                              will be ignored}
        -g                    {GenBank}

     
            For example:

            features.py -f tRNA -a M81884

      In the example, FEATURES was instructed to retrieve all tRNAs from
      the GenBank entry EPFCPCG with Accession number M81884, which contains
      the Epifagus plastid genome. By default, the GenBank database was the
      source of the sequence. A log describing the extraction of each feature
      is written to M81884.msg, while the
      extracted features themselves are written to M81884.out. Feature 
      expressions which could be used by FEATURES to reconstruct the .out
      file, are written to M81884.exp. 

      The first step is to retrieve the EPFCPCG entry from GenBank, which is
      accomplished by calling seqfetch.py. Next, FEATURES extracts the specified
      features from the entry.
      
      An excerpt from M81884.msg is shown below, describing the extraction
      of the fifth tRNA found in this entry. To create this tRNA,  two exons
      had to be joined. The qualifier line associated with this feature 
      indicates that it is an Isoleucine tRNA with a gat anticodon.


      EPFCPCG:anticodon gtg
          complement     
              (
                  join           
                (
                           70023                         70028

                           1                         69

                      )

              )


      /product="transfer RNA-His"
      /gene="His-tRNA"
      /label=anticodon gtg
      /note="anticodon gtg"
      //----------------------------------------------

 
	 The actual sequence for this feature, as written to M81884.out, is
	 written with each exon beginning a new line:

      >EPFCPCG:anticodon gtg
      ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat
      gcgggttcaattcccgtcg
      ttcgcc

      Finally, the expression that was evaluated to create this feature is 
      written to M81884.exp:

      >EPFCPCG:anticodon gtg
      @M81884:anticodon gtg

      If M81884.exp was used as an expression file in option 2 (E) of FEATURES,
      M81884.out would be recreated.
 

         Examples:

            Each feature is found by its qualifier line from the
	    original entry. It must be noted that the qualifier line must
	    be unique from others in the same entry in its first 15
	    characters after the = . 

	    The flaL protein coding region of B. licheniformis is described
	    in GenBank entry BLIFALA, accession number M60287 in the
	    following feature:

            CDS             305..640
                            /gene="flaL"
                            /note="flaD (sin) homologue; putative; label: ORF2"
                            /codon_start=1
                            /transl_table=11
                            /protein_id="AAA22439.1"

         This feature could be retrieved using any of the following
         expressions:

		 features.py 'M60287:305..640'
		 features.py 'M60287:/gene="flaL"'
		 features.py 'M60287:/note="flaD (sin) homologue; putative; label: ORF2"'
 
         Some locations may contain special characters that have a special meaning to 
         the shell. Therefore, it is safest to enclose expressions in single quotes as
         shown above, when typing the expression on the command line.

	 DATABASE (WHERE TO GET IT) - By default, all entries processed will
	 be automatically retrieved from GenBank using seqfetch.py. Specifying 'u'
	 (User-defined database subset) makes it possible to extract features
	 from GenBank subsets created by the user. Usually, retrieval of
	 features is much faster with a User-defined subset, so if you
	 frequently work with sets of genes, it is best to retrieve them
	 en-masse using seqfetch.py, and work with them directly. For example, if
	 you had retrieved a set of Beta-globin sequences into a file called
	 'globin.gen', you could directly extract features from these entries
	 by specifying 'globin' or 'globin.gen' as your User-defined database.
	 If the file extension is '.gen', FEATURES will automatically create
	 temporary files called globin.ano, globin.wrp and globin.ind,
	 containing annotation, sequence, and an index, respectively. These
	 files will be read during feature extraction, and then discarded. If
	 you have already created such files using SPLITDB, simply specify
	 any of 'globin', 'globin.ano', etc. ie. anything, as long as it does
	 not have the .gen file extension.

         'U' rather than 'u' causes ALL entries in the user-defined
         database to be subset. This means that it is unnecessary to 
         specify entry options (eg -n, -N etc.), as these will be
         ignored, if given.


	 WHERE TO SEND IT - By default (a), the output for all entries goes
	 to a single set of files, whose names are chosen by FEATURES,
	 depending on the setting of option 2, Entries. If a single name (n) or
	 accession number (a) has been chosen, that will be used as
	 the raw filename. For example, if you were processing the entry
	 WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names
	 (N), accession numbers (A) or expressions (E) were read from a file,
	 the raw name of that file would be used eg. cellulase.nam would result
	 in cellulase.msg and cellulase.out.  Finally, if a single expression
	 is processed (e), then the primary accession number in that
	 expression will be used for the filenames. In all cases, FEATURES
	 will tell you the names of the files being written.

	 Choosing suboption s, you can specify that the features created for
	 each entry be sent to separate files. In this case, each file will
	 have the name of that entry, with the extension .obj. However, all
	 messages and expressions  will still go to a single files. While this
         can be a convenient way of creating separate files when you need them,
         this option still has the limitation of writing all features for a
         given entry (if there are more than one) to the same file. Also,
         successive resolution of features (anything requiring 'getob -r')
         will not work with this option. This may be corrected in future 
         versions.


      COMMAND LINE EXECUTION

      There are two ways of running FEATURES from the command line. If only one
      argument is supplied, that argument is interpreted as an expression, and
      the result of that expression (ie. a sequence ) is written to the 
      standard output. .msg, .out and .exp files are NOT created. For example,
      GenBank entry BACFLALA (M60287) contains the following feature:

      CDS             95..271
                      /label=LORF-
                      /codon_start=1
                      /translation="MNKDKNEKEELDEEWTELIKHALEQGISPDDIRIFLNLGKKSSK
                      PSASIERSHSINPF"
      Any of 

      features M60287:LORF-
      features M60287:95..271
      features M60287:/label=LORF-

      would write the open reading frame to the standard output: 

      atgaataaagataaaaatgagaaagaagaattggatgaggagtggacaga
      actgattaaacacgctcttgaacaaggcattagtccagacgatatacgta
      tttttctcaatttgggtaagaagtcttcaaaaccttccgcatcaattgaa
      agaagtcattcaataaatcctttctga

      This form of FEATURES is provided to make it easy to pipe output to 
      other programs for further processing. For example

      features M60287:LORF- |ribosome >LORF.protein

      would write the translation of the open reading frame to a file called
      LORF.protein.      

      The full functionality of the FEATURES can be accessed using arguments on
      the command line. In particular, when there are multiple entries to be
      processed, or multiple features within entries, it is much faster to
      supply FEATURES with lists of entries, feature keys or expressions. 
      Command line options are similar to suboptions in menu items 1-3 above:


        Examples:
        
        features -f tRNA -n EPFCPCG

        retrieves all tRNAs from GenBank entry EPFCPCG and writes .msg, .out,
        and .exp files.

        features -e M60287:LORF-  	  

        would retrieve the same open reading frame as in the earlier example.


        Since most time-consuming operation in FEATURES is sequence retrieval,
        it is often best to retrieve frequently-used sequences as database
        subsets. For example, a set GenBank entries for chlorophyl a/b binding
        protein genes might be stored in a file called CAB.gen.

        features -f CDS -N CAB.nam -u CAB.gen

        would generate the files CAB.msg, CAB.out and CAB.exp containing output 
        for all CDS features in the entries listed in the file CAB.nam.

        features -E CAB.exp -u CAB.gen
 
        would re-create the output file CAB.out. 
        
     BUGS
       FEATURES does no preliminary error checking for syntax of 
       GenBank expressions prior to their evaluation. Expressions that can
       not be evaluated will be flagged by GETOB in the .msg file.

       At present, little checking is done to test for the presence or
       correctness of input files. Some errors may cause the program to
       crash.

       For User-defined datasets, filename expansion is not performed.

     FILES
        Temporary files:
          X.term X.ano X.wrp X.ind X.gen {X is raw filename, see 4) }
          UNRESOLVED.fea UNRESOLVED.out
          FEA.inf FEA.nam FEA.gen FEA.ano FEA.wrp FEA.ind FEA.msg FEA.out

     SEE ALSO
            grep(1V) getob splitdb 
 

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-474-7528
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.