name
   dbinst: extract Delila instructions from a GenBank database

synopsis
   dbinst(db: in,
          binst: out, einst: out,
          oinst: out, sinst: out,
          olength: out, slength: out,
          dbinstp: in, locuslist: out, missing: out, output: out)

files
   db: a set of GenBank entries
   binst: instructions for finding the beginning of a feature
   einst: instructions for finding the ending of a feature
   oinst: instructions for finding the whole feature, called the "object".
      They are given in the form "from begin + f to end + t" where f and t are
      the "from" and "to" parameters given in dbinstp.
   sinst: instructions for finding the regions between features, called
      the "space".  They have the same form as those of oinst.
   olength: list of object lengths
   slength: list of space lengths
   dbinstp: parameters to control the program
     First line: the name of the feature to use.
     Second line: two integers, the base "from" and the base "to" relative to
        the alignment point to write the instructions.
        If "from" is larger than "to" then generic names "before" and "after"
        are written.  This allows one to make a generic file of instructions
        to be copied and edited later.
     Third line:  The first 4 characters on the line control which instruction
	files are to be written.  To have all 4 on, use 'beos', for begin, end,
	object and space.  Any other character in a position means that the
	corresponding file will not be written.  The file will be rewritten
	however.  Thus beos means write all files, and bEos would not write
        the einst file.
     Fourth line: 2 characters without spaces that control which length
	files are to be written.  To both on, use 'os', for object and space.
	Any other character means that the corresponding file will not be
	written.  The file will be rewritten however.
     Fifth line:  If the first character is 'r' then remove obviously
	duplicated instructions and object or space lengths.  When alternative
	splicing occurs, GenBank records the endpoint several times, so that
	the sequence instructions are identical.  By using this toggle switch,
	such cases are eliminated.
     Sixth line:  If the first character is 'f' then the coordinates of the
        instruction are written whether or not the object is off the end
        of the sequence.  This allows one to pick up objects that are
        partially on a piece.

        If the first character is 's' then select against the feature if
        either end is missing.  This makes the length list correspond
        to the instruction set.

     Seventh line:  Alignment shift.  This integer is added to the
        from and too coordinates of the instructions written out.
        Normally this should be 0.  An example helps.  Normally, if the zero
        of splice donor sites is defined the first base on the intron,
        then if one is writing instructions based on exon coordinates
        the zero base will be 1 too low.  By making the alignment shift
        1, the instructions written out will match the expectations of
        other programs. 
        Note: object coordinates are shifted accordingly; this may
        not be quite what you want if you are using them from the olength
        file!  However, the length is not affected.

   locuslist: a list of all the loci in the db that have features of interest.
      This list can be used with dbpull to create reduced databases containing
      only those entries that contain the features we want.
   missing: Features that are listed under the database COMMENT are listed
      here.  These are "EMBL features not translated to GenBank features".  We
      do not consider these to be reliable.  They are NOT included in the binst,
      einst or olength, slength instructions.
   output: messages to the user

description
   The GenBank entries in db are scanned, and Delila instructions are
   generated, according to a desired feature table item.  Four kinds of
   instruction are made:  beginning, ending, object and space.  Beginning
   appears only if the data for the beginning of the feature is in the db.
   Ending appears only if the data for the ending of the feature is in the db.
   Object appears only if both the beginning and ending are there.  Space only
   appears if there was an ending to the previous feature, and the current
   feature has a beginning.  Thus object and space instructions is guaranteed
   to be a "natural" length.

   The names for the instructions are determined as follows.  The GenBank
   ORGANISM contains the two part genus/species name, such as:

  ORGANISM  Homo sapiens

   The parts are joined into "Homo.sapiens", and this becomes the name of the
   organism and chromosome in the instructions.  The instructions for organism
   and chromosome only change when the genus/species name changes in db.  The
   LOCUS name of the entry is picked up and used as the piece name.  These
   naming conventions are the ones generated automatically by the dbbk program,
   so one need not think about it most of the time.

   In each entry, lines of the form:

    pept    <     1       46     Ig V-R-H region protein, exon x

   are located and used to generate Delila "get" statements.

   If a "<" appears before the first number, then no instruction is
   written to binst, since the beginning point is before the GenBank sequence.

   If a "<" appears before the second number, then no instruction is
   written to einst, since the ending point is after the GenBank sequence.

   If a "<" or ">" appears in the db, then no object instructions or
   lengths are written.

   If a ">" appears in the previous feature or ">" appears in the current
   feature, then no space instructions or lengths are written.

   So for the above example, only one Delila instruction would be written:

        get from 46 -10 to 46 +20;

   if the dbinstp contained -10 20, and

        get from 46 before to 46 after;

   if the dbinstp contained 10 -20.

   where "before" and "after" are replaced by the integers from dbinstp.

examples
   If dbinstp contains:
CDS     the name of the feature to use.
-40 20  "from" and "to" to write the instructions.
beOS    "beos" means begin, end, object, space instructions written
os      "os" means object and space length file written
r       "r" means remove obviously duplicated instructions.
F       "f" = find-anyway.  's'= select AGAINST feature if either end missing
0       alignment shift:  amount to shift the zero base.

   then instructions to get coding sequence (CDS) starts (binst) and ends
   (einst) from -40 to +20 will be made.

   Instructions for the entire coding region, from -40 before the start of the
   peptide to 20 bases after will not be written because O is capitalized and
   so not recognized.

   Instructions for the regions between peptides, from -40 inside each previous
   peptide to 20 bases into the inside of the next peptide will not be written
   because S is capitalized and so not recognized.

documentation
   none

see also
   dbbk.p

author
   Thomas Dana Schneider

bugs
   The program does not produce the instructions for space between the first
   object and the beginning of the sequence or the space after the last object
   in the sequence.  This is possible (and perhaps should be controlled by a
   parameter) but it would not produce "natural" lengths because those space
   lengths depend on the length of the reported sequence.

   It is not clear that spaces are done properly anymore.  Possible bug
   at "SPACE PROBLEM".

   Genus names are limited to genuslimit (a constant) to avoid names longer
   than the standard Delila limit.

technical notes
   The expected column locations of the complement flag in the database, (the
   'before end of piece' and the 'after end of piece' flags) are given in the
   program constants.

   If a file is not written to, this version of the program will not
   touch the file.  Though this could lead to some confusion on
   the part of an incautious user (who thinks the program wrote a file
   when it did not), this does mean that the program will not create
   any new files that are not necessary.