name dbinst: extract Delila instructions from a GenBank database synopsis dbinst(db: in, binst: out, einst: out, oinst: out, sinst: out, olength: out, slength: out, dbinstp: in, locuslist: out, missing: out, output: out) files db: a set of GenBank entries binst: instructions for finding the beginning of a feature einst: instructions for finding the ending of a feature oinst: instructions for finding the whole feature, called the "object". They are given in the form "from begin + f to end + t" where f and t are the "from" and "to" parameters given in dbinstp. sinst: instructions for finding the regions between features, called the "space". They have the same form as those of oinst. olength: list of object lengths slength: list of space lengths dbinstp: parameters to control the program First line: the name of the feature to use. Second line: two integers, the base "from" and the base "to" relative to the alignment point to write the instructions. If "from" is larger than "to" then generic names "before" and "after" are written. This allows one to make a generic file of instructions to be copied and edited later. Third line: The first 4 characters on the line control which instruction files are to be written. To have all 4 on, use 'beos', for begin, end, object and space. Any other character in a position means that the corresponding file will not be written. The file will be rewritten however. Thus beos means write all files, and bEos would not write the einst file. Fourth line: 2 characters without spaces that control which length files are to be written. To both on, use 'os', for object and space. Any other character means that the corresponding file will not be written. The file will be rewritten however. Fifth line: If the first character is 'r' then remove obviously duplicated instructions and object or space lengths. When alternative splicing occurs, GenBank records the endpoint several times, so that the sequence instructions are identical. By using this toggle switch, such cases are eliminated. Sixth line: If the first character is 'f' then the coordinates of the instruction are written whether or not the object is off the end of the sequence. This allows one to pick up objects that are partially on a piece. If the first character is 's' then select against the feature if either end is missing. This makes the length list correspond to the instruction set. Seventh line: Alignment shift. This integer is added to the from and too coordinates of the instructions written out. Normally this should be 0. An example helps. Normally, if the zero of splice donor sites is defined the first base on the intron, then if one is writing instructions based on exon coordinates the zero base will be 1 too low. By making the alignment shift 1, the instructions written out will match the expectations of other programs. Note: object coordinates are shifted accordingly; this may not be quite what you want if you are using them from the olength file! However, the length is not affected. locuslist: a list of all the loci in the db that have features of interest. This list can be used with dbpull to create reduced databases containing only those entries that contain the features we want. missing: Features that are listed under the database COMMENT are listed here. These are "EMBL features not translated to GenBank features". We do not consider these to be reliable. They are NOT included in the binst, einst or olength, slength instructions. output: messages to the user description The GenBank entries in db are scanned, and Delila instructions are generated, according to a desired feature table item. Four kinds of instruction are made: beginning, ending, object and space. Beginning appears only if the data for the beginning of the feature is in the db. Ending appears only if the data for the ending of the feature is in the db. Object appears only if both the beginning and ending are there. Space only appears if there was an ending to the previous feature, and the current feature has a beginning. Thus object and space instructions is guaranteed to be a "natural" length. The names for the instructions are determined as follows. The GenBank ORGANISM contains the two part genus/species name, such as: ORGANISM Homo sapiens The parts are joined into "Homo.sapiens", and this becomes the name of the organism and chromosome in the instructions. The instructions for organism and chromosome only change when the genus/species name changes in db. The LOCUS name of the entry is picked up and used as the piece name. These naming conventions are the ones generated automatically by the dbbk program, so one need not think about it most of the time. In each entry, lines of the form: pept < 1 46 Ig V-R-H region protein, exon x are located and used to generate Delila "get" statements. If a "<" appears before the first number, then no instruction is written to binst, since the beginning point is before the GenBank sequence. If a "<" appears before the second number, then no instruction is written to einst, since the ending point is after the GenBank sequence. If a "<" or ">" appears in the db, then no object instructions or lengths are written. If a ">" appears in the previous feature or ">" appears in the current feature, then no space instructions or lengths are written. So for the above example, only one Delila instruction would be written: get from 46 -10 to 46 +20; if the dbinstp contained -10 20, and get from 46 before to 46 after; if the dbinstp contained 10 -20. where "before" and "after" are replaced by the integers from dbinstp. examples If dbinstp contains: CDS the name of the feature to use. -40 20 "from" and "to" to write the instructions. beOS "beos" means begin, end, object, space instructions written os "os" means object and space length file written r "r" means remove obviously duplicated instructions. F "f" = find-anyway. 's'= select AGAINST feature if either end missing 0 alignment shift: amount to shift the zero base. then instructions to get coding sequence (CDS) starts (binst) and ends (einst) from -40 to +20 will be made. Instructions for the entire coding region, from -40 before the start of the peptide to 20 bases after will not be written because O is capitalized and so not recognized. Instructions for the regions between peptides, from -40 inside each previous peptide to 20 bases into the inside of the next peptide will not be written because S is capitalized and so not recognized. documentation none see also dbbk.p author Thomas Dana Schneider bugs The program does not produce the instructions for space between the first object and the beginning of the sequence or the space after the last object in the sequence. This is possible (and perhaps should be controlled by a parameter) but it would not produce "natural" lengths because those space lengths depend on the length of the reported sequence. It is not clear that spaces are done properly anymore. Possible bug at "SPACE PROBLEM". Genus names are limited to genuslimit (a constant) to avoid names longer than the standard Delila limit. technical notes The expected column locations of the complement flag in the database, (the 'before end of piece' and the 'after end of piece' flags) are given in the program constants. If a file is not written to, this version of the program will not touch the file. Though this could lead to some confusion on the part of an incautious user (who thinks the program wrote a file when it did not), this does mean that the program will not create any new files that are not necessary.