SPLITDB update 16 Aug 2001 NAME splitdb - split GenBank files into annotation, sequence, and index SYNOPSIS splitdb [-gepvlct] dbfile anofile seqfile indfile DESCRIPTION Splitdb splits a database (dbfile) among three files: anofile, seqfile and indfile. Splitdb ignores any header information that might be in the file and begins processing at the first entry. anofile contains the annotation portion of each entry. Entries are terminated with '//' or '///' (PIR only). Trailing blanks present in dbfile are omitted in anofile. seqfile contains the sequence data for each entry. Each sequence entry begins with a header line, followed by sequence data on succeeding lines of 75 characters per line. The header line includes the header flag character '>' in column 1, followed by the name, followed by the first 50 characters of the 1st DEFINITION line. An example is shown below: >UNHOR1 - Unicorn horn protein 1, complete cDNA sequence attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc ... Removal of blanks and numbers from sequence lines makes makes split datasets about 8-9% smaller than the original GenBank files. indfile is an index which tells the line numbers for each entry in anofile and seqfile. It is assumed to be in alphabetical order by name. Each line contains a name and accession number, followed by the line numbers on which the annotation and sequence data begin in anofile and seqfile, respectively. Thus the file plants.ind might contain: 1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 A15660 TA156608 1 1 A15671 A15671 33 11 A15673 A15673 65 25 A15675 AK156751 97 36 A15677 BA156770 128 46 A16780 BA167807 160 57 A16782 A16782 192 70 ATHRPRP1C GM905105 225 83 etc... Note that indfile is a perfectly legitimate .nam file, for use with programs such as getloc, getob, or comm. The following options identify the type of database being read: -g GenBank (default) -e EMBL -p PIR (NBRF) -v Vecbase -l LiMB Other options: -c Compress 3 or more leading blanks in annotation lines to take the form , where CRUNCHFLAG is the ASCII character specified by the Pascal const CRUNCHOFFSET, which is set to 33 ("!") in the current implementation. For each annotation line read, if the number of leading blanks is >=3, splitdb sets CRUNCHCHAR to CRUNCHOFFSET+the number of blanks. Thus, for lines with 3, 4, or 5 leading blanks, CRUNCHCHAR would be '$', '%' and '&', respectively. GETLOC and GETOB automatically expand crunched blanks when CRUNCHFLAG is encountered on an input line. Empiracle observations indicate that the -c option decreases the size of GenBank files by about 10%. This compression method may fail when the number of leading blanks exceeds 127-CRUNCHOFFSET. However, none of the above mentioned databases currently supports any datafield with anywhere near that number of leading blanks. -t (GenBank only) Append all information in the first ORGANISM to the end of each line in indfile. For example, the entry which begins: LOCUS GORMTDLOOZ 282 bp DNA UNA 11-MAR-1996 DEFINITION GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon Zoological Gardens) mitochondrial D-loop DNA. ACCESSION L76759 NID g1222584 KEYWORDS D-loop. SOURCE Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom, ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male DNA. ORGANISM Mitochondrion Gorilla gorilla gorilla Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla. might be indexed as GORMTDLOOZ L76759 1 1 Mitochondrion Gorilla gorilla gorilla This is useful for taxonomic studies, or as a way of making it easy to create subsets from a single index. Thus, 'grep gorilla primates.ind' would print all lines in the file that contained the word gorilla. The output from this command could be used as a .nam file for extracting just gorilla sequences from a larger dataset using fetch. NOTES 1. Header lines that aren't part of entries are automatically stripped out during processing. For example, in a file containing GenBank entries, all lines up to the first occurrence of 'LOCUS' starting in column 1, are ignored. Similarly for PIR, processing begins on the first line containing 'ENTRY' beginning in column 1. 2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996, have accession numbers of 8 characters, rather than 6. Previously assigned accession numbers will remain at 6 characters. Splitdb has been updated to write all accession numbers to the .ind file, left justified in a field of 8 characters. 3. As of GenBank Release 127.0, the LOCUS names will increase from a maximum of 12 to a maximum of 16 characters. To accommodate this change, the NAME filed in .ind files has expanded to 16 characters. The current version of splitdb can read both old and new GenBank flatfiles, but will create .ind files using the new format. SEE ALSO getloc, getob, comm(1) (Unix command). AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-261-5732 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.