XYLEM_IDENTIFY update 26Jul 02 NAME xylem_identify - creates a file of locus names corresponding to lines found by grep in a GenBank annotation file. SYNOPSIS xylem_identify grepfile indfile namefile findfile DESCRIPTION grepfile is created using the Unix grep command to search a .ano file created by splitgb. For example, to find all lines containing the word 'chlorophyll' in plant.ano, use grep -n -i 'chlorophyll' plant.ano > plant.grep In the example shown, the -n option causes each line written to plant.grep to be preceeded by the number of that line in plant.ano. (The -i option causes grep to ignore case.) Identify can use the indfile do determine which entry a given numbered line was found in, and writes the corresponding LOCUS name to namefile. In addition, all lines found in a given entry are re-written to findfile without the line numbers, and preceeded by the LOCUS name for that entry. EXAMPLES Suppose you wanted to obtain a list of names for all plant sequences which code for proteins. The task is complicated by the fact that many fungal sequences are included in the GenBank plant file. You could begin by searching plant.ano (containing all GenBank plant entries) for the word 'Planta': grep -n 'Planta' plant.ano > Planta.grep However, we want to eliminate all fungal sequences, as well as all sequences for RNAs other than mRNAs. If we create the file bad.str containing the keywords Mycophyta tRNA rRNA uRNA we can then type grep -n -f bad.str plant.ano > bad.grep bad.grep now contains all lines containing the offending keywords. We next use xylem_identify to find the names of the entries found by grep. xylem_identify Planta.grep plant.ind Planta.nam Planta.fnd xylem_identify bad.grep plant.ind bad.nam bad.fnd Next, we can use the Unix comm command to compare the two .nam files and produce an output file containing only names which are present in Planta.nam but not bad.nam: comm -23 Planta.nam bad.nam > plants.nam The file plants.nam now contains names of either plant cDNA or genomic sequences which do not code for structural RNAs. At this point, getloc could to create a sub-database containing only those entries listed in planta.nam. See documentation for getloc for a more detailed discussion. SEE ALSO grep, fgrep, egrep, ngrep, comm, splitgb, getloc AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-261-5732 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.