estano.csh reads all FASTA-formatted
nucleotied sequence files in the current working directory with the
file extension .fsa. Each file must contain a single sequence. For each
file, blastx is run at NCBI using the blastcl3 client. The GI number of
the blastx hit with the highest E-value is used to obtain additional
information from
the SeqHound server at SLRI [
http://www.blueprint.org/seqhound/].
Output is sent to a .csv file, which can be directly imported by most
spreadsheet programs.
outfile.csv contains a set of
lines containing comma-separated fields. The following fields are
included in the file:
(1) EST name:
The name of the EST in the .fsa input
file
(2) GI number:
NCBI GI number
(3) Taxonomy name: The NCBI
Taxonomy name, listing Genus and Species
corresponding to the GI number.
(4) Protein name:
The NCBI protein name corresponding to the GI number.
(5) 3D Structure IDs: The
semicolon-seperated list of 3D structure IDs
retrieved from SeqHound API
corresponding to the GI number with E-value of 10
11 or
higher
(6) E-value:
E-value for the
highest blastx hit
1. A number of improvements might be
useful:
- ability to specify blast parameters
- ability to specify a directory from which EST sequence files are to
be read
- ability to read more than one sequence from a file
2. For almost any realistic EST
project, it is impractical to do all of the BLAST searches at NCBI.
Rather, blastall or fasty3 would be used to search a locally-installed
database.