update April 26, 2016

blastdbkit.py - Install and update local BLAST databases


blastdbkit.py --showall [--ftpsite url ]
blastdbkit.py --configure [ --birchdir directory --blastdb directory ]
blastdbkit.py  --reportlocal 
blastdbkit.py  --reportftp [--ftpsite url ]
blastdbkit.py --add  [ --ftpsite url ] --dblist db[,db]
blastdbkit.py --delete --dblist db[,db]
 --update [ --ftpsite url ] --dblist db[,db]

blastdbkit.py performs tasks on a BLAST database whose location is given in the environment variable $BLASTDB.

A full technical description of how blastdbkit.py works can be found at

--showall - prints an alphabetical list of available BLAST databases at the remote FTP site. Default: ftp.ncbi.nlmn.nih. Use --ftpsite to specify a particular FTP site.

-  This option is called by getbirch during installs and updates of a BIRCH system. If the BLASTDB environment variable is already set (ie. a BLAST database already exists on the system), this variable is set in $BIRCH/local/admin/BIRCH.settings. Otherwise, BLASTDB is set to $BIRCH/GenBank. The location of --birchdir must be set at the command line because of the fact that --configure is called during a fresh BIRCH install, when we can't count on the $BIRCH environment variable, or the presence of this setting in BIRCH.properties. Not compatible with --add, --delete or --update.
--birchdir directory - path to the BIRCH home directory. ($BIRCH)

--blastdb directory - path to the BLAST database directory. In BIRCH, the default is $BIRCH/GenBank.

--reportlocal - Write a spreadsheet-ready report with statistics on the local copy of the NCBI databases. The report is a tab-separated value file written to $BLASTDB/localstats.tsv.

--reportftp - Write a spreadsheet-ready report with statistics on the remote copy of the NCBI databases. The report is a tab-separated value file written to $BLASTDB/ftpstats.tsv
--add - Add files in dblist from the FTP site specified by --ftpsite to the BLASTDB database.

--delete - Delete files in dblist from the BLASTDB database.

--update - Update files in dblist from the FTP site specified by --ftpsite. Blast databases are often divided among many parts eg. nt.00.tar.gz, nt.01.tar.gz, nt.02.tar.gz etc. During an update, the only files downloaded are the ones that are newer than the ones locally-installed. This avoids completely downloading an entire database if only a few files have changed.

--ftpsite url - FTP site from which to download pre-formatted BLAST database files eg. ftp.ncbi.nih.gov. update_blastdb.pl will not download files if md5 checksum files are not available. Depending upon the ftpsite chosen, blastdbkit.py will download files from the appropriate directory, as listed in the table below. It is usually best to download files from the FTP site geographically closest to your location.

FTP site
Directory for BLAST file downloads
Bethesda, Maryland, USA
Tokyo, Japan

--dblist db[,db] - a comma-separated list of databases that should be installed. All databases included in the list will be installed or updated. If a database is not included in the list, but is currently installed, it will be deleted. If the database is currently installed, it will be updated.

The argument 'all' can be used with the --add, --update and --delete as follows. Note: Because of the size of these databases, 'all' should be used with a lot of forethought!

blastdbkit.py --update --dblist all
Updates all currently-installed databases

blastdbkit.py --add --dblist all
Adds ALL databases from the remote FTP site. At this writing, that corresponds to about 850 Gb!

blastdbkit.py --delete --dblist all
For obvious reasons, this is potentially a dangerous option!

The following options are mutually exclusive: --configure, --reportlocal, --reportftp, --add, --delete, --update.

Table 1. Codes for --dblist
nt Non-redundant nucleotide
refseqgene RefSeq Gene
refseq_rna RefSeq RNA
human_genomic Human Genomic - RefSeq Human chromosomal
refseq_genomic RefSeq Genomic
human_genomic_transcript Human Genomic plus Transcripts
mouse_genomic_transcript Mouse Genomic plus Transcripts
Representative_Genomes Representative Genomes
other_genomic Other Genomic - RefSeq non-human chromosomal seqs
vector Vector
patnt Patented Nucleotide
pdbnt Nucleotide sequences from PDB 3D nucl. acid structures
16SMicrobial 16S Microbial
nr Non-redundant protein
refseq_protein RefSeq Protein
swissprot Uniprot
pataa Patented Protein
pdbaa Protein sequences from PDB 3D protein structures
cdd_delta Conserved Domain Database for DeltaBlast
env_nt Environmental - Nucleotide
env_nr Environmental - Protein
est Expressed Sequence Tags
est_human Expressed Sequence Tags - Human
est_mouse Expressed Sequence Tags - Mouse
est_others Expressed Sequence Tags - Other
sts Sequence-Tagged Sites
gss Genome Survey Sequences
gss_annot Genome Survey Sequences - Annotation
htgs High Throughput Genomic Sequencing
tsa_nt Transcriptome Shotgun Assembly nucleotide
wgs Whole Genome Sequencing
taxdb Taxonomy

Some of the ideas in blastdbkit.py have been borrowed from the NCBI script update_blastdb.pl. blastdbkit.py differs for update_blastdb.pl in a number of ways:

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2