BIRCH
Local BLAST Databases
Important Considerations

May  4, 2020


Before deciding whether or not to install a set of BLAST databases, a number of factors must be considered:

Disk space

Databases are downloaded as gzip-compressed tar archive files (ie. .tar.gz files). When uncompressed, the combined files in any given database will require anywhere from 1.1 to 14 times the size of the .tar.gz archive files. For most databases, the ratio of decompressed to compressed is around 2.5. The table below gives a snapshot of sizes and decompression ratios for NCBI databases downloaded May 4, 2020.

blastdbkit.py: LOCAL BLAST DATABASE REPORT





Source:
localhost
ftp.ncbi.nih.gov
Database Directory: /home/psgendb/GenBank /blast/db




DB name uncompressed size (Mb) compressed size (Mbytes)* decompression ratio
nt 149731 64046 2.34
refseq_rna 43752 15815 2.77
human_genome 1566 1192 1.35
mouse_genome 1322 1025 1.33
ref_euk_rep_genomes 199793 186603 1.07
ref_prok_rep_genomes 12651 12368 1.03
ref_viroids_rep_genomes 0 30 1.00
ref_viruses_rep_genomes 82 108 1.05
patnt 17847 6709 2.67
pdbnt 14 31 14.00
16S_ribosomal_RNA 14 36 2.33
18S_fungal_sequences 1 30 1.00
28S_fungal_sequences 3 31 3.00
ITS_RefSeq_Fungi 5 32 2.50
ITS_eukaryote_sequences 37 39 4.11
LSU_eukaryote_rRNA 5 32 2.50
LSU_prokaryote_rRNA 2 31 2.00
SSU_eukaryote_rRNA 6 32 3.00
Betacoronavirus 29 38 3.63
nr 404350 98231 4.12
refseq_protein 170800 51766 3.30
swissprot 695 183 4.54
pdbaa 267 63 8.09
landmark 374 160 2.88
env_nt 79383 40140 1.98
taxdb 161 30 5.37
TOTAL: 1082890

*Compressed database files for each division of the database include the taxid database, which at this writing is about 30 Mb. Thus, although the total sizes of compressed files for ITS_eukaryote_sequences is 39 Mb, 30 Mb of that is a copy of taxdb. Each time the database files are de-archived, any existing copy of taxdb is overwritten. Consequently, there will only be one copy of taxdb after each install or update, even though many copies may have been downloaded. Decompression ratios in the table are calculated after subtracting the compressed size of taxdb (eg. 30 Mb) from the compressed size.


When considering how much disk space you can realistically afford to use, keep in mind that you can't just fill up the remaining space on the file system with databases. Day to day use of any computer demands some breathing room in available disk space. A big consideration is that to download any database file, you need enough space to hold both the downloaded file and the uncompressed contents. As well, just about any program running on the system will generate some disk files, even if only temporarily. A good rule of thumb is that you should avoid allowing any filesystem to fill much beyond about 90% of its capacity.


Memory and Cores*

Unless you have adequate numbers of CPUs and RAM, it may be pointless to install local copies of BLAST databases.

Standalone BLAST+ uses a great deal of memory, and may be unreasonably slow on machines without adequate numbers of cores (CPUs).  For most databases, you probably want a minimum of 8 cores and 16 Gb. RAM.  The good news is that it is relatively cheap to upgrade a computer to a configuration that will give you faster turnaround times than sending your jobs to NCBI.

A future version of this document will include some statistics on search times on computer systems with various configurations of CPU and RAM.

*The distinction between cores and CPUs is as follows:
A Central Processing Unit performs operations on data in RAM. Originally, the CPU had one processor. Today, the vast majority of CPUs manufactured today, even on low-end PCs,  have 2 or more cores, each of which can process information independently. The terms CPU and core, while not synonymous, are often used interchangeable. Strictly speaking, it is most correct to use the term core to refer to the number of processing units.

Network Load

A hard-wired (Ethernet) connection is best.  In most institutional settings, each computer will have a separate IP address on a switch.

Wifi may be too slow or not have adequate bandwith. Downloads will almost certainly take longer on Wifi. As well, since Wifi is a shared resource, large downloads may affect the Wifi performance for others nearby.

Mirrors

There are several FTP sites open to the general public for downloading copies of the NCBI BLAST databasese. Since these "mirrors" are kept in sync with NCBI, the primary consideration should be minimizing the amount of network load that your downloads generate, as a courtesy to others, as well as download speed. As well, since network traffic can often result in dropped connections during a download.

For all these reasons, it is usually best to download files from the FTP site geographically closest to your location.

FTP site
Directory for BLAST file downloads
Location
ftp.ncbi.nih.gov
/blast/db
Bethesda, Maryland, USA
ftp.ebi.ac.uk
pub/blast/db
EBI, UK
ftp.hgc.jp
pub/mirror/ncbi/blast/db
Tokyo, Japan


Backups

In general, it is usually the best practice to exclude your NCBI databases from your backup schedule.Keeping large BLAST databases on your machine could have a major impact on your automated backups. This is highly-dependent on how you do your backups, and the media (eg. networked servers, tapes, removable media) to which you do your backups.
 

Network backups - If your computer is automatically backed up over the network, backups of NCBI databases will generate substantial network load. As well, you need to know that the destination media, such as backup drives or cloud storage, have the capacity to handle the backed up databases.

Backups to local media - If you backup to local media such as tapes or backup drives, you need to take into account the capacity of the media, as well as the speed. The entire backup process could be slowed down depending on the devices to which you do your backups.




Please send suggestions of comments regarding this page to psgendb@cc.umanitoba.ca