BIRCH
Local BLAST Databases
Important Considerations

May  1, 2016


Before deciding whether or not to install a set of BLAST databases, a number of factors must be considered:

Disk space

Databases are downloaded as gzip-compressed tar archive files (ie. .tar.gz files). When uncompressed, the combined files in any given database will require anywhere from 1.1 to 5.3 times the size of the .tar.gz archive file. For most databases, the ratio of decompressed to compressed is around 2.5. The table below gives a snapshot of sizes and decompression ratios for NCBI databases downloaded between April 1 - 3, 2016.


blastdbkit.py: REMOTE FTP BLAST DATABASE REPORT





FTP site: ftp.ncbi.nih.gov

Database Directory: /blast/db





DB name: compressed size (Mbytes) uncompressed decompression ratio
nt 26067 39188 1.50
refseqgene 119 125 1.05
refseq_rna 6846 11552 1.69
human_genomic
11244
11941
1.06
refseq_genomic 192483 210822 1.10
Representative_Genomes 1349 1403 1.04
other_genomic 63062 68332 1.08
vector 0.554 1 1.81
patnt 4085 9745 2.39
pdbnt 0.073 3 41.10
16SMicrobial 4 10 2.50
nr 26225 85465 3.26
refseq_protein 15163 37732 2.49
swissprot 122 336 2.75
pataa 327 1097 3.35
pdbaa 23 108 4.70
cdd_delta 1414 3045 2.15
env_nt 24839 45543 1.83
env_nr 1127 2947 2.61
est na na na
est_human 1350 3004 2.23
est_mouse 715 1774 2.48
est_others 10335 23918 2.31
sts 200 476 2.38
gss 7454 15789 2.12
gss_annot 38 105 2.76
htgs 6568 7112 1.08
tsa_nt 14931 32388 2.17
human_genomic_transcript 6392 8368 1.31
mouse_genomic_transcript 2693 2948 1.09
wgs 179751 215608 1.20
taxdb 19 100 5.26
TOTAL: 593701.627

When considering how much disk space you can realistically afford to use, keep in mind that you can't just fill up the remaining space on the file system with databases. Day to day us of any computer demands some breathing room in available disk space. A big consideration is that to download any database file, you need enough space to hold both the downloaded file and the uncompressed contents. As well, just about any program running on the system will generate some disk files, even if only temporarily. A good rule of thumb is that you should avoid allowing any filesystem to fill much beyond about 90% of its capacity.

Memory and Cores*

Unless you have adequate numbers of CPUs and RAM, it may be pointless to install local copies of BLAST databases.

Standalone BLAST+ uses a great deal of memory, and may be unreasonably slow on machines without adequate numbers of cores (CPUs).  For most databases, you probably want a minimum of 8 cores and 16 Gb. RAM.  The good news is that it is relatively cheap to upgrade a computer to a configuration that will give you faster turnaround times than sending your jobs to NCBI.

A future version of this document will include some statistics on search times on computer systems with various configurations of CPU and RAM.

*The distinction between cores and CPUs is as follows:
A Central Processing Unit performs operations on data in RAM. Originally, the CPU had one processor. Today, the vast majority of CPUs manufactured today, even on low-end PCs,  have 2 or more cores, each of which can process information independently. The terms CPU and core, while not synonymous, are often used interchangeable. Strictly speaking, it is most correct to use the term core to refer to the number of processing units.

Network Load

A hard-wired (Ethernet) connection is best.  In most institutional settings, each computer will have a separate IP address on a switch.

Wifi may be too slow or not have adequate bandwith. Downloads will almost certainly take longer on Wifi. As well, since Wifi is a shared resource, large downloads may affect the Wifi performance for others nearby.

Mirrors

There are several FTP sites open to the general public for downloading copies of the NCBI BLAST databasese. Since these "mirrors" are kept in sync with NCBI, the primary consideration should be minimizing the amount of network load that your downloads generate, as a courtesy to others, as well as download speed. As well, since network traffic can often result in dropped connections during a download.

For all these reasons, it is usually best to download files from the FTP site geographically closest to your location.

FTP site
Directory for BLAST file downloads
Location
ftp.ncbi.nih.gov
/blast/db
Bethesda, Maryland, USA
ftp.ebi.ac.uk
pub/blast/db
EBI, UK
ftp.hgc.jp
pub/mirror/ncbi/blast/db
Tokyo, Japan


Backups

In general, it is usually the best practice to exclude your NCBI databases from your backup schedule.Keeping large BLAST databases on your machine could have a major impact on your automated backups. This is highly-dependent on how you do your backups, and the media (eg. networked servers, tapes, removable media) to which you do your backups.
 

Network backups - If your computer is automatically backed up over the network, backups of NCBI databases will generate substantial network load. As well, you need to know that the destination media, such as backup drives or cloud storage, have the capacity to handle the backedup databases.

Backups to local media - If you backup to local media such as tapes or backup drives, you need to take into account the capacity of the media, as well as the speed. The entire backup process could be slowed down depending on the devices to which you do your backups.



Please send suggestions of comments regarding this page to psgendb@cc.umanitoba.ca