![]() |
Important Considerations |
May 1, 2016 |
Databases are downloaded as gzip-compressed tar archive files
(ie. .tar.gz files). When uncompressed, the combined files in any
given database will require anywhere from 1.1 to 5.3 times the
size of the .tar.gz archive file. For most databases, the ratio of
decompressed to compressed is around 2.5. The table below gives a
snapshot of sizes and decompression ratios for NCBI databases
downloaded between April 1 - 3, 2016.
blastdbkit.py: | REMOTE FTP BLAST DATABASE REPORT | ||
FTP site: | ftp.ncbi.nih.gov | ||
Database Directory: | /blast/db | ||
DB name: | compressed size (Mbytes) | uncompressed | decompression ratio |
nt | 26067 | 39188 | 1.50 |
refseqgene | 119 | 125 | 1.05 |
refseq_rna | 6846 | 11552 | 1.69 |
human_genomic |
11244 |
11941 |
1.06 |
refseq_genomic | 192483 | 210822 | 1.10 |
Representative_Genomes | 1349 | 1403 | 1.04 |
other_genomic | 63062 | 68332 | 1.08 |
vector | 0.554 | 1 | 1.81 |
patnt | 4085 | 9745 | 2.39 |
pdbnt | 0.073 | 3 | 41.10 |
16SMicrobial | 4 | 10 | 2.50 |
nr | 26225 | 85465 | 3.26 |
refseq_protein | 15163 | 37732 | 2.49 |
swissprot | 122 | 336 | 2.75 |
pataa | 327 | 1097 | 3.35 |
pdbaa | 23 | 108 | 4.70 |
cdd_delta | 1414 | 3045 | 2.15 |
env_nt | 24839 | 45543 | 1.83 |
env_nr | 1127 | 2947 | 2.61 |
est | na | na | na |
est_human | 1350 | 3004 | 2.23 |
est_mouse | 715 | 1774 | 2.48 |
est_others | 10335 | 23918 | 2.31 |
sts | 200 | 476 | 2.38 |
gss | 7454 | 15789 | 2.12 |
gss_annot | 38 | 105 | 2.76 |
htgs | 6568 | 7112 | 1.08 |
tsa_nt | 14931 | 32388 | 2.17 |
human_genomic_transcript | 6392 | 8368 | 1.31 |
mouse_genomic_transcript | 2693 | 2948 | 1.09 |
wgs | 179751 | 215608 | 1.20 |
taxdb | 19 | 100 | 5.26 |
TOTAL: | 593701.627 |
When considering how
much disk space you can realistically afford to use, keep in
mind that you can't just fill up the remaining space on the file
system with databases. Day to day us of any computer demands
some breathing room in available disk space. A big consideration
is that to download any database file, you need enough space to
hold both the downloaded file and the uncompressed contents. As
well, just about any program running on the system will generate
some disk files, even if only temporarily. A good rule of thumb
is that you should avoid allowing any filesystem to fill much
beyond about 90% of its capacity.
Memory
and Cores*
Unless you have adequate numbers of CPUs and RAM, it
may be pointless to install local copies of BLAST databases.
Standalone BLAST+ uses a great deal of memory, and may be unreasonably slow on machines without adequate numbers of cores (CPUs). For most databases, you probably want a minimum of 8 cores and 16 Gb. RAM. The good news is that it is relatively cheap to upgrade a computer to a configuration that will give you faster turnaround times than sending your jobs to NCBI.
A future version of
this document will include some statistics on search times on
computer systems with various configurations of
CPU and RAM.
*The distinction
between cores and CPUs is as follows:
A Central Processing
Unit performs operations on data in RAM. Originally, the CPU
had one processor. Today, the vast majority of CPUs
manufactured today, even on low-end PCs, have 2 or more
cores, each of which can process information independently.
The terms CPU and core, while not synonymous, are often used
interchangeable. Strictly speaking, it is most correct to use
the term core to refer to the number of processing units.
A hard-wired (Ethernet)
connection is best. In most institutional settings, each
computer will have a separate IP address on a switch.
Wifi may be too slow
or not have adequate bandwith. Downloads will almost certainly
take longer on Wifi. As well, since Wifi is a shared resource,
large downloads may affect the Wifi performance for others
nearby.
There are several FTP
sites open to the general public for downloading copies of the
NCBI BLAST databasese. Since these "mirrors" are kept in sync
with NCBI, the primary consideration should be minimizing the
amount of network load that your downloads generate, as a
courtesy to others, as well as download speed. As well, since
network traffic can often result in dropped connections during a
download.
For all these reasons,
it is usually best to download files from the FTP site
geographically closest to your location.
FTP site |
Directory for BLAST file downloads |
Location |
ftp.ncbi.nih.gov |
/blast/db |
Bethesda,
Maryland, USA |
ftp.ebi.ac.uk |
pub/blast/db |
EBI,
UK |
ftp.hgc.jp |
pub/mirror/ncbi/blast/db |
Tokyo,
Japan |