BIRCH
Local BLAST Databases
  Disk space and updates

April 25, 2016


To help guide decision-making in managing your local databases, the BIRCH Administration Tool (birchadmin) can generate spreadsheets listing sizes and modification dates for both the local copies of databases, and their counterparts on a remote FTP site. The choice of presenting these reports as spreadsheets means that you can use the spreadsheets to find out how much space will be needed for a given choice of databases. To generate these reports, first launch birchadmin from the BIRCH launcher using File --> birchadmin, or by typing 'birchadmin' at the command line.

Next, choose UpdateAddInstall --> Reports on local or remote BLAST databases

To generate a report on the local copy of the databases, choose Report: Local Databases. Note: The Directory field in this menu cannot be changed. It is there solely as a placeholder to indicate the location of the BLASTDB directory.

To generate a report on a remote copy of the databases, first select the FTP site. It is strongly recommended to generate the report from the same FTP site you plan to download files from. Although mirror FTP sites are supposed to be synchronized, that may not be guaranteed shortly after new versions of files are released. Generally, it is best to download from the site that is closest, geographically, to your location.

Once you have chosen an FTP site, click on Report: FTP database.

Report on your Local BLAST databases.
Once you click on the Report button, the files in the local BLAST database will be checked, and output will be sent to a Tab-Separated Value (.tsv) file called localstats.tsv, written to your BLASTDB directory. This file will automatically open in your default spreadsheet program (eg. LibreOffice Calc.). You will usually be prompted to verify that the Tab character is the separator used in this file. Click on OK to continue.

An example of the spreadsheet is shown at right. Disk space statistics for the filesystem in which the Database Directory resides are given in row 4.

Next, the disk usage for each database are displayed. The database name is the NCBI code for each database. The size in Mb refers to the total size taken by all files in a given database, in megabytes. The Last Update column is the date and time at which the local copy of the database was last modified.

For example, filenames in the non-redundant nucleotide database all begin with the prefix 'nt', as shown in the directory listing below. Large databases such as nt are broken up into parts numbered 00, 01, 02 etc. For each part, there are a number of files including annotation, index files, and sequence files eg. nt.00.nsq. Each part of the database also includes an MD5 checksum, that is used to verify that a given part of the database downloaded correctly.


The "size (Mb) column gives a subtotal of the sizes of all files in each database, to guide in deciding whether or not you have the disk space to add a database, and as a guide for knowing whether a database will require a long time to download.

Example of a directory listing for files in the nt.00 part of the nt database.
-rw-r--r-- 1 psgendb psgendb  14521511 Mar 26 06:09 nt.00.nhd
-rw-r--r-- 1 psgendb psgendb    330157 Mar 26 06:09 nt.00.nhi
-rw-r--r-- 1 psgendb psgendb 146250710 Mar 29 17:25 nt.00.nhr
-rw-r--r-- 1 psgendb psgendb   9897836 Mar 29 17:25 nt.00.nin
-rw-r--r-- 1 psgendb psgendb   7668392 Mar 26 06:09 nt.00.nnd
-rw-r--r-- 1 psgendb psgendb     30004 Mar 26 06:09 nt.00.nni
-rw-r--r-- 1 psgendb psgendb   3299280 Mar 26 06:09 nt.00.nog
-rw-r--r-- 1 psgendb psgendb  34287591 Mar 26 06:09 nt.00.nsd
-rw-r--r-- 1 psgendb psgendb    792774 Mar 26 06:09 nt.00.nsi
-rw-r--r-- 1 psgendb psgendb 867970671 Mar 26 06:09 nt.00.nsq
-rw-rw-r-- 1 psgendb psgendb        47 Apr  1 19:35 nt.00.tar.gz.md5


The timestamp of the .md5 file is also used to determine the time of the last update. The dates of the files are the dates of creation at NCBI. The date of the .md5 file is the date on which these files were downloaded. For this reason, do not delete the .md5 files, since they are used to determine whether or not a database needs to be updated.


At the command line - To create a report on the local database,

blastdbkit.py --reportlocal

The output will be written to $BLASTDB/localstats.tsv, which can be imported into any spreadsheet program.
 
Report on BLAST database files at a remote FTP site.

Once you click on the Report: FTP Database button, a request will be sent to the FTP site asking for a listing of database archive files currently available at the FTP site. The report is saved in $BLASTDB/ftpstats.tsv, and will also pop up in your default spreadsheet program.

The "compressed size (Mbytes)" column requires a bit of explanation. As described above, each database is broken up into a number of parts. For example the nt database is broken into nt.00, nt.01, nt.02 etc. For each part, there are a set of files eg. nt.00.nhd, nt.00.nhi etc. All files for a given part of the database are saved in a compressed archive file (similar to a ZIP file) with a .tar.gz extension. Thus, all files in the nt.00 part of the database are saved in a file called nt.00.tar.gz, which has a corresponding md5 file, nt.00.tar.gz.md5. Compression of the file makes it faster to download the data. After download, the archive is decompressed and the individual files written to the $BLASTDB directory. Finally,  .tar.gz file is deleted.

The disk space taken up by a database, therefore, will be significantly greater than the space shown on the FTP report. The ratio between .tar.gz file sizes and the final disk space are described in Figuring out disk space needs. Based on empirically determined decompression ratios for each database, the "est. decompressed size (Mbytes) gives an estimate of the space taken up by all files in a database after decompression and de-archiving.  As listed in the document, the database files can uncompress to anywhere from a few percent greater to 3 to 5 times greater, with several files uncompressing by an even larger ratio. (Note: The sizes in this spreadsheet are rounded to the nearest Megabyte, so databases taking up less than  0.5 Megabytes will show a file size of 0.)

Finally, the Modification Time column shows the modification timestamp of the most recent .md5 file for a part of a database on the FTP site.ftp.ebi.ac.uk


At the command line - Example: To create a report on the  FTP mirror at the EBI,

blastdbkit.py --reportftp --ftpsite ftp.ebi.ac.uk

The output will be written to $BLASTDB/ftpstats.tsv, which can be imported into any spreadsheet program.

Please send suggestions of comments regarding this page to psgendb@cc.umanitoba.ca