BIRCH
BIRCH
 Downloading and Maintaining GenPept
This document is currently under revision.




Organization of GenPept flatfile distribution

The Protein Information Resource is produced by the National Biomedical Research Foundation at Georgetown University, Washington D.C. It is distributed as a set of flatfiles (text files), as described in the file 0protein_doc.codata. As summarized in Table 1, the sequences are divided among four files. Additionally, there are various documentation files. Indices and other files not used by BIRCH are not listed below.

Table 1. GenPept flatfiles used in BIRCH
description
file(s)
completely merged, classified and annotated sequences
pir1.dat.Z
completely merged, classified and annotated sequences
pir2.dat.Z
unverified and unannotated sequences
pir3.dat.Z
sequences neither naturally occurring nor naturally expressed, but fully annotated.
Includes sequences known to be conceptual translations of pseudogenes, mistranslations or otherwise unexpressed potential ORF's that may have mistakenly been assigned identifiers as coding regions by other databases. It also includes engineered or synthetic sequences, sequences resulting from fusion, cross-over or frame-shift mutations, and sequences of natural polypeptides that are not synthesized on ribosomes.
pir4.dat.Z
Header files listing current description of pirx.dat files
pirx.nam
Release notes
0codata.txt
Formal description of file format
0protein_doc.codata

The .Z extension indicates that files are compressed using the Unix compress protocol for faster download.

Automated downloading and installation of GenPept 

1. Disk space considerations

When you install BIRCH for the first time, the directory $BIRCH/GenPept ($GenPept, $pir) will be created. $GENBANK will contain two files, gpupdate and master.filelist.

GenPept Release 75.02 required about 510 Mb including all files listed in Table 1.  GenPept eliminates redundancy by maintaining a single entry where a protein is identical in more than one species. Therefore, growth is relatively slow, and disk space is rarely a problem.

2. Running gpupdate

Table 2. A sample filelist
0codata.txt
0protein_doc.codata
pir
To keep current on when new GenBank releases become available, check the GenPept Web site.

The gpupdate script automates the process of downloading and reformatting some or all files for PIR. Before running this script, you need to set the environment variable $MAILID to your email address. This is usually requested by most anonymous FTP sites, and can most easily be set in the .cshrc file of the BIRCH administrator.

The file 'filelist' defines which files and divisions to download.

Rules for the filelist file:
  1. Any file with a .gz or .Z file extension will be uncompressed after downloading. 
  2. All PIR sequence files can be downloaded and processed by simply putting the 3-letter code pir into filelist.
  3. Alternatively, any individual file can be downloaded by putting the name into filelist (eg. pir4.dat.Z).

A typical download session

We will now show the sequence of events in a typical download session.  The file master.filelist is distributed with BIRCH. It's probably safest to copy this to another file called 'filelist' to use as a working copy. To launch , move to the PIR directory and launch . By terminating the line with '&' you can make the command run in the background,
cd $pir
./ gpupdate filelist &

The advantage of running  in the background is that you can logout at any time during the download without interrupting it.

The first file in the list is 0codata.txt.  In the example,  0codata.txt is the file containing the  release notes.

When the file is received, the sizes of the original file from the FTP server and the file received are listed.

0codata.txt
ORIGINAL=  3358
RECEIVED=  3358
 If these numbers are equal, the name of the file is written to files_received. Otherwise, the name of the file is written to files_missed. By default, files remain in the $PIR directory. Files beginning with "0" are documentation files, and are moved to $doc/PIR. The full listing of files for this division are written, and then the names of each file are echoed to the output. Before beginning the download,  will remove the current files for this division, if they exist, as a way of making sure that enough space is available.

If the 'pir' code was specified in filelist, the .dat files would begin being downloaded. A partial
output is shown below:

pir
-rw-r--r-- 1 root system 21091315 Feb 3 06:37 pir1.dat.Z
-rw-r--r-- 1 root system 1243 Feb 3 06:37 pir1.nam
-rw-r--r-- 1 root system 177763121 Feb 3 06:40 pir2.dat.Z
-rw-r--r-- 1 root system 1248 Feb 3 06:40 pir2.nam
-rw-r--r-- 1 root system 11135 Feb 3 06:40 pir3.dat.Z
-rw-r--r-- 1 root system 1233 Feb 3 06:40 pir3.nam
-rw-r--r-- 1 root system 221983 Feb 3 06:40 pir4.dat.Z
-rw-r--r-- 1 root system 1897 Feb 3 06:40 pir4.nam
pir1.dat.Z
pir1.nam
pir2.dat.Z
pir2.nam
pir3.dat.Z
pir3.nam
pir4.dat.Z
pir4.nam
Removing file(s) for pir1, if they exist
No match
ORIGINAL= 21091315
RECEIVED= 21091315
etc ...

If a file contains the .dat extension, it is assumed to be a sequence file containing PIR entries. After uncompressing the file, the .dat file is split into 3 files containing annotation, sequence and an index. Thus, pir1.seq is split into pir1.ano, pir1.wrp and pir1.ind. This is fully described in the documentation for the XYLEM program splitdb . It is strongly recommended that you read this documentation file. The key point is that the annotation files and sequence files can be searched independently, saving a great deal of disk I/O. Thus, fasta or blast would only search the .wrp files containing sequence, and wouldn't have to read all of the documentation. When sequence entries are retrieved by fetch, the index (.ind) file is used to find the annotation and sequence for each entry so that the complete entry can be retrieved.

The two critical factors influencing the time required for a download are the speed of the internet connection and the speed of the filesystem. On our Sun Ultra 60 at the University of Manitoba, using a remotely-mounted NFS fileserver, a complete download and installation of PIR 75.02 took 20 minutes.

The progress of the download can be monitored in a number of ways. Just doing a directory listing of $pir periodically will list all files in the $PIR directory. 'less files_received' will list the files successfully downloaded.  'top' will shown the program currently running:  FTP if a file is being downloaded, uncompress if a file is being uncompressed, or splitdb if the file is being split.

When splitdb finishes processing a .dat file, the .ano, .wrp and .ind files are ready for use with no further processing. Thus, when gpupdate is complete all files are ready to use. The only thing remaining is to regenerate the index files used by FASTA, as described in the next section.

Configuring FASTA for PIR searches

How FASTA finds database files

FASTA reads a list of database files from the file 'fastgbs'. The location of fastgbs is specified by the environment variable $FASTLIBS, which is set to $BIRCH/dat/fasta/fastgbs. A typical fastgbs file is shown below:

PIR   Protein Identification Resource 75.02 $00@/home/psgendb/BIRCHDEV/dat/fasta/pir.fil 
GenPept GenBank 133.0 CDS translations$01/home/psgendb/BIRCHDEV/GenPept/genpept.wrp
GB133 Primate$1P@/home/psgendb/BIRCHDEV/dat/fasta/gbpri.fil
GB133 Rodent$1R@/home/psgendb/BIRCHDEV/dat/fasta/gbrod.fil
GB133 other Mammal$1M@/home/psgendb/BIRCHDEV/dat/fasta/gbmam.fil
GB133 verteBrates$1B@/home/psgendb/BIRCHDEV/dat/fasta/gbvrt.fil
GB133 Invertebrates$1I@/home/psgendb/BIRCHDEV/dat/fasta/gbinv.fil
GB133 pLants$1L@/home/psgendb/BIRCHDEV/dat/fasta/gbpln.fil
GB133 Expressed Sequece Tags$1E@/home/psgendb/BIRCHDEV/dat/fasta/gbest.fil
GB133 Bacteria$1T@/home/psgendb/BIRCHDEV/dat/fasta/gbbct.fil
GB133 Viral$1V@/home/psgendb/BIRCHDEV/dat/fasta/gbvrl.fil
GB133 Phage$1G@/home/psgendb/BIRCHDEV/dat/fasta/gbphg.fil
GB133 Synthetic$1Y@/home/psgendb/BIRCHDEV/dat/fasta/gbsyn.fil
GB133 Unannotated$1U@/home/psgendb/BIRCHDEV/dat/fasta/gbuna.fil
GB133 Patented$1D@/home/psgendb/BIRCHDEV/dat/fasta/gbpat.fil
GB133 STS$1X@/home/psgendb/BIRCHDEV/dat/fasta/gbsts.fil
GB133 HTG$1h@/home/psgendb/BIRCHDEV/dat/fasta/gbhtg.fil
GB133 GSS$1s@/home/psgendb/BIRCHDEV/dat/fasta/gbgss.fil
GB133 All sequences (VERY long!)$1A@/home/psgendb/BIRCHDEV/dat/fasta/genbank.fil

pir.fil lists the files comprising PIR.  An example is shown in Table 3.
Table 3. pir.fil
</home/psgendb/BIRCHDEV/PIR
pir1.wrp 0
pir2.wrp 0
pir3.wrp 0
pir4.wrp 0


This file lists the location and names of the 4 files in PIR.  A complete description of syntax for these files can be found in $BIRCH/doc/fasta/fasta3x.asc.

Finally remember to edit fastgbs, using Find/Replace in your text editor to change the PIR Release number to the current release.

Configuring GDE to read PIR

The default FASTA menus for GDE are located in $dat/GDE/makemenus/menus/Database. These menus only have one database choice for User-created files. The FASTA menus in $birch/local/dat/GDE/makemenus/menus/Database have additional menu choices for each database file listed in $BIRCH/dat/fasta/fastgbs. All we need to do is to edit $birch/local/dat/GDE/makemenus/menus/menulist to choose the local menu item files. This is done by adding lines to menulist, such as
Database
FASTAPROTEIN
FASTXY

Now, re-run makemenus.py to update the .GDEmenus files



Please send suggestions of comments regarding this page to psgendb@cc.umanitoba.ca