BIRCH Administration

BIRCH
Downloading and Maintaining GenPept
This document is currently under revision.

Organization of GenPept flatfile distribution
Automated downloading and installation of GenPept
Configuring FASTA for GenPept searches
Configuring GDE to read GenPept

Organization of GenPept flatfile distribution

The Protein Information Resource is produced by the National Biomedical Research Foundation at Georgetown University, Washington D.C. It is distributed as a set of flatfiles (text files), as described in the file 0protein_doc.codata. As summarized in Table 1, the sequences are divided among four files. Additionally, there are various documentation files. Indices and other files not used by BIRCH are not listed below.

Table 1. GenPept flatfiles used in BIRCH
description	file(s)
completely merged, classified and annotated sequences	pir1.dat.Z
completely merged, classified and annotated sequences	pir2.dat.Z
unverified and unannotated sequences	pir3.dat.Z
sequences neither naturally occurring nor naturally expressed, but fully annotated. Includes sequences known to be conceptual translations of pseudogenes, mistranslations or otherwise unexpressed potential ORF's that may have mistakenly been assigned identifiers as coding regions by other databases. It also includes engineered or synthetic sequences, sequences resulting from fusion, cross-over or frame-shift mutations, and sequences of natural polypeptides that are not synthesized on ribosomes.	pir4.dat.Z
Header files listing current description of pirx.dat files	pirx.nam
Release notes	0codata.txt
Formal description of file format	0protein_doc.codata

The .Z extension indicates that files are compressed using the Unix compress protocol for faster download.

Automated downloading and installation of GenPept

1. Disk space considerations

When you install BIRCH for the first time, the directory $BIRCH/GenPept ($GenPept, $pir) will be created. $GENBANK will contain two files, gpupdate and master.filelist.

GenPept Release 75.02 required about 510 Mb including all files listed in Table 1. GenPept eliminates redundancy by maintaining a single entry where a protein is identical in more than one species. Therefore, growth is relatively slow, and disk space is rarely a problem.

2. Running gpupdate

Table 2. A sample filelist

0codata.txt
0protein_doc.codata
pir

To keep current on when new GenBank releases become available, check the GenPept Web site.

The gpupdate script automates the process of downloading and reformatting some or all files for PIR. Before running this script, you need to set the environment variable $MAILID to your email address. This is usually requested by most anonymous FTP sites, and can most easily be set in the .cshrc file of the BIRCH administrator.

The file 'filelist' defines which files and divisions to download.

Rules for the filelist file:

Any file with a .gz or .Z file extension will be uncompressed after downloading.
All PIR sequence files can be downloaded and processed by simply putting the 3-letter code pir into filelist.
Alternatively, any individual file can be downloaded by putting the name into filelist (eg. pir4.dat.Z).

A typical download session

We will now show the sequence of events in a typical download session. The file master.filelist is distributed with BIRCH. It's probably safest to copy this to another file called 'filelist' to use as a working copy. To launch , move to the PIR directory and launch . By terminating the line with '&' you can make the command run in the background,

cd $pir
./ gpupdate filelist &

The advantage of running in the background is that you can logout at any time during the download without interrupting it.

The first file in the list is 0codata.txt. In the example, 0codata.txt is the file containing the release notes.

When the file is received, the sizes of the original file from the FTP server and the file received are listed.

0codata.txt
ORIGINAL=  3358
RECEIVED=  3358

If these numbers are equal, the name of the file is written to files_received. Otherwise, the name of the file is written to files_missed. By default, files remain in the $PIR directory. Files beginning with "0" are documentation files, and are moved to $doc/PIR. The full listing of files for this division are written, and then the names of each file are echoed to the output. Before beginning the download, will remove the current files for this division, if they exist, as a way of making sure that enough space is available.

If the 'pir' code was specified in filelist, the .dat files would begin being downloaded. A partial
output is shown below:

pir
-rw-r--r--   1 root     system    21091315 Feb  3 06:37 pir1.dat.Z
-rw-r--r--   1 root     system        1243 Feb  3 06:37 pir1.nam
-rw-r--r--   1 root     system   177763121 Feb  3 06:40 pir2.dat.Z
-rw-r--r--   1 root     system        1248 Feb  3 06:40 pir2.nam
-rw-r--r--   1 root     system       11135 Feb  3 06:40 pir3.dat.Z
-rw-r--r--   1 root     system        1233 Feb  3 06:40 pir3.nam
-rw-r--r--   1 root     system      221983 Feb  3 06:40 pir4.dat.Z
-rw-r--r--   1 root     system        1897 Feb  3 06:40 pir4.nam
pir1.dat.Z
pir1.nam
pir2.dat.Z
pir2.nam
pir3.dat.Z
pir3.nam
pir4.dat.Z
pir4.nam
Removing file(s) for pir1, if they exist
No match
ORIGINAL=  21091315
RECEIVED=  21091315
etc ...

If a file contains the .dat extension, it is assumed to be a sequence file containing PIR entries. After uncompressing the file, the .dat file is split into 3 files containing annotation, sequence and an index. Thus, pir1.seq is split into pir1.ano, pir1.wrp and pir1.ind. This is fully described in the documentation for the XYLEM program splitdb . It is strongly recommended that you read this documentation file. The key point is that the annotation files and sequence files can be searched independently, saving a great deal of disk I/O. Thus, fasta or blast would only search the .wrp files containing sequence, and wouldn't have to read all of the documentation. When sequence entries are retrieved by fetch, the index (.ind) file is used to find the annotation and sequence for each entry so that the complete entry can be retrieved.

The two critical factors influencing the time required for a download are the speed of the internet connection and the speed of the filesystem. On our Sun Ultra 60 at the University of Manitoba, using a remotely-mounted NFS fileserver, a complete download and installation of PIR 75.02 took 20 minutes.

The progress of the download can be monitored in a number of ways. Just doing a directory listing of $pir periodically will list all files in the $PIR directory. 'less files_received' will list the files successfully downloaded. 'top' will shown the program currently running: FTP if a file is being downloaded, uncompress if a file is being uncompressed, or splitdb if the file is being split.

When splitdb finishes processing a .dat file, the .ano, .wrp and .ind files are ready for use with no further processing. Thus, when gpupdate is complete all files are ready to use. The only thing remaining is to regenerate the index files used by FASTA, as described in the next section.

Configuring FASTA for PIR searches

How FASTA finds database files

FASTA reads a list of database files from the file 'fastgbs'. The location of fastgbs is specified by the environment variable $FASTLIBS, which is set to $BIRCH/dat/fasta/fastgbs. A typical fastgbs file is shown below:

PIR   Protein Identification Resource 75.02 $00@/home/psgendb/BIRCHDEV/dat/fasta/pir.fil 
GenPept GenBank 133.0 CDS translations$01/home/psgendb/BIRCHDEV/GenPept/genpept.wrp 
GB133 Primate$1P@/home/psgendb/BIRCHDEV/dat/fasta/gbpri.fil
GB133 Rodent$1R@/home/psgendb/BIRCHDEV/dat/fasta/gbrod.fil
GB133 other Mammal$1M@/home/psgendb/BIRCHDEV/dat/fasta/gbmam.fil
GB133 verteBrates$1B@/home/psgendb/BIRCHDEV/dat/fasta/gbvrt.fil
GB133 Invertebrates$1I@/home/psgendb/BIRCHDEV/dat/fasta/gbinv.fil
GB133 pLants$1L@/home/psgendb/BIRCHDEV/dat/fasta/gbpln.fil
GB133 Expressed Sequece Tags$1E@/home/psgendb/BIRCHDEV/dat/fasta/gbest.fil
GB133 Bacteria$1T@/home/psgendb/BIRCHDEV/dat/fasta/gbbct.fil
GB133 Viral$1V@/home/psgendb/BIRCHDEV/dat/fasta/gbvrl.fil
GB133 Phage$1G@/home/psgendb/BIRCHDEV/dat/fasta/gbphg.fil
GB133 Synthetic$1Y@/home/psgendb/BIRCHDEV/dat/fasta/gbsyn.fil
GB133 Unannotated$1U@/home/psgendb/BIRCHDEV/dat/fasta/gbuna.fil
GB133 Patented$1D@/home/psgendb/BIRCHDEV/dat/fasta/gbpat.fil
GB133 STS$1X@/home/psgendb/BIRCHDEV/dat/fasta/gbsts.fil
GB133 HTG$1h@/home/psgendb/BIRCHDEV/dat/fasta/gbhtg.fil
GB133 GSS$1s@/home/psgendb/BIRCHDEV/dat/fasta/gbgss.fil
GB133 All sequences (VERY long!)$1A@/home/psgendb/BIRCHDEV/dat/fasta/genbank.fil

pir.fil lists the files comprising PIR. An example is shown in Table 3.

Table 3. pir.fil

</home/psgendb/BIRCHDEV/PIR
pir1.wrp 0
pir2.wrp 0
pir3.wrp 0
pir4.wrp 0

This file lists the location and names of the 4 files in PIR. A complete description of syntax for these files can be found in $BIRCH/doc/fasta/fasta3x.asc.

Finally remember to edit fastgbs, using Find/Replace in your text editor to change the PIR Release number to the current release.

Configuring GDE to read PIR

The default FASTA menus for GDE are located in $dat/GDE/makemenus/menus/Database. These menus only have one database choice for User-created files. The FASTA menus in $birch/local/dat/GDE/makemenus/menus/Database have additional menu choices for each database file listed in $BIRCH/dat/fasta/fastgbs. All we need to do is to edit $birch/local/dat/GDE/makemenus/menus/menulist to choose the local menu item files. This is done by adding lines to menulist, such as

Database
        FASTAPROTEIN
        FASTXY

Now, re-run makemenus.py to update the .GDEmenus files

Please send suggestions of comments regarding this page to psgendb@cc.umanitoba.ca