BIRCH
Downloading and Maintaining GenPept
This document
is currently under revision.
Organization of
GenPept flatfile distribution
The Protein Information Resource
is produced by the National
Biomedical Research Foundation at Georgetown University, Washington
D.C. It is distributed as a set of flatfiles (text files), as described
in the file 0protein_doc.codata.
As summarized in Table 1, the sequences are divided among four files.
Additionally, there are various documentation files. Indices and other
files not used by BIRCH are not listed below.
Table 1. GenPept
flatfiles used in BIRCH
|
description
|
file(s)
|
completely merged, classified and
annotated sequences
|
pir1.dat.Z
|
completely merged, classified and
annotated sequences
|
pir2.dat.Z
|
unverified and unannotated sequences
|
pir3.dat.Z
|
sequences neither naturally occurring
nor naturally expressed, but fully annotated.
Includes sequences known to be conceptual translations of
pseudogenes, mistranslations or otherwise unexpressed potential
ORF's that may have mistakenly been assigned identifiers as coding
regions by other databases. It also includes engineered or synthetic
sequences, sequences resulting from fusion, cross-over or frame-shift
mutations, and sequences of natural polypeptides that are not
synthesized on ribosomes.
|
pir4.dat.Z
|
Header files listing current description of
pirx.dat files
|
pirx.nam
|
Release notes
|
0codata.txt
|
Formal description of file format
|
0protein_doc.codata
|
The .Z extension indicates that files are compressed using the Unix
compress protocol for faster download.
Automated downloading and
installation of GenPept
1. Disk space considerations
When you install BIRCH for the first time, the directory $BIRCH/GenPept
($GenPept, $pir) will be created. $GENBANK will contain two files,
gpupdate and master.filelist.
GenPept Release 75.02 required about 510 Mb including all files listed
in Table 1. GenPept eliminates redundancy by maintaining a single
entry where a protein is identical in more than one species. Therefore,
growth is relatively slow, and disk space is rarely a problem.
2. Running gpupdate
Table 2. A sample filelist
|
0codata.txt
0protein_doc.codata
pir
|
To keep current on when new GenBank releases become available, check
the GenPept
Web
site.
The gpupdate script automates the process of downloading and
reformatting some or all files for PIR. Before running this
script, you need to set the environment variable $MAILID to your
email address. This is usually requested by most anonymous FTP sites,
and can most easily be set in the .cshrc file of the BIRCH
administrator.
The file 'filelist' defines which files and divisions to download.
Rules for the filelist file:
- Any file with a .gz or .Z file extension will be uncompressed
after downloading.
- All PIR sequence files can be downloaded and processed by simply
putting the 3-letter code pir into filelist.
- Alternatively, any individual file can be downloaded by putting
the name into filelist (eg. pir4.dat.Z).
A typical download session
We will now show the sequence of events in a typical download session.
The file master.filelist is distributed with BIRCH. It's
probably safest to copy this to another file called 'filelist' to
use as a working copy. To launch , move to the PIR directory and
launch . By terminating the line with '&' you can make the
command run in the background,
cd $pir
./ gpupdate filelist &
The advantage of running in the background is that you can logout
at any time during the download without interrupting it.
The first file in the list is 0codata.txt. In
the example, 0codata.txt is the file
containing the release notes.
When the file is received, the sizes of the original file from the FTP
server and the file received are listed.
0codata.txt
ORIGINAL= 3358
RECEIVED= 3358
If these numbers are equal, the name of the file is written to
files_received. Otherwise, the name of the file is written to
files_missed. By default, files remain in the $PIR directory. Files
beginning with "0" are documentation files, and are moved to $doc/PIR.
The full listing of files for this division are written, and then the
names of each file are echoed to the output. Before beginning the
download, will remove the current files for this division, if
they exist, as a way of making sure that enough space is available.
If the 'pir' code was specified in filelist, the .dat files would begin
being downloaded. A partial
output is shown below:
pir
-rw-r--r-- 1 root system 21091315 Feb 3 06:37 pir1.dat.Z
-rw-r--r-- 1 root system 1243 Feb 3 06:37 pir1.nam
-rw-r--r-- 1 root system 177763121 Feb 3 06:40 pir2.dat.Z
-rw-r--r-- 1 root system 1248 Feb 3 06:40 pir2.nam
-rw-r--r-- 1 root system 11135 Feb 3 06:40 pir3.dat.Z
-rw-r--r-- 1 root system 1233 Feb 3 06:40 pir3.nam
-rw-r--r-- 1 root system 221983 Feb 3 06:40 pir4.dat.Z
-rw-r--r-- 1 root system 1897 Feb 3 06:40 pir4.nam
pir1.dat.Z
pir1.nam
pir2.dat.Z
pir2.nam
pir3.dat.Z
pir3.nam
pir4.dat.Z
pir4.nam
Removing file(s) for pir1, if they exist
No match
ORIGINAL= 21091315
RECEIVED= 21091315
etc ...
If a file contains the .dat extension, it is assumed to be a sequence
file containing PIR entries. After uncompressing the file, the .dat
file is split into 3 files containing annotation, sequence and an
index. Thus, pir1.seq is split into pir1.ano, pir1.wrp and pir1.ind.
This is fully described in the documentation for the XYLEM program splitdb
. It is strongly recommended that you read this documentation file. The
key point is that the annotation files and sequence files can be
searched independently, saving a great deal of disk I/O. Thus,
fasta or blast would only search the .wrp files containing
sequence, and wouldn't have to read all of the documentation. When
sequence entries are retrieved by fetch,
the index (.ind) file is used to find the annotation and sequence for
each entry so that the complete entry can be retrieved.
The two critical factors influencing the time required for a download
are the speed of the internet connection and the speed of the
filesystem. On our Sun Ultra 60 at the University of Manitoba, using
a remotely-mounted NFS fileserver, a complete download and
installation of PIR 75.02 took 20 minutes.
The progress of the download can be monitored in a number of ways. Just
doing a directory listing of $pir periodically will list all
files in the $PIR directory. 'less files_received' will list the
files successfully downloaded. 'top' will shown the program
currently running: FTP if a file is being downloaded,
uncompress if a file is being uncompressed, or splitdb if the file is
being split.
When splitdb finishes processing a .dat file, the .ano, .wrp and .ind
files are ready for use with no further processing. Thus, when gpupdate
is complete all files are ready to use. The only thing
remaining is to regenerate the index files used by FASTA, as
described in the next section.
Configuring
FASTA for PIR searches
How FASTA finds database files
FASTA reads a list of database files from the file 'fastgbs'. The
location of fastgbs is specified by the environment variable
$FASTLIBS, which is set to $BIRCH/dat/fasta/fastgbs. A typical
fastgbs file is shown below:
PIR Protein Identification Resource 75.02 $00@/home/psgendb/BIRCHDEV/dat/fasta/pir.fil
GenPept GenBank 133.0 CDS translations$01/home/psgendb/BIRCHDEV/GenPept/genpept.wrp
GB133 Primate$1P@/home/psgendb/BIRCHDEV/dat/fasta/gbpri.fil
GB133 Rodent$1R@/home/psgendb/BIRCHDEV/dat/fasta/gbrod.fil
GB133 other Mammal$1M@/home/psgendb/BIRCHDEV/dat/fasta/gbmam.fil
GB133 verteBrates$1B@/home/psgendb/BIRCHDEV/dat/fasta/gbvrt.fil
GB133 Invertebrates$1I@/home/psgendb/BIRCHDEV/dat/fasta/gbinv.fil
GB133 pLants$1L@/home/psgendb/BIRCHDEV/dat/fasta/gbpln.fil
GB133 Expressed Sequece Tags$1E@/home/psgendb/BIRCHDEV/dat/fasta/gbest.fil
GB133 Bacteria$1T@/home/psgendb/BIRCHDEV/dat/fasta/gbbct.fil
GB133 Viral$1V@/home/psgendb/BIRCHDEV/dat/fasta/gbvrl.fil
GB133 Phage$1G@/home/psgendb/BIRCHDEV/dat/fasta/gbphg.fil
GB133 Synthetic$1Y@/home/psgendb/BIRCHDEV/dat/fasta/gbsyn.fil
GB133 Unannotated$1U@/home/psgendb/BIRCHDEV/dat/fasta/gbuna.fil
GB133 Patented$1D@/home/psgendb/BIRCHDEV/dat/fasta/gbpat.fil
GB133 STS$1X@/home/psgendb/BIRCHDEV/dat/fasta/gbsts.fil
GB133 HTG$1h@/home/psgendb/BIRCHDEV/dat/fasta/gbhtg.fil
GB133 GSS$1s@/home/psgendb/BIRCHDEV/dat/fasta/gbgss.fil
GB133 All sequences (VERY long!)$1A@/home/psgendb/BIRCHDEV/dat/fasta/genbank.fil
pir.fil lists the files comprising PIR. An example is shown in
Table 3.
Table 3. pir.fil
|
</home/psgendb/BIRCHDEV/PIR pir1.wrp 0 pir2.wrp 0 pir3.wrp 0 pir4.wrp 0
|
This file lists the location and names of the 4 files in PIR. A
complete description of syntax for these files can be found in $BIRCH/doc/fasta/fasta3x.asc.
Finally remember to edit fastgbs, using Find/Replace in
your text editor to change the PIR Release number to the current
release.
Configuring
GDE to read PIR
The default FASTA menus for GDE are located in
$dat/GDE/makemenus/menus/Database. These menus only have one database
choice for User-created files. The FASTA menus in
$birch/local/dat/GDE/makemenus/menus/Database have additional menu
choices for each database file listed in $BIRCH/dat/fasta/fastgbs. All
we need to do is to
edit $birch/local/dat/GDE/makemenus/menus/menulist to choose the
local menu item files. This is done by adding lines to menulist, such
as
Database
FASTAPROTEIN
FASTXY
Now, re-run makemenus.py
to update the .GDEmenus files
Please send suggestions of comments regarding this
page to psgendb@cc.umanitoba.ca