last page PLNT4610/PLNT7690Bioinformatics
Lecture 5, part 2 of 3
nextpage

A. Sequence Databases

1. GenBank/EMBL/DDBJ nucleotide databases

Strictly speaking, what many people think of as GenBank is a collaboration between  three groups:
  1. GenBank - National Center for Biotechnology Informationat the National Library of Medicine, NIH. [https://ncbi.nlm.nih.gov/ ]
  2. EMBL - European Molecular Biology Laboratory NucleotideSequence Database, Hinxton,UK [ http://www.ebi.ac.uk]
  3. DNA Data Bank of Japan, Mishima, Japan [http://www.ddbj.nig.ac.jp ]
All three sites are synchronized, such that each has the complete set of known nucleotide sequences. That is, there is a 1:1 correspondence for each entry at one database with the same entry at the other databases. Together, these organizations process sequence and annotation data submitted electronically from authors, and monitor the research literature to make sure that sequences not submitted by authors are entered by database staff.

The entries that most workers see are 'flatfile' reports generated from the databases

At the database centers, data is stored using database management tools. GenBank and DDBJ both produce database entries in a human-readable report format, described in the current release notes for GenBank [ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt].The EMBL database entries contain the same information in a different human-readable format. GenBank also produces database entries in the generic ASN1 format, to make it easier for other database software to import GenBank files.

One critical point to make is that there is a 1:1 correspondence between GenBank, EMBL and DDBJ entries. In all cases, the Accession number will be the same for a given sequence, although the sequence name may differ.

Example: Human myoglobin gene (X00371)

Every two months GenBank releases are produced, containing essentially all published and many unpublished DNA sequences worldwide. For organizational purposes, the database is split up among a number of divisions:

Division
Description
PRI primate sequences
ROD
rodent sequences
MAM
other mammalian sequences
VRT
other vertebrate sequences
INV
invertebrate sequences
PLN
plant, fungal, and algal sequences
BCT
bacterial and archeal sequences
VRL
viral sequences
PHG
bacteriophage sequences
SYN
synthetic sequences eg. cloning vectors
UNA
unannotated sequences
EST
EST sequences (expressed sequence tags)
PAT
Sequences from patent applications
STS
STS sequences (sequence tagged sites - sequences for which PCR primers are described; used in genomic mapping)
GSS GSS sequences (genome survey sequences )
HTG
High Throughput Genomic Sequences (raw, high throughput genomic sequencing reads)
HTC
HTC sequences (raw, high throughput cDNA sequencing reads)
ENV
environmental sampling sequences (eg. metagenomics)
TSA
Transcriptome Shotgun Assembly sequences
CON
Does not contain sequence data. Contigs are described by join() statements, joining other sequences into contigs


RefSeq - Reference Sequence Database

For model organisms with well-annotated genomes, the RefSeq databases have non-redundant annotated entries for genomic, RNA and protein sequences. RefSeq is integrated, in that each gene has a corresponding RNA and protein entry.

>>> Most species are NOT represented in RefSeq <<<

2. UniProt and other protein databases

The Protein Information Resource at Georgetown University [http://pir.georgetown.edu/ ], and the UniProt database at EMBL [ http://www.ebi.ac.uk/uniprot/] both catalogue and classify proteins. Protein sequences are submitted directly by researchers, translated from DNA databases, or taken from the research literature.

Example: protein entries corresponding to Human myoglobin gene (GB:X00371)

Both GenBank and EMBL also generate databases containing translations of DNA sequences. GenBank produces the GenPept database, and EMBL the TrEMBL database. These databases can be thought of as raw translations of known or predicted coding sequences from DNA data. GenPept and TrEMBL exist primarily as a convenience for database searches.

In contrast, PIR and UniProt databases are carefully annotated to produce an efficient report on each protein. In particular, when many genes encode identical proteins, only one protein entry is produced, citing all the genes and their respective GenBank/EMBL/DDBJ accession numbers. PIRand UniProt specialize in annotating features relating to protein structure or chemistry. Where 3D structures are known, links to protein structure databases are also included.


Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada
last page PLNT4610/PLNT7690Bioinformatics
Lecture 5, part 2 of 3
nextpage