GenBank/EMBL/DDBJ nucleotide databases
speaking, what many people think of as GenBank is a collaboration
between three groups:
sites are synchronized, such that each has the complete set of
known nucleotide sequences. That is, there is a 1:1 correspondence
for each entry at one database with the same entry at the other
databases. Together, these organizations process sequence and
annotation data submitted electronically from authors, and monitor
the research literature to make sure that sequences not submitted
by authors are entered by database staff.
Center for Biotechnology Informationat the National Library of
Medicine, NIH. [https://ncbi.nlm.nih.gov/
- European Molecular Biology Laboratory NucleotideSequence
Database, Hinxton,UK [
Data Bank of Japan, Mishima, Japan [http://www.ddbj.nig.ac.jp
entries that most workers see are 'flatfile' reports generated
from the databases
database centers, data is stored using database management tools.
GenBank and DDBJ both produce database entries in a human-readable
report format, described in the current release notes for GenBank
database entries contain the same information in a different
human-readable format. GenBank also produces database entries in
the generic ASN1 format, to make it easier for other database
software to import GenBank files.
critical point to make is that there is a 1:1 correspondence
between GenBank, EMBL and DDBJ entries. In all cases, the
Accession number will be the same for a given sequence, although
the sequence name may differ.
myoglobin gene (X00371)
months GenBank releases are produced, containing essentially all
published and many unpublished DNA sequences worldwide. For
organizational purposes, the database is split up among a number
|other mammalian sequences
|other vertebrate sequences
|plant, fungal, and algal sequences
|bacterial and archeal sequences
|synthetic sequences eg. cloning vectors
|EST sequences (expressed sequence tags)
|Sequences from patent applications
|STS sequences (sequence tagged sites -
sequences for which PCR primers are described; used in
||GSS sequences (genome survey sequences )
|High Throughput Genomic Sequences (raw,
high throughput genomic sequencing reads)
|HTC sequences (raw, high throughput cDNA
|environmental sampling sequences (eg.
| Transcriptome Shotgun Assembly
|Does not contain sequence data. Contigs
are described by join() statements, joining other
sequences into contigs
RefSeq - Reference Sequence Database
For model organisms with well-annotated genomes, the RefSeq databases
have non-redundant annotated entries for genomic, RNA and protein
sequences. RefSeq is integrated, in that each gene has a
corresponding RNA and protein entry.
- refseqgene - genomic DNA for reference genes. In many cases,
"genes" are segments of genomic DNA taken from larger chromosome
- refseq_rna - transcripts for reference genes
- refseq_protein - amino acid sequences for reference genes
>>> What is not in RefSeq?
In other words, RefSeq is where you go if you want a fully annotated
copy of a specific gene.
- Not all species are represented in
- Some species have most genes from the
genome, while others have perhaps only a few genes
- Entries in RefSeq genomic database
are genes, not complete chromosomes.
2. UniProt and other protein databases
Protein Information Resource at Georgetown University [http://pir.georgetown.edu/
], and the UniProt database at EMBL [
http://www.ebi.ac.uk/uniprot/] both catalogue and
classify proteins. Protein sequences are submitted directly
by researchers, translated from DNA databases, or taken from
the research literature.
entries corresponding to Human myoglobin gene (GB:X00371)
Both GenBank and EMBL also
generate databases containing translations of DNA sequences.
GenBank produces the GenPept database, and EMBL the TrEMBL
database. These databases can be thought of as raw translations
of known or predicted coding sequences from DNA data. GenPept
and TrEMBL exist primarily as a convenience for database
In contrast, PIR and UniProt databases are carefully
annotated to produce an efficient report on each protein. In
particular, when many genes encode identical proteins, only
one protein entry is produced, citing all the genes and their
respective GenBank/EMBL/DDBJ accession numbers. PIRand UniProt
specialize in annotating features relating to protein
structure or chemistry. Where 3D structures are known, links
to protein structure databases are also included.