PLNT4610/PLNT7690 Bioinformatics - Lecture 5, part 2 of 3

PLNT4610/PLNT7690Bioinformatics
Lecture 5, part 2 of 3

1. GenBank/EMBL/DDBJ nucleotide databases

Strictly speaking, what many people think of as GenBank is a collaboration between three groups:

GenBank - National Center for Biotechnology Informationat the National Library of Medicine, NIH. [https://ncbi.nlm.nih.gov/ ]
EMBL - European Molecular Biology Laboratory NucleotideSequence Database, Hinxton,UK [ http://www.ebi.ac.uk]
DNA Data Bank of Japan, Mishima, Japan [http://www.ddbj.nig.ac.jp ]

All three sites are synchronized, such that each has the complete set of known nucleotide sequences. That is, there is a 1:1 correspondence for each entry at one database with the same entry at the other databases. Together, these organizations process sequence and annotation data submitted electronically from authors, and monitor the research literature to make sure that sequences not submitted by authors are entered by database staff.

The entries that most workers see are 'flatfile' reports generated from the databases

At the database centers, data is stored using database management tools. GenBank and DDBJ both produce database entries in a human-readable report format, described in the current release notes for GenBank [https://ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt].The EMBL database entries contain the same information in a different human-readable format. GenBank also produces database entries in the generic ASN1 format, to make it easier for other database software to import GenBank files.

Example: Human myoglobin gene (X00371)

Every two months GenBank releases are produced, containing essentially all published and many unpublished DNA sequences worldwide. For organizational purposes, the database is split up among a number of divisions:

Division	Description
PRI	primate sequences
ROD	rodent sequences
MAM	other mammalian sequences
VRT	other vertebrate sequences
INV	invertebrate sequences
PLN	plant, fungal, and algal sequences
BCT	bacterial and archeal sequences
VRL	viral sequences
PHG	bacteriophage sequences
SYN	synthetic sequences eg. cloning vectors
UNA	unannotated sequences
EST	EST sequences (expressed sequence tags)
PAT	Sequences from patent applications
STS	STS sequences (sequence tagged sites - sequences for which PCR primers are described; used in genomic mapping)
GSS	GSS sequences (genome survey sequences )
HTG	High Throughput Genomic Sequences (raw, high throughput genomic sequencing reads)
HTC	HTC sequences (raw, high throughput cDNA sequencing reads)
ENV	environmental sampling sequences (eg. metagenomics)
TSA	Transcriptome Shotgun Assembly sequences
CON	Does not contain sequence data. Contigs are described by join() statements, joining other sequences into contigs

RefSeq - Reference Sequence Database

For model organisms with well-annotated genomes, the RefSeq databases have non-redundant annotated entries for genomic, RNA and protein sequences. RefSeq is integrated, in that each gene has a corresponding RNA and protein entry.

refseqgene - genomic DNA for reference genes. In many cases, "genes" are segments of genomic DNA taken from larger chromosome contigs
refseq_rna - transcripts for reference genes
refseq_protein - amino acid sequences for reference genes

>>> What is not in RefSeq? <<<

Not all species are represented in RefSeq
Some species have most genes from the genome, while others have perhaps only a few genes
Entries in RefSeq genomic database are genes, not complete chromosomes.

In other words, RefSeq is where you go if you want a fully annotated copy of a specific gene.
2. UniProt and other protein databasesThe Protein Information Resource at Georgetown University [http://pir.georgetown.edu/ ], and the UniProt database at EMBL [ http://www.ebi.ac.uk/uniprot/] both catalogue and classify proteins. Protein sequences are submitted directly by researchers, translated from DNA databases, or taken from the research literature. Example: protein entries corresponding to Human myoglobin gene (GB:X00371)PIR UniProtBoth GenBank and EMBL also generate databases containing translations of DNA sequences. GenBank produces the GenPept database, and EMBL the TrEMBL database. These databases can be thought of as raw translations of known or predicted coding sequences from DNA data. GenPept and TrEMBL exist primarily as a convenience for database searches.

In contrast, PIR and UniProt databases are carefully annotated to produce an efficient report on each protein. In particular, when many genes encode identical proteins, only one protein entry is produced, citing all the genes and their respective GenBank/EMBL/DDBJ accession numbers. PIRand UniProt specialize in annotating features relating to protein structure or chemistry. Where 3D structures are known, links to protein structure databases are also included.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

last page

PLNT4610/PLNT7690Bioinformatics
Lecture 5, part 2 of 3

A. Sequence Databases

1. GenBank/EMBL/DDBJ nucleotide databases

The entries that most workers see are 'flatfile' reports generated from the databases

RefSeq - Reference Sequence Database

2. UniProt and other protein databases