last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 8, part 1 of 2
next page

November 7, 9, 2017

DATABASES


REFERENCES

Harrington, JL (2000) Object-Oriented Database Design Clearly Explained. Morgan Kaufmann Publishers.

Simon, AR (1995) Strategic Database Technology. Morgan Kaufmann Publishers.

Walsh S and Anderson M (1998)  ACEDB: A database for genome information. In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins . John Wiley, Toronto.


A. WHY DATABASES?

B. DATABASES ARE TO DATA AS CONSERVATION IS TO ENERGY

C. TYPES OF DATABASES

1. Flatfile

2. Relational Databases

3. Object-oriented databases

4. Hypertext databases

D. ACeDB

Tutorial: A Laboratory ACEDB Database

Tutorial:  Creating a database using ACEDB


Energy can be a metaphore for data:
"Among the unconventional sources of energy, conservation presents itself as the most immediate opportunity. It should be regarded as a largely untapped source of energy. Indeed, conservation - not coal or nuclear energy - is the major alternative to imported oil. "
Stoboagh, R and Yergin D (1979) Energy Future. pg. 10.Ballantine Books, New York.

A. WHY DATABASES?


B. DATABASES ARE TO DATA AS CONSERVATION IS TO ENERGY


HOW WE LOSE DATA:


HOW WE CONSERVE DATA

C.  TYPES OF DATABASES

 

1. Flatfile databases

A flatfile database consists of a collection of records, each containing several data fields.

For example,  three records in a database of cell stocks might look like this:
 
Cell_Stock : "SK11.pEA215.3"
Species  "Escherichia coli"
Plasmid  "pEA215.3"
Experiment       "SK11"
Freezer  "AG334 -80C"
Box      "Pisum ESTs II"
Gridded  "Rack(BF7) Box(Pisum ESTs II)"

Cell_Stock : "SK11.pI206KS"
Species  "Escherichia coli"
Plasmid  "pI206KS"
Experiment       "SK11"
Freezer  "AG334 -80C"
Box      "Pisum ESTs II"
Gridded  "Rack(BF7) Box(Pisum ESTs II)"

Cell_Stock : "SK11.pEA46.2"
Species  "Escherichia coli"
Plasmid  "pEA46.2"
Experiment       "SK11"
Freezer  "AG334 -80C"
Box      "Pisum ESTs II"
Gridded  "Rack(BF7) Box(Pisum ESTs II)"
One observation that can be made immediately is that flatfile databases are highly redundant. In this example, the data in the Species, Freezer, Box and Gridded fields are identical. Not only does this redundancy waste space, it means that if a field has to change, it must be changed in all records. For example, if the box was moved to a different freezer, all Freezer fields for that box would have to be changed.

Transactions update the database

 

 
 
 
 
 
 
 
 
 

Transactions that change, add or delet records require rewriting the database. To add a record, the database is written to the point at which the record is to be added, the new record is written, and then the rest of the records is written. Similarly, changing and deleting records also requires that all records be rewritten.

Searching the database is done by linear search or index search.

Searching a flatfile database normally requires that the entire text of the database be searched. Faster searches can be done if the database is indexed. For example, the Cell Stock database above could be indexed by Experiment, Plasmid, Species,  etc. Indexing can greatly increase the size of the database.
 

Flatfile databases force a single view of the data

Probably the single most important disadvantage of flatfile databases is that the records must be organized by one of the fields. In the example, the decision would have to be made that the Cell Stock was the fundamental unit of data, and that other types of information, such as Plasmids or Species, were best considered as an attribute of a Cell Stock. For many purposes, this is an unnatural representation of the data.

GenBank entries retrieved by Entrez are flatfile representations of data that are centered around DNA sequences. In a GenBank entry, every field appears to belong to a particular sequence. We will see later that the flat file representation is simply one possible view of a richer database.
 

2. Relational databases

Relational databases can be thought of as comprehensive tables of data. Every record from a flatfile could be implemented as a row in a table. The Cell Stock database might be implemented as a table with
 
Cell_Stock          Experiment  plasmid     Freezer    Box         Species

SK4.pcyclinB        SK4         pcyclinB    AG334 -80C Pisum ESTs Escherichia coli
SK4.pcyclinD        SK4         pcyclinD    AG334 -80C Pisum ESTs IEscherichia coli
SK4.pdiminuto       SK4         pdiminuto   AG334 -80C Pisum ESTs IEscherichia coli
SK4.pFF100          SK4         pFF100      AG334 -80C Pisum ESTs IEscherichia coli
SK4.pME1            SK4         pME1        AG334 -80C Pisum ESTs IEscherichia coli
SK4.pNDK-P1         SK4         pNDK-P1     AG334 -80C Pisum ESTs IEscherichia coli
SK4.pPABP1          SK4         pPABP1      AG334 -80C Pisum ESTs IEscherichia coli
SK4.pPCNA           SK4         pPCNA       AG334 -80C Pisum ESTs IEscherichia coli
SK4.pPS-IAA4-5      SK4         pPS-IAA4-5  AG334 -80C Pisum ESTs IEscherichia coli
SK4.pPS-IAA6        SK4         pPS-IAA6    AG334 -80C Pisum ESTs IEscherichia coli
SK4.pTic110         SK4         pTic110     AG334 -80C Pisum ESTs IEscherichia coli
From a relational database, many views of the data are possible.

Plasmid View

Plasmid     Species           Cell Stock
pEA25       Escherichia coli  SK10.2.pEA25
pEA46.2     Escherichia coli  SK11.pEA46.2
pEA207.2    Escherichia coli  SK11.pEA207.2
pEA214.6    Escherichia coli  MB123.pEA214.6
pEA215.3    Escherichia coli  SK11.pEA215.3
pEA238.2    Escherichia coli  MB123.3.PEA238.2
pEA238.11   Escherichia coli  MB123.3.pEA238.11
pEA277.11   Escherichia coli  SK11.pEA277.11
pEA303.4    Escherichia coli  SK11.pEA303.4
pEA315.2    Escherichia coli  MB123.3.pEA315.2 
peB4        Escherichia coli  VB1.eB4

Experiment View

Experiment  Cell Stock       Box               Freezer
SK4         SK4.pPS-IAA4-5   Pisum ESTs I      AG334 -80C
SK4         SK4.pPS-IAA6     Pisum ESTs I      AG334 -80C
SK4         SK4.pTic110      Pisum ESTs I      AG334 -80C
SK4         SK4.pToc34       Pisum ESTs I      AG334 -80C
SK4         SK4.pToc86       Pisum ESTs I      AG334 -80C
SK5         SK5.pAB96.3      Pisum ESTs I      AG334 -80C
SK5         SK5.pABR17.10    Pisum ESTs I      AG334 -80C
SK5         SK5.pABR18.2     Pisum ESTs I      AG334 -80C
SK5         SK5.pI39         Pisum ESTs I      AG334 -80C
SK5         SK5.pI49KS       Pisum ESTs I      AG334 -80C
SK5         SK5.pI176KS      Pisum ESTs I      AG334 -80C
SK5         SK5.pI225KS      Pisum ESTs I      AG334 -80C
 

Relational databases can be organized across many files

Although a relational database can be implemented in a single large table or "relation", it is often advantageous to split the database up into multiple tables.

Four tables are shown: PLASMID, DNA_SAMPLE, VECTOR and LOCATION. Fields in lowercase contain text or other data. Fields in uppercase contain names of items in other tables. Fields that reference other tables are referred to as links. Several factors have to be considered when designing a relational database:

Common relational database packages:

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada
last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 8, part 1 of 2
next page