PLNT4610/PLNT7690 Bioinformatics - Lecture 8, part 2 of 2

PLNT4610/PLNT7690 Bioinformatics
Lecture 8, part 2 of 2

3. Object-oriented databases

The previous examples illustrate that one database can generate many views. This concept is addressed in a more formal way in Object-oriented databases. Object-oriented databases solve many of the problems that are inherent in relational databases, and provide more powerful capabilities.

The object concept can be stated as follows:

A class is a formula for creating an object.
Everything is an object.
Objects have Data and Methods
You can make new classes by reusing and extending an existing class

The fundamental unit of Object-oriented databases (OODB) is the Class.

A class can be thought of as a formula for an object.

A class may have many attributes and methods.

Attributes - Attributes hold the information that is intrinsic to the class. Data items are usually simple types of data, such as text or numbers. Attributes can also be pointers to other classes. There may be either one or many data items or classes of each type.
Methods - Methods are procedures or tasks that an object can carry out, A method may contain its own data fields, and may point to other classes.

An object is an instance of a class

Objects, in OODBs, correspond to individual records in a relational database. There may therefore be many objects of a given class. A class is an abstraction. An object is a tangible thing.

pFF100 is an instance of the class PLASMID. The class definition allows it to point to two other classes, VECTOR and DNA_SAMPLE. DNA_SAMPLE might look like this:

CLASS OBJECT

DNA_SAMPLE
VECTOR | PLASMID
EXPERIMENT

concentration [µg/ µl]

BOX

SK99.pFF100
pFF100
SK99

0.5

Pisum ESTs I

In the DNA_SAMPLE class, the first field points to either a VECTOR or a PLASMID. In this case, the corresponding field in SK99.pFF100 points to PLASMID pFF100. the EXPERIMENT field points to an object of the type EXPERIMENT. the concentration field contains a floating point number whose units are µg/ µl. The BOX field points to an object of the type BOX, called "Pisum ESTs I".

A database is a model of something in the real world

It is important to note that out of 4 fields in the DNA_SAMPLE class, three of them are relations. That is, three of them point to other classes. Only one contains data. This illustrates the point that relations between objects are often as important as the data themselves. The relations between objects give the data objects characteristics similarly to their real-world counterparts.
.

Methods

The implementation of methods is highly dependent on the specific database software being used. In some cases, macros, that is, sets of database commands might be run. In other cases, an external program might be called. For example, the PLASMID class could have a method called MAP_VIEWER:

CLASS	OBJECT
PLASMID VECTOR insert MAP_VIEWER filename DNA_SAMPLE	pFF100 pBluescriptIISK+ 2500bp BamHI frag. from GB::X6638 eog pFF100.gif SK99.pFF100

Here, the MAP_VIEWER and filename fields could be a template for a command that would launch a viewing program with a specific file. In the pFF100 object, the actual command that would be run is ' pFF100.gif'. This command would be passed to UNIX, launching the eog image viewer with the GIF file pFF100.gif.

Depending on the software, it might even be possible to call a different viewing program for different types of image files. For example, if a plasmid called pMU1 had a map in Adobe PDF format, it would be necessary to view its map using the Adobe Acrobat viewer eg. 'acroread pMU1.pdf'.

One-to-one vs. one-to-many

The tabular structure of relational databases makes one-to-many relationships awkward to implement. In OODBs, it is trivial to modify a class to allow many fields of the same type. In databases like ACeDB, all fields may be present in arbitrary numbers unless specifically implemented as UNIQUE fields.

CLASS	OBJECT
VECTOR PLASMID ACCESSION DNA_SAMPLE	pBluescriptKSm13+ pI206KS pI49KS pI236KS pI230KS X52331 AN29.pBluescriptKSm13+ GK302.pBluescriptKSm13+ FJ120.pBluescript.KSm13+

Here, four plasmid constructs were made using the pBluescriptKSm13+ vector, and three DNA_SAMPLEs of this vector (not the plasmids) are listed.

Independence of classes

Many changes in classes have no effect on other classes. In particular, any number of attributes can be added, without changing objects that link to a class. The one big exception to this is the case in which 2-way links are added. In that case, both classes must be modified.

One point to make here is that when a class is changed, not all objects in that class need to be modified. OODBs, and to some extent other types of databases, do not require that all objects contain data for all possible attributes defined in the class. This allows a 'grandfathering' of preexisting objects. For example, if the class Cell_Stock was modified by the addition of an attribute called 'Date', listing the date on which the stock was made, it would not be necessary to go back and insert a date into (potentially thousands) of existing Cell_Stock objects. Dates can be included in new Cell_Stock objects, as they are created.

Modifications are done on specific objects

OODBs are the most efficient type of database to modify. Each object can be updated independently. Depending on the implementation, only a small part of the database may need to be rewritten during updates.

4. Schemas - models for objects

A schema is a model, or a formula, for how to create objects of a given class. Schemas can be expressed as diagrams for human readability, or in languages such as XML, for programmatic use.

Goals for creation of a good schema:

Each class represents something in the real world. The closer your classes and relations are to the real-world description, the more the database will behave like the real thing you are modeling.
Classes point to other classes. Much of the information about things has to do with their relationships to other things
Try to minimize the number of classes. If you see redundancy between two or more classes, it may be better to combine two or more classes into a single class
Try to minimize the size of classes. When a class gets too big, it may be time to create a new class
Never store the same piece of information in two different classes. Links (relations) can be duplicated, but not raw data (eg. numbers, weights etc.)

EXAMPLE - Schema to implement biochemical pathways

The schema at right implements a biochemical pathways, using the conventions of the ACeDB system. Each pathway object points to one or more enzymes present in that pathway, and each enzyme object points to one or more pathways to which it belongs.

Reactions performed by each enzyme are conceptualized as consumption of a substrate to produce a product. The pathway class also has a Chart field, which points to an image file showing the pathway. Note that objects contain other pieces of information. For example, each compound has a molecular weight, and each enzyme has an EC number.

$ACE_FILE_LAUNCHER - Rather than hard code the name of file viewer into the database, it is better to hand off the viewing to a script. The environment variable $ACE_FILE_LAUNCHER specifies the name of a script that decides the type of file to be viewed based on its file extension (eg. .png, .html, .pdf etc.) and chooses an application to view the program.

Each field contains a label and a data type

Databases try to create a model of real-world things as we understand them. To make this possible, it is useful to give each field a label, which describes what each piece of data is intended to represent. The label is a convenience for the human user. Each field also has a data type, which indicates the type of data used to represent that piece of information.

Common_Name is implemented as a Text field, a string of characters.
Mol_Wt (molecular weight) is a number, so it is implemented as an integer.

In a biochemical pathway, a compound can be a product of one enzyme, and a substrate for one enzyme. To represent these concepts, we have two fields, Produced_by and Consumed_by. Both point to objects of the Enzyme class.

Note: Common_Name and Mol_Wt are examples of fields in which the information is contained in each object. Produced_by and Consumed_by are examples of fields which point to other objects.

Remember, objects are instances of a class. We can make as many instances of a class as we wish. So here are two Objects of the class Compound, as implemented in the ACeDB system:

This illustrates the point that Classes are abstract ideas, whereas Objects are specific instances of those abstract ideas.

As mentioned above, Common_Name and Mol_Wt contain information, whereas Produced_by and Consumed_by point to other objects, in this case, Enzyme objects.

Pathway Demo

You can try a database that implements this schema to emulate the TCA cycle, by typing 'pathace' at the Linux prompt.

One of the best tests of a well-thought out a database occurs when you decide that the schema needs to be modified to add new concepts. For example, the existing Enzyme class could be extended to incorporate the concept of stoichiometry by adding coefficients to each compound linked-to in the enzyme class. In the example at right, an integer (Int) tells the number of molecules of a substrate or product consumed or produced.

Adding these fields doesn't require any changes in the other data objects. If your classes are well-designed, a change in one class will not break other classes.

There are other possible modifications that might be reasonable for a database of this sort. For example, the current schema doesn't have provisions for enzymatic reactions that can proceed in either direction. Databases should always be designed with the goal of creating a realistic representation of something in the real world, and building in the ability for change to occur in one part without disrupting other parts.

Guidelines for good schema design

1. The database is a model of a biological or experimental system. Make it as close to the real system as possible.

2. Keep each class simple. The fewer fields, the better.

3. Do not duplicate the same piece of information in more than one object.

4. Wherever practical, avoid free text. Use links or enumerated choices.

5. BioLegato applies Object-Oriented concepts to Graphical User Interfaces

Alvare, G., Roche-Lima, A. & Fristensky, B. BioLegato: a programmable, object-oriented graphic user interface. BMC Bioinformatics 24, 316 (2023). https://doi.org/10.1186/s12859-023-05436-4

BioLegato is a fundamental rethinking about application programs. It takes as its premise the idea that objects are an intuitive way to combine information and the methods that work with that information. If the objects are structured like things that the end user is already familiar with, the fact that the user already understands the relationships between objects, and what they are expected to do, makes it easier to use the software.

blgeneric is a BioLegato application that launches BioLegato without any menus or canvas. This is mainly for demonstration purposes, to illustrate the fact that almost all functionality of BioLegato is programmable. In the terminology of Object-Oriented programming, think of BioLegato as an abstract class that is extended to create real classes. So in a way blgeneric is like instantiating an abstract class. To launch type 'blgeneric'.

To continue showing how BioLegato follows the Object-Oriented paradigm, bldna shows that all BioLegato windows have two parts: The canvas, which displays the data, and the Menus, which are the methods for the BioLegato object. In this example, bldna has a sequence canvas and menus for working with DNA.

Similarly, blncbi has a table canvas for displaying NCBI search results, and menu items for performing operations on those results, such as retrieval of hits.

Designing software tools as objects ensure that only methods appropriate with a particular kind of data accompany those objects. bldna has methods for DNA or RNA sequences. blprotein only has methods for proteins. blncbi has methods for NCBI query results. Packaging data and methods together prevent errors by making it impossible to use a method with data for which it is not suited. For example, bldna can launch BLAST searches, but only those searches that take a DNA sequence as input. Searches that take protein as input cannot be run from bldna.

OO design also simplifies the look and feel of software tools by limiting menus to only those methods that make sense for a particular type of data.

SUMMARY:

A schema defines classes
A database instantiates classes into objects

6. Hypertext databases

Ways in which Web sites can be considered databases

In many peoples' minds, the World Wide Web is one big database. There is some element of truth in that statement.

Provides an efficient way to store and retrieve data
Web pages could each be considered a record of data. In this context, the browser is the 'front end' to the database.
Is machine readable and searchable
This sounds very obvious, but is a critical distinction between a library and a web site. Information in a library is not directly searchable. In contrast, some very sophisticated search engines now exist, using highly-efficient indexing schemes, that make it possible to quickly find almost any kind of information on the Web.
Is object-oriented, to some degree
There are many kinds of objects that are almost universally-recognized by web browsers, including HTML and text files, graphic files, Java applets, FTP sites, terminal programs, and plugins of many kinds. They all have attributes and methods.
Knowledge can be encoded in the structure of a web site
In much the same way that links structure an OODB, links between web pages structure web sites. In principle, an object-oriented database could be devised in which each plasmid had a web page, and each plasmid web page linked to corresponding pages for DNA samples and vectors.

Example: The Tree of Life [http://tolweb.org/tree/phylogeny.html]

The Tree of Life is a taxonomic database edited by David R. Maddison at the University of Arizona.
Its main structure is a hierarchy of web pages, whose root is at the kingdom level. Hypertext links allow a user to traverse the phylogenetic tree from one level to another (eg. phylum, order, class, family, genus, species). At each node, specialized data of almost any kind may be found, from images to text documents, or even links to other web sites.

DEMO: Descend Tree of life as follows:

root

Organisms with nucleated cells (Eukaryota)

Animals (Metazoa)

Bilateria

Deuterostomia

Why the Web is not, strictly speaking, a database

Web pages have no formal structure
A web page can be anything, and web sites can have any underlying structure. The structure can change from one moment to the next. The browser knows about nothing except the current page being viewed.
Much of the data is in the form of text, rather than structured types
In formal databases, every field has a type. For example, in DNA_SAMPLE, the concentration field was a floating point number. Many database programs would have a straightforward way of finding all DNA_SAMPLES whose concentration was 0.5 µg/ µl or greater. Although text on web sites is machine searchable, the lack of formal types makes it impossible to write programs to work with the data.
There is no formal definition of any 'object' on the Web
There can be as many different ways of representing data as there are web sites. At one site, literature references could be listed as plain text, and at another, they could be implemented with links to authors, journals, and electronic copies of papers. At one site, a DNA sequence might be represented as raw nucleotides, while at another, a sequence might be available in a format such as GenBank, EMBL, or GCG. More importantly, the lack of any formal definition of sequence means that even at a single web site, each sequence could be in a different format. Thus, there is no way to write a program to handle a sequence from such a web site, because no formal definition exists. As well, there is no way for database software to do automated validation and error detection.
Structuring of data is on an ad hoc basis
In a database, well defined objects mean that the knowledge encoded in the database has a predictable structure. For example, given an object of the type Paper, we can be pretty sure that there will be authors, a journal or book, dates, pages etc. Links to other objects exist, or not, at the whim of the author. Any web page can link to any other web page, rather than to other web pages of a specific type.

For comparison with the Tree of Life, the NCBI operates a taxonomy server through a relational-database engine, as part of the NCBI database. [http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html ].

DEMO: Descend Tree of life as follows:

root

Eukaryota

Animals (Metazoa)

Bilateria

Deuterostomia

Although it may seem like a subtle difference, the Tree of Life IS a collection of web pages, whereas the web pages visited at NCBI are generated on the fly from the NCBI database. The web pages seen at NCBI are therefore a view of the data.

It should be pointed out that each approach has advantages and disadvantages. The NCBI web site is formal and structured, but primarily serves to encode a taxonomic structure, with links to databse items such as sequences or literature references. The Tree of Life is rich, with images, articles, and other information, limited only by the creativity of the contributers.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

lprevious page PLNT4610/PLNT7690 Bioinformatics
Lecture 8, part 2 of 2 next page

CLASS	OBJECT
DNA_SAMPLE VECTOR \| PLASMID EXPERIMENT concentration [µg/ µl] BOX	SK99.pFF100 pFF100 SK99 0.5 Pisum ESTs I