PLNT4610/PLNT7690 Bioinformatics
Lecture 1, part 1 of 2

INFORMATION-DRIVEN SCIENCE

Before next class:

Running a Linux session on CCL

A. Why Bioinformatics?

1. What is Bioinformatics?
2. Much of the work of Bioinformatics is organizing raw data to create knowledge.
3. Knowledge is built by constructing relations between different kinds of data.
4. Machine Learning (ML) attempts to find mathematical functions that best classify subjects into two or more groups
5. Public databases organize knowledge into models of things in the real world
6. The public databases are a laboratory for investigating scientific questions.

B. Linux fundamentals

1. Understand how thin clients make it possible to use remote computer systems from anywhere.
2. Understand how file sharing makes it possible for any user on the system to use files and programs transparently, from any workstation on the system.
3. Know how to use a small core of Unix commands.
4. Understand the concept of a home directory.
5. Know how to organize your files by topic in a hierarchical directory tree
6. Know the distinction between ASCII textfiles and binary files and why scientific data is typically read from textfiles.

A. Why Bioinformatics?

The purpose of this course is not just to teach you how to use a bag of computerized magic tricks, although you will gain many practical skills. We will present the theory behind the methods, which will hopefully make it possible to distinguish between what we do know, what we think we know, and what we don't know. However, most importantly, the purpose of bioinformatics is to provide an organized and rigorous framework in which to do biology, and to make possible experimental strategies that would not otherwise be possible.

What I want you to get out of this course, then, is not just a set of skills, but rather a mindset. I want you to understand that the computer is the ultimate general purpose tool, and I want you to have the ability to use its capabilities in a creative way to attack biological problems.

1. What is Bioinformatics?

The terms "bioinformatics" and "computational biology" mean a lot of different things to a lot of people. There are no universally accepted definitions for these terms. This may seem odd, until you try to come up with definitions for terms such as "gene" or "life". A very strict definition could actually be counterproductive, missing many things that should be included.

Bioinformatics can be thought of as a branch of Data Science, focusing on biological problems.

The term "Data Science encompasses all of the steps that we might do in the process of discovery, starting with acquisition of raw data, refining the dataset, and analysis to learn things from our data.

The series of steps to arrive at a goal is referred to as a workflow.

Workflows can be implemented using a series of programs, each of which performs a single specific task. The chain of programs implementing a workflow is referred to as a data pipeline.

One way of getting a better sense of the domain of a difficult to define concept is to create a tag cloud. A tag cloud is a visual representation of a population of words, in which the most frequently used words are visualized in font sizes directly proportional to their representation in the source. For example, a tag cloud of words from the abstracts in a 2008 issue of BMC Bioinformatics looked like this:

from Saunders N, What if journal current contents were tag clouds? http://nsaunders.wordpress.com/2008/08/23/what-if-journal-current-contents-were-tag-clouds/

Question: What conclusions can we draw about this issue of BMC Bioinformatics? What broader generalizations can we make about the field of bioinformatics?

One thing we take away from tag clouds is that visualization can lead to important insights into information of any type. This principle led to the development of Sequence Logos by Tom Schneider and colleagues at the NCI. Sequence logos use different colors and font sizes to represent the frequency of nucleotides in a DNA sequence, or amino acids in a protein. For example, the logo in the figure at right shows the information content in bits, calculated from the raw frequencies of nucleotides in sites bound by the Lambda phage cI and cro proteins. At position 0, all four nucleotides are found in near equal frequencies. Since position 0 appears to have a random distribution of nucleotides, its information content is very close to 0 bits. At position -7, all binding sites had an A, indicating an extreme deviation from randomness. Therefore, position -7 is said to have a very high information content.

(Sequence logos appear to predate the tag cloud concept.)

figure from http://schneider.ncifcrf.gov/gallery/hawaii.fig1.gif

Ref: Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18: 6097–6100.

2. Much of the work of Bioinformatics is organizing raw data to create knowledge.

The human brain is superb at pattern recognition. We have an innate tendency to impose order on raw or incomplete information.

"In the Kaniza illusion there appears to be a white triangle lying on top of a black-outlined one. But if you look closely, you'll see that there are no triangles in the figure. Our perceptual system completes or "fills in" information that isn't there".

Levitin, DJ (2006) This is Your Brain on Music. Penguin Group, Canada.

Image from http://en.wikipedia.org/wiki/Illusory_contours

In bioinformatics, we seldom get perfect, complete data sets. The challenge, often, is to use the data we do have to learn about the biological system from the which it was taken. For example, in phylogenetic analysis, we usually only have sequences from modern day species to work with. Nonetheless, these data often give us enough information to reconstruct the evolutionary history of a gene family.

Part of the goal of bioinformatics, therefore, is to discover the relationships between different pieces of information and assemble that information into a structure, or model, that explains the data. Put another way, bioinformatics creates knowledge from data.

Or, another representation:

One way to organize biological concepts is through ontologies. An ontology organizes concepts in a hierarchical structure. For example the Amigo Gene Ontology database classifies genes involved in plant incompatible (resistant) defense responses to fungi as part of a hierarchy whose root is "biological process".

Link:
http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0009817&session_id=8221amigo1375399571

Ontologies express relationships between biological concepts in a way that is amenable to manipulation by computer programs. They codify ideas into a structure that matches as closely as possible the processes that occur in nature.

Ontologies, while they express relationships between different biological concepts, do not carry the data or other attributes associated with these processes. In computer science, almost anything can be described as an object. The object concept can be stated as follows:

A class is a formula for creating an object.
Everything is an object.
Objects have Data and Methods
You can make new classes by reusing and extending an existing class

Let's look at an example. Below we define a class called Bird. It's data include a list of things that all birds have. As well, methods include a list of things that a bird would be able to do. These are things that should be true of all birds.

However, there is no such thing as an object of the type bird ie. a generic bird. We must reuse and extend the bird class to create classes that can be instantiated.

The Duck and Penguin classes extend the bird class by adding different sets of Data and Methods, specific to either Duck or Penguin. (Note that "Bill" appears in both classes. If Bill was a characteristic of all possible birds, it would have been better to have added it to the Bird class. But birds like cardinals and chickadees have beaks, so to accommodate them, we don't add bill to the bird class.)

Biological databases, while not necessarily object-oriented, present their records in what could be considered to be objects. For an example, let's go to the Protein Data Bank (PDB) at http://www.pdb.org, and have a look at entry 4K1V, for delta 5-3-ketosteroid isomerase [http://www.pdb.org/pdb/explore/explore.do?structureId=4k1v].

This record follows the formula for protein objects at this site, having data for Primary Citation, Molecular Description, Source, Related PDB entries, and so forth. (In practice, each of these parts of the object are probably implemented in the database as smaller objects). Some of these objects have methods. Under Biological assembly, there is a link saying "View in 3D", which opens a Java 3D viewer in which we can manipulate and zoom the structure, and control which features are shown. Another example of a method is found under Experimental Details. Clicking on the link "Structure Factors" will download experimental data related to structure.

3. Knowledge is built by constructing relations between different kinds of data.

We can think of data as the raw material and knowledge as the value-added product. Relationships between different types of information, such as the Gene and Allele models described above, are themselves information. For example, the UniProt database [http://www.uniprot.org/] organizes genes into gene families. However, each gene and gene family contains automaticaly-generated links to related information in other databases.

EXAMPLE: Pathogenesis-related protein PR10 in plants
The PR10 family of proteins is found widely in dicotyledonous plants. This gene is often activated by pathogens and in some species by the hormone abcissic acid. It is constitutvely-expressed in the roots of some plants, and has also been found isolated as a pollen allergen in several tree species. Because of the range of biological contexts in which this protein has been found it has been called by a variety of names, such as PR10, Betvi, PR1, SAM, RH2 etc. In peas, it has been shown to have ribonuclease activity. Workers in one area might clone the gene (eg. from birch pollen) and not know that it was in the same protein family as pathogenesis-related proteins from other plants. The UniProt database entry for this gene is P13239. UniProt classifies PR10 homologues from all plant species in which this protein has been found as the BetVI pollen allergen family.

The entry for the Pea gene shows that automatically generated hypertext links to other databases provide the user with a complete picture of this gene family. For example, the link to the Pfam database provides information on protein domains, structure for this gene family. The knowledge encoded in these relations embodies information from a large number of research projects from many labs.

Take home lessons:
The example of knowledge that we have looked here may seem to be nothing more than common sense, and even trivial. Their significance becomes apparent when you realize that all of the information comes from raw pieces of data, such as raw sequence, similarity comparisons between sequences, protein structural analysis, and endless details written in research articles. The work of bioinformatics is not so much the acquisition of raw data, but the art of how we organize and analyze that data.

4. Machine Learning (ML) attempts to find mathematical functions that best classify subjects into two or more groups

Cannataro M et al. (2022) Artificial Intelligence in Bioinformatics: From Omics Analysis to Deep Learning and Network Mining. Elsevier Inc. ISBN: 978-0-12-822952-1

XY scatterplot
showing three possible lines through a scatter of datapoints

from Cost Function in Machine Learning (https://www.javatpoint.com/cost-function-in-machine-learning)

Human beings build models of things in their environment by generalizing and simplifying. By glancing at a scatter plot like the one above, a person sees right away that the green and red dots cluster into two groups in the lower left and upper right hand corners. A machine learning algorithm might test thousands of linear functions by trial and error, and select the function that best separates green from red.

The human understands the visual concept of grouping. The Intelligent Agent (function) resulting from ML has no understanding. It is simply a mathematical function, in this case, a slope (m) and a constant (b) in the equation of a straight line, y = f(x) = mx + b.

While real-world ML creates highly complex and sophisticated Intelligent Agents such as neural networks, in essence, even the most sophisticated of these are nothing more than black boxes resulting from the best fit of the IA to the input dataset.

In Bioinformatics, we can train Intelligent Agents using Machine Learning, and derive an IA that is a robust classifier for the dataset on which it was trained. For example, given a set of gene expression data for tumor cells and a comparable set of data for normal cells, ML can arrive at a neural network that will correctly identify tumor or normal cells most of the time. However, the IA is only as good as the dataset on which it was trained. Depending on biases inherent in the dataset, the IA may fail to correctly classify new data. Because we don't know how the IA works, the IA provides no insights into the underlying biochemical and physiological differences between tumor and normal cells. Put another way, Machine "Learning" is really a misleading term, because there is no understanding or modeling in the sense that humans reason and understand. In the next section, we will discuss how databases impose a formal structure onto data, based on human concepts.

5. Public databases organize knowledge into models of things in the real world.

One of the goals of bioinformatics is to create models, data objects that represent concepts from the real world. The more we make the model like the real-world concept, the easier the model is to work with, and the more useful it is. Here is view of chromosome III from the nematode worm, Caenorhabditis elegans, displayed in a web-based genome viewer.

from http://www.wormbase.org/tools/genome/gbrowse/c_elegans_PRJNA13758/

Questions:

1. In what ways does this model accurately represent the chromosome? (In other words, what are the data and methods of this object?)
2. What sorts of information appear to be missing from this model?
3. In what ways is this model an oversimplification of the chromosome?

6. The public databases are a laboratory for investigating scientific questions.

Beyond the initial project, data is still a valuable resource, if it is accessible by software. Results from numerous research projects that might themselves be of minimal significance, can often be put together to make generalizations or observations that could be quite significant.

EXAMPLE: Transposable elements in plants may carry regulatory sequences from gene to gene.

Bureau, TE and Wessler, SR (1994) Stowaway: A new family of inverted repeat elements associated with the genes of both monocotyledonous and dicotyledonous plants. Plant Cell 6:907-916.

Bureau and Wessler found a transposon called "Tourist" in the 5' non-coding region of the PEP carboxylase CP21 gene of sorghum. This Tourist element appears to contain another transposon, which they called "Stowaway".

Figure 1. The disrupted Tourist-Sb1 in the 5' flanking region of the sorghum phosphoenolpyruvate carboxylase CP21 gene (transcription start site, bent arrow). The 5' coding sequence has been expanded to show the position of Stowaway-Sb1. Triangles indicate terminal inverted repeats.

Stowaway had terminal inverted repeats characteristic of transposons. To test the hypothesis that Stoway is a transposon, they searched for Stowaway in other plant sequences in GenBank.

Search GenBank for sequences with >60% DNA sequence identity to Stowaway.
Select sequences with terminal inverted repeats and target site duplications.
Repeat searches with presumptive Stowaway elements

Many of the presumptive stowaway elements were present in 5' or 3' non coding regions. Polymorphisms for Stowaway insertions in different copies of genes are evidence that transposition has occurred after gene duplication.

Stowaway-St5 is located within intron 5 of the potato patatin pseudogene (STPATP1) but not in the corresponding position of three other members of the patatin gene family.

STPATP1  TTTCTTAATATA>===St5====<TATAATAGAAAA

STPATP2  TTTCTTAATATA--------------TGAAAGGAAA

POTPATA  TTTCTTAATATA--------------TGGTAGAAAA

STPATG   TTTCTTAATATA--------------TGATAGGAAA

Stowaway-Ps2 is located in the 5' flanking region of the pea CAB80 gene but not in CAB66.

PEACAB80 TATAATTAACTA>===Ps2====<TATATACTAGTT

PEACAB66 TATAATTAACCA--------------TATACTAGTA

Stowaway-Le2 is located in the 5' flanking region of the tomato rbcS1 gene (LERBSS1) but not in the corresponding position of the potato rbcS3 gene (STRBCS3). An asterisk indicates a short variable region, 7bp in LERBSS1 and 16bp in STRBCS3.

LERBSS1  TCTTGTCTATTA>===Le2====<TAAAATAT*AAA

STRBCS3  TCTTGTCTATTA--------------AAATAT*AAA

>,< - terminal inverted repeats; red - target sites presumed duplicated during transposition.

The Stowaway-Le2 element is particularly interesting because about 50% of the sites protected in DNAse footprinting assays [Manzara et al., (1993) Plant Mol. Biol. 21:69-88] are found within this Le2 element of the LERBSS1 gene. In other words, Stowaway contains sites bound by light-specific DNA binding proteins.

Stowaways also make up most of the 3' non-translated region and polyA site mRNAs encoding stress-inducible thaumatin-like proteins Hv-1a, Hv-1b, Hv-1c and Hv2.

Hypothesis: Transposons could be a powerful way for regulatory elements to move from gene to gene. Mixing and matching cis-regulatory elements could make it easier for evolution to try out new regulatory strategies.

Do these data really support that hypothesis? It's a difficult question to answer:

Would it prove anything if we saw more Stowaways or other transposons in genes, rather than intergenic regions?
Most of GenBank is genes, not intergenic regions. (True at the time the paper was written, but less true today).
Even if transposons are uniformly distributed in the genome, it might not falsify the hypothesis.

Perhaps a better question to ask is, "How often do we see cis-acting elements within transposons." This is a more difficult question to ask, because it is still not very reliable to identify promoter elements in sequences.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

PLNT4610/PLNT7690 Bioinformatics
Lecture 1, part 1 of 2

September 5 &10, 2024

INFORMATION-DRIVEN SCIENCE

Running a Linux session on CCL

A. Why Bioinformatics?

B. Linux fundamentals

A. Why Bioinformatics?

1. What is Bioinformatics?

Bioinformatics can be thought of as a branch of Data Science, focusing on biological problems.

2. Much of the work of Bioinformatics is organizing raw data to create knowledge.

3. Knowledge is built by constructing relations between different kinds of data.

4. Machine Learning (ML) attempts to find mathematical functions that best classify subjects into two or more groups

5. Public databases organize knowledge into models of things in the real world.

6. The public databases are a laboratory for investigating scientific questions.