Gene Expression

1) Gene array technology:   measures mRNA levels for thousands of genes in

mRNA population isolated from cells

gene x - strongly expressed; high abundance transcript

gene y - moderately expressed; medium abundance transcript

gene z - weakly expressed; low abundance transcript

Each transcript base pairs with the complementary DNA for its corresponding gene on the array.

Signal strength is proportional to the abundance of each mRNA


2) Two general classes of data:

There is a whole family of problems in normalization of data and controlling for components of experimental variation.

3) Gene array technology can generate data for thousands of genes in a single experiment.


Key questions:
  • Which genes are expressed  differentially, between condition A and condition B?
  • How can genes be grouped according to similarities in expression patterns?

a) Cluster analysis

 The variation in G over N conditions:


In experiments in which the expression is normalized to the mean expression for gene G, Goffset is the mean of G, which makes sym.phiG.gif the standard deviation for G. However, in most gene array experiments,  Goffset  is set to 0, the log of the fluorescence ratio of 1, meaning the ratio that would be seen if no change was observed from one conditon to the next. The similarity of gene expression patterns for any two genes X and Y  can be expressed as a correlation coeficient


Eisen et al. examined 8600 human genes in cells grown in the presence or absence of serum. Genes whose expression changed by a factor of 3.0 or more in at least 2 timepoints were subjected to cluster analysis.

Green - strong down-regulation at a given timepoint

RED - strong up-regulation at a given timepoint.

BLACK - little or no difference between serum-treated and serum starved cells.

Gene clusters: A - cholesterol biosynthesis; B - cell-cycle; C immediate-early respoinse; D signaling and angiogenesis; E - wound healing and tissue remodeling

Michael B. Eisen*, Paul T. Spellman*, Patrick O. Brown, and David Botstein* (1998)
Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Aca. Sci. USA Vol. 95, Issue 25, 14863-14868.


b) Self-organizing maps (SOM)

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A

Törönen P, Kolehmainen M, Wong G, Castrén E (1999)  Analysis of gene expression data using self-organizing maps. FEBS Letters 21;451(2):142-146.

Tamayo and coworkers have applied Self-Organizing Maps (SOM) to grouping gene expression data.  In Figure 1., they illustrate very simple X,Y data as groups of raw datapoints as black dots. Such a dataset might represent, for example, a measurement on a wild-type gene on the X-axis, and a measurement on a mutant gene on the Y-axis.  A timecourse with n-timepoints would therefore be represented in n dimensions. Just looking at the  datapoints, it looks as if there are distinct groups.

The goal is to find sets of X,Y points that most closely-approximate the mean value for each group of points. SOMs begin by arbitrarily creating a set of nodes (N) with randomly-assigned values. In the example, a set of six nodes (1-6) are randomly placed in the X,Y space.

For each iteration of the algorithm, a datapoint P is chosen, and the position of each node is changed to move it closer to P. The closer a node N is to point P, the greater the distance it is moved towards P. This process is continued for thousands of iterations, until the total change is lower than some threshold.

The net result is that all nodes will be moved many times, but each node will "come to rest" in the vicinity of the set of datapoints to which it is closest.

For example, Tamayo et al. studied 6000 human genes in myeloid leukemia cell line HL-60, in response to phorbol ester PMA, which stimulates macrophage differentiation. 567 genes were shown to change significantly with addition of PMA. Expression data were modeled onto a 3 x 4 array in which each node in the array had a randomly-generated timecourse curve. Each iteration consisted of selecting an actual timecourse curve for a human gene, and modifying all 12 randomized curves to fit that timecourse. The curves most closely-matching the data were modified to strongly resemble the data. Curves that were less closely-related to the data to begin with were underwent less modification. The 12 resultant curves are shown below:


The authors point out that, "An SOM based on a rectangular grid is analogous to an entomologist's specimen drawer,with adjacent compartments holding similar insects."