|mRNA population isolated from cells
gene x - strongly expressed; high abundance transcript
gene y - moderately expressed; medium abundance transcript
gene z - weakly expressed; low abundance transcript
Each transcript base pairs with the complementary DNA for its corresponding gene on the array.
Signal strength is proportional to the abundance of each mRNA
The variation in G over N conditions:
In experiments in which the expression is normalized to the mean expression
for gene G, Goffset is the mean of G, which makes
the standard deviation for G. However, in most gene array experiments,
Goffset is set to 0, the log of the fluorescence ratio
of 1, meaning the ratio that would be seen if no change was observed from
one conditon to the next. The similarity of gene expression patterns for
any two genes X and Y can be expressed as a correlation coeficient
Green - strong down-regulation at a given timepoint
RED - strong up-regulation at a given timepoint.
BLACK - little or no difference between serum-treated and serum starved cells.Gene clusters: A - cholesterol biosynthesis; B - cell-cycle; C immediate-early respoinse; D signaling and angiogenesis; E - wound healing and tissue remodeling
Michael B. Eisen*, Paul T. Spellman*, Patrick O. Brown, and David Botstein* (1998)
Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Aca. Sci. USA Vol. 95, Issue 25, 14863-14868.
Törönen P, Kolehmainen M, Wong G, Castrén E (1999) Analysis of gene expression data using self-organizing maps. FEBS Letters 21;451(2):142-146.
Tamayo and coworkers have applied Self-Organizing Maps (SOM) to grouping gene expression data. In Figure 1., they illustrate very simple X,Y data as groups of raw datapoints as black dots. Such a dataset might represent, for example, a measurement on a wild-type gene on the X-axis, and a measurement on a mutant gene on the Y-axis. A timecourse with n-timepoints would therefore be represented in n dimensions. Just looking at the datapoints, it looks as if there are distinct groups.
The goal is to find sets of X,Y points that most closely-approximate the mean value for each group of points. SOMs begin by arbitrarily creating a set of nodes (N) with randomly-assigned values. In the example, a set of six nodes (1-6) are randomly placed in the X,Y space.
For each iteration of the algorithm, a datapoint P is chosen, and the position of each node is changed to move it closer to P. The closer a node N is to point P, the greater the distance it is moved towards P. This process is continued for thousands of iterations, until the total change is lower than some threshold.
The net result is that all nodes will be moved many times, but each node will "come to rest" in the vicinity of the set of datapoints to which it is closest.
For example, Tamayo et al. studied 6000 human genes in myeloid leukemia cell line HL-60, in response to phorbol ester PMA, which stimulates macrophage differentiation. 567 genes were shown to change significantly with addition of PMA. Expression data were modeled onto a 3 x 4 array in which each node in the array had a randomly-generated timecourse curve. Each iteration consisted of selecting an actual timecourse curve for a human gene, and modifying all 12 randomized curves to fit that timecourse. The curves most closely-matching the data were modified to strongly resemble the data. Curves that were less closely-related to the data to begin with were underwent less modification. The 12 resultant curves are shown below:
The authors point out that, "An SOM based on a rectangular grid is analogous to an entomologist's specimen drawer,with adjacent compartments holding similar insects."