previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 3 of 3
first page

D. Transcriptomic Data pipelines

There is no such thing as an off-the shelf pipeline that can simply take raw reads, allow you to press a button, and generate a complete differential expression profile. This is in part because the nature of RNA-seq analysis is such that each step in the process is a decision point at which the results should be inspected to decide both whether the previous step needs to be re-run with different conditions, or with a different program, and also to decide, based on the result of the previous step, which program to use in the next step, or which parameters.
A second reason is that both the RNA sequencing technologies, and the bioinformatics tools are always changing to incorporate new advances. Therefore, RNA analysis pipelines will always be a work in progress.

For teaching purposes, this lecture will use as an example the RNA seq pipeline currently in place at the Bio Information Technologies Lab at the University of Manitoba.

There are usually considered to be 3 main steps in the RNA-seq pipeline, leading up to analysis of the data:

1. Assembly and mapping of reads
2. Normalization of the data
3. Calculation of expression levels within treatments and differential expression ratios between treatments

1. de-novo assemblies vs. assembly by read mapping (TopHat and Bowtie)

The first step in the data pipeline aligns sequencing reads, either by mapping each read to a gene, or by de-novo assembly of transcripts from the reads themselves.

from Thiru, P RNA-seq: Methods and Applications.

Mapping transcripts is generally easier in prokaryotes, and requires less computational time, than in eukaryotes.

smaller genomes and fewer genes much larger genomes and more genes
more likely to have completely sequenced and annotated genomes less likely to have completely sequenced and annotated genome
almost all genes present in single copies many genes are in multigene familes
genes seldom have introns genes almost always have introns; many genes have splice variants

genomes may be polyploid

The illustration at right shows RNA-seq reads aligned to two eukaryotic genes A and B. Reads that span part of an exon are shown as single lines, whereas reads that include parts of two adjacent exons are indicated by V-shaped lines. The presence of introns being spliced out of pre-mRNA transcripts means that alignment programs have to check to see whether a read contains part of the 3' end of one exon and part of a 5' end of another exon.

It is also notable that most transcriptomics experiments contain reads that map in presumably non-coding intergenic regions. These can either indicate that there are previously-unannotated transcribed regions in the genome, or the presence of untranscribed pseudogenes.

Transcriptomics is also revealing that alternative spllicing occurs more frequently in eukaryotic gene expression than was previously appreciated.

2. Normalization

As shown in the illustration above, more reads will be found for larger genes than for smaller genes. Consequently, it is necessary to correct gene expression levels for the size of each gene. To make results comparable across experiments, we also need to correct for the number of reads or fragments that were mapped to a gene in the genome or transcriptome.
Depending on whether you are doing single reads or paired-end reads, there are two almost identical formulae.

RPKM - Reads Per Kilobase of transcript per Million mapped reads


C : number of mappable reads on a feature
L : Length of feature (in kb)
N : Total number of mappable reads (in millions).
    Not all reads can be mapped to a specific gene.

from Thiru, P RNA-seq: Methods and Applications.

FPKM - Fragments Per Kilobase of transcript per Million mapped reads (necessary for paired-end reads)


F : number of mappable fragments on a feature. There are usually two reads for a fragment with paired end reads. Each pair of reads counts as a fragment.
L : Length of feature (in kb)
N : Total number of mappable reads (in millions)

Example of distribution of genes for a given FPKM.

RNAs from Clostridium thermocellum grown in exponential phase culture were assembled in a BowTie-TopHat Cufflinks pipeline.

FPKM is represented as the log2 of FPKM. The number of genes expressed at a given FPKM level are represented as bars on the Y-axis.

3. Which genes show a  "significant" difference between treatments?

Correction for the "Thousands of Genes" problem

In most experiments in biology, the accepted practice is to conclude that an observation is statistically significant if it is unlikely  to be seen by random chance more often than 5% or 1% of the time (p= 0.5 or p= 0.01). With RNAseq, the results for each gene could be considered to be a different experiment. What this means is that if the expression levels of 10,000 genes are compared between two treatments, we expect, when using p= 0.01 as a cutoff for significance, that about 100 genes would appear to have significantly different levels, just by random chance alone!  We need ways to correct p values so that one gets a more realistic estimate of statistical significance, for each gene compared.

Many gene expression programs use statistical methods such as ANOVA to calculate a p value p(x >
α), indicating the probability, the observed value x would exceed α by random chance alone. In many experiments, we accept p values less than 0.05 or 0.01 as being significant. Remember, however, that this means that respectively, 1 out of 20 times, or 1 out of 100 times, we get a false positive! Since each sequencing run can be thought of as thousands of experiments, we expect to see dozens or even hundreds of false positives that appear to be significant, but are not. Therefore, the threshold p values for assessing significance must be adjusted. The page linked below presents some of the ways of correcting p values in transcriptomics experiments.

Volcano Plots

Volcano plots let you assess the significant changes, between treatments, for the gene population as a whole
by Natalie Bjorklund, University of Manitoba

Volcano plots are a way to determine if there is data present that is worth taking a further look at. Gene expression experiments are typically designed to compare two things, such as exponential growth phase versus stationary phase, and looking for gene expression levels that change, either going up or down. Also because the experiments are done in replicates it is possible to determine how much variation there is between the replicates and use that value to determine if the change in gene expression is statistically significant or just a result of differences between each experimental replication. The Volcano plot allows you to quickly see if there is any difference that is both large and statistically significant. If the plot does not form a proper looking Volcano plot something is wrong. (Converting the values to log often makes the data fit into the same general order of magnitude and removes issues like different unit values so Volcano plots for gene arrays are typically shown in -log10 of the p value and log2 fold change.)

In the diagram below:

Probe 1: a statistically significant result (consistent across the experimental replicates) showing little change in expression.

Probe 2: a large decrease in gene expression that is also statistically significant (replicates match well)

Probe 3: a large increase in gene expression that is also statistically significant, (replicates match well)

Probe 4: a large increase in gene expression but one that is not statistically significant because the replicates of the experiment did not match well ie. observed change is less than the variance between replicates.

Probe 5: a large increase in gene expression but one that is not statistically significant because the replicates of the experiment did not match well

Example of a Volcano Plot showing which genes differ significantly, between two treatments. In this experiment, gene expression levels are compared in two different Clostridium thermocellum strains grown in exponential culture.

4. Differential expression

Ideally for each gene in the genome, RNAseq can give us an expression map, showing the level of expression of each gene, or even each exon (in the case of alternative splicing).

In principle, the level of expression for each exon in a gene should be identical, since each exon should be present in each mRNA transcript. There are at least three possible reasons:

Examples of Differential Expression

Trapnell, C et al. (2012) Differential gene expression and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7:562-578. doi:10.1038/nprot.2012.016.

Output form CummeRbund.

a) CummeRbund can generate expression profiles for single genes. In this example, FPKM levels for tropomyosin I are shown in four different samples, with replicates, at four stages of differentiation in myoblast cells in mouse muscle tissue.

b) Clustering of gene expression in myoblast cells over a differentiation timecourse in mouse muscle tissue.

RNA-Seq makes it possible to compare the expression of different splice variants

a) Map of a hypothetical gene with two different Transcription Start Sites (TSS) and three splice variants.

b - e)  Detection of variants make it possible to separately examine differential promoter use and differential splicing, in different conditions.

f, g) Since different proteins would also be generated by these splice and transcription variants, the proteome may be affected to

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 3 of 3
first page