PLNT4610/PLNT7690 Bioinformatics - Lecture 12, part 2 of 3

PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 2 of 3

C. Experimental considerations for RNAseq

RNA sequencing is not perfect. At the analytical level, it has a number of potential problems that must be corrected for in the analytical pipeline.

1. Sources of experimental variation

Treatments

Experimental conditions
Tissue preparation

RNA isolation - use identical amounts of tissue, identical extraction methods; use minimum number of steps; measure amount of RNA and normalize concentration
RNA is very susceptible to degradation during all steps of handling unless very strict measures are taken to eliminate ribonucleases, which are found in all cells.

All of these sources of variation can be minimized by doing biological replicates.

Order of sequencing - PLEASE, PLEASE, PLEASE! Do not sequence your RNAs in the exact order that they occur in your experimental design. Say you have 48 samples, 2 treatments x 4 replicates x 6 time points. Suppose you can run 8 samples per flow cell on the Illumina platform. For the above experiment, the temptation would be to run all 4 replicates for the 2 treatments for the first time point on one cell. The second cell would have 4 replicates for the 2 treatments for the second timepoint, and so on. Because there can be variation in quality from one flow cell to the next, this approach would introduce a bias that would give you comparisons between treatments that might appear to have strong stastical significance. For example, all RNAs on flow cell 3 might give higher numbers of reads than in the other flow cells. However, all you would be seeing would be cell to cell variation. This could be especially problematic if sequencing reactions were done on different days with different batches of reagents.
The right way to sequence these samples is to randomize the order of samples sequenced and loaded on flow cells, so that no systematic bias would be introduced.

2. Experimental design

The most critical factor in experimental design for gene expression experiments is the number of biological replicates. By biological replicates, we mean a complete repetition of the treatment or condition

Plotting T test results for number of significant results vs biological replicates creates a flattening curve indicating diminishing returns beyond 6 replicates. Fold test line (not shown) is flat indicating.

In microarray experiments, a dataset of 6 biological replicates was sampled to create test datasets containing all possible permutations of 3, 4, 5 or 6 replicates. Results for 3, 4 and 5 replicates are an average, while results for 6 replicates means that all 6 replicates were used. Although these results used microarray data, the same statistical principles apply to RNA-seq data.

Björklund, Natalie (2012) University of Manitoba.

Statisticians have long known that the power of any statistical test increases with the number of replicates. While it is tempting to minimize the number of replicates due to cost and time required to repeat experiments, it is pointless to do any experiment if the results are questionable. The ability of the data to distinguish whether a gene has changed its expression from one condition to the next increases as the number of biological replicates increases.

3 replicates - should be considered the bare minimum number of biological replicates
4 replicates - drastically increases the number of genes that can be classified correctly as having changed in expression
beyond 6 replicates - there is little improvement in the discriminatory power of the test unless you can do a very large number of replicates

BIOLOGICAL REPLICATES ARE THE SINGLE MOST EFFECTIVE WAY TO GET GOOD GENE EXPRESSION RESULTS!

In the next section we will see that there is an almost endless list of ways to massage the data. The most heroic analytical methods are no substitute for the simple step of doing several biological replicates.

biological replicates - A complete repetition of a timepoint or treatment using the whole organism or cells, keeping the environmental experimental conditions as uniform as possible.

Separate RNA samples are extracted for each replicate
Replicate RNA samples are never pooled. Each RNA sample is sequenced independentlly. You can pool the data later during the analytical phase.

Pooling throws away critical information about each replicate set of conditions.
You may need to discard an RNA sample if something went wrong in a particular sample

Biological replicates include all variation from both biological variation and experimental variation.
As the number of biological replicates increases, the total experimental variation decreases.

technical replicates - Re-sequencing of the same RNA sample

only control for differences in handling
generally doesn't tell you much, since RNA populations are large enough that coverage of any one transcript will be quite consistent

Brian's Basic Rule of Experimentation: A little bit of really good data tells you more than a large amount of questionable data.

Estimated number of replicated needed for a sample dataset

	FDR = 0.10	FDR = 0.05	FDR = 0.01
Power = 0.5	3 / 3	3 / 3	5 / 5
Power = 0.6	3 / 3	3 / 4	7 / 6
Power = 0.7	3 / 4	5 / 5	10 / 9
Power = 0.8	4 / 6	9 / 8	20 / 14
Power = 0.9	13 / 11	30 / 16	75 / 27

Power is the fraction of true positives detected. FDR is the false discovery rate ie. false positives. The numbers either side of the right slash indicate sample-size (ie. biological replicates) estimates made using the sample-size estimation methods described in Ref. [8] and Ref. [10], respectively.

Tommy S. Jorstad, Mette Langaas, Atle M. Bones, Understanding sample size: what determines the required number of microarrays for an experiment?, Trends in Plant Science, Volume 12, Issue 2, February 2007, Pages 46-50, ISSN 1360-1385, DOI: 10.1016/j.tplants.2007.01.001.

Simon, S. Myths & Truths About Microarray Expression Profiling

Conclusions

To get a greater statistical power, you need to do a larger number of replicates
To get a lower false discovery rate, you need to do a larger number of replicates

3. RNA

While in general the quality of the RNA is important to the success of RNA-seq experiments, the parameter that has the most effect is the degree to which the sample has been enriched for mRNA, by eliminating other RNAs. Especially in Eukaryotic RNA populations, mRNA usually makes up only a few percent of the total, which is predominantly rRNA. If no enrichment procedure was done, the depth of coverage of protein coding genes would be greatly compromised, because the vast majority of reads would be rRNA.

Although most RNA-seq library preparation protocols have a step for enriching for mRNAs, there will always be contamination from other RNAs. For this reason, it is important that there be a step in the RNA pipeline to eliminate reads that can be identified as other forms of RNA, such as rRNA or tRNA.

Image from
http://finchtalk.geospiza.com/2009/05/small-rnas-get-smaller.html

4. Sequencing technologies

There are many protocols for RNA sequencing, including Illumina GA/HiSeq, and Roche 454. Although these differ, the RNA-seq can be described generally as shown at right.

from
http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/RNA-seq.html

In some protocols, RNA is sheared, followed by random hexamer priming. In other protocols, the entire mRNA transcript is used as a template for cDNA synthesis, and the cDNA is fragmented.

Adapters for PCR are ligated onto ds-cDNA, followed by PCR amplification. Sequencing reactions are either done from a single end, or for both ends (paired-end).

Always go for longer reads wherever possible, which will be of great value in decreasing the number of reads that cannot be correctly assigned
Always do paired-end reads, again, to make sure that reads are correctly assigned to a gene

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 2 of 3