prev  page PLNT3140 Introductory Cytogenetics
Lecture 17, part 4 of 4
first page

IV. Measuring Gene Expression by RNA sequencing

Two general classes of data

transcriptome - the set of all RNA transcripts expressed in an organism

High throughput RNA sequencing can be used to measure the amount of each of thousands of distinct RNA transcripts in an RNA population.

Gene expression studies tend to generate two different types of data. Studies in which two or more conditions are compared at a time generate discrete state data. Often it is critical to follow the expression of a gene over time after a treatment. In timecourse experiments, the expression of each gene in response to two or more treatments is measured over time. For example, in the timecourse at right, the solid blue and red dashed curves might represent the expression levels for a gene in response to two different drugs.

What we're ultimately trying to get from gene expression experiments is expression patterns for each of the thousands of transcripts in the RNA population. By identifying genes whose expression patterns are similar, we can discover which groups of genes work in concert, in response to a given stimulus.

High-throughput RNA sequencing

There are many protocols for RNA sequencing, including Illumina GA/HiSeq, SOLiD, and Roche 454. Although these differ, the RNA-seq can be described generally as shown at right.
In some protocols, RNA is sheared, followed by random hexamer priming. In other protocols, the entire mRNA transcript is used as a template for cDNA synthesis, and the cDNA is fragmented.
Adapters for PCR are ligated onto ds-cDNA, followed by PCR amplification. Sequencing reactions are either done from a single end, or for both ends (paired-end).

Ideally, where a reference genome exists, all transcripts can be mapped to specific genes in the genome.

RNA-seq - introns complicate the assembly process
The illustration at right shows RNA-seq reads aligned to two eukaryotic genes A and B. Reads that span part of an exon are shown as single lines, whereas reads that include parts of two adjacent exons are indicated by V-shaped lines. The presence of introns being spliced out of pre-mRNA transcripts means that alignment programs have to check to see whether a read contains part of the 3' end of one intron and part of a 5' end of another intron. We have to already have the genomic sequence to do this.

It is also noteable that most transcriptomics experiments contain reads that map in presumably non-coding intergenic regions. These can either indicate that there are previously-unannotated transcribed regions in the genome, or the presence of untranscribed pseudogenes.
Transcriptomics is also revealing that alternative spllicing occurs more frequently in eukaryotic gene expression than was previously appreciated.

RNA-seq - Normalization of results to account for genes of different sizes

As shown in the illustration above, more reads will be found for larger genes than for smaller genes.  In other words, we want to find out the number of reads or fragments that were mapped to each gene in the genome or transcriptome. Consequently, it is necessary to correct gene expression levels for

Depending on whether you are doing single reads or paired-end reads, there are two almost identical formulae.

RPKM - Reads Per Kilobase of transcript per Million mapped reads



F. Whole Genome  Shotgun Sequencing (WGS)

Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications 7: 1026 - 1042.

In the past, genomes were sequenced by first making a BAC library, and then sequencing enough clones to cover the entire genome. Even today, these genomes are among the best genomic sequences. However, the time and expense of this strategy makes it impractical for sequencing large numbers of genomes. Whole genome shotgun sequencing is a quicker and cheaper way to sequence genomes, but has the disadvantage that most of the time, full chromosomal sequences cannot be built for eukaryotic genomes.  The trade off, then, is between cost and speed, and completeness of the genomic sequence.


Library - Collection of DNA (or RNA) fragments modified in a way that is appropriate for downstream analyses, such as high-throughput sequencing in this case

Insert size - Length of randomly sheared fragments (from the genome or transcriptome) sequenced from both ends. This is usually several hundred to several thousand nt. eg. 100 nt.

Read -Short base-pair sequence inferred from the DNA/RNA template by sequencing

Paired-end sequencing - Sequence information from two ends of a short DNA insert, usually a few hundred base pairs long

Mate-pair - Sequence information from two ends of a DNA fragment, usually several thousand base-pairs long

Contig -A contiguous linear stretch of DNA or RNA consensus sequence. Constructed from a number of smaller, partially overlapping, sequence fragments (reads)

Scaffold -Two or more contigs joined together using read-pair information. Within a scaffold, the gaps between adjacent contigs are usually denoted by a run of N's.

Assembly - Computational reconstruction of a longer sequence from smaller sequence reads

Summary of Whole Genome Sequencing

1. Shotgun sequencing generates millions of reads

WGS begins by creating a library of fragments from genomic DNA. Usually a PCR step amplifies the fragments, which are a uniform size, depending on the specific sequencing technology used. For example, with Illumina technology, fragments are typically about 300 nt in length. Since sequencing reads can be 150 nt or longer, it is possible to get two paired-end reads that overlap, to cover the entire 300 nt fragment.

2. Reads are assembled into contigs

Typically, WGS sequences enough reads to cover the entire genome with 50 to 100-fold redundancy. Highly-efficient pattern matching software pieces together reads at points of overlap to form contigs. The algorithm keeps adding reads together until contigs can no longer be extended from the pool of reads.

 The figure at right shows assembly of a contig from many individual reads.

Generally the bigger the contigs, the better the sequence assembly.

Each contig is assembled from many overlapping reads. At this point, we have no idea which chromosome each contig comes from, or where the contigs might be placed on those chromosomes.

There is usually a large number of unmatched reads that cannot be assembled into contigs, and very small contigs (eg. a few hundred or a few thousand base pairs in length) that do not contribute anything to the final genome assembly.

3. Repetitive elements make it difficult or impossible to assemble long contigs

Eukaryotic genomes are especially difficult to assemble because so much of the genome consists of repetitive elements, such as the AluI family, interspersed among unique DNA. Since the length of sequencing reads is fairly short, a high percentage of reads will have part of a repetitive element at one end. Few reads will completely span a repetitive element, with unique sequence on either side.

While it is true that repetitive sequence elements do mutate, it is often difficult or impossible for sequence assembly software to decide which copy of a repetitive element to join with any of thousands of other copies that may be identical or nearly identical to the a given read.

Put another way, we don't know where on the chromosome each read came from. That is what we're trying to figure out. The net result is that as a growing contig encounters a repetitive element, there may be no way to extend the contig further. Consequently, most genome assemblies have a relatively small number of large contigs, and a very large number of small contigs, maybe 1000 bp or smaller.

4. Mate-pair reads make it possible to join contigs together into scaffolds.

With current sequencing technologies, the best strategy for joining contigs is to do a  second sequencing run, this time using libraries with large fragment sizes (eg. 3000 bp or greater).  The reads from these larger insert libraries are called "mate pair" reads. If we find one read within one contig, and the read from the other mate pair somewhere within another contig, we know that the two contigs are no farther away than the length of the large insert. 

1) Large DNA fragments (eg. 3000 bp) are end-repaired using biotinylated nucleotides.
2) DNA ligase is added, causing fragments to circularize. This brings the two ends (red and green) together.
3) The circular DNA is fragmented, and the resultant fragment pool is run over a streptavidin column. Only the fragments containing the biotin are captured. The caputured fragments are then eluted from the column. Now, the remaining fragments have sequences from the two ends of the larger fragment on either end. Sequencing adaptors A1 and A2 are ligated to the ends of the biotinylated fragments.
4,5) Double-stranded fragments are denatured and run over a flow cell. The different cells in the flow cell have oligonucleotides complementary to either A1 or A2, so each strand will be captured in different cells. They are amplified by PCR. Fragments are sequenced, starting at A1 or A2.
6) With reference to the original large fragment from step 1, the sequence that we get comes from the two ends, starting interior the fragment, going outward.


Next-generation sequencing technologies and applications for human genetic history and forensics.

Berglund EC, Kiialainen A, Syvšnen AC - Investig Genet (2011)

A scaffold may have many contigs:

Scaffolds join contigs in the order and orientation with which they appear on the chromosome.

If we're lucky, may be able to assemble scaffolds that completely cover an entire chromosome. Most of the time, though, there are 2 or more scaffolds per chromosome, and we don't know the order and orientation of the scaffolds, relative to one another.

4. Limitations of Whole Genome Shotgun Sequencing
NCBI Genome Resources


  1. "Designer genomes" - Genomic maps that are saturated with markers allow plant and animal breeders to selectively breed offspring which combine desired genes from many different strains, varieties, lines, stocks etc.
  2. "Everything has already been cloned" - If a gene can be precisely mapped, the clones for that region of the genome already exist.
  3. Understanding how genomes are structured, and how structure relates to function.
  4. Raw materials for genetic engineering - With completely cloned, mapped, and in some cases sequenced genomes, we will be able to pick and choose genes of all kinds for genetic engineering purposes.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

prev  page PLNT3140 Introductory Cytogenetics
Lecture 17, part 4 of 4
first page