TUTORIAL: Generating Gene Expression data from RNAseq reads

June 29, 2020

References

1. Fristensky B (2019) Transcriptomics. PLNT4610 Bioinformatics, Lecture 10.
If you have never done transcriptome assembly before, this lecture is a good starting point for understanding the process, parameters and potential problems.

2. Pertea M, Kim D, Gertea GM, Leek JT, Salzberg SL (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols 11:1650. doi:10.1038/nprot.2016.095.
This package supersedes the Tuxedo (eg. Top Hat, Bowtie, Cufflinks) pipeline.

GOAL: Starting with RNAseq reads, to create files containing gene expression data in a form that can be used by gene expression software such as BioConductor.

OVERVIEW:

This tutorial assumes that an annotated genome is available.

Pre-processing and assembly of reads

Check raw reads using FASTQC
remove sequencing adaptors using Trimmomatic
correct errors in trimmed reads using Rcorrector
assemble transcriptome from corrected reads
evaluate quality of transcriptome assembly using transrate

Generation of gene expression data

Create genome index files
map reads to genome
for each set of reads, create GTF files that link mapped reads to features (genes)
create a merged transcriptome from all reads, and verify quality relative to reference transcriptome
for each set of reads, create a final GTF file with expression data

Pre-processing and assembly of reads

A simplified workflow for transcriptome assembly is shown at right.

Transcriptome assembly is carried out within the transcriptome directory, and for each major step, a separate subdirectory is used.

Raw read files are saved in the reads directory, and symbolic links to these files, with short, meaningful names, are created.

Sequencing adaptors are removed from the raw reads by Trimmomatic and the files containing the trimmed reads are saved in reads.Trimmomatic.

The trimmed reads are used as input by Rcorrector, which corrects errors in the reads and writes the corrected reads to the reads.Trimmomatic.Rcorrector directory.

At each step, FASTQC is run to check the properties of the reads.

Finally, the assembly itself is done using programs such as Trinity which produces contig files and scaffold files. For each assembly, transrate produces a reports with extensive statistics that can guide the choice of which assembly is best, or which assemblies should be repeated with different a parameters for improvement.

Generation of gene expression data

The final transcriptome assembly can now be used to generate gene expression data from the corrected reads. This tutorial implements the Hisat, Stringtie, Ballgown gene expression pipeline.

Hisat-build creates files that index the features in the annotated genome, so that reads can be mapped to features. Mapped reads are written to .bam files.

For each read file, hisat2 maps reads to the genome. Next, stringtie assembles the reads into transcripts, including alternative splicing products. Transcripts mapped to genomic locations are written to .gtf files.

Transcripts from all sets of reads are merged into a single .gtf file by stringtie --merge.

The final step is to generate tables of gene expression values. stringtie -B reads the mapped reads from the .bam files and the mapped transcripts from the .gtf files, and writes tables of FPKM values to comma-separated value files that can be read by the BioConductor R package for further analysis.

The next 4 tutorials implement these steps: