BIRCH

Previous

TUTORIAL: Transcriptome Assembly

Preprocessing of  RNA sequencing reads


  October 31, 2021

Next page

References

FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

TrimmomaticManual.pdf

SeqKit


0. Obtain sequencing read files

DATASET: Fakankun I et al. (2019) Ph.D. thesis, University of Manitoba (in progress) Rhodosporidium diobovatum.
This dataset is a random sample of about 5% of the reads from fungal RNAseq. Cells were grown in conditions of nitrogen starvation for 18 or 24 hours. Three replicate cultures were done for each time point.

raw read files (Illumina, paired end)
time
replicate
HI.3992.004.Index_1.18-1_R1_S5.fastq.gz
HI.3992.004.Index_1.18-1_R2_S5.fastq.gz
18 h
1
HI.3992.004.Index_3.18-2_R1_S5.fastq.gz
HI.3992.004.Index_3.18-2_R2_S5.fastq.gz
18 h
2
HI.3992.004.D701---D503.18-3_R1_S5.fastq.gz
HI.3992.004.D701---D503.18-3_R2_S5.fastq.gz
18 h
3
HI.3992.004.Index_8.24-1_R1_S5.fastq.gz
HI.3992.004.Index_8.24-1_R2_S5.fastq.gz
24 h
1
HI.3992.004.Index_10.24-2_R1_S5.fastq.gz
HI.3992.004.Index_10.24-2_R2_S5.fastq.gz
24 h
2
HI.3992.004.Index_11.24-3_R1_S5.fastq.gz
HI.3992.004.Index_11.24-3_R2_S5.fastq.gz
24 h
3

Since transcriptome assembly requires several steps, it is best to organize the files for each step into a separate directory. In addition to keeping things simple, this organization gives us more flexibility for saving files offline, compressing files, or ignoring sets of files for backup.

In your tutorials directory create a new directory called blreads/transcriptome/reads, and then go to that directory:

mkdir blreads/transcriptome
cd blreads/transcriptome
mkdir reads
cd reads


Next, download the fastq.gz files found at
https://home.cc.umanitoba.ca/~psgendb/birchhomedir/FTP/BIRCH/data/blreads/transcriptome/TranscriptomeFiles.html

These files contain raw sequencing reads, which should be saved to your reads directory, NOT to your Downloads directory. The total size of these files is around 2.23 Gb.

Check to make sure that these files have downloaded correctly

ls -l
-rw-rw-r-- 1 psgendb psgendb 168088629 Feb 26 07:57 HI.3992.004.D701---D503.18-3_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 177271488 Feb 26 07:57 HI.3992.004.D701---D503.18-3_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 194961311 Feb 26 07:57 HI.3992.004.Index_10.24-2_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 204240390 Feb 26 07:57 HI.3992.004.Index_10.24-2_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 251977279 Feb 26 07:57 HI.3992.004.Index_11.24-3_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 270292798 Feb 26 07:57 HI.3992.004.Index_11.24-3_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 182257703 Feb 26 07:58 HI.3992.004.Index_1.18-1_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 180472929 Feb 26 07:58 HI.3992.004.Index_1.18-1_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 156844236 Feb 26 07:58 HI.3992.004.Index_3.18-2_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 167542797 Feb 26 07:58 HI.3992.004.Index_3.18-2_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 153937166 Feb 26 07:58 HI.3992.004.Index_8.24-1_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 161674265 Feb 26 07:58 HI.3992.004.Index_8.24-1_R2_S5.fastq.gz

Create symbolic links with short, easy to distinguish names

Sequencing read files typically have long, complex names. Aside from being difficult to work with at the command line, it takes more work to be sure that, at any given point, you are working with the correct file. At the same time, it is best not to modify the names of the original files for reference to the original sequencing runs.

A good solution is to create symbolic links with short, meaningful names, to the read files. All subsequent steps in the assembly pipeline will be done using these shorter names. Usually, this will be a 2 step process: first creating symbolic links with shorter names, and then renaming these links with even simpler names.

Launch blreads by typing 'blreads &'.
(Here, we're adding & to the blreads command to run blreads in the background. That will allow us to continue using the command line in the same terminal window in which blreads was launched.)
Note that the path for the current working directory is listed on the first line of blreads. Any line begining with a hash mark (#) is a comment, and will be ignored.


 
While this first step has to be done one file at a time, we are setting up filenames so that we can work on groups of files in subsequent steps. Choose File --> Create symbolic links with short names.


Choose the first file to which you want to make a link.

To make the name of the link shorter, type in a target pattern that is common in two or more files. At right, the pattern is "HI.3992.004.D701---D503.". Since the short pattern field is left blank, this string will simply be omitted from the link name. The link name will be 18-3_R1_S5.fastq.gz.



The name of the link will appear in the blreads window. Note that the type field of the original file has 'f', to indicate a file, and the type field for the link is 'l'. to indicate a link.


Repeat this process for HI.3992.004.D701---D503.18-3_R2_S5.fastq.gz.  In this case, the target pattern will be the same. For the next two files, the target pattern will be "HI.3992.004.Index_10.".  Because the patterns for the remaining pairs of files differs only by the number (eg. 11, 1,3, 8) we can quickly go through the rest of the files, making small changes to the pattern.

When you have completed the process for all six files, blreads should look like this:

For each set of paired-end read files (R1 and R2), there will be a corresponding pair of symbolic links with shorter names.



IMPORTANT CHECK - It is essential to make sure that the short names of the links correctly match the original long names. At the command line type 'ls -l' and compare the names of the links with the names of the files to which they point.

lrwxrwxrwx 1 psgendb psgendb        39 Feb 26 12:26 18-1_R1_S5.fastq.gz -> HI.3992.004.Index_1.18-1_R1_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        39 Feb 26 12:26 18-1_R2_S5.fastq.gz -> HI.3992.004.Index_1.18-1_R2_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        39 Feb 26 12:26 18-2_R1_S5.fastq.gz -> HI.3992.004.Index_3.18-2_R1_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        39 Feb 26 12:26 18-2_R2_S5.fastq.gz -> HI.3992.004.Index_3.18-2_R2_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        43 Feb 26 12:26 18-3_R1_S5.fastq.gz -> HI.3992.004.D701---D503.18-3_R1_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        43 Feb 26 12:26 18-3_R2_S5.fastq.gz -> HI.3992.004.D701---D503.18-3_R2_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        39 Feb 26 12:26 24-1_R1_S5.fastq.gz -> HI.3992.004.Index_8.24-1_R1_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        39 Feb 26 12:26 24-1_R2_S5.fastq.gz -> HI.3992.004.Index_8.24-1_R2_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        40 Feb 26 12:26 24-2_R1_S5.fastq.gz -> HI.3992.004.Index_10.24-2_R1_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        40 Feb 26 12:26 24-2_R2_S5.fastq.gz -> HI.3992.004.Index_10.24-2_R2_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        40 Feb 26 12:26 24-3_R1_S5.fastq.gz -> HI.3992.004.Index_11.24-3_R1_S5.fastq.gz
lrwxrwxrwx 1 psgendb psgendb        40 Feb 26 12:26 24-3_R2_S5.fastq.gz -> HI.3992.004.Index_11.24-3_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb        20 Feb 26 12:25 bio803081030103253751.tmp.tsv
-rw-rw-r-- 1 psgendb psgendb 168088629 Feb 26 11:54 HI.3992.004.D701---D503.18-3_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 177271488 Feb 26 11:54 HI.3992.004.D701---D503.18-3_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 194961311 Feb 26 07:57 HI.3992.004.Index_10.24-2_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 204240390 Feb 26 07:57 HI.3992.004.Index_10.24-2_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 251977279 Feb 26 07:57 HI.3992.004.Index_11.24-3_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 270292798 Feb 26 07:57 HI.3992.004.Index_11.24-3_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 182257703 Feb 26 11:54 HI.3992.004.Index_1.18-1_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 180472929 Feb 26 11:54 HI.3992.004.Index_1.18-1_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 156844236 Feb 26 11:54 HI.3992.004.Index_3.18-2_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 167542797 Feb 26 11:54 HI.3992.004.Index_3.18-2_R2_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 153937166 Feb 26 07:58 HI.3992.004.Index_8.24-1_R1_S5.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 161674265 Feb 26 07:58 HI.3992.004.Index_8.24-1_R2_S5.fastq.gz


Finally, we'd like to eliminate redundancy from the link names, which can be done in a single step. All names contain the string "_S5", so let's delete that from the link names.

 Select the links as shown, and choose File --> Rename.



Set the target pattern to '_S5', and press Run.

Each pair of read files now has short, easy to distinguish names, that will be used in all subsequent steps.


1. Check raw reads for quality using SeqKit and FASTQC

In any analytical process, it is better to have a smaller amount of high quality data than a large amount of poor data. It is important therefore to eliminate any read files of poor quality, which would otherwise lead to a low-quality transcriptome.

Some summary statistics on the read files can be produced by SeqKit. Select the links (NOT the original files themsleves) and choose Reads --> SeqKit stats.
By default, the number of threads used is 1/4 of the number of available cores, or 1, whichever is greater. For most datasets, this is not a very time consuming step. Therefore, it is usually unnecessary to use additional cores.

Simply click on Run.


Note: For this small demonstration dataset, results should pop up in a minute or two. For real datasets, processing time could go for an hour or more. You may wish to check Yes for "Notify of completion by email", and type in your email address to be notified when the results are ready.


In this case, all of the files look fairly similar in terms of the number of reads and the average length of reads.

At this point, if there was a file that looked to be aberrant and you knew that it should be discarded, simply select the name of the bad file and delete the link using File --> Deletefiles.



FASTQC performs 11 quality checks on sequencing read files. You can generate FASTQC reports for all of your read files in a single step. Select the links (NOT the original files themselves, and choose

Typical read files are larger than this sample dataset, so with larger files, you may wish to speed up FASTQC by setting a larger number of cores. In most cases, results will appear soon enough that you don't need to set notification of completion by email.

Click on Run.





blreads now lists the output files from FASTQC. Files with the .html extension are viewable in any web browser. The .zip files are the accompanying graphics used by the web files.

To view any report, select an HTML file and choose File --> View file. The HTML file will pop up in the browser.


The FASTQC report for 18-1_R1.fastq.gz is shown below:


While each of the 11 quality reports need to be examined for each set of reads, it is worth noting that the Per base sequence quality report is one of the most important. This graph shows the mean and distribution of quality scores (Y-axis) for each position in the population of thousands of reads (X-asis). Good quality data will high quality scores for all positions in the reads, although for larger reads, there may be more errors, and hence lower quality, at the 3' regions of reads.

This report has more reads below the quality level of 30, which wouldn't necessarily prevent us from using this data, but may require filtering reads by quality level later if error correction can't fix these reads. All other files in this sample dataset had higher quality scores, almost entirely above 30.

These data illustrate the importance of reviewing FASTQC reports at each step of preprocessing.

More complete information on the interpretation of FASTQC reports can be found at the FASTQC web site.

2. Remove sequencing adapters using Trimmomatic

For large RNAseq datasets, we have found that read files generated using trim_galore cause Trinity, rnaspades and SOAPdenovo-Trans to terminate without completing the assembly. Assemblies were successful using reads corrected by Trimmomatic. Therefore, this tutorial will use Trimmomatic. Trimmomatic is more complex to use than trim_galore, because it has a wider array of choices for read trimming.. Trimmomatic can be run from blreads using Reads --> Trimmomatic.


Trimmomatic searches for commonly-used sequencing adaptors and removes them for the ends of sequencing reads. This step is essential before proceeding to transcriptome assembly.

When trimming paired-end reads, Trimmomatic needs to know which files represent read pairs. This point is critical, because if one read of a read pair is discarded (eg. for low quality or being too short) the other read must be discarded too.

The script guesspairs.py partly automates selection of paired files. First select ONLY the links to the read files, and then choose File --> guesspairs.py.
Most Illumina services use R1 and R2 as the substrings which distinguish the left and right read pair files for a given library, so by default, R1 and R2 is set for "unique string for left/right reads"

For this dataset, all read files use the fastq.gz extension. (By default this will be set to .fq.gz). In some cases, you may need to specify more than one set of file extensions as a comma-separated list eg. .fq,.fastq Clicking on the Hints button will give a more detailed explanation of these parameters.



Clicking on Run will bring up a new blreads window with the best quess of file pairing, in two columns. Usually, guesspairs.py gets it right. Files for which a pair cannot be found (ie. single-end reads) would be listed in a single column.

To run Trimmomatic with these read pairs, choose Edit --> SelectAll, and then Reads --> Trimmomatic.


General - The parameters for Trimmomatic are grouped into four tabs. The General tab has basic settings. By default, output will go to the ../reads.Trimmomatic directory.

Trimmomatic performs each of the processing steps in an order specified by the user. Individual parameters can be turned on or off, and the order in which they are performed is set by the rank parameter. 


Clipadapt - By default, the ILLUMINACLIP step is performed first (ie. rank=1). The defaults are those given in the Trimmomatic manual, with the exception that keep both reads of read-throughs is set to true. This may help to avoid cases where a single read of a read pair is deleted during trimming, which can cause some transcriptome assembly programs to crash.


Quality - By default, SLIDINGWINDOW and AVGQUAL are off. Set MAXINFO to Yes. Since rank is set to 3, we don't need to change this for MAXINFO to be performed as the next step.



Cropping - It's a good idea to eliminate poor quality nucleotides from 3' and 5' ends of reads, so set LEADING and TRAILING to Yes.

Finally, the last step is MINLEN, which eliminates read pairs below a specific length. For 100 bp reads a value of 40 is a reasonable compromise between very short reads, which would be difficult or impossible to uniquely map to transcripts, and much longer reads, which might compromise coverage of reads located in the 5' ends of transcripts. No rank is assigned, because this step must be done after all other trimming steps are completed.



When Trimmomatic has processed all files, a new blreads window will pop up.  It is important to note that this new window is running in the reads.Trimmomatic directory. At this point, one can probably close the previous blreads window, which is running in the raw directory. The next steps will be done in the reads.Trimmomatic directory.

(Optional: You may wish to run FASTQC at this point to verify that the properties of the processed reads are similar to the original reads.)

For each raw read file, there are now two  output files:
  • files ending in P.fastq, which have reads in which both pairs survive the trimming process
  • files ending in U.fastq, which have reads in which only one of the pair survived trimming.
Because unpaired reads will cause some transcriptome assembly programs to crash, we will only use the paired files in the next step. Note that in all cases the unpaired files are several orders of magnitude smaller than the paired files. Therefore, the added information that might otherwise be gained from these reads is negligible.





Next: Correction of errors in RNA sequencing reads