Preprocessing of sequencing reads

November 30, 2023

References

FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Trimmomatic User's Guide

SeqKit

0. Obtain sequencing read files

DATASET: Fakankun I et al. Ph.D. thesis, University of Manitoba (in progress) Rhodosporidium diobovatum.
This dataset is a random sample of about 5% of the reads from fungal genomic DNA.
NCBI BioSample SAMN09284860

raw read files (Illumina, paired end)	insert size (nt)
DL300_S1_L001_R1_001_sample.fastq.gz DL300_S1_L001_R2_001_sample.fastq.gz	300
DL400_S2_L001_R1_001_sample.fastq.gz DL400_S2_L001_R2_001_sample.fastq.gz	400
DL700_S3_L001_R1_001_sample.fastq.gz DL700_S3_L001_R2_001_sample.fastq.gz	700

Since genome assembly requires several steps, it is best to organize the files for each step into a separate directory. In addition to keeping things simple, this organization gives us more flexibility for saving files offline, compressing files, or ignoring sets of files for backup.

Create a new directory called blreads/genome/reads, and then go to that directory:

mkdir blreads/genomecd blreads/genome mkdir reads cd reads
Next, download the fastq.gz files found at
https://home.cc.umanitoba.ca/~psgendb/birchhomedir/FTP/BIRCH/data/blreads/genome/GenomeFiles.html
The files contain raw sequencing reads, which should be saved to your reads directory. The total size of these files is around 127 Mb.

Check to make sure that these files have downloaded correctly

ls -l-rw-rw-r-- 1 psgendb psgendb 24493198 Dec 31 14:12 DL300_S1_L001_R1_001_sample.fastq.gz-rw-rw-r-- 1 psgendb psgendb 27005117 Dec 31 14:12 DL300_S1_L001_R2_001_sample.fastq.gz-rw-rw-r-- 1 psgendb psgendb 20560594 Dec 31 14:12 DL400_S2_L001_R1_001_sample.fastq.gz-rw-rw-r-- 1 psgendb psgendb 22777206 Dec 31 14:12 DL400_S2_L001_R2_001_sample.fastq.gz-rw-rw-r-- 1 psgendb psgendb 15246265 Dec 31 14:13 DL700_S3_L001_R1_001_sample.fastq.gz-rw-rw-r-- 1 psgendb psgendb 17306339 Dec 31 14:13 DL700_S3_L001_R2_001_sample.fastq.gz

Create symbolic links with short, easy to distinguish names

Sequencing read files typically have long, complex names. Aside from being difficult to work with at the command line, it takes more work to be sure that, at any given point, you are working with the correct file. At the same time, it is best not to modify the names of the original files for reference to the original sequencing runs.

A good solution is to create symbolic links with short, meaningful names, to the read files. All subsequent steps in the assembly pipeline will be done using these shorter names. Usually, this will be a 2 step process: first creating symbolic links with shorter names, and then renaming these links with even simpler names.

Launch blreads by typing 'blreads &'.
(Here, we're adding & to the blreads command to run blreads in the background. That will allow us to continue using the command line in the same terminal window in which blreads was launched.)
Note that the path for the current working directory is listed on the first line of blreads. Any line begining with a hash mark (#) is a comment, and will be ignored.

While this first step has to be done one file at a time, we are setting up filenames so that we can work on groups of files in subsequent steps. Choose File --> Create symbolic links with short names.

Choose the first file to which you want to make a link.

To make the name of the link shorter, type in a target pattern that is common in two or more files. At right, the pattern is "_S1_L001". Since the short pattern field is left blank, this string will simply be omitted from the link name. The link name will be DL300_R1_001_sample.fastq.gz.

The name of the link will appear in the blreads window. Note that the type field of the original file has 'f', to indicate a file, and the type field for the link is 'l'. to indicate a link.

Repeat this process for DL300_S1_L001_R2_001_sample.fastq.gz. In this case, the target pattern will be the same. For the next two files, the target pattern will be _S2_L001, and for the last two files, the target will be _S3_L001.

When you have completed the process for all six files, blreads should look like this:

For each set of paired-end read files (R1 and R2), there will be a corresponding pair of symbolic links with shorter names.

Finally, we'd like to eliminate redundancy from the link names, which can be done in a single step. All names contain the string "_001_sample", so let's delete that from the link names.

To avoid accidentally changing the names of the original files, first sort the files based on the Type field, so that all links appear together in blreads. Choose Edit --> BLSORT. Set the 1st sort key to column 4 (ie. Type), and Sort order to Ascending. Choose Run:Output to this window.

All links will now be together in blreads. Select the links as shown, and choose File --> Rename.

Set the target pattern to '_001_sample', and press Run.

Each pair of read files now has short, easy to distinguish names, that will be used in all subsequent steps.

It is best to verify that the links point to the intended files by typing 'ls -l' at the command line:

On many systems, the bash shell will highlight the names of symbolic links in aqua if files that the links point to exist. If the files don't exist, the names will be in red. This acts as a check that the links are correct. If for some reason your links point to non-existent files, delete the links (not the original files!) using File --> DeleteFiles, or the command line, and then re-create the links.

1. Check reads for quality using SeqKit and FASTQC

In any analytical process, it is better to have a smaller amount of high quality data than a large amount of poor data. It is important therefore to eliminate any read files of poor quality, which would otherwise lead to a low-quality genome.

Some summary statistics on the read files can be produced by SeqKit. Select the links (NOT the original files themsleves) and choose Reads --> SeqKit stats.

By default, the number of threads used is 1/4 of the number of available cores, or 1, whichever is greater. For most datasets, this is not a very time consuming step. Therefore, it is usually unnecessary to use additional cores.

Simply click on Run.

In this case, all of the files look fairly similar in terms of the number of reads and the average length of reads.

At this point, if there was a file that looked to be aberrant and you knew that it should be discarded, simply select the name of the bad file and delete the link using File --> Deletefiles.

FASTQC performs 11 quality checks on sequencing read files. You can generate FASTQC reports for all of your read files in a single step. Select the links (NOT the original files themselves, and choose

Typical read files are larger than this sample dataset, so with larger files, you may wish to speed up FASTQC by setting a larger number of cores. In most cases, results will appear soon enough that you don't need to set notification of completion by email.

Simply click on Run.

blreads now lists the output files from FASTQC. Files with the .html extension are viewable in any web browser. The .zip files are the accompanying graphics used by the web files.

To view any report, select an HTML file and choose File --> View file. The HTML file will pop up in the browser.

The FASTQC report for DL300_R1.fastq.gz is shown below:

While each of the 11 quality reports need to be examined for each set of reads, it is worth noting that the Per base sequence quality report is one of the most important. This graph shows the mean and distribution of quality scores (Y-axis) for each position in the population of thousands of reads (X-asis). Good quality data will high quality scores for all positions in the reads, although for larger reads, there may be more errors, and hence lower quality, at the 3' regions of reads.

More complete information on the interpretation of FASTQC reports can be found at the FASTQC web site.

2. Remove sequencing adapters using Trimmomatic

Trimmomatic searches for commonly-used sequencing adaptors and removes them for the ends of sequencing reads. This step is essential before proceeding to genome assembly. First, select the links to your read files. Since these are paired-end reads, we need to be able to tell BioLegato which pairs of files are the left and right reads for each DNA sample.

The script guesspairs.py partly automates this process. In the output window from fastqc, first choose Edit --> Select All, and then choose File --> guesspairs.py.

Most Illumina services use R1 and R2 as the substrings which distinguish the left and right read pair files for a given library, so by default, R1 and R2 is set for "unique string for left/right reads"

We only want to process read files, and not the other files in the directory. For this dataset, all read files use the fastq.gz extension. Click on Yes and type in fastq.gz. Clicking on the Hints button will give a more detailed explantion of these parameters.

A new blreads window will appear with the left and right read files in two columns. Both the symbolic links and the long-named read files appear. Since we prefer to work with the short-named links, these can be selected by clicking on Selection mode: row below, and then clicking on the short-named read files. (This accomplishes the same thing as sorting did above.)

Next choose Reads --> Trimmomatic.

General - The parameters for Trimmomatic are grouped into four tabs. The General tab has basic settings. By default, output will go to the ../reads.Trimmomatic directory.

Trimmomatic performs each of the processing steps in an order specified by the user. Individual parameters can be turned on or off, and the order in which they are performed is set by the rank parameter.

Clipadapt - By default, the ILLUMINACLIP step is performed first (ie. rank=1). The defaults are those given in the Trimmomatic manual, with the exception that keep both reads of read-throughs is set to true. This may help to avoid cases where a single read of a read pair is deleted during trimming, which can cause some transcriptome assembly programs to crash.

Quality - By default, SLIDINGWINDOW and AVGQUAL are off. Set MAXINFO to Yes. Since rank is set to 3, we don't need to change this for MAXINFO to be performed as the next step.

Cropping - It's a good idea to eliminate poor quality nucleotides from 3' and 5' ends of reads, so set LEADING and TRAILING to Yes. Set the required quality score to trim at 30, rather than 20, which will shorten some reads, but the remaining bases will be of higher quality.

Finally, the last step is MINLEN, which eliminates read pairs below a specific length. For 150 bp reads a value of 100 is a reasonable compromise between very short reads, which would be difficult or impossible to uniquely map to transcripts, and much longer reads, which might compromise coverage of reads located in the 5' ends of transcripts. No rank is assigned, because this step must be done after all other trimming steps are completed.

By default, a new directory will be created for the output, specified in Name for output directory. The default will be to create a new directory called reads.Trimmomatic in the parent directory.

When Trimmomatic has processed all files, a new blreads window will pop up. It is important to note that this new window is running in the reads.Trimmomatic directory. At this point, one can probably close the previous blreads window, which is running in the reads directory. The next steps will be done in the reads.Trimmomatic directory.

First, notice that by default, trim_galore automatically runs FASTQC, so we have .zip and .html reports for each set of reads after trimming.

For each of the original read files, there are now two output files.

filenames ending in P_fastq - Where both forward and reverse reads survived trimming, the respective reads are written to files ending in 1P.fastq and 2P.fastq, respectively
filenames ending in U_fastq - Where only one of the paired reads survived trimming, the singleton reads are written to files ending in 1U.fastq and 2U.fastq, respectively.

For further steps, we will only use the P_fastq files, mainly because unpaired reads may cause some assembly programs to crash. As well, the unpaired reads are relatively small in numbers, so they wouldn't contribute much to the final assembly.

Quality of the trimmed reads

A quick indication of the trimming results can be seen in bl_trimmomatic.log. You can open this file using File --> View file, or simply open the file in a text editor.

Excerpts from this file are shown for each pair or read files:

DL300
Input Read Pairs: 248386 Both Surviving: 240957 (97.01%) Forward Only Surviving: 1024 (0.41%) Reverse Only Surviving: 319 (0.13%) Dropped: 6086 (2.45%)
DL400
Input Read Pairs: 205842 Both Surviving: 197238 (95.82%) Forward Only Surviving: 925 (0.45%) Reverse Only Surviving: 299 (0.15%) Dropped: 7380 (3.59%)
DL700
Input Read Pairs: 147405 Both Surviving: 141635 (96.09%) Forward Only Surviving: 936 (0.63%) Reverse Only Surviving: 234 (0.16%) Dropped: 4600 (3.12%)

These are good results, since only a few percent of the read pairs had to be discarded.

(Optional: You may wish to run FASTQC at this point to verify that the properties of the processed reads are similar to the original reads. Most notably, the Phred quality scores near the 3' ends of reads should be improved in the Trimmomatic output.)

Between the FASTQC results and the trimming summary, you get a pretty good idea of the quality of the trimmed dataset. If for some reason the quality of trimmed reads in pair of files is lower than expected, try rerunning Trimmomatic on that read pair with different parameter settings.

Once you are satisfied with the dataset, your reads are ready for error correction.

Next: Correction of errors in DNA sequencing reads

TUTORIAL: Genome Assembly

Preprocessing of sequencing reads

0. Obtain sequencing read files

Create symbolic links with short, easy to distinguish names

1. Check reads for quality using SeqKit and FASTQC

2. Remove sequencing adapters using Trimmomatic