BIRCH

Previous

TUTORIAL: Genome Assembly

Preprocessing of sequencing reads


  November 4, 2021

Next page


References

FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

trim_galore User's Guide
trim_galore manual

Trimmomatic User's Guide

SeqKit


0. Obtain sequencing read files

DATASET: Fakankun I et al. Ph.D. thesis, University of Manitoba (in progress) Rhodosporidium diobovatum.
This dataset is a random sample of about 5% of the reads from fungal genomic DNA.
NCBI BioSample SAMN09284860

raw read files (Illumina, paired end)
insert size (nt)
DL300_S1_L001_R1_001_sample.fastq.gz
DL300_S1_L001_R2_001_sample.fastq.gz
300
DL400_S2_L001_R1_001_sample.fastq.gz
DL400_S2_L001_R2_001_sample.fastq.gz
400
DL700_S3_L001_R1_001_sample.fastq.gz
DL700_S3_L001_R2_001_sample.fastq.gz
700

Since genome assembly requires several steps, it is best to organize the files for each step into a separate directory. In addition to keeping things simple, this organization gives us more flexibility for saving files offline, compressing files, or ignoring sets of files for backup.

In your tutorials directory create a new directory called blreads/genome/reads, and then go to that directory:

mkdir blreads/genome
cd blreads/genome
mkdir reads
cd reads

Next, download the fastq.gz files found at
https://home.cc.umanitoba.ca/~psgendb/birchhomedir/FTP/BIRCH/data/blreads/genome/GenomeFiles.html
The files contain raw sequencing reads, which should be saved to your reads directory. The total size of these files is around 127 Mb.

Check to make sure that these files have downloaded correctly

ls -l
-rw-rw-r-- 1 psgendb psgendb 24493198 Dec 31 14:12 DL300_S1_L001_R1_001_sample.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 27005117 Dec 31 14:12 DL300_S1_L001_R2_001_sample.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 20560594 Dec 31 14:12 DL400_S2_L001_R1_001_sample.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 22777206 Dec 31 14:12 DL400_S2_L001_R2_001_sample.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 15246265 Dec 31 14:13 DL700_S3_L001_R1_001_sample.fastq.gz
-rw-rw-r-- 1 psgendb psgendb 17306339 Dec 31 14:13 DL700_S3_L001_R2_001_sample.fastq.gz


Create symbolic links with short, easy to distinguish names

Sequencing read files typically have long, complex names. Aside from being difficult to work with at the command line, it takes more work to be sure that, at any given point, you are working with the correct file. At the same time, it is best not to modify the names of the original files for reference to the original sequencing runs.

A good solution is to create symbolic links with short, meaningful names, to the  read files. All subsequent steps in the assembly pipeline will be done using these shorter names. Usually, this will be a 2 step process: first creating symbolic links with shorter names, and then renaming these links with even simpler names.

Launch blreads by typing 'blreads &'.
(Here, we're adding & to the blreads command to run blreads in the background. That will allow us to continue using the command line in the same terminal window in which blreads was launched.)
Note that the path for the current working directory is listed on the first line of blreads. Any line begining with a hash mark (#) is a comment, and will be ignored.


 
While this first step has to be done one file at a time, we are setting up filenames so that we can work on groups of files in subsequent steps. Choose File --> Create symbolic links with short names.


Choose the first file to which you want to make a link.

To make the name of the link shorter, type in a target pattern that is common in two or more files. At right, the pattern is "_S1_L001". Since the short pattern field is left blank, this string will simply be omitted from the link name. The link name will be DL300_R1_001_sample.fastq.gz.



The name of the link will appear in the blreads window. Note that the type field of the original file has 'f', to indicate a file, and the type field for the link is 'l'. to indicate a link.


Repeat this process for DL300_S1_L001_R2_001_sample.fastq.gz.  In this case, the target pattern will be the same. For the next two files, the target pattern will be _S2_L001, and for the last two files, the target will be _S3_L001.

When you have completed the process for all six files, blreads should look like this:

For each set of paired-end read files (R1 and R2), there will be a corresponding pair of symbolic links with shorter names.


Finally, we'd like to eliminate redundancy from the link names, which can be done in a single step. All names contain the string "_001_sample", so let's delete that from the link names.

To avoid accidentally changing the names of the original files, first sort the files based on the Type field, so that all links appear together in blreads. Choose Edit --> BLSORT. Set the 1st sort key to column 4 (ie. Type), and Sort order to Ascending. Choose Run:Output to this window.



All links will now be together in blreads. Select the links as shown, and choose File --> Rename.



Set the target pattern to '_001_sample', and press Run.

Each pair of read files now has short, easy to distinguish names, that will be used in all subsequent steps.


It is best to verify that the links point to the intended files by typing 'ls -l' at the command line:



On many systems, the bash shell will highlight the names of symbolic links in aqua if files that the links point to exist. If the files don't exist, the names will be in red. This acts as a check that the links are correct. If for some reason your links point to non-existent files, delete the links (not the original files!) using File --> DeleteFiles, or the command line, and then re-create the links.

1. Check  reads for quality using SeqKit and FASTQC

In any analytical process, it is better to have a smaller amount of high quality data than a large amount of poor data. It is important therefore to eliminate any read files of poor quality, which would otherwise lead to a low-quality genome.

Some summary statistics on the read files can be produced by SeqKit. Select the links (NOT the original files themsleves) and choose Reads --> SeqKit stats.
By default, the number of threads used is 1/4 of the number of available cores, or 1, whichever is greater. For most datasets, this is not a very time consuming step. Therefore, it is usually unnecessary to use additional cores.

Simply click on Run.


In this case, all of the files look fairly similar in terms of the number of reads and the average length of reads.

At this point, if there was a file that looked to be aberrant and you knew that it should be discarded, simply select the name of the bad file and delete the link using File --> Deletefiles.



FASTQC performs 11 quality checks on sequencing read files. You can generate FASTQC reports for all of your read files in a single step. Select the links (NOT the original files themselves, and choose

Typical read files are larger than this sample dataset, so with larger files, you may wish to speed up FASTQC by setting a larger number of cores. In most cases, results will appear soon enough that you don't need to set notification of completion by email.

Simply click on Run.





blreads now lists the output files from FASTQC. Files with the .html extension are viewable in any web browser. The .zip files are the accompanying graphics used by the web files.

To view any report, select an HTML file and choose File --> View file. The HTML file will pop up in the browser.


The FASTQC report for DL300_R1.fastq.gz is shown below:


While each of the 11 quality reports need to be examined for each set of reads, it is worth noting that the Per base sequence quality report is one of the most important. This graph shows the mean and distribution of quality scores (Y-axis) for each position in the population of thousands of reads (X-asis). Good quality data will high quality scores for all positions in the reads, although for larger reads, there may be more errors, and hence lower quality, at the 3' regions of reads.

More complete information on the interpretation of FASTQC reports can be found at the FASTQC web site.

2. Remove sequencing adapters using trim_galore


For most datasets, trim_galore does a good job of trimming adaptors off of reads. However, for some paired-end datasets, trim_galore may create 0-length reads, or eliminate a read altogether, which confuses some assembly programs, particularly spades. If such problems arise in downstream analysis, go back to the this step and trim using Trimmomatic. Trimmomatic is more complex to use, but results in assemblies that are about the same as those using reads trimmed by trim_galore. Trimmomatic can be run from blreads using Reads --> Trimmomatic.

Trim_galore searches for commonly-used sequencing adaptors and removes them for the ends of sequencing reads. This step is essential before proceeding to genome assembly. First, select the links to your read files. Since these are paired-end reads, we need to be able to tell BioLegato which pairs of files are the left and right reads for each DNA sample.

The script guesspairs.py partly automates this process. In the output window from fastqc, choose File --> guesspairs.py.
Most Illumina services use R1 and R2 as the substrings which distinguish the left and right read pair files for a given library, so by default, R1 and R2 is set for "unique string for left/right reads"

We only want to process read files, and not the other files in the directory. For this dataset, all read files use the .fastq.gz extension. Clicking on the Hints button will give a more detailed explantion of these parameters.


A new blreads window will appear with the left and right read files in two columns. Both the symbolic links and the long-named read files appear. Since we prefer to work with the short-named links, these can be selected by clicking on Selection mode: row below, and then clicking on the short-named read files. (This accomplishes the same thing as sorting did above.)


Next choose Reads --> trim_galore.

While trim_galore has a lot of options, in most cases you can run it with the defaults. One exception would be to set the Min. Phred quality score to trim at 30, rather than 20, which will shorten some reads, but the remaining bases will be of higher quality.

Based on the quality scores in the FASTQC files, we can also set discard reads shorter than to 100, because the number of reads in the Sequence Length Distribution reports show that the number of reads shorter than 100 are negligible.

Set compress output to No, because some read correction programs such as Pollux can't work with compressed read files.

By default, a new directory will be created for the output, specified in Name for output directory. The default will be to create a new directory called reads.trimmed in the parent directory.

It is worth reading through the Trim_galore User's Guide before running trim_galore, especially if your reads use unusual adaptors.


When trim_galore has processed all files, a new blreads window will pop up. It is important to note that this new window is running in the reads.trimmed directory. At this point, one can probably close the previous blreads window, which is running in the reads directory. The next steps will be done in the reads.trimmed directory.

First, notice that by default, trim_galore automatically runs FASTQC, so we have .zip and .html reports for each set of reads after trimming.

For each  read file, there are now three output files.

  • trimmed reads (_val_x.fq.gz)
  • summary of trimming results (_trimming_report.txt)




Between the FASTQC results and the trimming summary, you get a pretty good idea of the quality of the trimmed dataset. For example, the summary will probably tell you that a few percent of the reads were eliminated for being too short, which is a good thing. Usually some percentage of reads with adaptors (eg. 12%) will have been trimmed.

If for some reason the quality of trimmed reads in one file is lower than expected, try rerunning trim_galore with different parameter settings.

Once you are satisfied with the dataset, your reads are ready for error correction.


Next: Correction of errors in DNA sequencing reads