Read correction

June 10, 2019

References
Marnier E et al. (2015) Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics201516:10 DOI: 10.1186/s12859-014-0435-6

Pollux manual

3. Correct errors in trimmed sequencing reads using Pollux

Pollux offers a number of improvements over previous methods for correcting DNA sequencing reads:

31 nt k-mers; so large that we don't have to worry about them occurring in the genome by random chance
larger k-word size for faster, more accurate error-checking
makes no assumptions about quality scoring
corrects errors regardless of the sequencing technology
good at correcting errors in homopolymer tracts

The goal is to perform errorr correction on the trimmed read files (_val_x_fq) from the previous step. If you closed the blreads window, simply start another instance of blreads in the reads.trim_galore directory.

OPTIONAL: HOW TO DECOMPRESS GZIPPED FILES

In the previous step, the instructions said to set compress output files using gzip to No. If, however, you did have trimmed reads in compressed form, blreads has an uncompress function. If your reads are not compressed, skip to Running Pollux below.

Unfortunately, Pollux can't read compressed fastq files, so we have to decompress these files. First, select all compressed fastq.gz files.

Next, choose File --> Decompress and click on Run.

On large files, Decompress may take a few minutes.

When complete, the window will be refreshed, showing that the .gz extension is gone from the fastq (.fq) files.

Running pollux

Your blreads window should now appear as shown at right. The read files will have the .fq file extension.

As before, we need to tell blreads which read files should be paired together for paired-end reads. Choose File --> guesspairs.py.

We only want to process read files, and not the other files in the directory. For this dataset, all read files use the .fq extension. In some cases, you may need to specify more than one set of file extensions as a comma-separated list eg. .fq,.fastq Clicking on the Hints button will give a more detailed explanation of these parameters.

Clicking on Run will bring up a new blreads window with the best quess of file pairing, in two columns. Usually, guesspairs.py gets it right. Files for which a pair cannot be found (ie. single-end reads) would be listed in a single column.

To run Pollux with these read pairs, choose Edit --> SelectAll, and then Reads --> Pollux.

By default, pollux will write corrected reads to a directory in the parent directory called reads.trimmed.pollux. To be more precise, let's call the output directory ../reads.trim_galore.pollux

Pollux may take awhile to run, so you may wish to set "Notify of completion by email" to Yes and type in an email address.

The output is written to the ../reads.trim_galore.pollux directory, whose contents is shown at right.

The report in pollux.log summarizes the numbers of errors corrected of different types eg. insertions, deletions, corrections, homopolymer corrections.

The high quality corrected reads are found in files with the .corrected.fq extension. Corrections based on low k-mer counts have the .low extension, and are generally not used.

Next: Assembly of contigs and scaffolds

TUTORIAL: Genome Assembly

Read correction

3. Correct errors in trimmed sequencing reads using Pollux