fixfq.py - remove corrupted reads from a fastq file

update June 10, 2021

NAME

fixfq.py - Divide a fastq file into two files, one with validated reads, and the other with reads that do not comply with the fastq format or are too short.

SYNOPSIS

fixfq.py fastqfile [--minseq integer]
fixfq.py --filelist namefile [--minseq integer]

DESCRIPTION

fixfq.py divides the reads from a fastq file into two files, one for valid reads (validfile) and the other for bad reads (badfile). Processing is done in two steps. First, the next presumptive read in the file is read. A presumptive read begins with a line in which the first character is "@", followed by up to three lines. If the presumptive read has less than 4 lines, it is written to badfile. If it has 4 lines, the presumptive read must comply with the following criteria:

line 2, Sequence line - must contain only the characters AGCTN.

line 3, Separator line - must begin with "+"

line 4, Quality line - must be exactly the same length as the Sequence line and must contain only Phred33 quality characters (ASCII 33 - 126).

If all criteria are met, the read is written to validfile. Otherwise, the read is written to badfile. If the next line after the current read does not begin with "@", lines are written to badfile until a new presumptive read is found, beginning with "@".

It is worth noting that since "@" is a legal quality character, we can't use that as a way to prescreen the presumptive read. That is, we can't assume it is a truncated read just because the fourth line after the first @ also starts with @.

OPTIONS

fastqfile - A single fastq file to be processed. The name of infile is presumed to have a file extension, typically .fastq or .fq. To create a basename for the output files, the file extension is removed. Output fastq files are given the .fq file extension.

--filelist namefile - If specified, names of files to be processed are read from namefile.

namefile - a list of fastq files to be processed, one file name per line.

validfile - Reads meeting all criteria are written to this file. For example, if the original fastq file was exp1.fq, validfile would be exp1_valid.fq.

badfile - Reads that do not meet all criteria are written to this file. For example, if the original fastq file was exp1.fq, badfile wold be exp1_bad.fq.

--minseq integer (default: 50) Reads smaller than minseq nucleotides are considered bad reads, and are written to badfile.

REFERENCES

Cock PJA et al. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr; 38(6): 1767–1771.
doi: 10.1093/nar/gkp1137

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist