update June 10, 2021
NAME
fixfq.py - Divide a fastq file into
two files, one with validated reads, and the other with reads that
do not comply with the fastq format or are too short.
SYNOPSIS
fixfq.py fastqfile [--minseq integer]
fixfq.py
--filelist namefile [--minseq
integer]
DESCRIPTION
fixfq.py divides the reads from a fastq file into two
files, one for valid reads (validfile) and the other for bad reads
(badfile). Processing is done in two steps. First, the next
presumptive read in the file is read. A presumptive read begins
with a line in which the first character is "@", followed by up to
three lines. If the presumptive read has less than 4 lines, it is
written to badfile. If it has 4 lines, the presumptive read must
comply with the following criteria:
- line 2, Sequence line - must contain only the characters
AGCTN.
- line 3, Separator line - must begin with "+"
- line 4, Quality line - must be exactly the same length as
the Sequence line and must contain only Phred33 quality
characters (ASCII 33 - 126).
If all criteria are met, the read is written to validfile.
Otherwise, the read is written to badfile. If the next line after
the current read does not begin with "@", lines are written to
badfile until a new presumptive read is found, beginning with "@".
It is worth noting that since "@" is a legal quality character, we
can't use that as a way to prescreen the presumptive read. That
is, we can't assume it is a truncated read just because the fourth
line after the first @ also starts with @.
OPTIONS
fastqfile - A single fastq file to be processed.
The name of infile is presumed to have a file extension, typically
.fastq or .fq. To create a basename for the output files, the file
extension is removed. Output fastq files are given the .fq file
extension.
--filelist namefile - If specified, names
of files to be processed are read from namefile.
namefile - a list of fastq files to be processed,
one file name per line.
validfile - Reads meeting all criteria are written to this
file. For example, if the original fastq file was exp1.fq,
validfile would be exp1_valid.fq.
badfile - Reads that do not meet all criteria are written
to this file. For example, if the original fastq file was exp1.fq,
badfile wold be exp1_bad.fq.
--minseq integer (default: 50) Reads smaller than
minseq nucleotides are considered bad reads, and are written to badfile.
REFERENCES
Cock PJA et al. (2010) The Sanger FASTQ file format for
sequences with quality scores, and the Solexa/Illumina FASTQ
variants. Nucleic Acids Res. 2010 Apr; 38(6): 1767–1771.
doi:
10.1093/nar/gkp1137
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist