Rationale:
Some genome or transcriptome assembly programs will fail if fastq
files contain unpaired reads.
fastq_pair reads a pair of fastq files and splits them into two
files, one containing paired reads, and the other with unpaired
reads.
As described in the documentation, this is a more complex problem
than you might think.
Where:
Run fastq_pair in the same folder as your read files.
Input:
RNAseq reads: Before opening this menu, select RNA-seq
readfiles to be mapped to the genome. For paired-end reads,
fastq_pair works with pair of files. Files can be selected in
pairs using File --> guesspairs.py.
Files cannot be compressed (eg. gzip).
Parameters:
Size of hashtable to use (-t) - Fastq_pair creates a hash
table of reads from the left read pair, and then reads the right
read pair file and attempts to find mates for each read. One
problem is that for large read files with millions of reads, if
the table is too large or too small. If the table is too large for
the available memory, or too small for the number of reads,
fastq_pair will crash. If fastq_pair crashes, it may be
necessary to re-run fastq_pair with a different size of hash
table.
Even for files of similar sizes, the best -t value can vary
greatly, depending on whether there are a small number of
singletons, versus a large number of singletons.
Print buckets (-p) - Setting this parameter causes
fastq_pair to print the number of sequences per "bucket" in the
table, as described in the fastq_pair documentation. Not set by
default. Setting this parameter adds several Mb to the output log.
This parameter can be useful in deciding the best hashtable size.
Output:
Name for output directory - This defaults to a
directory in the parent directory. It is usually the best
organizational practice for the paired and single read files to go
somewhere other than the current directory.
Output files will be the names of the original fastq files, with
'.paired.fq' or '.single.fq' appended to the names.
A new blreads window will be opened in the output directory.