fastq_pair

Rationale:

Some genome or transcriptome assembly programs will fail if fastq files contain unpaired reads.
fastq_pair reads a pair of fastq files and splits them into two files, one containing paired reads, and the other with unpaired reads.

As described in the documentation, this is a more complex problem than you might think.

Where:

Run fastq_pair in the same folder as your read files.

Input:

RNAseq reads
: Before opening this menu, select RNA-seq readfiles to be mapped to the genome. For paired-end reads, fastq_pair works with pair of files. Files can be selected in pairs using File --> guesspairs.py.

Files cannot be compressed (eg. gzip).


Parameters:

Size of hashtable to use (-t) - Fastq_pair creates a hash table of reads from the left read pair, and then reads the right read pair file and attempts to find mates for each read. One problem is that for large read files with millions of reads, if the table is too large or too small. If the table is too large for the available memory, or too small for the number of reads, fastq_pair will crash.  If fastq_pair crashes, it may be necessary to re-run fastq_pair with a different size of hash table.

Even for files of similar sizes, the best -t value can vary greatly, depending on whether there are a small number of singletons, versus a large number of singletons.

Print buckets (-p) - Setting this parameter causes fastq_pair to print the number of sequences per "bucket" in the table, as described in the fastq_pair documentation. Not set by default. Setting this parameter adds several Mb to the output log. This parameter can be useful in deciding the best hashtable size.

Output:

Name for output directory -  This defaults to a directory in the parent directory. It is usually the best organizational practice for the paired and single read files to go somewhere other than the current directory.

Output files will be the names of the original fastq files, with '.paired.fq' or '.single.fq' appended to the names.

A new blreads window will be opened in the output directory.