seqkit grep extracts sequences from read files using a list of
sequence names
Where:
Run seqkit grep in the same folder as your read files.
Input:
Before running seqkit grep, select a single read file in blreads
to process eg. DL300_R1.fastq.gz.
file of read names to match: Select a file with the names
to extract from the input file. Typically, this would be a file
generated using a program like magicblast, which compares reads to
a database. For example, if you wanted to find mitochondrial reads
in DL300_R1.fastq.gz, output from
magicblast would go to DL300_R.mito.tsv.
This would be a tab-separated value file for which the first
field would be the read names. If the file was generated for
paired-end reads, it could also be used as input to extract
reads from DL300_R1.fastq.gz. The
name file could also be just a list of names, with one name
per line.
name for output file: Reads will be written to
this file. It is best to use a name similar to the input file.
Using the example above, a good choice for the output filename
would be DL300_R1_mito.fastq.gz
Parameters:
Send to output:
- matching reads
- reads matching the names in the list will be written to the
output file
- mismatching reads
- reads that do NOT match the names in the list will be
written to the output file.
Number of threads to use:
Because seqkit
uses pigz to uncompress files, the number of files that can be
uncompressed at one time is dependent on the number of CPUs
utilized.
Performance note: As the # of CPUs increase, the
load on RAM also increases, because SeqKit uses pigz to do
decompression through an I/O stream for each file. It could
be that things will go faster if we use a smaller number of
CPUs. Some experimentation may be needed to optimize speed.
Output:
Output is written to the input directory.
The blreads window will be refreshed to show the new output file.