seqkit grep - Hints

seqkit grep

seqkit grep extracts sequences from read files using a list of sequence names

Where:

Run seqkit grep in the same folder as your read files.

Input:

Before running seqkit grep, select a single read file in blreads to process eg. DL300_R1.fastq.gz.

file of read names to match: Select a file with the names to extract from the input file. Typically, this would be a file generated using a program like magicblast, which compares reads to a database. For example, if you wanted to find mitochondrial reads in DL300_R1.fastq.gz, output from magicblast would go to DL300_R.mito.tsv. This would be a tab-separated value file for which the first field would be the read names. If the file was generated for paired-end reads, it could also be used as input to extract reads from DL300_R1.fastq.gz. The name file could also be just a list of names, with one name per line.

name for output file: Reads will be written to this file. It is best to use a name similar to the input file. Using the example above, a good choice for the output filename would be DL300_R1_mito.fastq.gz

Parameters:

Send to output:

matching reads - reads matching the names in the list will be written to the output file
mismatching reads - reads that do NOT match the names in the list will be written to the output file.

Number of threads to use: Because seqkit uses pigz to uncompress files, the number of files that can be uncompressed at one time is dependent on the number of CPUs utilized.
Performance note: As the # of CPUs increase, the load on RAM also increases, because SeqKit uses pigz to do decompression through an I/O stream for each file. It could be that things will go faster if we use a smaller number of CPUs. Some experimentation may be needed to optimize speed.

Output:

Output is written to the input directory.

The blreads window will be refreshed to show the new output file.