adaptercheck.py - Decompress files based on file extension

update May 12, 2021

NAME

adaptercheck.py - Find potential read-through adapter contamination in sequencing reads.

SYNOPSIS

adaptercheck.py fqfiles adapterfile olen reportfile

DESCRIPTION

For each file listed in fqfiles, adaptercheck.py searches for an oligonucleotide from the 5' end of each adapter in adapterfile. The length of the oligos specified by olen. The Unix grep and wc -l commands are called to count lines in which the adapter occurs. Results for all fastq files are summarized in a spreadsheet file, written to reportfile.

Typically, oligonucleotide contamination will be due to readthrough on short sequencing reads. When the oligo is found internal to a read, it is likely that the beginning of the oligo marks the 3' end of the insert, and the beginning of the adapter sequence.

Another use of adaptercheck.py is to determine which adapter was used for a set of reads, if that information is not known. This might occur, for example, when re-doing an assembly using older reads. Usually, the real adapters will be found at a frequency orders of magnitude higher than false positives.

fqfiles - Text file containing names of fastq files to search, one per line. Fastq files must not be compressed by gzip or other compression tools.

adapterfile - Fasta-format file containing names and sequences of adapters to search for.

olen [default: 10] - Length of oligonucleotide to search for. For each adapter, the first olen nucleotides from the 5' end are taken as the search string. Olen should be chosen so that

reportfile - File for output. The output is in the form of a TSV file, directly readable by most spreadsheet programs.

OUTPUT

An example of output is shown at right.

For each fastq file, the total number of sequences is listed, along with the expected number of hits for a k-mer of a specified size. In this example, out of 15,679,410 sequences, we'd expect to see 70 hits of 12 nt in length. This is based on the assumption that the a hit will occur once every 4^k nucleotides.

For most adapters, the observed number of hits is between 18 and 29. In the first file, we see that both PE1_rc and PE2_rc were found in 446,582 reads. These almost certainly represent real read throughs. Dividing the number of hits by the total number of sequences, we estimate that about 2.8% of these reads are read-throughs, in which at least 12 bp of adapter are at the 3' end of the read.

REFERENCES

Why are adapter sequences trimmed from only the 3' ends of reads?
https://support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-from-only-the--ends-of-reads.html

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist