last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 10, part 1 of 2
next page

November 22, 2018

Genomic Sequencing and Assembly


Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation


1. Top Down: sequencing of BACs

2. Bottom-up: Whole Genome Shotgun Sequencing






To understand the problem of genomic sequencing, we need to understand the properties of prokaryotic and eukaryotic genomes

bacteria, archaea
fungi, protists, plants, animals
genome structure
single circular chromosome
two or more linear chromosomes
circular organelle genomes: mitochondrial, plastid
ploidy levels
diploid, tetraploid, hexaploid or higher
total genome length (bp)
106 - 107
107 - 1012
single copy
  • single copy (1 - 10 copies)
    • usually only a few % of genome
    • most genes
  • middle-repetitive (10 - 500,000 copies)
    • most of the genome is mid-rep
    • transposable elements a major component
    • high-copy genes: histones, rRNA
  • highly-repetitive (>500,000 copies)
    • mostly short sequences repeated many times
    • concentrated at centromeres and telomeres
    • microsatellite sequences interspersed throughout the chromosome

 As sequencing technologies have given longer reads, combined with paired-end reads, it has been possible to assemble complete prokaryotic genomes by the whole genome shotgun approach. However, the complexity of most eukaryotic genomes makes it impossible to get complete genomic sequences  at today's level of technology. Because most of eukaryotic genomes is repetitive DNA, the top-down strategy is still the only way to get complete chromosomal sequences as single contigs. 

1. Top Down: sequencing of BACs

The first complete genomic sequences for eukaryotes were done using a "divide and conquer" strategy.

1. Construct a representative BAC library (avg. insert ~ 100 kb)
2. For each chromosome, find a set of overlapping BAC clones whose inserts cover the entire chromosome. In this way, the location of each BAC on the chromosome is known.
3. Sequence each BAC separately using shotgun sequencing.



2. Bottom-up: Whole Genome Shotgun Sequencing

Summary of Whole Genome Sequencing




Before you sequence a genome, you need to know the basic characteristics of your genome, and the basic parameters of the sequencing technology or technologies you plan to use. 

Read length
Read Type
 200 - 600
 10 - 375 million
 highest throughput
Ion Torrent
 200 - 400
 0.4 - 60 million

Pac Bio
4600 - 14,000
 22,000 - 47,000
 longest reads
 highest error rate
Roche 454
 400 - 700
 20,000 - 350,000
 lowest error rate
 0.8 - 266 million
 A/T bias

SR - single read; PE - paired end

Data from Choosing the Right NGS Sequencing Instrument for Your Study


How many reads do we need?

Coverage - Ideally we want to cover an entire genome with at least a 50-fold redundancy of reads, meaning that every nucleotide position is represented in at least 50 reads in the population. The level of redundancy is referred to as coverage.

High coverage is needed for several reasons:
The number of reads required to completely sequence a genome is given by

Example: We want to sequence a genome of 1 x 107 bp. If the read size is 200 nt, then f will be 200 nt/(1 x 107 nt ) = 2 x 10-5. If we want a probability of success of P=99%, then for 50-fold coverage,  N = 11.5 million reads. As a rule of thumb, then, the number of reads needed for 50-fold coverage is roughly the number of nucleotides in the genome. This number will go down as reads get longer.


As with most things in bioinformatics, the most important factor influencing the quality of the final result is the quality of the starting material. In the case of sequencing, the most critical part is the initial cleaning of the data:

Sequencing services typically provide reads in Fastq format. Most services will do some pre-processing of reads, at least to the extent of removing removing adaptor sequences added during library preparation.

Cock PJA et al. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Nucleic Acids Res. 2010 Apr; 38(6): 17671771. Published online 2009 Dec 16. doi:  10.1093/nar/gkp1137 PMCID: PMC2847217
Definition of Fastq format for sequencing reads

The information for each read consists of four lines per read.

line 1 - unique identifier of read
line 2 - sequence of read
line 3 - either '+' or optional repeat of title
line 4 - quality of each base read.

Quality encoding

Phred quality scores are typically calculated as Q = -10 x log10(Pe) where Pe is the estimated probability of error.

To save space in fastq files, Q values are encoded as single ASCII characters whose values, such that the Q value is represented by the ASCII character whose decimal number plus some offset value, usually 33 or 64. For example, in a Q value of 34 is encoded by the number in the ASCII character chart with the decimal value 33 + 34 = 67 ie. the capital C. A Q value of 30, plus offset 33, gives the ASCII decimal value of 63, represented by the ASCII character question mark (?).

Note: Some sequence assembly programs need to know which offset your data uses, 33 or 64. 

Description, OBF name ASCII characters
Quality score

Range Offset Type Range
Sanger standard

    fastq-sanger 33126 33 PHRED 0 to 93
Solexa/early Illumina

    fastq-solexa 59126 64 Solexa −5 to 62
Illumina 1.3+

    fastq-illumina 64126 64 PHRED 0 to 62

The first step after receiving sequencing read files should be to look at the quality of the reads using FastQC, which creates reports of the overall quality of the sequencing results. The example shown belows is for a good sequencing run, in which the quality of the nucleotide calls is in the high range at all positions. For positions approaching 145 nt, the variation in quality of the reads gets larger. For this data, it is probably best not to use positions above 145 - 149 for sequence assembly. Most assembly programs will automatically filter out poor quality positions.

Per base quality graph


Trimming primers from the ends of sequencing reads
For all sequencing technologies, the first dozen or so nucleotides at the 5' ends of raw sequencing reads are the adapters added to the fragments in the library.

Illumina:   5'AGATCGGAAGAC------------------------------------------------------------------3'
Small RNA   5'TGGAATTCTCGG------------------------------------------------------------------3'
Nextera:    5'CTGTCTCTTATA------------------------------------------------------------------3'

If adapters were not removed, sequence assembly programs would find that every sequence matched every other sequence in the first 12 - 13 bases which would confound the assembly of contigs from reads.

Programs such as trim-galore automatically detect the adaptor sequences, and remove them from the ends of reads.

Keep in mind: When you trim reads from fastq files, you generate new fastq files that are almost the same size as the original files. Thus, if you have sequencing read files totaling to 25 Gb, after trimming there will be an additional 25 Gb of fastq files with the trimmed reads. As well, the next step is error correction, which would generate another 25 Gb of trimmed, corrected reads. A good rule of thumb therefore is that for a genome sequencing project you need more than 3x the disk space required for the original raw reads.

Error correction

In sequencing, as in science in general, a smaller amount of high quality data is usually better than a larger amount of unreliable data. Reads with high error rates will often prevent the assembly of sequences into larger contigs:
We will next talk about various strategies for identifying and discarding poorer quality reads. No matter how good a sequence assembly algorithm is, it will not be able to do a good assembly when reads contain a lot of errors. The most important single factor in getting a good assembly is having good reads.

Most methods for error correction begin with the creation of a table of oligonucleotides of length k ie. k-mers, listing their frequency in the dataset. Two such methods are Quake and Pollux.


Kelly DR et al. (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology 201011:R116 DOI: 10.1186/gb-2010-11-11-r116. DOI: 10.1186/gb-2010-11-11-r116

"For sufficiently large k, almost all single-base errors alter k-mers overlapping the error to versions that do not exist in the genome. Therefore, k-mers with low coverage, particularly those occurring just once or twice, usually represent sequencing errors. For the purpose of our discussion, we will refer to high coverage k-mers as trusted, because they are highly likely to occur in the genome, and low coverage k-mers as untrusted. Based on this principle, we can identify reads containing untrusted k-mers and either correct them so that all k-mers are trusted or simply discard them."
Put another way, k-mers that are real should occur in all reads mapping to a particular site in the genome, while erroneous k-mers might be found only once in a dataset.

Choice of k-value is dependent on genome size

We want to choose a k-value such that there will be a low probability (eg. < 0.01) that a given k-mer will occur by chance in a genome of size G. So k should be chose such that 2G/4k = 0.01. (We use 2G to account for both strands of each chromosome). The equation simplifies to

k = log4(200G)

Max. genome size (Mb) cutoff
 (4k/200) for this k-value
 example species
Genome size (Mb)
Escherichia coli
Saccharomyces cereviseae
Leptosphaeria maculans
Drosophila melanogaster
Solanum tuberosum

Rather than simply counting the number of occurrences of each k-mer in a genome, Quake counts q-mers. Each k-mer in a read is weighted by the quality values for each nucleotide in the k-mer. Thus, the q-count for a given k-mer is the sum of q-mer values for all instances of that k-mer in the dataset.

Localize errors. Trusted (green) and untrusted (red) 15-mers are drawn against a 36 bp read. In (a), the intersection of the untrusted k-mers localizes the sequencing error to the highlighted column. In (b), the untrusted k-mers reach the edge of the read, so we must consider the bases at the edge in addition to the intersection of the untrusted k-mers. However, in most cases, we can further localize the error by considering all bases covered by the right-most trusted k-mer to be correct and removing them from the error region as shown in (c).

Fig. 4 from Kelly et al.

To correct errors, untrusted k-mers are replaced by trusted k-mers that could correct the error. The most likely corrections are tried, based on quality information, until a set of corrections is found that makes all k-mers in a read into trusted k-mers.


Marnier E et al. (2015) Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics201516:10 DOI: 10.1186/s12859-014-0435-6

Improvements over previous methods:

Pollux algorithm

# encode all k-mers from all reads into table
for each read do
    trim N's from ends of read
    replace internal N's with an arbitrary nucleotide
    add all k-mers from read to k-mer table
# Get rid of any k-mer that is only represented once
for each k-mers in table
    if count(k-mer) = 1
        remove k-mer from table
for each read do
    find all potential error positions as localized drop in k-mer frequency
    for each error position
        try correcting by insertions, deletions, substitutions
        choose best correction
        if valid(correction)
            apply(correction to read)
            # assume homopolymer correction
            try different lengths of homopolymer
            choose best correction
            if valid(correction)
                apply correction to read
A more complete pseudo-code for the algorithm is found in the Marnier et al. publications

Better accuracy through long k-mer sizes. As the length of read sizes increases with improving sequencing technologies, it is possible to improve the correction process using much larger k-mer sizes. Pollux uses k=31 because 31 is the longest kmer that can be represented in a 64-bit word. Any given 31 mer is not likely to be found in any genome, by random chance, regardless of how bit the genome is.

This requires encoding each nuclotide in 2 binary bits. For example

binary representation

2-bit encoding makes it possible to store the k-mer table in a much smaller space than would otherwise be possible. The one problem is that a 2-bits can only encode four possible choices. To represent a fifth nucleotide N, you would need 3 bits.

The solution is to replace each N in a read with an arbitrarily chosen nucleotide. That way, the N can still be treated as an error, which can be corrected like any other error.

Detection of errors as troughs in k-mer frequencies - Marnier et al. recognized that the k-mer count could be high in correct regions of a read, and very low in regions that contain errors. Usually the right end of the "trough" would locate the error, because immediately downstream from the error, the k-mers should all be high-frequency k-mers.

The authors state that the trough detection approach avoids the assumptions about sequence quality scores that are inherent in Quake.

Homopolymer correction - A second innovation of Pollux is the ability of k-mer frequencies to correct errors in homopolymer runs. In particular, Ion Torrent sequencing tends to lend itself to calling homopolymers incorrectly. 

Method: replace k-mers in a read with k-mers having different numbers of repeated nucleotides

Example: If the internal C in this

original k-mer: CGTCATT


Choose the alternate homopolymer-containing k-mers that occur most frequently. For example, if the k-mers that have 4 C's at that position are all more frequent than the ones with 1, 2, 3, 5 or > C's, then we correct the read to have 4 C's.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada
last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 10, part 1 of 2
next page