PLNT4610/PLNT7690 Bioinformatics
Lecture 10, part 1 of 2

Genomic Sequencing and Assembly

REFERENCES

Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation http://onlinelibrary.wiley.com/doi/10.1111/eva.12178/full

A. OVERVIEW

1. Top Down: sequencing of BACs

2. Bottom-up: Whole Genome Shotgun Sequencing

B. CRITICAL PARAMETERS OF SEQUENCING TECHNOLOGIES

C. PREPROCESSING OF READS

D. ASSEMBLY OF READS INTO CONTIGS

E. JOINING CONTIGS INTO SCAFFOLDS

A. OVERVIEW

To understand the problem of genomic sequencing, we need to understand the properties of prokaryotic and eukaryotic genomes

	Prokaryotes	Eukaryotes
taxa	bacteria, archaea	fungi, protists, plants, animals
genome structure	single circular chromosome plasmids	two or more linear chromosomes circular organelle genomes: mitochondrial, plastid
ploidy levels	haploid	diploid, tetraploid, hexaploid or higher
total genome length (bp)	10⁶ - 10⁷	10⁷ - 10¹²
complexity	single copy	single copy (1 - 10 copies) usually only a few % of genome most genes middle-repetitive (10 - 500,000 copies) most of the genome is mid-rep transposable elements a major component high-copy genes: histones, rRNA highly-repetitive (>500,000 copies) mostly short sequences repeated many times concentrated at centromeres and telomeres microsatellite sequences interspersed throughout the chromosome

As sequencing technologies have given longer reads, combined with paired-end reads, it has been possible to assemble complete prokaryotic genomes by the whole genome shotgun approach. However, the complexity of most eukaryotic genomes makes it almost impossible to get complete genomic sequences at today's level of technology.

1. Top Down: sequencing of BACs

The first complete genomic sequences for eukaryotes were done using a "divide and conquer" strategy.

1. Construct a representative BAC library (avg. insert ~ 100 kb)
2. For each chromosome, find a set of overlapping BAC clones whose inserts cover the entire chromosome. In this way, the location of each BAC on the chromosome is known.
3. Sequence each BAC separately using shotgun sequencing.

Advantages

use of BACs breaks a large, complex problem into smaller, less complex problems
small size of BAC inserts decreases by orders of magnitude the computational size of the assembly problem
small size of BAC inserts somewhat decreases the problem of overlapping reads when repetitive sequences are present

Disadvantages

many person-years of effort
enormous cost
not practical for most modern genomics projects (eg. Thousand Genomes)

2. Bottom-up: Whole Genome Shotgun Sequencing

Summary of Whole Genome Sequencing

Advantages

quick
low cost per genome
only practical way for most modern genomics projects (eg. Thousand Genomes)

Disadvantages

millions of short reads presents a computational problem of enormous complexity
computationaly demanding, in terms of processing time and memory
assembly overlapping reads becomes impossible when repetitive sequences are present

B. CRITICAL PARAMETERS OF SEQUENCING TECHNOLOGIES

Before you sequence a genome, you need to know the basic characteristics of your genome, and the basic parameters of the sequencing technology or technologies you plan to use.

Platform	Read length	Reads/unit	Read Type	Error type	Comment
Illumina	200 - 600	10 - 375 million	PE, SR	substitution	highest throughput
Ion Torrent	200 - 400	0.4 - 60 million	SR	homopolymers
Pac Bio	4600 - 14,000	22,000 - 47,000	SR	indel	longest reads highest error rate
Oxford nanopore	15,000 - 20,000		SR	homopolymers; deletions	longest reads can exceed 1000 kb
Roche 454	400 - 700	20,000 - 350,000	SR	indel	lowest error rate
SR - single read; PE - paired end

Data from Choosing the Right NGS Sequencing Instrument for Your Study http://genohub.com/ngs-instrument-guide/#pacbio

Guidelines:

Long reads are critical for assembling long contigs. There is no substitute.
High coverage improves the reliability of the sequence. 50 - 100 fold coverage is common
Best strategy is to combine two sequencing runs

paired-end with short insert sizes eg. Illumina 300 bp fragments
mate pair reads with long insert sizes to bridge gaps such as those resulting from repetitive sequences eg. Pac Bio

How many reads do we need?

Coverage - Ideally we want to cover an entire genome with at least a 50-fold redundancy of reads, meaning that every nucleotide position is represented in at least 50 reads in the population. The level of redundancy is referred to as coverage.

High coverage is needed for several reasons:

Sequencing is error-prone, often giving between 0.1% to 1% errors per base
Reads are short, usually < 200 nt.
Some sites may be polymorphic. In the case of diploid genomes, there might be two alleles at any given base position
The population of reads will vary in terms of coverage, across the genome. Since the read population is supposed to be a random sampling of the genome, some sites will be underrepresented. That is, if the average coverage is 50x, then many sites will have only 45% coverage, a significant number of sites will have 40% coverage, and some sites might even have only 35% coverage. Others will be overrepresented.

The number of reads required to completely sequence a genome is given by

where

C ::= fold coverage
P ::= the probability that all nucleotide positions will be included at least C times
f ::= the fraction of the genome spanned by a single read

Example: We want to sequence a genome of 1 x 10⁷ bp. If the read size is 200 nt, then f will be 200 nt/(1 x 10⁷ nt ) = 2 x 10^-5. If we want a probability of success of P=99%, then for 50-fold coverage, N = 11.5 million reads. As a rule of thumb, then, the number of reads needed for 50-fold coverage is roughly the number of nucleotides in the genome. This number will go down as reads get longer.

C. PREPROCESSING OF READS

As with most things in bioinformatics, the most important factor influencing the quality of the final result is the quality of the starting material. In the case of sequencing, the most critical part is the initial cleaning of the data:

trimming adaptors and low quality nucleotide regions from the ends of reads
error correction and elimination of low quality reads (usually done in the same step)

Sequencing services typically provide reads in Fastq format. Most services will not do pre-processing of reads, so you need to do these steps.

Cock PJA et al. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Nucleic Acids Res. 2010 Apr; 38(6): 1767–1771. Published online 2009 Dec 16. doi: 10.1093/nar/gkp1137 PMCID: PMC2847217

Definition of Fastq format for sequencing reads

The information for each read consists of four lines per read.

line 1 - unique identifier of read
line 2 - sequence of read
line 3 - either '+' or optional repeat of title
line 4 - quality of each base read.

(Note: 4 lines are shown at right, but the sequence lines are long, so they wrap.)

Quality encoding

Phred quality scores are typically calculated as Q = -10 x log₁₀(P_e) where P_e is the estimated probability of error.

To save space in fastq files, Q values are encoded as single ASCII characters whose values, such that the Q value is represented by the ASCII character whose decimal number plus some offset value, usually 33 or 64. For example, in a Q value of 34 is encoded by the number in the ASCII character chart with the decimal value 33 + 34 = 67 ie. the capital C. A Q value of 30, plus offset 33, gives the ASCII decimal value of 63, represented by the ASCII character question mark (?).

Note: Some sequence assembly programs need to know which offset your data uses, 33 or 64.

Description, OBF name	ASCII characters		Quality score
	Range	Offset	Type	Range
Sanger standard
`fastq-sanger`	33–126	33	PHRED	0 to 93
Solexa/early Illumina
`fastq-solexa`	59–126	64	Solexa	−5 to 62
Illumina 1.3+
`fastq-illumina`	64–126	64	PHRED	0 to 62

The first step after receiving sequencing read files should be to look at the quality of the reads using FastQC, which creates reports of the overall quality of the sequencing results. The example shown below is for a good sequencing run, in which the quality of the nucleotide calls is in the high range at all positions. For positions approaching 145 nt, the variation in quality of the reads gets larger. For this data, it is probably best not to use positions above 145 - 149 for sequence assembly. Most assembly programs will automatically filter out poor quality positions.

Per base quality graph

Trimming adapters and removing poor quality nucleotides from the ends of sequencing reads

For all sequencing technologies, the first dozen or so nucleotides at the 5' ends of raw sequencing reads are the adapters added to the fragments in the library. As shown in the FastQC output above, the quality of nucleotide calls is often more error prone, especially near the 3' ends of reads.

If adapters were not removed, sequence assembly programs would find that every sequence matched every other sequence in the first 12 - 13 bases which would confound the assembly of contigs from reads. If poor quality nucleotides were not removed, they would confound the process of overlapping reads, leading to smaller contigs. Finally, for some short reads, the 3' end of the read will also include adapters, which must be removed.

Programs such as Trimmomatic automatically detect the adapter sequences, and remove them from the ends of reads. Trimmomatic also has settings that let the user control the quality threshold for removing poor quality nucleotides.

Keep in mind: When you trim reads from fastq files, you generate new fastq files that are almost the same size as the original files. Thus, if you have sequencing read files totaling to 25 Gb, after trimming there will be an additional 25 Gb of fastq files with the trimmed reads. As well, the next step is error correction, which would generate another 25 Gb of trimmed, corrected reads. A good rule of thumb therefore is that for a genome sequencing project you need more than 3x the disk space required for the original raw reads.

Error correction

In sequencing, as in science in general, a smaller amount of high quality data is usually better than a larger amount of unreliable data. Reads with high error rates will often prevent the assembly of sequences into larger contigs:

errors may prevent the overlapping of reads, preventing the extension of contigs
errors may cause erroneous overlapping of reads, creating artifactual contigs

We will next talk about various strategies for identifying and discarding poorer quality reads. No matter how good a sequence assembly algorithm is, it will not be able to do a good assembly when reads contain a lot of errors. The most important single factor in getting a good assembly is having good quality reads.

Most methods for error correction begin with the creation of a table of oligonucleotides of length k ie. k-mers, listing their frequency in the dataset. Two such methods are Quake and Pollux.

Any sequence can be thought of as a nested set of k-mers. K-mer tables are a hash (dictionary) of all k-mers in the sequence, and their counts.

The trivial example shows a k-mer table for the trivial case of k=2. For k=2, there are 4² = 16 possible k-mers. Typical k-mer tables are made for k-mers of 21 nucleotides or larger.

Usually only a fraction of the possible k-mers are found in the sequence (blue). k-mers missing from the genome (orange) are referred to as "nullomers".

Hash tables are data structures that are highly efficient to search.

from Moeckel C et al. (2024) A survey of k-mer methods and applications in bioinformatics. Computational and Structural Biotechnology Journal 23:2289-2303. DOI: 10.1016/j.csbj.2024.05.025.

Quake

Kelly DR et al. (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology 201011:R116 DOI: 10.1186/gb-2010-11-11-r116. DOI: 10.1186/gb-2010-11-11-r116

"For sufficiently large k, almost all single-base errors alter k-mers overlapping the error to versions that do not exist in the genome. Therefore, k-mers with low coverage, particularly those occurring just once or twice, usually represent sequencing errors. For the purpose of our discussion, we will refer to high coverage k-mers as trusted, because they are highly likely to occur in the genome, and low coverage k-mers as untrusted. Based on this principle, we can identify reads containing untrusted k-mers and either correct them so that all k-mers are trusted or simply discard them."

Put another way, k-mers that are real should occur in all reads mapping to a particular site in the genome, while erroneous k-mers might be found only once in a dataset.

Choice of k-value is dependent on genome size

We want to choose a k-value such that there will be a low probability (eg. < 0.01) that a given k-mer will occur by chance in a genome of size G. So k should be chosen such that 2G/4^k = 0.01. (We use 2G to account for both strands of each chromosome). The equation simplifies to

k = log₄(200G)

k	4^k	Max. genome size (Mb) cutoff (4^k/200) for this k-value	example species	Genome size (Mb)
15	1.07e9	5.37	Escherichia coli	5.43
16	4.2e9	21.5	Saccharomyces cereviseae	12.12
17	1.7e10	85.9	Leptosphaeria maculans	45.12
18	6.87e10	344	Drosophila melanogaster	143.7
19	2.75e11	1374	Solanum tuberosum	705.9

Rather than simply counting the number of occurrences of each k-mer in a genome, Quake counts q-mers. Each k-mer in a read is weighted by the quality values for each nucleotide in the k-mer. Thus, the q-count for a given q-mer is the sum of q-mer values for all instances of that k-mer in the dataset.

Localize errors. Trusted (green) and untrusted (red) 15-mers are drawn against a 36 bp read. In (a), the intersection of the untrusted k-mers localizes the sequencing error to the highlighted column. In (b), the untrusted k-mers reach the edge of the read, so we must consider the bases at the edge in addition to the intersection of the untrusted k-mers. However, in most cases, we can further localize the error by considering all bases covered by the right-most trusted k-mer to be correct and removing them from the error region as shown in (c).

Fig. 4 from Kelly et al.

To correct errors, untrusted k-mers are replaced by trusted k-mers that could correct the error. The most likely corrections are tried, based on quality information, until a set of corrections is found that makes all k-mers in a read into trusted k-mers.

Pollux

Marnier E et al. (2015) Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics201516:10 DOI: 10.1186/s12859-014-0435-6

Improvements over previous methods:

31 nt k-mers; so large that we don't have to worry about them occurring in the genome by random chance
larger k-word size for faster, more accurate error-checking
makes no assumptions about quality scoring
corrects errors regardless of the sequencing technology
good at correcting errors in homopolymer tracts

Pollux algorithm

# encode all k-mers from all reads into tablefor each read do trim N's from ends of read replace internal N's with an arbitrary nucleotide add all k-mers from read to k-mer table # Get rid of any k-mer that is only represented once for each k-mers in table if count(k-mer) = 1 remove k-mer from table for each read do find all potential error positions as localized drop in k-mer frequency for each error position try correcting by insertions, deletions, substitutions choose best correction if valid(correction) apply(correction to read) else # assume homopolymer correction try different lengths of homopolymer choose best correction if valid(correction) apply correction to read A more complete pseudo-code for the algorithm is found in the Marnier et al. publications

Better accuracy through long k-mer sizes. As the length of read sizes increases with improving sequencing technologies, it is possible to improve the correction process using much larger k-mer sizes. Pollux uses k=31 because 31 is the longest kmer that can be represented in a 64-bit word. Any given 31 mer is not likely to be found in any genome, by random chance, regardless of how big the genome is.

This requires encoding each nuclotide in 2 binary bits. For example

nucleotide	binary representation
A	00
G	01
C	10
T	11

2-bit encoding makes it possible to store the k-mer table in a much smaller space than would otherwise be possible. The one problem is that a 2-bits can only encode four possible choices. To represent a fifth nucleotide N, you would need 3 bits.

The solution is to replace each N in a read with an arbitrarily chosen nucleotide. That way, the N can still be treated as an error, which can be corrected like any other error.

Detection of errors as troughs in k-mer frequencies - Marnier et al. recognized that the k-mer count could be high in correct regions of a read, and very low in regions that contain errors. Usually the right end of the "trough" would locate the error, because immediately downstream from the error, the k-mers should all be high-frequency k-mers.

The authors state that the trough detection approach avoids the assumptions about sequence quality scores that are inherent in Quake.

Homopolymer correction - A second innovation of Pollux is the ability of k-mer frequencies to correct errors in homopolymer runs. In particular, Ion Torrent sequencing tends to lend itself to calling homopolymers incorrectly.

Method: replace k-mers in a read with k-mers having different numbers of repeated nucleotides

Example: If the internal C in this

original k-mer: CGTCATT

alternative k-mers: CGTCCATT, CGTCCCATT, CGTCCCCATT, CGTCCCCCATT etc.....

Choose the alternate homopolymer-containing k-mers that occur most frequently. For example, if the k-mers that have 4 C's at that position are all more frequent than the ones with 1, 2, 3, 5 or > C's, then we correct the read to have 4 C's.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

last page

PLNT4610/PLNT7690 Bioinformatics
Lecture 10, part 1 of 2

November 2024

Genomic Sequencing and Assembly

Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation http://onlinelibrary.wiley.com/doi/10.1111/eva.12178/full

A. OVERVIEW

1. Top Down: sequencing of BACs

2. Bottom-up: Whole Genome Shotgun Sequencing

B. CRITICAL PARAMETERS OF SEQUENCING TECHNOLOGIES

C. PREPROCESSING OF READS

D. ASSEMBLY OF READS INTO CONTIGS

E. JOINING CONTIGS INTO SCAFFOLDS

A. OVERVIEW

1. Top Down: sequencing of BACs

Advantages

Disadvantages

2. Bottom-up: Whole Genome Shotgun Sequencing

Advantages

Disadvantages

B. CRITICAL PARAMETERS OF SEQUENCING TECHNOLOGIES

How many reads do we need?

C. PREPROCESSING OF READS