last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 10, part 1 of 2
next page

November 2024

Genomic Sequencing and Assembly


REFERENCES

Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation http://onlinelibrary.wiley.com/doi/10.1111/eva.12178/full



A. OVERVIEW

1. Top Down: sequencing of BACs

2. Bottom-up: Whole Genome Shotgun Sequencing

B. CRITICAL PARAMETERS OF SEQUENCING TECHNOLOGIES

C. PREPROCESSING OF READS

D. ASSEMBLY OF READS INTO CONTIGS

E. JOINING CONTIGS INTO SCAFFOLDS


A. OVERVIEW

To understand the problem of genomic sequencing, we need to understand the properties of prokaryotic and eukaryotic genomes


Prokaryotes
Eukaryotes
taxa
bacteria, archaea
fungi, protists, plants, animals
genome structure
single circular chromosome
plasmids
two or more linear chromosomes
circular organelle genomes: mitochondrial, plastid
ploidy levels
haploid
diploid, tetraploid, hexaploid or higher
total genome length (bp)
106 - 107
107 - 1012
complexity
single copy
  • single copy (1 - 10 copies)
    • usually only a few % of genome
    • most genes
  • middle-repetitive (10 - 500,000 copies)
    • most of the genome is mid-rep
    • transposable elements a major component
    • high-copy genes: histones, rRNA
  • highly-repetitive (>500,000 copies)
    • mostly short sequences repeated many times
    • concentrated at centromeres and telomeres
    • microsatellite sequences interspersed throughout the chromosome

 As sequencing technologies have given longer reads, combined with paired-end reads, it has been possible to assemble complete prokaryotic genomes by the whole genome shotgun approach. However, the complexity of most eukaryotic genomes makes it almost impossible to get complete genomic sequences  at today's level of technology.

1. Top Down: sequencing of BACs

The first complete genomic sequences for eukaryotes were done using a "divide and conquer" strategy.

1. Construct a representative BAC library (avg. insert ~ 100 kb)
2. For each chromosome, find a set of overlapping BAC clones whose inserts cover the entire chromosome. In this way, the location of each BAC on the chromosome is known.
3. Sequence each BAC separately using shotgun sequencing.

Advantages

Disadvantages

2. Bottom-up: Whole Genome Shotgun Sequencing


Summary of Whole Genome Sequencing

Advantages

Disadvantages


B. CRITICAL PARAMETERS OF SEQUENCING TECHNOLOGIES

Before you sequence a genome, you need to know the basic characteristics of your genome, and the basic parameters of the sequencing technology or technologies you plan to use. 

Platform
Read length
Reads/unit
Read Type
Error
type
Comment
Illumina
 200 - 600
 10 - 375 million
 PE, SR
 substitution
 highest throughput
Ion Torrent
 200 - 400
 0.4 - 60 million
 SR
 homopolymers

Pac Bio
4600 - 14,000
 22,000 - 47,000
 SR
 indel
 longest reads
 highest error rate
Oxford nanopore
 15,000 - 20,000
 
 SR
homopolymers; deletions
 longest reads can exceed 1000 kb
Roche 454
 400 - 700
 20,000 - 350,000
 SR
 indel
 lowest error rate
SR - single read; PE - paired end

Data from Choosing the Right NGS Sequencing Instrument for Your Study http://genohub.com/ngs-instrument-guide/#pacbio

Guidelines:

How many reads do we need?

Coverage - Ideally we want to cover an entire genome with at least a 50-fold redundancy of reads, meaning that every nucleotide position is represented in at least 50 reads in the population. The level of redundancy is referred to as coverage.

High coverage is needed for several reasons:
The number of reads required to completely sequence a genome is given by



where
Example: We want to sequence a genome of 1 x 107 bp. If the read size is 200 nt, then f will be 200 nt/(1 x 107 nt ) = 2 x 10-5. If we want a probability of success of P=99%, then for 50-fold coverage,  N = 11.5 million reads. As a rule of thumb, then, the number of reads needed for 50-fold coverage is roughly the number of nucleotides in the genome. This number will go down as reads get longer.

C. PREPROCESSING OF READS

As with most things in bioinformatics, the most important factor influencing the quality of the final result is the quality of the starting material. In the case of sequencing, the most critical part is the initial cleaning of the data:

Sequencing services typically provide reads in Fastq format. Most services will not do pre-processing of reads, so you need to do these steps.

Cock PJA et al. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Nucleic Acids Res. 2010 Apr; 38(6): 1767–1771. Published online 2009 Dec 16. doi:  10.1093/nar/gkp1137 PMCID: PMC2847217
Definition of Fastq format for sequencing reads

The information for each read consists of four lines per read.

line 1 - unique identifier of read
line 2 - sequence of read
line 3 - either '+' or optional repeat of title
line 4 - quality of each base read.

(Note: 4 lines are shown at right, but the sequence lines are long, so they wrap.)

Quality encoding

Phred quality scores are typically calculated as Q = -10 x log10(Pe) where Pe is the estimated probability of error.

To save space in fastq files, Q values are encoded as single ASCII characters whose values, such that the Q value is represented by the ASCII character whose decimal number plus some offset value, usually 33 or 64. For example, in a Q value of 34 is encoded by the number in the ASCII character chart with the decimal value 33 + 34 = 67 ie. the capital C. A Q value of 30, plus offset 33, gives the ASCII decimal value of 63, represented by the ASCII character question mark (?).

Note: Some sequence assembly programs need to know which offset your data uses, 33 or 64. 

Description, OBF name ASCII characters
Quality score

Range Offset Type Range
Sanger standard



    fastq-sanger 33–126 33 PHRED 0 to 93
Solexa/early Illumina



    fastq-solexa 59–126 64 Solexa −5 to 62
Illumina 1.3+



    fastq-illumina 64–126 64 PHRED 0 to 62

The first step after receiving sequencing read files should be to look at the quality of the reads using FastQC, which creates reports of the overall quality of the sequencing results. The example shown below is for a good sequencing run, in which the quality of the nucleotide calls is in the high range at all positions. For positions approaching 145 nt, the variation in quality of the reads gets larger. For this data, it is probably best not to use positions above 145 - 149 for sequence assembly. Most assembly programs will automatically filter out poor quality positions.


Per base quality graph

 

Trimming adapters and removing poor quality nucleotides from the ends of sequencing reads
 
For all sequencing technologies, the first dozen or so nucleotides at the 5' ends of raw sequencing reads are the adapters added to the fragments in the library. As shown in the FastQC output above, the quality of nucleotide calls is often more error prone, especially near the 3' ends of reads.



If adapters were not removed, sequence assembly programs would find that every sequence matched every other sequence in the first 12 - 13 bases which would confound the assembly of contigs from reads. If poor quality nucleotides were not removed, they would confound the process of overlapping reads, leading to smaller contigs. Finally, for some short reads, the 3' end of the read will also include adapters, which must be removed.

Programs such as Trimmomatic automatically detect the adapter sequences, and remove them from the ends of reads. Trimmomatic also has settings that let the user control the quality threshold for removing poor quality nucleotides.


Keep in mind: When you trim reads from fastq files, you generate new fastq files that are almost the same size as the original files. Thus, if you have sequencing read files totaling to 25 Gb, after trimming there will be an additional 25 Gb of fastq files with the trimmed reads. As well, the next step is error correction, which would generate another 25 Gb of trimmed, corrected reads. A good rule of thumb therefore is that for a genome sequencing project you need more than 3x the disk space required for the original raw reads.


Error correction

In sequencing, as in science in general, a smaller amount of high quality data is usually better than a larger amount of unreliable data. Reads with high error rates will often prevent the assembly of sequences into larger contigs:
We will next talk about various strategies for identifying and discarding poorer quality reads. No matter how good a sequence assembly algorithm is, it will not be able to do a good assembly when reads contain a lot of errors. The most important single factor in getting a good assembly is having good quality reads.

Most methods for error correction begin with the creation of a table of oligonucleotides of length k ie. k-mers, listing their frequency in the dataset. Two such methods are Quake and Pollux.

Any sequence can be thought of as a nested set of k-mers. K-mer tables are a hash (dictionary) of all k-mers in the sequence, and their counts.

The trivial example shows a k-mer table for the trivial case of k=2. For k=2, there are 42 = 16 possible k-mers. Typical k-mer tables are made for k-mers of 21 nucleotides or larger.

Usually only a fraction of the possible k-mers are found in the sequence (blue). k-mers missing from the genome (orange) are referred to as "nullomers".

Hash tables are data structures that are highly efficient to search.


from Moeckel C et al. (2024) A survey of k-mer methods and applications in bioinformatics. Computational and Structural Biotechnology Journal 23:2289-2303. DOI: 10.1016/j.csbj.2024.05.025.


Quake

Kelly DR et al. (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology 201011:R116 DOI: 10.1186/gb-2010-11-11-r116. DOI: 10.1186/gb-2010-11-11-r116

"For sufficiently large k, almost all single-base errors alter k-mers overlapping the error to versions that do not exist in the genome. Therefore, k-mers with low coverage, particularly those occurring just once or twice, usually represent sequencing errors. For the purpose of our discussion, we will refer to high coverage k-mers as trusted, because they are highly likely to occur in the genome, and low coverage k-mers as untrusted. Based on this principle, we can identify reads containing untrusted k-mers and either correct them so that all k-mers are trusted or simply discard them."
Put another way, k-mers that are real should occur in all reads mapping to a particular site in the genome, while erroneous k-mers might be found only once in a dataset.

Choice of k-value is dependent on genome size

We want to choose a k-value such that there will be a low probability (eg. < 0.01) that a given k-mer will occur by chance in a genome of size G. So k should be chosen such that 2G/4k = 0.01. (We use 2G to account for both strands of each chromosome). The equation simplifies to

k = log4(200G)

k
4k
Max. genome size (Mb) cutoff
 (4k/200) for this k-value
 example species
Genome size (Mb)
15
 1.07e9
 5.37
Escherichia coli
5.43
16
4.2e9
21.5
Saccharomyces cereviseae
12.12
17
1.7e10
85.9
Leptosphaeria maculans
45.12
18
6.87e10
344
Drosophila melanogaster
143.7
19
2.75e11
1374
Solanum tuberosum
705.9


Rather than simply counting the number of occurrences of each k-mer in a genome, Quake counts q-mers. Each k-mer in a read is weighted by the quality values for each nucleotide in the k-mer. Thus, the q-count for a given q-mer is the sum of q-mer values for all instances of that k-mer in the dataset.

Localize errors. Trusted (green) and untrusted (red) 15-mers are drawn against a 36 bp read. In (a), the intersection of the untrusted k-mers localizes the sequencing error to the highlighted column. In (b), the untrusted k-mers reach the edge of the read, so we must consider the bases at the edge in addition to the intersection of the untrusted k-mers. However, in most cases, we can further localize the error by considering all bases covered by the right-most trusted k-mer to be correct and removing them from the error region as shown in (c).

Fig. 4 from Kelly et al.

To correct errors, untrusted k-mers are replaced by trusted k-mers that could correct the error. The most likely corrections are tried, based on quality information, until a set of corrections is found that makes all k-mers in a read into trusted k-mers.




Pollux

Marnier E et al. (2015) Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics201516:10 DOI: 10.1186/s12859-014-0435-6


Improvements over previous methods:

Pollux algorithm

# encode all k-mers from all reads into table
for each read do
    trim N's from ends of read
    replace internal N's with an arbitrary nucleotide
    add all k-mers from read to k-mer table
# Get rid of any k-mer that is only represented once
for each k-mers in table
    if count(k-mer) = 1
        remove k-mer from table
for each read do
    find all potential error positions as localized drop in k-mer frequency
    for each error position
        try correcting by insertions, deletions, substitutions
        choose best correction
        if valid(correction)
            apply(correction to read)
        else
            # assume homopolymer correction
            try different lengths of homopolymer
            choose best correction
            if valid(correction)
                apply correction to read
    
A more complete pseudo-code for the algorithm is found in the Marnier et al. publications


Better accuracy through long k-mer sizes. As the length of read sizes increases with improving sequencing technologies, it is possible to improve the correction process using much larger k-mer sizes. Pollux uses k=31 because 31 is the longest kmer that can be represented in a 64-bit word. Any given 31 mer is not likely to be found in any genome, by random chance, regardless of how big the genome is.

This requires encoding each nuclotide in 2 binary bits. For example

nucleotide
binary representation
A
00
G
01
C
10
T
11

2-bit encoding makes it possible to store the k-mer table in a much smaller space than would otherwise be possible. The one problem is that a 2-bits can only encode four possible choices. To represent a fifth nucleotide N, you would need 3 bits.

The solution is to replace each N in a read with an arbitrarily chosen nucleotide. That way, the N can still be treated as an error, which can be corrected like any other error.

Detection of errors as troughs in k-mer frequencies - Marnier et al. recognized that the k-mer count could be high in correct regions of a read, and very low in regions that contain errors. Usually the right end of the "trough" would locate the error, because immediately downstream from the error, the k-mers should all be high-frequency k-mers.



The authors state that the trough detection approach avoids the assumptions about sequence quality scores that are inherent in Quake.

Homopolymer correction - A second innovation of Pollux is the ability of k-mer frequencies to correct errors in homopolymer runs. In particular, Ion Torrent sequencing tends to lend itself to calling homopolymers incorrectly. 

Method: replace k-mers in a read with k-mers having different numbers of repeated nucleotides

Example: If the internal C in this

original k-mer: CGTCATT

alternative k-mers: CGTCCATT, CGTCCCATT, CGTCCCCATT, CGTCCCCCATT etc.....

Choose the alternate homopolymer-containing k-mers that occur most frequently. For example, if the k-mers that have 4 C's at that position are all more frequent than the ones with 1, 2, 3, 5 or > C's, then we correct the read to have 4 C's.



Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada
last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 10, part 1 of 2
next page