lprevious page PLNT4610/PLNT7690 Bioinformatics Lecture 10, part 2 of 2 next page

## D. ASSEMBLY OF READS INTO CONTIGS

Jason R. Miller, , Sergey Koren, Granger Sutton (2010) Assembly algorithms for next-generation sequencing data. Genomics 95, Issue 6, June 2010, Pages 315\u2013327
http://dx.doi.org.uml.idm.oclc.org/10.1016/j.ygeno.2010.03.001

Rationale: In principle, you might wish to assemble a genome by doing pairwise Smith-Waterman alignments between all possible pairs of reads. However, current NGS technologies generate on the order of 108 or more reads per genome. A complete set of pairwise comparisons would require on the order of (108)2/2  = 5 x 1015 alignments. That would only be the starting point in the assembly problem.

Although the early sequence assembly programs attempted to do just that, modern NGS assemblers, rather than constructing an alignment, create a graph of connected k-mers, that is a de-facto representation of an alignment.

Goal: To assemble the reads into the largest possible contigs that are consistent with the data in the reads.

#### All NGS sequence assemblers represent assemblies as graphs of k-mers

 This example shows the ideal situation, in which reads contained no errors. (a) Two overlapping reads (b) A k-mer graph representing all k-mers that overlap the two reads. The graph is a path through connecting the k-mers, to correspond to the overlapping reads. (c) The alignment is a byproduct of graph construction. If you construct a graph through the overlapping k-mers, you implicitly construct an alignment.

If the input data was perfect, then there would only be one unique path connecting the k-mers. The graph would then be a straight line, as in (b).

#### Errors in reads and repetitive sequences result in more complex graphs

When reads contain errors or repeats, these errors will result in paths through two or more sets of k-mers.

 (a) spur - dead end paths (b) bubble - Two alternate paths, originating at a single k-mer and converging on a single k-mer at both ends. Bubbles are often caused by repeats. Often caused by an error in the middle of a read. (c) frayed rope - two or more divergent paths converge on a unique path, and later diverge. Can also be caused by repeats.

In the graphs illustrated above, the thickness of the arrows connecting the k-mers represents the frequency of the k-mers. It is often possible to chose one path over the other by choosing the path with the highest frequency of k-mers to support the path. This approach will correct a substantial number of errors, allowing assembly of k-mers into larger contigs.

### Repetitive elements make it difficult or impossible to assemble long contigs

 Eukaryotic genomes are especially difficult to assemble because so much of the genome consists of repetitive elements, such as the AluI family, interspersed among unique DNA. Since the length of sequencing reads is fairly short, a high percentage of reads will have part of a repetitive element at one end. Few reads will completely span a repetitive element, with unique sequence on either side.

 While it is true that repetitive sequence elements do mutate, it is often difficult or impossible for sequence assembly software to decide which copy of a repetitive element to join with any of thousands of other copies that may be identical or nearly identical to the a given read. Put another way, we don't know where on the chromosome each read came from. That is what we're trying to figure out. The net result is that as a growing contig encounters a repetitive element, there may be no way to extend the contig further. Consequently, most genome assemblies have a relatively small number of large contigs, and a very large number of small contigs, maybe 1000 bp or smaller.

#### Paired-end reads and mate-pair reads can correct many types of errors, especially those caused by repeats

 Most sequencing technologies (eg. Illumina, Ion Torrent etc.) allow the construction of paired-end libraries, with primer adaptor sites on both ends. For each fragment two reads are produced. While it is not guaranteed that the reads will be long enough to overlap each other, the insert size is known. Insert sizes in current Illumina sequencing typically ranges between 300 and 700 nt.

Since the size of the insert is know, algorithms that resolve cases where alternative paths exist can use the known distance between the paired-end reads to constrain which path is chosen.

 Mate-pair reads are simply paired-end reads with much longer insert sizes eg. 3000 nt. Again, assembly software can choose paths through the data such that k-mers from one read to its mate-pair must span a distance consistent with the known insert size.

 (a) When a single read spans a path consisting of several k-mers, the correct path can be chosen if the read includes all k-mers in that path. (b) If two reads in a paired-end read overlap terminal k-mers in a frayed-rope ambiguity, the path that includes correct termini can be chosen based on that overlap. (c) Where ambiguous paths exist in a longer path, mate-paired reads can eliminate those paths that do not contain k-mers found in both mate pair reads. Mate pair reads are particularly good at jumping over regions containing repeated sequences longer than the reads.

 At the end of contig assembly process, we are left with a set of contigs, and a pool of unassigned reads.

## E. JOINING CONTIGS INTO SCAFFOLDS

Eventually, no amount of error correction can extend the contigs beyond their current length. In this case, it still may be possible to join together contigs into larger scaffolds, in which the order of the contigs

If a paired-end read or a mate-pair read overlaps the ends of two contigs, we may not know the sequence between the contigs but we do know two things:
• the maximum distance between the two contigs must be less than the insert size of the library
• the orientation of the two adjacent contigs, relative to each other.
In this way, paired-end and mate-pair reads can be used to stitch together two or more contigs into a scaffold. The sequence between the contigs is usually represented by a run of N's.

#### Evaluating the Quality of an Assembly

If you wish to view the raw data for contigs, contig viewers like Tablet will list the contigs in order of descending size, and display the aligned reads, along with a graph of coverage in any window along the contig.

 Assembly of paired-end Illumina data for the genome of the fungus Rhodosporidium diobovatum. Fakankun I, Fristensky B, Levin D (2016) unpublished. Programs such as Quast can generate statistics, summarizing the quality of the assembly. In the ideal, you would like to have each chromosome in a single contig. In practice, with current sequencing technologies, that seldom happens. N50 - Size of the contig such that 50% of the contigs are shorter, and 50% are longer than the N50 contig. The N50 value is the most common single number for evaluating the quality of an assembly. While it is true that the larger the N50, the better, that is not the whole story. It is possible to get misassemblies that create artifactual contigs with high N50 values.

 The distribution of contigs is generally skewed to a relatively small number of very large contigs, and a large number of smaller contigs.  An example of an assembly for a fungal genome of approx. 21,000,000 nt is shown. X axis - cumulative percentage of contigs Y axis - size of contigs

Salzberg SL et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms
Genome Res. 2012. 22: 557-567
doi: 10.1101/gr.131383.111

Salzberg and co-workers assembled eukaryotic and prokaryotic chromosomes using eight different assembly programs using reads in the range of 50 to 150 nt.  In each case, assemblies were done using the reads as deposited at NCBI, as well as reads corrected using Quake.

• There is tremendous variability in results from one assembler to the next, and between genomes
• Many of the assemblies contain large numbers of errors. When errors are corrected, the N50 decreases, indicating that many of the contigs were artificially long.

 Current state of the art: The vast majority of genome assemblies contain numerous gaps. Many contigs are large, not because they are of high quality, but because they represent assembly artifacts that can be broken into smaller contigs using programs such as Pollux. The quality of the sequencing reads is the most important factor in getting a good assembly As read lengths increase, the quality of assemblies will increase.