PLNT 3140 Introductory Cytogenetics - 2024

Building Genomic Maps

Learning Objectives

Be able to list the goals of genomics projects (a fully-integrated map containing genetic data, overlapping clones, DNA sequence, and other annotation)
Be able to define a genomic library, and be able to calculate the number of randomly chosen clones necessary to ensure complete coverage in a library with the Clarke-Carbon equation
Be able to explain what a BAC is, and list its advantages and its disadvantages
Explain the principles behind map-based cloning and how to construct contigs from overlapping clones
Explain the basic principles behind RNAseq analysis of gene expression
Be able to outline the process of whole genome shotgun sequencing

The goal of genome projects is a fully integrated genetic map

Let's start by giving an example of what it is most genome programs hope to accomplish. Conceptually, this can be broken down into successive steps, each at a finer level of resolution:

At the highest level, genes or molecular markers are mapped using Mendelian genetic methods
Clones are identified from Bacterial Artificial Chromosome libraries, (BAC) which hybridize with markers, in a given region. By identifying overlapping clones, large regions can be assembled into contigs. Contigs are long sets of contiguous overlapping clones. (Each BAC contains >100kb of DNA).
Sequencing:
1. Short pieces of the ends of clones can be sequenced to uniquely identify each clone. These are known as Sequence-Tagged Sites (STS).
2. cDNAs hybridizing with a given clone can be placed on that clone. Each cDNA identifies the location of a gene.
3. High thoroughput sequencing methods can be used to determine the complete DNA sequence of subclones from each BAC. Overlapping sequences are assembled to give complete chromosomal sequences.

The result in an integrated model of the genome. The first genome to be entirely mapped was that of the nematode worm, C. elegans. The original 1998 sequence has since been updated to correct small gaps. The NCBI homepage for C. elegans can be found here. Some of the genome data from NCBI is presented below:

Fine-resolution map from chromosome III, showing clones and markers within a few map units of tra-1.

Low-resolution map showing a broader area, centered on tra-1 (top centre).

Full-length map of entire chromosome III.

Ideally, all genome mapping projects would be able to produce a result of this quality.

Genomic libraries are random sets of clones

A library is a random set of clones, in which genomic sequences are represented (in the ideal) by a Poisson distribution within the library. In other words, because cloning is a largely random process, some sequences will be cloned several times, and others with be cloned very rarely. Consequently, it is necessary to use the Clark & Carbon formula to determine how many clones are necessary to be sure, within a certain probability, of encountering each sequence at least once. Some sequences which are hard to clone may be statistically underrepresented in the library.

The number of clones needed is determined by the Clarke-Carbon equation

A genomic library is a population of clones, each containing a unique fragment of genomic DNA, which together, represent the entire genome.
As illustrated at right, if you have only a few clones, they are likely to be from different parts of the genome. As you keep drawing clones from the library, more and more sites in the genome are represented. Because clones are chosen at random, some parts of the genome will be overrepresented, while for other parts of the genome, no clones will have been chosen (second stage). Finally, if you choose a large enough number of clones, you can be sure that every part of every chromosome is represented in at least one of the clones (third stage).

The Clarke and Carbon equation from last reading allows us to calculate the number of genomic clones necessary to construct a genomic library:

Where:

N is the number of clones necessary to give a probability P of finding at least one clone for a given gene
f is the fraction of the genome represented by the average insert size (ie. avg. insert size/genome size).

As an example, let's use the BAC average insert size (0.1 Mb) and the Arabidopsis thaliana genome (70Mb).

f = 0.1Mb / 70Mb = 1.43 x 10^-3

N = ln(1-0.99)/ln(1-1.43 x 10^-3) = 3218

Note: make sure you adjust the units so that they match!

Just to put things into perspective, you could call 1/f one genome equivalent, that is, if you could split the genome up into adjacent segments of 100 kb, you would need 1/f segments to represent each piece of the genome once. This would be 700 clones for Arabidopsis. But, we have shown that to have a 99% chance of getting a given gene, you need to screen 3218 clones, so:

3218/700 = 4.6 genome equivalents

Library sizes (N) for various species and vector systems, assuming P=0.99.
Organism name	Genome size (Mb)	Insert size (Mb)			Genome equivalents needed (with BAC)
Organism name	Genome size (Mb)	Lambda (.02)	Cosmid (.035)	BAC (.3)	Genome equivalents needed (with BAC)
E. coli	4.5	1034	590	67	4.47
A. thaliana	70	1.6 x 10⁴	9208	107	4.59
H. sapiens	3000	6.9 x 10⁵	3.9 x 10⁵	4.6 x 10⁴	4.6
P. sativum	4600	1.1 x 10⁶	6.1 x 10⁵	7.1 x 10⁴	4.63
The insert size of 0.3 Mb for BACs is an optimistic figure, although some libraries have inserts this large.

This table indicates that on average, around 4.5 genome equivalents are need to have a 99% probability of getting every sequence at least once.

Cloning in bacterial artificial chromosomes (BACs) is used to make genomic libraries

Cloning in BAC vectors has distinct advantages over YAC vectors

The bigger the insert, the fewer clones you need to span a given region. In principle, there is no upper limit to the size of inserts YACs can hold. Furthermore, YACs can replicate as a plasmid in E.coli and as a chromosome in yeast. So why not clone in YACs? Well, YAC vectors have been created, and while the size of inserts is virtually unlimited, there are several critical problems with YACs:

Transformation of yeast with large YACs is very inefficient, resulting in libraries with small numbers of clones
Being linear, YAC DNA is very hard to isolate, because it is easily sheared
YACs are inherently unstable - inserts in YACs are often subject to recombination and deletion. This is very dangerous, because you could be working with an insert for which the specific sequence, as found in the insert, does not occur in nature!

BAC Vectors

The term "BAC" stands for Bacterial Artificial Chromosome, but it important to remember that these are prokaryotic artificial chromosomes, that is they are designed to replicate in bacteria, not in eukaryotic cells. While BACs are actually derived from the E. coli F' plasmid, BACs are distinct from ordinary plasmids by having a number of features to optimize the ability to work with large inserts.

Map of GenBank Accession U80929 created using Ugene.

In the figure above, the multiple cloning sites that we would expect to see in a vector are present, as well as an ori site. However, there are some slight differences from the vector we've seen before:

CmR gene: chloramphenicol resistance gene; typical plasmid vectors use ampicillin resistance as a selectible marker, so it's better to have a different gene for BACs
SacB gene: this gene allows selection for clones with no inserts (the vector we saw before used the LacZ gene)
- PUCLINK stuffer fragment interrupts the sacB gene
- sacB encodes levanosucrase, which converts sucrose to levan, (toxic to E. coli)
- PUCLINK stuffer fragment can be excised with restriction enzymes NotI (5'GC^GGCCGC3') , BamHI (5'G^GATCC3') or EcoRI (5'G^AATTC3')
- If the plasmid is recircularized with itself, the sacB promoter will now be directly upstream from the sacB coding sequence, and sacB will be expressed
- If the plasmid ligates with an insert, then sacB will not be expressed, and cells will survive

When we are working with large inserts, it's more efficient to use restriction enzymes that have longer recognition sequences, because they cut less frequently and therefore create longer insert fragments. For example, the NotI enzyme has an 8bp recognition sequence, while BamHI and EcoRI have 6bp recognition sequences. On average, using NotI instead of BamHI or EcoRI will result in fragments that are an order of magnitude larger.

Recall learning about restriction enzymes and when they cut. Can you calculate more specifically the difference in length (on average) between fragment sizes for NotI and EcoRI? What would that mean for the number of fragments needed to cover a whole genome?

While BACs have a number of advantages, namely:

Can have fragment sizes up to 350 kb
High stability compared to YACs (less risk of recombination)
Much easier to make large BAC libraries than YAC libraries

...there are also some disadvantages. While 350 kb fragment sizes is an improvement, there would still need to be many, many clones made to cover large genomes, in particular. Additionally, there are some sequences that are hard to clone in E. coli, and therefore may also be missed in BAC libraries.

Map-based cloning strategies facilitate finding a gene to clone based on the position of the marker associated

Given a molecular marker, how can you clone the associated gene? The first step in this process is to localize to a small area using genetic crosses. Map the gene you want to clone (d in the figure below) to a position between two markers (m1 and m2).

Keep in mind, though, that a map distance of even 1 cM can correspond to a large kilobase range! In A. thaliana, 1 cM is equal to 160kb, while in humans, 1 cM is equal to 883 kb! Although this may seem like a small distance to examine, it's larger than it looks - and there could be a lot of genes in one map unit.

The next step is to assemble a contig by chromosome walking. Starting from one of the markers (let's use m2), you "walk" along overlapping fragments until you find the gene. It's important to note that during the initial walk, you have to walk in both directions, since you don't know whether the gene is upstream or downstream from m2. Generally, you can determine directionality two ways: by simply walking in both directions until you find m1, or if one of the overlapping clones detects a polymorphism. This facilitates a 3 point cross with the probe, m2 and d.

How do we find these fragments that overlap so nicely? These fragments, or probes, are called end-specific because only the ends overlap with the next probe, making it so we can assemble contigs easily while covering the most area. Here's how these end-specific probes are made:

BAC DNA is digested with a restriction enzyme, such as Alu1, that cuts frequently, yielding small fragments (Note that the insert is much bigger than the BAC vector that carries it).
A primer specific for one of the BAC sequences bordering the cloning site is added, along with DNA polymerase and labeled nucleotides.
The primer is elongated, but elongation only proceeds as far as the next Alu1 site, at most a few hundred nucleotides downstream.
Thus, the only part of the BAC insert to be labeled is at one end of the insert.

If you user a primer synthesized to match the region of the vector immediately flanking the insert, the labeling reaction will proceeed into the insert, and terminate at the end of the fragment, where AluI cut. One of the advantages of this approach is that the same primer can be used for all clones, because all inserts are in the same vector!

End-specific probes can be used to assemble a contig - a set of overlapping BAC clones. Each time an end-specific probe identifies a new clone in the library, one step in genome walking has been taken.

Image displayed from Genevogue Biotechnology

From this point on, every gene is a special case. You know that the gene you're interested in is on one of the clones, but which one?

Complementation is the method of choice for experimental systems that allow transformation. For example, disease resistance genes in plants have been cloned by transforming susceptible plants with DNA from each BAC in a contig, and screening for resistant plants.

cDNA screening is an approach that sometimes works. eg. screening for a disease gene in humans:

Screen a cDNA library to find clones that hybridize with cosmids in the contig (a handful)
Test each cDNA to see if mRNA is expressed in the appropriate tissue
Of those that are expressed in tissue, sequence each gene in both affected individuals, carriers and non-carriers. Look for correlations between a mutation and the carriers.

Contigs can also be constructed with hybridization

An ordered library is a library of overlapping clones, such that each sequence outside of overlap regions is represented exactly once. Put another way, think of the chromosome as being split into short fragments laid end to end. Contigs are assembled by successive hybridization, fingerprinting, or sequencing experiments. Since thousands of clones have to be compared against each other to detect overlaps, computers are used to do pairwise comparisons.

Once you have an ordered library:

One hybridization experiment tells you location and approximate copy number of a gene.
Everything has already been cloned, so to speak, so even if you only have a partial probe (eg. oligonucleotide), you can find a clone that has the complete gene.
Because the ordered library tells us the order of the sequences in a genome, we can learn about the higher-order structure of eukaryotic genomes. Potentially, there's still a lot to discover about genome organization.
Eliminates the need for chromosome walking, because every part of the genome has already been covered with clones.

Whole genome shotgun sequencing bypasses the need for a library

In the past, genomes were sequenced by first making a BAC library, and then sequencing enough clones to cover the entire genome. Even today, these genomes are among the best genomic sequences. However, the time and expense of this strategy makes it impractical for sequencing large numbers of genomes. Whole genome shotgun sequencing is a quicker and cheaper way to sequence genomes, but has the disadvantage that most of the time, full chromosomal sequences cannot be built for eukaryotic genomes. The trade off, then, is between cost and speed, and completeness of the genomic sequence.

Further reading: Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications 7: 1026 - 1042. Link

An overview of whole genome shotgun sequencing. A mate-pair read is sequence information from two ends of a DNA fragment, usually several thousand base-pairs long. A scaffold is two or more contigs joined together using read-pair information. Within a scaffold, the gaps between adjacent contigs are usually denoted by a run of N's.

Shotgun sequencing creates millions of reads

WGS begins by creating a library of fragments from genomic DNA. Usually a PCR step amplifies the fragments, which are a uniform size, depending on the specific sequencing technology used. For example, with Illumina technology, fragments are typically about 300 bp in length. Since sequencing reads can be 150 bp or longer, it is possible to get two paired-end reads that overlap, to cover the entire 300 bp fragment.

Reads are then assembled into contigs

Typically, WGS sequences enough reads to cover the entire genome with 50 to 100-fold redundancy. Highly-efficient pattern matching software pieces together reads at points of overlap to form contigs. The algorithm keeps adding reads together until contigs can no longer be extended from the pool of reads. Generally the bigger the contigs, the better the sequence assembly.

The figure below shows assembly of a contig from many individual reads.

Each contig is assembled from many overlapping reads. At this point, we have no idea which chromosome each contig comes from, or where the contigs might be placed on those chromosomes. There is usually a large number of unmatched reads that cannot be assembled into contigs, and very small contigs (eg. a few hundred or a few thousand base pairs in length) that do not contribute anything to the final genome assembly.

Repetitive elements make it difficult or impossible to assemble long contigs

Eukaryotic genomes are especially difficult to assemble because so much of the genome consists of repetitive elements, such as the AluI family, interspersed among unique DNA. Since the length of sequencing reads is fairly short, a high percentage of reads will have part of a repetitive element at one end. Few reads will completely span a repetitive element, with unique sequence on either side. While it is true that repetitive sequence elements do mutate, it is often difficult or impossible for sequence assembly software to decide which copy of a repetitive element to join with any of thousands of other copies that may be identical or nearly identical to the a given read. Put another way, we don't know where on the chromosome each read came from. That is what we're trying to figure out. The net result is that as a growing contig encounters a repetitive element, there may be no way to extend the contig further. Consequently, most genome assemblies have a relatively small number of large contigs, and a very large number of small contigs, maybe 1000 bp or smaller.

Mate-pair reads make it possible to join contigs together into scaffolds

With current sequencing technologies, the best strategy for joining contigs is to do a second sequencing run, this time using libraries with large fragment sizes (eg. 3000 bp or greater). The reads from these larger insert libraries are called "mate pair" reads. If we find one read within one contig, and the read from the other mate pair somewhere within another contig, we know that the two contigs are no farther away than the length of the large insert.

Large DNA fragments (eg. 3000 bp) are end-repaired using biotinylated nucleotides
DNA ligase is added, causing fragments to circularize. This brings the two ends (red and green) together.
The circular DNA is fragmented, and the resultant fragment pool is run over a streptavidin column. Only the fragments containing the biotin are captured. The caputured fragments are then eluted from the column. Now, the remaining fragments have sequences from the two ends of the larger fragment on either end. Sequencing adaptors A1 and A2 are ligated to the ends of the biotinylated fragments.
Double-stranded fragments are denatured and run over a flow cell. The different cells in the flow cell have oligonucleotides complementary to either A1 or A2, so each strand will be captured in different cells (5). They are amplified by PCR. Fragments are sequenced, starting at A1 or A2.
With reference to the original large fragment from step 1, the sequence that we get comes from the two ends, starting interior the fragment, going outward.

Then, contigs can be joined together and assembled into much longer scaffolds. Ideally, we can join many different contigs together to make the scaffold as long as possible.

Scaffolds join contigs in the order and orientation with which they appear on the chromosome. If we're lucky, may be able to assemble scaffolds that completely cover an entire chromosome. Most of the time, though, there are 2 or more scaffolds per chromosome, and we don't know the order and orientation of the scaffolds, relative to one another.

Limitations of WGS

Like all strategies, whole genome sequencing comes with some disadvantages. For most chromosomes, there will be more than one scaffold per chromosome. This means that the overall assembly will have large gaps between contigs, likely missing some genes and large portions of repetitive DNA. In turn, we will likely underestimate the amount of repetitive DNA in the genome. For diploids, the reference chromosome resulting from the assembly is a composite of both copies of a chromosome pair.

If you want to look at genome sequences that have been assembled, you can view them here. Note that often there are some scaffolds without a determined location.

Gene expression can be measured by RNA sequencing, giving us another part of the fully integrated map

After building a full genetic map, we have the DNA sequence, gene locations, and a sense of the overall arrangement of the chromosomes. One of the other pieces of information that we can add to the genetic map is how much and how often genes are expressed. This information is partially contained in the transcriptome: the the set of all RNA transcripts expressed in an organism. High throughput RNA sequencing can be used to measure the amount of each of thousands of distinct RNA transcripts in an RNA population.

Gene expression studies tend to generate two different types of data. Studies in which two or more conditions are compared at a time generate discrete state data. Often it is critical to follow the expression of a gene over time after a treatment. In timecourse experiments, the expression of each gene in response to two or more treatments is measured over time. For example, in the timecourse at right, the solid blue and red dashed curves might represent the expression levels for a gene in response to two different drugs.

What we're ultimately trying to get from gene expression experiments is expression patterns for each of the thousands of transcripts in the RNA population. By identifying genes whose expression patterns are similar, we can discover which groups of genes work in concert, in response to a given stimulus.

RNA transcripts can be measured by high-throughput sequencing

There are many protocols for RNA sequencing, including Illumina GA/HiSeq, SOLiD, and Roche 454. Although these differ, the RNA-seq can be described generally as shown above. In some protocols, RNA is sheared, followed by random hexamer priming. In other protocols, the entire mRNA transcript is used as a template for cDNA synthesis, and the cDNA is fragmented. Adapters for PCR are ligated onto ds-cDNA, followed by PCR amplification. Sequencing reactions are either done from a single end, or for both ends (paired-end). Ideally, where a reference genome exists, all transcripts can be mapped to specific genes in the genome.

However, there are some complications with sequencing. One difficulty is sequencing introns. Firstly, the mRNA transcript being sequenced will not actually have any introns, but the gene sequence will. The presence of introns being spliced out of pre-mRNA transcripts means that alignment programs have to check to see whether a read contains part of the 3' end of one intron and part of a 5' end of another intron. We need the genomic sequence to do this - and if we don't have it, that's an issue. Another complication is shown in the figure below - alternative splicing patterns. Transcriptomics (the study of the transcriptome) is revealing that alternative splicing occurs more frequently in eukaryotic gene expression than was previously appreciated.

The illustration shows RNA-seq reads aligned to two eukaryotic genes A and B. Reads that span part of an exon are shown as single lines, whereas reads that include parts of two adjacent exons are indicated by V-shaped lines.

Another consideration of RNAseq methods is that since each read is the same length, but genes are different lengths, longer genes will be over-represented. Therefore, we need to correct reads for:

The size of the gene, and
The total number of reads in the dataset

This makes results comparable across experiments. Depending on whether you are doing single reads or paired-end reads, there are two almost identical formulae.

RPKM = C / LN

Where:

RPKM is reads per kilobase of transcript per million mapped reads
C is the number of mappable reads on a feature
Lis the length of feature (in kb)
N is the total number of mappable reads (in millions)

Where do genome projects get us?

Designer genomes: Genomic maps that are saturated with markers allow plant and animal breeders to selectively breed offspring which combine desired genes from many different strains, varieties, lines, stocks etc.
Everything has already been cloned : If a gene can be precisely mapped, the clones for that region of the genome already exist.
Understanding how genomes are structured, and how structure relates to function.
Raw materials for genetic engineering: With completely cloned, mapped, and in some cases sequenced genomes, we will be able to pick and choose genes of all kinds for genetic engineering purposes.

Summary

The goal of genome mapping is a fully integrated genome map
A genomic library is a collection of different, randomly-chosen markers that together cover an entire genome
A BAC, or bacterial artificial chromosome is used as a vector to hold fragments of a genomic library
Clones must be arranged in order to sequence a complete genome
Whole genome shotgun sequencing creates millions of reads, assembles them from reads to contigs to scaffolds, and then annotates them
RNAseq experiments use RNA transcripts to measure gene expression in a particular tissue