prev  page PLNT3140 Introductory Cytogenetics Lecture 17, part 2 of 4 next page

# III.GENOMIC LIBRARIES

A library is a random set of clones, in which genomic sequences are represented (in the ideal) by a Poisson distribution within the library. In other words, because cloning is a largely random process, some sequences will be cloned several times, and others with be cloned very rarely. Consequently, it is necessary to use the Clark & Carbon formula to determine how many clones are necessary to be sure, within a certain probability, of encountering each sequence at least once. Some sequences which are hard to clone may be statistically underrepresented in the library.

# A. DETERMINING THE NUMBER OF CLONES NEEDED

As illustrated at right, if you have only a few clones, they are likely to be from different parts of the genome. As you keep drawing clones from the library, more and more sites in the genome are represented.  Because clones are chosen at random, some parts of the genome will be overrepresented, while for other parts of the genome, no clones will have been chosen. Finally, if you choose a large enough number of clones,  you can be sure that every part of every chromosome is represented in at least one of the clones.  A genomic library is a population of clones, each containing a unique fragment of genomic DNA, which together, represent the entire genome.

The following equation [Clark & Carbon (1976) Cell 9:91] allows us to calculate the number of genomic clones necessary to construct a genomic library:

`    ln(1-P)N = -------    ln(1-f)`
where
N ::= the number of clones necessary to give a  probability P of finding at least one clone for a given gene
f ::= the fraction of the genome represented by the average insert size (ie. avg. insert size/genome size).

Example: BAC insert size ~ 0.1 Mb,  A. thaliana genome = 70Mb
 `           0.1 Mb        f =    -------------  =  1.43 x 10 -3            70 Mb          ln(1 - 0.99)  N = --------------------   = 3218       ln(1 - 1.43 x 10-3)`

Just to put things into perspective, you could call 1/f one genome equivalent, that is, if you could split the genome up into adjacent segments of 100 kb, you would need 1/f segments to represent each piece of the genome once. This would be 700 clones for Arabidopsis. But, we have shown that to have a 99% chance of getting a given gene, you need to screen 3218 clones, so
 `        3218         -------------  =    4.6 genome equivalents          700 `

 Table 1 Library sizes (N) for various species and vector systems. (Assuming P=0.99) insert size (Mb) Genome size (Mb) Lambda .02 cosmid .035 BAC .3 E. coli 4.5 1034 590 67 A. thaliana 70 1.6 x 104 9208 1072 H. sapiens 3000 6.9 x 105 3.9 x 105 4.6 x 104 P. sativum 4600 1.1 x 106 6.1 x 105 7.1 x 104
This table says, therefore, that you need about 4-5 genome equivalents to have a 99% probability of getting every sequence at least once. The insert size of 0.3 Mb for BACs is an optimistic figure, although some libraries have inserts this large.

## B.Cloning in BACs

She K (2003) So you want to work with giants: the BAC vector. BioTech Journal 1:69-74.
First - Why not clone in YACs?

The bigger the insert, the fewer clones you need to span a given region. In principle, there is no upper limit to the size of inserts YACs can hold. Furthermore, YACs can replicate as a plasmid in E.coli and as a chromosome in yeast. So why not clone in YACs?

Well, YAC vectors have been created, and while the size of inserts is virtually unlimited, there are several critical problems with YACs:

• Transformation of yeast with large YACs is very inefficient, resulting in libraries with small numbers of clones.
• Being linear, YAC DNA is very hard to isolate, because it is easily sheared
• "YACs are inherently unstable". Inserts in YACs are often subject to recombination and deletion. This is very dangerous, because you could be working with an insert for which the specific sequence, as found in the insert,  does not occur in nature!

BAC vectors

The term "BAC" stands for Bacterial Artificial Chromosome, but it important to remember that these are prokaryotic artificial chromosomes, that is they are designed to replicate in bacteria, not in eukaryotic cells. While BACs are actually derived from the E. coli F' plasmid, BACs are distinct from ordinary plasmids by having a number of features to optimize the ability to work with large inserts.

Map  of GenBank Accession U80929 created using Ugene.

Sequences for maintanence in E.coli
• ori - E.coli rep. derived from high-copy plasmid pUC9
• CM(R) - chloramphenicol resistance gene; typical plasmid vectors use ampicillin resistance as a selectible marker, so it's better to have a different gene for BACs
• cloning site - site for insert

Selection against clones with no inserts
• PUCLINK stuffer fragment interrupts the sacB gene
• sacB encodes levanosucrase, which converts sucrose to levan, which is toxic to E. coli.
• PUCLINK stuffer fragment can be excised with restriction enzymes NotI (5'GC^GGCCGC3') , BamHI (5'G^GATCC3') or EcoRI (5'G^AATTC3').
• If the plasmid is recircularized with itself, the sacB promoter will now be directly upstream from the sacB coding sequence, and sacB will be expressed
• If the plasmid ligates with an insert, then sacB will not be expressed, and cells will survive.
Working with large inserts
• NotI site is best, because the recognition is 8 bp (5'GC^GGCCGC3'), rather than 6. (Remember, 46=4096; 48=65536). Because NotI cuts, on the average, once every 65536 bp, most inserts can be excised as a single fragment, or perhaps several large fragments. 6-cutter enzymes would produce many smaller fragments.
• For physical reasons, circular BACs are much less susceptible to shearing than linear YACs