PLNT 3140 Introductory Cytogenetics - 2024

Mapping with Molecular Markers

Learning Objectives

Understand the distinction between genetic distance in cM and physical distance in Mb
Understand how microsatellite markers work
Be able to work with the Clarke-Carbon equation which tells how many markers must be screened to find at least one marker linked to a phenotypic trait
Understand the concept of a mapping kit and its advantages

Genetic maps represent genetic distance, not physical distance

The following table gives both physical map lengths of each human chromosome, as well as total genetic length, based on genetic recombination.

Physical and Genetic Lengths of Human Chromosomes
	Physical map (Mb)	Genetic map (cM)			Number of markers
	Physical map (Mb)	Male	Female	Sex Average	Number of markers
1	282	195	345	270	468
2	252	190	325	257	407
3	225	161	276	218	369
4	205	147	259	203	302
5	199	151	260	206	334
6	191	138	242	190	293
7	169	128	230	179	246
8	158	108	210	159	247
9	150	117	198	158	193
10	146	134	218	176	256
11	153	109	196	152	260
12	153	136	207	171	239
13	100	101	156	129	175
14	87	94	142	118	161
15	87	103	155	129	125
16	106	108	150	129	151
17	89	109	162	135	181
18	89	99	143	121	158
19	69	93	127	110	120
20	59	75	122	98	141
21	30	47	76	62	67
22	31	49	83	66	67
X	156	-	179	179	66
TOTAL	3191	2591	4460	3615	5136
Kong A et al. (2002) A high-resolution recombination map of the human genome. Nature Genetics 31: 241-247.

Why do you think this table divides the linkage distances by sex?

Total genome size: 3191 Mb
Male linkage (total): 2591 cM
Female linkage (total): 4460 cM
Sex average: 3615 cM
3191 Mb ÷ 3615 cM = 883 kb/cM

As these data show, the average physical length of 1 cM in the human genome is 883 kb. The actual physical distance that 1 cM corresponds to will change depending on the organism under study. This has something to do with the frequency of recombination.

Think of it like this: if an organism has more frequent recombination overall, the map distances based on recombination will increase. After all, the higher the chance of recombination the larger the relative distance. However, the physical distance will not change. Let's look at Arabidopsis thaliana as an example.

The A. thaliana genome is 7 x 10⁴kb long, physically (Nam et al. (1990) Plant Cell 1: 699-705). It has a total length in cM of 437. Therefore, each cM represents 160kb of DNA. This is a much lower number than the one for the human genome above, indicating that it has a higher amount of recombination. So why does Arabidopsis have more frequent recombination?

Does it have something to do with the fact that it has very little repetitive DNA? Wouldn't that decrease recombination frequencies?

Genetic distances, measured by recombination, are not always linearly related to actual physical distance. When we calculate recombination frequency, we use a unit of distance called a centi-Morgan (cM). A genetic distance of 1 cM represents a 1% chance that the loci in question will be separated due to recombination in one generation. In this way, centiMorgans represent relative distance between markers. Absolute distance, or physical distance, is measured in the familiar unit of base-pairs (bp, kbp, or Mbp/Mb).

For example, a comparison of physical and genetic distances on human chromosome 16 shows that the physical distance corresponding to 1 cM (map unit) varies along the length of chromosome 16. The markers D165S309 and D16S83 are practically next to each other on the chromosome, but have a very high frequency of recombination between them. Additionally, the markers SPN and D16S300 have ~25Mb between them, and almost no recombination.

Doggett, NA. et al. (1995) An integrated physical map of human chromosome 16. Nature 377:335-365.

What else besides distance might affect recombination frequency?

Polymorphic microsatellites are alleles, and can be used as molecular markers

So far, we've only been talking about one kind of molecular marker: RFLPs. In principle, any method that can identify mutations at any chromosomal location can be used to mark a site on a chromosome. RFLPs can be time consuming and expensive, so it's not always feasible to use them. PCR-based methods offer an alternative to RFLPs, and can be used for all the same purposes as RFLPs.

Since variation in repeat number gives PCR bands of different sizes, each band represents an allele

Microsatellites, also called variable number of tandem repeats (VNTRs) lend themselves to length polymorphisms, most likely because of strand slippage during DNA replication. The net result is that it is easy to find polymorphic microsatellites in which different alleles have different numbers of repeat units, so that the total length of a given microsatellite locus may vary. In other words, each variant is a unique allele at that locus. For a given locus containing a microsatellite, PCR primers specific for that locus are designed from the unique sequences flanking the repeats. Thus, microsatellite alleles are usually based on length polymorphism. Polymorphism at the priming sites would result in loss of bands, rather than changes in length.

Would agarose gel electrophoresis be appropriate to analyze microsatellites? Why or why not?

For a given locus containing a microsatellite, PCR primers specific for that locus are designed from the unique sequences flanking the repeats. Thus, microsatellite alleles are usually based on length polymorphism. Polymorphism at the priming sites would result in loss of bands, rather than changes in length.

Example (link provided for background reading): Aarnes SG et al. (2015) Identification and evaluation of 21 novel microsatellite markers from the automnal moth (Epirrita autumnata) (Lepidoptera: Geometridae). Int. J. Mol. Sciences 16:2241-22554. doi: 10.3390/ijms160922541.

The authors identify a number of microsatellite loci in the autumnal moth, which are polymorphic for the number of copies of short tandem repeats. Both alleles for each microsatellite locus were sequenced to determine the number of copies of tandem repeats in each, and the total length of the bands generated. For example, at locus A021 allele A contains 6 repeats of TGA, resulting in a PCR band whose total length is 92 bp, while allele B contain 7 copies of the TGA repeats, giving a PCR band of 95 bp.

Typically, several microsatellite loci can be amplified in a single PCR reaction. For each locus, primers for each locus are tagged with distinct flourescent dyes, so that the microsatellite bands for each locus flouresce at different wavelengths. PCR fragments are generated in separate reactions for each locus each using a different dye. After amplification, samples are mixed, and separated by capillary electrophoresis. This technique, rather than using a slab gel, runs DNA fragments through a thin capillary tube, containing polyacrylamide gel. A laser detector at the end of the capillaries detects the signals for each band at their characteristic wavelengths.

Results appear in a chromatogram, in which each DNA band appears as a peak. Bands for both alleles at each wavelength would fluoresce at the same wavelength. Homozygous loci, in which only one allele are present, give a single band eg. loci A019, A139 D005. Heterozygous loci, in which both alleles are present, give two bands eg. A021, A016, A022, A015.

Advantages of microsatellites

Highly reproducible between labs
Large number of polymorphisms per primer set
Often multiple alleles in a population, which can be highly informative
Co-dominant

Disadvantages of microsatellites

Extremely difficult to find good microsatellite markers in each species

There are many different schemes for detecting polymorphism (ie. differences in a given sequence among members of the population). Any of these can be used for molecular markers.

The Clarke and Carbon equation gives us the necessary number of markers needed to cover a genome

Regardless of the type of molecular marker employed, each assays a single genetic locus. That is, each marker assays a region of d cM on both sides of the marker. Another way to conceptualize it is to say that the genome is divided into G/2d segments of 2d each.

What you might be tempted to do is to say that the numbers of markers to score, to cover the entire genome, is the genetic distance of all chromosomes added together (G), divided by the genetic distance covered by each marker.

N = G/d

Where N is the number of markers necessary to have at least one marker within d map units of any gene.

Note: As calculated here, N is referred to as "1 genome equivalent".

The problem: N randomly chosen markers will be scattered unevenly across the chromosomes, so some regions will be full of markers, and other regions will not have any markers at all.

What we have to do, then, is to saturate the genome with markers, to ensure every region is covered. An initial map might be constructed using a relatively small number of markers. Since the chosen represent a random sampling of loci, they will be distributed unevenly across the genome. Consequently, some regions of each chromosome will be overrepresented in the map, and others will be underrepresented. As we sample more markers and add them to the map, their map locations will also be unevenly distributed. Finally, if enough markers are used, there will always be at least one marker with in a certain distance d of any part of the genome.

Therefore, it is necessary to screen a large number of markers before you find one that is linked to your gene. The following equation [Clark and Carbon (1976) Cell 9:91] allows us to calculate the number of markers necessary to find one that is linked to the gene of interest:

where:

N is the number of markers necessary to give a probability P of finding at least one marker linked within d map units of the gene
f is the fraction of the genome represented by a region of d map units on either side of the gene

For example, to find a marker in the human genome (3615 cM) linked within 10 cM of any gene,

Now, the bad news here is that 830 is a lot of markers, and will be expensive. But the good news is that we've set the value of P high, to make sure we do find a marker linked to the gene of interest. However, if we set P equal to 0.5 (50%) we only have to look at 124 markers. We can rephrase this as "50% of the time we only have to look at 124 markers before finding a marker linked to the gene". If we set P equal to .75, the number of markers becomes 250. So, just because 830 is our result doesn't mean that the marker linked to the gene will be the 830^th marker every time. Rather, it means that to be 99% sure we're getting a marker linked to our gene we have to look at 830 markers.

Many times, a limiting factor for experimental design is cost. In this situation, what would you do if the cost of using 830 markers was out of budget? Are there other ways to mitigate cost that do not involve using fewer markers?

Mapping kits use a minimal number of evenly-spaced markers to ensure complete coverage of the genome

The examples above show that if you are trying to find a marker linked to a gene of interest, just screening randomly-chosen markers requires that a large number of markers be screened to be sure that at least one is linked to your gene. However, once a genome has been saturated with markers, you only need to search a small set of evenly spaced markers that together cover the entire genome. That way, no matter where your gene is, at least one of the markers you screen must be closely-linked with your gene. A mapping kit is a set of markers that are evenly spaced on the chromosomes. If you can define a minimal set of markers, you can detect linkage by testing a minimal number of probes. One way of looking at it is that now that we have saturated the genome, we can choose a set of evenly-spaced markers so that N = G/2d. For example, if the human genome is 3615 cM in length, and we have markers evenly spaced at 20cM distance, then we need only 3615/20 = 181 markers to detect any gene.

Summary

The precision by which molecular markers can be used to map a site on a genome is often better than phenotypic markers, because you can use more of them and there are fewer errors in scoring
It is easy to establish linkage - if you have a set of markers carefully chosen from different parts of the genome, you can map any locus to a marker
Markers don't even have to be genes - they can be any piece of DNA that detects a small number of bands, and for which there is polymorphism in the population (this makes them selectively neutral)