Home Previous Topic Next Topic
PLNT 3140 Introductory Cytogenetics - 2024

Mapping Genes to Chromosomes

Learning Objectives

Genetic analysis allows us to map a phenotypic trait to a particular chromosomal locus

Why do we still need genetic mapping? Can't we just sequence the genome?

No matter how sophisticated and rapid DNA sequencing and other genomics technologies become, there is still no getting around the fact that what we're really interested in, in practical terms, is phenotypes. We want to somehow get at traits that are responsible for diseases in humans, or for agronomically useful characteristcs in crop plants and livestock. Examples of the latter would include disease resistance, yield, protein quality, oil quality, or uniform height or flowering time. Many of these traits are quantitative traits, governed by multiple loci. In many cases, the only thing we know at the beginning is the phenotype. Consequently, Mendelian genetics is more relevant than ever, as the primary way to go from a trait we can see or measure, to a specific chromosomal locus and gene product.

Mapping genes to chromosomal loci used to be done using 3 point crosses. Now, the process is almost completely based on molecular markers. Using molecular markers provides many advantages, including:

But also -

An allele is a variant, either in a molecular or phenotypic sense

Modern genetic mapping using molecular markers still relies upon the basic principles of Mendelian genetics. We observe genetic segregation based on the segregation of alleles in genetic crosses. But for molecular markers, we need a broader molecular definition of what an allele is. The word "allele" comes from the earlier word "allelomorph", which simply means "other form". We can use allele to mean a different trait, in a phenotypic sense, or to mean a different sequence, in a molecular sense. An allele defined by molecular means should have exactly the same genetic properties as a phenotypically-defined allele.

Infinite alleles model : Mutations within protein coding sequences define molecular allelism at many levels and degrees of severity.

  1. Mutations within a protein coding sequence define molecular allelism
  2. A large population can support a large number of alleles
  3. As long as the allele does not affect fitness, there is theoretically no limit for the number of alleles that could be present at that locus in a population

Mutations can take many different forms. Some mutations may have more of an effect than others:

Mutation Type Translaton
Effects
No mutation GAU CGA UGC CAG
Asp-Arg-Cys-Gln
This produces the wild-type protein
Silent mutation GAU CGA UGU CAG
Asp-Arg-Cys-Gln
Although one base pair has changed from C to U, both codons still produce a cysteine, and the end product is unchanged. No effect on the protein.
Missense mutation GAA CGA UGC CAG
Glu-Arg-Cys-Gln
This mutation, changing a U to an A, changes the sequence of amino acids. A missense mutation can have varied effects depending on how similar the mutated amino acid is to the original. In this case, Glu and Asp are similar, so the effect may be low.
Nonsense mutation GAU CGA UGA CAG
Asp-Arg
A nonsense mutation results in the early termination of an amino acid sequence. Instead of getting UGC, for cysteine, we get UGA, a stop codon. This type of mutation can range from very severe, if it occurs early on in the sequence, to less severe when it occurs closer to the end.
Frameshift mutation GAU GAU GCC AG
Asp-Asp-Ala
A frameshift mutation results from either an insertion or deletion, and is named because it shifts the frame of reference. As you can see, that results in a completely different amino acid sequence. Sequences that suffer frameshift mutations are often heavily effected, and are likely non-functional.

In addition to these point mutations, there are also large-scale sequence changes as the result of large insertions or deletions, as well as rearrangements. Not all mutations are necessarily bad - since the eukaryotic genome contains lots of "junk" DNA, often mutations do not affect organism functioning at all. If mutations occur on a non-essential portion of a large protein, the difference might be very small.

Divergence of protein sequences during plant evolution. Amino acids from the N-terminal region of thionin proteins from several plant species are shown. Where amino acids have been inserted or deleted during the course of evolution, gap (-) characters have been inserted to optimize the alignment of homologous positions within the proteins.

Of course, part of the reason that sometimes mutations don't have a big effect on function is the degeneracy of the genetic code. By "degenerate", we mean multiple codons can code for the same amino acid:

Displayed from Hyperphysics Chemistry and Biology
What might be an advantage of having a degenerate genomic code? What might be a disadvantage?


Since only a small percentage of eukaryotic genomes codes for proteins, most mutations will occur within non-coding DNA, and will usualy be selectively neutral. In most eukaryotic genomes, the chromosome is a sea of non-coding DNA punctuated by genes. This is illustrated in an approximately 1 Mb region taken from Zea mays chromosome 6:

Link to NCBI

Additionally, genes are composed of both exons and introns. In comparison to exons, introns appear to mutate very rapidly, suggesting that there is little selective pressure against intronic mutations. In many higher eukaryotes, especially in animals, introns may be very long compared to exons (Note that this appears not to be the case in plants, in which introns tend to be short).

In conclusion:

Practical consequence: because most mutations are selectively neutral and can occur anywhere in the genome, we can use molecular alleles as markers for chromosome mapping.

Restriction fragment length polymorphisms (RFLPs) are molecular alleles

At this point, we've discussed restriction digests, and you're familiar with the idea that when we digest a sequence with a certain restriction enzyme, we get bands of different lengths. Now, what happens when that particular sequence has differences between alleles? Well, we see polymorphism in the fragment lengths. Restriction sites can be assayed for polymorphism within a population by Southern hybridization - and these restriction sites can reveal polymorphism between genomes.

Assuming that individuals 1 and 2 are homozygous at the locus detected by these probes:

Restriction sites can reveal polymorphism between genotypes.

DNA sequences are mutating all the time. As two populations diverge over time, they each accumulate different mutations at different sites. When restriction sites in the vicinity of a given gene are compared from one genotype to another, one genotype may have the site, and the other will not. This is referred to as a polymorphism. High polymorphism between two genotypes is evidence of genetic divergence. Mike Freeling and colleagues have compared restriction sites within the maize alcohol dehydrogenase 1 (Adh1) gene in several maize lines. They found that although the restriction maps are identical within the Adh1 gene for all varieties tested, very few of the restriction sites outside of the coding region are conserved.


Johns, M. A., Strommer, J.N. and Freeling, M. (1983) Exceptionally high levels of restriction site polymorphism in DNA near the maize Adh1 gene. Genetics 105:733-743.
 

Maize genomic DNA was digested usng HindIII and the bands analyzed by Southern blot hybridization. DNA from seven maize lines was compared. The blot was hybridized using the pZmL84 probe, whose location on with respect to the AdhI gene is shown in the figure below. Note that this probe overlaps the 3' half of the adh1 transcript.

Since the HindIII site is internal to the probe region, the probe detects two bands in all digests. In all maize lines, a conserved 2.5 kb fragment is seen, as well as a second band. The size of the second band is determined by how far away the next HindIII site is, downstream from the conserved HindIII site. Because numerous mutations have accumulated in the downstream region, in the different maize lines over time, the location of the downstream HindIII site differs between lines.

  A restriction map of the corresponding region of the adh1 locus is shown below.

Restriction maps of the seven Adh 1 chromosomal regions. The boundaries of the transcription units are denoted by the vertical dashed lines, and the region that hybridized with the ADH1-cDNA probe, pZML84, is also shown.

In all lines the 5' fragment detected by the pZmL84 probe is 2.5kb. The map shows that the two HindIII sites that define this fragment are conserved. The 3' fragment is different in each line, because the next distal HindIII site occurrs at a different location in each line.

Now that millions of genes have been sequenced from thousands of species, we can generalize that protein coding regions are typically slow to diverge, because there is usually selective pressure to retain a given amino acid sequence. Outside of the protein coding region, and in non-coding DNA flanking genes, there is little selection against random mutations. Hence mutations are allowed to accumulate quickly in non-coding regions.

IN-CLASS EXERCISES

EXERCISE 1 - Segregation in the F1 generation
EXERCISE 2 - Measuring Recombination in a Test Cross
EXERCISE 3 - F2 phenotypes

RFLPs behave like any other Mendelian trait

Each band seen in a Southern blot indicates the presence of one or more restriction sites in a sequence. The sequence containing a restriction site is one allele, while the corresponding sequence missing the restriction site is the other allele. The "phenotypes" of these alleles are the differences in banding patterns, due to presence or absence of bands. In most cases, both loci have bands from one parent or or the other, or both loci are heterozygous. These are parental genotypes. When one locus is heterozygous and the other is homozygous, they must be recombinant. It is important to note that the %recombinants in F2 analysis doesn't give you the recombination frequency. Consider the case in which two recombinant gametes join to form a zygote:

In this diagram, the F1 generation could have any combination of the gametes created by the parents. For example, one offspring might be aaBb, with one parental and one recombinant gamete. However, progeny resulting from two recombinant gametes is indistinguishable from progeny with two parental gametes. 

These progeny would be heterozygous at both loci, which might be naively scored as non-recombinant. This is why geneticists have traditionally gone to great extremes to construct tester-stocks for mapping by test crosses. However, that is not always possible to do, especially if you want to map hundreds or thousands of markers to construct a genomic map. Consequently, you have use analytical methods appropriate for F2 data.

Maximum likelihood methods make it possible to construct linkage maps for many loci in a single cross

In a two point test cross, calculating linkage between two loci follows this formula:

When we look at the last example, it's clear that we are not able to get accurate linkage values from a two point test cross. Where we're looking for homozygotes, we miss the recombinant progeny that are heterozygotes, and end up with an inaccurate linkage for those two loci.

How would undercounting recombinants affect the linkage value?

However, once we are dealing with multiple markers, it gets much more complicated. To get an idea about how molecular marker data can be used to build linkage maps, let's examine some molecular markers from TAIR (The Arabidopsis Information Resource) at Stanford University. One of the RFLP probes documented in AAtDB is the RFLP probe g4539, which can be examined here. (Note: In the current map, g4539 has been converted into a PCR marker.) The Southern blot for this probe, shown at right, indicates two RFLP alleles, 2L and 1C, found in ecotypes Landsberg erecta and Columbia, respectively. Note that while several bands are detected with this probe, only two are polymorphic between these two parents: 2L and 1C.

If you take marker g4539 as an example, you will see that neighboring markers most closely-linked to it also are most alike in the phenotypic scores, from plant to plant. The farther you go in either direction from g4539, the more different will be the scores. Neighboring loci in any region will always have the most similar segregation patterns. This is nothing new, it is simply a restatement of the idea of co-segregation of closely-linked loci. By definition, the more closely-linked two loci are, the more they will cosegregate.


        ------------- segregating progeny ------------------->
marker/
map posn.
g6844   HHAAAAABHHBAAAHBHHHHABHHHABBAHHBHAHHBAAHHAHHBBHH­A­HHAHHBHHHAAAHHHBBHHHHAH­HBAAAAHHABHHHHBAHBABHBHAAHBHBHAHHBAHBAHHHAAHH­­­
g3843   HHAAAAABHHBAAAHBHHHHABHHHAHBAHHBHAHHBAABAAHHBBHHAAHHHAHHBHHBAAAHHHBBHAAHAHHHBAAAAAHABHHHHBAHBABHBHAAHBHBHAHHBHHBAHHBAA­­­­­
g2616   HHAAHHHBHHBAAAHBHHH­ABHHH­­­­­­­­­­­­­­­­­­­­­­­­­­­HHHHBBHBHHAHHHHHHHHHAH­HBHAAAAHABHHAABAHBABABHAAHHABHAHHBHABAHHBAHAH­­­
m210    HHAHHBHHHHHAAAHHBHHHAHHAHAHHHAABHHAHBHABAAA­AHBHAAHHHHAHBHHBBHAHHHAHHAHHBBHHBHHHHAHHBHHAHBAHHABHABAHBHAHHHH­HHABA­ABHH­­­­­
g6837   ­­­­­­­­­­­­­­­­­­­­­­­­­­HH­AABHHAHBHHBAAAH­HH­AA­H­­­­­­­­­­­­­­­­­­­­­­­HBHHHHAHHBHHAHBAHBABHABAHBHAHHHH­HHABA­­­HHAHHBA
g10086  ­­­­­­­­­­­­­­­­­­­­­­­­­AHHHAAHHHAHBHHBAHAHAHHH­A­H­­­­­­­­­­­­­­­­­­­­­­­HBHHHHAHHBHHAHBAHBABHABAHBHAHHHHAHH­­­­­BHHAH­­­
g4564­a           HAAHHBHHHHHAAAHHBHHHAHHAHAHHHAAHHHAHBHHBAHAHAHHHAAHHHHA         HBHHBBHAHHHAAHAHHBBHHBHHHHABHBHHAHBAHBABHABAHBHAHHHHAHHABAHABHH­­­­­
g3845                      HAHHHBHHHAHAAAAHBHH­AHBAHAHHHAAHHHAHBHHHAHAHAHAAHAHHHHAHBHHBBHAHHHAAHAHHBBHHHHBH­ABHBHHAHBABBHBHABAHBHAHHHHAHHABHHABHH­­­­­ 
g4539                      ­­­­­­­­­­­­­­­­­­­­­­­­­AHHHAAHHHAHBHHHAHAHAHAAHA­HHHAHBHHBBHAHHHAAHAHHBB­HHHBHHABHBHHAHBABBHHHABAHBHAHHHHAHHABHHABHHAH­­­ 
m557    HAHHHBHHHAHAAAAHBHHAAHBAHAHHHHAHHBAHBHHHAHAHAHAAHA­HHH AHBHHBBHAHAHAAHAHHBB­HHH BHHABHBBBAHHABBHHHABAHBHAHHHHAHHAHHHABHHAH­­­
g3883   HAHHHBHHHAHAAAAHBHHHAHBAHAHHHHAHHBAHBHHHAHAHAHAAHAHHHHAHBHHBBHAHAHAAHAHHBBHHHHBHHABHBBBAHHABBHBHABAHBHAHHHHAHHAHHHABHH­­­­­
g19833  HAHAHBHH­AHAAAAHBHH­AHBAH­HH­HAHHHA­BAHHAHAH­HA­HA­HHHAHBHHBBHAHAHAAHAHHBB­HHHBHHABHBBBAHBAB­HBHABAHBH­HHHH­HHAHH­ABHHAH­HA
g19838  ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­HAHH­AHB­HHAHAHAHAA­A­HHHAHBHH­BHAHAHAAH­­HBBHHHHBHHABHBBBA­BAB­HBHABAHBHAHHHHA­­AHHH­BHH­H­­­
m272    HAHAHBHHHAH­­AAHBHA­HHBAA­HBHHAHHBAAHHHHABAHAAA­HA­HHAAHBHHHBHHHAHAAHAHHBB­HHBBHHABHHHHAHHHBBHBHABAHBHAHHHHHHHA­H­AHHHBBHHA
g4513   HAHAHBHHHAHAAAAHBHAHHHBAAAHBHHAHBBAAHHHHABAHAAAAHAAHHAAHBHHHBHHHAHAAHAHHBBHHHBBHHABHHHHAHHHBBHBHABAHBHAHHHHHHHAAHAAHHH­­­­­


A,B: homozygotes for alternative alleles H: heterozygote -: ambiguous result

When we calculate linkages for multiple loci at once, we use maximum likelihood methods. These methods follow two simple steps:

  1. Make a best guess as to the order of markers on the map. The more frequently two loci co-segregate, the closer they are likely to be on the map. By doing pairwise comparisons for all possible pairs of loci, the order of loci on the map can be guessed at based on which markers tend to co-segregate with each other.
  2. Determining the spacing between each pair of adjacent loci. For any possible order and spacing of loci, it is possible to calculate the prior probability that "the given map would exactly give rise to the observed data". Note that the likelihood is necessarily very small, because it is the probability that each meiosis under study would come out exactly the same if the experiment were repeated. Thus, likelihoods are useful only for comparative purposes. For example, if an alternative map had a 1000-fold lower chance of giving rise to the data, one might choose to reject it.
Figure 1. Example of multipoint linkage analysis using EM algorithm, showing convergence to maximum-likelihood genetic map for 16 RFLPs on human chromosome 7, studied in CEPH families (see text). The initial assumption of 5% recombination between consecutive RFLPscorresponded to a log10 likelihood -351.45. After 12 iterations, the recombination fractions converged to a map that was about 10 48 times more likely to have produced the observed data. The analysis used the first genetic reconstruction algorithm discussed in the text, involving only genotype-known data, and it required ~9 sec on an HP9000 minicomputer. Analysis of the full data set, using the hidden Markov-chain reconstruction algorithm, required ~4 min and did not alter the recombination fractions significantly.

In Fig. 1, we see that MAPMAKER begins considering a particular map in the following way:

  1. Arbitrarily space all loci under consideration at a distance of 5cM.
  2. Calculate the likelihood of seeing the observed data, given the current map.
  3. Next, try another guess for map distances between markers.
  4. Calculate the likelihood of seeing the data, given the new guess. If the new guess has a higher likelihood, choose it as the current working map.

As the iterations progress, the changes to the map get smaller and smaller. The process repeats until the difference in log likelihood between the current map and the previous map is less than some threshold (eg. 0.1). In this way, the program quickly converges on the spacing that is most consistent with the data.

Summary