11.1. Recombinants and nonrecombinants
Figure 11.1
.
Recombinants and nonrecombinants
Alleles at two loci (locus A, alleles A1 and A2;
locus B, alleles B1 and B2) are segregating in
this family. Where this can be deduced, the combination of alleles a
person received from his or her father is boxed. Persons in generation
III who received either A1B1 or
A2B2 from their father are the product of
nonrecombinant sperm; persons who received A1B2 or
A2B1 are recombinant. The information shown
does not enable us to classify any of the individuals in generations I
and II as recombinant or nonrecombinant, nor does it identify
recombinants arising from oogenesis in individual II2.
In principle, genetic mapping in humans is exactly the same as genetic mapping in any
other sexually reproducing
diploid organism. The aim is to discover how often two
loci are separated by meiotic recombination. Consider a person who is
heterozygous
at two loci, and so types as A
1A
2 B
1B
2.
Suppose the alleles A
1 and B
1 in this person came from one
parent, and A
2 and B
2 from the other. Any of that person's
children who inherit one of these parental combinations (A
1B
1
or A
2B
2) is nonrecombinant, whereas children who inherit
A
1B
2 or A
2B
1 are
recombinant (). The proportion of children who are
recombinant
is the
recombination fraction between
the two loci A and B.
11.1.1. The recombination fraction is a measure of genetic distance
If two loci are on different chromosomes, they will segregate independently.
Considering spermatogenesis in individual II
1 in , at the end of meiosis I, whichever sperm
receives
allele A
1, there is a 50% chance that it will receive
allele
B
1 and a 50% chance it will receive B
2. Thus, on
average, 50% of the children will be
recombinant and 50% nonrecombinant. The
recombination fraction is 0.5. If the loci are
syntenic, that is if
they lie on the same chromosome, then they might be expected always to segregate
together, with no recombinants. However, this simple expectation ignores meiotic
crossovers. During prophase of meiosis I, pairs of homologous chromosomes
synapse and exchange segments (
Figure
2.14). Only two of the four chromatids are involved in any particular
crossover. A crossover, if it occurs between the positions of the two loci, will
create two
recombinant chromatids carrying A
1B
2 and
A
2B
1, and leave the two noninvolved chromatids
nonrecombinant. Thus one crossover generates 50% recombinants between loci
flanking it.
Recombination will rarely separate loci that lie very close together on a
chromosome, because only a crossover located precisely in the small space
between the two loci will create recombinants. Therefore sets of alleles on the
same small chromosomal segment tend to be transmitted as a block through a
pedigree. Such a block of alleles is known as a haplotype. Haplotypes mark recognizable chromosomal
segments which can be tracked through pedigrees and through populations. When
not broken up by recombination, haplotypes can be treated for mapping purposes
as alleles at a single highly polymorphic locus.
The further apart two loci are on a chromosome, the more likely it is that a
crossover will separate them. Thus the recombination fraction is a measure of
the distance between two loci. Recombination fractions define genetic distance, which is not the
same as physical distance. Two loci
that show 1% recombination are defined as being 1 centimorgan
(cM) apart on a genetic map.
11.1.2. Recombination fractions do not exceed 0.5 however great the physical
distance
Figure 11.2
.
Single and double recombinants
Each crossover involves two of the four chromatids of the two
synapsed homologous chromosomes. The black chromosome carries
alleles A1 and B1 at two loci, while the blue
chromosome carries alleles A2 and B2. Gametes
in which the chromatid is the same color at the two loci are
nonrecombinant for these loci, those where the chromatids are
different colors are recombinant. (A) A single
crossover generates two recombinant and two nonrecombinant
chromatids. (B) A two-strand double crossover leaves
flanking markers nonrecombinant on all four chromatids.
(C) A three-strand double crossover leaves flanking
markers recombinant on two of the four strands. (D) A
four-strand double crossover generates 100% recombinants. The three
types of double crossover occur in random proportions, so the
average effect of a double crossover is to give 50%
recombinants.
A single recombination event produces two
recombinant and two nonrecombinant
chromatids. When loci are well separated there may be more than one crossover
between them. Double crossovers can involve two, three or four chromatids, but
shows that the overall
effect, averaged over all double crossovers, is to give 50% recombinants. Loci
very far apart on the same chromosome might be separated by three, four or more
crossovers. Again, the overall effect is to give 50% recombinants. Recombination
fractions never exceed 0.5, however far apart the loci are.
11.1.3. Mapping functions define the relationship between recombination fraction and
genetic distance
Because recombination fractions never exceed 0.5, they are not simply additive
across a genetic map. If a series of loci, A, B, C, … are located at
5 cM intervals on a map, locus M may be 60 cM from locus A, but the
recombination fraction between A and M will not be 60%. The mathematical
relationship between recombination fraction and genetic map distance is
described by the mapping function.
If crossovers occurred at random along a bivalent and had no influence on one
another, the appropriate mapping function would be Haldane's function:

where Θ is the map distance and θ
the recombination fraction; as usual ln means logarithm to the base e, and exp
means ‘e to the power of’. However, we know that crossovers
do not occur at random. The presence of one chiasma inhibits formation of a
second chiasma nearby. This phenomenon is called interference. A variety of mapping functions exist that
allow for varying degrees of interference. A widely used function for human
mapping is Kosambi's function:
A mapping function is needed in multipoint mapping (Section 11.4) to convert the raw data on the
recombination fraction into a genetic map. The interested reader should consult
Ott's book (see Further reading) and Broman
and Weber (1998) for a fuller discussion of mapping functions.
11.1.4. The relation between physical and genetic distances is not constant across
the genome
Chiasma counts in human male meiosis show an average of 49 crossovers per cell
(Morton et al.,
1982). Since each crossover gives 50% recombinants, the chiasma count
implies a total male genetic map length of 2450 cM. The current version of the
Location Database (Collins et
al., 1996) suggests a total male map length of 2851 cM.
Chiasmata are more frequent in female meiosis (exemplifying Haldane's rule that
the heterogametic sex has the lower chiasma count), and the total female map
length in the Location Database is 4296 cM (excluding the X). Thus over the 3000
Mb autosomal genome, 1 male cM averages 1.05 Mb and 1 female cM averages 0.70
Mb; the sexaveraged figure is 1 cM = 0.88 Mb.
Figure 11.3
.
Relation of physical and genetic maps of chromosome 19
180 markers from chromosome 19 were mapped genetically and
physically. The physical map of the 65 Mb chromosome is compared
with genetic maps separately computed for male and female meioses.
Note the uneven distribution of recombinants along the chromosome,
with more recombination towards the telomeres, and the varying
male:female recombination ratio. The female map is about 10% longer
than the male map; for most chromosomes the difference is more
marked. Data from Mohrenweiser
et al. (1998).
The approximation 1 cM = 1 Mb is a useful rule of thumb, but the
actual correspondence varies widely for different chromosomal regions. In
general, there is more recombination towards the telomeres of chromosomes in
males, while centromeric regions have recombinants in females but not in males
(see and
Broman et al., 1998).
The most extreme deviation is shown by the
pseudoautosomal region at the tip of
the short arms of the X and Y chromosomes (see
Figure 14.7). Males have an
obligatory crossover
within this 2.6 Mb region, so that it is 50 cM long. Thus, for this region in
males 1 Mb = 19 cM, whereas in females 1 Mb = 2.7 cM.
Uniquely, the Y chromosome, outside the
pseudoautosomal region, has no genetic
map because it is not subject to synapsis and crossing over in normal meiosis.
The X chromosome of course undergoes normal recombination in females, and can be
genetically mapped in female meioses.
11.2. Genetic markers
11.2.1. Mapping human disease genes requires genetic markers
Since most human geneticists are interested in diseases, we would like a map to
show the order and distance apart of all disease genes. Scoring the
recombination fraction between pairs of diseases would be the obvious way to
construct such a map, but disease-disease mapping is not possible in humans.
Defining recombinants, as we have seen () requires double heterozygotes. People
heterozygous for two
different diseases are extremely rare. Even if they can be found, they will
probably have no children, or be unsuitable for genetic analysis in some other
way. For this reason human genetic mapping depends on
markers. Any
mendelian character can in principle be used
as a genetic
marker. It helps if the character can be scored easily and cheaply
using readily available material (blood cells rather than a brain biopsy), but
the crucial thing is that it should be sufficiently polymorphic that a randomly
selected person has a good chance of being
heterozygous.
Box 11.1 summarizes the development
of human genetic markers, from blood groups and polymorphisms of serum proteins
through to the present generation of DNA microsatellites and single nucleotide
polymorphisms.
Gene mappers could not set out to map a disease with a reasonable hope of success
until markers were available that were spaced throughout the
genome.
Disease-
marker mapping, if it is not to be a purely blind exercise, requires
framework maps of markers. These are generated by
marker-
marker mapping.
Although in theory
linkage can be detected between loci 40 cM apart, the amount
of data required to do this is prohibitive. Ten meioses are sufficient to give
evidence of
linkage if there are no recombinants, but 85 meioses would be needed
to give equally strong evidence of
linkage if the
recombination fraction was 0.3
(see
Box 11.3 for a guide to
these calculations). Obtaining enough family material to test much more than 30
meioses can be seriously difficult for a rare disease. Thus mapping requires
markers spaced at intervals no greater than about 20 cM across the
genome. Given
the
genome lengths calculated above, this means that we need a minimum of 150
markers. Allowing for imperfect informativeness (see below), we need at least
300. In fact much denser maps, down to 1 cM or less average spacing of markers,
are needed to guide progress from initial mapping of a disease through to
cloning the gene. A major achievement of the Human
Genome Project has been to
generate upwards of 10 000 highly polymorphic markers and place them on
framework maps (
Collins et
al., 1996;
Broman
et al., 1998).
11.2.2. The heterozygosity or polymorphism information content measure how
informative a marker is
For
linkage analysis we need
informative meioses (see
Box 11.2). The examples in the box
show that a meiosis is not informative with a given
marker if the parent is
homozygous for the
marker, and also in half of the cases where both parents have
the same
heterozygous genotype. For most purposes the mean
heterozygosity of a
marker (the chance that a randomly selected
person will be
heterozygous) is used as the measure of informativeness. If there
are
marker alleles A
1, A
2, A
3 … with
gene frequencies
p1,
p2,
p3 …, then the proportion of people
who are
heterozygous is 1 - (
p12 +
p22 +
p32 + …) (
Section 3.1). A more sophisticated measure,
the
polymorphism information content (PIC) allows for couples who are both
heterozygous A
1A
2. Half their children will also be
A
1A
2 and therefore uninformative. The PIC of a
marker
is given by:
where
pi is the frequency of the
ith
allele. The third term takes
out half the matings of similar heterozygotes. For X-linked markers the PIC and
heterozygosity are the same; for autosomal markers the heterozygosity somewhat
overstates the informativeness, especially for 2-
allele markers. For an
autosomal
marker with two alleles of equal frequency the heterozygosity is 0.5
but the PIC is only 0.375.
11.2.3. DNA polymorphisms are the basis of all current genetic markers
In the early 1980s, DNA polymorphisms provided, for the first time, a set of
markers that were sufficiently numerous and spaced across the entire genome. DNA
markers have the additional advantage that they can all be typed by the same
technique. Moreover their chromosomal location can be determined using FISH or
radiation hybrid mapping (Sections 10.1
and 10.2), allowing DNA-based genetic
maps to be cross-referenced to physical maps. This avoids the frustrating
situation that arose when the long-sought cystic fibrosis gene
(CFTR) was first mapped. Linkage was established to a
protein polymorphism of the enzyme paraoxonase, but the chromosomal location of
the paraoxonase gene was not known. The development of DNA markers allowed human
gene mapping to start in earnest.
Restriction fragment length polymorphisms (RFLPs)
The first generation of DNA markers were restriction fragment length
polymorphisms (RFLPs). RFLPs were initially typed by preparing Southern
blots from restriction digests of the test DNA, and hybridizing with
radiolabeled probes (see Figure
5.12). This technology required plenty of time, money and DNA, and
made a whole genome search a heroic undertaking. Nowadays this is less of a
problem because RFLPs can usually be typed by PCR. A sequence including the
variable restriction site is amplified, the product is incubated with the
appropriate restriction enzyme and then run out on a gel to see if it has
been cut (see Figure 6.6). A more
fundamental limitation is their limited informativeness. RFLPs have only two
alleles: the site is present or it is absent. The maximum heterozygosity is
0.5. Disease mapping using RFLPs is frustrating because all too often a key
meiosis in a family turns out to be uninformative.
Minisatellites
Minisatellite VNTR (variable number tandem repeat) markers were a great
improvement. The VNTRs have many alleles and high heterozygosity. Most
meioses are informative. However, the technical problems of Southern
blotting and radioactive probes were still an obstacle to easy mapping, and
VNTRs are not evenly spread across the genome.
Microsatellites
The advent of PCR finally made mapping relatively quick and easy.
Minisatellites are too long to amplify well, and so the standard tools for
PCR linkage analysis are microsatellites. These are mostly (CA)n repeats. Tri- and tetranucleotide repeats are gradually replacing
dinucleotide repeats as the markers of choice because they give cleaner
results - dinucleotide repeat sequences are peculiarly prone to replication
slippage during PCR amplification. Each allele gives a little ladder of
‘stutter bands’ on a gel, making it hard to read (see
Figure 6.8). Much effort has been
devoted to producing compatible sets of microsatellite markers that can be
amplified together in a multiplex PCR reaction and give nonoverlapping
allele sizes, so that they can be run in the same gel lane. With fluorescent
labeling in several colors, it is possible to score perhaps ten markers on a
sample in a single lane of an automated gel.
Single nucleotide polymorphisms (SNPs)
After 10 years of developing more and more polymorphic markers, it may seem
perverse that the newest generation of markers are 2-allele single
nucleotide polymorphisms. They include the classic RFLPs, but also
polymorphisms that do not happen to create or abolish a restriction site.
The advantage of SNPs is that they can be scored on solid-state arrays
without recourse to gel electrophoresis (Wang et al., 1998). The gain in throughput more
than offsets the lower informativeness of SNPs. Typically the test DNA is
PCR amplified in a very large multiplex and hybridized to an array
comprising a series of anchored oligonucleotide primers, each terminating
with a polymorphic nucleotide. A single primer extension step is carried out
on the array, using a mixture of four fluorescently-labeled
dideoxynucleotides. Label adds to primers that perfectly match the test DNA,
but not to those with a 3′ mismatch (see Figure 17.10). Reading the cells of the array for
presence or absence of fluorescence allows the types for every SNP on the
array to be read off. Although the technology is still being put together,
it is hoped that an array of a few thousand primers can be used to genotype
markers spaced closely across the whole genome in a single chip
hybridization.
11.3. Two-point mapping
11.3.1. Scoring recombinants in human pedigrees is not always simple
Having collected families where a mendelian disease is segregating, and typed
them with an informative marker, how do we know when we have found linkage?
There are two aspects to this question:
-
How can we work out the recombination fraction?
-
What statistical test should we use to see if the recombination
fraction is significantly different from 0.5, the value expected on
the null hypothesis of no linkage?
Figure 11.4
.
Recognizing recombinants: three versions of a family with an
autosomal dominant disease, typed for a marker A
(A) All meioses are phase-known. We can identify
III1–III5 unambiguously as
nonrecombinant and III6 as recombinant. (B) The
same family, but phase-unknown. The mother, II1, could have
inherited either marker allele A1 or A2 with the
disease; thus her phase is unknown. Either
III1–III5 are nonrecombinant and
III6 is recombinant; or
III1–III5 are recombinant and
III6 is nonrecombinant. (C) The same family
after further tracing of relatives. III7 and III8
have also inherited marker allele A1 along with the disease
from their father, but we cannot be sure whether their father's allele
A1 is identical by descent to the allele A1 in
his sister II1. Maybe there are two copies of allele
A1 among the four grandparental marker alleles. The
likelihood of this depends on the gene frequency of allele
A1. Thus although this pedigree contains linkage information,
extracting it is problematic.
In some families the first question can be answered very simply by counting
recombinants and nonrecombinants. The family shown in is one example. There are two recombinants in
seven meioses and the
recombination fraction is 0.28. shows another example. The double
heterozygote who is informative for
linkage (individual II
1 in both
and ), is
phase-known: we know which alleles were
inherited from which parent, and so we can unambiguously score each meiosis as
recombinant or nonrecombinant. In , individual II
1 is again doubly
heterozygous, but
this time
phase-unknown. Among her children, either there are five
nonrecombinants and one
recombinant, or else there are five recombinants and one
nonrecombinant. We can no longer identify recombinants unambiguously, even if
the first alternative seems much more likely than the second. adds yet more complications,
yet if this is a family with a rare disease no researcher would be willing to
discard it. Some method is needed to extract the
linkage information from a
collection of such imperfect families.
11.3.2. Computerized lod score analysis is the best way to analyze complex pedigrees
for linkage between mendelian characters
In the pedigree shown in it
is not possible to identify recombinants unambiguously and count them. It is
possible, however, to calculate the overall likelihood of the pedigree, on the
alternative assumptions that the loci are linked (
recombination fraction
= θ) or not linked
(
recombination fraction = 0.5). The ratio of these two likelihoods
gives the odds of
linkage, and the logarithm of the odds is the
lod score.
Morton (1955) demonstrated that lod scores represent the
most efficient statistic for evaluating pedigrees for
linkage, and derived
formulae to give the lod score (as a function of
θ) for various standard pedigree
structures.
Box 11.3 shows how
this is done for simple structures. Being a function of the recombination
fraction, lod scores are calculated for a range of
θ values. In a set of families, the overall
probability of
linkage is the product of the probabilities in each individual
family, therefore lod scores (being logarithms) can be added up across
families.
Calculating the full lod score for the family in is difficult. To calculate the likelihood that
III
7 and III
8 are
recombinant or nonrecombinant, we
must take likelihoods calculated for each possible
genotype of I
1,
I
2 and II
3, weighted by the probability of that
genotype. For I
1 and I
2, the
genotype probabilities depend
on both the gene frequencies and the observed genotypes of II
1,
III
7 and III
8.
Genotype probabilities for
II
3 are then calculated by simple
mendelian rules. Human
linkage
analysis, except in the very simplest cases, is entirely dependent on computer
programs that implement algorithms for handling these branching trees of
genotype probabilities, given the pedigree data and a table of gene
frequencies.
11.3.3. Lod scores of +3 and -2 are the
criteria for linkage and exclusion (for a single test)
Figure 11.5
.
Lod score curves
Graphs of lod score against recombination fraction from a
hypothetical set of linkage experiments. Curve 1: evidence of
linkage (Z > 3) with no recombinants. Curve
2: evidence of linkage (Z > 3) with the most
likely recombination fraction being 0.23. Curve 3: linkage excluded
(Z < -2) for recombination fractions
below 0.12; inconclusive for larger recombination fractions. Curve
4: inconclusive at all recombination fractions.
The result of
linkage analysis is a table of lod scores at various recombination
fractions, like the two tables in
Box
11.3. Positive lods give evidence in favor of
linkage and negative
lods give evidence against
linkage. Note that only recombination fractions
between 0 and 0.5 are meaningful, and that all lod scores are zero at
θ = 0.5 (because they are then
measuring the ratio of two identical probabilities, and log
10(1)
= 0). The results can be plotted to give curves like those in .
Returning to the two questions posed at the start of this section, we now see
that the most likely
recombination fraction is the one at which the lod score is
highest. If there are no recombinants, the lod score will be maximum at
θ = 0. If there are
recombinants,
Z will peak at the most likely recombination
fraction (0.167 = 1/6 for the family in , but harder to predict for ).
The second question concerned the threshold of significance. Here the answer is
at first sight surprising.
Z = 3.0 is the threshold
for accepting
linkage, with a 5% chance of error.
Linkage can be rejected if
Z < -2.0. Values of
Z between -2
and +3 are inconclusive. For most statistics
p < 0.05 is
used as the threshold of significance, but
Z = 3.0
corresponds to 1000 : 1 odds (log
10(1000) = 3.0). The
reason why such a stringent threshold is chosen lies in the inherent
improbability that two loci, chosen at random, should be linked. With 22 pairs
of autosomes to choose from, it is not likely they would be located on the same
chromosome (syntenic) and, even if they were, loci well separated on a
chromosome are unlinked. Common sense tells us that if something is inherently
improbable, we require strong evidence to convince us that it is true. This
common sense can be quantified in a Bayesian calculation (see
Box 11.4), which shows
that 1000 : 1 odds in fact corresponds precisely to the conventional
p = 0.05 threshold of significance. The same
logic suggests a threshold lod of 2.3 for establishing
linkage between an
X-linked character and an X-chromosome
marker (
prior probability of
linkage
![[congruent with]](corehtml/pmc/pmcents/cong.gif)
1/10).
Confidence intervals are hard to deduce analytically, but a widely accepted
support interval extends to recombination fractions at which the lod score is 1
unit below the peak value (the lod-1 rule). Thus, curve 2 in gives acceptable evidence of
linkage (
Z > 3) with the most likely recombination
fraction 0.23 and support interval 0.17–0.32. The curve will be more
sharply peaked the greater the amount of data, but in general peaks are quite
broad. It is important to remember that distances on human genetic maps are
often very imprecise estimates.
Negative lod scores exclude
linkage for the region where
Z
< -2. Curve 3 on
excludes the disease from 12 cM either side of the
marker. While gene mappers
hope for a positive lod score, exclusions are not without value. They tell us
where the disease is not (
exclusion
mapping). This can exclude a possible candidate gene, and if enough
of the
genome is excluded, only a few possible locations may remain.
11.3.4. For whole genome searches a genome-wide threshold of significance must be
used
In disease studies, families are typed for
marker after
marker until positive
lods are obtained. The appropriate threshold for significance is a lod score
such that there is only a 0.05 chance of a false positive result occurring
anywhere during a search of the whole genome. As shown in
Box 11.4, a lod score of 3.0
corresponds to a significance of 0.05 at a single point. But if 50 markers have
been used, the chance of a spurious positive result is greater than if only one
marker is used. A stringent procedure would multiply the
p
value by 50 before testing its significance. The threshold lod score for a study
using
n markers would be 3 + log(
n), that is a
lod score of 4 for 10 markers, 5 for 100, etc. However, this is over-stringent.
Linkage data are not independent. If a character is
mendelian, then it is
determined at a single chromosomal location. If it does not map to one location,
then the
prior probability that it maps to another location is raised. The
threshold for a
genome-wide significance level of 0.05 has been much argued
over, but a widely accepted answer for
mendelian characters is 3.3 (
Lander and Schork, 1994). For
nonmendelian characters see
Section
12.5. In practice, lod scores below 5, whether with
one
marker or many, should be regarded as provisional.
11.4. Multipoint mapping is more efficient than two-point mapping
11.4.1. Multipoint linkage can locate a disease locus on a framework of
markers
Table 11.1
Gene ordering by three-point crosses
| ABC/abc | Nonrecombinant | 853 |
| abc/abc | | |
| ABc/abc | (A, B)-x-C | 5 |
| abC/abc | | |
| Abc/abc | A-x-(B, C) | 47 |
| aBC/abc | | |
| AbC/abc | B-x-(A, C) | 95 |
| aBc/abc | | |
Linkage analysis can be more efficient if data for more than two loci are
analyzed simultaneously. Multilocus analysis is particularly useful for
establishing the chromosomal order of a set of linked loci. Experimental
geneticists have long used three-point crosses for this purpose. The rarest
recombinant class is that which requires a double recombination. In
Table 11.1, the gene order A-C-B is
immediately apparent. This procedure is more efficient than estimating the
recombination fractions for intervals A-B, A-C and B-C separately in a series of
two-point crosses. Ideally, in any
linkage analysis the whole
genome would be
screened for
linkage, and the full dataset would be used to calculate the
likelihood at each location across the
genome.
A second advantage of multilocus mapping in humans is that it helps overcome
problems caused by the limited informativeness of markers. Some meioses in a
family might be informative with
marker A, and others uninformative for A but
informative with the nearby
marker B. Only simultaneous
linkage analysis of the
disease with markers A and B extracts the full information. This is less
important for mapping using highly informative
microsatellite markers rather
than two-
allele RFLPs, but it will resurface if SNPs (
Box 11.1) become the main mapping
tool.
11.4.2. Multipoint mapping by computer
Figure 11.6
.
Multipoint mapping in man
The horizontal axis is a map of markers and the vertical axis is the
lod score. The linkmap program has moved the unmapped
disease locus across the map, calculating the lod score at each
position. Lod scores dip to strongly negative values near to the
position of markers which show recombinants with the disease. The
highest peak shows the most likely location. Odds in favor of this
position are measured by the degree to which the highest peak
overtops its rivals. Redrawn from Hughes et al. (1994) with permission
from the author.
For disease-
marker mapping the starting point is usually a two-point lod score
showing that the disease maps near one particular
marker, plus a map of the
framework of markers. The
marker map is taken as given, and the aim is to locate
the disease gene in one of the intervals of the framework. Programs such as
linkmap (part of the
Linkage package) or
genehunter can
notch the disease
locus across the
marker framework, calculating the overall
likelihood of the pedigree data at each position. The result () is a curve of likelihood
against map location. The y-axis is usually a lod score, the log likelihood
ratio for this location versus a location off the end of the map. Occasionally,
for reasons based on statistical theory, a location score is used. Location
scores are twice the natural logarithm of the likelihood ratio, i.e. 4.6
× the lod score. This method is also useful for
exclusion mapping: if
the curve stays below a lod score of -2 across the region, then the disease
locus is excluded from that region.
The apparently quantitative nature of is largely spurious. Peak heights depend crucially on the
precise distances between markers and on the
mapping function (
Section 11.1.3). In reality these are
seldom accurately known. The distances on
marker-
marker maps should be regarded
as only rough guides, and moreover none of the mapping functions in
linkage
programs even approximates to the real complexities of chiasma distribution (see
). However, unless the
marker map is radically wrong, it remains true that the highest peak marks the
most likely location.
11.4.3. Multipoint linkage is essential for constructing marker framework
maps
Disease-
marker mapping suffers from the necessity of using whatever families can
be found where the disease of interest is segregating. Such families will rarely
have ideal structures. All too often the number of meioses is undesirably small,
and missing persons mean that some meioses are
phase-unknown.
Marker-
marker
mapping can avoid these problems. Markers can be studied in any family, so
families can be chosen that have plenty of children and ideal structures for
linkage, like the family in .
Construction of
marker framework maps has benefited greatly from a collection of
families (the CEPH families) assembled specifically for the purpose by the
Centre pour l'Étude du Polymorphisme Humain (now the Centre Jean
Dausset) in Paris. Immortalized cell lines from every individual ensure a
permanent supply of DNA, and sample mix-ups and nonpaternity have long since
been ruled out by typing with many markers. The first goal of the Human
Genome
Project was to produce high-density framework maps of highly polymorphic
markers. This
phase is now complete. As an example, the current map from CHLC
(Cooperative Human
Linkage Center) is based on the results of scoring eight CEPH
families with 8325 microsatellites, resulting in over 1 million genotypes (
Broman et al.,
1998).
11.4.4. Integrated maps combine genetic and physical data
Ordering the loci in multipoint mapping is not a trivial problem. There are
n!/2 possible orders for n markers, and
current maps have hundreds of markers per chromosome. Something more intelligent
than brute force computing must be used to work out the correct order. Physical
mapping information can be immensely helpful here. Markers that can be typed by
PCR can be used as sequence-tagged sites (STS) and grouped into
physically localized sets using radiation hybrids or YAC clones. Within a set,
the number of possible orders should be small enough to test against the
multipoint mapping data. As the genome is increasingly covered with clone
contigs, physical distance data are becoming available for more markers. The
overall goal of mapping is an integrated map, that lists features in order of
chromosomal location and gives their distances on both genetic (preferably
separate male and female cM) and physical scales, and relates all this to the
chromosomal bands. The Location Database (Collins et al., 1996) contains such integrated
maps, and the latest versions can be consulted at http://cedar.genetics.soton.ac.uk/public_html/ .
11.5. Standard lod score analysis is not without problems
Standard lod score analysis is a tremendously powerful method for scanning the genome
in 20-Mb segments to locate a disease gene, but it can run into difficulties. These
include:
-
vulnerability to errors;
-
computational limits on what pedigrees can be analyzed;
-
problems with locus heterogeneity;
-
limits on the ultimate resolution achievable;
-
the need to specify a precise genetic model, detailing the mode of
inheritance, gene frequencies and penetrance of each genotype.
11.5.1. Errors in genotyping and misdiagnoses can generate spurious
recombinants
Figure 11.7
.
Apparent double recombinants suggest errors in the data
Because of interference (Section
11.1.3), the probability of a true double recombinant
with markers 5 cM apart is small, well below 0.05 × 0.05
= 0.0025. Apparent double recombinants usually signal an
error in typing the markers, a clinical misdiagnosis, or locus
heterogeneity such that the disease in this case does not map to
locus D but elsewhere in the genome. Mutation in one of the genes or
germinal mosaicism are rarer causes.
With highly polymorphic markers, common errors such as misread gels, switched
samples or nonpaternity will usually result in a child being given a
genotype
incompatible with the parents. The
linkage analysis program will stall until
such errors have been corrected. Errors that introduce possible but wrong
genotypes are more of a problem. These include misdiagnosis of somebody's
disease status. Such errors inflate the length of genetic maps by introducing
spurious recombinants, because if a child has been assigned the wrong parental
allele, it will appear to be a
recombinant. Multilocus analysis can help,
because spurious recombinants appear as close double recombinants (). Error-checking routines test
the extent to which the map can be shortened by omitting any single test result
(see
Broman et al.,
1998). Results that significantly lengthen the map (i.e. add
recombinants) are suspect.
11.5.2. Computational difficulties limit the pedigrees that can be analyzed
As we saw in Section 11.3.2, human
linkage analysis depends on computer programs that implement algorithms for
handling branching trees of genotype probabilities, given the pedigree data and
gene frequencies. liped was the first generally useful program, and
mlink (part of a package called linkage) used the same
basic algorithm, the Elston-Stewart algorithm, but extended it to multipoint
data. The Elston-Stewart algorithm can handle arbitrarily large pedigrees, but
the computing time increases exponentially with increasing numbers of possible
haplotypes (more alleles and/or more loci). This limits the ability of
mlink to analyse multipoint data. An alternative algorithm, the
Lander-Green algorithm, can cope with any number of genotypes but the computing
time increases exponentially with the size of the pedigree. This algorithm is
implemented in the genehunter program (see Section 12.2.4), which is particularly good for analysing
whole-genome searches of modest sized pedigrees. The general theory of linkage
analysis is excellently covered in the book by Ott (Further reading), while the
book by Terwilliger and Ott (Further reading) is full of practical advice
indispensable to anybody undertaking human linkage analysis.
11.5.3. Locus heterogeneity is always a pitfall in human gene mapping
As we saw in Section 3.1.4, it is common
for mutations in several unlinked genes to produce the same clinical phenotype.
Even a dominant condition with large families can be hard to map if there is
locus heterogeneity within the collection of families studied. It took years of
collaborative work to show that tuberous sclerosis was caused by mutations at
either of two loci, TSC1 (MIM 191100) at 9q34 and TSC2
(MIM 191092) at 16p13. With recessive conditions,
the difficulty is multiplied by the need to combine many small families.
Autozygosity mapping (Section 11.5.5) is
the main solution in such cases.
genehunter or homog and related programs (see Terwilliger and Ott, 1994) can compare
the likelihood of the data on the alternative assumptions of locus homogeneity
(all families map to the location under test) and heterogeneity (a proportion
α of unlinked families), and give a maximum likelihood estimate of
α.
11.5.4. The limited resolution of human genetic mapping may be overcome by typing
single sperm or by using linkage disequilibrium
Once a marker is found for which all meioses are informative and nonrecombinant,
linkage analysis comes to a halt. In typical collections of disease families,
the target region thus identified is likely to be 1 Mb or more. This is
uncomfortably large for positional cloning of an unknown disease gene. One
possible way to increase the resolution of marker-marker mapping is to type
sperm instead of children. Humans have far too few children for optimal linkage
analysis, but men produce untold millions of sperm, and modern PCR technology
allows markers to be scored on single separated sperm from a doubly heterozygous
man. Yu et al. (1996)
show examples. Apart from technical problems, one drawback is that a single
sperm cannot be resampled repeatedly to confirm interesting results, in the same
way as a child can. Whole genome amplification (Zhang et al., 1992) partially
circumvents this problem. Individual spermatozoa are subjected to whole genome
amplification followed by multiplex PCR amplification of markers from an
aliquot. Further aliquots can be used to check any recombinants. Unfortunately
sperm typing could not be used for disease-marker mapping, unless the disease
mutations were already characterized.
Linkage disequilibrium provides the best hope of narrowing down the candidate
region in disease-marker mapping. Genotypes or haplotypes for markers spread
across the candidate region are examined in a series of unrelated affected
patients. If the patients all carry independent mutations, as may very well be
the case for a dominant or X-linked disease, this exercise will reveal nothing
of interest. However, if a proportion of the disease genes in apparently
unrelated patients derive from a common ancestor, as often happens with
recessive conditions, it may be possible to find a shared ancestral haplotype
that defines a small part of the candidate region. This approach is illustrated
in Section 12.4.1.
11.5.5. Autozygosity mapping can map recessive conditions efficiently in extended
inbred families
Autozygosity is a term used to mean
homozygosity for markers identical by descent, inherited from a recent common
ancestor. People with rare recessive diseases in consanguineous families are
likely to be autozygous for markers linked to the disease locus. Suppose the
parents are second cousins: they would be expected to share 1/32 of all their
genes because of their common ancestry, and a child would be autozygous at only
1/64 of all loci. If a child is homozygous for a particular marker allele, this
could be because of autozygosity, or it could be because a second copy of the
same allele has entered the family independently. The rarer the allele is in the
population, the greater the likelihood that homozygosity represents
autozygosity. For an infinitely rare allele, a single homozygous affected child
born to second cousin parents generates a lod score of log10(64)
= 1.8. If there are two other affected sibs who are both also
homozygous for the same rare allele, the lod score is 3.0 (log10(64
× 4 × 4); the chance that a sib would have inherited the
same pair of parental haplotypes even if they are unrelated to the disease is 1
in 4).
Figure 11.8
.
Autozygosity mapping
A large multiply inbred family in which several members suffer from
profound congenital deafness (filled symbols). A whole genome screen
with 160 polymorphic microsatellite markers showed that all affected
family members were homozygous for markers D2S2144 and D2S158, thus
mapping the DFNB9 locus to the 2 cM region between
the flanking markers, D2S2303 and D2S174. Redrawn from Chaib et al.
(1996)Hum. Mol. Genet. 5,
155–158, with permission from Oxford University Press.
Thus quite small inbred families can generate significant lod scores, and
autozygosity mapping becomes a powerful tool for
linkage analysis if families
can be found with multiple affected people in two or more sibships, linked by
inbreeding. Suitable families may be found in Middle Eastern countries where
inbreeding is common. The method has been applied with great success to locating
genes for autosomal
recessive hearing loss, which otherwise presents intractable
problems because of extensive
locus heterogeneity (
Guilford et al., 1994). An example is
shown in .
The same principle can be extended to populations where the common ancestry is
inferred rather than demonstrated. A bold application of this principle enabled
Houwen et al.
(1994) to map the rare recessive condition, benign recurrent
intrahepatic cholestasis, using only four affected individuals (two sibs and two
supposedly unrelated people) from an isolated Dutch village. The more remote the
shared ancestor, the smaller is the proportion of the genome that is shared by
virtue of that common ancestry, and therefore the greater the significance for
linkage if autozygosity can be demonstrated. But at the same time, the remoter
the common ancestor, the more chances there are for a second independent allele
to enter the family from outside, and so the less likely is it that homozygosity
represents autozygosity, either for the disease or for the markers. With remote
common ancestry, as in the study of Houwen
et al., everything depends on finding people
with a very rare recessive condition who are homozygous for a very rare marker
allele or (more likely) haplotype. The power of Houwen's study seems almost
miraculous, but it is important to remember that this methodology applies only
to diseases and populations where most affected people are descended from a
common ancestor who was a carrier. The wider use of allelic association is
described in the next chapter (Sections
12.3 and 12.4).
11.5.6. Characters whose inheritance is not mendelian are not suitable for mapping by
the methods described in this chapter
The methods of lod score analysis described in this chapter require a precise
genetic model that specifies the mode of inheritance, gene frequencies and
penetrance of each genotype. For mendelian characters, penetrance is the main
problem area. If no allowance is made for unaffected people being nonpenetrant
gene carriers, or affected people being phenocopies, then these people may be
wrongly scored as recombinant. On the other hand, if the penetrance is set too
low there is a reduction in the power to detect linkage, because a less precise
hypothesis is being tested. Errors in the order of markers on marker framework
maps can cause problems, but these are diminishing as genetic maps are
cross-checked against physical mapping data. Given sufficient meioses, the main
obstacle in linkage analysis of mendelian characters is locus heterogeneity.
However, for common complex diseases like diabetes or schizophrenia, the
problems are far more intractable. Any genetic model is no more than a
hypothesis - we have no real idea of the gene frequencies or penetrance of any
susceptibility alleles, or even the mode of inheritance. This makes it
near-impossible to apply the methods we have described in this chapter to such
diseases. Nevertheless, identifying the genetic components of susceptibility to
complex diseases is now a major part of human genetics research. The ways one
can attempt to do this are the subject of the next chapter.