NCBI » Bookshelf » Human Molecular Genetics 2 » Genetic mapping of complex characters
 
hmg
Human Molecular Genetics 2
2nd
Tom Strachan1 and Andrew P Read2
1University of Newcastle, Newcastle-upon-Tyne, UK
2University of Manchester, Manchester, UK
BIOS Scientific Publishers Ltd1-85996-202-51999
genetics

 Chapter 12:  Genetic mapping of complex characters

A1460

As we saw at the end of Chapter 3, now that most mendelian diseases have been mapped and most genes at least partially cloned, many researchers see unraveling the genetic determinants of nonmendelian diseases as the next frontier in human genetics. For healthcare, it is certainly an important task. The main genetic contribution to morbidity and mortality in the developed world is through the genetic component of common diseases. Identifying the genes involved may suggest new means of prevention or treatment. Pharmaceutical companies, too, have shifted much of their research into genomics in the belief that this represents the best way to identify new drug targets. However, there are problems in tackling complex diseases with the methods described in Chapter 11 that served so well for mapping mendelian characters. This chapter discusses the approaches that might be used for mapping nonmendelian characters, and the second half of Chapter 19 describes how well they have worked when tried.

12.1. Parametric linkage analysis and complex diseases

12.1.1. Standard lod score analysis is usually inappropriate for nonmendelian characters

Standard lod score analysis is called parametric because it requires a precise genetic model, detailing the mode of inheritance, gene frequencies and penetrance of each genotype. As long as a valid model is available, parametric linkage provides a wonderfully powerful method for scanning the genome in 20-Mb segments to locate a disease gene. For mendelian characters, specifying an adequate model should be no great problem. Nonmendelian conditions, however, are much less tractable.

A major problem is establishing diagnostic criteria. With mendelian syndromes it is usually fairly obvious which features of a patient form part of the syndrome and which are coincidental. Different features may have different penetrances, but basically the components of the syndrome are those that cosegregate. No such check exists for nonmendelian conditions. Great efforts are made, especially with psychiatric diseases, to establish diagnostic categories that are valid, in the sense that two independent psychiatrists will agree whether or not a certain label applies to a given patient. But a diagnostic label can be valid without being biologically meaningful. Any mendelian pattern must be biologically meaningful. Without a mendelian pattern, sometimes physiology will provide an alternative reality check but, especially for psychiatric and behavioral phenotypes, the diagnostic criteria are often biologically arbitrary. Adhering to them helps make different studies comparable, but does not guarantee that the right genetic question is being asked.

Once diagnostic criteria are agreed, segregation analysis (Section 19.4) can identify the most likely mode of inheritance, gene frequencies and penetrances. However, these estimates are averages over a probably heterogeneous set of families, and over all the loci within a family, and they are rarely much use for gene mapping. In the face of all these difficulties, there are several possible ways to proceed:

  • Seek families in which the disease segregates in a near-mendelian manner.

  • Use affected pedigree members only in a parametric analysis.

  • Use a nonparametric (model-free) method of linkage analysis.

12.1.2. Near-mendelian families can be selected for parametric linkage analysis - but the results may be misleading

Both breast cancer and schizophrenia are, in most cases, nonmendelian, but rare families can be found with many affected people in a pattern consistent with autosomal dominant inheritance, albeit with reduced penetrance and, for breast cancer, sex limitation. In each case, these families have been used for a genome search using standard lod score analysis. There are two justifications for this strategy. First, the disease may be heterogeneous and include one or more mendelian conditions phenotypically indistinguishable from the nonmendelian majority. Second, the near-mendelian families may represent cases where, by chance, many determinants of the disease are already present in most people, so that the balance is tipped by the mendelian segregation of just one of the normal susceptibility factors. In the first case, identifying the mendelian subset does not necessarily cast any light on the causes of the nonmendelian disease. In the second case, the loci mapped are also susceptibility factors for the common nonmendelian disease.

The breast cancer work led to the identification of the BRCA1 and BRCA2 genes, as described in Chapter 19, whereas the first such attempt in schizophrenia produced a lod score of 6 that is now generally agreed to have been spurious. What was the difference? With hindsight, it is clear that whereas a subset of breast cancer patients really do have a mendelian form of the disease, the apparently mendelian schizophrenia families must have been chance aggregations of affected people within one family. Of itself, this should have simply produced negative lod scores across the whole genome in the schizophrenia families. The other problem (apart from bad luck) was multiple testing. Because the diagnostic criteria for schizophrenia are arbitrary, the researchers tried a number of different criteria, and checked which one gave the highest lod score. This is a perfectly valid procedure - any number of variables can be estimated from a given dataset - but each variable adds more degrees of freedom, and the raw p value needs correcting accordingly.

12.1.3. Using affected pedigree members only avoids the need to specify the penetrance

One solution to the problem of having to specify the penetrance in parametric linkage analysis is to use a parametric method but analyze only the affected family members. The penetrance is irrelevant for affected people, and unaffected members are scored as having an unknown disease phenotype. If the penetrance is low, unaffected people provide relatively little information. We can infer the genotype of affected people (they must have the susceptibility allele), but not of unaffected people, therefore not too much is lost by ignoring unaffected family members. This strategy is useful for testing candidate susceptibility loci for oligogenic diseases. It is often sensible to check for linkage before starting to screen a candidate gene for mutations. Since a parametric analysis is used, it is still necessary to specify a genetic model, and so there is still the danger of getting meaningless results if the model is wrong. The risk of false positives is reduced if the analysis is restricted to checking a few candidate loci. It helps if the disease is rare but distinctive, so that the risk of heterogeneity and of phenocopies is minimized.

12.2. Nonparametric linkage analysis does not require a genetic model

If the need to specify a complete genetic model is too daunting, one can use model-free or nonparametric methods of linkage analysis. These methods ignore unaffected people, and look for alleles or chromosomal segments that are shared by affected individuals. Shared segment methods can be used within nuclear families (sib pair analysis, see below), within known extended families, or in whole populations. At the population level they constitute association studies, which are considered in the following section.

12.2.1. Identity by state is not the same as identity by descent

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch12f1.jpg.

Figure 12.1

.

   Identity by state (IBS) and identity by descent (IBD)

Both sib pairs share allele A1. The first sib pair have two independent copies of A1 (IBS but not IBD); the second sib pair share copies of the same paternal A1 allele (IBD). The difference is only apparent if the parental genotypes are known.

It is important to distinguish segments identical by descent (IBD) from those identical by state (IBS). IBS alleles look the same, and may have the same DNA sequence, but they are not derived from a known common ancestor. Alleles IBD are demonstrably copies of the same ancestral (usually parental) allele. If two sibs each have allele A1 (Figure 12.1), the shared allele is IBS, but it may or may not be IBD. For very rare alleles, two independent origins are unlikely, so IBS generally implies IBD, but this is not true for common alleles. Multiallele microsatellites are more efficient than two-allele markers for defining IBD, and multilocus multiallele haplotypes are better still, because any one haplotype is likely to be rare. Shared segment analysis can be conducted using either IBS or IBD data, provided the appropriate analysis is used. IBD is the more powerful, but requires parental samples.

12.2.2. Affected sib pairs allow model-free analysis in nuclear families

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch12f2.jpg.

Figure 12.2

.

   Sib pair analysis

(A) By random segregation sib pairs share 0, 1 or 2 parental haplotypes ¼, ½ and ¼ of the time, respectively. (B) Pairs of sibs who are both affected by a dominant condition share one or two parental haplotypes for the relevant chromosomal segment. (C) Pairs of sibs who are both affected by a recessive condition share both parental haplotypes for the relevant chromosomal segment.

Picking a chromosomal segment at random, pairs of sibs are expected to share 0, 1 or 2 parental haplotypes with frequency 1/4, 1/2 and 1/4, respectively. However, if both sibs are affected by a genetic disease, then they are likely to share whichever segment of chromosome carries the disease locus. On the simplest assumption, that everybody with the disease carries a mutant allele at this locus, then if the disease is dominant, they will share at least one parental haplotype, and if the disease is recessive they will share both haplotypes. This allows a simple form of linkage analysis (Figure 12.2). Affected sib pairs (ASP) are typed for markers, and chromosomal regions sought where the sharing is above the random 1 : 2 : 1 ratios of sharing 2, 1 or 0 haplotypes identical by descent. If the sib pairs are tested only for identity by state, the expected sharing on the null hypothesis is a function of the gene frequencies. Multipoint analysis is preferable to single-point analysis because it more efficiently extracts the information about IBD sharing across the chromosomal region. The mapmaker/sibs program of Kruglyak and Lander (1995) is widely used to analyze multipoint ASP data and produce nonparametric lod scores.

Because sib pair analysis is model-free, it can be performed without making any assumptions about the genetics of the disease. Thus it has been used as one of the main tools for seeking genes conferring susceptibility to common nonmendelian diseases like diabetes or schizophrenia. One drawback is that candidate regions defined by sib pair analysis are usually uncomfortably large for positional cloning. Sib pair analysis has no process analogous to the end-game of mendelian mapping, where closer and closer markers are tested until there are no more recombinants. It is not likely that a chromosomal segment can be defined that is shared by all affected sib pairs. If a susceptibility factor is neither necessary nor sufficient for disease, then not all affected sib pairs will share the chromosomal segment that contains the susceptibility locus. Moreover, sib pairs share many segments by chance, including, perhaps, segments that coincidentally lie close to a susceptibility locus. The mathematics of ASP analysis have been detailed by Sham and Zhao (1998), and examples of some systematic applications of ASP analysis to complex diseases are given in Section 19.5.

12.2.3. Nonparametric affected pedigree member analysis generalizes affected sib pair analysis

The affected pedigree member (APM) method of Weeks and Lange (1992) extends the logic of affected sib pair analysis to other relationships. In a complex pedigree with several affected people, for each pair of affected pedigree members the distribution of alleles identical by state is observed, and compared to the expectation on the null hypothesis of no linkage. APM allows multipoint data to be analysed in large pedigrees; however, because it uses IBS and not IBD data, it does not necessarily use all the linkage information that could in theory be extracted from a pedigree.

12.2.4. The genehunter program allows nonparametric lod scores to be calculated - but they must be interpreted with care

A more radical approach to nonparametric analysis of complex pedigrees is implemented in the genehunter program of Kruglyak et al. (1996). This is based on a generalization of the mapmaker/sibs program for analysis of multipoint ASP data mentioned above. The basic algorithm in these programs is able to handle any number of loci (the computing time increases linearly with the number of loci), but is limited to fairly small pedigrees. Pedigrees contain founders (people whose parents are not included in the pedigree) and nonfounders (people whose parents are included). If somebody has a sib in the pedigree, then they must be nonfounders, because the only way to tell the computer that they are sibs is to include the parents. If a pedigree contains f founders and n nonfounders, the genehunter computing time increases exponentially with (2n - f). Current versions fail to cope with pedigrees where 2n - f > 16.

Provided a pedigree falls within the size limit, genehunter can include any number of loci in a multipoint analysis. It is in fact able to compute parametric lod scores, if a concrete genetic model is provided. For complex characters where no model can be provided, the result is expressed as a nonparametric lod (NPL) score. These are based on calculating the extent to which affected relatives share alleles identical by descent, and comparing the result across all affected pedigree members with the null hypothesis of simple mendelian segregation (markers will segregate according to mendelian ratios unless the segregation is distorted by linkage or association). This method appears to extract the linkage information from a pedigree more efficiently than the APM method. However, the threshold of significance for a NPL is not so obvious as with the parametric lod score that would be calculated for a single pair of mendelian characters. The significance is best expressed as a genome-wide p value, as discussed in Section 12.5.2.

12.3. Association is in principle quite distinct from linkage, but where the family and the population merge, linkage and association merge

12.3.1. Linkage is a relation between loci, but association is a relation between alleles

In principle, linkage and association are totally different phenomena. Association is simply a statistical statement about the co-occurrence of alleles or phenotypes. Allele A is associated with disease D if people who have D also have A more (or maybe less) often than would be predicted from the individual frequencies of D and A in the population. For example, HLA-DR4 is found in 36% of the general UK population but 78% of people with rheumatoid arthritis. An association can have many possible causes, not all genetic (see below). Linkage, on the other hand, is a specific genetic relationship between loci (not alleles or phenotypes). Linkage does not of itself produce any association in the general population. The STR45 locus is linked to the dystrophin locus. Within a family where a dystrophin mutation is segregating, we would expect affected people to have the same allele of STR45, but over the whole population the distribution of STR45 alleles is just the same in people with and without muscular dystrophy. Thus linkage creates associations within families, but not among unrelated people. However, if two supposedly unrelated people with disease D have actually inherited it from a distant common ancestor, they may well also tend to share particular ancestral alleles at loci closely linked to D. Where the family and the population merge, linkage and association merge.

12.3.2. Population associations depend on population history

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

Figure 12.3

.

   Merging into the gene pool

A fully-outbred person has 2n ancestors n generations ago. If the UK population were fully outbred, two ‘unrelated’ present-day people would have shared ancestors in 1500, if not more recently. Reprinted from Read (1989) Medical Genetics: An Illustrated Outline, by permission of Mosby.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch12f4.jpg.

Figure 12.4

.

   Family linkage and population association studies are two ends of a continuum

All humans are related, if we go back far enough. A simplified calculation suggests that in the UK two ‘unrelated’ people would typically share common ancestors not more than 22 generations ago. If fully outbred, they would have 222 = 4 million ancestors each at that time. Twenty-two generations is about 500 years, and in 1500 the population of Britain was around 4 million (Figure 12.3). Therefore, if the UK population interbred freely, not more than 44 meioses would separate our two unrelated people. Suppose the two ‘unrelated’ people each inherit a disease susceptibility allele from their common ancestor. During the many generations and many meioses that separate them from their common ancestor, repeated recombination will have reduced the shared chromosomal segment to a very small region (Figure 12.4). Only alleles at loci tightly linked to the disease susceptibility locus will still be shared.

For a locus showing recombination fraction θ with the susceptibility locus, a proportion θ of ancestral chromosomes will lose the association each generation, and a proportion (1-θ) will retain it. After n generations, a fraction (1-θ)n of chromosomes will retain the association. Considering the 44 meioses that may separate our two patients, loci showing 1% recombination per meiosis would have a better than 50% chance of remaining in the same combination, since (0.99)44 = 0.64. Loci 3 cM apart would have (0.97)44 = 0.26 chance of remaining together. This argument is grossly simplified because it ignores population substructure and assumes the entire British population has been one freely interbreeding unit over the past 500 years. However, for what it is worth, it suggests that allelic associations reflecting sharing of ancestral chromosomes in the British population might begin to be noticeable for loci within 2–3 cM of each other.

Although this calculation is crude, it does show how the extent of allelic association depends on the history of the population concerned. There has been considerable debate about the differences to be expected when comparing a population that has expanded rapidly from a relatively recent bottleneck with one that has maintained or only gradually increased its size over many generations. The Finns would typify the former and the British the latter. Diseases, being subject to selection, are likely to be more mutationally homogeneous in a rapidly expanded population; the mutant alleles found will reflect those present in the founder population. For selectively neutral markers the position is less clear. In a study of 20 microsatellites from chromosome 18q21 in 664 British and 430 Finnish subjects, Eaves et al. (1998) observed significant disequilibrium for all 53 pairs of loci less than 1 cM apart in both populations, for 20/75 (UK) and 61/75 (Finland) pairs 1–3 cM apart, and for 0/62 (UK) and 20/66 (Finland) pairs more than 3 cM apart. In other words, disequilibrium is present in both populations, but extends over a longer range in Finns.

Searching for population associations is an attractive option for identifying disease susceptibility genes. Association studies are easier to conduct than linkage analysis, because no multicase families or special family structures are needed. Also, because linkage disequilibrium is a short-range phenomenon, if an association is found, it defines a small candidate region in which to search for the susceptibility gene. Finally, recent work suggests that association is more powerful than linkage for detecting weak susceptibility alleles (Section 12.5.4). However, there are several pitfalls to be avoided if a claimed association is to provide a reliable pointer to a nearby susceptibility locus.

12.3.3. Not all population associations are caused by linkage disequilibrium

Linkage disequilibrium is not the only possible reason for an association between a disease D and allele A. Possible causes include the following:

  • Direct causation - having allele A makes you susceptible to disease D. Possession of A is neither necessary nor sufficient for somebody to develop D, but it increases the likelihood. In this case one would expect to see the same allele A associated with the disease in any population studied (unless the causes of the disease vary from one population to another).

  • Natural selection - people who have disease D might be more likely to survive and have children if they also have allele A.

  • Population stratification - the population contains several genetically distinct subsets. Both the disease and allele A happen to be particularly frequent in one subset. Lander and Schork (1994) give the example of the association in the San Francisco Bay area between HLA-A1 and ability to eat with chopsticks. HLA-A1 is more frequent among Chinese than among Caucasians.

  • Statistical artefact - association studies often test a range of loci, each with several alleles, for association with a disease. The raw p values need correcting for the number of questions asked (Section 12.5.1). In the past, researchers often applied inadequate corrections, and associations were reported that could not be replicated in subsequent studies.

  • Linkage disequilibrium - close linkage can produce allelic association at the population level, provided that most disease-bearing chromosomes in the population are descended from one or a few ancestral chromosomes. If linkage disequilibrium is the cause of the association, there should be a gene near to the A locus that has mutations in people with disease D. The particular allele at the A locus that is associated with disease D may be different in different populations

Direct causation and selective advantage are unlikely if the associated allele is a variant in the noncoding DNA and not closely associated with any gene, but studies in several ethnically distinct populations are useful to help distinguish these causes of association from linkage disequilibrium. Statistical artefacts are reduced by proper correction of probabilities (see Section 12.5.1).

The choice of the control group in association studies is crucial. Many studies in the past have used published gene frequencies, often without adequate certainty that these frequencies were representative of the population from whom the patients were recruited. Alternatively, students or staff from the investigator's university may be used as a control series. Again, this is undesirable because they may well not be typical of the population from which the patients were drawn. Thus, when an association is found, it may be impossible to know whether it is caused by linkage disequilibrium with a susceptibility locus or by inadequately matched controls.

12.3.4. The transmission disequilibrium test (TDT) overcomes many of the problems of classical disease-marker association studies

Recently, a clutch of methods have been developed that largely circumvent the stratification problem. Collectively they can be called association studies with internal controls. They involve 50% more work than standard case-control studies because three people (proband and parents) are typed in each family. This seems a small price to pay for the gain in reliability. Parents must be available, which restricts the usefulness of these tests for late-onset diseases. One method, the haplotype relative risk (HRR) method, handles the data like typical case-control data, except that the control is not a real person but is made out of the two alleles that the parents did not transmit to their affected offspring.

The most popular method is the transmission disequilibrium test (TDT; Schaid, 1998). The TDT starts with couples who have one or more affected offspring. It is irrelevant whether either parent is affected or not. To test whether marker allele M1 is associated with the disease, we select those parents who are heterozygous for M1. The test simply compares the number of such parents who transmit M1 to their affected offspring with the number who transmit their other allele (Box 12.1). The result is unaffected by population stratification. The TDT can be used when only one parent is available, but this may bias the result (Schaid, 1998). There has been some argument about whether the TDT is a test of linkage or association. Since it asks questions about alleles and not loci, it is fundamentally a test of association. The associated allele may itself be a susceptibility factor, or it may be in linkage disequilibrium with a susceptibility allele at a nearby locus. The TDT cannot detect linkage if there is no disequilibrium - a point to remember when considering schemes to use the TDT for whole-genome scans.

12.4. Linkage disequilibrium as a mapping tool

12.4.1. Cystic fibrosis and Nijmegen breakage syndrome illustrate the use of linkage disequilibrium to narrow down a candidate region for positional cloning of a mendelian disease locus

Table 12.1

Allelic association in cystic fibrosis
Marker allelesCF chromosomesNormal chromosomes
X1, K1349
X1, K214719
X2, K1870
X2, K2825

Data from typing for the RFLP markers XV2.c (alleles X1 and X2) and KM19 (alleles K1 and K2) in 114 British families with a cystic fibrosis (CF) child. Chromosomes carrying the CF disease mutation tend also to carry allele X1 of XV2.c and allele K2 of KM19. Data derived from Ivinson et al. 1989.

Association studies are not restricted to nonmendelian conditions. We have already seen how an association was used to map autosomal recessive familial benign intrahepatic cholestasis (Section 11.5.5). More commonly a population association has been used to narrow down a candidate region that was initially defined by standard parametric linkage analysis. Cystic fibrosis is a disease of northern Europeans, where the sort of large multiply inbred family used for autozygosity mapping (Section 11.5.5) is excessively rare. Thus mapping CF depended on rare unfortunate nuclear families with more than one affected child. Using these, CF was mapped to 7q32, but after all available recombinants had been used, the candidate region was still dauntingly large. The initial markers (MET and D7S8) showed little or no linkage disequilibrium, but a new set of markers from within the candidate region, XV2.c and KM19 showed strong association between the X1, K2 (XV2.c*1, KM19*2) haplotype and CF. Typical data are shown in Table 12.1. As more markers were isolated, the gradient of linkage disequilibrium helped indicate the location of the CF gene.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch12f5.jpg.

Figure 12.5

.

   An ancestral haplotype in European patients with Nijmegen breakage syndrome

16 markers from 8p21, shown in chromosomal order across the top of the table, were used to generate 74 NBS-associated haplotypes in unrelated patients. Alleles attributed to an ancestral haplotype are marked A. Other alleles are shaded. Where there are no data, cells are left blank. Non-ancestral alleles are marked with the number of nucleotides by which they differ from the ancestral allele. Differences of two nucleotides are often the result of a mutation of the marker, but larger differences are likely to be the result of ancestral recombinations. All 74 haplotypes have the ancestral alleles at markers 11 and 12, which therefore indicated the likely location of the NBS gene. After Varon et al, 1998.

A more recently cloned gene, governing Nijmegen breakage syndrome, shows how ancestral haplotypes can be identified and used to define the exact position of the disease gene. Nijmegen breakage syndrome (NBS; MIM 251260) is a very rare autosomal recessive disease characterized by chromosome breakage, growth retardation, microcephaly, immunodeficiency and a predisposition to cancer. The suspected cause is a defect in DNA repair. Conventional linkage analysis in small nuclear families located the NBS locus to an 8-Mb target region between markers D8S271 and D8S270 on chromosome 8p21. There were no recombinants within the candidate region. Fifty one individual apparently unrelated patients were typed for a series of microsatellite markers spaced across the candidate region. Among the haplotypes of the patients, 74 are apparently related to a common ancestral haplotype (Figure 12.5). It is particularly associated with Slav ancestry. Some patients do not have this haplotype at all, and presumably carry independent NBS mutations. Others share only part of the haplotype, showing the effect of recombination in distant ancestors. These apparently recombinant haplotypes still share the region between polymorphic markers H4CA and H5CA (markers 11 and 12 in Figure 12.5), thus defining the likely location of the NBS gene. Subsequently a gene encoding a novel protein was cloned from this location and shown to carry mutations in NBS patients. As predicted, patients with the common haplotype all have the same mutation (Varon et al., 1998).

12.4.2. Linkage disequilibrium can be quantified, but the gradients of disequilibrium around a disease gene can be hard to understand

For positional cloning of a disease where a large number of patients are available, quantitative measures of linkage disequilibrium can be calculated for a series of markers across the target region. Hopefully the disease gene will be located at the peak of disequilibrium. The simplest measures of disequilibrium are affected by the gene frequencies. A better measure is the Yule coefficient (Krawczak and Schmidtke, 1998). For two loci A and B with alleles A1, A2, B1 and B2, this is

graphic element

where p1,1 and p1,2 are the frequency of allele A1 on chromosomes carrying alleles B1 and B2, respectively.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch12f6.jpg.

Figure 12.6

.

   Allelic association around the locus for Huntington disease

S10, S125, etc. are shorthand for the DNA markers D4S10, D4S125, etc., shown in their map positions relative to the HD locus. The total distance represented is 2500 kb. For some loci, several different RFLPs exist, which sometimes show very different allelic association, for example marker S95 (see text). Linkage disequilibrium is measured by the Yule coefficient. From Krawczak and Schmidtke (1998) DNA Fingerprinting, 2nd edn, BIOS Scientific Publishers.

This approach was used with Huntington disease. The result was helpful, but not simple to interpret. In Figure 12.6 we can see cases of a strong association with a more distant marker and a weak association with a closer marker. Even more curious, the marker D4S95, closely linked to the HD locus, detects RFLPs with three enzymes, TaqI, MboI and AccI. Results confirmed in several independent studies show a strong association with a particular AccI and a particular MboI allele, but no association with either TaqI allele. Probably the associations reflect a complex history, with a combination of several independent mutations, recombination in one of a small founder population of disease chromosomes, and maybe an origin of some marker polymorphisms more recently than some disease mutations. Xiong and Guo (1997) give a nice overview of the problems and the literature of linkage disequilibrium mapping, before proposing a sophisticated method of analysis based on maximum likelihood estimation of multipoint data. This is one of several approaches that appear able to predict gene locations much better than the simple analysis in Figure 12.6.

12.5. Thresholds of significance are an important consideration in analysis of complex diseases

Whereas most mendelian loci localized by significant lod scores have been successfully cloned, the history of complex disease analysis has been marked by a succession of false dawns and irreproducible results. Innumerable HLA-disease associations have been reported, but few proved reproducible. Positive lod scores in families with schizophrenia turned into a serious embarrassment (reviewed by Byerley, 1989). More recently, there are a number of complex diseases where several different groups have undertaken independent large-scale sib pair analyses. The candidate regions defined in the different studies have seldom coincided. Risch and Botstein (1996) outline a typical history, that of manic-depressive psychosis, and similar results with multiple sclerosis are discussed in Section 19.5.5. Whatever the exact cause of these problems in the various cases, a clear common thread is the difficulty of deciding when to call the results of a linkage or association study significant.

12.5.1. Probabilities calculated from association studies must be corrected for the number of questions asked

Table 12.2

p values from a hypothetical association study
D1D2D3D4D5D6D7D8D9D10
M10.290.470.800.470.360.130.930.150.080.08
M20.210.260.380.550.960.610.460.280.100.40
M30.360.870.610.760.800.510.440.110.760.99
M40.120.770.200.680.880.470.390.050.500.53
M50.090.560.010.930.240.810.180.280.040.18
M60.610.830.270.950.660.030.240.050.030.87
M70.630.640.120.330.760.090.540.770.420.09
M80.240.120.060.650.980.520.910.630.680.23
M90.360.030.150.620.680.880.150.960.940.55
M100.270.940.310.320.540.060.200.630.530.38

Panels of patients with diseases (D1-D10) and a panel of controls were typed for markers (M1-M10). For each possible association the p value is tabulated. In reality none of the diseases is associated with any of the markers, but five of the 100 p values are significant at the 5% level, including one at the 1% level. This is of course exactly what is expected of a series of 100 random numbers. If n questions are asked, the appropriate threshold of significance is p = 0.05/n (Bonferroni correction).

A mendelian condition must map somewhere and so, in linkage analysis, no matter how many markers are used in finding the location, the risk of a false positive result remains manageably low (Section 11.3.4). This is not the case for association studies. There may well be no association to find, and so each test performed carries an independent risk of a false positive result. To avoid errors, a correction has to be applied. The threshold of significance is set, not at the conventional p = 0.05, but at p = 0.05/n, where n is the number of independent potential associations checked (Table 12.2). This is called the Bonferroni correction. All too few published disease association studies apply the rigorous correction factor, n(m - 1) for the testing of n loci with m alleles each, and all too often associations reported in one study cannot be confirmed in a second independent sample of patients.

12.5.2. Genome-wide significance levels for analysis of complex diseases are controversial

The problems of deciding appropriate thresholds of significance are partly technical and partly philosophical. We have already noted the distinction between pointwise (or nominal) and genome-wide significance (Section 11.3.4):

  • The pointwise p value of a linkage statistic is the probability of exceeding the observed value at a specified position in the genome, assuming the null hypothesis of no linkage.

  • The genome-wide p value is the probability that the observed value will be exceeded anywhere in the genome, assuming the null hypothesis of no linkage.

For a whole-genome study, the appropriate significance threshold is a value where the probability of finding a false positive anywhere in the genome is 0.05. This will be more stringent than the pointwise threshold for a single test. But suppose an association study finds a significant result (pointwise p < 0.05) with the very first marker tested. Had the result been negative, the researchers would no doubt have gone on to test marker after marker until either they found something or else they had got negative results across the whole genome. Should they apply the genome-wide threshold, even though they did only one test?

According to Lander and Kruglyak (1995), the genome-wide false-positive rate, αT* is related to the pointwise false positive rate, αT by the equation

graphic element

T is the threshold lod score; C = 23, the number of chromosomes, and G = 33, the total length of the genome in Morgans. The parameter ρ measures the crossover rate, and takes different values depending on the relationship being studied, so that the formula cannot be simply applied to complex pedigrees. For affected sib pairs the formula suggests genome-wide lod score thresholds of 3.6 for IBD testing and 4.0 for IBS testing. Note that the formula applies strictly only to large samples and to stringent thresholds.

Because the associations underlying TDT tests operate over much shorter chromosomal distances than the linkage underlying ASP testing, and because TDT, as an association test, must be performed separately for every allele of each locus, the total number of tests needed for a genome-wide scan by TDT is huge. Risch and Merikangas (1996) considered the ultimate case of testing five diallelic polymorphisms at each of 100 000 gene loci by TDT. Applying a full Bonferroni correction for 1 million independent tests means the threshold significance for a positive result is p = 5 × 10-8.

Most complex disease studies avoid these theoretical approaches by basing the significance threshold on simulation. Typically, 1000 replicates of the family collection are generated by computer with random genotypes, but based on correct allele frequencies, recombination fractions, etc. A whole-genome search is conducted in each simulated dataset and the maximum lod score noted. The genome-wide threshold of significance is taken as a score that is exceeded in less than 5% of replicates.

12.5.3. Criteria for suggestive and significant linkage have been suggested for complex diseases

In response to the frequent failure to replicate claimed localizations of disease susceptibility genes, Lander and Kruglyak (1995) proposed a series of thresholds:

  • Suggestive linkage is a lod score or p value that would be expected to occur once by chance in a whole genome scan.

  • Significant linkage is a lod score or p value that would be expected to occur by chance 0.05 times in a whole genome scan (i.e. the conventional p = 0.05 threshold of significance)

  • Highly suggestive linkage is a lod score or p value that would be expected to occur by chance 0.001 times in a whole genome scan.

  • Confirmed linkage - linkage is to be regarded as confirmed when a significant linkage observed in one study is confirmed by finding a lod score or p value that would be expected to occur 0.01 times by chance in a specific search of the candidate region.

The pointwise p values for significant linkage work out at 1 - 5 × 10-5 for different genome-wide study designs. Note that these values do not imply threshold lod scores of 4.3–5.0. A lod score of 5 means that the data are 105 times more likely on the given linkage hypothesis than on the null hypothesis; a p value of 10-5 means that the stated lod score will be exceeded only once in 105 times, given the null hypothesis. The two measures are not the same. The lod scores for genome-wide significant linkage are in the range 3.3–4.0, again depending on the study design. For some discussion of the Lander and Kruglyak criteria, see the correspondence section of the April 1996 issue of Nature Genetics.

12.5.4. For detecting alleles of modest effect, association tests are likely to be more powerful than linkage tests

Table 12.3

Sample sizes for 80% power to detect significant linkage or association in a genome-wide search
ASP analysis TDT analysis
γpYN-ASPP(trA)N-TDT
50.010.53425300.830747
0.10.6341610.830108
0.50.5913550.83083
30.010.509337970.7501960
0.10.5569530.750251
0.50.5569530.750150
20.10.51891670.667696
0.50.52642540.667340
1.50.10.5051155370.6002219
0.50.510306600.600950
1.20.10.50139519970.54511868
0.50.5026960990.5454606

γ is the relative risk for individuals of genotype Aa compared to aa; p is the frequency of the A susceptibility allele. For affected sib pair (ASP) analysis, Y is the expected allele sharing and N-ASP the number of pairs required for significance, based on IBD testing (α = 3 × 10-5). For transmission disequilibrium testing (TDT), P(trA) is the probability that an Aa parent will transmit A to an affected child, and N-TDT is the number of parent-child trios required for significance. After Risch and Merikangas (1996).

An important paper by Risch and Merikangas (1996) compared the power of affected sib pair and TDT testing to detect associations between a marker at a susceptibility locus and a complex disease. They calculated the number of ASPs or TDT trios required to obtain a given power and significance level in order to distinguish a genetic effect from the null hypothesis. Box 12.2 illustrates their method (consult the original paper for more detail). Table 12.3 shows typical results of applying their formulae.

Their conclusion is clear: ASP analysis would require unfeasibly large samples to detect susceptibility loci conferring a relative risk of less than about 3, whereas TDT might detect loci giving a relative risk below 2 with manageable sample sizes. Susceptibility genes conferring a relative risk below 1.5 would be very hard to find by either method. Note, however, that their result is obtained with one particular genetic model, and might not apply to all. In particular, linkage disequilibrium is not necessarily present between alleles at tightly-linked loci.

12.6. Strategies for complex disease mapping usually involve a combination of linkage and association techniques

In many ways linkage and association provide complementary data. Linkage operates over a long chromosomal range. Linkage analysis, whether parametric or nonparametric, can scan the entire genome in a few hundred tests. A typical study of 250 sib pairs with 300 markers would require 1.5–3 × 105 genotypes to be generated (depending whether or not the parents were typed). Such a study might be completed in a few months by a well-organized and well-funded laboratory using an automated fluorescence sequencer. However, as noted (Section 12.2.2), candidate regions defined by linkage are usually uncomfortably large for positional cloning.

Association tests like the TDT have the opposite characteristics. Linkage disequilibrium is seldom striking over more than a megabase, so a genome screen by TDT would involve huge numbers of tests; on the other hand, a positive result would localize the susceptibility factor rather accurately. A natural study design is therefore to start with a genome-wide screen by linkage, probably in affected sib pairs, and then, once an initial localization has been achieved, to narrow the candidate region by linkage disequilibrium mapping.

It is important to remember that linkage disequilibrium is not an inevitable result of tight linkage. Association due to disequilibrium will be seen only if a significant proportion of the disease chromosomes derive from one not too distant common ancestor. There is a balance in this. Some serious dominant or X-linked mendelian diseases show no linkage disequilibrium because natural selection ensures a rapid turnover of disease genes, and most affected people are the result of independent mutations. For susceptibility factors in common disease, the problem is more likely to lie at the opposite end of the spectrum. Susceptibility factors may be common variants that have existed in the population at high frequency for a very long time, and that are nonpathogenic except when they get into bad company. A very old variant may have reached linkage equilibrium with adjacent markers. Equally, if many different changes to a given gene each acts as a susceptibility factor (in the same way that many different changes can cause loss of function, see Figure 16.1), then there may be no linkage disequilibrium. Therefore even if a susceptibility factor can be roughly localized by linkage, it does not necessarily follow that it can be fine-mapped by linkage disequilibrium or a method such as TDT that relies on it.

A major problem with sib pair analysis is that it will detect only rather strong susceptibility factors. The calculations in Table 12.3 show that for γ ≤ 2 the excess of allele sharing by affected sib pairs is very small, and detecting it would require huge numbers. It appears that large-scale association testing offers the best chance of finding these weak susceptibility factors through genome screening. But this would require work on a hitherto unprecedented scale. If we needed to be within 1 cM of a susceptibility locus to detect linkage disequilibrium, we would need about 3000 markers for a genome-wide screen. Given the quirky nature of linkage disequilibrium, with its dependence on details of population history, it might be prudent to use a denser set of maybe 10 000 markers. Scoring 1000 parent-child trios with a panel of 10 000 markers would mean generating 3 × 107 genotypes - an increase of two orders of magnitude on the current best technology. This may be achievable with diallelic single nucleotide polymorphisms (Section 11.2.3) scored on high-density DNA chips. For many common diseases, such a scale of operation may be necessary before susceptibility factors can be reliably identified. As will become apparent in Chapter 19, the present generation of studies lack the power to detect weak susceptibility loci. Nevertheless, for the present, TDT and other association testing is limited to testing candidate loci or regions.

Further reading
Ott J (1991) Analysis of Human Genetic Linkage, revised edn. Johns Hopkins University Press, Baltimore.
Terwilliger J, Ott J (1994) Handbook for Human Genetic Linkage. Johns Hopkins University Press, Baltimore.
References
Byerley W F. Genetic linkage revisited. Nature. (1989); 340: 340341. [PubMed]
Eaves I A, Merriman T R, Barber R A. et al. Comparison of linkage disequilibrium in populations from the UK and Finland. Am. J. Hum. Genet. (1998); 63 (suppl.): A1212.
Ivinson A J, Read A P, Harris R, Super M, Schwarz M, Clayton Smith J, Elles R. Testing for cystic fibrosis using allelic association. J. Med. Genet. (1989); 26: 426430. [PubMed]
Krawczak M, Schmidtke J (1998) DNA Fingerprinting, 2nd edn. BIOS Scientific Publishers, Oxford.
Kruglyak L, Lander E S. Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. (1995); 57: 439454. [PubMed]
Kruglyak L, Daly M J, Reeve-Daly M P, Lander E S. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. (1996); 58: 13471363. [PubMed]
Lander E S, Kruglyak L. Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nature Genet. (1995); 11: 241247. [PubMed]
Lander E S, Schork N. Genetic dissection of complex traits. Science. (1994); 265: 20372048. [PubMed]
Read A (1989) Medical Genetics: An Illustrated Outline. Mosby, London.
Risch N, Botstein D. A manic depressive history. Nature Genet. (1996); 12: 351353. [PubMed]
Risch N, Merikangas K. (1996). The future of genetic studies of complex human diseases Science 273:15161517.(See also Science, 275, 1327–1330, 1997, for discussion.) [PubMed].
Schaid D J. Transmission disequilibrium, family controls and great expectations. Am. J. Hum. Genet. (1998); 63: 935941. [PubMed]
Sham S, Zhao J (1998) Linkage analysis using affected sib-pairs. In: Guide to Human Genome Computing, 2nd edn. (MJ Bishop ed.). Academic Press, London.
Varon R, Vissinga C, Platzer M. et al. Nibrin, a novel DNA double-stranded break repair protein, is mutated in Nijmegen Breakage Syndrome. Cell. (1998); 93: 467476. [PubMed]
Weeks D E, Lange K. A multilocus extension of the affected pedigree member method of linkage analysis. Am. J. Hum. Genet. (1992); 50: 859868. [PubMed]
Xiong M, Guo S -W. Fine-scale genetic mapping based on linkage disequilibrium: theory and applications. Am. J. Hum. Genet. (1997); 60: 15131531. [PubMed]
Help ǀ Contact Bookshelf
Human Molecular Genetics 21999
(navigation arrows) Go to previous chapter Go to next chapter Go to top of this page Go to bottom of this page Go to Table of Contents