NCBI » Bookshelf » Human Molecular Genetics 2 » Identifying human disease genes
 
hmg
Human Molecular Genetics 2
2nd
Tom Strachan1 and Andrew P Read2
1University of Newcastle, Newcastle-upon-Tyne, UK
2University of Manchester, Manchester, UK
BIOS Scientific Publishers Ltd1-85996-202-51999
genetics

 Chapter 15:  Identifying human disease genes

A1867

A more accurate, though less snappy title for this chapter might be ‘Identifying genetic determinants of human phenotypes’. Some mendelian phenotypes, like red-green color blindness, may be regarded as normal variants rather than diseases. Nor would the many genetic variants that contribute in a minor way to our susceptibility to common nonmendelian diseases normally be called disease genes, since they are neither necessary nor sufficient for developing the disease. But for all genetically determined phenotypic variants, one can in principle use the methods described here to discover what DNA sequence variants are responsible. Such variants will be found in many but not all the 80 000 or so human genes. Some genes are indispensable to embryonic function, so that deleterious mutations result in embryonic lethality and go unrecorded in humans. In other cases, abolition of gene function may normally have no effect on the phenotype because other nonallelic genes also supply the same function (genetic redundancy).

15.1. Principles and strategies in identifying disease genes

Few areas have moved as fast as human disease gene identification. Before 1980, very few human genes had been identified as disease loci. The few early successes involved a handful of diseases with a known biochemical basis where it was possible to purify the gene product. In the 1980s, advances in recombinant DNA technology allowed a new approach, positional cloning, sometimes given the rather meaningless label ‘reverse genetics’. The number of disease genes identified started to increase, but these early successes were hard won, heroic efforts. With the advent of PCR for linkage studies and mutation screening, it all became much easier. Now the human and other genome projects have made available a vast range of resources - maps, clones, sequences, expression data and phenotypic data. Identifying novel disease genes has become commonplace and is currently occurring on a weekly basis. Soon the landscape will change again, as the complete human genome sequence becomes available, so that all genes will in theory be accessible through databases.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f1.jpg.

Figure 15.1

.

   How to identify a human disease gene

There is no single pathway to success, but the key step is to arrive at a plausible candidate gene, which can then be tested for mutations in affected people. Note the interplay between clinical work, laboratory benchwork and computer analysis. Database searching is becoming more and more crucial as information from genome projects accumulates.

Figure 15.1 summarizes some of the routes that have been followed to identify human disease genes. If the figure seems complicated, that is because there is no standard procedure for gene identification. All pathways converge on mutation testing in a candidate gene, but there is not one single entry point, and there is no unique pathway to the candidate gene. For discussion of the principles, we can divide the methods into those that do not require us to know the chromosomal location of the disease locus (Section 15.2) and those that depend on this knowledge (Section 15.3). In reality, groups trying to identify a disease gene will use several parallel approaches, with the emphasis shifting from one candidate and one line of attack to another in response to clues that emerge from the team's own results or new external data, and to new possibilities arising from technical developments. Most genes are identified by defining a candidate gene on the basis of both its chromosomal location and its properties (the positional candidate approach; Section 15.4).

15.2. Position-independent strategies for identifying disease genes

Historically, the first disease genes were identified by pure position-independent methods, simply because no relevant mapping information existed and the techniques were not available to generate it. However, methods based on sequence homology or functional complementation that are in principle position-independent, work much better when applied to predefined candidate subchromosomal regions, rather than to the whole human genome. Homology searches in particular become exceedingly powerful when combined with positional information. Their use in ‘cybercloning’ and positional candidate approaches to gene identification is considered in Section 15.4.

15.2.1. Identification of a disease gene through knowledge of the protein product

If the biochemical basis of an inherited disease is known, it may be possible to purify and partially characterize some of the gene product. If this can be done, gene-specific oligonucleotides or specific antibodies can be generated that can be used to identify the gene.

Use of gene-specific oligonucleotides

This approach relies on the ability to isolate sufficient protein product to permit amino acid sequencing. Specific peptide bonds in the protein product can be cleaved using proteolytic enzymes such as trypsin (cuts at the carboxyl end of lysine or arginine residues) or reagents such as cyanogen bromide (cuts at the carboxyl end of methionine residues). The amino acid sequence of each resulting peptide can be determined by chemical sequencing. This involves a repeated series of chemical reactions in an automated amino acid sequencer. In each cycle, the peptide is exposed to a chemical that covalently bonds to the N-terminal amino acid and cleaves it off, allowing it to be identified by chromatography. Sequence overlaps identify overlapping peptides, enabling longer sequences to be assembled.

The resulting amino acid sequence is inspected to identify regions containing amino acids with minimal codon degeneracy (e.g. methionines and tryptophans are uniquely encoded by AUG and UGG codons, respectively). Once suitable regions have been identified, combinations of oligonucleotides are synthesized to correspond to all possible codon permutations. The resulting mix of partially degenerate oligonucleotides is labeled and used as a probe to screen cDNA libraries. As only one of the oligonucleotides in the mix will correspond to the authentic sequence, it is important to keep the number of different oligonucleotides low so as to increase the chance of identifying the correct target. Once a suitable cDNA clone is isolated, it can be used to screen a genomic DNA library in order to isolate genomic DNA clones for full characterization of the gene.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f2.jpg.

Figure 15.2

.

   The factor VIII gene, the locus for hemophilia A, was cloned by product-directed oligonucleotide screening of DNA libraries

The figure illustrates one way in which factor VIII DNA clones were obtained, following cleavage of purified porcine factor VIII protein into peptides and amino acid sequencing. The resulting sequences were inspected to identify regions with low codon redundancy. The top panel shows a sequence of 15 amino acids from His8 to Met22 in one of the peptides, with the possible codon permutations above (with variable nucleotides in color). This sequence was selected because of the generally low codon redundancy: two amino acids, Trp and Met, are specified by a single codon and another seven can be specified by just two alternative codons. A partially degenerate 45-bp antisense oligonucleotide probe was prepared and used as a primary hybridization probe to screen a porcine genomic DNA library, and thereafter secondary screening used a 15-bp antisense degenerate oligonucleotide probe corresponding to the sequence from Trp18 to Met22. The porcine factor VIII genomic clone was then used to screen human DNA libraries to identify the human gene (see Gitschier et al., 1984).

Identification of the hemophilia A gene (MIM 306700) followed this approach. Biochemical analysis of serum samples from patients had previously identified a genetic deficiency of blood clotting factor VIII. Purification of factor VIII from plasma is not straightforward, partly because it is present in very low quantities. One approach involved isolating small quantities of factor VIII from large volumes of pig blood by standard protein purification techniques. The purified product allowed the production of gene-specific oligonucleotide probes for library screening (Figure 15.2).

Library screening by hybridization can be tedious when a complex mixture of oligonucleotides is used, because the results are greatly influenced by the hybridization conditions. A more rapid alternative is to use partially degenerate oligonucleotides as PCR primers. One early strategy was to use two such sets corresponding to amino acid sequences from different regions of the protein as primers. By using total cDNA from a suitable source as a template, a specific cDNA could be amplified spanning the codons from the two different regions (Figure 6.11). However, this approach demands considerable prior information about the protein sequence. A more convenient alternative is to prepare cDNA and ligate it to vector DNA molecules. PCR can then be performed using one vector-specific primer and one primer composed of a panel of partially degenerate oligonucleotides.

Use of specific antibodies

If even small amounts of the normal protein product can be isolated, specific antibodies can be raised. The protein, or a peptide derived from it, is conjugated to a powerful immunogenic hapten such as keyhole limpet hemocyanin, and the compound molecule injected into a rabbit or mouse. The hapten activates B lymphocytes, and the protein or peptide of interest activates helper T lymphocytes, leading to production of antibodies. Mouse or rabbit antibodies that are specific for the desired protein or peptide can then be used in various ways to identify a corresponding cDNA.

An early approach was to enrich for mRNA encoding the protein product in a cell-free in vitro protein synthesis system. This was how the gene that is mutated in phenylketonuria (MIM 261600) was identified in 1982 (Robson et al., 1982). Phenylketonuria was known to be caused by a lack of the enzyme phenylalanine hydroxylase (PAH). PAH enzyme was purified from rat liver, a known site of expression. Specific antibodies were raised and used to immunoprecipitate polysomes containing PAH mRNA. The purified mRNA was converted to cDNA, and a specific rat cDNA clone was isolated. This was then used as a probe to isolate the human cDNA from a human liver cDNA library.

This type of approach has been superseded by antibody screening of cDNA expression libraries. cDNA from a relevant tissue is cloned into an expression vector. Inserts within the recombinant DNA clones are expected to be expressed within the host cell to produce foreign polypeptides. Appropriate antibodies can then be used to screen colony filters from the library to identify clones encoding the product of interest.

15.2.2. Identification of a disease gene through knowledge of the DNA sequence

This most usually arises when the researcher is considering what diseases might be caused by mutations in a particular known gene. Alternatively, a novel human disease gene may be identified by homology, either to a paralogous human gene (Section 15.4.2) or to an orthologous gene in another species (Section 15.4.3). An interesting application of DNA sequence knowledge is the attempt to clone genes containing expanded trinucleotide repeats. As shown in Box 16.7, expanded trinucleotide repeats are known to cause several inherited neurological disorders. Often these disorders show anticipation - that is, the disease presents at an earlier age and with increased severity in successive generations. If a disease under investigation shows any of these features, it may be worth screening for triplet repeat expansions. The repeat expansion detection method of Schalling et al. (1993) permits detection of expanded repeats in unfractionated genomic DNA of affected patients, and methods have been developed for cloning any expanded repeats detected (Koob et al., 1998). This approach was recently used in a completely position-independent way to identify a novel repeat expansion that causes a form of spinocerebellar ataxia (SCA8) (Koob et al., 1999).

15.2.3. Identification of a disease gene through knowledge of its normal function

Functional cloning depends on expressing random fragments of human DNA in a cell or organism, and isolating any fragments that cause a desired change in function. The usual approach is a functional complementation assay, seeking fragments that correct a defect in the recipient. Examples include the following:

  • Functional complementation in mammalian cell lines. For example, a variety of mammalian cell lines have been generated that are deficient in DNA repair. They (Robson et al., 1982) show abnormal responses following exposure to UV irradiation or chemical mutagens. These mutant cells, or alternatively cells derived from patients with a DNA repair deficiency, can be transformed by fragments of normal human DNA or human chromosomes in order to produce a repair-competent phenotype. This was the way in which cDNA clones for the Fanconi's anemia group C (FACC; MIM 227645) gene were first obtained (Strathdee et al., 1992). Similarly, the ability of transferred chromosomes or clones to correct the uncontrolled growth of tumor cell lines has been used to help locate and then identify tumor suppressor genes (Chapter 18).

  • Functional complementation in yeast. Innumerable yeast mutants have been defined, and genetic analysis in yeast is particularly sophisticated because of the ease of performing homologous recombination. Some proteins have been so highly conserved during evolution that the human protein can complement a yeast mutant defective in the corresponding protein. This approach has been successful in identifying the human genes that specify various enzymes of purine and pyrimidine biosynthesis, and also some crucially important transcription factors.

  • An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f3.jpg.

    Figure 15.3

    .

       Functional complementation in transgenic mice as a tool for identifying a human disease gene

    The shaker-2 mouse mutation was identified by finding a wild-type clone that corrected the defect. Human families with a similar phenotype that mapped to the corresponding chromosomal location proved to have mutations in the orthologous gene.

    Functional complementation in transgenic mice. Occasionally a mouse gene has been identified by constructing transgenic mice, using nonmutant BAC clones from a candidate region, crossing them to mice carrying the mutation, and checking which transgene corrects the defect. This strategy was first used to identify a clock gene (Antoch et al., 1997), and more recently as a crucial step in identifying the human DFNB3 deafness gene (Probst et al., 1998; Figure 15.3). DFNB3 had been mapped to a location that corresponded in the mouse to the location of the deafness gene shaker-2. Transgenic mice were constructed using BACs from the shaker-2 candidate region, and a BAC that corrected the shaker-2 phenotype was identified. This led to identifying the shaker-2 gene as an unconventional myosin, MYO15. The human MYO15 gene was then isolated based on its close homology to the mouse gene, its position within the DFNB3 candidate region confirmed, and mutations demonstrated in DFNB3 affected people.

  • Isolation of activated oncogenes. This is done by their effect on the growth of mouse 3T3 fibroblasts (see Figure 18.4).

If a patient has a disease because of a chromosomal deletion, identifying genes present in a normal person but absent in the patient would pinpoint the disease gene. More generally, genes implicated in a disease may be expressed to a different degree in patients and controls (this depends on the type of mutation: missense mutations alter the function but not the expression of the mRNA, but many other types of mutation result in low or absent levels of mRNA - see Chapter 16). Methods that identify the differential presence or expression of a gene therefore provide a possible route to position-independent identification of a disease gene, although more usually they are one arm of a positional candidate strategy.

Subtraction cloning

Subtraction cloning can be used to select clones of the DNA that is deleted in an individual with a chromosomal deletion. Two DNA samples are compared, a normal ‘test’ DNA and a deleted ‘driver’ DNA. The test DNA is mixed with a large excess of driver DNA, denatured and re-annealed. By one means or another, double helices are selected in which both strands consist of test DNA. These preferentially represent sequences in the test DNA that are absent from the driver DNA. The most celebrated application of subtraction cloning was in identifying the dystrophin (DMD) gene (Section 15.3.4). The test DNA came from a normal individual, and the driver DNA from a patient who had a deletion including the dystrophin gene. The test clones remaining after subtraction were enriched for DNA derived from the region missing in the affected patient.

In the historically important case of dystrophin, subtraction cloning directly yielded clones from the desired unknown gene, but this was an exceptional success. Subtraction cloning is a very difficult technique, which has seldom succeeded with genomic DNA. A more recent approach to the same problem is representational difference analysis (RDA; Lisitsyn, 1995). RDA uses several means, including selective PCR, to enrich sequences present in the test but not the driver DNA, and has been used to isolate regions amplified or deleted in cancer cells (Schutte et al., 1995). Genes in such regions are positional candidate tumor suppressor or oncogenes (Chapter 18).

Subtractive hybridization works better with mRNA than genomic DNA and subtractive hybridization or mRNA differential display (Section 20.2.4) have been used to identify differentially expressed transcripts. In the future, expression arrays (Figure 20.6) could be efficient tools for such analyses, although they will not contain any novel uncharacterized genes. Generally the role of these techniques in gene identification is not to isolate the disease gene directly, but to produce collections of sequences that may be position-independent candidates for a disease because of their pattern of expression. These can then be screened further by positional criteria.

Subtractive hybridization has also been used in several projects to produce libraries of cDNAs specifically expressed in a certain tissue, by subtraction of a tissue-specific cDNA library against one or more nonspecific libraries (Swaroop et al., 1991). Such subtraction libraries are a good source of candidate genes for diseases affecting just the tissue in question. Several disease genes have been identified by screening an appropriate subtraction library for sequences mapping to the same chromosomal location as the disease (see, for example, Yasunaga et al., 1999). This is an example of the power of the positional candidate approach (Section 15.4).

15.3. In positional cloning, disease genes are identified using only knowledge of their approximate chromosomal location

Table 15.1

Examples of disease genes identified by positional cloning
DiseaseMIM no.Map positionGeneApproach
Duchenne muscular dystrophy (Section 15.3.4)310200Xp21.3Dystrophin(a) clone translocation breakpoints
(b) clone sequences missing in a patient with a deletion
Cystic fibrosis (Section 15.3.5)2197007q31CFTRLinkage disequilibrium
Branchio-oto-renal syndrome (BOR) (Section 15.3.6)1136508q13EYA1Sequencing genomic clones; homology to Drosophila gene
Treacher Collins syndrome (Section 15.3.7)1545005q32-33.1TCOF1Transcript mapping

The disease gene was found within the minimal region defined by linkage analysis using the approaches shown. See text for details.

At the opposite pole to the position-independent gene identification strategies, positional cloning identifies a disease gene based on no information except its approximate chromosomal location. The first successful gene identifications based only on positional information, published in 1986, marked a triumphant new era for human molecular genetics. One after another, genes for important disorders such as Duchenne muscular dystrophy, cystic fibrosis, Huntington disease, adult polycystic kidney disease, colorectal cancer, breast cancer, etc. were isolated. However, positional cloning can be desperately hard work, and by 1995 only about 50 inherited disease genes had been identified by this approach (Collins, 1995). The four examples discussed in Sections 15.3.415.3.7 (Table 15.1) illustrate typical approaches to positional cloning. Each of these disease genes was identified by first mapping the disease as closely as possible in affected families, and then identifying a novel candidate gene and showing that patients had mutations in that gene. In all cases except Treacher Collins syndrome, there was some clue to help identify the correct candidate gene.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f4.jpg.

Figure 15.4

.

   The difficult path from candidate region to gene

One researcher's view of the frustrations of positional cloning. Image courtesy of Dr Richard Smith, University of Iowa.

Positional cloning projects recapitulate the Human Genome Project in miniature. In both cases the researchers first produce high-resolution genetic and physical maps, then build clone contigs before identifying and sequencing transcripts. The only parts specific to positional cloning are selecting the candidate region and identifying pathogenic mutations in patients with the disease in question. Thus general progress in the Human Genome Project has had an enormous impact on positional cloning projects. Defining the right candidate region and testing candidate genes can still be long hard tasks (as Figure 15.4 makes clear) but most of the intermediate stages can be achieved by intelligent use of existing Genome Project data.

15.3.1. The first step in positional cloning is to define the candidate region as tightly as possible

The methods of mapping mendelian and nonmendelian disease genes have been described in Chapters 11 and 12, respectively, while the use of loss of heterozygosity mapping to locate tumor suppressor genes is described in Section 18.5.3. Even with all the fruits of the Human Genome Project to hand, positional cloning can still be laborious and frustrating. More than any other factor, the size of the candidate region determines the work required, and so every effort is made to define as small a candidate region as possible. Regions larger than about 1 Mb of DNA present serious obstacles to positional cloning.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f5.jpg.

Figure 15.5

.

   Crossover analysis seeks to map a gene by defining flanking proximal and distal recombinants

The figure shows haplotype analysis in two pedigrees with a dominantly inherited skin disorder, Darier-White disease (MIM 124200), which had previously been mapped to 12q. Genotypes in II-1, II-2, II-4 and II-5 in pedigree (A) are inferred. (A) In this family the disease gene segregates with the marker haplotype 6-5-2-6-2-2 between D12S84 (proximal) and D12S79 (distal). A crossover in II-6 shows that the disease gene must map distal to D12S84. The positioning of D12S105 is ambiguous because of presumed homozygosity for allele 5 in I-1 - compare the genotypes for II-3 and II-6. (B) In this family the disease gene must lie proximal to D12S129. The combined data indicate that the Darier's disease gene must map between the proximal marker D12S84 and the distal marker D12S129. Reproduced from Carter et al. (1994) Genomics, 24, 378–382, with permission from Academic Press.

Often the initial localization from genetic mapping defines a candidate region of 10 Mb or more. The next step is to collect as many families as possible and establish a dense cover of polymorphic markers across the region. Suitable markers may be found by database searching, but if this fails then YACs, BACs and cosmids must be isolated from the candidate region and screened for polymorphisms. These might be microsatellites or single nucleotide polymorphisms, and depending on how well developed the physical map of the region is, the new markers might either be localized physically as sequence tagged sites (STS) on a contig, mapped physically using a radiation hybrid panel (Section 10.1.3), or mapped genetically in CEPH families (see Section 11.4.3). For mendelian conditions, where recombinants can be identified unambiguously, the limit of genetic resolution is reached when pairs of closely spaced markers define the positions of the closest recombinations on either side of the disease locus. This is decided by inspecting individual haplotypes rather than by statistical analysis (Figure 15.5). When mapping low-penetrance disease susceptibility loci, recombinants cannot be pinpointed in this way (Section 12.2.2), so all one can do is sharpen the lod score curve as far as possible by using the biggest possible dataset.

When single recombinants define the boundaries of a region that is to be searched, it is important to consider possible sources of error (see Section 11.5.1). Meticulous clinical diagnoses are imperative. Key recombinations are more reliable if they occur in unambiguously affected people - an unaffected individual may carry a nonpenetrant disease gene, which can lead to them being misinterpreted as recombinant when in fact they are nonrecombinant. Sometimes, despite good positive lod scores, there appear to be recombinants with every marker tried. This is usually an indication that somebody has been diagnosed wrongly (labeled as affected when unaffected, or vice versa) or else that the disease gene in one or more of the families under investigation does not map to the candidate region. Alternatively, perhaps the markers are wrongly ordered on the genetic map.

Linkage disequilibrium may allow very high resolution mapping

Linkage disequilibrium (association at the population level of a particular marker allele with a disease) can allow genetic mapping to be taken to a very high resolution, as discussed in Section 12.4. Linkage disequilibrium has been enormously valuable for guiding positional cloning (the example of cystic fibrosis is discussed below) but not all diseases show it (European Consortium on MEN1, 1997). It is seen only when many of the apparently unrelated affected people in a population in fact derive their disease chromosome from a shared ancestor. Thus the easiest diseases to map very finely are those where most affected people carry the same ancestral mutation, and an ancestral haplotype can be defined, as illustrated in Figure 12.5. In such cases there is a price to be paid later, when candidate genes are being tested for mutations, because a diversity of mutations gives a much higher chance of spotting the correct gene. Plans to identify the genetic factors responsible for susceptibility to many common diseases rest almost entirely on the hope that association studies will pinpoint the location of the susceptibility genes. If it were to turn out that at most susceptibility loci many different variants predispose to the disease, then the whole endeavor would be in serious trouble.

15.3.2. Genes within the candidate region can be identified by a combination of database searching and transcript mapping

As shown in Figure 15.1, known genes from the candidate region can be found by database searching. If none of the known genes is a promising candidate, or if after testing no mutations can be found, then novel genes must be sought. The general methods for identifying unknown transcribed sequences from within a contig of genomic clones have been discussed in detail in Section 10.4, and are summarized briefly in Box 15.1. With the continuing progress of the Human Genome Project, the emphasis has moved strongly towards identifying genes from the databases. Once the human genome sequence is completed, it should in theory no longer be necessary to clone any human gene from scratch. However, present computational methods for identifying genes using genomic DNA sequence data (Section 10.4.5) are far from perfect, and for some time to come it may still be necessary to get one's hands dirty in the laboratory if one wishes to identify every expressed sequence from a candidate region.

Whenever cDNA libraries are to be screened, the question arises which libraries should be used. Often the pathology of the disease under study suggests a particular investigation. For example, when studying a neuromuscular disease it makes sense to start by screening muscle cDNA libraries. However, tissue-specific diseases are often caused by malfunction of widely expressed genes (Section 16.7.1), so if one library fails it is always worth screening others. Fetal brain is a popular choice because it has a particularly high number of expressed sequences.

15.3.3. Chromosomal aberrations can provide a useful short-cut to locating a disease gene

Table 15.2

The first ten years of positional cloning (selected highlights)
YearDiseaseMIM numberLocationGeneChromosome abnormality
1986Duchenne muscular dystrophy310200Xp21.3DMD(a) del(X)(p21.3)
(b) t(X;21)(p21.3:p13)
Retinoblastoma18020013q14RBdel(13)(q13.1q14.5)
1989Cystic fibrosis2197007q31CFTRNone
1990Neurofibromatosis 116220017q11.2NF1Balanced translocations
t(1;17)(p34.3:q11.2)
t(17;22)(q11.2:q11.2)
Wilms' tumor19407011p13WT1del(11)(p14p13)
1991Aniridia10621011p13PAX6t(4;11)(q22;p13)
del(11)(p13)
Familial polyposis coli1751005q21APCdel(5)(q15q22)
Fragile-X syndrome309550Xq27.3FMR1FRAXA fragile site
Myotonic dystrophy16090019q13.3DMPKNone
1993Huntington's disease1431004p16HDNone
Tuberous sclerosis 219109216p13TSC2Microdeletions in candidate region
von Hippel-Lindau disease1933003p25VHLMicrodeletions in candidate region
1994Achondroplasia1008004p16FGFR3None
Early-onset breast/ovarian cancer11370517q21BRCA1None
Polycystic kidney disease17390016p13.3PKD1t(16;22) (p13.3;q11.21)
601313
1995Spinal muscular atrophy2533005q13SMN1None
600354

The data illustrate how important cytogenetic abnormalities were for cloning disease genes. For TSC2 and VHL, microdeletions were identified only late in the refinement of the candidate region.

Researchers are constantly on the alert for special patients or observations that will short-cut the labor of pure positional cloning. Cancer studies in particular have relied on investigations of chromosomal abnormalities (Chapter 18), and identification of many other disease genes has also been greatly helped by finding patients with a chromosomal abnormality (Table 15.2). Alert clinicians play a crucial role in identifying such patients (Box 15.2).

Translocations and inversions

If a person with an apparently balanced translocation or inversion is phenotypically abnormal, there are three possible explanations:

  1. the finding is coincidental;

  2. the rearrangement is not in fact balanced - there is an unnoticed loss or gain of material;

  3. one of the chromosome breakpoints causes the disease.

A chromosomal break can cause a loss-of-function phenotype if it disrupts the coding sequence of a gene, or separates it from a nearby regulatory region. Alternatively, it could cause a gain of function, for example by splicing regulatory sequences from one gene to distal coding sequences from another gene, causing inappropriate expression (this is rare in inherited disease but common in tumorigenesis, see Chapter 18). In either case, the breakpoint provides a valuable clue to the exact physical location of the disease gene. The clue is valuable but not infallible: sometimes breakpoints can alter expression of a gene located hundreds of kilobases away by affecting the structure of large-scale chromatin domains (Box 15.3).

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f6.jpg.

Figure 15.6

.

   Using fluorescence in situ hybridization to define a translocation breakpoint

(A) Cytogenetically defined translocation t(8;16)(p22;q12). (B) physical map of part of the breakpoint region in a normal chromosome 8, showing approximate locations of seven clones. (C) Results of successive FISH experiments. The breakpoint is within the sequence represented in clone D. This result would normally be confirmed using clones from chromosome 16.

The precise location of a chromosome breakpoint is most easily defined by using FISH (Figure 15.6). Alternatively, different DNA clones from the relevant region can be used in turn to see if any can identify patient-specific restriction fragments, by hybridizing each clone to the patient's genomic DNA which has been digested with a rare-cutter restriction endonuclease and subjected to pulsed field gel electrophoresis (Section 10.2.2).

Deletions and duplications

Chromosomal deletions cause abnormalities due to loss of genes in males with X chromosome deletions, and reduced levels of dosage-sensitive gene products in people heterozygous for autosomal deletions (Figure 16.9). Cytogenetically visible deletions involve many megabases of DNA. Such large deletions often produce rather complex, nonspecific phenotypes, but if specific elements can be seen, the deletion may provide a pointer to a broad subchromosomal localization. In the past, subtraction cloning was attempted using such deletions, as described below for the dystrophin gene.

Small-scale deletions (microdeletions) are valuable for positional cloning. Deletions of tens or hundreds of kilobases of DNA are not uncommon in some disorders (Section 16.8.1). Often they are generated by unequal recombination between flanking repeat sequences (Section 9.5.4). Microdeletions can be identified by several methods.

  • An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

    Figure 15.7

    .

       Deletions at a disease locus may result in noninheritance of closely linked markers

    The pedigree shows a family with type 2 neurofibromatosis (MIM 101000), with types for a TaqI RFLP at the closely linked neurofilament heavy chain (NEFH) locus. The two affected offspring inherit no NEFH allele from their mother. The likely cause of both the disease and the anomalous inheritance pattern is a deletion encompassing both the NEFH gene and NF2 genes. This was confirmed by the analysis shown in Figure 15.8. Reproduced from Watson et al. (1993) by permission of Oxford University Press.

    Noninheritance of marker alleles. If a microdeletion eliminates a marker locus, individuals carrying the deletion will be hemizygous. On testing, they appear to be homozygotes and transmission of the disease chromosome is often accompanied by what appears to be nonmendelian segregation (Figure 15.7; Figure 17.12 shows an X-linked example).

  • PCR dosage analysis. Any sequence-tagged site that maps within a microdeletion will be present in half dosage in deletion carriers. Various quantitative or semiquantitative PCR methods can be used to detect the reduced dosage.

  • An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

    Figure 15.8

    .

       Microdeletions can be identified by restriction mapping using PFGE and by FISH

    The panels illustrate two sets of analyses to test whether affected members in the pedigree in Figure 15.7 carried a small deletion of chromosome 22 in the vicinity of the NF2 locus. (A) PFGE analysis. A genomic DNA clone from the NF2 gene region was hybridized to a Southern blot of genomic DNA from indicated family members, which had been digested with NotI and size-fractionated by PFGE. The 600-kb NotI fragment represents wild-type alleles. Note the additional, approximately 460-kb band found only in the affected individuals, resulting from a deletion which simultaneously eliminated the NF2 gene. (B) FISH analysis. Metaphase chromosome preparations from an affected individual were hybridized with two DNA probes. A probe that hybridizes to repeat sequences found at the centromeres of chromosomes 14 and 22 produces a strong signal on the two homologs for each chromosome. A cosmid from the NF2 gene region, however, hybridizes to only one of the two chromosome 22 homologs. This is a single copy probe so it gives a fainter signal (a dot from each chromatid) than the repeat sequence probe. Reproduced from Watson et al. (1993) Hum. Mol. Genet., 2, 701–704, by permission of Oxford University Press.

    FISH mapping. Once suspected, a microdeletion can also be confirmed by this method. A DNA clone from the deleted interval is used as a hybridization probe against a metaphase chromosome preparation from the patient. For autosomal microdeletions, the probe should hybridize to only one of the two homologs (Figure 15.8).

  • Hybridization-based restriction mapping. Unlike the previous methods, which require markers that are deleted, this approach can be used with markers located several hundred kilobases away from the deletion. Genomic DNA from a panel of patients is digested with rare-cutter restriction endonucleases, and the fragments size-fractionated by pulsed field gel electrophoresis. DNA probes that map in the vicinity of the disease locus can then be tested in turn to see if they can detect abnormal patient-specific hybridization bands. If a probe detects abnormal size bands in a patient's DNA digested separately with two or more different enzymes, a large-scale mutation is indicated (Figure 15.8).

Duplications have not played any significant role in positional cloning. If unequal recombination between flanking repeats is a major cause of microdeletions, microduplications should be equally frequent, but in fact they are rarely observed. Probably most duplications are overlooked because they are nonpathogenic. If a duplication is associated with an abnormal phenotype, the cause is most likely to be a loss of function of a gene that is disrupted by the breakpoint. Occasionally there may be a dosage effect when a complete working gene is duplicated (Figure 16.7). Microduplications can be detected by careful dosage analysis, by long-range restriction mapping or by finding people who have three alleles of a marker.

15.3.4. Chromosomal deletions and translocations assisted positional cloning of the dystrophin gene

Duchenne muscular dystrophy (DMD, MIM 310200) was a major test-bed for positional cloning methods. Years of careful investigation of the pathological changes in affected muscle had failed to reveal the biochemical basis of DMD. In the early 1980s, several groups competed to clone the DMD gene, using different approaches. The pioneering work of these groups, overcoming formidable technical difficulties to clone an unprecedented gene, was probably the major inspiration for most subsequent positional cloning efforts. This work has been well reviewed by Worton and Thompson (1988).

The DMD gene was localized by linkage analysis and through X-autosome translocations

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f9.jpg.

Figure 15.9

.

   Nonrandom X inactivation occurs in female DMD patients with Xp21-autosome translocations

The translocation is balanced, but the X chromosome breakpoint disrupts the dystrophin gene. X inactivation is random, but cells which inactivate the translocated X die because of lethal genetic imbalance. The embryo develops entirely from cells where the normal X is inactivated, leading to a woman with no functional dystrophin gene. The resulting failure to produce any dystrophin causes DMD.

The DMD locus was mapped to Xp21 by linkage to a restriction fragment length polymorphism as long ago as 1982 (the first disease to be so mapped). Additional confirmation of this localization came from studies of rare affected females. These women, about 20 of whom have been described worldwide, occur sporadically in families with no history of DMD, and there is no evidence that they have inherited a conventional DMD mutation from either parent. Instead, they all carry balanced X-autosome translocations. Although each woman has a different autosomal breakpoint, and many different autosomes are involved, the X chromosome breakpoint is always at Xp21. The pathogenesis results from an unusual mechanism. X inactivation is random, but those cells in which the der(X) translocation chromosome was inactivated suffer genetic imbalances and die. In the cells that survive, the normal X is inactive, whereas the active der(X) does not produce any dystrophin because the translocation breakpoint has disrupted the dystrophin gene (Figure 15.9).

Isolation of the DMD gene by subtraction cloning

Kunkel's group in Boston used DNA from a boy ‘BB’ (Section 16.8.1) who had DMD and a cytogenetically visible Xp21 deletion. A technically very difficult subtraction cloning procedure (Section 15.2.3) was used to isolate clones from normal DNA that corresponded to sequences deleted in BB. Individual DNA clones in the subtraction library were then used as probes in Southern blot hybridization against DNA samples from normal people and DMD patients. One clone, pERT87-8, detected deletions in DNA from about 7% of cytogenetically normal DMD patients. It also detected polymorphisms that were shown by family studies to be tightly linked to DMD. These results showed that pERT87-8 was located much closer to the DMD gene than any previously isolated clones (in fact it was within the gene, in intron 13). Other nearby genomic probes were isolated by chromosome walking and used to screen muscle cDNA libraries. Given the low abundance of dystrophin mRNA and, as we now know, the small size and widely scattered location of the exons, finding cDNA clones was far from easy, but eventually clones were identified, and subsequently the whole remarkable dystrophin gene (see Figure 8.13) was characterized.

Isolation of the DMD gene by cloning a translocation breakpoint

While Kunkel's group was working on subtraction cloning, Worton's group in Toronto was successful with a different approach. One of the affected women described above had an X;21 translocation with a breakpoint in the short arm of chromosome 21. Knowing that 21p is occupied by arrays of repeated rRNA genes (Section 8.2.1), Worton's group prepared a genomic library and set out to find clones containing both rDNA and X chromosome sequences. This led to isolation of XJ (X junction) clones which, in a similar way to Kunkel's pERT87-8 probe, detected deletions and polymorphisms. XJ turned out to be located in intron 17 of the dystrophin gene.

15.3.5. Linkage disequilibrium was an important aid to positional cloning of the cystic fibrosis gene

In 1985, studies of affected sib-pairs (see Figure 12.2) showed that the gene for CF (MIM 219700) was linked to a protein polymorphism of the enzyme paraoxonase. At that time, the chromosomal location of the paraoxonase gene was not known (this illustrates one of the big advantages of using DNA rather than protein polymorphisms for mapping). A rapid mapping effort located the paraoxonase gene to chromosome 7, and a variety of DNA markers were used to show that CF mapped to 7q31-q32. The MET oncogene was established as a proximal flanking marker, and an anonymous clone D7S8 as a distal marker.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

Figure 15.10

.

   Identification of the CF gene involved laborious chromosome walking and chromosome jumping techniques

Starting from the flanking markers MET (proximal) and D7S8 (distal), an intervening region of about 500 kb was intensively mapped. Chromosome walking was used to identify overlapping λ and cosmid clones (short thin and long thick horizontal lines, respectively, above the restriction map). Chromosome jumping steps (color arcs) facilitated this process. After several false starts, the overlapping E4.3 and H1.6 clones, which contained evolutionarily conserved sequences (as detected by zoo blotting; see Figure 10.21), were used to isolate a cognate cDNA clone. The cDNA clone was then used to map back to λ genomic clones and the gene was shown to contain 24 exons. Gaps remained, however (e.g. between exons III and IV). The full structure of the gene was later shown to comprise 27 exons. Verification of the gene's involvement in CF was obtained by demonstrating patient-specific mutations (see text). Reproduced with permission from Rommens et al. (1989) Science, 245, 1059–1065. Copyright 1989 American Association for the Advancement of Science.

Despite an intensive world-wide search, no CF patients have been discovered with translocation, inversion or deletion breakpoints at 7q31-q32, nor did any microdeletions emerge during the progress of the research. Without large-scale mutations to help, identifying the CF gene proved an exceedingly arduous task, requiring extensive genetic mapping and exhaustive molecular characterization of the candidate region. In those days, before the Human Genome Project, generating clones covering the region between the flanking markers D7S8 and MET was a major effort (Figure 15.10). The techniques used for this task included:

  • Chromosome walking (Section 10.3.2). All this work was conducted before human YAC libraries were available, and so the chromosome walking used cosmid and phage λ libraries. Thus, individual steps were only about 10–20 kb, and frustrated researchers talked about chromosome crawling.

  • Chromosome microdissection. Using micromanipulation techniques, a small chromosomal segment can be physically cut out from an individual chromosome in a spread on a microscope slide (Edstrom et al., 1987). Extremely fine needles, or a laser beam, are used to perform the microdissection in a series of cells. The excised fragments, typically representing a single chromosomal band, are collected, pooled and the DNA extracted for use in constructing DNA libraries (Ludecke et al., 1989). This technical tour de force has been useful for generating DNA clones from several disease-associated chromosomal regions. However, microdissection libraries are difficult to construct; contamination by extraneous DNA has been a problem and the clone complexity (the number of different DNA sequences) has often been poor. Now that large-scale YAC and BAC mapping has made clones from all chromosomal regions available, microdissection is no longer necessary for clone generation.

  • Chromosome jumping. This now obsolete technique used circularization of large DNA fragments from the region of interest to hop from one genomic clone to another located several hundred kilobases away (Poustka et al., 1987). Successful jumps in the candidate region provided new start points for chromosome walking.

Eventually most of the region was encompassed by DNA clones. Considering the formidable labor required to do this in 1985-88, makes one appreciate the impact of Human Genome Project resources on present-day positional cloning efforts. Linkage disequilibrium then provided valuable clues about the location of the CF gene. The mutation rate is very low and CF is maintained in the population largely by heterozygote advantage (Box 3.6). CF disease chromosomes in apparently unrelated people often derive from a distant common ancestor. The original flanking markers D7S8 and MET show only extremely weak disequilibrium, but some of the newer markers such as KM19 and XV2.c generated from clones within the candidate region showed strong disequilibrium with CF (Table 12.1). Linkage disequilibrium data can be hard to interpret (see Figure 12.6) but, in the case of CF, a gradient of steadily increasing disequilibrium pointed quite effectively to the location of the 5′ end of the gene.

Eventual isolation of the CFTR gene, unaided by large-scale mutations, required extensive screening of libraries for the elusive cDNA. A further small difficulty was encountered in producing convincing evidence that the gene eventually cloned was indeed the site of mutations causing CF. Because of the powerful linkage disequilibrium, it was expected that most CF mutant chromosomes in the population would share a great deal of ancestral sequence. Therefore, showing that a particular sequence change (the F508del mutation, Table 17.3) was present on 70% of CF chromosomes did not prove in a wholly convincing manner that CFTR was the CF gene, still less that F508del caused CF. F508del could have been simply a neutral variant inherited along with CF on the ancestral disease chromosome, especially since the sequence change left the reading frame intact.

The F508del mutation is present in the heterozygous state in 3–4% of phenotypically normal individuals (we would now identify them as CF carriers). The fact that F508del homozygotes were always severely affected was persuasive but not conclusive - F508del could have been in more or less complete linkage disequilibrium with the real CF mutation. Biochemical and pathological knowledge was important, in showing that the CFTR gene encoded an ion channel, and that the pathogenesis of CF was ultimately caused by defective regulation of chloride ion transport across apical membranes. The subsequent identification of minority disease alleles like G542X, where the expected effect on gene expression was more obviously deleterious, provided further confirmation that the true disease locus had been identified.

15.3.6. Positional cloning of the gene causing branchio-oto-renal syndrome was achieved by large-scale sequencing of clones from the candidate region

The autosomal dominant branchio-oto-renal (BOR) syndrome (MIM 113650: branchial fistulas, malformation of the external and inner ear with hearing loss; hypoplasia or absence of kidneys) was mapped to 8q13 following a clue from an affected patient who had a rearrangement of chromosome 8. The initial interval of 7 cM was refined to an interval of 470–650 kb by further mapping and delineation of a chromosomal deletion in the patient mentioned. P1 and PAC clones were isolated by screening genomic libraries with markers within or close to the candidate region, and gaps in the contig were filled by chromosome walking. The minimum tiling path (the smallest number of clones from which a contig can be built) across the candidate region involved 3 P1 and 3 PAC clones.

It was decided to isolate genes from the contig by large-scale sequencing of plasmid subclones. Checking the sequence against the EMBL and GenBank databases revealed homology between part of the sequence obtained and the Drosophila developmental gene eyes absent (eya). Further genomic sequence was then translated, and then searched for homologies to the deduced amino acid sequence of the Drosophila eya gene. This resulted in the identification of seven putative exons showing 69% identity and 88% similarity at the amino acid level to the putative eya protein (note that amino acid sequences are more likely to show detectable homology than nucleotide sequences). The human cDNA was then isolated from a 9-week total fetal mRNA library, and seven mutations in the gene, named EYA1 were demonstrated in 42 unrelated BOR patients. Expression studies in the mouse demonstrated expression consistent with the developmental abnormalities of BOR syndrome.

This work (Abdelhak et al., 1997) unambiguously identified EYA1 as the human BOR gene, but the function of the gene product was not identified, nor was it clear why the Drosophila phenotype consists of reduced or absent compound eyes. As so often with positional cloning, identifying the gene was just the start of understanding the syndrome.

15.3.7. Identification of the gene causing Treacher Collins syndrome illustrates positional cloning in its purest form

Treacher Collins syndrome (MIM 154500) is an autosomal dominant disorder of craniofacial development with a variable phenotype including abnormalities of the external and middle ears, hypoplasia of the mandible and zygomatic complex and cleft palate. Linkage was initially established to markers at 5q31-q34. Because the markers in that region at the time (1991-92) were not very informative, new microsatellites were isolated and used to refine the candidate region to 5q32-33.1. A combined genetic and radiation hybrid map was constructed across this interval, and by 1994 the team had assembled a YAC contig. This was converted to a cosmid contig, and cDNA library screening and exon trapping were used to generate a transcript map. At least seven genes were identified in the critical region. Further rounds of marker isolation and crossover analysis produced a confusing picture of overlapping recombinations, but led eventually to isolation of a candidate incomplete cDNA from a placental library. Northern blotting and zoo blotting showed that the gene was widely expressed and conserved across species, but database searches revealed no significant homologies. The exon-intron structure was determined, and mutation analysis demonstrated five different mutations in unrelated patients.

Isolation of the TCOF1 gene (Treacher Collins Syndrome Collaborative Group, 1996) illustrates positional cloning in its purest form. No relevant chromosomal abnormalities were found (there were four patients with TCS who had chromosomal translocations or deletions, but markers from each of the breakpoints showed no linkage to TCS in family studies, so presumably these cases were all coincidental). There is no linkage disequilibrium - not surprisingly, since 60% of cases are new mutations. The candidate region is gene-rich, so there were many possible candidates, and the gene eventually identified had no features that made it a particularly promising candidate. The gene product is now believed to be a nucleolar phosphoprotein that is involved in some aspect of nucleolar trafficking. Why mutations should cause Treacher Collins syndrome is not yet known.

15.4. Positional candidate strategies identify candidate genes by a combination of their map position and expression, function or homology

The position-independent and positional cloning strategies described in the last two sections are in principle quite separate, but in reality most disease genes have been identified by a positional candidate strategy, using a combination of positional and nonpositional information.

15.4.1. Criteria for selecting a candidate gene: expression pattern and function

From the list of genes that map to the candidate region, one would look for a gene that shows appropriate expression and/or appropriate function. Alternatively or additionally, as discussed in the following sections, one would look for homology to some other human or non-human gene that is known to have appropriate expression or function.

Appropriate expression pattern

A good candidate gene should have an expression pattern consistent with the disease phenotype. Expression need not be restricted to the affected tissue, because there are many examples of widely expressed genes causing a tissue-specific disease (Section 16.7.1), but the candidate should at least be expressed at the time and in the place where the pathology is seen. For example, neural tube defects are likely to involve genes that are expressed during the 3rd–4th weeks of human embryonic development, shortly before or during neurulation. The expression of candidate genes can be tested by RT-PCR or Northern blotting, but the best method for revealing the exact expression pattern is in situ hybridization against mRNA in tissue sections (Figure 5.17). For embryonic stages this is most conveniently performed using sections of mouse embryos at the equivalent developmental stages (7.5–9.5 embryonic days in the case of neural tube defects). The expression in human embryos is likely to be very similar, although this cannot be guaranteed, and centralized resources of staged human embryo sections have been established to allow the equivalent analyses to be performed where necessary on human embryos.

Appropriate function

Studying the pathology of a genetic disease only rarely gives information precise enough to allow position-independent identification of the disease gene, but it often allows good positional candidates to be selected. Rhodopsin and fibrillin, mentioned above, provide typical examples.

The gene for human rhodopsin was cloned in 1984, and it was mapped to 3q21-qter in 1986. Among disorders involving hereditary retinal degeneration are the various forms of retinitis pigmentosa (RP), which are marked by progressive visual loss resulting from clumping of the retinal pigment. Although rhodopsin was a possible candidate gene for some forms of RP, it was only one of many proteins that were known to be involved in phototransduction. However, in 1989, linkage analyses in a large Irish RP family mapped their disease gene to 3q in the neighborhood of rhodopsin. Rhodopsin was now a serious candidate gene, and patient-specific mutations in the rhodopsin (RHO) gene were identified within a year (see OMIM entry 180380).

The phenotype of Marfan syndrome (MFS, MIM 154700: excessive growth of long bones; lax joints; dislocation of lenses; liability to aortic aneurysms) suggested some abnormality in a connective tissue component. Linkage analysis mapped the MFS gene to 15q, and subsequently the gene for the connective tissue protein fibrillin was localized to 15q21.1 by in situ hybridization. Fibrillin was then an obvious positional candidate, and patient-specific mutations were soon demonstrated (see McKusick 1991 for discussion of the background).

Candidate genes may also be suggested on the basis of a close functional relationship to a gene known to be involved in a similar disease. The genes could be related by encoding a receptor and its ligand, or other interacting components in the same metabolic or developmental pathway. For example, some of the genes implicated in Hirschsprung disease were identified using this logic, as described in Section 19.5.2.

15.4.2. Criteria for selecting a candidate gene: homology to a relevant human gene or EST

Preliminary identification of transcripts often comes from matching genomic sequence generated from the candidate region against unmapped ESTs in the databases. Finding a match suggests the presence of an exon in the genomic DNA, and may provide leads to identifying more of the gene or to guessing its function. Sometimes a gene in the candidate region turns out to be closely related to a known disease gene. If the diseases are similar, the new gene becomes a compelling candidate. Members of multigene families can be fairly readily assessed as positional candidates on this basis. For example, after fibrillin was identified as the gene mutated in Marfan syndrome, a second fibrillin gene was shown to map to 5q. This therefore became a candidate location for other Marfan-like phenotypes. A related condition, congenital contractural arachnodactyly (MIM 121050) was mapped to 5q and shown to be caused by mutations in the FBN2 gene (Putnam et al., 1995).

Table 15.3

Functions of some genes implicated in human non-syndromic sensorineural hearing loss
LocusChromosomal locationGeneFunction
DFNA15q31DIAPH1Cytokinesis
DFNA21p34KCNQ4K+ ion channel
DFNA914q12-q13COCHUncertain
DFNA12, DFNB2111q22-q24TECTAStructural component of tectorial membrane
DFNB113q12Cx26 (GJB2)Intercellular gap junction
DFNB211q13.5MYO7AMyosin (molecular motor)
DFNB317p11.2MYO15Myosin (molecular motor)
DFNB47q31PDSChloride-iodide transporter
DFNB92p23OTOFVesicle-membrane fusion
DFN3Xq21.1POU3F4Transcription factor
12S RNAMitochondrion12S ribosomal RNAEnergy generation by mitochondria

Many structurally and biochemically unrelated genes are implicated in hereditary hearing loss. This diversity means that we cannot predict the nature of the many as yet unidentified deafness genes in a position-independent way. However, knowing some of the genes and pathways involved helps prioritize positional candidates.

Selecting candidate disease genes by homology is often more successful using model organisms as described below than by considering human paralogs. Many diseases show extensive locus heterogeneity, and it is not usually the case that the different genes involved are related in any obvious way, either structurally or functionally (Table 15.3). One must always bear in mind the complexity of every human organ, tissue and developmental process. Each requires very many different genes and pathways, with the result that mutations in many unrelated genes can produce similar phenotypes.

15.4.3. Criteria for selecting a candidate gene: homology to a relevant gene in a model organism

Over the past decade it has become increasingly clear how far structural and functional homologies extend across even very distantly related species. Virtually every mouse gene has an exact human counterpart, and the same is probably true of other less well explored mammalian species. More surprisingly, extensive homologies can be detected between human genes and genes in zebrafish, Drosophila, the nematode worm Caenorhabditis elegans and even yeast. Even more than gene sequences, pathways are often highly conserved, so that knowledge of a developmental or control pathway in Drosophila or yeast can be used to predict the likely working of human pathways - although mammals often have several parallel paths corresponding to a single path in lower organisms. Being able to predict the possible protein-protein interactions governing a pathway assists experimental attempts to identify novel disease genes, for example by yeast two-hybrid screening (Section 20.4.1). Perhaps the most striking demonstration of the conservation of function between distant organisms is contained in a paper by Rincón-Limas et al. 1999 which shows that a wingless Drosophila mutant called apterous can be corrected by transfection with the human apterous homolog Lhx2 - in other words, humans have a fully functional gene for making flies grow wings!

A very powerful means of selecting good candidates from among a set of human genes is therefore to search the databases for evidence of homologous genes in these well-studied model organisms, as described in Section 20.1.4. If a homolog is detected, one can see what is known about its function. Such data might include the pattern of expression and the phenotype of mutants. Additionally in the mouse, though not in nonmammalian species, the likely chromosomal location of the human ortholog can often be predicted from mouse mapping data, allowing prediction of as yet uncharacterized positional candidates.

Using clues from the mouse

Human-mouse phenotypic homologies provide particularly valuable clues towards identifying human disease genes for several reasons:

  • Of the genetically well explored organisms, mice are much the closest to humans in evolutionary terms. Therefore orthologous gene mutations are more likely to produce similar phenotypes in humans and mice than in humans and lower organisms. Not infrequently, however, despite the close evolutionary relationship, the phenotypes are considerably different (see Section 21.4.6).

  • An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

    Figure 15.11

    .

       Conservation of synteny between human and mouse genetic maps

    The Oxford Grid for human and mouse shows an overall comparison of the two species. Each cell shows the number of orthologous genes mapping to particular chromosomes in mouse and man. Cells are color coded according to the number of orthologs mapped. The nonrandom distribution is obvious (Blake et al., 1999). Reproduced with permission from the Mouse Genome Database, Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine (http://www.informatics.jax.org/) (25 May 1999).

    An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

    Figure 15.12

    .

       Detail of one mouse chromosome

    The map shows the part of mouse chromosome 1, from 48 to 62 cM from the centromere. Mapped mouse genes are shown in color. Where a human ortholog has been mapped, its human map location is shown. The distal part of the mouse map shows good conservation of synteny with the distal long arm of human chromosome 2, but in the proximal part of the mouse chromosome the relationship to human chromosomes is more complex. Mouse Chromosome 1 Linkage Map reproduced with permission from the Mouse Genome Database, Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine (http://www.informatics.jax.org/) (25 May 1999). See Blake et al., 1999.

    Table 15.4

    Using mapped mouse mutations to predict possible locations of genes causing hearing loss in humans
    Mouse mutantMouse map locationPredicted human map location(s)Possible human ortholog
    WoChr 1, 25 cM2q12; 2q31-q33; 6p11
    Sp (Pax3)Chr 1, 54 cM2p36WS1
    drChr 1, 89 cM1q23-q31DFNA7
    LpChr 1, 92 cM1q21-q23DFNA7
    fiChr 2, 34 cM2q14-q37
    krChr 2, 91 cM20q11-q13
    SigChr 6, 1 cM7q21-q31DFNB4
    Hoxa1Chr 6, 25 cM7p14-p15DFNA5
    nvChr 7, 4 cM19q13DFNA4
    hbChr 7, 65 cM16p11-p13; 10q24-q26
    Fgf3Chr 7, 70 cM11q13.3
    ha1Chr 10, 56 cM12q22-q24.1
    FuChr 17, 12 cM6p21
    TwChr 18, 9 cM18q11-q12
    syChr 18, 36 cM5q31-q32; 18p11-q21
    DcChr 19, 6 cM11q12-q13DFNA2, DFNA11

    The left hand columns list a series of mouse mutants with structural abnormalities of the inner ear, and their location on the mouse map (chromosome, distance from centromere in cM). Using mouse-human synteny relationships, as illustrated in Figure 15.11, the likely locations of the human orthologs are predicted. In some cases, a human deafness gene has been mapped to the corresponding location (right hand column). Such information allows high-resolution mapping and gene cloning efforts in the two species to reinforce each other. Data from Hereditary Hearing Loss Homepage, http://dnalab-www.uia.ac.be/dnalab/hhh. Reproduced by kind permission of the MRC Institute of Hearing Research, Nottingham, UK.

    Mouse phenotypic information often translates readily into positional candidate information. Backcross mapping (see Box 15.4) allows quick and accurate mapping in the mouse. Thus most mouse mutants have been mapped, or can easily be mapped, and there is considerable conservation of synteny between humans and mice (see Figure 15.11). Once a chromosomal location for a gene of interest is known in mouse or humans, it is usually (though not always) possible to predict the likely location of that gene in the other species. Figure 15.12 shows the example of a mouse chromosomal region where prediction of the corresponding human location is easy, and an adjoining region where prediction could be difficult. A database of human-mouse map relationships is maintained at http://www.ncbi.nlm.nih.gov/Omim/Homology (DeBry & Seldin, 1996). Table 15.4 shows an example of using mouse information to predict possible locations of human deafness genes.

  • Exon sequences are usually well conserved between orthologous human and mouse genes. Once a human or mouse gene is isolated, probes or primers can be designed to screen DNA libraries from the other species in order to identify the orthologous gene.

An example: Waardenburg syndrome and the Splotch mouse

Waardenburg syndrome type 1 (WS1, MIM 193500) illustrates the value of human-mouse comparisons. A pedigree of this autosomal dominant but variable condition was shown in Figure 3.5C. The characteristic pigmentary abnormalities and hearing loss of WS1 are caused by absence of melanocytes from the affected parts (including the inner ear, where melanocytes are required in the stria vascularis of the cochlea in order for normal hearing to develop). Linkage analysis, aided by the description of a chromosomal abnormality in an affected patient, localized the gene for WS1 to the distal part of 2q. At this point, a likely mouse homolog emerged. The Splotch (Sp) mouse mutant has pigmentary abnormalities caused by patchy absence of melanocytes, and the Sp gene maps to a linkage group on mouse chromosome 1 that shows extensive conservation of synteny with distal human 2q.

Consideration of the pathogenesis provided further evidence that WS1 and Sp are orthologous genes. The root cause of the phenotype lies in the embryonic neural crest, because melanocytes originate in the neural crest and migrate out to their final locations during embryonic development. Although heterozygous Sp mice resemble WS1 patients, homozygous Sp mice have neural tube defects, and have been studied for many years as a model for human neural tube defects.

A positional candidate gene emerged when the murine Pax-3 gene was mapped to the vicinity of the Sp locus. Pax-3 is one of a family of genes (PAX genes) that encode transcription factors containing the paired box DNA-binding motif, and it is expressed in mouse embryos in the developing nervous system, including the neural crest. The sequence of Pax-3 was almost identical to the limited sequence which had previously been published for an unmapped human genomic clone, HuP2. Such observations prompted mutation screening of Pax-3 and HuP2 and led to identification of mutations in Splotch mice and humans with WS1 (reviewed by Strachan & Read, 1994). As the underlying genes, Pax-3 and HuP2 were clearly orthologs, the HuP2 gene was subsequently re-named PAX3.

Limitations of human-mouse homologies

Though enormously valuable as a guide to human- mouse homologies, conservation of synteny is not always sufficient to allow identification of positional candidates. This is illustrated by the mi/MITF locus, mutations in which cause another variant of Waardenburg syndrome, WS type 2 (MIM 193510) in man. The mi locus on mouse chromosome 6 has long been recognized as a likely candidate homolog of some form of WS, but attempts to predict the location of the human homolog failed. Genes mapping close to mi have human homologs mapping to 3p25, 3q21-q24 and 10q11.2. Each of these locations was tested for linkage to WS2, with negative results. Not until the human mi homolog, MITF, had been cloned and mapped by FISH to 3p14, and WS2 in humans had been independently mapped by linkage to the same location (see Figure 11.6), was there sufficient evidence to begin a successful search for mutations in man (Tassabehji et al., 1994).

Using clues from mutant phenotypes in lower organisms

Phenotypic homologies with lower organisms have mostly been used after a human gene and its orthologs have been identified, as part of the exploration of the gene's function. In a few cases, however, homologies have been used prospectively, to help identify a human disease gene. A particularly successful example was the identification of the MSH2 and MLH1 genes that are mutated in certain forms of hereditary colon cancer. These genes were identified after phenotypic resemblances led researchers to suspect that the cancers might be caused by mutations in the human homologs of yeast mismatch repair genes (Section 18.7.1).

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch15f13.jpg.

Figure 15.13

.

   A systematic database search for Drosophila mutants that could be positional candidates for human disease genes (Banfi et al., 1996)

An attempt to use Drosophila phenotypic information systematically to identify positional candidates for human diseases is the DRES database (Banfi et al., 1996). The dbEST human EST database was searched for matches to Drosophila genes with known mutant phenotypes. Sixty six novel matches were detected. The map position of each EST was determined by both FISH and radiation hybrid mapping (Figure 15.13).

Sequence homologies with lower organisms

Just as in the general Human Genome Project, genomic or cDNA sequences generated in the exploration of a candidate disease region are routinely checked against data from lower organisms. A match of genomic sequence suggests the presence of a gene, and if the match is to a known gene in the lower organism, it suggests the nature of the human gene. The example of branchio-oto-renal syndrome, above, shows how useful this approach can be.

15.5. Confirming a candidate gene

Ultimately all approaches used to identify disease genes generate candidates, which then have to be tested individually to see if there is compelling evidence that mutations in them do cause the disease in question. Demonstrating that a candidate gene is likely to be the disease locus can be done by various means.

15.5.1. Principles of mutation screening to confirm a candidate gene

However promising a candidate gene appears to be for a disease, it must be shown to be mutated in affected people. Mutation screening entails testing DNA samples from a sizable panel of unrelated patients and control individuals. The first step is to design pairs of primers for PCR-amplifying portions of the coding DNA, either from a genomic DNA sample (if the exon/intron boundaries are known), or from cDNA generated by RT-PCR from mRNA of patients (Figure 6.5). The products of individual amplification reactions are then subjected to one or more of the mutation screening procedures described in Section 17.1.4 that are designed to detect unknown point mutations.

Mutation screening is often straightforward for diseases where a good proportion of patients carry independent mutations (typically, severe early onset dominant or X-linked recessive disorders) and where the disease phenotype results from loss of function of the gene. As explained in Section 16.4, if the correct gene is tested, a panel of DNA samples from unrelated patients will usually show a variety of different mutations, including some with an obviously deleterious effect on gene expression (nonsense mutations, frameshift mutations, etc.). Figure 16.1 shows an example from the work on Waardenburg syndrome mentioned above. If the identified mutations are absent from control samples, then the conclusion that the gene being tested really is the locus for the disease becomes almost inescapable. However, other circumstances may make the identification of mutations and the interpretation of mutation screening more difficult.

  • Unsuspected locus heterogeneity. Often mutations in several different genes can give almost identical phenotypes, so that a panel of unselected patient samples may have pathogenic mutations in different genes. If the candidate gene being tested is responsible for only a small proportion of cases, most samples will show no mutation in that gene. Ideally, one would use only samples from families with demonstrated linkage to the candidate region, but this may be impracticable. Family sizes for recessive and some dominant disorders are often too small for independent linkage analyses, and in some severe dominant disorders most patients present as sporadic cases without a family history.

  • Mutational homogeneity. This problem was discussed above in connection with CF. Most apparently unrelated patients carry the same mutation, F508del. See Sections 16.3.3 and 17.1.2 for further discussion and examples of mutational homogeneity.

  • Mutations are not unambiguously pathogenic. It may be difficult to identify missense mutations as being pathogenic as opposed to being neutral variants with no major effect on gene expression. Some guidelines to help decide whether a sequence change is pathogenic are given in Box 16.4.

  • Mutations may be hard to find. Large genes are more difficult to screen for mutations, and sometimes mutations seem very hard to find. Current examples include the NF1 and PKD1 genes that are mutated in neurofibromatosis 1 (MIM 162200) and adult polycystic kidney disease (MIM 173900) respectively. Mutations in the F8C gene causing severe hemophilia A seemed to be hard to find, until it was discovered that most of the missing mutations were large inversions which disrupted the gene (see Figure 9.20) but were not detected by the PCR methods normally used.

15.5.2. Once a candidate gene is confirmed, the next step is to understand its function

Identifying the gene involved in a genetic disease opens the way to several lines of investigation. The ability to identify mutations should immediately lead to improved diagnosis and counseling, as described in Chapter 17. Understanding the molecular pathology (why the mutated gene causes the disease; see Chapter 16) may also lead to insight into related diseases, and hopefully eventually to more effective treatment including perhaps gene therapy (Chapter 22).

A second line of enquiry concerns the normal function of the gene product. For example, until the Duchenne muscular dystrophy (DMD) gene was identified there was no knowledge of the way the contractile machinery of muscle cells is anchored to the sarcolemma. Analysis of functional domains and motifs (Mushegian et al., 1997) and the search for experimentally manipulable homologs in the mouse, fruit fly, nematode and yeast are powerful tools for this work. This large topic is covered in Chapter 20; a foretaste of the sort of information that can be generated by database searching can be seen in the following information, taken from the Hereditary Hearing Loss Homepage database (http://dnalab-www.uia.ac.be/dnalab/hhh) and describing the gene that was identified by positional cloning of an autosomal dominant hearing loss locus (DFNA1) in one large Costa Rican family (see OMIM entry 124900):

The human DFNA1 protein product DIAPH1, mouse p140mDia, and Drosophila diaphanous are homologs of Saccharomyces cervisiae protein Bni1p. The proteins are highly conserved overall. The genes encoding these proteins are members of the formin gene family, which also includes the mouse limb deformity gene, Drosophila cappuccino, Aspergillus nidulans gene sepA, and S. pombe genes fus1 and cdc12. These genes are involved in cytokinesis and establishment of cell polarity. All formins share Rho-binding domains in their N-terminal regions, polyproline stretches in the central region of each sequence, and formin-homology domains in the C-terminal region.

Further reading
Silver LM (1995) Mouse Genetics: Concepts and Applications. Oxford University Press, Oxford.
References
Abdelhak S, Kalatzis V, Heilig R. et al. A human homologue of the Drosophila eyes absent gene underlies Branchio-Oto-Renal (BOR) syndrome and identifies a novel gene family. Nature Genet. (1997); 15: 157164. [PubMed]
Antoch M P, Song E -J, Chang A -M. et al. Functional identification of the mouse circadian Clock gene by transgenic BAC rescue. Cell. (1997); 89: 655667. [PubMed]
Banfi S, Borsani G, Rossi E. et al(1996). Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching Nat. Genet 13:167174.website http//www.tigem.it/LOCAL/drosophila/dros.html [PubMed].
Blake J A, Richardson J E, Davisson M T, Eppig J T. Mouse Genome Database Group. The Mouse Genome Database (MGD): Genetic and genomic information about the laboratory mouse. Nucleic Acids Res. (1999); 27: 9598. [PubMed] [Free Full Text in PMC icon.Free Full text in PMC]
Carter S A, Bryce S D, Munro C S. et al. Linkage analyses in British pedigrees suggest a single locus for Darier disease and narrow the location to the interval between D12S105 and D12S129. Genomics. (1994); 24: 378382. [PubMed]
Collins F S. Positional cloning moves from the perditional to the traditional. Nature Genet. (1995); 9: 347350. [PubMed]
Copeland N G, Jenkins N A. Development and applications of a molecular genetic linkage map of the mouse genome. Trends Genet. (1991); 7: 113118. [PubMed]
DeBry RW, Seldin MF (1996) Human/mouse homology relationships. Genomics, 33, 337–351. An updated electronic version is at http: //www.ncbi.nlm.nih.gov/Omim/Homology/ .
Edstrom J -E, Kaiser R, Rohme D. Microcloning of mammalian metaphase chromosomes. Meth. Enzymol. (1987); 151: 503516. [PubMed]
The European Consortium on MEN1. Linkage disequilibrium studies in multiple endocrine neoplasia type 1 (MEN1). Hum. Genet. (1997); 100: 657665. [PubMed]
Gitschier J, Wood W I, Goralka T M. et al. Characterization of the human factor VIII gene. Nature. (1984); 312: 326330. [PubMed]
Koob M D, Benzow K A, Bird T D, Day J W, Moseley M L, Ranum L P. Rapid cloning of expanded trinucleotide repeat sequences from genomic DNA. Nature Genet. (1998); 18: 7275. [PubMed]
Koob M D, Moseley M L, Schut L J. et al. Untranslated CTG expansion causes a novel form of spinocerebellar ataxia (SCA8). Nature Genet. (1999); 21: 379384. [PubMed]
Lisitsyn N A. Representational difference analysis: finding the differences between genomes. Trends Genet. (1995); 11: 303307. [PubMed]
Ludecke H J, Senger G, Claussen U, Horsthemke B. Cloning defined regions of the human genome by microdissection of banded chromosomes and enzymatic amplification. Nature. (1989); 338: 348350. [PubMed]
McKusick V A. The defect in Marfan syndrome. Nature. (1991); 352: 279281. [PubMed]
Mushegian A R, Bassett D E, Boguski M S, Bork P, Koonin E V. Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc. Natl Acad. Sci. USA. (1997); 94: 58315836. [PubMed] [Free Full Text in PMC icon.Free Full text in PMC]
Poustka A, Pohl T M, Barlow D P, Frischauf A M, Lehrach H. Construction and use of human chromosome jumping libraries from NotI-digested DNA. Nature. (1987); 325: 353355. [PubMed]
Probst F J, Fridell R A, Raphael Y. et al(1998). Correction of deafness in shaker-2 mice by an unconventional myosin in a BAC transgene Science 280:14441447. [PubMed]See also the accompanying paper, Wang A, Liang Y, Fridell RA et al. (1998) Association of unconventional myosin MYO15 mutations with human nonsyndromic deafness DFNB3. Science, 280, 1447–1451. [PubMed].
Putnam E A, Zhang H, Ramirez F, Milewicz D M. Fibrillin-2 (FBN2) mutations result in the Marfan-like disorder, congenital contractural arachnodactyly. Nature Genet. (1995); 11: 456458. [PubMed]
Rincón-Limas D E, Lu C -H, Canal I. et al. Conservation of the expression and function of apterous orthologs in Drosophila and mammals. Proc. Natl Acad. Sci. USA. (1999); 96: 21652170. [PubMed] [Free Full Text in PMC icon.Free Full text in PMC]
Robson K J H, Chandra T, MacGillivray R T A, Woo S L C. Polysome immunoprecipitation of phenylalanine hydroxylase mRNA from rat liver and cloning of its cDNA. Proc. Natl Acad. Sci. USA. (1982); 79: 47014705. [PubMed]
Rommens J M, Januzzi M C, Kerem B -S. et al. Identification of the cystic fibrosis gene: chromosome walking and jumping. Science. (1989); 245: 10591065. [PubMed]
Schalling M, Hudson T J, Buetow K H, Housman D E. Direct detection of novel expanded trinucleotide repeats in the human genome. Nature Genet. (1993); 4: 135139. [PubMed]
Schutte M, da Costa L T, Hahn S A. et al. Identification by representational difference analysis of a homozygous deletion in pancreatic carcinoma that lies within the BRCA2 region. Proc. Natl Acad. Sci. USA. (1995); 92: 59505954. [PubMed] [Free Full Text in PMC icon.Free Full text in PMC]
Strachan T, Read A P. PAX genes. Curr. Opin. Genet. Dev. (1994); 4: 427438. [PubMed]
Strathdee C A, Gavish H, Shannan W R, Buchwald M. Cloning of cDNAs for Fanconi's anaemia by functional complementation. Nature. (1992); 356: 763767. [PubMed]
Swaroop A, Xu J, Agarwal N, Weissman S M. A simple and efficient cDNA library subtraction procedure: isolation of human retina-specific cDNA clones. Nucleic Acids Res. (1991); 19: 1954. [PubMed] [Free Full Text in PMC icon.Free Full text in PMC]
Tassabehji M, Newton V E, Read A P. Waardenburg syndrome type 2 caused by mutations in the human microphthalmia (MITF) gene. Nature Genet. (1994); 8: 251 255. [PubMed]
The Treacher Collins Syndrome Collaborative Group. Positional cloning of a gene involved in the pathogenesis of Treacher Collins syndrome. Nat. Genet. (1996); 12: 130136. [PubMed]
Watson C J, Gaunt L, Evans G, Patel K, Harris R, Strachan T. A disease-associated germline deletion maps the type 2 neurofibromatosis (NF2) gene between the Ewing sarcoma region and the leukaemia inhibitory factor locus. Hum. Mol. Genet. (1993); 2: 701704. [PubMed]
Worton R G, Thompson M W. Genetics of Duchenne muscular dystrophy. Annu. Rev. Genet. (1988); 22: 601629. [PubMed]
Yasunaga S, Grati M, Cohen-Salmon M. et al. A mutation in OTOF, encoding otoferlin, a FER-1-like protein, causes DFNB9, a nonsyndromic form of deafness. Nature Genet. (1999); 21: 363369. [PubMed]
Help ǀ Contact Bookshelf
Human Molecular Genetics 21999
(navigation arrows) Go to previous chapter Go to next chapter Go to top of this page Go to bottom of this page Go to Table of Contents