15.1. Principles and strategies in identifying disease genes
Few areas have moved as fast as human disease gene identification. Before 1980, very
few human genes had been identified as disease loci. The few early successes
involved a handful of diseases with a known biochemical basis where it was possible
to purify the gene product. In the 1980s, advances in recombinant DNA technology
allowed a new approach, positional cloning, sometimes given the rather meaningless
label ‘reverse genetics’. The number of disease genes identified
started to increase, but these early successes were hard won, heroic efforts. With
the advent of PCR for linkage studies and mutation screening, it all became much
easier. Now the human and other genome projects have made available a vast range of
resources - maps, clones, sequences, expression data and phenotypic data.
Identifying novel disease genes has become commonplace and is currently occurring on
a weekly basis. Soon the landscape will change again, as the complete human genome
sequence becomes available, so that all genes will in theory be accessible through
databases.
Figure 15.1
.
How to identify a human disease gene
There is no single pathway to success, but the key step is to arrive at a
plausible candidate gene, which can then be tested for mutations in
affected people. Note the interplay between clinical work, laboratory
benchwork and computer analysis. Database searching is becoming more and
more crucial as information from genome projects accumulates.
summarizes some of the routes
that have been followed to identify human disease genes. If the figure seems
complicated, that is because there is no standard procedure for gene identification.
All pathways converge on mutation testing in a candidate gene, but there is not one
single entry point, and there is no unique pathway to the candidate gene. For
discussion of the principles, we can divide the methods into those that do not
require us to know the chromosomal location of the disease
locus (
Section 15.2) and those that depend on this
knowledge (
Section 15.3). In reality, groups
trying to identify a disease gene will use several parallel approaches, with the
emphasis shifting from one candidate and one line of attack to another in response
to clues that emerge from the team's own results or new external data, and to new
possibilities arising from technical developments. Most genes are identified by
defining a candidate gene on the basis of both its chromosomal location and its
properties (the positional candidate approach;
Section 15.4).
15.2. Position-independent strategies for identifying disease genes
Historically, the first disease genes were identified by pure position-independent
methods, simply because no relevant mapping information existed and the techniques
were not available to generate it. However, methods based on sequence homology or
functional complementation that are in principle position-independent, work much
better when applied to predefined candidate subchromosomal regions, rather than to
the whole human genome. Homology searches in particular become exceedingly powerful
when combined with positional information. Their use in
‘cybercloning’ and positional candidate approaches to gene
identification is considered in Section
15.4.
15.2.1. Identification of a disease gene through knowledge of the protein
product
If the biochemical basis of an inherited disease is known, it may be possible to
purify and partially characterize some of the gene product. If this can be done,
gene-specific oligonucleotides or specific antibodies can be generated that can
be used to identify the gene.
Use of gene-specific oligonucleotides
This approach relies on the ability to isolate sufficient protein product to
permit amino acid sequencing. Specific peptide bonds in the protein product
can be cleaved using proteolytic enzymes such as trypsin (cuts at the
carboxyl end of lysine or arginine residues) or reagents such as cyanogen
bromide (cuts at the carboxyl end of methionine residues). The amino acid
sequence of each resulting peptide can be determined by chemical sequencing.
This involves a repeated series of chemical reactions in an automated amino
acid sequencer. In each cycle, the peptide is exposed to a chemical that
covalently bonds to the N-terminal amino acid and cleaves it off, allowing
it to be identified by chromatography. Sequence overlaps identify
overlapping peptides, enabling longer sequences to be assembled.
The resulting amino acid sequence is inspected to identify regions containing
amino acids with minimal codon degeneracy (e.g. methionines and tryptophans
are uniquely encoded by AUG and UGG codons, respectively). Once suitable
regions have been identified, combinations of oligonucleotides are
synthesized to correspond to all possible codon permutations. The resulting
mix of partially degenerate oligonucleotides is labeled and used as a probe
to screen cDNA libraries. As only one of the oligonucleotides in the mix
will correspond to the authentic sequence, it is important to keep the
number of different oligonucleotides low so as to increase the chance of
identifying the correct target. Once a suitable cDNA clone is isolated, it
can be used to screen a genomic DNA library in order to isolate genomic DNA
clones for full characterization of the gene.
Figure 15.2
.
The factor VIII gene, the locus for hemophilia A, was cloned
by product-directed oligonucleotide screening of DNA
libraries
The figure illustrates one way in which factor VIII DNA clones
were obtained, following cleavage of purified porcine factor
VIII protein into peptides and amino acid sequencing. The
resulting sequences were inspected to identify regions with low
codon redundancy. The top panel shows a sequence of 15 amino
acids from His8 to Met22 in one of the peptides, with the
possible codon permutations above (with variable nucleotides in
color). This sequence was selected because of the generally low
codon redundancy: two amino acids, Trp and Met, are specified by
a single codon and another seven can be specified by just two
alternative codons. A partially degenerate 45-bp antisense
oligonucleotide probe was prepared and used as a primary
hybridization probe to screen a porcine genomic DNA library, and
thereafter secondary screening used a 15-bp antisense degenerate
oligonucleotide probe corresponding to the sequence from Trp18
to Met22. The porcine factor VIII genomic clone was then used to
screen human DNA libraries to identify the human gene (see Gitschier et
al., 1984).
Identification of the hemophilia A gene (MIM
306700) followed this approach.
Biochemical analysis of serum samples from patients had previously
identified a genetic deficiency of blood clotting factor VIII. Purification
of factor VIII from plasma is not straightforward, partly because it is
present in very low quantities. One approach involved isolating small
quantities of factor VIII from large volumes of pig blood by standard
protein purification techniques. The purified product allowed the production
of gene-specific oligonucleotide probes for
library screening ().
Library screening by hybridization can be tedious when a complex mixture of
oligonucleotides is used, because the results are greatly influenced by the
hybridization conditions. A more rapid alternative is to use partially
degenerate oligonucleotides as PCR primers. One early strategy was to use
two such sets corresponding to amino acid sequences from different regions
of the protein as primers. By using total cDNA from a suitable source as a
template, a specific cDNA could be amplified spanning the codons from the
two different regions (Figure 6.11).
However, this approach demands considerable prior information about the
protein sequence. A more convenient alternative is to prepare cDNA and
ligate it to vector DNA molecules. PCR can then be performed using one
vector-specific primer and one primer composed of a panel of partially
degenerate oligonucleotides.
Use of specific antibodies
If even small amounts of the normal protein product can be isolated, specific
antibodies can be raised. The protein, or a peptide derived from it, is
conjugated to a powerful immunogenic hapten such as keyhole
limpet hemocyanin, and the compound molecule injected into a rabbit or
mouse. The hapten activates B lymphocytes, and the protein or peptide of
interest activates helper T lymphocytes, leading to production of
antibodies. Mouse or rabbit antibodies that are specific for the desired
protein or peptide can then be used in various ways to identify a
corresponding cDNA.
An early approach was to enrich for mRNA encoding the protein product in a
cell-free in vitro protein synthesis system. This was how
the gene that is mutated in phenylketonuria (MIM 261600) was identified in 1982 (Robson et al.,
1982). Phenylketonuria was known to be caused by a lack of the enzyme
phenylalanine hydroxylase (PAH). PAH enzyme was purified from rat liver, a
known site of expression. Specific antibodies were raised and used to
immunoprecipitate polysomes containing PAH mRNA. The
purified mRNA was converted to cDNA, and a specific rat cDNA clone was
isolated. This was then used as a probe to isolate the human cDNA from a
human liver cDNA library.
This type of approach has been superseded by antibody screening of cDNA
expression libraries. cDNA from a relevant tissue is cloned into an
expression vector. Inserts within the recombinant DNA clones are expected to
be expressed within the host cell to produce foreign polypeptides.
Appropriate antibodies can then be used to screen colony filters from the
library to identify clones encoding the product of interest.
15.2.2. Identification of a disease gene through knowledge of the DNA
sequence
This most usually arises when the researcher is considering what diseases might
be caused by mutations in a particular known gene. Alternatively, a novel human
disease gene may be identified by homology, either to a paralogous human gene
(Section 15.4.2) or to an
orthologous gene in another species (Section
15.4.3). An interesting application of DNA sequence knowledge is the
attempt to clone genes containing expanded trinucleotide repeats. As shown in
Box 16.7, expanded
trinucleotide repeats are known to cause several inherited neurological
disorders. Often these disorders show anticipation - that is, the disease
presents at an earlier age and with increased severity in successive
generations. If a disease under investigation shows any of these features, it
may be worth screening for triplet repeat expansions. The repeat expansion
detection method of Schalling et
al. (1993) permits detection of expanded repeats in
unfractionated genomic DNA of affected patients, and methods have been developed
for cloning any expanded repeats detected (Koob et al., 1998). This approach was recently used
in a completely position-independent way to identify a novel repeat expansion
that causes a form of spinocerebellar ataxia (SCA8) (Koob et al., 1999).
15.2.3. Identification of a disease gene through knowledge of its normal
function
Functional cloning depends on expressing random fragments of human DNA in a cell
or organism, and isolating any fragments that cause a desired change in
function. The usual approach is a functional complementation assay, seeking
fragments that correct a defect in the recipient. Examples include the
following:
-
Functional complementation in mammalian cell lines. For
example, a variety of mammalian cell lines have been generated that are
deficient in DNA repair. They (Robson
et al., 1982) show abnormal responses
following exposure to UV irradiation or chemical mutagens. These mutant
cells, or alternatively cells derived from patients with a DNA repair
deficiency, can be transformed by fragments of normal human DNA or human
chromosomes in order to produce a repair-competent phenotype. This was
the way in which cDNA clones for the Fanconi's anemia group C (FACC; MIM
227645) gene
were first obtained (Strathdee
et al., 1992). Similarly, the ability of
transferred chromosomes or clones to correct the uncontrolled growth of
tumor cell lines has been used to help locate and then identify tumor
suppressor genes (Chapter
18).
-
Functional complementation in yeast. Innumerable yeast
mutants have been defined, and genetic analysis in yeast is particularly
sophisticated because of the ease of performing homologous
recombination. Some proteins have been so highly conserved during
evolution that the human protein can complement a yeast mutant defective
in the corresponding protein. This approach has been successful in
identifying the human genes that specify various enzymes of purine and
pyrimidine biosynthesis, and also some crucially important transcription
factors.
-
Figure 15.3
.
Functional complementation in transgenic mice as a tool for
identifying a human disease gene
The shaker-2 mouse mutation was identified by finding a
wild-type clone that corrected the defect. Human families with a similar
phenotype that mapped to the corresponding chromosomal location proved
to have mutations in the orthologous gene.
Functional complementation in transgenic mice.
Occasionally a mouse gene has been identified by constructing transgenic
mice, using nonmutant BAC clones from a candidate region, crossing them
to mice carrying the mutation, and checking which transgene corrects the
defect. This strategy was first used to identify a clock gene (Antoch et al.,
1997), and more recently as a crucial step in identifying the
human DFNB3 deafness gene (Probst et al., 1998; ).
DFNB3 had been mapped to a location that
corresponded in the mouse to the location of the deafness gene
shaker-2. Transgenic mice were constructed using
BACs from the shaker-2 candidate region, and a BAC that
corrected the shaker-2 phenotype was identified. This
led to identifying the shaker-2 gene as an
unconventional myosin, MYO15. The human MYO15 gene was
then isolated based on its close homology to the mouse gene, its
position within the DFNB3 candidate region confirmed,
and mutations demonstrated in DFNB3 affected
people. -
Isolation of activated oncogenes. This is done by their
effect on the growth of mouse 3T3 fibroblasts (see Figure 18.4).
If a patient has a disease because of a chromosomal deletion, identifying genes
present in a normal person but absent in the patient would pinpoint the disease
gene. More generally, genes implicated in a disease may be expressed to a
different degree in patients and controls (this depends on the type of mutation:
missense mutations alter the function but not the expression of the mRNA, but
many other types of mutation result in low or absent levels of mRNA - see Chapter 16). Methods that identify
the differential presence or expression of a gene therefore provide a possible
route to position-independent identification of a disease gene, although more
usually they are one arm of a positional candidate strategy.
Subtraction cloning
Subtraction cloning can be used to select clones of the DNA that
is deleted in an individual with a chromosomal deletion. Two DNA samples are
compared, a normal ‘test’ DNA and a deleted
‘driver’ DNA. The test DNA is mixed with a large excess
of driver DNA, denatured and re-annealed. By one means or another, double
helices are selected in which both strands consist of test DNA. These
preferentially represent sequences in the test DNA that are absent from the
driver DNA. The most celebrated application of subtraction cloning was in
identifying the dystrophin (DMD) gene (Section 15.3.4). The test DNA came from a normal
individual, and the driver DNA from a patient who had a deletion including
the dystrophin gene. The test clones remaining after subtraction were
enriched for DNA derived from the region missing in the affected
patient.
In the historically important case of dystrophin, subtraction cloning
directly yielded clones from the desired unknown gene, but this was an
exceptional success. Subtraction cloning is a very difficult technique,
which has seldom succeeded with genomic DNA. A more recent approach to the
same problem is representational difference analysis (RDA; Lisitsyn, 1995). RDA uses several
means, including selective PCR, to enrich sequences present in the test but
not the driver DNA, and has been used to isolate regions amplified or
deleted in cancer cells (Schutte
et al., 1995). Genes in such regions are
positional candidate tumor suppressor or oncogenes (Chapter 18).
Subtractive hybridization works better with mRNA than genomic DNA and
subtractive hybridization or mRNA differential display (Section 20.2.4) have been used to identify
differentially expressed transcripts. In the future, expression arrays
(Figure 20.6) could be efficient
tools for such analyses, although they will not contain any novel
uncharacterized genes. Generally the role of these techniques in gene
identification is not to isolate the disease gene directly, but to produce
collections of sequences that may be position-independent candidates for a
disease because of their pattern of expression. These can then be screened
further by positional criteria.
Subtractive hybridization has also been used in several projects to produce
libraries of cDNAs specifically expressed in a certain tissue, by
subtraction of a tissue-specific cDNA library against one or more
nonspecific libraries (Swaroop et
al., 1991). Such subtraction libraries are a good
source of candidate genes for diseases affecting just the tissue in
question. Several disease genes have been identified by screening an
appropriate subtraction library for sequences mapping to the same
chromosomal location as the disease (see, for example, Yasunaga et al., 1999). This is an
example of the power of the positional candidate approach (Section 15.4).
15.3. In positional cloning, disease genes are identified using only knowledge of their
approximate chromosomal location
Table 15.1
Examples of disease genes identified by positional cloning
At the opposite pole to the position-independent gene identification strategies,
positional cloning identifies a disease gene based on no information except its
approximate chromosomal location. The first successful gene identifications based
only on positional information, published in 1986, marked a triumphant new era for
human molecular genetics. One after another, genes for important disorders such as
Duchenne muscular dystrophy, cystic fibrosis, Huntington disease, adult polycystic
kidney disease, colorectal cancer, breast cancer, etc. were isolated. However,
positional cloning can be desperately hard work, and by 1995 only about 50 inherited
disease genes had been identified by this approach (
Collins, 1995). The four examples discussed in
Sections 15.3.4–
15.3.7 (
Table 15.1) illustrate typical approaches to positional cloning. Each of
these disease genes was identified by first mapping the disease as closely as
possible in affected families, and then identifying a novel candidate gene and
showing that patients had mutations in that gene. In all cases except Treacher
Collins syndrome, there was some clue to help identify the correct candidate
gene.
Figure 15.4
.
The difficult path from candidate region to gene
One researcher's view of the frustrations of positional cloning. Image
courtesy of Dr Richard Smith, University of Iowa.
Positional cloning projects recapitulate the Human
Genome Project in miniature. In
both cases the researchers first produce high-resolution genetic and physical maps,
then build clone contigs before identifying and sequencing transcripts. The only
parts specific to positional cloning are selecting the candidate region and
identifying pathogenic mutations in patients with the disease in question. Thus
general progress in the Human
Genome Project has had an enormous impact on
positional cloning projects. Defining the right candidate region and testing
candidate genes can still be long hard tasks (as makes clear) but most of the intermediate stages can be
achieved by intelligent use of existing
Genome Project data.
15.3.1. The first step in positional cloning is to define the candidate region as
tightly as possible
The methods of mapping mendelian and nonmendelian disease genes have been
described in Chapters 11 and 12, respectively, while the use of
loss of heterozygosity mapping to locate tumor suppressor genes is described in
Section 18.5.3. Even with all the
fruits of the Human Genome Project to hand, positional cloning can still be
laborious and frustrating. More than any other factor, the size of the candidate
region determines the work required, and so every effort is made to define as
small a candidate region as possible. Regions larger than about 1 Mb of DNA
present serious obstacles to positional cloning.
Figure 15.5
.
Crossover analysis seeks to map a gene by defining flanking
proximal and distal recombinants
The figure shows haplotype analysis in two pedigrees with a
dominantly inherited skin disorder, Darier-White disease (MIM
124200),
which had previously been mapped to 12q. Genotypes in II-1, II-2,
II-4 and II-5 in pedigree (A) are inferred. (A) In this
family the disease gene segregates with the marker haplotype
6-5-2-6-2-2 between D12S84 (proximal) and
D12S79 (distal). A crossover in II-6 shows that
the disease gene must map distal to D12S84. The
positioning of D12S105 is ambiguous because of
presumed homozygosity for allele 5 in I-1 - compare the genotypes
for II-3 and II-6. (B) In this family the disease gene
must lie proximal to D12S129. The combined data
indicate that the Darier's disease gene must map between the
proximal marker D12S84 and the distal marker
D12S129. Reproduced from Carter et al. (1994)
Genomics, 24, 378–382, with
permission from Academic Press.
Often the initial localization from genetic mapping defines a candidate region of
10 Mb or more. The next step is to collect as many families as possible and
establish a dense cover of polymorphic markers across the region. Suitable
markers may be found by database searching, but if this fails then YACs, BACs
and cosmids must be isolated from the candidate region and screened for
polymorphisms. These might be microsatellites or single nucleotide
polymorphisms, and depending on how well developed the physical map of the
region is, the new markers might either be localized physically as
sequence tagged sites (
STS) on a
contig, mapped physically
using a
radiation hybrid panel (
Section
10.1.3), or mapped genetically in CEPH families (see
Section 11.4.3). For
mendelian conditions,
where recombinants can be identified unambiguously, the limit of genetic
resolution is reached when pairs of closely spaced markers define the positions
of the closest recombinations on either side of the disease
locus. This is
decided by inspecting individual haplotypes rather than by statistical analysis
(). When mapping
low-
penetrance disease susceptibility loci, recombinants cannot be pinpointed in
this way (
Section 12.2.2), so all one
can do is sharpen the lod score curve as far as possible by using the biggest
possible dataset.
When single recombinants define the boundaries of a region that is to be
searched, it is important to consider possible sources of error (see Section 11.5.1). Meticulous clinical
diagnoses are imperative. Key recombinations are more reliable if they occur in
unambiguously affected people - an unaffected individual may carry a
nonpenetrant disease gene, which can lead to them being misinterpreted as
recombinant when in fact they are nonrecombinant. Sometimes, despite good
positive lod scores, there appear to be recombinants with every marker tried.
This is usually an indication that somebody has been diagnosed wrongly (labeled
as affected when unaffected, or vice versa) or else that the disease gene in one
or more of the families under investigation does not map to the candidate
region. Alternatively, perhaps the markers are wrongly ordered on the genetic
map.
Linkage disequilibrium may allow very high resolution mapping
Linkage disequilibrium
(association at the population level of a particular marker allele with a
disease) can allow genetic mapping to be taken to a very high resolution, as
discussed in Section 12.4. Linkage
disequilibrium has been enormously valuable for guiding positional cloning
(the example of cystic fibrosis is discussed below) but not all diseases
show it (European Consortium on MEN1,
1997). It is seen only when many of the apparently unrelated
affected people in a population in fact derive their disease chromosome from
a shared ancestor. Thus the easiest diseases to map very finely are those
where most affected people carry the same ancestral mutation, and an
ancestral haplotype can be defined, as illustrated in Figure 12.5. In such cases there is a price to be
paid later, when candidate genes are being tested for mutations, because a
diversity of mutations gives a much higher chance of spotting the correct
gene. Plans to identify the genetic factors responsible for susceptibility
to many common diseases rest almost entirely on the hope that association
studies will pinpoint the location of the susceptibility genes. If it were
to turn out that at most susceptibility loci many different variants
predispose to the disease, then the whole endeavor would be in serious
trouble.
15.3.2. Genes within the candidate region can be identified by a combination of
database searching and transcript mapping
As shown in , known genes from
the candidate region can be found by database searching. If none of the known
genes is a promising candidate, or if after testing no mutations can be found,
then novel genes must be sought. The general methods for identifying unknown
transcribed sequences from within a
contig of genomic clones have been discussed
in detail in
Section 10.4, and are
summarized briefly in
Box 15.1.
With the continuing progress of the Human
Genome Project, the emphasis has moved
strongly towards identifying genes from the databases. Once the human
genome
sequence is completed, it should in theory no longer be necessary to clone any
human gene from scratch. However, present computational methods for identifying
genes using genomic DNA sequence data (
Section
10.4.5) are far from perfect, and for some time to come it may still
be necessary to get one's hands dirty in the laboratory if one wishes to
identify every expressed sequence from a candidate region.
Whenever cDNA libraries are to be screened, the question arises which libraries
should be used. Often the pathology of the disease under study suggests a
particular investigation. For example, when studying a neuromuscular disease it
makes sense to start by screening muscle cDNA libraries. However,
tissue-specific diseases are often caused by malfunction of widely expressed
genes (Section 16.7.1), so if one
library fails it is always worth screening others. Fetal brain is a popular
choice because it has a particularly high number of expressed sequences.
15.3.3. Chromosomal aberrations can provide a useful short-cut to locating a disease
gene
Table 15.2
The first ten years of positional cloning (selected
highlights)
| 1986 | Duchenne muscular dystrophy | 310200 | Xp21.3 | DMD | (a) del(X)(p21.3) |
| | | | | (b) t(X;21)(p21.3:p13) |
| Retinoblastoma | 180200 | 13q14 | RB | del(13)(q13.1q14.5) |
| 1989 | Cystic fibrosis | 219700 | 7q31 | CFTR | None |
| 1990 | Neurofibromatosis 1 | 162200 | 17q11.2 | NF1 | Balanced translocations |
| | | | | t(1;17)(p34.3:q11.2) |
| | | | | t(17;22)(q11.2:q11.2) |
| Wilms' tumor | 194070 | 11p13 | WT1 | del(11)(p14p13) |
| 1991 | Aniridia | 106210 | 11p13 | PAX6 | t(4;11)(q22;p13) |
| | | | | del(11)(p13) |
| Familial polyposis coli | 175100 | 5q21 | APC | del(5)(q15q22) |
| Fragile-X syndrome | 309550 | Xq27.3 | FMR1 | FRAXA fragile site |
| Myotonic dystrophy | 160900 | 19q13.3 | DMPK | None |
| 1993 | Huntington's disease | 143100 | 4p16 | HD | None |
| Tuberous sclerosis 2 | 191092 | 16p13 | TSC2 | Microdeletions in candidate region |
| von Hippel-Lindau disease | 193300 | 3p25 | VHL | Microdeletions in candidate region |
| 1994 | Achondroplasia | 100800 | 4p16 | FGFR3 | None |
| Early-onset breast/ovarian cancer | 113705 | 17q21 | BRCA1 | None |
| Polycystic kidney disease | 173900 | 16p13.3 | PKD1 | t(16;22) (p13.3;q11.21) |
| | 601313 | | | |
| 1995 | Spinal muscular atrophy | 253300 | 5q13 | SMN1 | None |
| | 600354 | | | |
Researchers are constantly on the alert for special patients or observations that
will short-cut the labor of pure positional cloning. Cancer studies in
particular have relied on investigations of chromosomal abnormalities (
Chapter 18), and identification of
many other disease genes has also been greatly helped by finding patients with a
chromosomal abnormality (
Table 15.2).
Alert clinicians play a crucial role in identifying such patients (
Box 15.2).
Translocations and inversions
If a person with an apparently balanced translocation or inversion is
phenotypically abnormal, there are three possible explanations:
-
the finding is coincidental;
-
the rearrangement is not in fact balanced - there is an unnoticed
loss or gain of material;
-
one of the chromosome breakpoints causes the disease.
A chromosomal break can cause a loss-of-function
phenotype if it disrupts the
coding sequence of a gene, or separates it from a nearby regulatory region.
Alternatively, it could cause a gain of function, for example by
splicing
regulatory sequences from one gene to distal coding sequences from another
gene, causing inappropriate expression (this is rare in inherited disease
but common in tumorigenesis, see
Chapter 18). In either case, the breakpoint provides a valuable
clue to the exact physical location of the disease gene. The clue is
valuable but not infallible: sometimes breakpoints can alter expression of a
gene located hundreds of kilobases away by affecting the structure of
large-scale chromatin domains (
Box
15.3).
Figure 15.6
.
Using fluorescence in situ hybridization to
define a translocation breakpoint
(A) Cytogenetically defined translocation
t(8;16)(p22;q12). (B) physical map of part of the
breakpoint region in a normal chromosome 8, showing approximate
locations of seven clones. (C) Results of
successive FISH experiments. The breakpoint is within the
sequence represented in clone D. This result would normally be
confirmed using clones from chromosome 16.
The precise location of a chromosome breakpoint is most easily defined by
using
FISH ().
Alternatively, different DNA clones from the relevant region can be used in
turn to see if any can identify patient-specific restriction fragments, by
hybridizing each clone to the patient's genomic DNA which has been digested
with a rare-cutter restriction endonuclease and subjected to pulsed field
gel electrophoresis (
Section
10.2.2).
Deletions and duplications
Chromosomal deletions cause abnormalities due to loss of genes in males with
X chromosome deletions, and reduced levels of dosage-sensitive gene products
in people heterozygous for autosomal deletions (Figure 16.9). Cytogenetically visible deletions
involve many megabases of DNA. Such large deletions often produce rather
complex, nonspecific phenotypes, but if specific elements can be seen, the
deletion may provide a pointer to a broad subchromosomal localization. In
the past, subtraction cloning was attempted using such deletions, as
described below for the dystrophin gene.
Small-scale deletions (microdeletions) are valuable for positional cloning.
Deletions of tens or hundreds of kilobases of DNA are not uncommon in some
disorders (Section 16.8.1). Often
they are generated by unequal recombination between flanking repeat
sequences (Section 9.5.4).
Microdeletions can be identified by several methods.
Duplications have not played any significant role in positional cloning. If
unequal recombination between flanking repeats is a major cause of
microdeletions, microduplications should be equally frequent, but in fact
they are rarely observed. Probably most duplications are overlooked because
they are nonpathogenic. If a duplication is associated with an abnormal
phenotype, the cause is most likely to be a loss of function of a gene that
is disrupted by the breakpoint. Occasionally there may be a dosage effect
when a complete working gene is duplicated (Figure 16.7). Microduplications can be detected by careful
dosage analysis, by long-range restriction mapping or by finding people who
have three alleles of a marker.
15.3.4. Chromosomal deletions and translocations assisted positional cloning of the
dystrophin gene
Duchenne muscular dystrophy (DMD, MIM 310200) was a major test-bed for positional
cloning methods. Years of careful investigation of the pathological changes in
affected muscle had failed to reveal the biochemical basis of DMD. In the early
1980s, several groups competed to clone the DMD gene, using
different approaches. The pioneering work of these groups, overcoming formidable
technical difficulties to clone an unprecedented gene, was probably the major
inspiration for most subsequent positional cloning efforts. This work has been
well reviewed by Worton and Thompson
(1988).
The DMD gene was localized by linkage analysis and through X-autosome
translocations
Figure 15.9
.
Nonrandom X inactivation occurs in female DMD patients with
Xp21-autosome translocations
The translocation is balanced, but the X chromosome breakpoint
disrupts the dystrophin gene. X inactivation is random, but
cells which inactivate the translocated X die because of lethal
genetic imbalance. The embryo develops entirely from cells where
the normal X is inactivated, leading to a woman with no
functional dystrophin gene. The resulting failure to produce any
dystrophin causes DMD.
The
DMD locus was mapped to Xp21 by
linkage to a restriction
fragment length
polymorphism as long ago as 1982 (the first disease to be so
mapped). Additional confirmation of this localization came from studies of
rare affected females. These women, about 20 of whom have been described
worldwide, occur sporadically in families with no history of DMD, and there
is no evidence that they have inherited a conventional
DMD
mutation from either parent. Instead, they all carry balanced X-
autosome
translocations. Although each woman has a different autosomal breakpoint,
and many different autosomes are involved, the X chromosome breakpoint is
always at Xp21. The pathogenesis results from an unusual mechanism. X
inactivation is random, but those cells in which the der(X)
translocation
chromosome was inactivated suffer genetic imbalances and die. In the cells
that survive, the normal X is inactive, whereas the active der(X) does not
produce any dystrophin because the
translocation breakpoint has disrupted
the dystrophin gene ().
Isolation of the DMD gene by subtraction cloning
Kunkel's group in Boston used DNA from a boy ‘BB’ (Section 16.8.1) who had DMD and a
cytogenetically visible Xp21 deletion. A technically very difficult
subtraction cloning procedure (Section
15.2.3) was used to isolate clones from normal DNA that
corresponded to sequences deleted in BB. Individual DNA clones in the
subtraction library were then used as probes in Southern blot hybridization
against DNA samples from normal people and DMD patients. One clone,
pERT87-8, detected deletions in DNA from about 7% of cytogenetically normal
DMD patients. It also detected polymorphisms that were shown by family
studies to be tightly linked to DMD. These results showed that pERT87-8 was
located much closer to the DMD gene than any previously
isolated clones (in fact it was within the gene, in intron 13). Other nearby
genomic probes were isolated by chromosome walking and used to screen muscle
cDNA libraries. Given the low abundance of dystrophin mRNA and, as we now
know, the small size and widely scattered location of the exons, finding
cDNA clones was far from easy, but eventually clones were identified, and
subsequently the whole remarkable dystrophin gene (see Figure 8.13) was characterized.
Isolation of the DMD gene by cloning a translocation breakpoint
While Kunkel's group was working on subtraction cloning, Worton's group in
Toronto was successful with a different approach. One of the affected women
described above had an X;21 translocation with a breakpoint in the short arm
of chromosome 21. Knowing that 21p is occupied by arrays of repeated rRNA
genes (Section 8.2.1), Worton's group
prepared a genomic library and set out to find clones containing both rDNA
and X chromosome sequences. This led to isolation of XJ (X
junction) clones which, in a similar way to Kunkel's pERT87-8 probe,
detected deletions and polymorphisms. XJ turned out to be
located in intron 17 of the dystrophin gene.
15.3.5. Linkage disequilibrium was an important aid to positional cloning of the
cystic fibrosis gene
In 1985, studies of affected sib-pairs (see Figure 12.2) showed that the gene for CF (MIM 219700) was linked to a protein polymorphism
of the enzyme paraoxonase. At that time, the chromosomal location of the
paraoxonase gene was not known (this illustrates one of the big advantages of
using DNA rather than protein polymorphisms for mapping). A rapid mapping effort
located the paraoxonase gene to chromosome 7, and a variety of DNA markers were
used to show that CF mapped to 7q31-q32. The MET oncogene was
established as a proximal flanking marker, and an anonymous clone
D7S8 as a distal marker.
Figure 15.10
.
Identification of the CF gene involved laborious chromosome
walking and chromosome jumping techniques
Starting from the flanking markers MET (proximal)
and D7S8 (distal), an intervening region of about
500 kb was intensively mapped. Chromosome walking was used to
identify overlapping λ and cosmid clones (short thin and
long thick horizontal lines, respectively, above the restriction
map). Chromosome jumping steps (color arcs) facilitated this
process. After several false starts, the overlapping E4.3 and H1.6
clones, which contained evolutionarily conserved sequences (as
detected by zoo blotting; see Figure 10.21), were used to isolate a cognate cDNA
clone. The cDNA clone was then used to map back to λ genomic
clones and the gene was shown to contain 24 exons. Gaps remained,
however (e.g. between exons III and IV). The full structure of the
gene was later shown to comprise 27 exons. Verification of the
gene's involvement in CF was obtained by demonstrating
patient-specific mutations (see text). Reproduced with permission
from Rommens et
al. (1989) Science,
245, 1059–1065. Copyright 1989 American
Association for the Advancement of Science.
Despite an intensive world-wide search, no CF patients have been discovered with
translocation, inversion or deletion breakpoints at 7q31-q32, nor did any
microdeletions emerge during the progress of the research. Without large-scale
mutations to help, identifying the CF gene proved an exceedingly arduous task,
requiring extensive genetic mapping and exhaustive molecular characterization of
the candidate region. In those days, before the Human
Genome Project, generating
clones covering the region between the flanking markers
D7S8
and
MET was a major effort (). The techniques used for this task included:
-
Chromosome walking (Section 10.3.2). All this work was conducted before human
YAC libraries were available, and so the chromosome walking used cosmid
and phage λ libraries. Thus, individual steps were only about
10–20 kb, and frustrated researchers talked about chromosome
crawling.
-
Chromosome microdissection. Using micromanipulation
techniques, a small chromosomal segment can be physically cut out from
an individual chromosome in a spread on a microscope slide (Edstrom et al.,
1987). Extremely fine needles, or a laser beam, are used to
perform the microdissection in a series of cells. The excised fragments,
typically representing a single chromosomal band, are collected, pooled
and the DNA extracted for use in constructing DNA libraries (Ludecke et al.,
1989). This technical tour de force has been useful for
generating DNA clones from several disease-associated chromosomal
regions. However, microdissection libraries are difficult to construct;
contamination by extraneous DNA has been a problem and the clone
complexity (the number of different DNA sequences) has often been poor.
Now that large-scale YAC and BAC mapping has made clones from all
chromosomal regions available, microdissection is no longer necessary
for clone generation.
-
Chromosome jumping. This now obsolete technique used
circularization of large DNA fragments from the region of interest to
hop from one genomic clone to another located several hundred kilobases
away (Poustka et al.,
1987). Successful jumps in the candidate region provided new
start points for chromosome walking.
Eventually most of the region was encompassed by DNA clones. Considering the
formidable labor required to do this in 1985-88, makes one appreciate the impact
of Human Genome Project resources on present-day positional cloning efforts.
Linkage disequilibrium then provided valuable clues about the location of the CF
gene. The mutation rate is very low and CF is maintained in the population
largely by heterozygote advantage (Box
3.6). CF disease chromosomes in apparently unrelated people often
derive from a distant common ancestor. The original flanking markers
D7S8 and MET show only extremely weak
disequilibrium, but some of the newer markers such as KM19 and
XV2.c generated from clones within the candidate region
showed strong disequilibrium with CF (Table
12.1). Linkage disequilibrium data can be hard to interpret (see
Figure 12.6) but, in the case of CF,
a gradient of steadily increasing disequilibrium pointed quite effectively to
the location of the 5′ end of the gene.
Eventual isolation of the CFTR gene, unaided by large-scale
mutations, required extensive screening of libraries for the elusive cDNA. A
further small difficulty was encountered in producing convincing evidence that
the gene eventually cloned was indeed the site of mutations causing CF. Because
of the powerful linkage disequilibrium, it was expected that most CF mutant
chromosomes in the population would share a great deal of ancestral sequence.
Therefore, showing that a particular sequence change (the
F508del mutation, Table
17.3) was present on 70% of CF chromosomes did not prove in a wholly
convincing manner that CFTR was the CF gene, still less that
F508del caused CF. F508del could have been
simply a neutral variant inherited along with CF on the ancestral disease
chromosome, especially since the sequence change left the reading frame
intact.
The F508del mutation is present in the heterozygous state in
3–4% of phenotypically normal individuals (we would now identify them
as CF carriers). The fact that F508del homozygotes were always
severely affected was persuasive but not conclusive - F508del
could have been in more or less complete linkage disequilibrium with the real CF
mutation. Biochemical and pathological knowledge was important, in showing that
the CFTR gene encoded an ion channel, and that the pathogenesis
of CF was ultimately caused by defective regulation of chloride ion transport
across apical membranes. The subsequent identification of minority disease
alleles like G542X, where the expected effect on gene
expression was more obviously deleterious, provided further confirmation that
the true disease locus had been identified.
15.3.6. Positional cloning of the gene causing branchio-oto-renal syndrome was
achieved by large-scale sequencing of clones from the candidate region
The autosomal dominant branchio-oto-renal (BOR) syndrome (MIM 113650: branchial fistulas, malformation of
the external and inner ear with hearing loss; hypoplasia or absence of kidneys)
was mapped to 8q13 following a clue from an affected patient who had a
rearrangement of chromosome 8. The initial interval of 7 cM was refined to an
interval of 470–650 kb by further mapping and delineation of a
chromosomal deletion in the patient mentioned. P1 and PAC clones were isolated
by screening genomic libraries with markers within or close to the candidate
region, and gaps in the contig were filled by chromosome walking. The minimum
tiling path (the smallest number of clones from which a contig can be built)
across the candidate region involved 3 P1 and 3 PAC clones.
It was decided to isolate genes from the contig by large-scale sequencing of
plasmid subclones. Checking the sequence against the EMBL and GenBank databases
revealed homology between part of the sequence obtained and the
Drosophila developmental gene eyes absent
(eya). Further genomic sequence was then translated, and
then searched for homologies to the deduced amino acid sequence of the
Drosophila eya gene. This resulted in the identification of
seven putative exons showing 69% identity and 88% similarity at the amino acid
level to the putative eya protein (note that amino acid sequences are more
likely to show detectable homology than nucleotide sequences). The human cDNA
was then isolated from a 9-week total fetal mRNA library, and seven mutations in
the gene, named EYA1 were demonstrated in 42 unrelated BOR
patients. Expression studies in the mouse demonstrated expression consistent
with the developmental abnormalities of BOR syndrome.
This work (Abdelhak et al.,
1997) unambiguously identified EYA1 as the human BOR
gene, but the function of the gene product was not identified, nor was it clear
why the Drosophila phenotype consists of reduced or absent
compound eyes. As so often with positional cloning, identifying the gene was
just the start of understanding the syndrome.
15.3.7. Identification of the gene causing Treacher Collins syndrome illustrates
positional cloning in its purest form
Treacher Collins syndrome (MIM 154500) is an autosomal dominant disorder of
craniofacial development with a variable phenotype including abnormalities of
the external and middle ears, hypoplasia of the mandible and zygomatic complex
and cleft palate. Linkage was initially established to markers at 5q31-q34.
Because the markers in that region at the time (1991-92) were not very
informative, new microsatellites were isolated and used to refine the candidate
region to 5q32-33.1. A combined genetic and radiation hybrid map was constructed
across this interval, and by 1994 the team had assembled a YAC contig. This was
converted to a cosmid contig, and cDNA library screening and exon trapping were
used to generate a transcript map. At least seven genes were identified in the
critical region. Further rounds of marker isolation and crossover analysis
produced a confusing picture of overlapping recombinations, but led eventually
to isolation of a candidate incomplete cDNA from a placental library. Northern
blotting and zoo blotting showed that the gene was widely expressed and
conserved across species, but database searches revealed no significant
homologies. The exon-intron structure was determined, and mutation analysis
demonstrated five different mutations in unrelated patients.
Isolation of the TCOF1 gene (Treacher Collins Syndrome Collaborative Group, 1996) illustrates
positional cloning in its purest form. No relevant chromosomal abnormalities
were found (there were four patients with TCS who had chromosomal translocations
or deletions, but markers from each of the breakpoints showed no linkage to TCS
in family studies, so presumably these cases were all coincidental). There is no
linkage disequilibrium - not surprisingly, since 60% of cases are new mutations.
The candidate region is gene-rich, so there were many possible candidates, and
the gene eventually identified had no features that made it a particularly
promising candidate. The gene product is now believed to be a nucleolar
phosphoprotein that is involved in some aspect of nucleolar trafficking. Why
mutations should cause Treacher Collins syndrome is not yet known.
15.4. Positional candidate strategies identify candidate genes by a combination of
their map position and expression, function or homology
The position-independent and positional cloning strategies described in the last two
sections are in principle quite separate, but in reality most disease genes have
been identified by a positional candidate strategy, using a combination of
positional and nonpositional information.
-
A purely position-independent approach will rarely succeed because molecular
pathology is too complicated. Predictions of the biochemical function of an
unknown disease gene are often proved wrong once the gene is isolated.
Likewise, predicting the phenotypic effect of mutations in a known gene
usually works only in very general terms. Mutations in rhodopsin (the
pigment of the rods of the retina) will probably affect vision, but we
cannot predict which of the many forms of hereditary retinal malfunction
they might cause. Again, mutations in fibrillin (a connective tissue
component) will probably cause a connective tissue disease - but which one
of the hundreds known?
-
A purely positional approach is inefficient because candidate regions
identified by positional cloning usually contain dozens of genes. It can be
very time-consuming to identify every transcript from the region, and
excessively laborious to screen them all for mutations. Many are identified
only as cDNAs or ESTs, so that before mutation screening can be started,
their exon-intron structure must be determined. Thus it is essential to
prioritize candidates. This requires position-independent information about
their pattern of expression, likely function, or homology to genes
implicated in relevant mutants in model organisms.
15.4.1. Criteria for selecting a candidate gene: expression pattern and
function
From the list of genes that map to the candidate region, one would look for a
gene that shows appropriate expression and/or appropriate function.
Alternatively or additionally, as discussed in the following sections, one would
look for homology to some other human or non-human gene that is known to have
appropriate expression or function.
Appropriate expression pattern
A good candidate gene should have an expression pattern consistent with the
disease phenotype. Expression need not be restricted to the affected tissue,
because there are many examples of widely expressed genes causing a
tissue-specific disease (Section
16.7.1), but the candidate should at least be expressed at the
time and in the place where the pathology is seen. For example, neural tube
defects are likely to involve genes that are expressed during the
3rd–4th weeks of human embryonic development, shortly before or
during neurulation. The expression of candidate genes can be tested by
RT-PCR or Northern blotting, but the best method for revealing the exact
expression pattern is in situ hybridization against mRNA in
tissue sections (Figure 5.17). For
embryonic stages this is most conveniently performed using sections of mouse
embryos at the equivalent developmental stages (7.5–9.5 embryonic
days in the case of neural tube defects). The expression in human embryos is
likely to be very similar, although this cannot be guaranteed, and
centralized resources of staged human embryo sections have been established
to allow the equivalent analyses to be performed where necessary on human
embryos.
Appropriate function
Studying the pathology of a genetic disease only rarely gives information
precise enough to allow position-independent identification of the disease
gene, but it often allows good positional candidates to be selected.
Rhodopsin and fibrillin, mentioned above, provide typical examples.
The gene for human rhodopsin was cloned in 1984, and it was mapped to
3q21-qter in 1986. Among disorders involving hereditary retinal degeneration
are the various forms of retinitis pigmentosa (RP), which are marked by
progressive visual loss resulting from clumping of the retinal pigment.
Although rhodopsin was a possible candidate gene for some forms of RP, it
was only one of many proteins that were known to be involved in
phototransduction. However, in 1989, linkage analyses in a large Irish RP
family mapped their disease gene to 3q in the neighborhood of rhodopsin.
Rhodopsin was now a serious candidate gene, and patient-specific mutations
in the rhodopsin (RHO) gene were identified within a year
(see OMIM entry 180380).
The phenotype of Marfan syndrome (MFS, MIM 154700: excessive growth of long bones;
lax joints; dislocation of lenses; liability to aortic aneurysms) suggested
some abnormality in a connective tissue component. Linkage analysis mapped
the MFS gene to 15q, and subsequently the gene for the connective tissue
protein fibrillin was localized to 15q21.1 by in situ
hybridization. Fibrillin was then an obvious positional candidate, and
patient-specific mutations were soon demonstrated (see McKusick 1991 for discussion of the background).
Candidate genes may also be suggested on the basis of a close functional
relationship to a gene known to be involved in a similar disease. The genes
could be related by encoding a receptor and its ligand, or other interacting
components in the same metabolic or developmental pathway. For example, some
of the genes implicated in Hirschsprung disease were identified using this
logic, as described in Section
19.5.2.
15.4.2. Criteria for selecting a candidate gene: homology to a relevant human gene or
EST
Preliminary identification of transcripts often comes from matching genomic
sequence generated from the candidate region against unmapped ESTs in the
databases. Finding a match suggests the presence of an exon in the genomic DNA,
and may provide leads to identifying more of the gene or to guessing its
function. Sometimes a gene in the candidate region turns out to be closely
related to a known disease gene. If the diseases are similar, the new gene
becomes a compelling candidate. Members of multigene families can be fairly
readily assessed as positional candidates on this basis. For example, after
fibrillin was identified as the gene mutated in Marfan syndrome, a second
fibrillin gene was shown to map to 5q. This therefore became a candidate
location for other Marfan-like phenotypes. A related condition, congenital
contractural arachnodactyly (MIM 121050) was mapped to 5q and shown to be
caused by mutations in the FBN2 gene (Putnam et al., 1995).
Table 15.3
Functions of some genes implicated in human non-syndromic
sensorineural hearing loss
| DFNA1 | 5q31 | DIAPH1 | Cytokinesis |
| DFNA2 | 1p34 | KCNQ4 | K+ ion channel |
| DFNA9 | 14q12-q13 | COCH | Uncertain |
| DFNA12, DFNB21 | 11q22-q24 | TECTA | Structural component of tectorial membrane |
| DFNB1 | 13q12 | Cx26 (GJB2) | Intercellular gap junction |
| DFNB2 | 11q13.5 | MYO7A | Myosin (molecular motor) |
| DFNB3 | 17p11.2 | MYO15 | Myosin (molecular motor) |
| DFNB4 | 7q31 | PDS | Chloride-iodide transporter |
| DFNB9 | 2p23 | OTOF | Vesicle-membrane fusion |
| DFN3 | Xq21.1 | POU3F4 | Transcription factor |
| 12S RNA | Mitochondrion | 12S ribosomal RNA | Energy generation by mitochondria |
Selecting candidate disease genes by homology is often more successful using
model organisms as described below than by considering human paralogs. Many
diseases show extensive
locus heterogeneity, and it is not usually the case that
the different genes involved are related in any obvious way, either structurally
or functionally (
Table 15.3). One must
always bear in mind the complexity of every human organ,
tissue and
developmental process. Each requires very many different genes and pathways,
with the result that mutations in many unrelated genes can produce similar
phenotypes.
15.4.3. Criteria for selecting a candidate gene: homology to a relevant gene in a
model organism
Over the past decade it has become increasingly clear how far structural and
functional homologies extend across even very distantly related species.
Virtually every mouse gene has an exact human counterpart, and the same is
probably true of other less well explored mammalian species. More surprisingly,
extensive homologies can be detected between human genes and genes in zebrafish,
Drosophila, the nematode worm Caenorhabditis
elegans and even yeast. Even more than gene sequences, pathways are
often highly conserved, so that knowledge of a developmental or control pathway
in Drosophila or yeast can be used to predict the likely
working of human pathways - although mammals often have several parallel paths
corresponding to a single path in lower organisms. Being able to predict the
possible protein-protein interactions governing a pathway assists experimental
attempts to identify novel disease genes, for example by yeast two-hybrid
screening (Section 20.4.1). Perhaps the
most striking demonstration of the conservation of function between distant
organisms is contained in a paper by Rincón-Limas et al. 1999 which shows
that a wingless Drosophila mutant called
apterous can be corrected by transfection with the human
apterous homolog Lhx2 - in other words,
humans have a fully functional gene for making flies grow wings!
A very powerful means of selecting good candidates from among a set of human
genes is therefore to search the databases for evidence of homologous genes in
these well-studied model organisms, as described in Section 20.1.4. If a homolog is detected, one can see
what is known about its function. Such data might include the pattern of
expression and the phenotype of mutants. Additionally in the mouse, though not
in nonmammalian species, the likely chromosomal location of the human ortholog
can often be predicted from mouse mapping data, allowing prediction of as yet
uncharacterized positional candidates.
Using clues from the mouse
Human-mouse phenotypic homologies provide particularly valuable clues towards
identifying human disease genes for several reasons:
-
Of the genetically well explored organisms, mice are much the closest
to humans in evolutionary terms. Therefore orthologous gene
mutations are more likely to produce similar phenotypes in humans
and mice than in humans and lower organisms. Not infrequently,
however, despite the close evolutionary relationship, the phenotypes
are considerably different (see Section 21.4.6).
-
Figure 15.11
.
Conservation of synteny between human and mouse genetic
maps
The Oxford Grid for human and mouse shows an overall comparison
of the two species. Each cell shows the number of orthologous
genes mapping to particular chromosomes in mouse and man. Cells
are color coded according to the number of orthologs mapped. The
nonrandom distribution is obvious (Blake et al., 1999).
Reproduced with permission from the Mouse Genome Database, Mouse
Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine
(http://www.informatics.jax.org/) (25 May 1999).
Figure 15.12
.
Detail of one mouse chromosome
The map shows the part of mouse chromosome 1, from 48 to 62 cM
from the centromere. Mapped mouse genes are shown in color.
Where a human ortholog has been mapped, its human map location
is shown. The distal part of the mouse map shows good
conservation of synteny with the distal long arm of human
chromosome 2, but in the proximal part of the mouse chromosome
the relationship to human chromosomes is more complex. Mouse
Chromosome 1 Linkage Map reproduced with permission from the
Mouse Genome Database, Mouse Genome Informatics, The Jackson
Laboratory, Bar Harbor, Maine (http://www.informatics.jax.org/)
(25 May 1999). See Blake
et al., 1999.
Table 15.4
Using mapped mouse mutations to predict possible locations of
genes causing hearing loss in humans
| Wo | Chr 1, 25 cM | 2q12; 2q31-q33; 6p11 | |
| Sp
(Pax3) | Chr 1, 54 cM | 2p36 | WS1 |
| dr | Chr 1, 89 cM | 1q23-q31 | DFNA7 |
| Lp | Chr 1, 92 cM | 1q21-q23 | DFNA7 |
| fi | Chr 2, 34 cM | 2q14-q37 | |
| kr | Chr 2, 91 cM | 20q11-q13 | |
| Sig | Chr 6, 1 cM | 7q21-q31 | DFNB4 |
| Hoxa1 | Chr 6, 25 cM | 7p14-p15 | DFNA5 |
| nv | Chr 7, 4 cM | 19q13 | DFNA4 |
| hb | Chr 7, 65 cM | 16p11-p13; 10q24-q26 | |
| Fgf3 | Chr 7, 70 cM | 11q13.3 | |
| ha1 | Chr 10, 56 cM | 12q22-q24.1 | |
| Fu | Chr 17, 12 cM | 6p21 | |
| Tw | Chr 18, 9 cM | 18q11-q12 | |
| sy | Chr 18, 36 cM | 5q31-q32; 18p11-q21 | |
| Dc | Chr 19, 6 cM | 11q12-q13 | DFNA2, DFNA11 |
Mouse phenotypic information often translates readily into positional
candidate information. Backcross mapping (see Box 15.4) allows quick
and accurate mapping in the mouse. Thus most mouse mutants have been
mapped, or can easily be mapped, and there is considerable
conservation of synteny between humans and mice (see ). Once a
chromosomal location for a gene of interest is known in mouse or
humans, it is usually (though not always) possible to predict the
likely location of that gene in the other species. shows the example
of a mouse chromosomal region where prediction of the corresponding
human location is easy, and an adjoining region where prediction
could be difficult. A database of human-mouse map relationships is
maintained at http://www.ncbi.nlm.nih.gov/Omim/Homology (DeBry & Seldin,
1996). Table 15.4
shows an example of using mouse information to predict possible
locations of human deafness genes. -
Exon sequences are usually well conserved between orthologous human
and mouse genes. Once a human or mouse gene is isolated, probes or
primers can be designed to screen DNA libraries from the other
species in order to identify the orthologous gene.
An example: Waardenburg syndrome and the Splotch mouse
Waardenburg syndrome type 1 (WS1, MIM 193500) illustrates the value of
human-mouse comparisons. A pedigree of this autosomal dominant but variable
condition was shown in Figure 3.5C.
The characteristic pigmentary abnormalities and hearing loss of WS1 are
caused by absence of melanocytes from the affected parts (including the
inner ear, where melanocytes are required in the stria vascularis of the
cochlea in order for normal hearing to develop). Linkage analysis, aided by
the description of a chromosomal abnormality in an affected patient,
localized the gene for WS1 to the distal part of 2q. At this point, a likely
mouse homolog emerged. The Splotch (Sp) mouse mutant has
pigmentary abnormalities caused by patchy absence of melanocytes, and the
Sp gene maps to a linkage group on mouse chromosome 1
that shows extensive conservation of synteny with distal human 2q.
Consideration of the pathogenesis provided further evidence that
WS1 and Sp are orthologous genes. The
root cause of the phenotype lies in the embryonic neural crest, because
melanocytes originate in the neural crest and migrate out to their final
locations during embryonic development. Although heterozygous
Sp mice resemble WS1 patients, homozygous
Sp mice have neural tube defects, and have been studied
for many years as a model for human neural tube defects.
A positional candidate gene emerged when the murine Pax-3
gene was mapped to the vicinity of the Sp locus.
Pax-3 is one of a family of genes (PAX
genes) that encode transcription factors containing the paired box
DNA-binding motif, and it is expressed in mouse embryos in the developing
nervous system, including the neural crest. The sequence of
Pax-3 was almost identical to the limited sequence
which had previously been published for an unmapped human genomic clone,
HuP2. Such observations prompted mutation screening of
Pax-3 and HuP2 and led to
identification of mutations in Splotch mice and humans with
WS1 (reviewed by Strachan & Read,
1994). As the underlying genes, Pax-3 and
HuP2 were clearly orthologs, the HuP2
gene was subsequently re-named PAX3.
Limitations of human-mouse homologies
Though enormously valuable as a guide to human- mouse homologies,
conservation of synteny is not always sufficient to allow identification of
positional candidates. This is illustrated by the
mi/MITF locus, mutations in which
cause another variant of Waardenburg syndrome, WS type 2 (MIM 193510) in man. The mi
locus on mouse chromosome 6 has long been recognized as a likely candidate
homolog of some form of WS, but attempts to predict the location of the
human homolog failed. Genes mapping close to mi have human
homologs mapping to 3p25, 3q21-q24 and 10q11.2. Each of these locations was
tested for linkage to WS2, with negative results. Not until
the human mi homolog, MITF, had been
cloned and mapped by FISH to 3p14, and WS2 in humans had
been independently mapped by linkage to the same location (see Figure 11.6), was there sufficient
evidence to begin a successful search for mutations in man (Tassabehji et al.,
1994).
Using clues from mutant phenotypes in lower organisms
Phenotypic homologies with lower organisms have mostly been used after a
human gene and its orthologs have been identified, as part of the
exploration of the gene's function. In a few cases, however, homologies have
been used prospectively, to help identify a human disease gene. A
particularly successful example was the identification of the
MSH2 and MLH1 genes that are mutated
in certain forms of hereditary colon cancer. These genes were identified
after phenotypic resemblances led researchers to suspect that the cancers
might be caused by mutations in the human homologs of yeast mismatch repair
genes (Section 18.7.1).
Figure 15.13
.
A systematic database search for Drosophila
mutants that could be positional candidates for human disease
genes (Banfi et
al., 1996)
An attempt to use
Drosophila phenotypic information
systematically to identify positional candidates for human diseases is the
DRES database (
Banfi et
al., 1996). The dbEST human
EST database was
searched for matches to
Drosophila genes with known mutant
phenotypes. Sixty six novel matches were detected. The map position of each
EST was determined by both
FISH and
radiation hybrid mapping ().
Sequence homologies with lower organisms
Just as in the general Human Genome Project, genomic or cDNA sequences
generated in the exploration of a candidate disease region are routinely
checked against data from lower organisms. A match of genomic sequence
suggests the presence of a gene, and if the match is to a known gene in the
lower organism, it suggests the nature of the human gene. The example of
branchio-oto-renal syndrome, above, shows how useful this approach can
be.
15.5. Confirming a candidate gene
Ultimately all approaches used to identify disease genes generate candidates, which
then have to be tested individually to see if there is compelling evidence that
mutations in them do cause the disease in question. Demonstrating that a candidate
gene is likely to be the disease locus can be done by various means.
-
Mutation screening. Screening for patient-specific mutations
in the candidate gene is by far the most popular method, because it is
generally applicable and comparatively rapid. The reasons why particular
mutations may be found in certain diseases are discussed in Chapter 16, and the methods of
testing for mutations are described in Chapter 17. Identifying mutations in several
unrelated affected individuals strongly suggests that the correct candidate
gene has been chosen, but formal proof requires additional evidence.
-
Restoration of normal phenotype in vitro. For some
disorders, usually ones where mutations cause loss of function, the
phenotype is reversible. If a cell line that displays the mutant phenotype
can be cultured from the cells of a patient, transfection of a cloned normal
allele into the cultured disease cells may result in restoration of the
normal phenotype by complementing the genetic deficiency.
-
Production of a mouse model of the disease (Chapter 21). Once a putative
disease gene is identified, a transgenic mouse model can be constructed. If
the human phenotype is known to result from loss of function, gene targeting
can be used to generate a germline knockout mutation in the mouse ortholog.
If the human disease results from a gain of function, then attempts to make
a transgenic mouse model normally involve introducing a disease allele into
the mouse germline. The mutant mice are expected to show some resemblance to
humans with the disease, although this expectation may not always be met
even when the correct gene has been identified.
15.5.1. Principles of mutation screening to confirm a candidate gene
However promising a candidate gene appears to be for a disease, it must be shown
to be mutated in affected people. Mutation screening entails testing DNA samples
from a sizable panel of unrelated patients and control individuals. The first
step is to design pairs of primers for PCR-amplifying portions of the coding
DNA, either from a genomic DNA sample (if the exon/intron boundaries are known),
or from cDNA generated by RT-PCR from mRNA of patients (Figure 6.5). The products of individual amplification
reactions are then subjected to one or more of the mutation screening procedures
described in Section 17.1.4 that are
designed to detect unknown point mutations.
Mutation screening is often straightforward for diseases where a good proportion
of patients carry independent mutations (typically, severe early onset dominant
or X-linked recessive disorders) and where the disease phenotype results from
loss of function of the gene. As explained in Section 16.4, if the correct gene is tested, a panel of DNA samples
from unrelated patients will usually show a variety of different mutations,
including some with an obviously deleterious effect on gene expression (nonsense
mutations, frameshift mutations, etc.). Figure
16.1 shows an example from the work on Waardenburg syndrome mentioned
above. If the identified mutations are absent from control samples, then the
conclusion that the gene being tested really is the locus for the disease
becomes almost inescapable. However, other circumstances may make the
identification of mutations and the interpretation of mutation screening more
difficult.
-
Unsuspected locus heterogeneity. Often mutations in
several different genes can give almost identical phenotypes, so that a
panel of unselected patient samples may have pathogenic mutations in
different genes. If the candidate gene being tested is responsible for
only a small proportion of cases, most samples will show no mutation in
that gene. Ideally, one would use only samples from families with
demonstrated linkage to the candidate region, but this may be
impracticable. Family sizes for recessive and some dominant disorders
are often too small for independent linkage analyses, and in some severe
dominant disorders most patients present as sporadic cases without a
family history.
-
Mutational homogeneity. This problem was discussed above
in connection with CF. Most apparently unrelated patients carry the same
mutation, F508del. See Sections 16.3.3 and 17.1.2 for further discussion and examples of mutational
homogeneity.
-
Mutations are not unambiguously pathogenic. It may be difficult to
identify missense mutations as being pathogenic as opposed to being
neutral variants with no major effect on gene expression. Some
guidelines to help decide whether a sequence change is pathogenic are
given in Box 16.4.
-
Mutations may be hard to find. Large genes are more
difficult to screen for mutations, and sometimes mutations seem very
hard to find. Current examples include the NF1 and
PKD1 genes that are mutated in neurofibromatosis 1
(MIM 162200) and
adult polycystic kidney disease (MIM 173900) respectively. Mutations in
the F8C gene causing severe hemophilia A seemed to be
hard to find, until it was discovered that most of the missing mutations
were large inversions which disrupted the gene (see Figure 9.20) but were not detected by the PCR
methods normally used.
15.5.2. Once a candidate gene is confirmed, the next step is to understand its
function
Identifying the gene involved in a genetic disease opens the way to several lines
of investigation. The ability to identify mutations should immediately lead to
improved diagnosis and counseling, as described in Chapter 17. Understanding the molecular pathology (why
the mutated gene causes the disease; see Chapter 16) may also lead to insight into related diseases, and
hopefully eventually to more effective treatment including perhaps gene therapy
(Chapter 22).
A second line of enquiry concerns the normal function of the gene product. For
example, until the Duchenne muscular dystrophy (DMD) gene was identified there
was no knowledge of the way the contractile machinery of muscle cells is
anchored to the sarcolemma. Analysis of functional domains and motifs (Mushegian et al., 1997)
and the search for experimentally manipulable homologs in the mouse, fruit fly,
nematode and yeast are powerful tools for this work. This large topic is covered
in Chapter 20; a foretaste of the
sort of information that can be generated by database searching can be seen in
the following information, taken from the Hereditary Hearing Loss Homepage
database (http://dnalab-www.uia.ac.be/dnalab/hhh) and describing the gene
that was identified by positional cloning of an autosomal dominant hearing loss
locus (DFNA1) in one large Costa Rican family (see OMIM entry
124900):
The human DFNA1 protein product DIAPH1, mouse p140mDia, and
Drosophila diaphanous are homologs of Saccharomyces
cervisiae protein Bni1p. The proteins are highly
conserved overall. The genes encoding these proteins are members of the formin
gene family, which also includes the mouse limb deformity gene,
Drosophila cappuccino, Aspergillus
nidulans gene sepA, and S. pombe
genes fus1 and cdc12. These genes are involved
in cytokinesis and establishment of cell polarity. All formins share Rho-binding
domains in their N-terminal regions, polyproline stretches in the central region
of each sequence, and formin-homology domains in the C-terminal region.