As its name suggests, the aim of structural genomics is to characterize the structure of the genome. Knowledge of the structure of an individual genome can be useful in manipulating genes and DNA segments in that particular species. For example, genes can be cloned on the basis of knowing where they are in the genome. When a number of genomes have been characterized at the structural level, the hope is that, through comparative genomics, it will become possible to deduce the general rules that govern the overall structural organization of all genomes.
Structural genomics proceeds through increasing levels of analytic resolution, starting with the assignment of genes and markers to individual chromosomes, then the mapping of these genes and markers within a chromosome, and finally the preparation of a physical map culminating in sequencing.
High-resolution chromosome maps
The next level of increasing resolution is to determine the position of a gene or molecular marker on the chromosome. This step is important because the genetic maps that are produced can be aligned with the physical maps considered in the next section and used to validate the physical maps. In addition, the clones generated as parts of the physical map can be used to help identify the genomic DNA corresponding to the genes on the genetic map. Several different methods are used in localizing genes or markers.
Meiotic mapping by recombination.

Meiotic linkage mapping used in genomics is based on the principles of mapping covered in Chapter 5—in other words, on the analysis of recombinant frequency in dihybrid and multihybrid crosses. In experimental organisms such as yeast, Neurospora, Drosophila, and Arabidopsis, the genes that determine qualitative phenotypic differences can be mapped in a straightforward way because of the ease with which controlled experimental crosses (such as testcrosses) can be made. Therefore, in these organisms, the chromosome maps built over the years appear to be full of genes with known phenotypic effect, all mapped to their respective loci.
This is not the case for humans. First, informative crosses are lacking. Second, progeny sample sizes are too small for accurate statistical determination of linkage. Third, the human genome is enormous. In fact, even the assignment of a human disease gene to an individual autosome by linkage analysis was a difficult task. (Most genes with known phenotypes were assigned not by RF analysis but by human–rodent cell hybrid mapping.)
Even in those organisms in which the maps appeared to be “full” of loci of known phenotypic effect, measurements showed that the recombinational intervals between known genes had to contain vast amounts of DNA. These intervals, or gaps, could not be mapped by linkage analysis, because there were no markers in those regions. Large numbers of additional genetic markers were needed, which could be used to fill in the gaps to provide a higher-resolution map. This need was met by the discovery of various kinds of molecular markers. A molecular marker is a site of heterozygosity for some type of neutral DNA variation. Neutral variation is that which is not associated with any measurable phenotypic variation. Such a “DNA locus,” when heterozygous, can be used in mapping analysis just as a conventional heterozygous allele pair can be used. Because molecular markers can be easily detected and are so numerous in a genome, when mapped by linkage analysis, they fill the voids between genes of known phenotype. Note that, in mapping, the biological significance of the DNA marker is not important in itself; the heterozygous site is merely a convenient reference point that will be useful in finding one’s way around the genome. In this way, markers are being used just as milestones were used by travelers in earlier centuries. Travelers were not interested in the milestones (markers) themselves, but they would have been disoriented without them.
Restriction fragment length polymorphisms.
RFLPs (described in Chapters 1, 5, and 13) were the first neutral DNA markers to be applied to genome mapping by recombinant frequency.
DNA markers based on variable numbers of shortsequence repeats.
Although RFLPs were the first DNA markers to have been generally used in genomic characterization, in the analysis of animal and plant genomes, they have now been largely replaced by markers based on variation in the number of short tandem repeats. These markers are collectively called simple-sequence length polymorphisms (SSLPs). SSLPs have two basic advantages over RFLPs. First, in regard to RFLPs, usually only one or two “alleles,” or morphs, are found in a pedigree or population under study. This limits their usefulness; it would be better to have a larger number of alleles that could act as specific tags for a larger variety of homologous chromosomal regions. The SSLPs fill this need because multiple allelism is much more common, and as many as 15 alleles have been found for one locus. Second, the heterozygosity for RFLPs can be low; in other words, if one allele of a locus is relatively uncommon in relation to the other allele, the proportion of heterozygotes (the crucial individuals useful in mapping) will be low. However, SSLPs, in addition to having more alleles, show much higher levels of heterozygosity, which makes them more useful than RFLPs in mapping because heterozygotes are the basis for recombination analysis. Two types of SSLPs are now routinely used in genomics.
- 1
Figure 14-3
.
DNA fingerprints from a bloodstain at the scene of a crime and from the blood of seven suspects. (Cellmark Diagnostics, Germantown, MD.)
Figure 14-4
.
Obtaining a DNA fingerprint by using a VNTR probe. (a) Preparation of the probe. The first intron of the myoglobin gene has four repeats of the sequence shown, which contains a 13-bp core sequence (shown in boldface). This core sequence is found at other VNTR loci, labeled VNTR I, II, and III in this simple diagrammatic representation. (b) The number of repeats at the three VNTR loci with the core sequence. The Southern blot has been probed with the 33-bp repeat in part a and shows the DNA fingerprints of three people. (From J. D. Watson, M. Gilman, J. Witkowski, and M. Zoller, Recombinant DNA, 2d ed. Copyright © 1992 by James D. Watson, Michael Gilman, Jan Witkowski, and Mark Zoller.)
Figure 14-5
.
Using DNA fingerprint bands as molecular markers in mapping. Simplified fingerprints are shown for parents and five progeny. Examples illustrate methods of linkage analysis. Molecular markers can be mapped to one another or to a locus with known phenotypic expression.
Minisatellite markers. Minisatellite markers are based on variation in the number of tandem repeats (VNTRs). A DNA fingerprint is an array of bands in a Southern hybridization of a restriction digest ( and ). The individual bands of the DNA fingerprint represent different-sized DNA sequences at many different chromosomal positions. If parents differ for a particular band, then this difference becomes a heterozygous (“plus/minus”) locus that can be used in mapping. A simple example is shown in . This same technique can be applied in most organisms with repetitive DNA. - 2
Microsatellite markers. Recall that microsatellite DNA is a class of repetitive DNA based on dinucleotide repeats. The most common type consists of repeats of CA and its complement GT, as in the following example:
Probes for detecting these segments are made with the help of the polymerase chain reaction (PCR; see Chapter 12). First, digestion of human DNA with the restriction enzyme AluI results in fragments with an average length of 400 bp, and these fragments are cloned into an M13 phage vector. Phages with (CA)n/(GT)n inserts are identified by hybridizing with a (CA)n/(GT)n probe. Positive clones are sequenced, and PCR primer pairs are designed on the basis of sequences flanking the repetitive tract:

Figure 14-6
.
Using microsatellite repeats as molecular markers for mapping. A hybridization pattern is shown for a family with six children, and this pattern is interpreted at the top of the illustration with the use of four different-sized microsatellite “alleles,” M′ through M′′′′, one of which (M′′) is probably linked in cis configuration to the disease allele P.
Figure 14-7
.
Linkage map of human chromosome 1, correlated with chromosome banding pattern. The histogram shows the distribution of all markers available for chromosome 1. Some markers are genes of known phenotype, but most are DNA markers based on neutral sequence variation.
A linkage map, based on recombinant frequency analyses of the type described in this chapter, is in the center of the illustration. It shows only some of the markers available. Map distances are shown in centimorgans (cM, or m.u.). The total length of the chromosome 1 map is 356 cM; it is the longest human chromosome.
The positions of some markers are cross-referenced to a diagram of subregions of chromosome 1 based on a standard banding pattern (such a diagram is called an idiogram). These kinds of correlations can be made only by using cytogenetic analysis (Chapter 17) and in situ hybridization. Most of the markers shown on the map are molecular, but several genes (highlighted in light green) also are included:
(From B. R. Jasney et al., “The Genome Maps 1994,” Science, 265, 1994, 2055–2070.)
The primers are used to amplify DNA with the use of genomic DNA as a substrate. An individual primer pair will amplify its own repetitive tract and any size variants of it in DNAs from different individuals. A high proportion of PCR primer pairs reveals at least three marker “alleles” of different-sized amplification products. An example of the microsatellite mapping technique is shown in . Thousands of primer pairs can be made that likewise detect thousands of marker loci. The latest microsatellite marker map of human chromosome 1 is shown in .
Note some differences in the convenience of RFLP and SSLP analyses. RFLP analysis requires a specific cloned probe to be on hand in the laboratory for the detection of each individual marker locus. Microsatellite analysis requires a primer pair for each marker locus, but these primer sequences can be easily shared throughout the world—distributed by electronic mail and rapidly constructed by using a DNA synthesizer. Minisatellite analysis requires just one probe that detects the core sequence of the repetitive element at loci anywhere in the genome.
Together, the discovery of RFLP and SSLP markers has enabled the construction of a human genetic map with centimorgan [cM, or map unit (m.u.)] density. Although such resolution is a remarkable achievement, a centimorgan is still a huge segment of DNA, estimated in humans to be 1 megabase (1 Mb = 1 million base pairs, or 1000 kb). Currently, even higher resolution genetic maps are being developed on the basis of single-nucleotide polymorphisms (SNPs). An SNP is a single base-pair site within the genome at which more than one of the four possible base pairs is commonly found in natural populations. Several hundred thousand SNP sites are being identified and mapped on the sequence of the genome, providing the densest possible map of genetic differences.
MESSAGE
Meiotic recombination analysis of loci of genes with known phenotypic effect, RFLP markers, and SSLP markers has resulted in a map of the human genome that is saturated down to the 1 centimorgan (1 map unit) level. SNP analysis promises even greater resolution.
Figure 14-8
.
(a) Randomly amplified polymorphic DNA analysis (RAPD analysis) provides molecular chromosome markers. If another strain lacks one of the bands, this band can be considered a heterozygous marker locus and used in mapping, as shown in the example in part b. (b) RAPD analysis of a cross in a species of tree. A 10-nucleotide primer was used to amplify regions of genomic DNA in the parental trees (M, male; F, female) and 10 progeny. DNA standards for size calibration are shown in the right lane, marked “stds.” The arrows point to two bands that represent two loci in a “dihybrid” testcross. The two alleles of both these loci are expressed as the presence and the absence of a RAPD band. Of the two parents, only the male showed these bands, but the male must have been heterozygous for the presence (+ allele) and absence (− allele) of bands at both loci. The male could be designated 1+/1− · 2+/2−, and the female 1−/1− · 2−/2−. The progeny show various parental and nonparental combinations of these alleles. (John E. Carlson.)
Randomly amplified polymorphic DNAs (RAPDs).

A single PCR primer designed at random will often by chance amplify several different regions of the genome. The single sequence “finds” DNA bracketed by two inverted copies of the primer sequence. The result is a set of different-sized amplified bands of DNA (). In a cross, some of the amplified bands may be unique to one parent, in which case they can be treated as heterozygous (+/−) loci and used as molecular markers in mapping analysis. Notice, too, that the set of amplified DNA fragments (called a
RAPD, pronounced “rapid”) is yet another type of DNA fingerprint that can be used to characterize an individual organism. Such identity tags can be very useful in routine genetic analysis or in population studies.
In situ hybridization.

If a cloned gene is available, it can be used to make a labeled probe for hybridization to chromosomes in situ. If the individual chromosomes of the genomic set are recognizable through their banding patterns, size, arm ratio, or other cytological feature, then the new gene can be assigned to the chromosome to which it hybridizes. Furthermore, the locus of hybridization reveals a rough chromosomal position.
Figure 14-9
.
FISH analysis. Chromosomes probed in situ with a fluorescent probe specific for a gene present in a single copy in each chromosome set—in this case, a muscle protein. Only one locus shows a fluorescent spot corresponding to the probe bound to the muscle protein gene. (From P. Lichter et al., “High-Resolution Mapping of Human Chromosome 11 by in Situ Hybridization with Cosmid Clones,” Science 247, 1990, 64.)
Figure 14-10
.
Chromosome painting by in situ hybridization with different-labeled probes. (Applied Imaging, Hylton Park, Wessington, Sunderland, U.K.)
Commonly used probe labels are radioactivity and fluorescence. In the process of fluorescence in situ hybridization (FISH), the clone is labeled with a fluorescent dye, and a partially denatured chromosome preparation is bathed in the probe. The probe binds to the chromosome in situ, and the location of the cloned fragment is revealed by a bright fluorescent spot (). An extension of FISH is chromosome painting. Sets of cloned DNA known to be from specific chromosomes or specific chromosome regions are labeled with different fluorescent dyes. These dyes then “paint” specific regions and identify them under the microscope (). If a clone of a gene of unknown location is labeled with yet another dye, its position can be established in the painted array.
Rearrangement breakpoints.

We shall see in Chapter 17 that mutant alleles giving a new observable phenotype are sometimes caused by a chromosomal rearrangement. Usually such mutations trace to a chromosome break that is part of the rearrangement and that splits the gene in two, disrupting vital coding or regulatory sequences. If the break can be seen or mapped to known markers by recombination analysis, then this information can be used to assign a gene to a position on a cytogenetic map of a chromosome. One helpful feature of rearrangement breaks is that they also serve as molecular landmarks. When cloned DNA spanning a break has been identified, the break is easily detected on Southern blots as the loss of an expected band and the appearance of two novel bands.
Radiation hybrid mapping.

Figure 14-11
.
Making radiation hybrids by using X rays. Fragments of human chromosomes integrate into rodent chromosomes. A panel of different radiation hybrids is analyzed for cotransfer of human markers, which can indicate linkage.
The technique that is used to localize genes to individual chromosomes can be extended to obtain map loci. One important extension is
radiation hybrid mapping. This technique was designed to produce a higher-resolution map of molecular markers along a chromosome. The procedure is to X-ray treat human cells to fragment the chromosomes and then fuse the irradiated cells with the rodent cells to form a panel of different hybrids. In this case, the hybrids have an assortment of
fragments of human chromosomes, as diagrammed in . Most of the fragments are seen to be embedded in the rodent chromosomes, but truncated human chromosomes also can be found. First, the frequency of various human molecular markers in the hybrids is calculated. The next step is to calculate the frequency of the co-occurrence of pairs of human molecular markers. Closely linked markers are assumed to be incorporated at high frequencies because the probability that an X-irradiation-induced break will occur between the loci is low. Distant markers and markers on different chromosomes should be present at frequencies close to the product of individual frequencies. A mapping unit cR
3000 is calculated, which has been calibrated to approximately 0.1 cM (m.u.).
A standard panel in the range of 100 to 200 radiation hybrids is quite straightforward to obtain. Such a panel is sufficient to obtain a high-resolution cR3000 map of the human genome, which would have 10-fold greater resolution than the current centimorgan genetic map. One downside of the technique is that it is limited to those markers for which human–rodent differences are available.
MESSAGE
Correlation of human markers and chromosomes in hybrid rodent–human cell lines allows chromosomal assignment of the markers. The co-occurrence of different human markers in X-irradiated hybrids allows high-resolution mapping of the chromosomal loci of the markers.
In this section on marker mapping, we have encountered techniques based on widely differing premises—for example, meiotic crossover frequency and radiation-induced breakage. Hence, even though these maps give the same order of markers, distances between markers on one map may not be proportional to distances between markers on another map.
Physical mapping of genomes
A further increase in mapping resolution is accomplished by manipulating cloned DNA fragments directly. Because DNA is the physical material of the genome, the procedures are generally called physical mapping. One goal of physical mapping is to identify a set of overlapping cloned fragments that together encompass an entire chromosome or an entire genome. The resulting physical map is useful in three ways. First, the genetic markers carried on the clones can be ordered and hence contribute to the overall genome mapping process. Second, when the contiguous clones have been obtained, they represent an ordered library of DNA sequences that can be exploited for future genetic analysis—for example, to correlate mutant phenotypes with disruptions of specific molecular regions. Third, these clones form the raw material that will be sequenced in large-scale genome projects.
Figure 14-12
.
Structure of a bacterial artificial chromosome (BAC), used for cloning large fragments of donor DNA. CMR is a selectable marker for chloramphenicol resistance. oriS, repE, parA, and parB are F genes for replication and regulation of copy number. cosN is the cos site from λ phage. HindIII and BamHI are cloning sites at which foreign DNA is inserted. The two promoters are for transcribing the inserted fragment. The NotI sites are used for cutting out the inserted fragment.
In the preparation of physical maps of genomes, vectors that can carry very large inserts are naturally the most useful. Cosmids, YACs (yeast artificial chromosomes), BACs (bacterial artificial chromosomes), and PACs (phage P1-based artificial chromosomes) have been the main types. Cosmids and YACs were introduced in
Chapters 12 and
13.
BACs () are based on the 7-kb F plasmid of
E. coli. Recall that F can carry large fragments of
E. coli DNA as F′ derivatives (
Chapter 7). In a similar manner, as cloning vectors, they can also carry inserts of fragments of foreign DNA as large as 300 kb, although the average is about 100 kb.
PACs are produced by a type of engineering similar to that of phage P1; they carry inserts comparable to those of BACs.
Although the maximum insert sizes of BACs and PACs are not as large as those of YACs, the former types have several advantages over YACs. First, they can be amplified in bacteria and isolated and manipulated simply with basic bacterial plasmid technology. Second, BACs and PACs form fewer hybrid inserts than YACs do. Hybrid inserts are composed of several different fragments; their presence can thwart attempts to order the clones.
However, despite these useful vectors, the task of genomic cloning is a daunting one. Even so-called small genomes still contain huge amounts of DNA. Consider, for example, the 100-Mb genome of the tiny nematode Caenorhabditis elegans; because an average cosmid insert is about 40 kb, at least 2500 cosmids would be required to embrace this genome, and many more would be required to narrow the number to such a complete set. YACs can contain on the order of 1 Mb, so here the task is somewhat simpler.
Cloning a whole genome begins by amassing a large number of randomly cloned inserts. The contents of these clones must be characterized in some way, and overlaps must be determined. A set of overlapping clones is called a contig. In the early phases of a genome project, contigs are numerous and represent cloned “islands” of the genome. But, as more and more clones are characterized, contigs enlarge and merge into one another, and eventually the project should end up with a set of contigs that equals the number of chromosomes.
Chromosome-specific libraries
Figure 14-13
.
Chromosome purification by using flow sorting. Chromosomes stained with a fluorescent dye are passed through a laser beam. Each time, the amount of fluorescence is measured, and the chromosome is deflected accordingly. The chromosomes are then collected as droplets.
If a library of clones is prepared from total genomic DNA, then contig development is relatively slow. However, if a specific chromosome can be used to develop the library of clones, contigs emerge more rapidly. PFGE can be used to isolate individual chromosomes (if they are small) or chromosome fragments cut with “long-cutter” enzymes such as NotI.
Flow sorting is another option for preparing DNA of a specific chromosome. Chromosomes (such as human chromosomes) can be flow-sorted by fluorescence-activated chromosome sorting (FACS; ). In this procedure, metaphase chromosomes are stained with two dyes, one of which binds to AT-rich regions and the other to GC-rich regions. Cells are disrupted to liberate whole chromosomes into liquid suspension. This suspension is converted into a spray in which the concentration of chromosomes is such that each spray droplet contains one chromosome. The spray passes through laser beams tuned to excite the fluorescence. Each chromosome produces its own characteristic fluorescence signal, which is recognized electronically, and two deflector plates direct the droplets containing the specific chromosome needed into a collection tube.
MESSAGE
Genomic cloning proceeds by assembling clones into overlapping groups called contigs. As more data accumulate, the contigs become equivalent to whole chromosomes.
Several different techniques are used to order genomic clones into contigs. We shall consider some of the main ones.
Ordering by FISH.

Figure 14-14
.
Ordering BAC and PAC clones by FISH. Each color represents a type of vector; each circle represents hybridization by a specific clone; red arrows represent centromere-specific binding. (Julie Korenberg, Ph.D., M.D., and Xiao-Ning Chen, M.D., Cedars Sinai Medical Center, Los Angeles, CA.)
If good chromosomal landmarks are known, FISH analysis can be used to locate the approximate positions of the large inserts. shows results of a FISH analysis that generates a rough ordering of BACs and PACs in human chromosomes.
Ordering by clone fingerprints.

The genomic insert carried by a vector has its own unique sequence, which can be used to generate a DNA fingerprint. For example, a multiple restriction-enzyme digestion can generate a set of bands whose number and positions are a unique “fingerprint” of that clone. The different bands generated by separate clones can be aligned either visually or by using a computer program to determine if there is any overlap between the inserted DNAs. In this way, the contig can be built up.
Ordering by sequence-tagged sites.

Figure 14-15
.
Using sequence-tagged sites (STSs) to order overlapping clones (YACs, in this example) into a contig. Five different YACs are tested to determine which STSs they contain (top), and these data are used to assemble a physical map (bottom).
Unique short sequences of large cloned inserts can be used as tags to align the various clones into contigs. For example, if clone A has tags 1 and 2 and clone B has tags 2 and 3, clones A and B must overlap in the region of tag 2. The practical procedure is to amass a large set of random clones with small genomic inserts (say, in λ phage) and sequence short regions of each. From these sequences, pairs of PCR primers are designed that will amplify the short specific sequence of DNA flanked by the primers. These short DNA sequences are known as
sequence-tagged sites (STSs). Even though initially the location of these STSs in the genome is not known, a panel of many STSs can be used to characterize clones with large genomic inserts (such as YAC clones). The clones that are shown to have specific STSs in common must have overlapping inserts and therefore can be aligned into contigs. An example of this process is shown in .
Short stretches of sequence are sometimes obtained from cDNA clones. These stretches are known as expressed sequence tags (ESTs). ESTs are obtained by sequencing into the cDNA insert by using a primer based on the vector sequence. They can be used to align the cDNAs on the contig, thus anchoring the gene map to the physical map. Further, if part of the open reading frame (ORF) of the transcript is contained within the EST, the “virtual” translation of the ORF can provide a “sneak preview” of the function of the protein encoded by the mRNA from which the cDNA was derived.
Figure 14-16
.
Using an ordered array of YACs to locate the map position of a newly cloned gene in C. elegans. The YAC clones are placed in order on a polytene filter. DNA from this filter is blotted and probed with the cloned gene, giving the autoradiogram with two positive spots corresponding to two adjacent YACs numbered 332 and 333. Because of the hybridization pattern obtained, the location of the cloned gene can be narrowed down to a small region on chromosome III. (Autoradiogram from Alan Coulson.)
The combination of these physical methods has resulted in the cloning of whole genomes of several organisms. For example, the
C. elegans genome is now available as sets of cosmid or YAC contigs. Furthermore, the DNA of the contigs has been arranged on nitrocellulose filters in ordered arrays; so, to find out where a specific piece of DNA of interest lies in the genome, that DNA is used as a probe on the contig filters, and a positive hybridization signal announces the precise location of the DNA ().
An example: cloning and mapping the human Y chromosome
Several of the smaller human chromosomes have been fully cloned as overlapping sets of YAC clones (contigs). We shall examine the cloning of the Y chromosome as an example because it illustrates several of the techniques of physical mapping. The STS map of the Y chromosome was in fact obtained by two different methods—YAC alignment and deletion analysis.
YAC alignment.

Flow sorting yielded a sample of Y chromosomes, from which λ clones were made. From clones that did not contain repetitive DNA, STS primers were designed. In all, 160 primer pairs were made. A Y chromosome YAC library of 10,368 clones was obtained in which the average insert size was 650 kb. From these numbers, each point on the Y chromosome was estimated to have been sampled an average of four times. The YAC clones were divided into 18 pools of 576 YACs, and the pools were screened with the STS primers. Subdivision of positive pools led rapidly to the assignment of a particular STS to specific YACs. The total STS content of each YAC was assessed, and overlaps between the YACs were determined in the same way as that shown in the generalized example in .
(a) The STS content of naturally occurring Y chromosome fragments was used to order the STSs on a Y chromosome map. Note: “q−” means “lacking the q arm”; “trans Yq” means “translocation of Yq to an autosome.” (b) Deep freeze containing plates of YAC clones used in the human genome project. (Part b from Roger Bessmeyer—© 1995 Corbis. All rights reserved Library of Human Genes.)
Deletion analysis.

Various types of Y chromosome deletions occur naturally. For example, some XX males contain truncated fragments of the Y, whereas some XY females have deletions of the region containing the maleness (testis-determining) gene (see
Chapters 2 and
23). These Y deletions were maintained in cell culture and formed the basis for aligning the Y chromosome STSs. Each deletion was tested for STS content. Because by nature the deletions were nested sets, the STS content could be used not only to develop an STS map, but also to map the coverage of the deletions. The principle is illustrated in . The STS maps produced by YAC alignment and by deletion analysis were identical.
MESSAGE
Clones can be arranged into contigs by matching DNA fingerprints, by matching short sequences within cloned segments, and by analyzing deletions.
Using genome maps for genetic analysis
Genetic and physical maps are an important starting point for several types of genetic analysis, including gene isolation (including human disease genes) and functional genomics.
Isolating human disease genes by positional cloning.

We shall follow the methods used to identify the genomic sequence of the cystic fibrosis (CF) gene as an example. No primary biochemical defect was known at the time that the gene was isolated, so it was very much a gene in search of a function. Linkage to molecular markers had located the gene to the long arm of chromosome 7, between bands 7q22 and 7q31.1. The CF gene was thought to be inside this region, flanked by the gene met (a proto-oncogene; see Chapter 22) at one end and a molecular marker, D788, at the other end. But between these markers lay 1.5 centimorgans (map units) of DNA, a vast uncharted terrain of 1.5 million bases. Additional markers within the region were obtained by using new probes derived from a chromosome 7 library made by flow sorting.
However, the two key techniques that were used to traverse the huge genetic distances were chromosome walking (Chapter 13) and a related technique called chromosome jumping. The latter technique provides a way of jumping across potentially unclonable areas of DNA and generates widely spaced landmarks along the sequence that can be used as initiation points for multiple bidirectional chromosomal walks.
Figure 14-19
.
Manipulating cloned genomic fragments for chromosome jumping, a modified type of chromosome walking that can bypass regions difficult to clone, such as those containing repetitive DNA (see text).
Chromosome jumping is illustrated in . In this procedure, large fragments are created by partial restriction cleavage of the DNA in the region believed to contain the gene of interest. Each DNA fragment is then circularized, thus bringing the beginning and end of the fragment together. This junction is cut out and cloned into a phage vector, which together with the other junction segments make up a
jumping library. A probe from the beginning of the stretch of DNA under investigation can be used to screen the jumping library to find the clone that contains the beginning sequence. When this clone is found, the other end of the junction sequence is excised and used to screen the library again to make a second jump. From each jump position, chromosome walks can be made in both directions to search for genelike sequences.
A restriction map of the overall region was obtained with rare-cutting restriction enzymes, and the restriction sites were used to position and orient the sequences obtained from jumping and walking. When enough sequencing had been done to cover representative parts of the overall region, the hunt for any genes along this stretch began. Genes were sought by several techniques. First, human genes were known to be generally preceded at the 5′ end by clusters of cytosines and guanines, called CpG islands, and several of these clusters were found. Second, it was reasoned that a gene would show homology to the DNA of other animals, because of evolutionary conservation, so candidate sequences were used to probe what were called zoo blots of genomic DNA from a range of animals. Third, genes should have appropriate start and stop signals. Fourth, genes should be transcribed, and transcripts should be found.
Ultimately, a strong candidate gene was found spanning 250 kb of the region. Some CF symptoms are expressed in sweat glands; so, from cultured sweat gland cells, cDNA was prepared, and a 6500-nucleotide cDNA homologous to the candidate gene was detected. On sequencing this cDNA in normal and CF patients, the cDNA of the patients showed the deletion of three base pairs, eliminating a phenylalanine from the protein. Therefore it was very likely that this was the CF coding sequence. Thus the CF gene had been found. From its cDNA nucleotide sequence, an amino acid sequence was inferred. In turn, from this inferred sequence, the three-dimensional structure of the protein was predicted. This protein is structurally similar to ion-transport proteins in other systems, suggesting that a transport defect is the primary cause of CF. When used to transform mutant cell lines from CF patients, the wild-type gene restored normal function; this phenotypic “rescue” was the final confirmation that the isolated sequence was in fact the CF gene.
The candidate-gene approach
Inevitably, intensive cloning and sequence-level characterization of a chromosomal region reveal the presence of genes of unknown function. If a gene of interest such as a disease gene is mapped to that chromosomal region, then these “orphan” gene sequences become candidate genes for the disease gene. This procedure is termed the candidate-gene approach to gene isolation. Knowledge about the gene’s phenotype such as biochemical defect and pattern of tissue expression can be matched to the sequence domains and tissue expression of the candidate gene. The method works in the opposite direction, too; the domains and tissue expression of randomly sequenced genes often suggest a possible disease-gene phenotype.
MESSAGE
Cloning is made easier by the availability of a set of overlapping genomic clones.
Genes underlying complex inheritance patterns
Most of the contrasting phenotypes analyzed in this book are determined simply by alleles of a single gene. However, many phenotypes are determined in a complex manner. Here two situations can be distinguished.
Figure 14-20
.
Producing lines for QTL identification and mapping. Two pure lines that differ significantly in some quantitative trait (character) are crossed, and, after many generations of inbreeding, pure recombinant lines are produced. The comparison of phenotypes of these lines leads to the identification of specific chromosomal segments that consistently contribute to the parental difference. These regions contain presumptive QTLs. (After W. N. Frankel, “Taking Stock of Complex Trait Genetics in Mice,” Trends in Genetics 12, 1995, Figure 1.)
First the phenotypic variation may be quantitative (
Chapters 1 and
25), and the characters (traits) are called
quantitative traits. Examples are metric characters such as height and weight. This type of variation is thought to be based on the cumulative interaction between
1 and − alleles of several genes and the environment. The availability of thousands of molecular markers such as SSLPs arranged along all the chromosomes of a genome has made it possible to map some of the genes that contribute to quantitative variation, whose loci are called
quantitative trait loci, abbreviated
QTLs. The approach is to take two lines that show widely contrasting phenotypes for a quantitative trait and to interbreed these lines to generate homozygous descendants that contain only one segment or a small number of segments from one line, as shown in . (These segments can be identified by the SSLP alleles that they carry.) Such hybrid individuals are then assessed for their quantitative phenotype, and estimates are made of the contributions (or lack of contribution) of specific segments to the observed variation. The average phenotype of lines with, say, region A is compared with the average of lines lacking region A; if there is a difference, region A becomes a candidate for containing a QTL. Ideally, a derived pure line would carry only one QTL, and then in backcrosses to the appropriate parent this QTL would segregate in a monohybrid manner. The QTL can then be mapped precisely by recombination with SSLP markers.
The second situation is a type of discontinuous variation that is not inherited as a simple Mendelian allele. Examples are all-or-none phenotypes such as epilepsy, heart disease, diabetes, and Alzheimer disease. Here the model for inheritance is again alleles of one to several contributing genes plus a large environmental component. However, to produce discontinuous phenotypes, these factors seem to contribute to a type of cellular or organismal “threshold” beyond which the disorder is expressed. These conditions also are amenable to gene identification by using the approach shown in , and several complex trait loci have been identified in experimental organisms and humans. In humans, studies on isolated populations with little genetic variation are particularly useful in identifying the contributing loci. In the future, SNP analysis promises to accelerate the mapping of complex traits.