NCBI » Bookshelf » Molecular Cell Biology » Molecular Structure of Genes and Chromosomes » 9.2 Chromosomal Organization of Genes and Noncoding DNA
 
mcb
Molecular Cell Biology
4th
Harvey Lodish,1 Arnold Berk,2 Lawrence Zipursky,2 Paul Matsudaira,3 David Baltimore,4 and James Darnell5
1Whitehead Institute for Biomedical Research and Massachusetts Institute of Technology
2Molecular Biology Institute, University of California, Los Angeles
3Howard Hughes Medical Institute, School of Medicine, University of California, Los Angeles
4California Institute of Technology (Caltech)
5Rockefeller University, New York
W. H. Freeman0-7167-3136-32000
cell biologymolecular biology

 9:  9.2 Chromosomal Organization of Genes and Noncoding DNA

Having reviewed the relationship between transcription units and genes in prokaryotes and eukaryotes, we now consider the organization of genes on chromosomes and the relationship of noncoding DNA sequences to coding sequences.

Genomes of Higher Eukaryotes Contain Much Nonfunctional DNA

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch9f3.jpg.

Figure 9-3

.

   Diagrams of ≈80-kb region from chromosome III of the yeast S. cerevisiae and the β-globin gene cluster on human chromosome 11

(a) In the yeast DNA, blue boxes indicate open reading frames; it is not clear whether all these potential protein-coding sequences are functional genes. (b) In the human DNA, the blue boxes represent transcribed regions that encode the indicated globin-type proteins. Each globin-type gene has a similar arrangement of exons and introns (not shown). The human β-globin gene cluster contains two pseudogenes (diagonal lines); these regions are related to the functional globin-type genes but are not transcribed. Red arrows indicate the locations of Alu sequences, an ≈300-bp noncoding repeated sequence that is abundant in the human genome. Note the much higher proportion of noncoding to coding sequences in the human DNA than in the yeast DNA. [Part (a) see S. G. Oliver et al., 1992, Nature 357:28; part (b) see F. S. Collins and S. M. Weissman, 1984, Prog. Nucl. Acid Res. Mol. Biol. 31:315.]

The abundance of noncoding sequences in the genomes of higher organisms is illustrated in Figure 9-3, which depicts the protein-coding regions in an 80-kb stretch of DNA from the yeast S. cerevisiae and in the β-globin gene cluster of humans, also about 80 kb long. Note that in the single-celled yeast, protein-coding regions are closely spaced along the DNA sequence, whereas only a small fraction of the human DNA encodes protein. DNA sequencing and identification of exons has revealed that in higher organisms there is a considerable amount of DNA that does not encode protein. In fact, the β-globin gene cluster is unusually rich in protein-coding sequences compared with other regions of vertebrate DNA. In the 60-kb region including the chicken lysozyme gene, for example, the coding exons total less than 500 base pairs. Because no function has yet been found for most of the noncoding DNA in higher eukaryotes, it is commonly referred to as nonfunctional.

Different selective pressures during evolution may account, at least in part, for this remarkable difference in the amount of nonfunctional DNA in microorganisms and multicellular organisms. For example, microorganisms must compete for limited amounts of nutrients in their environment, and metabolic economy thus is a critical characteristic. Since synthesis of nonfunctional (i.e., noncoding) DNA requires time and energy, presumably there was selective pressure to lose nonfunctional DNA during the evolution of microorganisms. On the other hand, natural selection in vertebrates depends largely on their behavior. The energy invested in DNA synthesis is trivial compared with the metabolic energy required for the movement of muscles; thus there was little selective pressure to eliminate nonfunctional DNA in vertebrates.

Cellular DNA Content Does Not Correlate with Phylogeny

The total amount of chromosomal DNA in different animals and plants does not vary in a consistent manner with the apparent complexity of the organisms. Yeasts, fruit flies, chickens, and humans have successively larger amounts of DNA in their haploid chromosome sets (0.015, 0.15, 1.3, and 3.2 picograms, respectively), in keeping with what we perceive to be the increasing complexity of these organisms. Yet the vertebrates with the greatest amount of DNA per cell are amphibians, which are surely less complex than humans in their structure and behavior. Many plant species also have considerably more DNA per cell than humans have. For example, the DNA content per cell of wheat, broad beans, and garden onions (7.0, 14.6, and 16.8 picograms, respectively) ranges from about two to more than five times that of humans, and tulips have ten times as much DNA per cell as humans.

The DNA content per cell also varies considerably among closely related species. All insects or all amphibians would appear to be similarly complex, but the amount of haploid DNA in species within each of these phyla varies by a factor of 100. The same variation in DNA content per cell is common within groups of plants that have similar structures and life cycles. For example, the broad bean contains about three to four times as much DNA per cell as the kidney bean.

Table 9-1

Classification of Eukaryotic DNA
Protein-coding genes
  Solitary genes
  Duplicated and diverged genes (functional gene
   families and nonfunctional pseudogenes)
Tandemly repeated genes encoding rRNA, 5S rRNA,
  tRNA, and histones
Repetitious DNA
  Simple-sequence DNA
  Moderately repeated DNA (mobile DNA elements)
   Transposons
   Viral retrotransposons
   Long interspersed elements (LINES; nonviral
    retrotransposons)
   Short interspersed elements (SINES; nonviral
    retrotransposons)
Unclassified spacer DNA
These facts further suggest that much of the DNA in certain organisms is “extra” or expendable — that is, it does not encode RNA or have any regulatory or structural function. The total amount of DNA per haploid cell in an organism is referred to as the C value; the failure of C values to correspond to phylogenetic complexity is called the C-value paradox. This perplexing variation in genome size occurs mainly because eukaryotic chromosomes contain variable amounts of DNA with no demonstrable function, both between genes and within genes in introns. As discussed later, much of this apparently nonfunctional DNA is composed of repetitious DNA sequences, some of which are never transcribed and most all of which are likely dispensable. The different classes of eukaryotic DNA sequences discussed in the following sections are summarized in Table 9-1.

Protein-Coding Genes May Be Solitary or Belong to a Gene Family

In multicellular organisms, roughly 25 – 50 percent of the protein-coding genes are represented only once in the haploid genome and thus are termed solitary genes. The remaining protein-coding genes belong to families comprising two or more similar genes.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch9f4.jpg.

Figure 9-4

.

   The chicken lysozyme gene and its surrounding regions

This 15-kb simple transcription unit contains four exons (blue) and three introns (tan). The positions indicated by red arrows are repeated Alu sequences found at many sites elsewhere in the genome. [See P. Balducci et al., 1981, Nucleic Acids Res. 9:3575.]

A well-studied example of a solitary protein-coding gene is the chicken lysozyme gene mentioned previously. The 15-kb DNA sequence encoding chicken lysozyme constitutes a simple transcription unit (i.e., a single gene) containing four exons and three introns (Figure 9-4). The flanking regions, extending for about 20 kb upstream and downstream from the transcription unit, do not encode any detectable mRNAs. Lysozyme, an enzyme that cleaves the polysaccharides in bacterial cell walls, is an abundant component of chicken egg-white protein and also is found in human tears. Its activity helps to keep the surface of the eye and the chicken egg sterile.

Frequently, the DNA that lies within 5 – 10 kb of a particular gene contains sequences that are close but inexact copies of the gene. Such sequences, which are thought to have arisen by duplication of an ancestral gene, are referred to as duplicated protein-coding genes; duplicated genes probably constitute half of the protein-coding DNA in vertebrate genomes. A set of duplicated genes that encode proteins with similar but nonidentical amino acid sequences is called a gene family; the encoded closely related, homologous proteins constitute a protein family. A few protein families, such as protein kinases, transcription factors, and vertebrate immunoglobulins, include hundreds of members. Most families, however, include from just a few to 30 or so members; common examples are cytoskeletal proteins, 70-kDa heat-shock proteins, myosin heavy chain, chicken ovalbumin, and the α- and β-globins in vertebrates.

The genes encoding the β-like globins are a good example of a gene family. As shown in Figure 9-3b, the β-like globin gene family contains five functional genes designated β, δ, Aγ, Gγ, and ϵ; the encoded polypeptides are similarly designated. Two identical β-like globin polypeptides combine with two identical α-globin polypeptides (encoded by another gene family) and with four small heme groups to form a hemoglobin molecule (see Figure 3-10). All the hemoglobins formed from the different β-like globins carry oxygen in the blood, but they exhibit somewhat different properties that are suited to specific roles in human physiology. For example, hemoglobins containing either the Aγ or Gγ polypeptides are expressed only during fetal life. Because these fetal hemoglobins have a higher affinity for oxygen than adult hemoglobins, they can effectively extract oxygen from the maternal circulation in the placenta. The lower oxygen affinity of adult hemoglobins, which are expressed after birth, permits better release of oxygen to the tissues, especially muscles, which have a high demand for oxygen during exercise.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch9f5.jpg.

Figure 9-5

.

   Gene duplication resulting from unequal crossing over

Each parental chromosome (top) contains one ancestral globin gene containing three exons and two introns. Homologous L1 repeated sequences lie 5′ and 3′ of the globin gene. The parental chromosomes are shown displaced relative to each other, so that the L1 sequences are aligned. Homologous recombination between L1 sequences as shown would generate one recombinant chromosome with two copies of the globin gene and one chromosome with a deletion of the globin gene. Subsequent independent mutations in the duplicated genes could lead to slight changes in sequence that might result in slightly different functional properties of the encoded proteins. Unequal crossing over also can result from rare recombinations between unrelated sequences. [See D. H. A. Fitch et al., 1991, Proc. Nat’l. Acad. Sci. USA 88:7396.]

The different β-globin genes probably arose by duplication of an ancestral gene, most likely as the result of an “unequal crossover” during recombination in a germ-cell (egg or sperm) precursor (Figure 9-5). Over evolutionary time the two copies of the gene that resulted accumulated random mutations; beneficial mutations that conferred some refinement in the basic oxygen-carrying function of hemoglobin were retained by natural selection. Repetitions of this process are thought to have resulted in the evolution of the contemporary globin-like genes observed in humans and other complex species today.

Two regions in the human β-like globin gene cluster contain nonfunctional sequences, called pseudogenes, similar to those of the functional β-like globin genes (see Figure 9-3b). Sequence analysis shows that these pseudogenes have the same apparent exon-intron structure as the functional β-like globin genes, suggesting that they also arose by duplication of the same ancestral gene. However, sequence drift during evolution generated sequences that either terminate translation or block mRNA processing, rendering such regions nonfunctional even if they were transcribed into RNA. Because such pseudogenes are not deleterious, they remain in the genome and mark the location of a gene duplication that occurred in one of our ancestors. As discussed in a later section, other nonfunctional gene copies can arise by reverse transcription of mRNA into cDNA and integration of this intron-less DNA into a chromosome.

Several different gene families encode the various proteins that make up the cytoskeleton. These proteins are present in varying amounts in almost all cells. In vertebrates, the major cytoskeletal proteins are the actins, tubulins, and intermediate filament proteins like the keratins (Chapters 18 and 19). Although the physiologic rational for these protein families is not as obvious as it is for the globins, the different members of a family probably have similar but subtly different functions suited to the particular type of cell in which they are expressed.

Tandemly Repeated Genes Encode rRNAs, tRNAs, and Histones

In invertebrates and some vertebrates, the genes encoding rRNAs, tRNAs, histones (a family of proteins associated with eukaryotic nuclear DNA), and several other proteins occur as tandemly repeated arrays. These are distinguished from the duplicated genes of gene families in that the multiple tandemly repeated genes encode identical or nearly identical proteins or functional RNAs. Most often copies of a sequence appear one after the other, in a head-to-tail fashion, over a long stretch of DNA. Within a tandem array of rRNA or tRNA genes, each copy is exactly, or almost exactly, like all the others. Although the transcribed portions of rRNA genes are the same in a given individual, the nontranscribed spacer regions between the transcribed regions can vary. Arrays of tandemly repeated histone DNA are somewhat more complex; however, each histone gene, too, has multiple identical copies.

Table 9-2

Effect of Gene Copy Number and Loading with RNA Polymerase on Rate of Pre-rRNA Synthesis in Human Cells
Copies of Pre-RNA GeneRNA Polymerase Molecules per GeneMolecules of Pre-rRNA Produced in 24 Hours
11288
1≈250≈70,000
100≈250≈7,000,000
The tandemly repeated rRNA, tRNA, and histone genes are needed to meet the great cellular demand for their transcripts. Most of the RNA in a cell consists of rRNA and tRNA. Assuming RNA polymerase molecules move at a fixed speed, there must be a limit to the number of RNA copies that transcription of a single gene can provide during one cell generation, even if it is fully loaded with polymerase molecules. If more RNA is required than can be transcribed from one gene, multiple copies of the gene are necessary. For example, during early embryonic development in humans, many embryonic cells have a doubling time of ≈24 hours and contain 5 – 10 million ribosomes. To produce enough rRNA to form this many ribosomes, an embryonic human cell needs at least 100 copies of the pre-rRNA gene, and most of these must be close to maximally active for the cell to divide every 24 hours (Table 9-2). That is, multiple RNA polymerases must be loaded onto and transcribing each pre-rRNA gene at the same time (see Figure 11-49). The importance of repeated rRNA genes is illustrated by Drosophila mutants called bobbed (because they have stubby wings), which lack a full complement of the tandemly repeated rRNA genes. A bobbed mutation that reduces the number of rRNA genes to less than ≈50 is a recessive lethal mutation.

All eukaryotes, including yeasts, contain 100 or more copies of the genes encoding 5S rRNA and pre-rRNA. More than 20,000 copies of the 5S rRNA gene are present in frogs. The copy number for individual tRNA genes ranges from 10 to 100.

Reassociation Experiments Reveal Three Major Fractions of Eukaryotic DNA

Besides duplicated protein-coding genes and tandemly repeated genes, eukaryotic cells contain multiple copies of other DNA sequences in the genome, generally referred to as repetitious DNA (see Table 9-1). Some of these sequences are quite short and occur as tandem repeats; others are much longer and are interspersed at many places in the genome. The existence of these repeated sequences was first recognized in reassociation experiments in which denatured eukaryotic DNA was observed to renature nonuniformly; that is, some of it reassociated much more rapidly than the bulk of cellular DNA.

In these studies, the total DNA of an organism was broken into fragments with an average length of about a thousand base pairs. The DNA was then melted into single strands and placed under conditions that allow strand reassociation to occur (e.g., a favorable ion concentration and a favorable temperature). If none of the DNA fragments contained sequences that were repeated in the genome, they all would be expected to re-form duplexes at about the same speed. However, a fragment containing a sequence repeated many times in the genome would find a complementary partner more quickly than a fragment with a sequence that occurred only once per haploid genome, because the repeated sequence would be present at a much higher concentration. Consequently, a fragment containing a repeated sequence would reassociate faster than a fragment with a unique sequence.

About 50 – 60 percent of mammalian DNA reassociates at a slow rate indicating that it consists primarily of single-copy DNA. According to Mendelian genetics, only one copy of each gene is contained in the haploid DNA set; thus the single-copy DNA fraction is expected to contain most of the genes encoding mRNA. However, the vast majority of single-copy DNA in the mammalian genome is noncoding DNA between genes and in introns. It appears that only a small fraction of the total DNA in humans, on the order of 5 percent, actually encodes proteins or functional RNA molecules. The remainder of the single-copy DNA, which currently has no known function other than to separate functional DNA sequences, is referred to as spacer DNA.

Another 25 – 40 percent of mammalian DNA reassociates at an intermediate rate. Cloning and sequencing of this DNA fraction from many different animals and higher plants have revealed that it is composed primarily of a very large number of copies of a relatively few sequence families in any specific organism. Such repetitious DNA, termed moderately repeated DNA, or intermediate-repeat DNA, is interspersed throughout mammalian genomes. Because these sequences can be copied and reinserted into new sites in the genome, they are called mobile DNA elements, which we describe in the next section. A small portion of this fraction consists of large duplicated gene families and tandemly repeated genes discussed previously.

About 10 – 15 percent of mammalian DNA reassociates at a very rapid rate. This rapidly reassociating type of repetitious DNA, referred to as simple-sequence DNA, is composed largely of several different sets of short (5- to 10-bp) sequences repeated in long tandem arrays.

Simple-Sequence DNAs Are Concentrated in Specific Chromosomal Locations

Although much of the simple-sequence DNA of higher organisms is composed of tandemly repeated, 5- to 10-bp sequences, long tandem repeats of simple sequences containing 20 – 200 nucleotides also occur in some vertebrate and plant genomes. Such tandem repeats generally extend up to 105 base pairs in total length. These long stretches of simple-sequence DNA are often referred to as satellite DNA because they are separated from the bulk of cellular DNA by equilibrium density-gradient centrifugation. However, not all simple-sequence DNAs separate from the bulk of cellular DNA during centrifugation.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

Figure 9-6

.

   Use of simple-sequence DNA as chromosomal marker

Human metaphase chromosomes stained with a fluorescent dye and hybridized in situ with a particular simple-sequence DNA labeled with a fluorescent biotin derivative. When viewed under the appropriate wavelength of light, the DNA appears red and the hybridized simple-sequence DNA appears as a yellow band on chromosome 16, thus locating this particular simple sequence to one site in the genome. [See R. K. Moyzis et al., 1987, Chromosoma 95:378; courtesy of R. K. Moyzis.]

In situ hybridization studies with metaphase chromosomes have localized simple-sequence DNA to specific chromosomal regions. In most mammals, much of the simple-sequence DNA lies near centromeres, discrete chromosomal regions that attach to spindle microtubules during mitosis and meiosis (see Figure 19-39). In the chromosomes of Drosophila melanogaster, simple-sequence DNA is concentrated in both centromeres and telomeres, the ends of chromosomes. Some simple-sequence tandem arrays also are located within chromosome arms in the Drosophila genome. In humans, some simple-sequence DNAs are located at a specific location on one chromosome. These sequences are useful for identifying particular chromosomes by fluorescence in situ hybridization (FISH). For example, a particular simple sequence in the human genome is present only in the middle of the long arm of chromosome 16 (Figure 9-6).

Simple-sequence DNA located at centromeres is suspected to contribute to the structure and therefore the function of the kinetochore of metaphase chromosomes. This large nucleoprotein complex assembles at the centromere and attaches to spindle microtubules during mitosis (Chapter 19). As yet, however, there is little clear-cut experimental evidence demonstrating any function for most simplesequence DNA.

DNA Fingerprinting Depends on Differences in Length of Simple-Sequence DNAs

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch9f7.jpg.

Figure 9-7

.

   Unequal crossing over during meiosis can generate differences in lengths of simple-sequence DNA tandem arrays

In this example, unequal recombination within a stretch of DNA containing six copies (1 – 6) of a particular simple-sequence repeat unit yields germ cells containing either an 8-unit or 4-unit tandem array.

Within a species, the nucleotide sequences of the repeat units composing simple-sequence DNA tandem arrays are highly conserved among individuals. In contrast, differences in the number of repeats, and thus in the length, of simple-sequence tandem arrays containing the same repeat unit are quite common among individuals. These differences in length result from unequal crossing over within regions of simple-sequence DNA during development of sperm and oocyte precursors and during meiosis (Figure 9-7). As a consequence of this unequal crossing over, the lengths of some simple-sequence tandem arrays are unique in each individual.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch9f8.jpg.

Figure 9-8

.

   Consensus sequences of the repeat unit of human minisatellites named λ33.1 and λ33.5 based on analysis of more than ten sets of repeats in each case

Red letters indicate positions in which base differences have been detected; red solid dot indicates a deletion. The 62-bp repeat unit of λ33.1 is much more highly conserved than the 17-bp unit of λ33.5. [See A. J. Jeffreys et al., 1985, Nature 314:67.]

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is permission.jpg.

Figure 9-9

.

   Human DNA fingerprints

DNA samples from three individuals (1, 2, and 3) were subjected to Southern-blot analysis using the restriction enzyme Hinf1 and three labeled minisatellites as probes (λ33.6, 33.15, and 33.5; lanes a, b, and c, respectively). DNA from each individual produced a unique band pattern with each probe. Conditions of electrophoresis can be adjusted so that at least 50 bands can be resolved for each person with this restriction enzyme. The nonidentity of these three samples is easily distinguished. [From A. J. Jeffreys et al., 1985, Nature 316:76; courtesy of A. J. Jeffreys.]

In humans and other mammals, some of the simplesequence DNA exists in relatively short 1- to 5-kb regions made up of 20 – 50 repeat units each containing 15 to about 100 base pairs. These regions are called minisatellites to distinguish them from the more common regions of tandemly repeated simple-sequence DNA, which are ≈100 kb in length. The sequences of the repeat unit in two human minisatellites are shown in Figure 9-8. Even slight differences in the total lengths of various minisatellites from different individuals can be detected. These differences form the basis of DNA fingerprinting, which is superior to conventional fingerprinting for identifying individuals (Figure 9-9).

SUMMARY

Help ǀ Contact Bookshelf