Having reviewed the relationship between transcription units and genes in prokaryotes
and eukaryotes, we now consider the organization of genes on chromosomes and the
relationship of noncoding DNA sequences to coding sequences.
Genomes of Higher Eukaryotes Contain Much Nonfunctional DNA
Figure 9-3
.
Diagrams of ≈80-kb region from chromosome III of the yeast
S. cerevisiae and the β-globin gene
cluster on human chromosome 11
(a) In the yeast DNA, blue boxes indicate open reading frames; it is
not clear whether all these potential protein-coding sequences are
functional genes. (b) In the human DNA, the blue boxes represent
transcribed regions that encode the indicated globin-type proteins.
Each globin-type gene has a similar arrangement of exons and introns
(not shown). The human β-globin gene cluster contains two
pseudogenes (diagonal lines); these regions are related to the
functional globin-type genes but are not transcribed. Red arrows
indicate the locations of Alu sequences, an
≈300-bp noncoding repeated sequence that is abundant in the
human genome. Note the much higher proportion of noncoding to coding
sequences in the human DNA than in the yeast DNA. [Part (a) see S.
G. Oliver et al., 1992, Nature
357:28; part (b) see F. S. Collins and S. M. Weissman,
1984, Prog. Nucl. Acid Res. Mol. Biol.
31:315.]
The abundance of noncoding sequences in the
genomes of higher organisms is
illustrated in , which depicts
the
protein-coding regions in an 80-kb stretch of DNA from the yeast
S.
cerevisiae and in the β-globin
gene cluster of humans,
also about 80 kb long. Note that in the single-celled yeast,
protein-coding
regions are closely spaced along the DNA sequence, whereas only a small fraction
of the human DNA encodes
protein. DNA sequencing and identification of
exons has
revealed that in higher organisms there is a considerable amount of DNA that
does not encode
protein. In fact, the β-globin
gene cluster is
unusually rich in
protein-coding sequences compared with other regions of
vertebrate DNA. In the 60-kb region including the chicken lysozyme
gene, for
example, the coding
exons total less than 500
base pairs. Because no function
has yet been found for most of the noncoding DNA in higher
eukaryotes, it is
commonly referred to as nonfunctional.
Different selective pressures during evolution may account, at least in part, for
this remarkable difference in the amount of nonfunctional DNA in microorganisms
and multicellular organisms. For example, microorganisms must compete for
limited amounts of nutrients in their environment, and metabolic economy thus is
a critical characteristic. Since synthesis of nonfunctional (i.e., noncoding)
DNA requires time and energy, presumably there was selective pressure to lose
nonfunctional DNA during the evolution of microorganisms. On
the other hand, natural selection in vertebrates depends largely on their
behavior. The energy invested in DNA synthesis is trivial compared with the
metabolic energy required for the movement of muscles; thus there was little
selective pressure to eliminate nonfunctional DNA in vertebrates.
Cellular DNA Content Does Not Correlate with Phylogeny
The total amount of chromosomal DNA in different animals and plants does not vary
in a consistent manner with the apparent complexity of the organisms. Yeasts,
fruit flies, chickens, and humans have successively larger amounts of DNA in
their haploid chromosome sets (0.015, 0.15, 1.3, and 3.2 picograms,
respectively), in keeping with what we perceive to be the increasing complexity
of these organisms. Yet the vertebrates with the greatest amount of DNA per cell
are amphibians, which are surely less complex than humans in their structure and
behavior. Many plant species also have considerably more DNA per cell than
humans have. For example, the DNA content per cell of wheat, broad beans, and
garden onions (7.0, 14.6, and 16.8 picograms, respectively) ranges from about
two to more than five times that of humans, and tulips have ten times as much
DNA per cell as humans.
The DNA content per cell also varies considerably among closely related species.
All insects or all amphibians would appear to be similarly complex, but the
amount of haploid DNA in species within each of these phyla varies by a factor
of 100. The same variation in DNA content per cell is common within groups of
plants that have similar structures and life cycles. For example, the broad bean
contains about three to four times as much DNA per cell as the kidney bean.
Table 9-1
Classification of Eukaryotic DNA
These facts further suggest that much of the DNA in certain organisms is
“extra” or expendable — that
is, it does not encode RNA or have any regulatory or structural function. The
total amount of DNA per
haploid cell in an organism is referred to as the
C value; the failure of C values to correspond to
phylogenetic complexity is called the
C-value paradox. This
perplexing variation in
genome size occurs mainly because eukaryotic
chromosomes
contain variable amounts of DNA with no demonstrable function, both between
genes and within
genes in
introns. As discussed later, much of this apparently
nonfunctional DNA is composed of repetitious DNA sequences,
some of which are never transcribed and most all of which are likely
dispensable. The different classes of eukaryotic DNA sequences discussed in the
following sections are summarized in
Table
9-1.
Protein-Coding Genes May Be Solitary or Belong to a Gene Family
In multicellular organisms, roughly 25 – 50
percent of the protein-coding genes are represented only once in the haploid
genome and thus are termed solitary genes. The remaining
protein-coding genes belong to families comprising two or more similar
genes.
Figure 9-4
.
The chicken lysozyme gene and its surrounding regions
This 15-kb simple transcription unit contains four exons (blue) and
three introns (tan). The positions indicated by red arrows are
repeated Alu sequences found at many sites
elsewhere in the genome. [See P. Balducci et al., 1981,
Nucleic Acids Res.
9:3575.]
A well-studied example of a solitary
protein-coding
gene is the chicken lysozyme
gene mentioned previously. The 15-kb DNA sequence encoding chicken lysozyme
constitutes a simple
transcription unit (i.e., a single
gene) containing four
exons and three
introns ().
The flanking regions, extending for about 20 kb
upstream and
downstream from the
transcription unit, do not encode any detectable mRNAs. Lysozyme, an
enzyme that
cleaves the
polysaccharides in bacterial
cell walls, is an abundant component of
chicken egg-white
protein and also is found in human tears. Its activity helps
to keep the surface of the eye and the chicken egg sterile.
Frequently, the DNA that lies within 5 – 10 kb of
a particular gene contains sequences that are close but inexact copies of the
gene. Such sequences, which are thought to have arisen by duplication of an
ancestral gene, are referred to as duplicated protein-coding
genes; duplicated genes probably constitute half of the protein-coding DNA in
vertebrate genomes. A set of duplicated genes that encode proteins with similar
but nonidentical amino acid sequences is called a gene family; the
encoded closely related, homologous proteins constitute a protein
family. A few protein families, such as protein kinases,
transcription factors, and vertebrate immunoglobulins, include hundreds of
members. Most families, however, include from just a few to 30 or so members;
common examples are cytoskeletal proteins, 70-kDa heat-shock proteins, myosin
heavy chain, chicken ovalbumin, and the α- and β-globins in
vertebrates.
The
genes encoding the β-like globins are a good example of a
gene
family. As shown in , the
β-like globin
gene family contains five functional
genes designated
β, δ, A
γ, G
γ, and
ϵ; the encoded
polypeptides are similarly designated. Two identical
β-like globin
polypeptides combine with two identical
α-globin
polypeptides (encoded by another
gene family) and with four
small heme groups to form a hemoglobin molecule (see
Figure 3-10). All the hemoglobins formed from the
different β-like globins carry oxygen in the blood, but they exhibit
somewhat different properties that are suited to specific roles in human
physiology. For example, hemoglobins containing either the
A
γ or G
γ polypeptides are
expressed only during fetal life. Because these fetal hemoglobins have a higher
affinity for oxygen than adult hemoglobins, they can effectively extract oxygen
from the maternal circulation in the placenta. The lower oxygen affinity of
adult hemoglobins, which are expressed after birth, permits better release of
oxygen to the tissues, especially muscles, which have a high demand for oxygen
during exercise.
Figure 9-5
.
Gene duplication resulting from unequal crossing over
Each parental chromosome (top) contains one
ancestral globin gene containing three exons and two introns.
Homologous L1 repeated sequences lie 5′ and 3′
of the globin gene. The parental chromosomes are shown displaced
relative to each other, so that the L1 sequences are aligned.
Homologous recombination between L1 sequences as shown would
generate one recombinant chromosome with two copies of the globin
gene and one chromosome with a deletion of the globin gene.
Subsequent independent mutations in the duplicated genes could lead
to slight changes in sequence that might result in slightly
different functional properties of the encoded proteins. Unequal
crossing over also can result from rare recombinations between
unrelated sequences. [See D. H. A. Fitch et al., 1991, Proc.
Nat’l. Acad. Sci. USA
88:7396.]
The different β-globin
genes probably arose by duplication of an
ancestral
gene, most likely as the result of an “unequal
crossover” during
recombination in a germ-cell (egg or sperm)
precursor (). Over
evolutionary time the two copies of the
gene that resulted accumulated random
mutations; beneficial
mutations that conferred some refinement in the basic
oxygen-carrying function of hemoglobin were retained by natural selection.
Repetitions of this process are thought to have resulted in the evolution of the
contemporary globin-like
genes observed in humans and other complex species
today.
Two regions in the human β-like globin
gene cluster contain
nonfunctional sequences, called
pseudogenes, similar to those of
the functional β-like globin
genes (see ). Sequence analysis shows that these
pseudogenes have the same apparent
exon-
intron structure as the functional
β-like globin
genes, suggesting that they also arose by duplication of
the same ancestral
gene. However,
sequence drift during
evolution generated sequences that either terminate
translation or block mRNA
processing, rendering such regions nonfunctional even if they were transcribed
into RNA. Because such pseudogenes are not deleterious, they remain in the
genome and mark the location of a
gene duplication that occurred in one of our
ancestors. As discussed in a later section, other nonfunctional
gene copies can
arise by reverse
transcription of mRNA into cDNA and integration of this
intron-less DNA into a
chromosome.
Several different gene families encode the various proteins that make up the
cytoskeleton. These proteins are present in varying amounts in almost all cells.
In vertebrates, the major cytoskeletal proteins are the actins, tubulins, and
intermediate filament proteins like the keratins (Chapters 18 and 19). Although the physiologic rational for these protein families is
not as obvious as it is for the globins, the different members of a family
probably have similar but subtly different functions suited to the particular
type of cell in which they are expressed.
Tandemly Repeated Genes Encode rRNAs, tRNAs, and Histones
In invertebrates and some vertebrates, the genes encoding rRNAs, tRNAs, histones
(a family of proteins associated with eukaryotic nuclear DNA), and several other
proteins occur as tandemly repeated arrays. These are
distinguished from the duplicated genes of gene families in that the multiple
tandemly repeated genes encode identical or nearly identical proteins or
functional RNAs. Most often copies of a sequence appear one after the other, in
a head-to-tail fashion, over a long stretch of DNA. Within a tandem array of
rRNA or tRNA genes, each copy is exactly, or almost exactly, like all the
others. Although the transcribed portions of rRNA genes are the same in a given
individual, the nontranscribed spacer regions between the transcribed regions
can vary. Arrays of tandemly repeated histone DNA are somewhat more complex;
however, each histone gene, too, has multiple identical copies.
Table 9-2
Effect of Gene Copy Number and Loading with RNA Polymerase on Rate of
Pre-rRNA Synthesis in Human Cells
| 1 | 1 | 288 |
| 1 | ≈250 | ≈70,000 |
| 100 | ≈250 | ≈7,000,000 |
The tandemly repeated rRNA, tRNA, and histone
genes are needed to meet the great
cellular demand for their
transcripts. Most of the RNA in a cell consists of
rRNA and tRNA. Assuming
RNA polymerase molecules move at a fixed speed, there
must be a limit to the number of RNA copies that
transcription of a single
gene
can provide during one cell generation, even if it is fully loaded with
polymerase molecules. If more RNA is required than can be transcribed from one
gene, multiple copies of the
gene are necessary. For example, during early
embryonic
development in humans, many embryonic cells have a doubling time of
≈24 hours and contain 5 – 10 million
ribosomes. To produce enough rRNA to form this many
ribosomes, an embryonic
human cell needs at least 100 copies of the
pre-rRNA gene, and most of these
must be close to maximally active for the cell to divide every 24 hours (
Table 9-2). That is, multiple RNA
polymerases must be loaded onto and transcribing each
pre-rRNA gene at the same
time (see
Figure 11-49). The importance
of repeated rRNA
genes is illustrated by
Drosophila mutants
called
bobbed (because they have stubby wings), which lack a
full complement of the tandemly repeated rRNA
genes. A
bobbed
mutation that reduces the number of rRNA
genes to less than ≈50 is a
recessive lethal
mutation.
All eukaryotes, including yeasts, contain 100 or more copies of the genes
encoding 5S rRNA and pre-rRNA. More than 20,000 copies of the 5S rRNA gene are
present in frogs. The copy number for individual tRNA genes ranges from 10 to
100.
Reassociation Experiments Reveal Three Major Fractions of Eukaryotic
DNA
Besides duplicated
protein-coding
genes and tandemly repeated
genes, eukaryotic
cells contain multiple copies of other DNA sequences in the
genome, generally
referred to as repetitious DNA (see
Table
9-1). Some of these sequences are quite short and occur as tandem
repeats; others are much longer and are interspersed at many places in the
genome. The existence of these repeated sequences was first recognized in
reassociation experiments in which denatured eukaryotic DNA was observed to
renature nonuniformly; that is, some of it reassociated much more rapidly than
the bulk of cellular DNA.
In these studies, the total DNA of an organism was broken into fragments with an
average length of about a thousand base pairs. The DNA was then melted into
single strands and placed under conditions that allow strand reassociation to
occur (e.g., a favorable ion concentration and a favorable temperature). If none
of the DNA fragments contained sequences that were repeated in the genome, they
all would be expected to re-form duplexes at about the same speed. However, a
fragment containing a sequence repeated many times in the genome would find a
complementary partner more quickly than a fragment with a sequence that occurred
only once per haploid genome, because the repeated sequence would be present at
a much higher concentration. Consequently, a fragment containing a repeated
sequence would reassociate faster than a fragment with a unique sequence.
About 50 – 60 percent of mammalian DNA
reassociates at a slow rate indicating that it consists
primarily of single-copy DNA. According to Mendelian genetics,
only one copy of each gene is contained in the haploid DNA set; thus the
single-copy DNA fraction is expected to contain most of the genes encoding mRNA.
However, the vast majority of single-copy DNA in the mammalian genome is
noncoding DNA between genes and in introns. It appears that only a small
fraction of the total DNA in humans, on the order of 5 percent, actually encodes
proteins or functional RNA molecules. The remainder of the single-copy DNA,
which currently has no known function other than to separate functional DNA
sequences, is referred to as spacer DNA.
Another 25 – 40 percent of mammalian DNA
reassociates at an intermediate rate. Cloning and sequencing of
this DNA fraction from many different animals and higher plants have revealed
that it is composed primarily of a very large number of copies of a relatively
few sequence families in any specific organism. Such repetitious DNA, termed
moderately repeated DNA, or intermediate-repeat
DNA, is interspersed throughout mammalian genomes. Because these
sequences can be copied and reinserted into new sites in the genome, they are
called mobile DNA elements, which
we describe in the next section. A small portion of this fraction consists of
large duplicated gene families and tandemly repeated genes discussed
previously.
About 10 – 15 percent of mammalian DNA
reassociates at a very rapid rate. This rapidly reassociating
type of repetitious DNA, referred to as simple-sequence DNA, is composed largely of several different sets
of short (5- to 10-bp) sequences repeated in long tandem arrays.
Simple-Sequence DNAs Are Concentrated in Specific Chromosomal
Locations
Although much of the simple-sequence DNA of higher organisms is composed of
tandemly repeated, 5- to 10-bp sequences, long tandem repeats of simple
sequences containing 20 – 200 nucleotides also
occur in some vertebrate and plant genomes. Such tandem repeats generally extend
up to 105 base pairs in total length. These long stretches of
simple-sequence DNA are often referred to as satellite DNA
because they are separated from the bulk of cellular DNA by equilibrium
density-gradient centrifugation. However, not all simple-sequence DNAs separate
from the bulk of cellular DNA during centrifugation.
Figure 9-6
.
Use of simple-sequence DNA as chromosomal marker
Human metaphase chromosomes stained with a fluorescent dye and
hybridized in situ with a particular simple-sequence DNA labeled
with a fluorescent biotin derivative. When viewed under the
appropriate wavelength of light, the DNA appears red and the
hybridized simple-sequence DNA appears as a yellow band on
chromosome 16, thus locating this particular simple sequence to one
site in the genome. [See R. K. Moyzis et al., 1987,
Chromosoma
95:378; courtesy of R. K. Moyzis.]
In situ
hybridization studies with
metaphase chromosomes have localized
simple-sequence DNA to specific chromosomal regions. In most mammals, much of
the
simple-sequence DNA lies near
centromeres, discrete chromosomal regions that attach to spindle
microtubules during
mitosis and
meiosis (see
Figure 19-39). In the
chromosomes of
Drosophila
melanogaster, simple-sequence DNA is concentrated in both
centromeres and
telomeres, the ends
of
chromosomes. Some simple-sequence tandem arrays also are located within
chromosome arms in the
Drosophila genome. In humans, some
simple-sequence DNAs are located at a specific location on one
chromosome. These
sequences are useful for identifying particular
chromosomes by fluorescence in
situ
hybridization (FISH). For example, a particular simple sequence in the
human
genome is present only in the middle of the long arm of
chromosome 16
().
Simple-sequence DNA located at centromeres is suspected to contribute to the
structure and therefore the function of the kinetochore of metaphase chromosomes. This large
nucleoprotein complex assembles at the centromere and attaches to spindle
microtubules during mitosis (Chapter
19). As yet, however, there is little clear-cut experimental evidence
demonstrating any function for most simplesequence DNA.
DNA Fingerprinting Depends on Differences in Length of Simple-Sequence
DNAs
Figure 9-7
.
Unequal crossing over during meiosis can generate differences in
lengths of simple-sequence DNA tandem arrays
In this example, unequal recombination within a stretch of DNA
containing six copies (1 – 6) of a particular
simple-sequence repeat unit yields germ cells containing either an
8-unit or 4-unit tandem array.
Within a species, the
nucleotide sequences of the repeat units composing
simple-sequence DNA tandem arrays are highly conserved among individuals. In
contrast, differences in the
number of repeats, and thus in the
length, of simple-sequence tandem arrays containing the same repeat unit are
quite common among individuals. These differences in length result from unequal
crossing over within regions of
simple-sequence DNA during
development of sperm
and
oocyte precursors and during
meiosis (). As a consequence of this unequal
crossing over, the
lengths of some simple-sequence tandem arrays are unique in each individual.
Figure 9-8
.
Consensus sequences of the repeat unit of human minisatellites
named λ33.1 and λ33.5 based on analysis of more than
ten sets of repeats in each case
Red letters indicate positions in which base differences have been
detected; red solid dot indicates a deletion. The 62-bp repeat unit
of λ33.1 is much more highly conserved than the 17-bp
unit of λ33.5. [See A. J. Jeffreys et al., 1985,
Nature
314:67.]
Figure 9-9
.
Human DNA fingerprints
DNA samples from three individuals (1, 2, and 3) were subjected to
Southern-blot analysis using the restriction enzyme
Hinf1 and three labeled minisatellites as
probes (λ33.6, 33.15, and 33.5; lanes a, b, and c,
respectively). DNA from each individual produced a unique band
pattern with each probe. Conditions of electrophoresis can be
adjusted so that at least 50 bands can be resolved for each person
with this restriction enzyme. The nonidentity of these three samples
is easily distinguished. [From A. J. Jeffreys et al., 1985,
Nature
316:76; courtesy of A. J. Jeffreys.]
In humans and other mammals, some of the simplesequence DNA exists in relatively
short 1- to 5-kb regions made up of 20 – 50
repeat units each containing 15 to about 100
base pairs. These regions are
called
minisatellites to distinguish them from the more common
regions of tandemly repeated
simple-sequence DNA, which are ≈100 kb in
length. The sequences of the repeat unit in two human minisatellites are shown
in . Even slight differences
in the total lengths of various minisatellites from different individuals can be
detected. These differences form the basis of
DNA fingerprinting,
which is superior to conventional fingerprinting for identifying individuals
().
SUMMARY
-
The genomes of prokaryotes and lower
eukaryotes contain few nonfunctional sequences, whereas vertebrate
genomes contain many sequences that do not code for RNAs or have any
structural or regulatory function. Only about 5 percent of the genomic
DNA in humans encodes proteins or functional RNAs.
-
The lack of a consistent relationship
between the amount of DNA in the haploid chromosomes of an animal or
plant and its phylogenetic complexity is called the C-value paradox.
-
About half of the protein-coding genes in
vertebrate genomic DNA are solitary genes, whose sequence occurs only
once in the haploid genome. The remainder are duplicated genes, which
arose by duplication of an ancestral gene and subsequent independent
mutations (see ). -
Duplicated genes, such as those forming the
β-like globin gene family, encode closely related proteins and
generally appear as a cluster in a particular region of DNA (see ). The proteins encoded
by a gene family have homologous but nonidentical amino acid sequences
and exhibit similar but slightly different properties. -
In invertebrates and vertebrates, rRNAs,
tRNAs, and histone proteins are encoded by multiple copies of genes
located in tandem arrays in genomic DNA.
-
Single-copy DNA consists of solitary
protein-coding genes, small duplicated gene families, and spacer
DNA.
-
Moderately repeated DNA includes the
tandemly repeated genes encoding, rRNA, tRNA genes, and histones; large
duplicated gene families; and mobile DNA elements.
-
Simple-sequence DNA, which consists largely
of very short sequences repeated in long tandem arrays, is
preferentially located in centromeres, telomeres, and specific locations
within the arms of particular chromosomes.
-
The length of a particular simple-sequence
tandem array is quite variable among individuals in a species, probably
because of unequal crossing over during meiosis (see ). Differences in the
lengths of some simple-sequence tandem arrays forms the basis for DNA
fingerprinting.
ǀ