• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Strachan T, Read AP. Human Molecular Genetics. 2nd edition. New York: Wiley-Liss; 1999.

Cover of Human Molecular Genetics

Human Molecular Genetics. 2nd edition.

Show details

Chapter 7Organization of the human genome

7.1. General organization of the human genome

The human genome is the term used to describe the total genetic information (DNA content) in human cells. It really comprises two genomes: a complex nuclear genome which accounts for 99.9995% of the total genetic information, and a simple mitochondrial genome which accounts for the remaining 0.0005% (Figure 7.1). The nuclear genome provides the great bulk of essential genetic information, most of which specifies polypeptide synthesis on cytoplasmic ribosomes. Mitochondria possess their own ribosomes and the few polypeptide-encoding genes in the mitochondrial genome produce mRNAs which are translated on the mitochondrial ribosomes. However, the mitochondrial genome specifies only a very small portion of the specific mitochondrial functions; the bulk of the mitochondrial polypeptides are encoded by nuclear genes and are synthesized on cytoplasmic ribosomes, before being imported into the mitochondria (see Figure 1.11).

Figure 7.1. Organization of the human genome.

Figure 7.1

Organization of the human genome.

Like other complex genomes, a sizeable component of the human genome is made up of noncoding DNA. In addition, the human genome is representative of mammalian genomes and other complex genomes in having a considerable amount of repetitive DNA, including both noncoding repetitive DNA and multiple copy genes and gene fragments.

7.1.1. The mitochondrial genome consists of a small circular DNA duplex which is densely packed with genetic information

General structure and inheritance of the mitochondrial genome

The human mitochondrial genome is defined by a single type of circular double-stranded DNA whose complete nucleotide sequence has been established (Anderson et al., 1981). It is 16 569 bp in length and is 44% (G + C). The two DNA strands have significantly different base compositions: the heavy (H) strand is rich in guanines, the light (L) strand is rich in cytosines. Although the mitochondrial DNA is principally double-stranded, a small section is defined by a triple DNA strand structure. This is because a short segment of the heavy strand is replicated for a second time, giving a structure known as 7S DNA (see Figure 7.2 and Clayton, 1992 for a general review of transcription and replication of animal mitochondrial DNAs). Human cells usually contain thousands of copies of the double-stranded mitochondrial DNA molecule. Accordingly, although a single mitochondrial DNA duplex has only about 1/8000 as much DNA as an average sized chromosome, the total mitochondrial DNA complement can account for up to about 0.5% of the DNA in a nucleated somatic cell.

Figure 7.2. The human mitochondrial genome.

Figure 7.2

The human mitochondrial genome. The D loop is marked by a triple-stranded structure and encompasses a duplicated short region of the heavy strand, 7S DNA. Transcription of the heavy (H) strand actually originates from two closely spaced promoters located (more...)

During zygote formation, a sperm cell contributes its nuclear genome, but not its mitochondrial genome, to the egg cell. Consequently, the mitochondrial genome of the zygote is determined exclusively by that originally found in the unfertilized egg. The mitochondrial genome is therefore maternally inherited: males and females both inherit their mitochondria from their mother but males cannot transmit their mitochondria to subsequent generations. Thus mitochondrially encoded genes or DNA variants give the pedigree pattern shown in Figure 3.4. During mitotic cell division, the mitochondrial DNA molecules of the dividing cell segregate in a purely random way to the two daughter cells.

Mitochondrial genes

The human mitochondrial genome contains 37 genes: 28 are encoded by the heavy strand, and nine by the light strand (Figure 7.2). Of the 37 genes, a total of 24 specify a mature RNA product: 22 mitochondrial tRNA molecules and two mitochondrial rRNA molecules, a 23S rRNA (a component of the large subunit of mitochondrial ribosomes) and a 16S rRNA (a component of the small subunit of the mitochondrial ribosomes). The remaining 13 genes encode polypeptides which are synthesized on mitochondrial ribosomes.

Each of the 13 polypeptides encoded by the mitochondrial genome is a subunit of one of the mitochondrial respiratory complexes, the multichain enzymes of oxidative phosphorylation which are engaged in the production of ATP. Note, however, that there are a total of about 100 different polypeptide subunits in the mitochondrial oxidative phosphorylation system, and so the vast majority are encoded by nuclear genes (see Box 7.1). All other mitochondrial proteins, including numerous enzymes, transport proteins, structural proteins etc., are encoded by the nuclear genome and are translated on cytoplasmic ribosomes before being imported into the mitochondria (see Figure 1.11).

Box Icon

Box 7.1

The limited autonomy of the mitochondrial genome.

The mitochondrial genetic code

The mitochondrial genetic code (which is used to decipher only 13 different mitochondrial mRNAs on mitochondrial ribosomes) differs slightly from the nuclear genetic code (which specifies perhaps about 70 000–80 000 different mRNAs on cytoplasmic ribosomes). The mitochondrial genome encodes all the ribosomal RNA and tRNA molecules it needs for synthesizing proteins but relies on nuclear-encoded genes to provide all other components (such as the protein components of mitochondrial ribosomes, amino acyl tRNA synthetases, etc.).

As there are only 22 different types of human mitochondrial tRNA, individual tRNA molecules need to be able to interpret several different codons. Eight of the 22 tRNA molecules have anticodons which are each able to recognize families of four codons differing only at the third base, and 14 recognize pairs of codons which are identical at the first two base positions and share either a purine or a pyrimidine at the third base. Between them, therefore, the 22 mitochondrial tRNA molecules can recognize a total of 60 codons [(8 × 4) + (14 × 2)]. The remaining four codons, UAG, UAA, AGA and AGG cannot be recognized by mitochondrial tRNA and act as stop codons (see Figure 1.22).

In addition to their differences in genetic capacity and different genetic codes, the mitochondrial and nuclear genomes differ in many other aspects of their organization and expression (Table 7.1).

Table 7.1. The human nuclear and mitochondrial genomes.

Table 7.1

The human nuclear and mitochondrial genomes.

Coding and noncoding DNA

Unlike its nuclear counterpart, the human mitochondrial genome is extremely compact: approximately 93% of the DNA sequence represents coding sequence. All 37 mitochondrial genes lack introns and they are tightly packed (on average one per 0.45 kb). The coding sequences of some genes (notably those encoding the sixth and eighth subunits of the mitochondrial ATPase) show some overlap (Figures 7.2 and 7.3) and, in most other cases, the coding sequences of neighboring genes are contiguous or separated by one or two noncoding bases. Some genes even lack termination codons; to overcome this deficiency, UAA codons have to be introduced at the post-transcriptional level (Anderson et al., 1981; see legend to Figure 7.3).

Figure 7.3. The genes for mitochondrial ATPase subunits 6 and 8 are partially overlapping and translated in different reading frames.

Figure 7.3

The genes for mitochondrial ATPase subunits 6 and 8 are partially overlapping and translated in different reading frames. Note that the overlapping genes share a common sense strand, the H strand. Coding sequence coordinates are as follows: ATPase subunit (more...)

The only significant region lacking any known coding DNA is the displacement (D) loop region. This is the region in which a triple-stranded DNA structure is generated by synthesizing an additional short piece of the H-strand DNA, known as 7S DNA (see Figure 7.2). The replication of both the H and L strands is unidirectional and starts at specific origins. In the former case, the origin is in the D loop and only after about two-thirds of the daughter H strand has been synthesized (by using the L strand as a template and displacing the old H strand) does the origin for L strand replication become exposed. Thereafter, replication of the L strand proceeds in the opposite direction, using the H strand as a template (Figure 7.2). The D loop also contains the predominant promoter for transcription of both the H and L strands. Unlike transcription of nuclear genes, in which individual genes are almost always transcribed separately using individual promoters, transcription of the mitochondrial DNA starts from the promoters in the D loop region and continues, in opposing directions for the two different strands, round the circle to generate large multigenic transcripts (see Figure 7.2). The mature RNAs are subsequently generated by cleavage of the multigenic transcripts.

7.1.2. The nuclear genome is distributed between 24 different types of DNA duplex which show considerable regional variation in base composition and gene density

Size and banding patterns of human chromosomes

The nucleus of a human cell contains more than 99% of the cellular DNA. The nuclear genome is distributed between 24 different types of linear double-stranded DNA molecule, each of which has histones and other nonhistone proteins bound to it, constituting a chromosome. The 24 different chromosomes (22 types of autosome and two sex chromosomes, X and Y) can easily be differentiated by chromosome banding techniques (see Figure 2.18), and have been classified into groups largely according to size and, to some extent, centromere position (see Table 2.3). In addition to the primary constriction (centromere) present on each chromosome, the long arms of chromosomes 1, 9 and 16 possess so-called secondary constrictions (light staining, apparently uncoiled chromosomal regions) which, like the centromeres, are composed of constitutive heterochromatin (see Section 2.3.5). By comparison with the size of a mitochondrial DNA molecule, an average size human chromosome has an enormous amount of DNA, approximately 130 Mb on average, but varying between approximately 50 and 260 Mb (Table 7.2). In a 550 band metaphase chromosome preparation (see Figure 2.18), an average band corresponds to about 6 Mb of DNA.

Table 7.2. DNA content of human chromosomes a.

Table 7.2

DNA content of human chromosomes a.

Base composition in the human nuclear genome

Since the entire nucleotide sequence of the human mitochondrial genome is known, its precise base composition is known. The sequence of the human nuclear genome is still being established (and is not expected to be finished before 2003), but current estimates suggest a figure of about 42% GC. However, the proportion of specific combinations of nucleotides can vary considerably. Like other vertebrate nuclear genomes, for example, the human nuclear genome has a conspicuous shortage of the dinucleotide CpG (that is, neighboring cytosine and guanine residues on the same DNA strand in the 5′ → 3′ direction). Taking the average figure of 42% GC, the individual base frequencies are : C = G = 0.21, and so the expected frequency for the dinucleotide CpG is (0.21)2 = 0.0441. However, the observed frequency of the CpG dinucleotide is approximately one-fifth of this (see Bird, 1986).

In vertebrate DNA, cytosine residues occurring in CpG dinucleotides are targets for methylation at carbon atom 5. Only about 3% of the cytosines in human DNA are methylated, but most that are methylated are found in the CpG dinucleotide, producing 5-methylcytosine. Over evolutionarily long periods of time, 5-methylcytosine spontaneously deaminates to give thymine and so CpG is continuously being depleted and replaced by TpG (or CpA on the complementary strand. Despite the overall background, certain small regions of DNA noted for their transcriptional activity are characterized by the expected CpG density (CpG islands; see Box 8.5).

Base composition can also show regional subchromosomal variation. For example, human telomeres are defined by numerous repeats of a 50% GC sequence, TTAGGG. Large tracts of condensed heterochromatin found at centromeric regions and in defined noncentromeric regions of many chromosomes (see Figure 2.18) are composed of specialized repetitive DNA whose sequence can vary considerably in base composition from the bulk DNA. Such variation is the basis for the ability of equilibrium density ultracentrifugation to fractionate the total DNA into subclasses (satellite DNA - see Section 7.4.1). The alternating pale and light staining chromosome bands also differ in a number of features, with dark G bands being characterized by a lower GC content than the pale staining bands (Table 7.3).

Table 7.3. Properties of chromosome bands seen with standard Giemsa staining.

Table 7.3

Properties of chromosome bands seen with standard Giemsa staining.

Gene density in the human nuclear genome

The total number of genes in the human genome has been estimated to be about 70 000–80 000 (see Section 7.2.1). As all but 37 of these genes are located in the nuclear genome, this gives a rough estimate of about 3000 genes per chromosome. However, gene density can vary substantially between chromosomal regions and also between whole chromosomes. For example, heterochromatic regions are known to be very largely composed of repetitive noncoding DNA, and the centromeres and large regions of the Y chromosome, in particular, are notably devoid of genes.

Recently, insight into gene distribution along the lengths of the different chromosomes has been obtained by hybridizing purified CpG island fractions of the genome (which are associated with perhaps about 56% of human genes; Antequara and Bird, 1993) to metaphase chromosomes (Craig and Bickmore, 1994). On this basis, it is clear that gene density is high in subtelomeric regions and that some chromosomes (e.g. 19 and 22) are gene rich while others (e.g. 4 and 18) are gene poor (Figure 7.4).

Figure 7.4. Clustering of CpG islands in the human genome.

Figure 7.4

Clustering of CpG islands in the human genome. The diagram represents FISH of a CpG island fraction from human DNA to human metaphase chromosomes (Craig and Bickmore, 1994). The Texas Red signal is derived from the CpG island probe, while the fluorescein (more...)

Differences between pale and dark G bands, which are, respectively, gene rich and gene poor, are illustrated by the contrast between the human leukocyte antigen (HLA) complex and the dystrophin gene (DMD) regions. The former is located in the pale G band, 6p21.3: at the time of writing, the most intensively investigated region of the HLA complex, the class III region, had 70 genes in 0.9 Mb of DNA, giving a density of one per 13 kb. By contrast, a full 2.4 Mb of DNA in the dark G band region, Xp21, appears to be devoted exclusively to the dystrophin gene (Figure 7.5).

Figure 7.5. Contrasting gene densities in the HLA region and the dystrophin gene (DMD) region.

Figure 7.5

Contrasting gene densities in the HLA region and the dystrophin gene (DMD) region. For the sake of clarity, only a 900 kb segment from the class III region of the 4 Mb HLA cluster is shown. Note that the great bulk of the genes in the HLA region have (more...)

7.2. Organization and distribution of human genes

7.2.1. The nuclear genome contains about 65 000–80 000 genes but only about 3% of the genome represents coding sequences

The number of genes in the human genome has been the subject of much speculation; while the small mitochondrial genome is known to have precisely 37 genes, the number in the nuclear genome remains unknown. Theoretical calculations based on the mutational load that a genome can tolerate and observed average mutation rates of human genes (~10-5 per gene per generation) suggest an upper limit of about 100 000. A variety of different approaches have been used to obtain more precise estimates of the total gene number. Three approaches have suggested a best estimate of about 65 000–80 000 genes:

  • Genomic sequencing. Extrapolation from sequencing of large chromosomal regions may suggest that there are about 70 000 genes (Fields et al., 1994). This is based on the observation that gene-rich regions have an average gene density of close to one per 20 kb, but gene-poor regions have a much lower density, say one-tenth of this density, and that the genome is split 50:50 into gene-rich and gene-poor regions.
  • CpG island number. Restriction enzyme analysis using the methylation-sensitive enzyme HpaII suggests that the total number of CpG islands (see Box 8.5) in the human genome is 45 000 (Antequara and Bird, 1993). Using an estimate that approximately 56% of genes are associated with CpG islands, these authors have suggested a total of about 80 000 human genes.
  • EST analysis. Large-scale random sequencing of cDNA clones provides so-called expressed sequence tags (ESTs, see Section 13.2.3). Comparison of known human EST sequences with a large set of different human genomic coding DNA sequences listed in sequence databases has suggested a figure of about 65 000 human genes (Fields et al., 1994).

The above values suggest that the genes in the nuclear genome represent about 99.95% of the total number of cellular genes. If the average size of a human nuclear gene, including introns, is taken to be about 10–15 kb, this would mean that if the genes did not show overlaps, the total nuclear DNA occupied by genes would be about 70 000 × (10–15) kb or about 700–1050 Mb which corresponds roughly to about 25–35% of the genome. As the vast majority of nuclear genes encode polypeptides and the coding sequence required for an average size human polypeptide is taken to be about 500–600 codons, that is 1.5–1.8 kb, only about 3% of the nuclear genome (80–100 Mb of the 3300 Mb) would be expected to have a coding function.

7.2.2. RNA-encoding gene families often have numerous family members

While the great majority of human genes are expected to encode polypeptides, a significant minority encode mature RNA molecules of diverse function. The mitochondrial genome is exceptional in that 65% (24/37) of the genes encode RNA but even in the case of the nuclear genome about 5% of the genes, perhaps 3000–4000 genes in all, are expected to encode RNA molecules. In common with other cellular genomes a considerable variety of genes in the human genome are devoted to making mature RNA molecules which assist in the general process of gene expression. Some, notably rRNA and tRNA are involved in translation of mRNA. In addition, many other RNA families are involved in reactions leading to maturation not only of mRNA but also of rRNA, tRNA and other RNA species, involving both cleavage reactions and basespecific modification reactions. In addition, several other RNAs have diverse functions (Table 7.4).

Table 7.4. Functional diversity of RNA.

Table 7.4

Functional diversity of RNA.

Ribosomal RNA (rRNA) genes

There are multiple rRNA genes. In addition to the two mitochondrial rRNA molecules, the 28S, 18S and 5.8S cytoplasmic rRNAs are encoded by a single transcription unit (see Figure 8.1) which is tandemly repeated about 250 times, comprising five clusters of about 50 tandem repeats located on the short arms of human chromosomes 13, 14, 15, 21 and 22. In addition, the 5S cytoplasmic rRNA is encoded by several hundred gene copies in at least three clusters on the long arm of chromosome 1. The major rationale for the repetition of cytoplasmic rRNA genes is likely to be based on gene dosage: by having a comparatively large number of these genes, the cell can satisfy the huge demand for cytoplasmic ribosomes needed for protein synthesis.

Transfer RNA (tRNA) genes

These belong to a very large dispersed gene family, comprising more than 40 different subfamilies each with several members which encode the different species of cytoplasmic tRNA. In addition to multiple copies of genes specifying the individual cytoplasmic tRNA molecules, there are several defective gene copies (pseudogenes).

Small nuclear RNA (snRNA) genes

A heterogeneous collection of several hundred small nuclear RNA species are encoded by a large dispersed family of genes. Many of the snRNA species are uridine-rich and are named accordingly, e.g. U3 snRNA means the third uridine-rich small nuclear RNA to be classified. Individual species of RNA are associated with specific proteins to form ribonucleoprotein particles (RNPs). Some are known to be important in RNA splicing. A large subfamily of perhaps about 200 genes are present in the nucleolus, and have been termed small nucleolar RNA (snoRNA). They have important roles in specific cleavage reactions and base-specific modifications during maturation of ribosomal RNA (see Smith and Steitz, 1997).

Other RNA genes

Additional RNA genes encode functionally diverse products, including the 7SL RNA component of the signal recognition particle which is required for protein export and the RNA component of telomerase, the enzyme required to synthesize DNA at the telomeres (Section 2.3.4). More recently, evidence has been obtained suggesting that certain RNA genes encode products that are important in gene regulation. An important example is the XIST gene. This gene is thought to be the major gene involved in initiating the process of X chromosome inactivation, being expressed exclusively from inactivated X chromosomes. No long open reading frames can be identified, and gene function is thought to be carried out through an RNA product by a mechanism that remains obscure (see Section 8.5.6).

In addition, several RNA genes have been found at a variety of chromosomal regions that are known to be imprinted (imprinted genes are normally expressed from a maternally inherited copy or a paternally inherited copy, but not both; see Section 8.5.4). For example, the H19 gene contains five exons and is expressed to give a polyadenylated cytoplasmic RNA which does not however associate with ribosomes. It shows a restricted pattern of expression during early development (fetal and neonatal liver, visceral endoderm and fetal gut) and is imprinted, since only the maternally inherited allele is expressed. Its functional significance is, however, unclear.

7.2.3. Functionally similar genes are occasionally clustered in the human genome, but are more often dispersed over different chromosomes

As seen in the previous section, some families of RNA genes are clustered. In the case of polypeptide-encoding gene families, some genes encoding identical or functionally related products are clustered, but often they are dispersed on several chromosomes.

Functionally identical genes

A very few human polypeptides are known to be encoded by two or more identical gene copies. Often, these are encoded by recently duplicated genes in a gene cluster, as in the case of the duplicated α-globin genes (see Section 7.3.4). In addition, some genes on different chromosomes encode identical polypeptides. Examples include members of histone gene subfamilies. As mentioned in Section 2.3.1, histones can be classified into five groups in terms of structure: H1 (the linker histone) and the four core histones, H2A, H2B, H3 and H4. In addition histone genes can be classified into three groups according to expression: (i) replication-dependent (restricted to the S phase of the cell cycle); (ii) replication-independent (expressed at a low level throughout the cell cycle to give so-called replacement histones); (iii) tissue-specific, e.g. the H1t and H3t genes are expressed exlusively in the testis. There appears to be a total of 61 human histone genes which comprise several subfamilies (Albig and Doenecke, 1997; electronic reference 2). Most of the histone genes are found in two multifamily clusters on the short arm of chromosome 6, but genes on several other human chromosomes can specify identical copies of a particular histone subtype (Figure 7.6).

Figure 7.6. Chromosomal distribution of the human histone gene family.

Figure 7.6

Chromosomal distribution of the human histone gene family. Eleven clusters comprising a total of about 60 histone genes are distributed over seven human chromosomes. The two clusters on 6p contain the great majority of histone genes. Other clusters contain (more...)

Functionally similar genes

A large fraction of human genes are members of gene families where individual genes are closely related but not identical in sequence. In many such cases the genes are clustered and have arisen by tandem gene duplication, as in the case of the different members of each of the α-globin and β-globin gene clusters (see Section 7.3.4). Genes which encode clearly related products but which are located on different chromosomes are generally less related, as in the case of the α-globin and β-globin genes. However, in the case of the HOX homeobox gene family which consists of clusters of approximately 10 genes on each of four chromosomes, individual genes on different chromosomes may be more related to each other than they are to members of the same gene cluster (Section 14.2.2 and Figure 14.5).

In addition to the above, genes encoding closely related tissue-specific isoforms, or subcellular compartment-specific isozymes are often nonsyntenic (i.e. located on different chromosomes; see Table 7.5).

Table 7.5. Distribution of genes encoding functionally related products.

Table 7.5

Distribution of genes encoding functionally related products.

Functionally related genes

Some genes encode products which may not be so closely related in structure, but are clearly functionally related. The products may be subunits of the same protein or macromolecular structure, components of the same metabolic or developmental pathway, or may be required to specifically bind to each other as in the case of ligands and their relevant receptors. In almost all such cases, the genes are not clustered and are usually found on different chromosomes (see Table 7.5 for some examples).

7.2.4. Human genes show enormous variation in size and internal organization

Size diversity

Genes in simple organisms such as bacteria are comparatively similar in size, and usually very short. By contrast, complex organisms such as mammals show wide variation in gene size, a feature found especially in human genes which can vary in length from hundreds of nucleotides to several megabases (Figure 7.7). The enormous size of some human genes means that transcription can be time-consuming. For example, the human dystrophin gene requires about 16 hours to be transcribed, and transcripts undergo splicing before transcription is completed (Tennyson et al., 1995).

Figure 7.7. Human genes vary enormously in size and exon content.

Figure 7.7

Human genes vary enormously in size and exon content. Exon content is shown as a percentage of the lengths of indicated genes. Note the generally inverse relationship between gene length and percentage of exon content. Asterisks emphasize that the lengths (more...)

As one would expect, there is a direct correlation between the size of a gene and the size of its product, but there are some striking anomalies. For example, apolipoprotein B has 4563 amino acids and is encoded by a 45 kb gene while the dystrophin gene is 2.4 Mb in length and encodes a product in muscle cells of 3685 amino acids.

Diversity in internal organization

There is an inverse correlation between gene size and the proportion of the gene length which is expressed at the RNA level (Figure 7.7). A very small minority of human genes lack introns and are generally very small genes (see Table 7.6 for examples). For those that do possess introns, the exon content as a percentage of gene length tends to be very small in large genes. This does not arise because exons in large genes are smaller than those in small genes: the average exon size in human genes is about 200 bp and, although very large exons are known (see Box 7.2), exon size is comparatively independent of gene length (Table 7.7). Instead, the explanation is due to the huge variation in intron lengths: large genes tend to have very large introns (Table 7.7). The relationship between gene and intron length is not, however, without anomalies: the human type 7 collagen gene (COL7A1) is an intermediate size gene (31 kb) but has a total of 118 exons and an average intron size of only 188 bp (Christiano et al., 1994). The extraordinary number of exons in COL7A1 is not even matched by the number of exons in the giant 2.4 Mb dystrophin (DMD) gene (Table 7.7).

Table 7.6. Examples of human genes with uninterrupted coding sequences.

Table 7.6

Examples of human genes with uninterrupted coding sequences.

Box Icon

Box 7.2

Human gene organization.

Table 7.7. Average sizes of exons and introns in human genes.

Table 7.7

Average sizes of exons and introns in human genes.

7.2.5. Rare examples of overlapping genes and genes within genes are known in the human genome

Partially overlapping genes

The genes of simple organisms are generally more clustered than those in complex organisms. The average gene density in the human genome is about one per 40–45 kb of DNA. Assuming a mean size of, say, 10–15 kb, human genes should be separated by about 30 kb of nongenic DNA on average. By contrast, average gene densities in simple organisms are very much higher: roughly one per 1, 2 and 5 kb, respectively, for E. coli, Saccharomyces cerevisiae and Caenorhabditis elegans. Simple genomes such as those of certain phages and bacteria often show examples of partially overlapping genes which use different reading frames, sometimes from a common sense strand. The human mitochondrial genome is another example of a simple genome packed with genetic information and it too has an example of such overlapping genes (see Figure 7.3).

Reported occurrences of overlapping genes in the complex nuclear genomes of mammals are rare and, where they do occur, the overlapping genes are often transcribed from the two different DNA strands. As noted in Section 7.1.2, the degree of gene clustering in the nuclear genome is largely dependent on the chromosomal region, and in regions of high density occasional examples of overlapping genes have been noted. For example, the class III region of the HLA complex at 6p21.3 has an average gene density of about one gene per 13 kb, and is known to contain several examples of overlapping genes (Figure 7.5).

Genes within genes

The small nucleolar RNA (snoRNA) genes are unusual in that the majority of them are located within other genes, often ones which encode a ribosome-associated protein or a nucleolar protein. Possibly this arrangement has been maintained to permit coordinate production of protein and RNA components of the ribosome (Tycowski et al., 1993).

In addition to the snoRNA genes there are a few examples of other genes being located within the introns of larger genes, and in some cases the internal genes as well as the host genes are known to encode polypeptides. Three illustrative examples are:

  • The neurofibromatosis type I (NF1) gene. Intron 27 of the NF1 gene spans about 40 kb and contains three small genes, each with two exons which are transcribed from the opposite strand to that used for the NF1 gene (Viskochil et al., 1991; see Figure 7.8).
  • The factor VIII gene. Intron 22 of the blood clotting factor VIII gene (F8C) contains a CpG island from which two internal genes, F8A and F8B are transcribed in opposite directions (Levinson et al., 1992). F8A is transcribed from the opposite strand to that used by the factor VIII gene. F8B is transcribed in the same direction as the factor VIII gene to give a short mRNA containing a new exon spliced on to exons 23–26 of the factor VIII gene (see Figure 9.20).
  • The retinoblastoma susceptibility gene RB1. Intron 17 of this gene is 72 kb long and contains a G protein-coupled receptor gene, U16, which is actively transcribed from the opposite strand (see Figure 7.19).
Figure 7.8. Genes within genes: intron 26 of the gene for neurofibromatosis type I (NF1) contains three internal genes each with two exons.

Figure 7.8

Genes within genes: intron 26 of the gene for neurofibromatosis type I (NF1) contains three internal genes each with two exons. Note that the three internal genes are transcribed from the opposing strand to that used for transcription of the NF1 gene. (more...)

Figure 7.19. Location of Alu, LINE-1 and (A) n /(T) n repeats within the human retinoblastoma susceptibility gene.

Figure 7.19

Location of Alu, LINE-1 and (A) n /(T) n repeats within the human retinoblastoma susceptibility gene. The entire sequence of the 180 kb human retinoblastoma susceptibility gene, RB1, has been determined, enabling identification within the gene of many (more...)

7.3. Human multigene families and repetitive coding DNA

DNA sequences in the nuclear diploid genome usually exist as two allelic copies (on paternal and maternal homologous chromosomes). In addition to this degree of repetition, about 40% of the human nuclear genome in both haploid and diploid cells is composed of sets of closely related nonallelic DNA sequences (DNA sequence families or repetitive DNA). Within the considerable variety of different repetitive DNA sequences are DNA sequence families whose individual members include functional genes (multigene families), and also many examples of nongenic repetitive DNA sequence families.

7.3.1. The reassociation kinetics of human DNA suggest three broad classes of DNA sequence

Reassociation kinetics (Section 5.2.2) first suggested that complex genomes, such as the human genome, comprise different sequence classes on the basis of the copy number. Typically, human DNA is randomly sheared (e.g. by sonication) to give fragments whose average size is about 500 bp and the sheared DNA is denatured by heating to separate the complementary strands of each fragment. Thereafter the DNA is cooled to a temperature of about 20–30°C below the melting temperature, Tm (which marks the mid-point of the transition between the double-stranded and single-stranded states of DNA heated in solution). The cooled DNA renatures but the rate of reassociation depends not only on time (t) but also on the initial concentration (Co) of that sequence (i.e. the Cot value).

The above type of analysis has suggested that the human genome consists of three broad sequence components:

  • Single copy, or at least very low copy number, DNA (60%) - reassociates very slowly. A single strand from a single copy sequence will require some considerable time to find a complementary partner strand, given that the vast majority of DNA fragments are unrelated to it.
  • Moderately repetitive (30%) - intermediate speed of reassociation.
  • Highly repetitive (10%) - reassociates very rapidly. There are numerous copies of the same sequence and the chances of quickly finding complementary partners within the mass of different fragments are high.

7.3.2. Members of DNA sequence families can be identified by a variety of different approaches

The operational definition of a DNA sequence family is the comparatively high level of DNA sequence similarity (sequence homology) between whole family members, or components of the family members. Members of a DNA sequence family can be identified and actively sought by a variety of methods:

  • DNA sequencing - Allows direct calculations on the degree of sequence relatedness of family members.
  • DNA hybridization and cloning. A probe from a gene family member typically gives a complex band pattern when hybridized against a Southern blot of genomic DNAs. Individual family members can then be cloned by screening genomic DNA libraries.
  • PCR cloning - Permits identification of novel family members by designing degenerate primers corresponding to highly conserved nucleotide or amino acid sequences.

When two members of a repetitive DNA sequence family exhibit a high degree of sequence homology, a recent common evolutionary origin is indicated. As detailed in the following sections, DNA sequence families show considerable variation in the number of different repeat unit members in the family, the size of the repeating unit, chromosomal location, mode of repetition and capacity for expression.

7.3.3. Human gene families vary in the overall sequence relatedness of different family members and the extent to which particularly conserved subgenic sequences define the family

A large percentage of actively expressed human genes are members of families of DNA sequences which show a high degree of sequence similarity. However, the extent of sequence sharing and the organization of family members can vary widely. Some family members may be nonfunctional (pseudogenes and gene fragments - see below) and rapidly accumulate sequence differences, leading to marked sequence divergence.

Classical gene families

Classical gene families are distinguished by members which exhibit a high degree of sequence homology over most of the gene length or, at least, the coding DNA component, a feature which automatically identifies such sequences as being closely related evolutionarily as well as functionally. In some cases, such as the individual rRNA gene families and individual histone gene families, there is an extremely high degree of sequence similarity between family members. Many other large gene families show a high degree of sequence similarity between family members.

Gene families encoding products with large, highly conserved domains

In some gene families there is particularly pronounced homology within specific strongly conserved regions of the genes; the corresponding sequence similarity between the remaining portion of the coding sequence in the different genes may be quite low. Often such families encode transcription factors that play important roles in early development, and the conserved sequence encodes a protein domain which is required to bind specifically to the DNA of selected target genes (see Table 7.8).

Table 7.8. Examples of human genes with sequence motifs which encode highly conserved domains.

Table 7.8

Examples of human genes with sequence motifs which encode highly conserved domains.

Gene families encoding products with very short conserved amino acid motifs

The members of some gene families may not be very obviously related at the DNA sequence level, but nevertheless encode gene products that are characterized by a common general function and the presence of very short conserved sequence motifs. Examples, some of which are illustrated in Figure 7.9, include:

Figure 7.9. Some gene families are defined by functionally related gene products bearing very short conserved amino acid motifs: consensus motifs for the DEAD box and WD repeat families.

Figure 7.9

Some gene families are defined by functionally related gene products bearing very short conserved amino acid motifs: consensus motifs for the DEAD box and WD repeat families. (A) Motifs in the DEAD box family. This gene family encodes products implicated (more...)

  • the DEAD box gene family - contains several different genes whose products appear to function as RNA helicases and are characterized by the presence of eight short amino acid sequence motifs, including the DEAD box (the sequence Asp-Glu-Ala-Asp as represented by the one-letter amino acid code);
  • the WD repeat gene family - gene products have a regulatory function, but there is considerable diversity in function. Typically characterized by tandem repeats with a central core of fixed length and containing small conserved amino acid motifs, including the WD (tryptophan-aspartate) sequence);
  • the ankyrin repeat gene family - wide functional diversity but often involved in protein-protein interactions. Characterised by tandem repeats of a 33 amino acid sequence characterized by the presence of select amino acids at only a few positions;
  • the LIM domain gene family - encode a characteristic cysteine-rich 56 amino acid domain most likely involved in protein-protein interactions.

Gene superfamilies

In some types of gene family, the genes encode products that are known to be functionally related in a general sense, and show only very weak sequence homology over a large segment, without very significant conserved amino acid motifs. Instead, there may be some evidence for general common structural features. Such genes, which appear to be evolutionarily related but more distantly than those in a classical or conserved domain/motif gene family, have been considered to be members of a gene superfamily. For example, in addition to the immunoglobulin gene family, other related genes such as the HLA genes, TCR genes, T4 and T8 genes are known to encode products with an immune system function and a domain structure that resembles that of immunoglobulins. Although, therefore, the level of sequence homology between such genes may be very low, the similarities in function and general domain structure have suggested the existence of a so-called Ig superfamily, in which there appears to be a distant common evolutionary relationship (Figure 7.10; see also Figure 14.16 for the globin superfamily).

Figure 7.10. Members of the Ig superfamily are surface proteins with similar types of domain structure.

Figure 7.10

Members of the Ig superfamily are surface proteins with similar types of domain structure. Most members of the Ig superfamily are dimers consisting of extracellular variable domains (V) located at the N-terminal ends and constant (C) domains, located (more...)

7.3.4. Human gene families can occur as closely clustered genes at specific subchromosomal locations, or as widely dispersed genes

A wide variety of human gene families have been identified and show considerable variation in both the organization of the genes and the extent to which individual genes within the gene family are related in terms of sequence and function. In terms of gene organization, two basic arrangements can be discerned: families where there is evidence of close gene clustering and families that are dispersed over several different chromosomal locations. This classification is rather arbitrary, however, since several gene families consist of multiple gene clusters at different chromosomal locations (Table 7.9).

Table 7.9. Examples of clustered and interspersed multigene families.

Table 7.9

Examples of clustered and interspersed multigene families.

Gene families organized in a single cluster

Genes in an individual gene cluster are thought to arise by tandem gene duplication events. Different organizations are evident:

  • Tandem gene organization. This arrangement is exemplified by the organisation of genes in individual rRNA gene clusters (see Section 7.2.2). The genes are highly related to each other in terms of both sequence and function, although certain family members may be nonfunctional (pseudogenes, see Section 7.3.5 below).
  • Close clustering. The individual genes may be more physically separate than in a tandemly repeated cluster but nevertheless may be closely clustered, even to the extent of being subject to a common regulatory mechanism (locus control region - see the example of the α- and β-globin gene clusters in Figures 7.11 and 8.23. Again the individual genes usually show a high degree of sequence and functional identity to each other, but many family members may be pseudogenes.
  • Compound clusters. In other clustered gene families, however, the physical relationship between genes in a cluster may be less close and a cluster of related genes may also contain within it genes that are unrelated in sequence and function, a compound gene cluster. For example, the HLA complex on 6p21.3 is dominated by families of genes which encode class I and II HLA antigens and various serum complement factors, but individual family members may be separated by functionally unrelated genes such as members of the steroid 21-hydroxylase gene family, etc. The gene organization in the latter case shows evidence of an ancestral duplication event that involved a segment containing different genes that are unrelated in function and sequence (see Figure 7.5).
Figure 7.11. Examples of human clustered gene families.

Figure 7.11

Examples of human clustered gene families. Genes in a cluster are closely related in sequence and are often transcribed from the same DNA strand.

Gene families organized in multiple gene clusters

Some gene families are organized in clusters distributed over two or more chromosomal locations. Again different organizations are evident, with some families showing very high similarity between genes on different clusters, while others are marked by comparatively low sequence homology between genes on different clusters.

  • High cluster similarity. Here sequence homology between gene family members on different chromosomes may be very high as in the case of the different rRNA gene clusters on the short arms of chromosomes 13, 14, 15, 21 and 22. Another useful example is the olfactory receptor gene family which encodes a diverse repertoire of receptors which allow us to discriminate thousands of different odors. The family consists of perhaps 1000 genes, making it the largest in the human genome, and is organized in large clusters at more than 25 different chromosomal locations (Rouquier et al., 1998).
  • Low cluster similarity. Often sequence homology is greater within a cluster than between clusters. For example, the globin gene family includes genes at three locations: the α-globin cluster on 16p, the β-globin cluster on 11p and the myoglobin gene at 22q and although all globin genes are clearly related, those within a cluster are more related to each other than they are to genes in one of the other clusters (see Figure 14.16).

Interspersed gene families

Some gene families show no obvious physical relationship between family members, which are dispersed as solitary genes at two or more different chromosomal locations. The family members may show considerable sequence divergence unless their dispersion has been a relatively recent event, or there has been considerable selection pressure to maintain sequence conservation. The following examples illustrate some of the many different types of organization.

  • Families expected to originate from two genomes. Current ideas on mitochondrial origins have envisaged that the original genome was derived from a prokaryote but that many of the genes were subsequently transferred to the nuclear genome. Some families of nuclear genes encode cytoplasm-specific and mitochondrial-specific isoforms for certain enzymes and other key metabolic products (see Table 7.5 for some examples).
  • Families originating from ancient genome duplication or gene duplication events. This class of interspersed gene families typically contains only a few members, as in the case of the PAX gene family, and appear to have evolved by a combination of gene duplication and/or genome duplication events over a long period of evolutionary time. Typically, all or most of the genes are functional, and may individually encode highly related products.
  • Families originating largely by retrotransposition events. Some gene families have expanded comparatively recently in evolutionary terms by a process whereby RNA transcribed from one or a small number of functional genes is converted by cellular reverse transcriptase into natural cDNA which then becomes integrated elsewhere in the chromosomes. Most such copies are nonfunctional (see next section).

7.3.5. Pseudogenes, truncated gene copies and gene fragments are commonly found in multigene families

Families of RNA genes or polypeptide-encoding genes are frequently characterized by defective copies of essentially all of the gene or its coding sequence (pseudogenes), or a portion of it, in some cases a single exon (gene fragments). A large variety of different classes are found (see Box 7.3). The following examples are meant to be illustrative of the types of defective gene copies found in different types of gene family.

Box Icon

Box 7.3

Pseudogenes and gene fragments. The processes that give rise to gene families often result in the formation of nonfunctional copies of a gene or a fragment of a gene, either a pseudogene (a nonfunctional copy of most or all of a gene, or at least its (more...)

Nonprocessed pseudogenes in a gene cluster

Individual gene clusters are typically characterized by the presence of defective gene copies which have been copied at the level of genomic DNA. This means that the defective gene copies can contain copies of the exons, introns and promoter regions of the functional genes (nonprocessed pseudogenes). Classical examples are the α-globin and β-globin clusters (see Figure 7.11). Only one member of the α-globin gene family is a nonprocessed pseudogene, but three of the seven gene copies in the β-globin gene cluster are. Another β-globin gene family member, HBQ1, is likely to be an expressed pseudogene: it encodes a type of globin polypeptide, θ-globin, but there is no evidence that the latter is ever incorporated into a haemoglobin molecule and so θ-globin is likely to lack any function (Clegg, 1987)

Truncated genes and gene fragments in a gene cluster

The class I HLA gene family at 6p21.3 is a classical example of a gene cluster which is characterized by nonprocessed pseudogenes, truncated gene copies and gene fragments. Although the number of class I HLA genes can vary on different chromosome 6s, comprehensive analysis of one of these identified 17 family members clustered over about 2 Mb (Geraghty et al., 1992; see Figure 7.12). Six of the genes are known to be expressed, although the precise functions of some of these are still not clearly understood. The remaining members are clearly defective. Four are conventional full-length pseudogenes, but another five represent truncated gene copies (lacking the 5′ end in four cases and the 3′ end in the other case) and two are fragments which contain a small component of the gene sequence, even a single exon (see Figure 7.12).

Figure 7.12. Clustered gene families often contain nonprocessed pseudogenes and truncated genes or gene fragments: example of the class I HLA gene family.

Figure 7.12

Clustered gene families often contain nonprocessed pseudogenes and truncated genes or gene fragments: example of the class I HLA gene family. (A) Structure of a class I HLA heavy chain mRNA. The full-length mRNA contains a polypeptide-encoding sequence. (more...)

Nonprocessed pseudogenes in an interspersed gene family

For several polypeptide-encoding genes, defective copies containing intronic sequence have been found elsewhere in the genome. Two illustrative examples are:

  • The NF1 (neurofibromatosis type I) gene. This gene is located close to the chromosome 17 centromere at 17q11.2 and has at least 11 nonprocessed pseudogene or gene fragment copies, nine of which are located at pericentromeric regions on seven different chromosomes (Regnier et al., 1997). Characterization of the chromosome 15-specific NF1 gene copies revealed that they contain copies of a segment of the NF1 gene spanning exons 8 and 27 and including the intron sequences. This type of gene duplication may be a result of what has been called pericentromeric plasticity.
  • The PKD1 (adult polycystic kidney disease) gene. This gene consists of 46 exons spanning 50 kb at 16p13.3. A truncated 5′ gene copy comprising approximately 70% of the gene and encompassing exons 1 to 34 and all the intervening introns has been faithfully replicated at least three times and inserted into a more proximal location at 16p13.1 (the European Polycystic Kidney Disease Consortium, 1994).

An apparently related case concerns the CFTR (cystic fibrosis) gene at 7q31.2. which has 24 exons spanning 250 kb. Copies of a ~30 kb sequence encompassing exon 9 and flanking intron sequences are distributed at a large number of chromosomal locations. However, in this case a likely explanation proposed by Rozmahel et al. 1997 is a retrotransposition model. Transcription from the sense strand of the CFTR gene was envisaged to generate an antisense transcript encompassing the exon 9 sequence and flanking intron regions. The transcript could then have been converted to cDNA by cellular reverse transcriptase prior to integration at several sites in the genome.

Processed pseudogenes in an interspersed polypeptide-encoding gene family

Interspersed gene families frequently have several defective gene copies which have been copied at the cDNA level (processed pseudogenes; see Table 7.9B for examples). This form of retrotransposition is carried out by cellular reverse transcriptases which transcribe mRNA into natural cDNA which can then integrate into chromosomal DNA at sites of temporary breakage (see Figure 7.13 for one possible mechanism).

Figure 7.13. Processed pseudogenes originate by reverse transcription from RNA transcripts.

Figure 7.13

Processed pseudogenes originate by reverse transcription from RNA transcripts. The reverse transcriptase function could be provided by LINE-1 (Kpn) repeats (see Figure 7.18). The model for integration shown in the figure is only one of several possibilities (more...)

Although processed pseudogenes are typically not expressed, several examples are known where natural cDNA copies of a gene appear to be expressed, often in a testis-specific fashion (see Section 14.3.3). Often the functional intron-containing gene locus in such families is located on the X chromosome (see Section 14.3.3).

Processed pseudogenes in an RNA-encoding gene family

Although the size of some interspersed polypeptide-encoding gene families testifies to the success of retrotransposition as a mechanism for generating processed gene copies, the really successful (in terms of high copy number) retrotranspositions have been performed from RNA polymerase III transcripts. For example, the Alu repeat family (see Section 7.4.5) is considered to have arisen as processed pseudogenes copied from the 7SL RNA gene. Genes such as this which are transcribed by RNA polymerase III often contain an internal promoter which facilitates the expression of newly transposed copies (see Section 8.2.1 and Figure 8.3).

7.3.6. The coding sequences of many human genes contain repeated sequence motifs

In addition to the types of gene sequence duplication discussed in the previous sections, many human genes, like other eukaryotic genes, contain intragenic repeated sequences. Repeated sequences in coding DNA may involve different forms of repeated structure, including comparatively large sequences encoding protein domains or small sequence motifs. Different modes of repetition can be seen but tandem repetition is rather common.

  • Tandem repetition of microsatellite sequences (short sequence motifs - see Section 7.4.3) is common and may simply reflect statistical expected frequencies for certain base compositions. For example, certain genes which have a very high % GC often have long runs of single cytosines or guanines in the coding sequence. Such sequences are comparatively unstable and are prone to single nucleotide deletion or insertion events which are often pathogenic. A variety of genes are known to have long tracts of certain trinucleotides in their coding sequences and runs of the CAG trinucleotides in the coding sequences of several genes have been found to be capable of expansion causing pathogenesis (see Box 16.7).
  • Tandem repetition of sequences encoding known or assumed protein domains is quite common, and may be functionally advantageous in some cases by providing a more available biological target. In some cases, the sequence homology between the repeats can be very high; in other cases it may be rather low (see Table 7.10).

Table 7.10. Examples of intragenic repetitive coding DNA (see also Box 16.7).

Table 7.10

Examples of intragenic repetitive coding DNA (see also Box 16.7).

7.4. Extragenic repeated DNA sequences and transposable elements

The human nuclear genome, like that of other complex eukaryotes, contains a large amount of highly repeated DNA sequence families which are largely transcriptionally inactive. A wide variety of different repeats are known and dedicated repeat sequence databases have been established (see electronic reference 3). Like multigene families, noncoding repetitive DNA shows two major types of organization: tandemly repeated and interspersed.

Tandemly repeated noncoding DNA

Such families are defined by blocks (or arrays) of tandemly repeated DNA sequences. Individual arrays can occur at a few or many different chromosomal locations. Depending on the average size of the arrays of repeat units, highly repetitive noncoding DNA belonging to this class can be grouped into three subclasses: satellite, minisatellite and microsatellite DNA.

In addition to these three major subclasses (which are detailed in the following three sections), a fourth class has recently been recognized and described as megasatellite or macrosatellite DNA. Despite the name, this type of DNA is characterized by array lengths which can be comparatively modest compared to some satellite DNA arrays. Instead, the prefix mega- has been used to emphasize the large size of the repeating unit which can be several kilobases long. This class is exemplified by the RS447 megasatellite which consists of about 60 tandem copies of a novel 4.7 kb repeat on 4p15, plus another array of several copies on distal 8p (Gondo et al., 1998). Array lengths can be highly polymorphic and the RS447 repeat is reported to contain a putative open reading frame of 1590 bp (Gondo et al., 1998). See Table 7.11 for some other examples.

Table 7.11. Major classes of tandemly repeated human DNA.

Table 7.11

Major classes of tandemly repeated human DNA.

Interspersed repetitive noncoding DNA

The individual repeat units are not clustered, but are dispersed at numerous locations in the genome, and together account for perhaps one third of the DNA in the human genome (Smit, 1996). Most of the DNA families belonging to this class contain some members that are capable of undergoing retrotransposition (i.e. transposition through an RNA intermediate).

The chromosomal locations of different types of tandemly repeated DNA can show a very restricted or highly dispersed pattern, whereas different classes of interspersed repeat DNA can show preferential location within different types of chromosome bands (see below and Figure 7.14).

Figure 7.14. Chromosomal location of major repetitive DNA classes.

Figure 7.14

Chromosomal location of major repetitive DNA classes. (A) General overview. Note the restricted locations of certain types of tandemly repeated DNA, such as satellite DNAs which are found in heterochromatin (notably at the centromeres) and minisatellite (more...)

7.4.1. Satellite DNA is composed of very long arrays of tandem repeats which can be separated from bulk DNA by buoyant density gradient centrifugation

Human satellite DNA is comprised of very large arrays of tandemly repeated DNA with the repeat unit being a simple or moderately complex sequence (Table 7.11; see Singer, 1982a). Repeated DNA of this type is not transcribed and accounts for the bulk of the heterochromatic regions of the genome, being notably found in the vicinity of the centromeres (pericentromeric heterochromatin). The base composition, and therefore density, of such DNA regions is dictated by the base composition of their constituent short repeat units and may diverge substantially from the overall base composition of bulk cellular DNA.

Isolation by buoyant density gradient centrifugation

When DNA is isolated from human cells by conventional methods, it is subject to mechanical shearing. Fragments are generated from the bulk DNA (with a base composition of ~42% GC) and fragments from the satellite DNA regions which may have a similar or different base composition. If the base composition is significantly different, satellite DNA sequences can be separated from the bulk DNA by buoyant density gradient centrifugation. Following centrifugation, they appear as minor (or satellite) bands of different buoyant density from a major band which represents bulk DNA. Typically, human DNA is complexed with Ag+ ions and then fractionated in buoyant density gradients containing cesium sulfate, whereupon three satellite bands are identified at different densities: satellite 1 – 1.687 gcm-3; satellite 2 – 1.693 gcm-3; satellite 3 – 1.697 gcm-3.

Each of these satellite classes includes a number of different tandemly repeated DNA sequence families (satellite subfamilies), some of which are shared between different classes. DNA sequence analysis has revealed that some of the repetitive DNA families in the satellites are based on very simple repeat units. For example, both satellite 2 and satellite 3 contain sequence arrays which are based on tandem repetition of the sequence ATTCC. Additionally, restriction mapping has revealed satellite subfamilies which show additional higher order repeat units superimposed on the small basic repeat units. Such subfamilies are thought to arise as a result of subsequent amplification of a unit which is larger than the initial basic repeat unit and contains some diverged units (Figure 7.15).

Figure 7.15. Formation of higher order repeat units in simple sequence satellite DNA.

Figure 7.15

Formation of higher order repeat units in simple sequence satellite DNA. Secondary amplification is envisaged in this case to involve a 15 bp repeat unit comprising three diverged 5 bp repeats.

Alphoid DNA and centromeric heterochromatin

Other types of satellite DNA sequence cannot easily be resolved by density gradient centrifugation. They were first identified by digestion of genomic DNA with a restriction endonuclease which typically has a single recognition site in the basic repeat unit. In addition to the basic repeat unit size (monomer), such enzymes will produce a characteristic pattern of multimers of the unit length because of occasional random loss of the restriction site in some of the repeats (Singer, 1982a). Alpha satellite (or alphoid DNA) constitutes the bulk of the centromeric heterochromatin and accounts for about 3–5% of the DNA of each chromosome. It is characterized by tandem repeats of a basic mean length of 171 bp, although higher order units are also seen. The sequence divergence between individual members of the alphoid DNA family can be so high that it is possible to isolate chromosome-specific subfamilies for each of the human chromosomes (Choo et al., 1991).

The function of satellite DNA remains unclear (see Csink and Henikoff, 1998). The centromeric DNA of human chromosomes largely consists of various families of satellite DNA (see Figure 7.14B). Of these, only the α-satellite is known to be present on all chromosomes, and its repeat units often contain a binding site for a specific centromere protein, CENP-B. Recently cloned α-satellite arrays have been shown to seed de novo centromeres in human cells, indicating that α-satellite plays an important role in centromere function (Grimes and Cook, 1998).

7.4.2. Minisatellite DNA is composed of moderately sized arrays of tandem repeats and is often located at or close to telomeres

Minisatellite DNA comprises a collection of moderately sized arrays of tandemly repeated DNA sequences which are dispersed over considerable portions of the nuclear genome (Table 7.11). Like satellite DNA sequences, they are not normally transcribed (but see below).

Hypervariable minisatellite DNA

Hypervariable minisatellite DNA sequences are highly polymorphic and are organized in over 1000 arrays (0.1–20 kb long) of short tandem repeats (Jeffreys, 1987). The repeat units in different hypervariable arrays vary considerably in size, but share a common core sequence, GGGCAGGAXG (where X = any nucleotide), which is similar in size and in G content to the chi sequence, a signal for generalized recombination in E. coli. While many of the arrays are found near the telomeres, several hypervariable minisatellite DNA sequences occur at other chromosomal locations. The great majority of hypervariable minisatellite DNA sequences are not transcribed, except for elements occuring within noncoding intragenic sequences. Some, however, are expressed. For example, the MUC1 locus on 1q is an expressed hypervariable minisatellite locus. It encodes a glycoprotein found in several epithelial tissues and body fluids which is highly polymorphic as a result of extensive variation in the number of minisatellite-encoded repeats (Swallow et al., 1987).

The significance of hypervariable minisatellite DNA is not clear, although it has been reported to be a ‘hotspot’ for homologous recombination in human cells (Wahls et al., 1990). Nevertheless it has found many applications. Various individual loci have been characterized and used as genetic markers, although the preferential localization in subtelomeric regions has limited their use for genome-wide linkage studies. A major application has been in DNA fingerprinting, in which a single DNA probe which contains the common core sequence can hybridize simultaneously to multiple minisatellite DNA loci on all chromosomes, resulting in a complex individual-specific hybridization pattern (see Section 17.4.1).

Telomeric DNA

Another major family of minisatellite DNA sequences is found at the termini of chromosomes, the telomeres. The principal constituent of telomeric DNA is 10–15 kb of tandem hexanucleotide repeat units, especially TTAGGG, which are added by a specialized enzyme, telomerase (see Figure 2.9). By acting as buffers to protect the ends of the chromosomes from degradation and loss and by providing a mechanism for replicating the ends of the linear DNA of chromosomes, these simple repeats are directly responsible for telomere function (see Figure 2.9 and Section 2.3.4).

7.4.3. Microsatellite DNA is defined by the presence of short arrays of tandem simple repeat units and is dispersed throughout the human genome

Microsatellite DNA families include small arrays of tandem repeats which are simple in sequence (often 1–4 bp) and are interspersed throughout the genome. Of the mononucleotide repeats, runs of A and of T are very common (see Figure 7.19) and together account for about 10 Mb, or 0.3% of the nuclear genome. By contrast, runs of G and of C are very much rarer. In the case of dinucleotide repeats, arrays of CA repeats (TG repeats on the complementary strand) are very common, accounting for 0.5% of the genome, and are often highly polymorphic (see Section 11.2.3). CT/AG repeats are also common, occurring on average once every 50 kb and accounting for 0.2% of the genome, but CG/GC repeats are very rare. This is so because C residues which are flanked at their 3′ end by a G residue (i.e. CpG) are prone to methylation and subsequent deamination, resulting in TpG (or CpA on the opposite strand, see Box 8.5). Trinucleotide and tetranucleotide tandem repeats are comparatively rare, but are often highly polymorphic and increasingly have been investigated to develop highly polymorphic markers.

The significance of microsatellite DNA is not known. Alternating purine-pyrimidine repeats, such as tandem repeats of the dinucleotide pair CA/TG, are capable of adopting an altered DNA conformation, Z-DNA, in vitro, but there is little evidence that they do so in the cell. Although microsatellite DNA has generally been identified in intergenic DNA or within the introns of genes, a few examples have been recorded within the coding sequences of genes. Tandem repeats of three nucleotides in coding DNA may be sites that are prone to pathogenic expansions (see Section 9.5.2 and Box 16.7).

7.4.4. Highly repeated interspersed DNA families contain a small percentage of actively transposing DNA elements

Two major classes of mammalian interspersed repetitive DNA families have been discerned on the basis of repeat unit length (Singer, 1982b): SINEs and LINEs.

SINEs (short interspersed nuclear elements)

The most conspicuous human SINE is the Alu repeat family (so called because of early attempts at characterizing the sequence using the restriction nuclease AluI). The Alu repeat contains an internal RNA polymerase III promoter sequence. It has attained a very high copy number in the human genome (Table 7.12) and appears to have originated by retrotransposition from the 7SL RNA gene (see Section 7.4.5). The Alu repeat is primate-specific but other mammals have similar types of sequence derived from the 7SL RNA gene such as the B1 family in mouse.

Table 7.12. Major classes and families of interspersed human repetitive DNA (adapted from Smit, 1996).

Table 7.12

Major classes and families of interspersed human repetitive DNA (adapted from Smit, 1996).

Unlike the Alu repeat, another major human SINE family is not restricted to primates, with copies being found in marsupials and monotremes. In accordance with its distribution this family has been termed the MIR (mammalian-wide interspersed repeat) family (see Table 7.12).

LINEs (long interspersed nuclear elements)

Human LINEs are exemplified by the LINE-1 or L1 element (also called the Kpn repeat because of early attempts at characterizing this family using the restriction nuclease KpnI; see Section 7.4.6). The LINE-1 element is also found in other mammals such as the mouse.

In addition to the human Alu and LINE-1 repeat families, there are many smaller families, including the THE-1 (transposable human element family), many MER (medium reiteration frequency) families and families of human endogenous retroviruses (HERV) or retrovirus-like elements (RTLV) - see Lower et al., 1996 and Table 7.12.

Some members of the interspersed repeat families have been considered as transposable elements, unstable DNA elements which can migrate to different regions of the genome (Figure 7.16). Rare examples are known of human DNA sequences which appear to have been copied by a DNA-mediated transposition event (transposons) and a variety of sequences have been found in the human genome which resemble known DNA transposons such as the mariner transposon and others (Smit and Riggs, 1996). However, the latter are not thought to be actively transposing. Instead, the great majority of human transposable elements undergo retrotransposition, that is their RNA transcripts can be converted within the cell to a complementary DNA form which can reinsert back into chromosomal DNA at a variety of different locations (see Figure 7.13). Three classes of mammalian sequence are known to be able to transpose through an RNA intermediate (see Box 7.4). In each case, the transposition event involves duplication of a very short sequence at the target site, causing the transposed sequence to be flanked by short repeats (see Figures 7.13 and 7.17).

Figure 7.16. Human transposable elements.

Figure 7.16

Human transposable elements. Only a small proportion of members of any of the above families may be capable of transposing; many have lost such capacity by acquiring inactivating mutations and many are short truncated copies. See Figures 7.17 and 7.18 (more...)

Box Icon

Box 7.4

Classes of mammalian sequence which undergo transposition through an RNA intermediate. Endogenous retroviruses are sequences which resemble retroviruses but which cannot infect new cells and are therefore restricted to one genome. They include sequences (more...)

Figure 7.17. Classes of mammalian transposable elements which undergo transposition through an RNA intermediate.

Figure 7.17

Classes of mammalian transposable elements which undergo transposition through an RNA intermediate. Blue blocks flanking the sequences represent short repeats of a sequence originally present at the target site that was duplicated during the integration (more...)

7.4.5. The Alu repeat occurs about once every 3 kb in the human genome and includes examples that are transcribed

The Alu repeat is the most abundant sequence in the human genome, with a copy number of about 1 000 000 (see Deininger, 1989; Smit, 1996). Alu repeats have a relatively high GC content and, although dispersed mainly throughout the euchromatic regions of the genome, have been reported to be preferentially located in R chromosome bands (Korenberg and Rykowski, 1988). The latter correspond to the pale bands seen when using standard Giemsa staining (see Box 2.4) and represent the most transcriptionally active regions of the genome. The full-length Alu repeat is about 280 bp long and is usually flanked by short (often 6–18 bp) direct repeats (i.e. the repeats are in the same orientation). The typical Alu sequence is a tandemly repeated dimer, with the repeats sharing an approximately 120 bp sequence followed by a short sequence which is rich in A residues on one strand and T residues on the complementary strand. However, there is asymmetry between the tandem repeats: one repeat unit contains an internal 32-bp sequence lacking in the other (Figure 7.18). Monomers, containing only one of the two tandem repeats, and various truncated versions of dimers and monomers are also common.

Figure 7.18. Structures of full-length Alu and LINE-1 repeats.

Figure 7.18

Structures of full-length Alu and LINE-1 repeats. The consensus standard Alu dimer is shown with two similar repeats terminating in an (A) n /(T) n like sequence. They have different sizes because of the insertion of a 32 bp element within the larger (more...)

The two repeated units of the Alu sequence show a striking resemblance to the sequence for 7SL RNA, a component of the signal recognition particle, which facilitates transport of proteins across the membrane of the endoplasmic reticulum. Because of this and the observation of the A-rich regions and the flanking direct repeats, it has been widely assumed that the Alu sequence has been propagated by retrotransposition from 7SL RNA, and therefore represents a processed 7SL RNA pseudogene (Figure 14.25). Certainly, transposition by Alu sequences is known to occur [presumably as a result of trans-acting cellular reverse transcriptases such as those encoded by LINE-1 (Kpn) elements (see below), and may occasionally cause clinical problems (Section 9.5.6)]. Possibly, the very high copy number achieved by this processed pseudogene is related to the presence of a promoter sequence in the 7SL RNA sequence (the 7SL RNA gene, like tRNA genes is transcribed by RNA polymerase III from an internal promoter, see Section 8.2.1). By contrast, processed pseudogenes from RNA polymerase II transcripts lack promoter sequences and their only chance of expression is if the integration event places them next to a functional promoter sequence.

Currently, the function of the Alu sequence, if any, is unknown, although several roles have been considered (see Schmid, 1998). Although the average expected frequency is one copy per 3 kb, high density clustering of Alu repeats is known to occur in certain regions. Because of their ubiquity, Alu sequences have been considered to promote unequal recombination, a mechanism which while occasionally causing disease, may be evolutionarily advantageous in promoting gene duplication (Section 9.3.2). Although conspicuously absent from coding sequences, Alu sequences are often found in noncoding intragenic locations, notably in introns and occasionally in untranslated sequences (see Figure 7.19). Consequently, they are often represented in the primary transcript RNA from genes encoding polypeptides, and occasionally in mRNA. The Alu sequence can also be transcribed from its internal promoter by RNA polymerase III in vitro, and in vivo transcription of some Alu sequences can result in the accumulation of a small cytoplasmic RNA which can be specifically bound by two cytoplasmic signal recognition particle proteins, SRP9 and SRP14 (Chang et al., 1994).

7.4.6. The LINE-1 (Kpn repeat) includes examples that appear to encode a reverse transcriptase and are actively transposing

The human LINE-1 (L1, Kpn repeat) family is expected to consist of over 100 000 and perhaps as many as 400 000– 500 000 interspersed repeats (Smit, 1996). Of these, multiple members are known to be actively transposing (Sassaman et al., 1997). The full-length consensus element is 6.1 kb long and has two open reading frames (ORFs). ORF1 is located close to one end (conventionally known as the 5′ end) of the consensus element and encodes a protein of unknown function, p40 (so called because its molecular weight is ~40 kDa). ORF2 encodes a protein with an endonuclease domain and also a reverse transcriptase domain (see Sassaman et al., 1997 ; Kazazian and Moran, 1998). The full-length consensus element contains an internal promoter within a region of untranslated DNA preceding ORF1 (conventionally called the 5′-UTR) while at the other end there is an (A) n /(T) n sequence, often described as the 3′ poly(A) tail. As in the case of other elements that transpose, the LINE-1 elements are flanked by short duplicated repeats (Figure 7.18).

The full-length LINE-1 element is comparatively rare and most repeats are truncated at the 5′ end, resulting in a population that is heterogeneous in length but sharing a common 3′ end with the poly(A) tail. LINE-1 elements are primarily located in euchromatic regions but show an inverse relationship with Alu repeats by appearing to be located preferentially in the dark G bands (Giemsa positive) of metaphase chromosomes. Like the Alu repeats, they are conspicuously absent from coding sequences but may be found in intragenic noncoding sequences (Figure 7.19). As a result, they may be represented in the primary RNA transcript of large genes, but they are conspicuously absent from coding sequences.

Further reading

  1. Gardiner K. Human genome organization. Curr. Opin. Genet. Dev. (1995);5:315–322. [PubMed: 7549425]
  2. Schmid C. Alu: structure, origin, evolution, significance and function of one-tenth of human DNA. Prog. Nucleic Acid Res. Mol. Biol. (1996);53:283–319. [PubMed: 8650306]

Electronic References (e-Refs)

  1. The Genome Monitoring Table at http://www​.ebi.ac.uk/~sterk/genome-MOT.
  2. The Histone Sequence Database at http://genome​.nhgri.nih.gov/histones.
  3. The Repeat Sequence Database at http://www​.girinst.org/

References Q15

  1. Albig W, Doenecke D. The human histone gene cluster at the D6S105 locus. Hum. Genet. (1997);101:284–294. [PubMed: 9439656]
  2. Anderson S. et al. Sequence and organization of the human mitochondrial genome. Nature. (1981);290:457–465. [PubMed: 7219534]
  3. Antequara F, Bird A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA. (1993);90:11995–11999. [PMC free article: PMC48112] [PubMed: 7505451]
  4. Bird A. CpG islands and the function of DNA methylation. Nature. (1986) ; 321:209–213. [PubMed: 2423876]
  5. Chang D -Y, Nelson B, Bilyeu T, Hsu K, Darlington G J, Maraia R J. A human Alu-RNA binding protein whose expression is associated with accumulation of small cytoplasmic Alu RNA. Mol. Cell. Biol. (1994);14:3949–3959. [PMC free article: PMC358761] [PubMed: 8196634]
  6. Choo K H, Vissel B, Nagy A, Earle E, Kalitsis P. A survey of the genomic distribution of alpha satellite DNA on all the human chromosomes, and derivation of a new consensus sequence. Nucleic Acids Res. (1991);19:1179–1182. [PMC free article: PMC333840] [PubMed: 2030938]
  7. Christiano A, Hoffman G G, Chung-Honet L C, Lee S, Cheng W, Uitto J, Greenspan D S. Structural organization of the human type VII collagen gene (COL7A1), composed of more exons than any previously characterized gene. Genomics. (1994) ; 21:169–179. [PubMed: 8088784]
  8. Clayton D A. Transcription and replication of animal mitochondrial DNAs. Int. Rev. Cytol. (1992);141:217–232. [PubMed: 1452432]
  9. Clegg J B. Can the product of the theta gene be a real globin. Natur. (1987) ; 329:465–467.
  10. Craig J M, Bickmore W A. The distribution of CpG islands in mammalian chromosomes. Nature Genet. (1994);7:376–381. [PubMed: 7920655]
  11. Csink A K, Henikoff S. Something from nothing: the evolution and utility of satellite repeats. Trends Genet. (1998);14:200–204. [PubMed: 9613205]
  12. Deininger P (1989) In: Mobile DNA (DE Berg, MM Howe, eds), pp. 619–636. American Society for Microbiology, Washington, DC.
  13. European Polycystic Kidney Disease Consortium . The polycystic kidney disease 1 gene encodes a 14 kb transcript and lies within a duplicated region on chromosome 16. Cell. (1994) ;77:881–894. [PubMed: 8004675]
  14. Fields C, Adams M D, White O, Venter J C. How many genes in the human genome. Nature Genet. (1994);7:345–346. [PubMed: 7920649]
  15. Geraghty D E, Koller B H, Hansen J A, Orr H T. Examination of four HLA class I pseudogenes Common events in the evolution of HLA genes and pseudogenes. J. Immunol. (1992);149:1934–1936. [PubMed: 1517564]
  16. Gondo Y, Okada T, Matsuyama N, Saitoh Y, Yanagisawa Y, Ikeda J -E. Human megasatellite DNA RS447: copy-number polymorphisms and interspecies conservation. Genomics. (1998) ;54:39–49. [PubMed: 9806828]
  17. Grimes B, Cooke H. Engineering mammalian chromosomes. Hum. Mol. Genet. (1998);7:1635–1640. [PubMed: 9735385]
  18. Hewitt J E, Lyle R, Clark L N, Valleley E M, Wright T J, Wijmenga C, van Deutckom J C T, Francis F, Sharpe P T, Hofker M, Frants R R, Williamson R. Analysis of the tandem repeat locus D4Z4 associated with fascioscapulohumeral muscular dystrophy. Hum Molec Genet. (1994) ;3:1287–1295. [PubMed: 7987304]
  19. Jeffreys A J. Highly variable minisatellites and DNA fingerprints. Biochem. Soc. Trans. (1987);15:309–317. [PubMed: 2887471]
  20. Kazazian H H, Moran J V. The impact of L1 retrotransposons on the human genome. Nature Genet. (1998);19:19–24. [PubMed: 9590283]
  21. Korenberg J R, Rykaowski M C. Human genome organization: Alu, lines, and the molecular structure of metaphase chromosome bands. Cell. (1988) ; 53:391–400. [PubMed: 3365767]
  22. Levinson B, Kenwrick S, Gamel P, Fisher K, Gitschier J. Evidence for a third transcript from the human factor VIII gene. Genomics. (1992) ;14:585–589. [PubMed: 1427887]
  23. Lower R, Lower J, Kurth R. The viruses in all of us: characteristics and biological significance of human endogenous retrovirus sequences. Proc. Natl Acad. Sci. USA. (1996);93:5177–5184. [PMC free article: PMC39218] [PubMed: 8643549]
  24. Neer E J, Schmidt C J, Nambudripad R, Smith T F. The ancient regulatory-protein family of WD-repeat proteins. Nature. (1994) ; 371:297–300. [PubMed: 8090199]
  25. Rouquier et al. Distribution of olfactory receptor genes in the human genome. Nature Genet. (1998);18:243–250. [PubMed: 9500546]
  26. Rozmahel R, Heng H H, Duncan A M, Shi X M, Rommens J M, Tsui L C. Amplification of CFTR exon 9 sequences to multiple locations in the human genome. Genomics. (1997) ;45:554–561. [PubMed: 9367680]
  27. Sassaman D M. Many human L1 elements are capable of retrotransposition. Nature Genet. (1997);16:37–43. [PubMed: 9140393]
  28. Schmid C W. Does SINE evolution preclude Alu function. Nucleic Acids Res. (1998);26:4541–4550. [PMC free article: PMC147893] [PubMed: 9753719]
  29. Schmid S R, Linder P. D-E-A-D protein family of putative RNA helicases. Mol. Microbiol. (1992);6:283–292. [PubMed: 1552844]
  30. Smit A F. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. (1996);6:743–748. [PubMed: 8994846]
  31. Smit A F, Riggs A D. Tiggers and DNA transposon fossils in the human genome. Proc. Natl Acad. Sci. USA. (1996);93:1443–1448. [PMC free article: PMC39958] [PubMed: 8643651]
  32. Smith C M, Steitz J A. Sno storm in the nucleolus: new roles for myriad small RNPs. Cell. (1997) ;89:669–672. [PubMed: 9182752]
  33. Swallow D M, Gendler S, Griffiths B, Corney G, Taylor-Papadimitrou J, Bramwell M E. The human tumour-associated epithelial mucins are coded by an expressed hypervariable gene locus PUM. Nature. (1987) ; 328:82–84. [PubMed: 3600778]
  34. Tennyson C N, Klamut H J, Worton R G. The human dystrophin gene requires 16 hours to be transcribed and is co-transcriptionally spliced. Nature Genet. (1995);9:184–190. [PubMed: 7719347]
  35. Toguchida J, McGee T L, Paterson J C, Eagle J R, Tucker S, Yandell D W, Dryda T P. Complete genomic sequence of the human retinoblastoma susceptibility gene. Genomics. (1993) ;17:535–543. [PubMed: 7902321]
  36. Tycowski K T, Shu M -D, Steitz J A. A small nucleolar RNA is processed from an intron of the human gene encoding ribosomal protein S3. Genes Dev. (1993);7:1176–1190. [PubMed: 8319909]
  37. Tyler-Smith C, Willard H F. Mammalian chromosome structure. Curr. Opin. Genet. Dev. (1993);3:390–397. [PubMed: 8353411]
  38. Viskochil D. et al. The gene encoding the oligodendrocyte-myelin glycoprotein is embedded within the neurofibromatosis type 1 gene. Molec. Cell Biol. (1991);11:906–912. [PMC free article: PMC359746] [PubMed: 1899288]
  39. Wahls W P, Wallace L J, Moore P J. Hypervariable minisatellite DNA is a hotspot for homologous recombination in human cells. Cell. (1990) ; 60:95–103. [PubMed: 2295091]
  40. Zhang M Q. Statistical features of human exons and their flanking regions. Hum. Mol. Genet. (1998);7:919–932. [PubMed: 9536098]
Copyright © 1999, Garland Science.
Bookshelf ID: NBK7587