The human genome is the term used to describe the total genetic information (DNA content) in human cells. It really comprises two genomes: a complex nuclear genome which accounts for 99.9995% of the total genetic information, and a simple mitochondrial genome which accounts for the remaining 0.0005% (Figure 7.1
Like other complex genomes, a sizeable component of the human genome is made up of noncoding DNA. In addition, the human genome is representative of mammalian genomes and other complex genomes in having a considerable amount of repetitive DNA, including both noncoding repetitive DNA and multiple copy genes and gene fragments.
During zygote formation, a sperm cell contributes its nuclear genome, but not its mitochondrial genome, to the egg cell. Consequently, the mitochondrial genome of the zygote is determined exclusively by that originally found in the unfertilized egg. The mitochondrial genome is therefore maternally inherited: males and females both inherit their mitochondria from their mother but males cannot transmit their mitochondria to subsequent generations. Thus mitochondrially encoded genes or DNA variants give the pedigree pattern shown in Figure 3.4. During mitotic cell division, the mitochondrial DNA molecules of the dividing cell segregate in a purely random way to the two daughter cells.
Each of the 13 polypeptides encoded by the mitochondrial genome is a subunit of one of the mitochondrial respiratory complexes, the multichain enzymes of oxidative phosphorylation which are engaged in the production of ATP. Note, however, that there are a total of about 100 different polypeptide subunits in the mitochondrial oxidative phosphorylation system, and so the vast majority are encoded by nuclear genes (see Box 7.1). All other mitochondrial proteins, including numerous enzymes, transport proteins, structural proteins etc., are encoded by the nuclear genome and are translated on cytoplasmic ribosomes before being imported into the mitochondria (see Figure 1.11).
The mitochondrial genetic code (which is used to decipher only 13 different mitochondrial mRNAs on mitochondrial ribosomes) differs slightly from the nuclear genetic code (which specifies perhaps about 70 000–80 000 different mRNAs on cytoplasmic ribosomes). The mitochondrial genome encodes all the ribosomal RNA and tRNA molecules it needs for synthesizing proteins but relies on nuclear-encoded genes to provide all other components (such as the protein components of mitochondrial ribosomes, amino acyl tRNA synthetases, etc.).
As there are only 22 different types of human mitochondrial tRNA, individual tRNA molecules need to be able to interpret several different codons. Eight of the 22 tRNA molecules have anticodons which are each able to recognize families of four codons differing only at the third base, and 14 recognize pairs of codons which are identical at the first two base positions and share either a purine or a pyrimidine at the third base. Between them, therefore, the 22 mitochondrial tRNA molecules can recognize a total of 60 codons [(8 × 4) + (14 × 2)]. The remaining four codons, UAG, UAA, AGA and AGG cannot be recognized by mitochondrial tRNA and act as stop codons (see Figure 1.22).
| Nuclear genome | Mitochondrial genome | |
|---|---|---|
| Size | 3300 Mb | 16.6 kb |
| No. of different DNA molecules | 23 (in XX) or 24 (in XY) cells, all linear | One circular DNA molecule |
| Total no. of DNA molecules per cell | 23 in haploid cells; 46 in diploid cells | Several thousand |
| Associated protein | Several classes of histone and nonhistone protein | Largely free of protein |
| Number of genes | ~65 000–80 000 | 37 |
| Gene density | ~1/40 kb | 1/0.45 kb |
| Repetitive DNA | Large fraction, see Figure 7.1 | Very little |
| Transcription | The great bulk of genes are transcribed individually | Continuous transcription of multiple genes |
| Introns | Found in most genes | Absent |
| % of coding DNA | ~3% | ~93% |
| Codon usage | See Figure 1.22 | See Figure 1.22 |
| Recombination | At least once for each pair of homologs at meiosis | Not evident |
| Inheritance | Mendelian for sequences on X and autosomes; paternal for sequences on Y | Exclusively maternal |
Note that the overlapping genes share a common sense strand, the H strand. Coding sequence coordinates are as follows: ATPase subunit 8, 8366–8569; ATPase subunit 6, 8527–9204. The C terminus of the ATPase 6 subunit gene is defined by the post-transcriptional introduction of a UAA codon: following transcription the RNA is cleaved after position 9206 and polyadenylated, resulting in a UAA codon where the first two nucleotides are derived ultimately from the TA at positions 9205–9206 and the third nucleotide is the first A of the poly(A) tail. Other human genes are known to be overlapping but are often transcribed from opposite strands.
| Chromosome | Amount of DNA (Mb) | Chromosome | Amount of DNA (Mb) |
|---|---|---|---|
| 1 | 263 | 13 | 114 |
| 2 | 255 | 14 | 109 |
| 3 | 214 | 15 | 106 |
| 4 | 203 | 16 | 98 |
| 5 | 194 | 17 | 92 |
| 6 | 183 | 18 | 85 |
| 7 | 171 | 19 | 67 |
| 8 | 155 | 20 | 72 |
| 9 | 145 | 21 | 50 |
| 10 | 144 | 22 | 56 |
| 11 | 144 | X | 164 |
| 12 | 143 | Y | 59 |
The DNA content is given for chromosomes prior to entering the S (DNA replication) phase of cell division (see Figure 2.2). Data abstracted from electronic reference 1.
Since the entire nucleotide sequence of the human mitochondrial genome is known, its precise base composition is known. The sequence of the human nuclear genome is still being established (and is not expected to be finished before 2003), but current estimates suggest a figure of about 42% GC. However, the proportion of specific combinations of nucleotides can vary considerably. Like other vertebrate nuclear genomes, for example, the human nuclear genome has a conspicuous shortage of the dinucleotide CpG (that is, neighboring cytosine and guanine residues on the same DNA strand in the 5′ → 3′ direction). Taking the average figure of 42% GC, the individual base frequencies are : C = G = 0.21, and so the expected frequency for the dinucleotide CpG is (0.21)2 = 0.0441. However, the observed frequency of the CpG dinucleotide is approximately one-fifth of this (see Bird, 1986).
In vertebrate DNA, cytosine residues occurring in CpG dinucleotides are targets for methylation at carbon atom 5. Only about 3% of the cytosines in human DNA are methylated, but most that are methylated are found in the CpG dinucleotide, producing 5-methylcytosine. Over evolutionarily long periods of time, 5-methylcytosine spontaneously deaminates to give thymine and so CpG is continuously being depleted and replaced by TpG (or CpA on the complementary strand. Despite the overall background, certain small regions of DNA noted for their transcriptional activity are characterized by the expected CpG density (CpG islands; see Box 8.5).
| Dark bands (G bands) | Pale bands (correspond to R bands - see Box 2.4) |
|---|---|
| Stain strongly with dyes that bind preferentially to AT-rich regions, such as Giemsa and Quinacrine | Stain weakly with Giemsa and Quinacrine |
| May be comparatively AT-rich | May be comparatively GC-rich |
| DNase insensitive | DNase sensitive |
| Condense early during the cell cycle but replicate late | Condense late during cell cycle but replicate early |
| Gene poor. Genes may be large because exons are often separated by very large introns | Gene rich. Genes are comparatively small because of close clustering of exons |
| LINE rich, but may be poor in Alu repeats | LINE poor, but may be enriched in Alu repeats |
The total number of genes in the human genome has been estimated to be about 70 000–80 000 (see Section 7.2.1). As all but 37 of these genes are located in the nuclear genome, this gives a rough estimate of about 3000 genes per chromosome. However, gene density can vary substantially between chromosomal regions and also between whole chromosomes. For example, heterochromatic regions are known to be very largely composed of repetitive noncoding DNA, and the centromeres and large regions of the Y chromosome, in particular, are notably devoid of genes.
The diagram represents FISH of a CpG island fraction from human DNA to human metaphase chromosomes (Craig and Bickmore, 1994). The Texas Red signal is derived from the CpG island probe, while the fluorescein isothiocyanate (FITC) green signal represents late replicating regions (which are mostly transcriptionally inactive), recognized by incorporation of bromodeoxyuridine (BrdU). Black regions represent overlap of the signals, indicating hybridization of the CpG island fraction to late replicating DNA. There is no counterstain, so that early replicating regions of the genome which do not have high densities of CpG islands are invisible, as are centromeres (where the anti-BrdU cannot get access). In addition to the rDNA clusters on the short arms of chromosomes 13–15, 21 and 22, high CpG island density is found on chromosomes 1, 9, 15–17, 19, 20 and 22. Adapted from Craig and Bickmore (1994) Nature Genetics, 7, pp. 376–381, with permission from Nature America Inc.
For the sake of clarity, only a 900 kb segment from the class III region of the 4 Mb HLA cluster is shown. Note that the great bulk of the genes in the HLA region have multiple exons (not shown) and this region is characterized by a very high density of exons in marked contrast to the dystrophin gene region, where there is a single gene with ~80 exons. The very high gene density in the HLA region is partly due to the presence of several overlapping genes (as indicated by internal/external boxes, e.g. at the 1400 kb position).
The number of genes in the human genome has been the subject of much speculation; while the small mitochondrial genome is known to have precisely 37 genes, the number in the nuclear genome remains unknown. Theoretical calculations based on the mutational load that a genome can tolerate and observed average mutation rates of human genes (~10-5 per gene per generation) suggest an upper limit of about 100 000. A variety of different approaches have been used to obtain more precise estimates of the total gene number. Three approaches have suggested a best estimate of about 65 000–80 000 genes:
Genomic sequencing. Extrapolation from sequencing of large chromosomal regions may suggest that there are about 70 000 genes (Fields et al., 1994). This is based on the observation that gene-rich regions have an average gene density of close to one per 20 kb, but gene-poor regions have a much lower density, say one-tenth of this density, and that the genome is split 50:50 into gene-rich and gene-poor regions.
CpG island number. Restriction enzyme analysis using the methylation-sensitive enzyme HpaII suggests that the total number of CpG islands (see Box 8.5) in the human genome is 45 000 (Antequara and Bird, 1993). Using an estimate that approximately 56% of genes are associated with CpG islands, these authors have suggested a total of about 80 000 human genes.
EST analysis. Large-scale random sequencing of cDNA clones provides so-called expressed sequence tags (ESTs, see Section 13.2.3). Comparison of known human EST sequences with a large set of different human genomic coding DNA sequences listed in sequence databases has suggested a figure of about 65 000 human genes (Fields et al., 1994).
The above values suggest that the genes in the nuclear genome represent about 99.95% of the total number of cellular genes. If the average size of a human nuclear gene, including introns, is taken to be about 10–15 kb, this would mean that if the genes did not show overlaps, the total nuclear DNA occupied by genes would be about 70 000 × (10–15) kb or about 700–1050 Mb which corresponds roughly to about 25–35% of the genome. As the vast majority of nuclear genes encode polypeptides and the coding sequence required for an average size human polypeptide is taken to be about 500–600 codons, that is 1.5–1.8 kb, only about 3% of the nuclear genome (80–100 Mb of the 3300 Mb) would be expected to have a coding function.
| Class of RNA | Examples | Function |
|---|---|---|
| A. RNA classes involved in assisting general gene expression | ||
| Ribosomal RNA (rRNA) | 28S rRNA | Component of large cytoplasmic ribosomal subunit |
| 5.8S rRNA | Component of large cytoplasmic ribosomal subunit | |
| 5S rRNA | Component of large cytoplasmic ribosomal subunit | |
| 18S rRNA | Component of small cytoplasmic ribosomal subunit | |
| 23S rRNA | Component of large mitochondrial ribosomal subunit | |
| 16S rRNA | Component of small mitochondrial ribosomal subunit | |
| Transfer RNA (tRNA) | >40 different cytoplasmic | Binding to codons in mitochondrial or nuclear-encoded mRNA |
| tRNA; 22 types of mitochondrial tRNA | ||
| Small nuclear RNA (snRNA) | Many, including | |
| U1 snRNA | Component of major spliceosome | |
| U2 snRNA | Component of major spliceosome | |
| U4 snRNA | Component of major spliceosome | |
| U5 snRNA | Component of major and minor spliceosome | |
| U6 snRNA | Component of major spliceosome | |
| U4acat snRNA | Component of minor spliceosome | |
| U6acat snRNA | Component of minor spliceosome | |
| U11snRNA | Component of minor spliceosome | |
| U12 snRNA | Component of minor spliceosome | |
| U7 snRNA | Histone mRNA transcriptional termination | |
| Small nucleolar RNA (snoRNA) | About 200 types, including | |
| U3 snoRNA | rRNA processing | |
| U8 snoRNA | rRNA processing | |
| various box C/D snoRNAs | Site-specific methylation of the 2′ OH group of rRNA | |
| various box H/ACA snoRNAs | Site-specific rRNA modification by formation of pseudouridine. | |
| B. Other RNA classes | ||
| 7SL RNA | Component of signal recognition particle for transporting proteins (see Section 1.5.4) | |
| 7SK RNA | Function uncertain | |
| Telomerase RNA | Component of telomerase (Section 2.3.4) | |
| XIST RNA | Regulatory gene imposing X-chromosome inactivation (Section 7.2.2) | |
| H19 RNA | Imprinted gene, function unclear (Section 7.2.2) | |
| SRA RNA | Encodes a steroid receptor coactivator | |
There are multiple rRNA genes. In addition to the two mitochondrial rRNA molecules, the 28S, 18S and 5.8S cytoplasmic rRNAs are encoded by a single transcription unit (see Figure 8.1) which is tandemly repeated about 250 times, comprising five clusters of about 50 tandem repeats located on the short arms of human chromosomes 13, 14, 15, 21 and 22. In addition, the 5S cytoplasmic rRNA is encoded by several hundred gene copies in at least three clusters on the long arm of chromosome 1. The major rationale for the repetition of cytoplasmic rRNA genes is likely to be based on gene dosage: by having a comparatively large number of these genes, the cell can satisfy the huge demand for cytoplasmic ribosomes needed for protein synthesis.
These belong to a very large dispersed gene family, comprising more than 40 different subfamilies each with several members which encode the different species of cytoplasmic tRNA. In addition to multiple copies of genes specifying the individual cytoplasmic tRNA molecules, there are several defective gene copies (pseudogenes).
A heterogeneous collection of several hundred small nuclear RNA species are encoded by a large dispersed family of genes. Many of the snRNA species are uridine-rich and are named accordingly, e.g. U3 snRNA means the third uridine-rich small nuclear RNA to be classified. Individual species of RNA are associated with specific proteins to form ribonucleoprotein particles (RNPs). Some are known to be important in RNA splicing. A large subfamily of perhaps about 200 genes are present in the nucleolus, and have been termed small nucleolar RNA (snoRNA). They have important roles in specific cleavage reactions and base-specific modifications during maturation of ribosomal RNA (see Smith and Steitz, 1997).
Additional RNA genes encode functionally diverse products, including the 7SL RNA component of the signal recognition particle which is required for protein export and the RNA component of telomerase, the enzyme required to synthesize DNA at the telomeres (Section 2.3.4). More recently, evidence has been obtained suggesting that certain RNA genes encode products that are important in gene regulation. An important example is the XIST gene. This gene is thought to be the major gene involved in initiating the process of X chromosome inactivation, being expressed exclusively from inactivated X chromosomes. No long open reading frames can be identified, and gene function is thought to be carried out through an RNA product by a mechanism that remains obscure (see Section 8.5.6).
In addition, several RNA genes have been found at a variety of chromosomal regions that are known to be imprinted (imprinted genes are normally expressed from a maternally inherited copy or a paternally inherited copy, but not both; see Section 8.5.4). For example, the H19 gene contains five exons and is expressed to give a polyadenylated cytoplasmic RNA which does not however associate with ribosomes. It shows a restricted pattern of expression during early development (fetal and neonatal liver, visceral endoderm and fetal gut) and is imprinted, since only the maternally inherited allele is expressed. Its functional significance is, however, unclear.
As seen in the previous section, some families of RNA genes are clustered. In the case of polypeptide-encoding gene families, some genes encoding identical or functionally related products are clustered, but often they are dispersed on several chromosomes.
Eleven clusters comprising a total of about 60 histone genes are distributed over seven human chromosomes. The two clusters on 6p contain the great majority of histone genes. Other clusters contain only one or two of the histone gene subtypes. Note that identical histones can be specified by genes on different chromosomes.
A large fraction of human genes are members of gene families where individual genes are closely related but not identical in sequence. In many such cases the genes are clustered and have arisen by tandem gene duplication, as in the case of the different members of each of the α-globin and β-globin gene clusters (see Section 7.3.4). Genes which encode clearly related products but which are located on different chromosomes are generally less related, as in the case of the α-globin and β-globin genes. However, in the case of the HOX homeobox gene family which consists of clusters of approximately 10 genes on each of four chromosomes, individual genes on different chromosomes may be more related to each other than they are to members of the same gene cluster (Section 14.2.2 and Figure 14.5).
| Genes which encode | Organization | Examples |
|---|---|---|
| The same product | Often clustered but may also be on different chromosomes | The two α-globin genes on 11p; genes
encoding rDNA (Figure
8.1) and histones (Figure 7.6 |
| Tissue-specific protein isoforms or isozymes | Sometimes clustered; sometimes nonsyntenic | Clustering of pancreatic and salivary amylase genes (1p21); nonsynteny of α-actin genes expressed in skeletal (1p) and cardiac (15q) muscle |
| Isozymes specific for different subcellular compartments | Usually nonsyntenic | Cytoplasmic (c) and mitochondrial (m) isozymes for various enzymes e.g. aldehyde dehydrogenase: ALDH1, ALDH2 on 9q and 12q, respectively; aconitase: ACO1, ACO2 on 9p and 22q respectively; thymidine kinase: TK1, TK2 on 17q and 16 respectively. |
| Enzymes in the same metabolic pathway | Usually nonsyntenic | Genes encoding enzymes in steroidogenesis |
steroid 11-hydroxylase 8q, | ||
steroid 17-hydroxylase 10 | ||
steroid 21-hydroxylase 6p | ||
| Subunits of the same protein or enzyme | Usually nonsyntenic | Hemoglobin: α-chain - 16p; β chain - 11p; |
collagens: α(1)I chain -
7q; α(2)I chain - 17q; | ||
ferritin: heavy chain - 11q; light
chain - 22q; | ||
class I HLA: heavy chain - 6p; light
chain - 15q; | ||
immunoglobulins: heavy chain - 14q;
light chain - 2p or 22q | ||
| Ligand plus associated receptor | Usually nonsyntenic | Genes encoding insulin, interferons and their receptors |
insulin INS-11p, but
insulin receptor INSR-19p | ||
interferon α
IFNA - 9p, and receptor,
IFNAR - 21q | ||
interferon β
IFNB - 9p, and receptor,
IFNBR - 21q | ||
interferon γ
IFNG - 12q, and receptor,
IFNGR - 18 |
Exon content is shown as a percentage of the lengths of indicated genes. Note the generally inverse relationship between gene length and percentage of exon content. Asterisks emphasize that the lengths given for the indicated Ig heavy chain and light chain loci correspond to the germline organizations. Immunoglobulin and T-cell receptor genes have unique organizations, requiring cell-specific somatic rearrangements in order to be expressed in B or T lymphocytes respectively (see Section 8.6). Abbreviations: CFTR, cystic fibrosis transmembrane regulator; HPRT, hypoxanthine phosphoribosyl transferase; NF1, neurofibromatosis type 1.
As one would expect, there is a direct correlation between the size of a gene and the size of its product, but there are some striking anomalies. For example, apolipoprotein B has 4563 amino acids and is encoded by a 45 kb gene while the dystrophin gene is 2.4 Mb in length and encodes a product in muscle cells of 3685 amino acids.
| All 37 mitochondrial genes |
| Histone genes |
| Many genes encoding small RNA, e.g. most tRNA genes |
| Various neurotransmitter and hormone receptor genes, e.g. dopamine D1 and D5 receptors, 5-HT1B serotonin receptor, angiotensin II type 1 receptor, formyl peptide receptor, bradykinin B2 receptor, α2 adrenergic receptor |
| Autosomal processed copies of intron-containing X-linked genes |
| Typically have testis-specific expression patterns, e.g. PGK2 (phosphoglycerate kinase), GK (glycerol kinase), MYCL2 (myc family member), PDHA2 (pyruvate dehydrogenase E1a), GLUD2 (glutamate dehydrogenase) |
| Others, e.g. IFN-α, thrombomodulin, SRY and many SOX (SRY HMG box-related) genes, XIST, neurogenin genes |
| Gene product | Size of gene (kb) | Number of exons | Average size of exon (bp) | Average size of intron (bp) |
|---|---|---|---|---|
| tRNAtyr | 0.1 | 2 | 50 | 20 |
| Insulin | 1.4 | 3 | 155 | 480 |
| β-Globin | 1.6 | 3 | 150 | 490 |
| Class I HLA | 3.5 | 8 | 187 | 260 |
| Serum albumin | 18 | 14 | 137 | 1100 |
| Type VII collagen | 31 | 118 | 77 | 190 |
| Complement C3 | 41 | 29 | 122 | 900 |
| Phenylalanine hydroxylase | 90 | 26 | 96 | 3500 |
| Factor VIII | 186 | 26 | 375 | 7100 |
| CFTR (cystic fibrosis) | 250 | 27 | 227 | 9100 |
| Dystrophin | 2400 | 79 | 180 | 30 000 |
The small nucleolar RNA (snoRNA) genes are unusual in that the majority of them are located within other genes, often ones which encode a ribosome-associated protein or a nucleolar protein. Possibly this arrangement has been maintained to permit coordinate production of protein and RNA components of the ribosome (Tycowski et al., 1993).
In addition to the snoRNA genes there are a few examples of other genes being located within the introns of larger genes, and in some cases the internal genes as well as the host genes are known to encode polypeptides. Three illustrative examples are:
Note that the three internal genes are transcribed from the opposing strand to that used for transcription of the NF1 gene. Genes are: OGMP, oligodendrocyte myelin glycoprotein; EVI2A and EVI2B, human homologs of murine genes thought to be involved in leukemogenesis, and located at ecotropic viral integration sites.
The factor VIII gene. Intron 22 of the blood clotting factor VIII gene (F8C) contains a CpG island from which two internal genes, F8A and F8B are transcribed in opposite directions (Levinson et al., 1992). F8A is transcribed from the opposite strand to that used by the factor VIII gene. F8B is transcribed in the same direction as the factor VIII gene to give a short mRNA containing a new exon spliced on to exons 23–26 of the factor VIII gene (see Figure 9.20).
The entire sequence of the 180 kb human retinoblastoma susceptibility gene, RB1, has been determined, enabling identification within the gene of many examples of abundant DNA repeats. Note that the 72 kb intron 17 contains a G protein-coupled receptor gene, U16, which is actively transcribed in the opposite direction to the RB1 gene. The top line (5′→ 3′) of each pair shows the repeat elements orientated in the sense direction; the bottom line (3′→ 5′) shows them in the antisense orientation. There are 46 Alu repeats corresponding to the expected frequency of one per 4 kb. There is a particularly high frequency of LINE-1 elements in this gene (one per 11 kb), but only two of the elements approach the full 6.1 kb length. The (A)n /(T)n sequences (n = 12 or greater) which are indicated are only those not found within the sequences of the interspersed repeats. No examples were found for (C)n /(G)n for n = 12 or greater. The frequencies of dinucleotide repeats with n = 6 or greater were as follows: (CG)n/(GC)n 0; (AT)n 1; (CT)n /(AG)n 2; (CA)n/(TG)n 4 (three of which are polymorphic). Adapted from Toguchida et al. (1993) Genomics, 17, 535–543, with permission from Academic Press Inc.
DNA sequences in the nuclear diploid genome usually exist as two allelic copies (on paternal and maternal homologous chromosomes). In addition to this degree of repetition, about 40% of the human nuclear genome in both haploid and diploid cells is composed of sets of closely related nonallelic DNA sequences (DNA sequence families or repetitive DNA). Within the considerable variety of different repetitive DNA sequences are DNA sequence families whose individual members include functional genes (multigene families), and also many examples of nongenic repetitive DNA sequence families.
Reassociation kinetics (Section 5.2.2) first suggested that complex genomes, such as the human genome, comprise different sequence classes on the basis of the copy number. Typically, human DNA is randomly sheared (e.g. by sonication) to give fragments whose average size is about 500 bp and the sheared DNA is denatured by heating to separate the complementary strands of each fragment. Thereafter the DNA is cooled to a temperature of about 20–30°C below the melting temperature, Tm (which marks the mid-point of the transition between the double-stranded and single-stranded states of DNA heated in solution). The cooled DNA renatures but the rate of reassociation depends not only on time (t) but also on the initial concentration (Co) of that sequence (i.e. the Cot value).
The above type of analysis has suggested that the human genome consists of three broad sequence components:
Single copy, or at least very low copy number, DNA (60%) - reassociates very slowly. A single strand from a single copy sequence will require some considerable time to find a complementary partner strand, given that the vast majority of DNA fragments are unrelated to it.
Moderately repetitive (30%) - intermediate speed of reassociation.
Highly repetitive (10%) - reassociates very rapidly. There are numerous copies of the same sequence and the chances of quickly finding complementary partners within the mass of different fragments are high.
The operational definition of a DNA sequence family is the comparatively high level of DNA sequence similarity (sequence homology) between whole family members, or components of the family members. Members of a DNA sequence family can be identified and actively sought by a variety of methods:
DNA sequencing - Allows direct calculations on the degree of sequence relatedness of family members.
DNA hybridization and cloning. A probe from a gene family member typically gives a complex band pattern when hybridized against a Southern blot of genomic DNAs. Individual family members can then be cloned by screening genomic DNA libraries.
PCR cloning - Permits identification of novel family members by designing degenerate primers corresponding to highly conserved nucleotide or amino acid sequences.
When two members of a repetitive DNA sequence family exhibit a high degree of sequence homology, a recent common evolutionary origin is indicated. As detailed in the following sections, DNA sequence families show considerable variation in the number of different repeat unit members in the family, the size of the repeating unit, chromosomal location, mode of repetition and capacity for expression.
A large percentage of actively expressed human genes are members of families of DNA sequences which show a high degree of sequence similarity. However, the extent of sequence sharing and the organization of family members can vary widely. Some family members may be nonfunctional (pseudogenes and gene fragments - see below) and rapidly accumulate sequence differences, leading to marked sequence divergence.
Classical gene families are distinguished by members which exhibit a high degree of sequence homology over most of the gene length or, at least, the coding DNA component, a feature which automatically identifies such sequences as being closely related evolutionarily as well as functionally. In some cases, such as the individual rRNA gene families and individual histone gene families, there is an extremely high degree of sequence similarity between family members. Many other large gene families show a high degree of sequence similarity between family members.
| Gene family | Number of genes | Sequence motif/domain |
|---|---|---|
| Homeobox genes | 30 HOX genes (see Figure 14.5) plus ~60 orphan homeobox genes | Homeobox specifies a homeodomain of ~60 amino acids. A wide variety of different subclasses have been defined |
| PAX genes | 9 | Paired box encodes a paired domain of ~130 amino acids; PAX genes often have in addition a type of homeodomain known as a paired-type homeodomain |
| SOX genes | ~15 | SRY-like HMG box which encodes a domain of ~70 amino acids |
| TBX genes | ~15 | T-Box which encodes a domain of ~170 amino acids |
| Forkhead domain genes | ~15 | The forkhead domain is about 110 amino acids long |
| POU domain genes | ~15 | The POU domain is ~150 amino acids long |
(A) Motifs in the DEAD box family. This gene family encodes products implicated in cellular processes involving alteration of RNA secondary structure, such as translation initiation and splicing. Eight very highly conserved amino acid motifs are evident, including the DEAD box (Asp-Glu-Ala-Asp). Numbers refer to frequently found size ranges for intervening amino acid sequences (see Schmid and Linder, 1992). X = any amino acid. See inside front cover for the one-letter amino acid code. (B) WD repeat family. This gene family encodes products that are involved in a variety of regulatory functions, such as regulation of cell division, transcription, transmembrane signaling, mRNA modification, etc. The gene products are characterized by between four and eight tandem repeats containing a core sequence of fixed length (from 27–45 amino acids, terminating in the dipeptide WD, i.e. Trp-Asp) preceded by a unit whose length can vary between repeats (see Neer et al., 1994).
the DEAD box gene family - contains several different genes whose products appear to function as RNA helicases and are characterized by the presence of eight short amino acid sequence motifs, including the DEAD box (the sequence Asp-Glu-Ala-Asp as represented by the one-letter amino acid code);
the WD repeat gene family - gene products have a regulatory function, but there is considerable diversity in function. Typically characterized by tandem repeats with a central core of fixed length and containing small conserved amino acid motifs, including the WD (tryptophan-aspartate) sequence);
the ankyrin repeat gene family - wide functional diversity but often involved in protein-protein interactions. Characterised by tandem repeats of a 33 amino acid sequence characterized by the presence of select amino acids at only a few positions;
the LIM domain gene family - encode a characteristic cysteine-rich 56 amino acid domain most likely involved in protein-protein interactions.
Most members of the Ig superfamily are dimers consisting of extracellular variable domains (V) located at the N-terminal ends and constant (C) domains, located at the C-terminal (membrane-proximal) ends. The light chain of class I HLA antigens has a single constant domain and does not span the membrane. It associates with the transmembrane heavy chain which has two variable and one constant domain, giving an overall structure similar to that of the class II HLA antigens.
| Family | Copy no. | Organization location | Chromosome |
|---|---|---|---|
| A. Clustered gene families | |||
| Single cluster gene families | |||
| Growth hormone gene cluster | 5 | Clustered within 67 kb; one conventional pseudogene | 17q22-24 |
| α-Globin gene cluster | 7 | Clustered over ~50 kb (see Figure 8.23) | |
| Class I HLA heavy chain genes | ~20 | Clustered over 2 Mb (see Figure 8.4) | |
| Multiple cluster gene families | |||
| HOX genes | 38 | Organized in four clusters (Figure 9.5) | 2p, 7, 12,17 |
| Histone gene family | 61 | Modest-sized clusters at a few locations; two large clusters on chromosome 6 | (see Figure
7.6 |
| Olfactory receptor gene family | ~1000 | About 25 large clusters scattered throughout the genome | Many |
| B. Interspersed gene families | |||
| Aldolase | 5 | Three functional genes and two pseudogenes on five different chromosomes | Many |
| PAX | 9 | All nine genes are expressed | Many |
| NF1 (Neurofibromatosis type I) | >12 | One functional gene at 17q11.2; others are defective non-processed DNA copies | Mostly pericentromeric |
| Ferritin heavy chain | >15 | One functional gene known on chromosome 11; most are processed pseudogenes | Many |
| Glyceraldehyde 3-phosphate dehydrogenase | >18 | One functional gene on 12p; many processed pseudogenes | Many |
| Actin | >20 | Four functional genes and many processed pseudogenes | Many |
Genes in an individual gene cluster are thought to arise by tandem gene duplication events. Different organizations are evident:
Tandem gene organization. This arrangement is exemplified by the organisation of genes in individual rRNA gene clusters (see Section 7.2.2). The genes are highly related to each other in terms of both sequence and function, although certain family members may be nonfunctional (pseudogenes, see Section 7.3.5 below).
Genes in a cluster are closely related in sequence and are often transcribed from the same DNA strand.
Some gene families are organized in clusters distributed over two or more chromosomal locations. Again different organizations are evident, with some families showing very high similarity between genes on different clusters, while others are marked by comparatively low sequence homology between genes on different clusters.
High cluster similarity. Here sequence homology between gene family members on different chromosomes may be very high as in the case of the different rRNA gene clusters on the short arms of chromosomes 13, 14, 15, 21 and 22. Another useful example is the olfactory receptor gene family which encodes a diverse repertoire of receptors which allow us to discriminate thousands of different odors. The family consists of perhaps 1000 genes, making it the largest in the human genome, and is organized in large clusters at more than 25 different chromosomal locations (Rouquier et al., 1998).
Low cluster similarity. Often sequence homology is greater within a cluster than between clusters. For example, the globin gene family includes genes at three locations: the α-globin cluster on 16p, the β-globin cluster on 11p and the myoglobin gene at 22q and although all globin genes are clearly related, those within a cluster are more related to each other than they are to genes in one of the other clusters (see Figure 14.16).
Some gene families show no obvious physical relationship between family members, which are dispersed as solitary genes at two or more different chromosomal locations. The family members may show considerable sequence divergence unless their dispersion has been a relatively recent event, or there has been considerable selection pressure to maintain sequence conservation. The following examples illustrate some of the many different types of organization.
Families originating from ancient genome duplication or gene duplication events. This class of interspersed gene families typically contains only a few members, as in the case of the PAX gene family, and appear to have evolved by a combination of gene duplication and/or genome duplication events over a long period of evolutionary time. Typically, all or most of the genes are functional, and may individually encode highly related products.
Families originating largely by retrotransposition events. Some gene families have expanded comparatively recently in evolutionary terms by a process whereby RNA transcribed from one or a small number of functional genes is converted by cellular reverse transcriptase into natural cDNA which then becomes integrated elsewhere in the chromosomes. Most such copies are nonfunctional (see next section).
Families of RNA genes or polypeptide-encoding genes are frequently characterized by defective copies of essentially all of the gene or its coding sequence (pseudogenes), or a portion of it, in some cases a single exon (gene fragments). A large variety of different classes are found (see Box 7.3). The following examples are meant to be illustrative of the types of defective gene copies found in different types of gene family.
(A) Structure of a class I HLA heavy chain mRNA. The full-length mRNA contains a polypeptide-encoding sequence. Blocks represent different domains as follows: L, leader sequence; α1, α2, α3 extracellular domains; TM, transmembrane sequence; CY, cytoplasmic tail and a 3′-untranslated sequence (3′-UTR). The three extracellular domains α1–α3 are each encoded essentially by a single exon. The very small 5′-UTR is not shown. (B) The class I HLA heavy chain gene cluster. The cluster is located at 6p21.3 and comprises about 20 genes. They include six expressed genes (black), four full-length nonprocessed pseudogenes (filled blue blocks) and a variety of nonfunctional truncated genes or gene fragments (open blue blocks). Some of the latter are truncated at the 5′ end (e.g. the one next to HLA-B), some are truncated at the 3′ end (e.g. the one next to HLA-F) and some contain single exons (e.g. the one next to HLA-E).
For several polypeptide-encoding genes, defective copies containing intronic sequence have been found elsewhere in the genome. Two illustrative examples are:
The NF1 (neurofibromatosis type I) gene. This gene is located close to the chromosome 17 centromere at 17q11.2 and has at least 11 nonprocessed pseudogene or gene fragment copies, nine of which are located at pericentromeric regions on seven different chromosomes (Regnier et al., 1997). Characterization of the chromosome 15-specific NF1 gene copies revealed that they contain copies of a segment of the NF1 gene spanning exons 8 and 27 and including the intron sequences. This type of gene duplication may be a result of what has been called pericentromeric plasticity.
The PKD1 (adult polycystic kidney disease) gene. This gene consists of 46 exons spanning 50 kb at 16p13.3. A truncated 5′ gene copy comprising approximately 70% of the gene and encompassing exons 1 to 34 and all the intervening introns has been faithfully replicated at least three times and inserted into a more proximal location at 16p13.1 (the European Polycystic Kidney Disease Consortium, 1994).
An apparently related case concerns the CFTR (cystic fibrosis) gene at 7q31.2. which has 24 exons spanning 250 kb. Copies of a ~30 kb sequence encompassing exon 9 and flanking intron sequences are distributed at a large number of chromosomal locations. However, in this case a likely explanation proposed by Rozmahel et al. 1997 is a retrotransposition model. Transcription from the sense strand of the CFTR gene was envisaged to generate an antisense transcript encompassing the exon 9 sequence and flanking intron regions. The transcript could then have been converted to cDNA by cellular reverse transcriptase prior to integration at several sites in the genome.
Although processed pseudogenes are typically not expressed, several examples are known where natural cDNA copies of a gene appear to be expressed, often in a testis-specific fashion (see Section 14.3.3). Often the functional intron-containing gene locus in such families is located on the X chromosome (see Section 14.3.3).
Although the size of some interspersed polypeptide-encoding gene families testifies to the success of retrotransposition as a mechanism for generating processed gene copies, the really successful (in terms of high copy number) retrotranspositions have been performed from RNA polymerase III transcripts. For example, the Alu repeat family (see Section 7.4.5) is considered to have arisen as processed pseudogenes copied from the 7SL RNA gene. Genes such as this which are transcribed by RNA polymerase III often contain an internal promoter which facilitates the expression of newly transposed copies (see Section 8.2.1 and Figure 8.3).
In addition to the types of gene sequence duplication discussed in the previous sections, many human genes, like other eukaryotic genes, contain intragenic repeated sequences. Repeated sequences in coding DNA may involve different forms of repeated structure, including comparatively large sequences encoding protein domains or small sequence motifs. Different modes of repetition can be seen but tandem repetition is rather common.
Tandem repetition of microsatellite sequences (short sequence motifs - see Section 7.4.3) is common and may simply reflect statistical expected frequencies for certain base compositions. For example, certain genes which have a very high % GC often have long runs of single cytosines or guanines in the coding sequence. Such sequences are comparatively unstable and are prone to single nucleotide deletion or insertion events which are often pathogenic. A variety of genes are known to have long tracts of certain trinucleotides in their coding sequences and runs of the CAG trinucleotides in the coding sequences of several genes have been found to be capable of expansion causing pathogenesis (see Box 16.7).
| Gene product | Size of encoded repeat in amino acids | No. of copies | Nucleotide sequence homology between copies |
|---|---|---|---|
| Ubiquitin (UbB and UbC genes) | 76 | 3 (UbB ) | High homology |
| 9 (UbC ) | |||
| Involucrin | 10 | 59 | High homology for central 39 repeats |
| Apolipoprotein (a) | 114 | 37 | High homology; 24 of the repeats are identical in sequence |
| Plasminogen | ~75–80 | 5 | Low homology but conserved protein domains |
| Collagen | 18 | 57 | Low homology but conserved amino acid motifs based on (Gly-X-Y)6 |
| Serum albumin | 195 | 3 | Low homology |
| Proline-rich protein genes | 16–21 | 5 | Low homology |
| Tropomyosin α-chain | 42 | 7 | Low homology |
| Immunoglobulin ϵ-chain, C region | 108 | 4 | Low homology |
| Dystrophin | 109 | 24 | Low homology |
The human nuclear genome, like that of other complex eukaryotes, contains a large amount of highly repeated DNA sequence families which are largely transcriptionally inactive. A wide variety of different repeats are known and dedicated repeat sequence databases have been established (see electronic reference 3). Like multigene families, noncoding repetitive DNA shows two major types of organization: tandemly repeated and interspersed.
Such families are defined by blocks (or arrays) of tandemly repeated DNA sequences. Individual arrays can occur at a few or many different chromosomal locations. Depending on the average size of the arrays of repeat units, highly repetitive noncoding DNA belonging to this class can be grouped into three subclasses: satellite, minisatellite and microsatellite DNA.
| Class | Size of repeat | Major chromosomal location(s) |
|---|---|---|
| ‘Megasatellite’ DNA (blocks of hundreds of kb in some cases) | several kb | Various locations on selected chromosomes |
![]() RS447 | 4.7 kb | ~50–70 copies on 4p15 plus several copies on distal 8p |
![]() untitled | 2.5 kb | ~400 copies on 4q31 and 19q13 |
![]() untitled | 3.0 kb | ~50 copies on the X chromosome |
| Satellite DNA (blocks often from 100 kb to several Mb in length) | 5–171 bp | Especially at centromeres |
![]() α (alphoid DNA) | 171 bp | Centromeric heterochromatin of all chromosomes |
![]() β
(Sau3 A family) | 68 bp | Centromeric heterochromatin of 1, 9, 13, 14, 15, 21, 22 and Y |
Satellite 1 (AT-rich) | 25–48 bp | Centromeric heterochromatin of most chromosomes and other heterochromatic regions |
Satellites 2 and 3 | 5 bp | Most, possibly all, chromosomes |
| Minisatellite DNA (blocks often within the 0.1–20 kb range) | 6–64 bp | At or close to telomeres of all chromosomes |
![]() telomeric family | 6 bp | All telomeres |
![]() hypervariable family | 9–64 bp | All chromosomes, often near telomeres |
| Microsatellite DNA (blocks often less than 150 bp) | 1–4 bp | Dispersed throughout all chromosomes |
The individual repeat units are not clustered, but are dispersed at numerous locations in the genome, and together account for perhaps one third of the DNA in the human genome (Smit, 1996). Most of the DNA families belonging to this class contain some members that are capable of undergoing retrotransposition (i.e. transposition through an RNA intermediate).
(A) General overview. Note the restricted locations of certain types of tandemly repeated DNA, such as satellite DNAs which are found in heterochromatin (notably at the centromeres) and minisatellite DNAs which are often found at telomeres or close to them. (B) Satellite DNA organization at centromeres. The locations of different classes of satellite DNA are shown for chromosome 9 and for chromosome 21 (one of the five examples of an autosomal acrocentric chromosome). The illustration in this case is redrawn from Tyler-Smith and Willard (1993) Curr. Opin. Genet. Dev., 3, 390–397, with permission from Current Biology Ltd.
When DNA is isolated from human cells by conventional methods, it is subject to mechanical shearing. Fragments are generated from the bulk DNA (with a base composition of ~42% GC) and fragments from the satellite DNA regions which may have a similar or different base composition. If the base composition is significantly different, satellite DNA sequences can be separated from the bulk DNA by buoyant density gradient centrifugation. Following centrifugation, they appear as minor (or satellite) bands of different buoyant density from a major band which represents bulk DNA. Typically, human DNA is complexed with Ag+ ions and then fractionated in buoyant density gradients containing cesium sulfate, whereupon three satellite bands are identified at different densities: satellite 1 – 1.687 gcm-3; satellite 2 – 1.693 gcm-3; satellite 3 – 1.697 gcm-3.
Secondary amplification is envisaged in this case to involve a 15 bp repeat unit comprising three diverged 5 bp repeats.
Other types of satellite DNA sequence cannot easily be resolved by density gradient centrifugation. They were first identified by digestion of genomic DNA with a restriction endonuclease which typically has a single recognition site in the basic repeat unit. In addition to the basic repeat unit size (monomer), such enzymes will produce a characteristic pattern of multimers of the unit length because of occasional random loss of the restriction site in some of the repeats (Singer, 1982a). Alpha satellite (or alphoid DNA) constitutes the bulk of the centromeric heterochromatin and accounts for about 3–5% of the DNA of each chromosome. It is characterized by tandem repeats of a basic mean length of 171 bp, although higher order units are also seen. The sequence divergence between individual members of the alphoid DNA family can be so high that it is possible to isolate chromosome-specific subfamilies for each of the human chromosomes (Choo et al., 1991).
Hypervariable minisatellite DNA sequences are highly polymorphic and are organized in over 1000 arrays (0.1–20 kb long) of short tandem repeats (Jeffreys, 1987). The repeat units in different hypervariable arrays vary considerably in size, but share a common core sequence, GGGCAGGAXG (where X = any nucleotide), which is similar in size and in G content to the chi sequence, a signal for generalized recombination in E. coli. While many of the arrays are found near the telomeres, several hypervariable minisatellite DNA sequences occur at other chromosomal locations. The great majority of hypervariable minisatellite DNA sequences are not transcribed, except for elements occuring within noncoding intragenic sequences. Some, however, are expressed. For example, the MUC1 locus on 1q is an expressed hypervariable minisatellite locus. It encodes a glycoprotein found in several epithelial tissues and body fluids which is highly polymorphic as a result of extensive variation in the number of minisatellite-encoded repeats (Swallow et al., 1987).
The significance of hypervariable minisatellite DNA is not clear, although it has been reported to be a ‘hotspot’ for homologous recombination in human cells (Wahls et al., 1990). Nevertheless it has found many applications. Various individual loci have been characterized and used as genetic markers, although the preferential localization in subtelomeric regions has limited their use for genome-wide linkage studies. A major application has been in DNA fingerprinting, in which a single DNA probe which contains the common core sequence can hybridize simultaneously to multiple minisatellite DNA loci on all chromosomes, resulting in a complex individual-specific hybridization pattern (see Section 17.4.1).
Another major family of minisatellite DNA sequences is found at the termini of chromosomes, the telomeres. The principal constituent of telomeric DNA is 10–15 kb of tandem hexanucleotide repeat units, especially TTAGGG, which are added by a specialized enzyme, telomerase (see Figure 2.9). By acting as buffers to protect the ends of the chromosomes from degradation and loss and by providing a mechanism for replicating the ends of the linear DNA of chromosomes, these simple repeats are directly responsible for telomere function (see Figure 2.9 and Section 2.3.4).
The significance of microsatellite DNA is not known. Alternating purine-pyrimidine repeats, such as tandem repeats of the dinucleotide pair CA/TG, are capable of adopting an altered DNA conformation, Z-DNA, in vitro, but there is little evidence that they do so in the cell. Although microsatellite DNA has generally been identified in intergenic DNA or within the introns of genes, a few examples have been recorded within the coding sequences of genes. Tandem repeats of three nucleotides in coding DNA may be sites that are prone to pathogenic expansions (see Section 9.5.2 and Box 16.7).
Two major classes of mammalian interspersed repetitive DNA families have been discerned on the basis of repeat unit length (Singer, 1982b): SINEs and LINEs.
| Classa | Familya | Size of repeat unit | No. of copies | Percentage of genome |
|---|---|---|---|---|
| SINE | Alu family | Full length ~0.3 kb | ~1 000 000 | ~ 7.0% |
| MIR families | Average size ~0.13 kb | ~400 000 | ~1.7% | |
| LINE | LINE-1 (Kpn) family | Full length is 6.1 kb, but average size ~0.8 kb | ~200 000–500 000 | ~5–12% |
| LINE-2 family | Average size ~0.25 kb | ~270 000 | ~2.1% | |
| LTR | HERV | Average size ~1.3 kb | ~50 000 | ~1.3% |
| Others | Average size ~0.5 kb | ~200 000 | ~3.3% | |
| DNA transposon | Mariner & other families | Varies; perhaps average size = 0.25 kb | ~200 000 | ~1.6% |
| Others | Various | Perhaps average size of about 0.4 kb | ~60 000 | ~0.8% |
See text.
Human LINEs are exemplified by the LINE-1 or L1 element (also called the Kpn repeat because of early attempts at characterizing this family using the restriction nuclease KpnI; see Section 7.4.6). The LINE-1 element is also found in other mammals such as the mouse.
The consensus standard Alu dimer is shown with two similar repeats terminating in an (A)n /(T)n like sequence. They have different sizes because of the insertion of a 32 bp element within the larger repeat. Alu monomers also exist in the human genome, as do various truncated copies of both monomers and dimers. The consensus full-length LINE-1 element is 6.1 kb long but most LINE-1 elements are truncated and the average size is very much smaller. ORF1, ORF2, open reading frames 1 and 2.
The two repeated units of the Alu sequence show a striking resemblance to the sequence for 7SL RNA, a component of the signal recognition particle, which facilitates transport of proteins across the membrane of the endoplasmic reticulum. Because of this and the observation of the A-rich regions and the flanking direct repeats, it has been widely assumed that the Alu sequence has been propagated by retrotransposition from 7SL RNA, and therefore represents a processed 7SL RNA pseudogene (Figure 14.25). Certainly, transposition by Alu sequences is known to occur [presumably as a result of trans-acting cellular reverse transcriptases such as those encoded by LINE-1 (Kpn) elements (see below), and may occasionally cause clinical problems (Section 9.5.6)]. Possibly, the very high copy number achieved by this processed pseudogene is related to the presence of a promoter sequence in the 7SL RNA sequence (the 7SL RNA gene, like tRNA genes is transcribed by RNA polymerase III from an internal promoter, see Section 8.2.1). By contrast, processed pseudogenes from RNA polymerase II transcripts lack promoter sequences and their only chance of expression is if the integration event places them next to a functional promoter sequence.
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]