For more details on the anatomy of the human genome, see Section 1.2.
Life as we know it is specified by the genomes of the myriad organisms with which we share the planet. Every organism possesses a genome that contains the biological information needed to construct and maintain a living example of that organism. Most genomes, including the human genome and those of all other cellular life forms, are made of DNA (deoxyribonucleic acid) but a few viruses have RNA (ribonucleic acid) genomes. DNA and RNA are polymeric molecules made up of chains of monomeric subunits called nucleotides.
For more details on the anatomy of the human genome, see Section 1.2.
The nuclear genome comprises approximately 3 200 000 000 nucleotides of DNA, divided into 24 linear molecules, the shortest 50 000 000 nucleotides in length and the longest 260 000 000 nucleotides, each contained in a different chromosome. These 24 chromosomes consist of 22 autosomes and the two sex chromosomes, X and Y.
The mitochondrial genome is a circular DNA molecule of 16 569 nucleotides, multiple copies of which are located in the energy-generating organelles called mitochondria.
Each of the approximately 1013 cells in the adult human body has its own copy or copies of the genome, the only exceptions being those few cell types, such as red blood cells, that lack a nucleus in their fully differentiated state. The vast majority of cells are diploid and so have two copies of each autosome, plus two sex chromosomes, XX for females or XY for males - 46 chromosomes in all. These are called somatic cells, in contrast to sex cells or gametes, which are haploid and have just 23 chromosomes, comprising one of each autosome and one sex chromosome. Both types of cell have about 8000 copies of the mitochondrial genome, 10 or so in each mitochondrion.
This book is about genomes. It explains what genomes are (Part 1), how they are studied (Part 2), how they function (Part 3), and how they replicate and evolve (Part 4). We begin our journey with our own genome, which is quite naturally the one that interests us the most. Later in this chapter we will examine how the human genome is constructed, some of this information dating from the old days when biologists studied genes rather than genomes, but much of it revealed only since the Human Genome Project was completed in the first year of the new millennium. First, however, we must understand the structure of DNA.
DNA was discovered in 1869 by Johann Friedrich Miescher, a Swiss biochemist working in Tübingen, Germany. The first extracts that Miescher made from human white blood cells were crude mixtures of DNA and chromosomal proteins, but the following year he moved to Basel, Switzerland (where the research institute named after him is now located) and prepared a pure sample of nucleic acid from salmon sperm. Miescher's chemical tests showed that DNA is acidic and rich in phosphorus, and also suggested that the individual molecules are very large, although it was not until the 1930s when biophysical techniques were applied to DNA that the huge lengths of the polymeric chains were fully appreciated.
Three years before Miescher discovered DNA, Gregor Mendel had published the results of his breeding experiments with pea plants, carried out in the monastery gardens at Brno, a central european city some 550 km from Tübingen in what is now the Czech Republic. Mendel's paper in the Proceedings of the Society of Natural Sciences in Brno describes his hypothesis that inheritance is controlled by unit factors, the entities that geneticists today call genes. It is very unlikely that Miescher and Mendel were aware of each other's work, and if either of them had happened to read about the other's discoveries then they certainly would not have made any connection between DNA and genes. To make such a connection - to infer that genes are made of DNA - would have been quite illogical in the late 19th century or indeed for many decades afterwards. The precise biological function of DNA was not known, and the supposition that it was a store of cellular phosphorus seemed entirely reasonable at the time. The chemical nature of genes was equally unknown, and indeed was an irrelevance for most geneticists, who in the years immediately after 1900, when Mendel's work was rediscovered, were able to make remarkable advances in understanding heredity without worrying about what genes were actually made of.
It was not until the 1930s that scientists began to ask more searching questions about genes. In 1944, Erwin Schrödinger, more famous for the wave equation which still terrifies many biology students taking introductory courses in physical chemistry, published a book entitled What is Life?, which encapsulated a variety of issues that were being discussed not only by geneticists but also by physicists such as Niels Bohr and Max Delbrück. These scientists were the first molecular biologists and the first to suggest that ‘life’ could be explained in molecular terms; our current knowledge of how the genome functions stems directly from their pioneering work. The starting point for the new molecular biology was to discover what genes are made of.
How could the molecular nature of the genetic material be determined? Back in 1903, WS Sutton had realized that the inheritance patterns of genes paralleled the behavior of chromosomes during cell division. This observation led to the proposal that genes are located in chromosomes and by the 1930s it was universally accepted that the chromosome theory was correct. Examination of cells by cytochemistry, after staining with dyes that bind specifically to just one type of biochemical, had shown that chromosomes are made of DNA and protein, in roughly equal amounts. Some biologists looked on the combination between the two (‘nucleoprotein’) as the genetic material, but others argued differently. From today's perspective it can be difficult to understand why these arguments favored the notion that genes were made, not of DNA, but of protein. The explanation is that, at the time, many biochemists thought that all DNA molecules were the same, which meant that DNA did not have the immense variability that was one of the postulated features of the genetic material. Billions of different genes must exist and for each one to have its own individual activity, the genetic material must be able to take many different forms. If every DNA molecule were identical then DNA could not satisfy this requirement and so genes must be made of protein. This assumption made perfect sense because proteins were known, correctly, to be highly variable polymeric molecules, each one made up of a different combination of 20 chemically distinct amino-acid monomers (Section 3.3.1).
The errors that had been made in understanding DNA structure lingered on until the late 1930s. Gradually, however, it was accepted that DNA, like protein, has immense variability. Could DNA therefore be the genetic material? The results of two experiments performed during the middle decades of the 20th century forced biologists to take this possibility seriously.
The first molecular biologists realized that the most conclusive way to identify the chemical composition of genes would be to purify some and subject them to chemical analysis. But nothing like this had ever been attempted and it was not clear how it could be done. Ironically, the experiment was performed almost unwittingly by a group of scientists who did not look upon themselves as molecular biologists and who were not motivated by a curiosity to know what genes are made of. Instead, their objective was to find a better treatment for one of the most deadly diseases of the early 20th century, pneumonia.
(A) Representation of a S. pneumoniae bacterium. A serotype is a bacterial type with distinctive immunological properties, conferred in this case by the combination of sugars present in the capsule. Avirulent types have no capsule. (B) The experiments which showed that a component of heat-killed bacteria can transform living avirulent bacteria into virulent cells. Griffith showed that the avirulent bacteria were always transformed into the same serotype as the dead cells. In other words, the living bacteria acquired the genes specifying synthesis of the capsule of the dead cells.
Avery and his colleagues showed that the transforming principle is unaffected by treatment with a protease or a ribonuclease, but is inactivated by treatment with a deoxyribonuclease.
Avery's experiments were meticulous but, because of several complicating factors, they did not immediately lead to acceptance of DNA as the genetic material. It was not clear in the minds of all microbiologists that transformation really was a genetic phenomenon, and few geneticists really understood the system well enough to be able to evaluate Avery's work. There was also some doubt about the veracity of the experiments. In particular, there were worries about the specificity of the deoxyribonuclease enzyme that he used to inactivate the transforming principle. This result, a central part of the evidence for the transforming principle being DNA, would be invalid if, as seemed possible, the enzyme contained trace amounts of a contaminating protease and hence was also able to degrade protein. These uncertainties meant that a second experiment was needed to provide more information on the chemical nature of the genetic material.
(A) The structure of a head-and-tail bacteriophage such as T2. The DNA genome of the phage is contained in the head part of the protein capsid. (B) The infection cycle. After injection into an Escherichia coli bacterium, the T2 phage genome directs synthesis of new phages. For T2, the infection cycle takes about 20 minutes at 37 °C and ends with lysis of the cell and release of 250–300 new phages. This is the lytic infection cycle. Some phages, such as λ, can also follow a lysogenic infection cycle, in which the phage genome becomes inserted into the bacterial chromosome and remains there, in quiescent form, for several generations of the bacterium (Section 4.2.1).
The bacteriophages were labeled with 32P and 35S. A few minutes after infection, the culture was agitated to detach the empty phage capsids from the cell surface. The culture was then centrifuged and the radioactive content of the bacterial pellet determined. This pellet contained most of 32P-labeled component of the phages (the DNA) but only 20% of the 35S-labeled material (the phage protein). In a second experiment, Hershey and Chase showed that new phages produced at the end of an interrupted infection cycle contained less than 1% of the protein from the parent phages.
Hershey and Chase's results suggested that DNA was the major component of the infecting phages that entered the bacterial cell and, similarly, was the major, or perhaps only, component to be passed on to the progeny phages. These observations lent support to the view that DNA is the genetic material, but were they conclusive? Not according to Hershey and Chase (1952) who wrote ‘Our experiments show clearly that a physical separation of phage T2 into genetic and non-genetic parts is possible … . The chemical identification of the genetic part must wait, however, until some questions … have been answered.’ Even if the experiment had provided compelling evidence that the genetic material of phages was DNA, it would have been erroneous to extrapolate from these unusual life forms (which some biologists contend are not really ‘living’) to cellular organisms. Indeed, we know that some phage genomes are made of RNA. The Hershey-Chase experiment is important, not because of what it tells us, but because it alerted biologists to the fact that DNA might be the genetic material and was therefore worth studying. It was this that influenced Watson and Crick to study DNA and, as we will see below, it was their discovery of the double helix structure, which solved the puzzling question of how genes can replicate, that really convinced the scientific world that genes are made of DNA.
The names of James Watson and Francis Crick are so closely linked with DNA that it is easy to forget that, when they began their collaboration in Cambridge, England in October 1951, the detailed structure of the DNA polymer was already known. Their contribution was not to determine the structure of DNA per se, but to show that in living cells two DNA chains are intertwined to form the double helix. We will consider the two facets of DNA structure separately.
(A) The general structure of a deoxyribonucleotide, the type of nucleotide found in DNA. (B) The four bases that occur in deoxyribonucleotides.
2′-deoxyribose, which is a pentose, a type of sugar composed of five carbon atoms. These five carbons are numbered 1′ (spoken as ‘one-prime’), 2′, etc. The name ‘2′-deoxyribose’ indicates that this particular sugar is a derivative of ribose, one in which the hydroxyl (-OH) group attached to the 2′-carbon of ribose has been replaced by a hydrogen (-H) group.
A nitrogenous base, one of cytosine, thymine (single-ring pyrimidines), adenine or guanine (double-ring purines). The base is attached to the 1′-carbon of the sugar by a β- N -glycosidic bond attached to nitrogen number 1 of the pyrimidine or number 9 of the purine.
A phosphate group, comprising one, two or three linked phosphate units attached to the 5′-carbon of the sugar. The phosphates are designated α, β and γ, with the α-phosphate being the one directly attached to the sugar.
A molecule made up of just the sugar and base is called a nucleoside; addition of the phosphates converts this to a nucleotide. Although cells contain nucleotides with one, two or three phosphate groups, only the nucleoside triphosphates act as substrates for DNA synthesis. The full chemical names of the four nucleotides that polymerize to make DNA are:
2′-deoxyadenosine 5′-triphosphate
2′-deoxycytidine 5′-triphosphate
2′-deoxyguanosine 5′-triphosphate
2′-deoxythymidine 5′-triphosphate
The abbreviations of these four nucleotides are dATP, dCTP, dGTP and dTTP, respectively, or, when referring to a DNA sequence, A, C, G and T, respectively.
Synthesis occurs in the 5′→3′ direction, with the new nucleotide being added to the 3′-carbon at the end of the existing polynucleotide. The β- and γ-phosphates of the nucleotide are removed as a pyrophosphate molecule.
(A) RNA contains ribonucleotides, in which the sugar is ribose rather than 2′-deoxyribose. The difference is that a hydroxyl group rather than hydrogen atom is attached to the 2′-carbon. (B) RNA contains the pyrimidine called uracil instead of thymine.
adenosine 5′-triphosphate
cytidine 5′-triphosphate
guanosine 5′-triphosphate
uridine 5′-triphosphate
As with DNA, RNA polynucleotides contain 3′–5′ phosphodiester bonds, but these phosphodiester bonds are less stable than those in a DNA polynucleotide because of the indirect effect of the hydroxyl group at the 2′-position of the sugar. This may be one reason why the biological functions of RNA do not require the polynucleotide to be more than a few thousand nucleotides in length, at most. There are no RNA counterparts of the million-unit sized DNA molecules found in human chromosomes.
In the years before 1950, various lines of evidence had shown that cellular DNA molecules are comprised of two or more polynucleotides assembled together in some way. The possibility that unraveling the nature of this assembly might provide insights into how genes work prompted Watson and Crick, among others, to try to solve the structure. According to Watson in his book The Double Helix (see Further Reading), their work was a desperate race against the famous American biochemist, Linus Pauling (Section 3.3.1), who initially proposed an incorrect triple helix model, giving Watson and Crick the time they needed to complete the double helix structure (Watson and Crick, 1953). It is now difficult to separate fact from fiction, especially regarding the part played by Rosalind Franklin, whose X-ray diffraction studies provided the bulk of the experimental data in support of the double helix and who was herself very close to solving the structure. The one thing that is clear is that the double helix, discovered by Watson and Crick on Saturday 7 March 1953, was the single most important breakthrough in biology during the 20th century.
Watson and Crick used four types of information to deduce the double helix structure:
Biophysical data of various kinds. The water content of DNA fibers was particularly important because it enabled the density of the DNA in a fiber to be estimated. The number of strands in the helix and the spacing between the nucleotides had to be compatible with the fiber density. Pauling's triple helix model was based on an incorrect density measurement which suggested that the DNA molecule was more closely packed than it actually is.
X-ray diffraction patterns (Section 9.1.3), most of which were produced by Rosalind Franklin of Kings College, London, and which revealed the helical nature of the structure and indicated some of the key dimensions within the helix.
DNA was extracted from various organisms and treated with acid to hydrolyze the phosphodiester bonds and release the individual nucleotides. Each nucleotide was then quantified by chromatography. The data show some of the actual results obtained by Chargaff. These indicate that, within experimental error, the amount of adenine is the same as that of thymine, and the amount of guanine is the same as that of cytosine.
Model building, which was the only major technique that Watson and Crick made use of themselves. Scale models of possible DNA structures enabled the relative positioning of the various atoms to be checked, to ensure that pairs of groups that formed bonds were not too far apart, and that other groups were not so close together as to interfere with one another.
(A) Two representations of the double helix. On the left the structure is shown with the sugar-phosphate ‘backbones’ of each polynucleotide drawn as a red ribbon with the base pairs in black. On the right the chemical structure for three base pairs is given. (B) A base-pairs with T, and G base-pairs with C. The bases are drawn in outline, with the hydrogen bonding indicated by dotted lines. Note that a G-C base pair has three hydrogen bonds whereas an A-T base pair has just two. The structures in part (A) are redrawn from Turner et al. (1997) (left) and Strachan and Read (1999) (right).
Base-stacking, sometimes called π-π interactions, involves hydrophobic interactions between adjacent base pairs and adds stability to the double helix once the strands have been brought together by base-pairing. These hydrophobic interactions arise because the hydrogen-bonded structure of water forces hydrophobic groups into the internal parts of a molecule.
Both base-pairing and base-stacking are important in holding the two polynucleotides together, but base-pairing has added significance because of its biological implications. The limitation that A can only base-pair with T, and G can only base-pair with C, means that DNA replication can result in perfect copies of a parent molecule through the simple expedient of using the sequences of the pre-existing strands to dictate the sequences of the new strands. This is template-dependent DNA synthesis and it is the system used by all cellular DNA polymerases (Section 4.1.1). Its counterpart, template-dependent RNA synthesis, is used by RNA polymerases to make RNA copies of genes, these copies preserving the biological information contained in the sequence of the genomic DNA molecule (Section 3.2.2). The only difference between DNA and RNA syntheses is that when RNA is made, the adenines in the DNA template do not specify thymines in the RNA copy. This is because RNA does not contain thymine; instead adenine pairs with uracil in DNA-RNA hybrids and in double-stranded RNA structures.
Reprinted with permission from Kendrew A (ed.), The Encyclopaedia of Molecular Biology, Plate 1. Copyright 1994 Blackwell Science.
| Feature | B-DNA | Conformation A-DNA | Z-DNA |
|---|---|---|---|
| Type of helix | Right-handed | Right-handed | Left-handed |
| Helical diameter (nm) | 2.37 | 2.55 | 1.84 |
| Rise per base pair (nm) | 0.34 | 0.29 | 0.37 |
| Distance per complete turn (pitch) (nm) | 3.4 | 3.2 | 4.5 |
| Number of base pairs per complete turn | 10 | 11 | 12 |
| Topology of major groove | Wide, deep | Narrow, deep | Flat |
| Topology of minor groove | Narrow, shallow | Broad, shallow | Narrow, deep |
The critical feature of a DNA molecule is its nucleotide sequence. If the sequence of a DNA molecule is known then the genes that it contains can be identified and the activities of those genes can be studied in detail. Since the mid-1970s, molecular biologists have been able to obtain the sequences of longer and longer stretches of DNA, culminating in the 1990s with completion of the first complete sequences of entire genomes. The most important of these projects has been the one devoted to the human genome.
The Human Genome Project was conceived in 1984 and begun in earnest in 1990 with the primary aim of determining the nucleotide sequence of the entire human nuclear genome. The much smaller mitochondrial genome had been sequenced in the early 1980s (Anderson et al., 1981). The project has been funded by governments and charities from across the world and has been the largest and most complex international collaboration ever attempted in any area of science. A second human genome project was set up by a private company - Celera Genomics of Maryland, USA - in 1998. Both projects completed a draft of the human genome sequence in 2001 and the results were published in the scientific journals Nature and Science in February of that year (IHGSC, 2001; Venter et al., 2001). These drafts were not complete sequences, each representing only 83–84% of the entire genome, but their coverage was thought to include all of the most important parts of the genome, most of the remaining 16–17% being made up of sequences at the very ends of chromosomes (the telomeres) and around the centromeres (Section 2.2.1), where few, if any, genes are located (Bork and Copley, 2001).
The map illustrates the distance that would be covered by the human genome sequence if it were printed in the typeface used in this book.
We should also bear in mind that although it is standard practice to refer to the human genome sequence, there are in fact many human genome sequences because every individual, except pairs of identical twins, have their own version. The differences between individual genomes are largely due to single nucleotide polymorphisms (SNPs), positions in the genome where some individuals have one nucleotide (e.g. an A) and others have a different nucleotide (e.g. a G). Over 1.4 million SNPs have been identified, an average of one for every 2.0 kb of sequence (SNP Group, 2001). On average, every 2 kb also contains a microsatellite (also called a short tandem repeat or STR), which is a series of repeated nucleotides (e.g. CACACACA) in which the number of repeats is variable in different individuals. Many of these SNPs and microsatellites have no effect on the function of the genome but many others do. For example, 60 000 SNPs lie within genes and at least some of these have an impact on the activities of these genes, leading to the variations that give each of us our own individual biological characteristics.
This map shows the location of genes, gene segments, pseudogenes, genome-wide repeats and microsatellites in a 50-kb segment of the human β T-cell receptor locus on chromosome 7. Redrawn from Rowen et al. (1996).
One gene. This gene is called TRY4 and it contains information for synthesis of the protein called trypsinogen, the inactive precursor of the digestive enzyme trypsin. TRY4 is one of a family of trypsinogen genes present in two clusters at either end of the β T-cell receptor locus. These genes have nothing to do with the immune response, they simply share this part of chromosome 7 with the β T-cell receptor locus. TRY4 is an example of a discontinuous gene, the information used in synthesis of the trypsinogen protein being split between five exons, separated by four non-coding introns.
Two gene segments. These are V28 and V29-1, and each specifies a part of the β T-cell receptor protein after which the locus is named. V28 and V29-1 are not complete genes, only segments of a gene, and before being expressed they must be linked to other gene segments from elsewhere in the locus. This occurs in T lymphocytes and is an example of how a permanent change in the activity of the genome can arise during cellular differentiation (see Section 12.2.1). Like TRY4, both V28 and V29-1 are discontinuous.
One pseudogene. A pseudogene is a non-functional copy of a gene, usually one whose nucleotide sequence has changed so that its biological information has become unreadable (see page 22). This particular pseudogene is called TRY5 and it is closely related to the functional members of the trypsinogen gene family.
52 genome-wide repeat sequences. These are sequences that recur at many places in the genome. There are four main types of genome-wide repeat, called LINEs (long interspersed nuclear elements), SINEs (short interspersed nuclear elements), LTR (long terminal repeat) elements and DNA transposons. Examples of each type are seen in this short segment of the genome.
Two microsatellites, which, as mentioned above, are sequences in which a short motif is repeated in tandem. One of the microsatellites seen here has the motif GA repeated 16 times, giving the sequence:
5′- GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA-3′
3′-CTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT-5′
The second microsatellite comprises six repeats of TATT.
Finally, approximately 50% of our 50-kb segment of the human genome is made up of stretches of non-genic, non-repetitive, single-copy DNA of no known function or significance.
Now we will look at these different components of the genome in greater detail.
We look on the genes as the most important part of the human genome because these are the parts that contain biological information. Most genes specify one or more protein molecules, the ‘expression’ of these genes involving an RNA intermediate, called messenger or mRNA, which is transported from the nucleus to the cytoplasm where it directs synthesis of the protein coded by the gene (Figure 1.15
Alternative splicing results in different combinations of exons becoming linked together, resulting in different proteins being synthesized from the same pre-mRNA.
This gene has a two exons split by a single intron. For a protein-coding gene, the start of the biological information corresponds to the position of the initiation codon, and the end of the biological information is marked by the termination codon (Section 3.3.2). ‘Upstream’ and ‘downstream’ are two useful terms used to indicate the DNA sequences to either side of the gene.
The pie chart shows a categorization of the identified human protein-coding genes. It omits approximately 13 000 genes whose functions are not yet known. The segment labeled ‘various other activities’ includes, among others, proteins involved in biochemical transport processes and protein folding, immunological proteins, and structural proteins. Based on Figure 15 of Venter et al. (2001).
One thing that the gene catalog cannot tell us, and will not be able to tell us even when it is complete, is what makes a human being. The minimalist approach to molecular biology, whereby the study of individual genes or groups of genes is expected to lead eventually to a full biomolecular description of how a human being is constructed and functions, has been dealt a severe blow by the draft genome sequences. There are no amazing revelations about what makes humans different from apes. Even when the chimpanzee genome has been completely sequenced (which will not be for several years) it may still not be possible simply from genome comparisons to determine what makes us human (Baltimore, 2001). On the basis of gene number we are only three times more complex than a fruit fly and only twice as complex as the microscopic worm Caenorhabditis elegans. More detailed studies of how the human genome functions may reveal key features that underlie some of the special attributes of human beings, but genomics will never explain why a human was able to compose Mozart's 40th symphony, or indeed why it was composed by Mozart and not by an ordinary human.
A conventional pseudogene is a gene that has been inactivated because its nucleotide sequence has changed by mutation (Section 14.1). Many mutations have only minor effects on the activity of a gene but some are more important and it quite possible for a single nucleotide change to result in a gene becoming completely non-functional. Once a pseudogene has become non-functional it will degrade through accumulation of more mutations and eventually will no longer be recognizable as a gene relic. TRY5 is an example of a conventional pseudogene.
A processed pseudogene is thought to arise by integration into the genome of a copy of the mRNA transcribed from a functional gene. The process by which mRNA is copied into DNA is called reverse transcription and the product is called complementary DNA (cDNA). The cDNA may integrate into the same chromosome as its functional parent, or possibly into a different chromosome.
As well as pseudogenes, genomes also contain other evolutionary relics in the form of truncated genes, which lack a greater or lesser stretch from one end of the complete gene, and gene fragments, which are short isolated regions from within a gene (Figure 1.20
The draft sequences have shown that approximately 62% of the human genome comprises intergenic regions, the parts of the genome that lie between genes and which have no known function. These sequences used to be called junk DNA but the term is falling out of favor, partly because the number of surprises resulting from genome research over the last few years has meant that molecular biologists have become less confident in asserting that any part of the genome is unimportant simply because we do not currently know what its function might be. One thing that is clear is that the bulk of the intergenic DNA is made up of repeated sequences of one type or another. Because repeated sequences are important features of all genomes we will deal with them in detail during our general survey of genome anatomies in Chapter 2. Here we will limit ourselves to the key features of the human repeats.
| Type of repeat | Subtype | Approximate number of copies in the human genome |
|---|---|---|
| SINEs | 1 558 000 | |
| Alu | 1 090 000 | |
| MIR | 393 000 | |
| MIR3 | 75 000 | |
| LINEs | 868 000 | |
| LINE-1 | 516 000 | |
| LINE-2 | 315 000 | |
| LINE-3 | 37 000 | |
| LTR elements | 443 000 | |
| ERV class I | 112 000 | |
| ERV(K) class II | 8000 | |
| ERV(L) class III | 83 000 | |
| MaLR | 240 000 | |
| DNA transposons | 294 000 | |
| hAT | 195 000 | |
| Tc-1 | 75 000 | |
| PiggyBac | 2000 | |
| Unclassified | 22 000 |
Taken from IHGSC (2001). The numbers are approximate and are likely to be under-estimates (Li et al., 2001).
| Length of repeat unit | Approximate number of copies in the human genome |
|---|---|
| 1 | 120 000 |
| 2 | 140 000 |
| 3 | 37 500 |
| 4 | 105 000 |
| 5 | 56 000 |
| 6 | 49 000 |
| 7 | 27 000 |
| 8 | 35 500 |
| 9 | 27 500 |
| 10 | 27 500 |
| 11 | 28 000 |
From IHGSC (2001).
The human mitochondrial genome is small and compact, with little wasted space, so much so that the ATP6 and ATP8 genes overlap. Abbreviations: ATP6, ATP8, genes for ATPase subunits 6 and 8; COI, COII, COIII, genes for cytochrome c oxidase subunits I, II and III; Cytb, gene for apocytochrome b; ND1–ND6, genes for NADH hydrogenase subunits 1–6. Ribosomal RNA and transfer RNA are two types of non-coding RNA (Section 3.2.1).
The human genome has been the focus of biological research for the last decade and will continue to be the center of attention for many years to come. Why is all this activity being devoted to the human genome? There are many reasons.
First, the human gene catalog, containing a description of the sequence of every gene in the genome, will be immensely valuable, even if for many years the functions of some of the genes remain unknown. Not only will the catalog contain the sequences of the coding parts of every gene, it will also include the regulatory regions for these genes. Some of these genes are the ones that, when they function incorrectly, give rise to a genetic disease. The human gene catalog will provide rapid access to these genes, enabling the underlying basis to these diseases to be studied, hopefully leading to strategies for treatment and management.
While the catalog is being completed, attention will focus more and more on the transcriptome and proteome (Chapter 3), which are the keys to understanding how the information contained in the genome is utilized by the cell. The Human Genome Project, and the similar projects currently being carried out with other species' genomes, therefore opens the way to a comprehensive description of the molecular activities of human cells and the ways in which these activities are controlled. This is central to the continued development, not only of molecular biology and genetics, but also of those areas of biochemistry, cell biology and physiology now described as the molecular life sciences.
The genome projects will have additional benefits that at present can only be guessed at. We have seen that the human genome, in common with the genomes of many other organisms, contains extensive amounts of intergenic DNA. We think that most of the intergenic DNA has no function, but perhaps this is because we do not know enough about it. Could the intergenic DNA have a role, but one that at present is too subtle for us to grasp? The first step in addressing this possibility is to obtain a complete description of the organization of the intergenic DNA in different genomes, so that common features, which might indicate a role for some or all of these sequences, can be identified.
There is one final reason for genome projects. The work stretches current technology to its limits. Genome analysis therefore represents the frontier of molecular biology, territory that was inaccessible just a few years ago and which still demands innovative approaches and a lot of sheer hard work. Scientists have always striven to achieve the almost impossible, and the motivation for many molecular biologists involved in genome projects is, quite simply, the challenge of the unknown.
Give short definitions of the following terms:
β-N-glycosidic bond
π-π interaction
3′ terminus
3′ untranslated region
5′ terminus
5′ untranslated region
DNA
Kilobase pair
Megabase pair
Messenger RNA
RNA
Short tandem repeat
Single nucleotide polymorphism
Give a brief description of the two components of the human genome.
DNA and genes were both discovered in the 1860s. Explain why the connection between two was not made until 80 years later.
Why did biologists originally think that protein is the genetic material?
Describe the two experiments that indicated that genes are made of DNA.
Draw a fully annotated diagram of the structure of a short DNA polynucleotide containing each of the four nucleotides. Indicate the changes that you would have to make if the drawing was of RNA not DNA.
Describe the evidence that led Watson and Crick to deduce that a cellular DNA molecule is a double helix.
Distinguish between base-pairing and base-stacking. What influences do these two types of interaction have on the structural flexibility of the double helix? Your answer should include a description of the conformations of the major and minor grooves in the various forms of the double helix.
Return to your diagram from Question 5. Show how your DNA polynucleotide could be extended by template-dependent DNA synthesis.
What differences and similarities would you expect to find if you compared your genome with those of your parents?
Draw and annotate the structure of an ‘average’ human gene. How do the gene segments V28 and V29-1 differ from this ‘average’?
Outline the functional categories within the human gene catalog.
Distinguish between the two types of pseudogene.
Define the terms ‘interspersed repetitive DNA’ and ‘tandemly repeated DNA’ and give examples of both classes in the human genome.
To what extent were the results of the Avery and Hershey-Chase experiments accepted by the scientific community of the 1940s and 1950s?
The text (page 13) states that Watson and Crick discovered the double helix structure on Saturday 7 March 1953. Justify this statement.
If Watson and Crick had not existed then who would have discovered the double helix structure?