Both strands are read in the 5′→3′ direction. Each strand has three reading frames, depending on which nucleotide is chosen as the starting position.
A genome sequence is not an end in itself. A major challenge still has to be met in understanding what the genome contains and how the genome functions. The former is addressed by a combination of computer analysis and experimentation, with the primary aim of locating the genes and their control regions. The first part of this chapter is devoted to these methods. The second question - understanding how the genome functions - is, to a certain extent, merely a different way of stating the objectives of molecular biology over the last 30 years. The difference is that in the past attention has been directed at the expression pathways for individual genes, with groups of genes being considered only when the expression of one gene is linked to that of another. Now the question has become more general and relates to the expression of the genome as a whole. The techniques used to address this topic will be covered in the latter parts of this chapter.
Once a DNA sequence has been obtained, whether it is the sequence of a single cloned fragment or of an entire chromosome, then various methods can be employed to locate the genes that are present. These methods can be divided into those that involve simply inspecting the sequence, by eye or more frequently by computer, to look for the special sequence features associated with genes, and those methods that locate genes by experimental analysis of the DNA sequence. The computer methods form part of the methodology called bioinformatics, and it is with these that we begin.
Sequence inspection can be used to locate genes because genes are not random series of nucleotides but instead have distinctive features. These features determine whether a sequence is a gene or not, and so by definition are not possessed by non-coding DNA. At present we do not fully understand the nature of these specific features, and sequence inspection is not a foolproof way of locating genes, but it is still a powerful tool and is usually the first method that is applied to analysis of a new genome sequence.
Both strands are read in the 5′→3′ direction. Each strand has three reading frames, depending on which nucleotide is chosen as the starting position.
The key to the success of ORF scanning is the frequency with which termination codons appear in the DNA sequence. If the DNA has a random sequence and a GC content of 50% then each of the three termination codons - TAA, TAG and TGA - will appear, on average, once every 43 = 64 bp. If the GC content is > 50% then the termination codons, being AT-rich, will occur less frequently but one will still be expected every 100–200 bp. This means that random DNA should not show many ORFs longer than 50 codons in length, especially if the presence of a starting ATG is used as part of the definition of an ‘ORF’. Most genes, on the other hand, are longer than 50 codons: the average lengths are 317 codons for Escherichia coli, 483 codons for Saccharomyces cerevisiae, and approximately 450 codons for humans. ORF scanning, in its simplest form, therefore takes a figure of, say, 100 codons as the shortest length of a putative gene and records positive hits for all ORFs longer than this.
The diagram shows 4522 bp of the lactose operon of Escherichia coli with all ORFs longer than 50 codons marked. The sequence contains two real genes - lacZ and lacY - indicated by the red lines. These real genes cannot be mistaken because they are much longer than the spurious ORFs, shown in blue. See Figure 2.20A for the detailed structure of the lactose operon.
The nucleotide sequence of a short gene containing a single intron is shown. The correct amino acid sequence of the protein translated from the gene is given immediately below the nucleotide sequence: in this sequence the intron has been left out because it is removed from the transcript before the mRNA is translated into protein. In the lower line, the sequence has been translated without realizing that an intron is present. As a result of this error, the amino acid sequence appears to terminate within the intron. The amino acid sequences have been written using the one-letter abbreviations (see Table 3.1). The genetic code was described in Section 3.3.2; introns are covered in detail in Section 10.1.3.
Solving the problem posed by introns is the main challenge for bioinformaticists writing new software programs for ORF location. Three modifications to the basic procedure for ORF scanning have been adopted (Fickett, 1996):
Codon bias is taken into account. ‘Codon bias’ refers to the fact that not all codons are used equally frequently in the genes of a particular organism. For example, leucine is specified by six codons in the genetic code (TTA, TTG, CTT, CTC, CTA and CTG; see Figure 3.20), but in human genes leucine is most frequently coded by CTG and is only rarely specified by TTA or CTA. Similarly, of the four valine codons, human genes use GTG four times more frequently than GTA. The biological reason for codon bias is not understood, but all organisms have a bias, which is different in different species. Real exons are expected to display the codon bias whereas chance series of triplets do not. The codon bias of the organism being studied is therefore written into the ORF scanning software.
Exon-intron boundaries can be searched for as these have distinctive sequence features, although unfortunately the distinctiveness of these sequences is not so great as to make their location a trivial task. The sequence of the upstream, exon-intron boundary is usually described as:


Upstream regulatory sequences can be used to locate the regions where genes begin. This is because these regulatory sequences, like exon-intron boundaries, have distinctive sequence features that they possess in order to carry out their role as recognition signals for the DNA-binding proteins involved in gene expression (Chapter 9). Unfortunately, as with exon-intron boundaries, the regulatory sequences are variable, more so in eukaryotes than in prokaryotes, and in eukaryotes not all genes have the same collection of regulatory sequences. Using these to locate genes is therefore problematic (Ohler and Niemann, 2001).
These three extensions of simple ORF scanning are generally applicable to all higher eukaryotic genomes. Additional strategies are also possible with individual organisms, based on the special features of their genomes. For example, vertebrate genomes contain CpG islands upstream of many genes (Bird, 1986), these being sequences of approximately 1 kb in which the GC content is greater than the average for the genome as a whole. Some 40–50% of human genes are associated with an upstream CpG island. These sequences are distinctive and when one is located in vertebrate DNA, a strong assumption can be made that a gene begins in the region immediately downstream.
Most of the various software programs available for gene location can identify up to 95% of the coding regions in a eukaryotic genome, but even the best ones tend to make frequent mistakes in their positioning of the exon-intron boundaries (Reese et al., 2000). Identification of spurious ORFs as real genes is still a major problem. These limitations can be offset to a certain extent by the use of a homology search to test whether a series of triplets is a real exon or a chance sequence. In this analysis the DNA databases are searched to determine if the test sequence is identical or similar to any genes that have already been sequenced. Obviously, if the test sequence is part of a gene that has already been sequenced by someone else then an identical match will be found, but this is not the point of a homology search. Instead the intention is to determine if an entirely new sequence is similar to any known genes, because if it is then there is a chance that the test and match sequences are homologous, meaning that they represent genes that are evolutionarily related. The main use of homology searching is to assign functions to newly discovered genes, and we will therefore return to it when we deal with this aspect of genome analysis later in the chapter (Section 7.2.1). At this point, we will note simply that the technique is also central to gene location because it enables tentative exon sequences located by ORF scanning to be tested for functionality. If the tentative exon sequence gives one or more positive matches after a homology search then it is probably a real exon, but if it gives no match then its authenticity must remain in doubt until it is assessed by one or other of the experiment-based gene location techniques.
Most experimental methods for gene location are not based on direct examination of DNA molecules but instead rely on detection of the RNA molecules that are transcribed from genes. All genes are transcribed into RNA, and if the gene is discontinuous then the primary transcript is subsequently processed to remove the introns and link up the exons (Sections 1.2.1 and 10.1.3). Techniques that map the positions of transcribed sequences in a DNA fragment can therefore be used to locate exons and entire genes. The only problem to be kept in mind is that the transcript is usually longer than the coding part of the gene because it begins several tens of nucleotides upstream of the initiation codon and continues several tens or hundreds of nucleotides downstream of the termination codon (see Figure 1.17). Transcript analysis does not therefore give a precise definition of the start and end of the coding region of a gene, but it does tell you that a gene is present in a particular region and it can locate the exon-intron boundaries. Often this is sufficient information to enable the coding region to be delineated.
An RNA extract is electrophoresed under denaturing conditions in an agarose gel (see Technical Note 4.4). After ethidium bromide staining, two bands are seen. These are the two largest rRNA molecules (Section 3.2.1) which are abundant in most cells. The smaller rRNAs, which are also abundant, are not seen because they are so short that they run out of the bottom of the gel and, in most cells, none of the mRNAs (the transcripts of protein-coding genes) are abundant enough to form a band visible after ethidium bromide staining. The gel is blotted onto a nylon membrane and, in this example, probed with a radioactively labeled DNA fragment. A single band is visible on the autoradiograph, showing that the DNA fragment used as the probe contains part or all of one transcribed sequence.
Some individual genes give rise to two or more transcripts of different lengths because some of their exons are optional and may or may not be retained in the mature RNA (Section 10.1.3). If this is the case, then a fragment that contains just one gene could detect two or more hybridizing bands in the northern blot. A similar problem can occur if the gene is a member of a multigene family (Section 2.2.1).
With many species, it is not practical to make an mRNA preparation from an entire organism so the extract is obtained from a single organ or tissue. Consequently any genes not expressed in that organ or tissue will not be represented in the RNA population, and so will not be detected when the RNA is probed with the DNA fragment being studied. Even if the whole organism is used, not all genes will give hybridization signals because many are expressed only at a particular developmental stage, and others are weakly expressed, meaning that their RNA products are present in amounts too low to be detected by hybridization analysis.
The objective is to determine if a fragment of human DNA hybridizes to DNAs from related species. Samples of human, chimp, cow and rabbit DNAs are therefore prepared, restricted, and electrophoresed in an agarose gel. Southern hybridization is then carried out with a human DNA fragment as the probe. A positive hybridization signal is seen with each of the animal DNAs, suggesting that the human DNA fragment contains an expressed gene. Note that the hybridizing restriction fragments from the cow and rabbit DNAs are smaller than the hybridizing fragments in the human and chimp samples. This indicates that the restriction map around the transcribed sequence is different in cows and rabbits, but does not affect the conclusion that a homologous gene is present in all four species.
Northern hybridization and zoo-blotting enable the presence or absence of genes in a DNA fragment to be determined, but give no positional information relating to the location of those genes in the DNA sequence. The easiest way to obtain this information is to sequence the relevant cDNAs. A cDNA is a copy of an mRNA (see Figure 5.32) and so corresponds to the coding region of a gene, plus any leader or trailer sequences that are also transcribed. Comparing a cDNA sequence with a genomic DNA sequence therefore delineates the position of the relevant gene and reveals the exon-intron boundaries.
In order to obtain an individual cDNA, a cDNA library must first be prepared from all of the mRNA in the tissue being studied. Once the library has been prepared, the success of cDNA sequencing as a means of gene location depends on two factors. The first concerns the frequency of the desired cDNAs in the library. As with northern hybridization, the problem relates to the different expression levels of different genes. If the DNA fragment being studied contains one or more poorly expressed genes, then the relevant cDNAs will be rare in the library and it might be necessary to screen many clones before the desired one is identified. To get around this problem, various methods of cDNA capture or cDNA selection have been devised, in which the DNA fragment being studied is repeatedly hybridized to the pool of cDNAs in order to enrich the pool for the desired clones (Lovett, 1994). Because the cDNA pool contains so many different sequences, it is generally not possible to discard all the irrelevant clones by these repeated hybridizations, but it is possible to increase significantly the frequency of those clones that specifically hybridize to the DNA fragment. This reduces the size of the library that must subsequently be screened under stringent conditions to identify the desired clones.
A second factor that determines success or failure is the completeness of the individual cDNA molecules. Usually, cDNAs are made by copying RNA molecules into single-stranded DNA with reverse transcriptase and then converting the single-stranded DNA into double-stranded DNA with a DNA polymerase (see Figure 5.32). There is always a chance that one or other of the strand synthesis reactions will not proceed to completion, resulting in a truncated cDNA. The presence of intramolecular base pairs in the RNA can also lead to incomplete copying. Truncated cDNAs may lack some of the information needed to locate the start and end points of a gene and all its exon-intron boundaries.
The RNA being studied is converted into a partial cDNA by extension of a DNA primer that anneals at an internal position not too distant from the 5′ end of the molecule. The 3′ end of the cDNA is further extended by treatment with terminal deoxynucleotidyl transferase (Section 4.1.4) in the presence of dATP, which results in a series of As being added to the cDNA. This series of As acts as the annealing site for the anchor primer. Extension of the anchor primer leads to a double-stranded DNA molecule which can now be amplified by a standard PCR. This is 5′-RACE, so-called because it results in amplification of the 5′ end of the starting RNA. A similar method - 3′-RACE - can be used if the 3′ end sequence is desired.
This method of transcript mapping makes use of S1 nuclease, an enzyme that degrades single-stranded DNA or RNA polynucleotides, including single-stranded regions in predominantly double-stranded molecules, but has no effect on double-stranded DNA or on DNA-RNA hybrids. In the example shown, a restriction fragment that spans the start of a transcription unit is ligated into an M13 vector and the resulting single-stranded DNA hybridized with an RNA preparation. After S1 treatment, the resulting heteroduplex has one end marked by the start of the transcript and the other by the downstream restriction site (R2). The size of the undigested DNA fragment is therefore measured by gel electrophoresis in order to determine the position of the start of the transcription unit relative to the downstream restriction site.
The exon-trap vector consists of two exon sequences preceded by promoter sequences - the signals required for gene expression in a eukaryotic host (Section 9.2.2). New DNA containing an unmapped exon is ligated into the vector and the recombinant molecule introduced into the host cell. The resulting RNA transcript is then examined by RT-PCR to identify the boundaries of the unmapped exon.
Once a new gene has been located in a genome sequence, the question of its function has to be addressed. This is turning out to be an important area of genomics research, because completed sequencing projects have revealed that we know rather less than we thought about the content of individual genomes. E. coli and S. cerevisiae, for example, were studied intensively by conventional genetic analysis before the advent of sequencing projects, and geneticists were at one time fairly confident that most of their genes had been identified. The genome sequences revealed that in fact there are large gaps in our knowledge. Of the 4288 protein-coding genes in the E. coli genome sequence, only 1853 (43% of the total) had been previously identified (Blattner et al., 1997). For S. cerevisiae the figure was only 30% (Dujon, 1996).
As with gene location, attempts to determine the functions of unknown genes are made by computer analysis and by experimental studies.
We have already seen that computer analysis plays an important role in locating genes in DNA sequences, and that one of the most powerful tools available for this purpose is homology searching, which locates genes by comparing the DNA sequence under study with all the other DNA sequences in the databases. The basis of homology searching is that related genes have similar sequences and so a new gene can be discovered by virtue of its similarity to an equivalent, already sequenced, gene from a different organism. Now we will look more closely at homology analysis and see how it can be used to assign a function to a new gene.
Homologous genes are ones that share a common evolutionary ancestor, revealed by sequence similarities between the genes. These similarities form the data on which molecular phylogenies are based, as we will see in Chapter 16. Homologous genes fall into two categories:
Orthologous genes are those homologs that are present in different organisms and whose common ancestor predates the split between the species.
Paralogous genes are present in the same organism, often members of a recognized multigene family (Section 2.2.1), their common ancestor possibly or possibly not predating the species in which the genes are now found.
A pair of homologous genes do not usually have identical nucleotide sequences, because the two genes undergo different random changes by mutation, but they have similar sequences because these random changes have operated on the same starting sequence, the common ancestral gene. Homology searching makes use of these sequence similarities. The basis of the analysis is that if a newly sequenced gene turns out to be similar to a previously sequenced gene, then an evolutionary relationship can be inferred and the function of the new gene is likely to be the same, or at least similar, to the function of the known gene.
It is important not to confuse the words homology and similarity. It is incorrect to describe a pair of related genes as ‘80% homologous’ if their sequences have 80% nucleotide identity (Figure 7.9
Two nucleotide sequences are shown, with nucleotides that are identical in the two sequences given in red and non-identities given in blue. The two nucleotide sequences are 76% identical, as indicated by the asterisks. This might be taken as evidence that the sequences are homologous. However, when the sequences are translated into amino acids the identity decreases to 28%. Identical amino acids are shown in brown, and non-identities in green. The comparison between the amino acid sequences suggests that the genes are not homologous, and that the similarity at the nucleotide level was fortuitous. The amino acid sequences have been written using the one-letter abbreviations (see Table 3.1).
The top drawing shows the structure of the Drosophila tudor protein, which contains ten copies of the tudor domain. The domain is also found in a second Drosophila protein, homeless, and in the human A-kinase anchor protein (AKAP149), which plays a role in RNA metabolism. The proteins have dissimilar structures other than the presence of the tudor domains. The activity of each protein involves RNA in one way or another.
The S. cerevisiae genome project has illustrated both the potential and limitations of homology analysis as a means of assigning functions to new genes. The yeast genome contains approximately 6000 genes, 30% of which had been identified by conventional genetic analysis before the sequencing project got underway. The remaining 70% were studied by homology analysis, giving the following results (Figure 7.12
Almost another 30% of the genes in the genome could be assigned functions after homology searching of the sequence databases. About half of these were clear homologs of genes whose functions had been established previously, and about half had less striking similarities, including many where the similarities were restricted to discrete domains. For all these genes the homology analysis could be described as successful, but with various degrees of usefulness (Oliver, 1996a). For some genes the identification of a homolog enabled the function of the yeast gene to be comprehensively determined; examples included identification of yeast genes for DNA polymerase subunits. For other genes the functional assignment could only be to a broad category, such as ‘gene for a protein kinase’; in other words, the biochemical properties of the gene product could be inferred, but not the exact role of the protein in the cell. Some identifications were initially puzzling, the best example being the discovery of a yeast homolog of a bacterial gene involved in nitrogen fixation. Yeasts do not fix nitrogen so this could not be the function of the yeast gene. In this case, the discovery of the yeast homolog refocused attention on the previously characterized bacterial gene, with the subsequent realization that, although being involved in nitrogen fixation, the primary role of the bacterial gene product was in the synthesis of metal-containing proteins, which have broad roles in all organisms, not just nitrogen-fixing ones.
About 10% of all the yeast genes had homologs in the databases, but the functions of these homologs were unknown. The homology analysis was therefore unable to help in assigning functions to these yeast genes. These yeast genes and their homologs are called orphan families.
The remaining yeast genes, about 30% of the total, had no homologs in the databases. A proportion of these (about 7% of the total) were questionable ORFs which might not be real genes, being rather short or having an unusual codon bias. The remainder looked like genes but were unique. These are called single orphans.
It is clear that homology analysis is not a panacea that can identify the functions of all new genes. Experimental methods are therefore needed to complement and extend the results of homology studies. This is proving to be one of the biggest challenges in genomics research, and most molecular biologists agree that the methodologies and strategies currently in use are not entirely adequate for assigning functions to the vast numbers of unknown genes being discovered by sequencing projects. The problem is that the objective - to plot a course from gene to function - is the reverse of the route normally taken by genetic analysis, in which the starting point is a phenotype and the objective is to identify the underlying gene or genes. The problem we are currently addressing takes us in the opposite direction: starting with a new gene and hopefully leading to identification of the associated phenotype.
In conventional genetic analysis, the genetic basis of a phenotype is usually studied by searching for mutant organisms in which the phenotype has become altered. The mutants might be obtained experimentally, for example by treating a population of organisms (e.g. a culture of bacteria) with ultraviolet radiation or a mutagenic chemical (see Section 14.1.1), or the mutants might be present in a natural population. The gene or genes that have been altered in the mutant organism are then studied by genetic crosses (Section 5.2.4), which can locate the position of a gene in a genome and also determine if the gene is the same as one that has already been characterized. The gene can then be studied further by molecular biology techniques such as cloning and sequencing.
The general principle of this conventional analysis is that the genes responsible for a phenotype can be identified by determining which genes are inactivated in organisms that display a mutant version of the phenotype. If the starting point is the gene, rather than the phenotype, then the equivalent strategy would be to mutate the gene and identify the phenotypic change that results. This is the basis of most of the techniques used to assign functions to unknown genes.
The chromosomal copy of the target gene recombines with a disrupted version of the gene carried by a cloning vector. As a result, the target gene becomes inactivated. For more information on recombination see Section 14.3.
The deletion cassette consists of an antibiotic-resistance gene preceded by the promoter sequences needed for expression in yeast, and flanked by two restriction sites. The start and end segments of the target gene are inserted into the restriction sites and the vector introduced into yeast cells. Recombination between the gene segments in the vector and the chromosomal copy of the target gene results in disruption of the latter. Cells in which the disruption has occurred are identified because they now express the antibiotic-resistance gene and so will grow on an agar medium containing geneticin. The gene designation ‘kan r’ is an abbreviation for ‘kanamycin resistance’, kanamycin being the family name of the group of antibiotics that include geneticin.
The second example of gene inactivation uses an analogous process but with mice rather than yeast. The mouse is frequently used as a model organism for humans because the mouse genome is similar to the human genome, containing many of the same genes. Identifying the functions of unknown human genes is therefore being carried out largely by inactivating the equivalent genes in the mouse, these experiments being ethically unthinkable with humans. The homologous recombination part of the procedure is identical to that described for yeast and once again results in a cell in which the target gene has been inactivated. The problem is that we do not want just one mutated cell, we want a whole mutant mouse, as only with the complete organism can we make a full assessment of the effect of the gene inactivation on the phenotype. To achieve this it is necessary to use a special type of mouse cell, an embryonic stem or ES cell (Evans et al., 1997). Unlike most mouse cells, ES cells are totipotent, meaning that they are not committed to a single developmental pathway and can therefore give rise to all types of differentiated cell. The engineered ES cell is therefore injected into a mouse embryo, which continues to develop and eventually gives rise to a chimera, a mouse whose cells are a mixture of mutant ones, derived from the engineered ES cells, and non-mutant ones, derived from all the other cells in the embryo. This is still not quite what we want, so the chimeric mice are allowed to mate with one another. Some of the offspring result from fusion of two mutant gametes, and will therefore be non-chimeric, as every one of their cells will carry the inactivated gene. These are knockout mice, and with luck their phenotypes will provide the desired information on the function of the gene being studied. This works well for many gene inactivations but some are lethal and so cannot be studied in a homozygous knockout mouse. Instead, a heterozygous mouse is obtained, the product of fusion between one normal and one mutant gamete, in the hope that the phenotypic effect of the gene inactivation will be apparent even though the mouse still has one correct copy of the gene being studied.
Recombinant DNA techniques have been used to place a promoter sequence (Section 3.2.2) that is responsive to galactose upstream of a Ty1 element in the yeast genome. When galactose is absent, the Ty1 element is not transcribed and so remains quiescent. When the cells are transferred to a culture medium containing galactose, the promoter is activated and the Ty1 element is transcribed, initiating the transposition process (Smith et al., 1995). For more information on activation of eukaryotic promoters, see Box 9.6 and for details of the retrotransposition process see Section 14.3.3.
Transposon tagging is central to the technique called genetic footprinting (Smith et al., 1995), which has been used to inactivate many of the yeast orphans as a first step to assessing their function. Transposon tagging is also important in analysis of the fruit-fly genome, using the endogenous Drosophila transposon called the P element (Engels, 2000). The weakness with transposon tagging is that it is difficult to target individual genes, because transposition is more or less a random event and it is impossible to predict where a transposon will end up after it has jumped. If the intention is to inactivate a particular gene then it is necessary to induce a substantial number of transpositions and then to screen the resulting organisms to find one with the correct insertion. Transposon tagging is therefore more applicable to global studies of genome function, in which genes are inactivated at random and groups of genes with similar functions identified by examining the progeny for interesting phenotype changes.
The double-stranded RNA molecule is broken down by the Dicer ribonuclease into ‘short interfering RNAs’ (siRNAs) of 21–25 bp in length. One strand of each siRNA base pairs to the target mRNA, which is then degraded by the RDE-1 nuclease. For more details on RNA interference, see Section 10.4.2.
RNA interference is known to occur naturally in a range of eukaryotes, but applying it to mammalian cells was expected to be difficult because these organisms display a parallel response to double-stranded RNA, in which protein synthesis is generally inhibited, resulting in cell death (Bass, 2001). These worries were unfounded, however, because it has now been shown that introduction of double-stranded RNAs into cultured human cells by fusion with liposomes (Figure 7.17
So far we have concentrated on techniques that result in inactivation of the gene being studied (‘loss of function’). The complementary approach is to engineer an organism in which the test gene is much more active than normal (‘gain of function’) and to determine what changes, if any, this has on the phenotype. The results of these experiments must be treated with caution because of the need to distinguish between a phenotype change that is due to the specific function of an overexpressed gene, and a less specific phenotype change that reflects the abnormality of the situation where a single gene product is being synthesized in excessive amounts, possibly in tissues in which the gene is normally inactive. Despite this qualification, overexpression has provided some important information on gene function.
The objective is to determine if overexpression of the gene being studied has an effect on the phenotype of a transgenic mouse. A cDNA of the gene is therefore inserted into a cloning vector carrying a highly active promoter sequence that directs expression of the cloned gene in mouse liver cells. The cDNA is used rather than the genomic copy of the gene because the former does not contain introns and so is shorter and easier to manipulate in the test tube.
Gene inactivation and overexpression are the primary techniques used by genome researchers to determine the function of a new gene, but these are not the only procedures that can provide information on gene activity. Other methods can extend and elaborate the results of inactivation and overexpression. These can be used to provide additional information that will aid identification of a gene function, or might form the basis of a more comprehensive examination of the activity of a protein whose gene has already been characterized.
Inactivation and overexpression can determine the general function of a gene, but they cannot provide detailed information on the activity of a protein coded by a gene. For example, it might be suspected that part of a gene codes for an amino acid sequence that directs its protein product to a particular compartment in the cell, or is responsible for the ability of the protein to respond to a chemical or physical signal. To test these hypotheses it would be necessary to delete or alter the relevant part of the gene sequence, but to leave the bulk unmodified so that the protein is still synthesized and retains the major part of its activity. The various procedures of site-directed or in vitro mutagenesis (Technical Note 7.1) can be used to make these subtle changes. These are important techniques whose applications lie not only with the study of gene activity but also in the area of protein engineering, where the objective is to create novel proteins with properties that are better suited for use in industrial or clinical settings.
See the text for details.
Clues to the function of a gene can often be obtained by determining where and when the gene is active. If gene expression is restricted to a particular organ or tissue of a multicellular organism, or to a single set of cells within an organ or tissue, then this positional information can be used to infer the general role of the gene product. The same is true of information relating to the developmental stage at which a gene is expressed. This type of analysis has proved particularly useful in understanding the activities of genes involved in the earliest stages of development in Drosophila (Section 12.3.3) and is increasingly being used to unravel the genetics of mammalian development. It is also applicable to those unicellular organisms, such as yeast, which have distinctive developmental stages in their life cycle.
| Gene | Gene product | Assay |
|---|---|---|
| lacZ | β-galactosidase | Histochemical test |
| uidA | β-glucuronidase | Histochemical test |
| lux | Luciferase | Bioluminescence |
| GFP | Green fluorescent protein | Fluorescence |
The open reading frame of the reporter gene replaces the open reading frame of the gene being studied. The result is that the reporter gene is placed under control of the regulatory sequences that usually dictate the expression pattern of the test gene. For more information on these regulatory sequences, see Sections 9.2 and 9.3. Note that the reporter gene strategy assumes that the important regulatory sequences do indeed lie upstream of the gene. This is not always the case for eukaryotic genes.
The cell is treated with an antibody that is labeled with a blue fluorescent marker. Examination of the cell shows that the fluorescent signal is associated with the inner mitochondrial membrane. A working hypothesis would therefore be that the target protein is involved in electron transport and oxidative phosphorylation, as these are the main biochemical functions of the inner mitochondrial membrane.
Even if every gene in a genome can be identified and assigned a function, a challenge still remains. This is to understand how the genome as a whole operates within the cell, specifying and coordinating the various biochemical activities that take place. These global studies of genome activity must address not the genome itself but the transcriptome and proteome that are synthesized and maintained by the genome (Chapter 3). The objective is to understand the key features of the transcriptomes and proteomes that are present in different tissues and during different developmental stages and, in the case of humans, in different disease states (Section 3.2.3).
The transcriptome comprises the mRNAs that are present in a cell at a particular time. Transcriptomes can have highly complex compositions, with hundreds or thousands of different mRNAs represented, each making up a different fraction of the overall population (Section 3.2.3). To characterize a transcriptome it is therefore necessary to identify the mRNAs that it contains and, ideally, to determine their relative abundances.
The most direct way to characterize a transcriptome is to convert its mRNA into cDNA (see Figure 5.32), and then to sequence every clone in the resulting cDNA library. Comparisons between the cDNA sequences and the genome sequence will reveal the identities of the genes whose mRNAs are present in the transcriptome. This approach is feasible but it is laborious, with many different cDNA sequences being needed before a near-complete picture of the composition of the transcriptome begins to emerge. If two or more transcriptomes are being compared then the time needed to complete the project increases. Can any shortcuts be used to obtain the vital sequence information more quickly?
Serial analysis of gene expression (SAGE) provides a solution (Velculescu et al., 2000). Rather than studying complete cDNAs, SAGE yields short sequences, as little as 12 bp in length, each of which represents an mRNA present in the transcriptome. The basis of the technique is that these 12-bp sequences, despite their shortness, are sufficient to enable the gene that codes for the mRNA to be identified. The argument is that any particular 12-bp sequence should appear in the genome once every 412 = 16 777 216 bp. The average size of a eukaryotic mRNA is about 1500 bp, so 412 bp is equivalent to the combined length of over 11 000 transcripts. This number is higher than the number of transcripts expected in all but the most complex transcriptomes, so the 12-bp sequence tags should be able to identify unambiguously the genes coding for all the mRNAs that are present.
See the text for details. In this example, the first restriction enzyme to be used is Alu I, which recognizes the 4-bp target site 5′-AGCT-3′ (see Table 4.3). The oligonucleotide that is ligated to the cDNA contains the recognition sequence for Bsm FI, which cuts 10–14 nucleotides downstream, and so cleaves off a fragment of the cDNA. Fragments of different cDNAs are ligated to produce the concatamer that is sequenced. Using this method, the concatamer that is formed is made up partly of sequences derived from the Bsm FI oligonucleotides. To avoid this, and so obtain a concatamer made up entirely of cDNA fragments, the oligonucleotide can be designed so that the end that ligates to the cDNA contains the recognition sequence for a third restriction enzyme. Treatment with this enzyme cleaves the oligonucleotide from the cDNA fragment.
(A) Transcriptome analysis with a DNA chip carrying oligonucleotides representing all the genes in a small genome. After adding labeled cDNA, the positions of the hybridization signals on the chip indicate which genes have contributed to the transcriptome under study. (B) With a larger genome, cDNA clones prepared from the transcriptome of one tissue are immobilized as a microarray and probed with cDNAs representing the same or a different transcriptome. By comparing the hybridization patterns, genes that are expressed differently in the tissues from which the transcriptomes are obtained can be identified.
Proteome studies are important because of the central role that the proteome plays as the link between the genome and the biochemical capability of the cell (Section 3.3). Characterization of the proteomes of different cells is therefore the key to understanding how the genome operates and how dysfunctional genome activity can lead to diseases such as cancer. Transcriptome studies can only partly address these issues. Examination of the transcriptome gives an accurate indication of which genes are active in a particular cell, but gives a less accurate indication of the proteins that are present. There are several reasons for this lack of equivalence between transcriptome and proteome, the most important being:
Not all mRNAs are actively translated at any particular time.
The protein content of the cell is determined by both synthesis of new proteins and degradation of existing ones.
Methods for studying the proteome are therefore needed in order to obtain a complete picture of genome expression.
The methodology used to study proteomes is collectively called proteomics. It is based on two techniques - protein electrophoresis and mass spectrometry - both of which have long pedigrees but which were rarely applied together in the pre-genomics era. Today they have been combined into one of the major growth areas of modern research.
(A) After two-dimensional gel electrophoresis a protein of interest is excised from the gel and digested with a protease such as trypsin, which cuts immediately after arginine or lysine amino acids. This cleaves the protein into a series of peptides which can be analyzed by MALDI-TOF. (B) In the mass spectrometer the peptides are ionized by a pulse of energy from a laser and then accelerated down the column to the reflector and onto the detector. The time of flight of each peptide depends on its mass-to-charge ratio. The data are visualized as a spectrum (C). The computer contains a database of the predicted molecular weights of every trypsin fragment of every protein encoded by the genome of the organism under study. The computer compares the masses of the detected peptides with the database and identifies the most likely source protein.
Proteomics can also be taken beyond simple characterization of proteome content. For example, the compositions of the peptides derived from a single protein can be used to check a gene sequence (Mann and Pandey, 2001), and in particular to ensure that exon-intron boundaries have been correctly located. This not only helps to delineate the exact position of a gene in a genome (Section 7.1.1), it also allows differential splicing pathways to be identified in cases where two or more proteins are derived from the same gene.
Important data pertaining to genome activity can also be obtained by identifying pairs and groups of proteins that interact with one another. At a detailed level, this information is often valuable when attempts are made to assign a function to a newly discovered gene or protein (Section 7.2) because an interaction with a second well-characterized protein can often indicate the role of an unknown protein. For example, an interaction with a protein that is located on the cell surface might indicate that an unknown protein is involved in cell-cell signaling (Section 12.1.2). At a global level, the construction of protein interaction maps is looked on as an important step in linking the proteome with the cellular biochemistry.
(A) The cloning vector used for phage display is a bacteriophage genome with a unique restriction site located within a gene for a coat protein. The technique was originally carried out with the gene III coat protein of the filamentous phage called f1, but has now been extended to other phages including λ. To create a display phage, the DNA sequence coding for the test protein is ligated into the restriction site so that a fused reading frame is produced - one in which the series of codons continues unbroken from the coat protein gene into the test gene. After transformation of Escherichia coli, this recombinant molecule directs synthesis of a hybrid protein made up of the test protein fused to the coat protein. Phage particles produced by these transformed bacteria therefore display the test protein in their coats. (B) Using a phage display library. The test protein is immobilized within a well of a microtiter tray and the phage display library added. After washing, the phages that are retained in the well are those displaying a protein that interacts with the test protein.
The yeast two-hybrid system detects protein interactions in a more complex way (Fields and Sternglanz, 1994). In Section 9.3.2 we will see that proteins called activators are responsible for controlling the expression of genes in eukaryotes. To carry out this function an activator must bind to a DNA sequence upstream of a gene and stimulate the RNA polymerase enzyme that copies the gene into RNA. These two abilities - DNA-binding and polymerase activation - are specified by different parts of the activator, and some activators will work even after cleavage into two segments, one segment containing the DNA-binding domain and one the activation domain. In the cell, the two segments interact to form the functional activator.
(A) On the left, a gene for a human protein has been ligated to the gene for the DNA-binding domain of a yeast activator. After transformation of yeast, this construct specifies a fusion protein, part human protein and part yeast activator. On the right, various human DNA fragments have been ligated to the gene for the activation domain of the activator: these constructs specify a variety of fusion proteins. (B) The two sets of constructs are mixed and cotransformed into yeast. A colony in which the reporter gene is expressed contains fusion proteins whose human segments interact, thereby bringing the DNA-binding and activation domains into proximity and stimulating the RNA polymerase. See Section 9.3.2 for more information on activators.
The 5′ region of the yeast HIS2 gene is homologous to Escherichia coli his2, and the 3′ region is homologous to E. coli his10.
Each dot represents a protein, with connecting lines indicating interactions between pairs of proteins. Red dots are essential proteins: an inactivating mutation in the gene for one of these proteins is lethal. Mutations in the genes for proteins indicated by green dots are non-lethal; mutations in genes for proteins shown in orange lead to slow growth. The effects of mutation in genes for proteins shown as yellow dots are not known. From Jeong et al., Nature, 411, 41–42. Copyright 2001 Macmillan Magazines Limited.
The final method for understanding a genome sequence that we will consider is comparative genomics. We have already seen how similarities between homologous genes from different organisms provide one way of assigning a function to an unknown gene (Section 7.2.1). This is an example of how knowledge about the genome of one organism can help in understanding the genome of a second organism. The possibility that a more general comparison with other genomes might be a valuable means of deciphering the human sequence was recognized when the Human Genome Project was planned in the late 1980s, and the Project has actively stimulated the development of genome projects for model organisms such as the mouse and fruit fly. In this section we will explore the extent to which comparisons between different genomes are proving useful.
The basis of comparative genomics is that the genomes of related organisms are similar. The argument is the same one that we considered when looking at homologous genes (Section 7.2.1). Two organisms with a relatively recent common ancestor will have genomes that display species-specific differences built onto the common plan possessed by the ancestral genome. The closer two organisms are on the evolutionary scale, the more related their genomes will be (Nadeau and Sankoff, 1998).
If the two organisms are sufficiently closely related then their genomes might display synteny, the partial or complete conservation of gene order. Then it is possible to use map information from one genome to locate genes in the second genome. At one time it was thought that mapping the genomes of the mouse and other mammals, which are at least partially syntenic with the human genome, might provide valuable information that could be used in construction of the human genome map. The problem with this approach is that all the close relatives of humans have equally large genomes that are just as difficult to study, the only advantage being that a genetic map is easier to construct with an animal which, unlike humans, can be subjected to experimental breeding programs (Section 5.2.4). Despite the limitations of human pedigree analysis, progress has been more rapid in mapping the human genome than in mapping those of any of our close relatives, so in this respect comparative genomics is proving more useful in mapping the animal genomes rather than our own. This in itself is a useful corollary to the Human Genome Project because it is revealing animal homologs of human genes involved in diseases, providing animal models for the study of these diseases.
Mapping is significantly easier with a small genome than with a large one. This means that if one member of a pair of syntenic genomes is substantially smaller than the other, then mapping studies with this small genome are likely to provide a real boost to equivalent work with the larger genome. The pufferfish, Fugu rubripes, has been proposed in this capacity with respect to the human genome. The pufferfish genome is just 400 Mb, less than one-seventh the size of the human genome but containing approximately the same number of genes. The mapping work carried out to date with the pufferfish indicates that there is some similarity with the human gene order, at least over short distances. This means that it should be possible, to a certain extent, to use the pufferfish map to find human homologs of pufferfish genes, and vice versa. This may be useful in locating undiscovered human genes, but holds greatest promise in identifying essential sequences such as promoters and other regulatory signals upstream of human genes. This is because these signals are likely to be similar in the two genomes, and recognizable because they are surrounded by non-coding DNA that has diverged quite considerably by random mutations (Elgar et al., 1996; Hardison, 2000).
One area where comparative genomics has a definite advantage is in the mapping of plant genomes. Wheat provides a good example. Wheat is the most important food plant in the human diet, being responsible for approximately 20% of the human calorific intake, and is therefore one of the crop plants that we most wish to study and possibly manipulate in the quest for improved crops. Unfortunately, the wheat genome is huge at 16 000 Mb, five times larger than even the human genome. A small model genome with a gene order similar to that of wheat would therefore be useful as a means of mapping desirable genes which might then be obtained from their equivalent positions in the wheat genome. Wheat, and other cereals such as rice, are members of the Gramineae, a large and diverse family of grasses. The rice genome is only 430 Mb, substantially smaller than that of wheat, and there are probably other grasses with even smaller genomes. Comparative mapping of the rice and wheat genomes has revealed many similarities, and the possibility therefore exists that genes from the wheat genome might be isolated by first mapping the positions of the equivalent genes in a smaller Gramineae genome (Gura, 2000).
One of the main reasons for sequencing the human genome is to gain access to the sequences of genes involved in human disease. The hope is that the sequence of a disease gene will provide an insight into the biochemical basis of the disease and hence indicate a way of preventing or treating the disease. Comparative genomics has an important role to play in the study of disease genes because the discovery of a homolog of a human disease gene in a second organism is often the key to understanding the biochemical function of the human gene. If the homolog has already been characterized then the information needed to understand the biochemical role of the human gene may already be in place; if it has not been characterized then the necessary research can be directed at the homolog.
| Human disease gene | Yeast homolog | Function of the yeast gene |
|---|---|---|
| Amyotrophic lateral sclerosis | SOD1 | Protein against superoxide (O2-) |
| Ataxia telangiectasia | TEL1 | Codes for a protein kinase |
| Colon cancer | MSH2, MLH1 | DNA repair |
| Cystic fibrosis | YCF1 | Metal resistance |
| Myotonic dystrophy | YPK1 | Codes for a protein kinase |
| Type 1 neurofibromatosis | IRA2 | Codes for a regulatory protein |
| Bloom's syndrome, Werner's syndrome | SGS1 | DNA helicase |
| Wilson's disease | CCC2 | Copper transport? |
Data taken from Bassett et al. and Sinclair et al. (1997).
Give short definitions of the following terms:
cDNA capture
cDNA selection
Embryonic stem cell
Homology search
In vitro mutagenesis
Knockout mice
Matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF)
Multicopy plasmid
Orthologous genes
Paralogous genes
Serial analysis of gene expression (SAGE)
Two-dimensional gel electro-phoresis
Zoo-blotting
Explain why ORF scanning is a feasible way of identifying genes in a prokaryotic DNA sequence.
What modifications are introduced when ORF scanning is applied to a eukaryotic DNA sequence?
Describe how homology searching is used to locate genes in a DNA sequence and to assign possible functions to those genes.
Distinguish between northern blotting and zoo-blotting. What are the applications of these two techniques in gene location?
Explain how cDNA capture or cDNA selection are used to enrich a clone library for a particular cDNA sequence.
Draw a fully annotated diagram illustrating the procedure called 5′-RACE.
Describe how S1 nuclease is used to map the positions of the ends of a transcript on to a DNA sequence.
What experimental methods can be used to locate exon-intron boundaries in a DNA sequence?
Using the yeast genome project as an example, illustrate the strengths and weaknesses of homology analysis as a means of assigning functions to unknown genes.
Describe how gene inactivation can be used to determine the function of an unknown gene.
Give an example of the use of gene overexpression to determine the function of an unknown gene.
Describe how oligonucleotide-directed mutagenesis is carried out and outline the use of this technique in studying the activity of the protein coded by an unknown gene.
What is a reporter gene and how is it used?
Describe the methods used to study transcriptomes.
Explain how two-dimensional gel electrophoresis combined with mass spectrometry is used to study a proteome.
Draw diagrams illustrating the techniques called (a) phage display, and (b) the yeast two-hybrid system. What are the similarities and differences between these two techniques?
What is a protein interaction map? What has the yeast protein interaction map told us about the construction of the proteome of this organism?
Define the term ‘synteny’ and, using examples, explain how synteny can predict the positions of genes in a genome sequence.
Describe the applications of comparative genomics in the study of human disease genes.
Defend one of the following statements:
‘In future years it will be possible to use bioinformatics to obtain a complete description of the locations and functions of the genes in a genome sequence.’
‘In future years bioinformatics will become obsolete because of the development of rapid and effective experimental methods for locating and assigning functions to the genes in a genome sequence.’
Devise a hypothesis to explain the codon biases that occur in the genomes of various organisms. Can your hypothesis be tested?
Gene inactivation studies have suggested that at least some genes in a genome are redundant, meaning that they have the same function as a second gene and so can be inactivated without affecting the phenotype of the organism. What evolutionary questions are raised by genetic redundancy? What are the possible answers to these questions?
Explore the natural role of RNA interference in living organisms.
Gene overexpression has so far provided limited but important information on the function of unknown genes. Assess the overall potential of this approach in functional analysis.
‘Comparative genomics has an important role to play in the study of disease genes.’ Evaluate this statement.
Free Full text in PMC]