U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Annual Reviews Collection [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002 Nov.

Cover of Annual Reviews Collection

Annual Reviews Collection [Internet].

Show details

How Many Genes Can Make a Cell: The Minimal-Gene-Set Concept

.

Author Information and Affiliations

Reproduced from Annu. Rev. Genomics Hum. Genet. 2000. 01:99 – 116.

Key Words: comparative genomics, orthologs, nonorthologous gene displacement, genome evolution, transposon mutagenesis

Several theoretical and experimental studies have endeavored to derive the minimal set of genes that are necessary and sufficient to sustain a functioning cell under ideal conditions, that is, in the presence of unlimited amounts of all essential nutrients and in the absence of any adverse factors, including competition. A comparison of the first two completed bacterial genomes, those of the parasites Haemophilus influenzae and Mycoplasma genitalium, produced a version of the minimal gene set consisting of ~250 genes. Very similar estimates were obtained by analyzing viable gene knockouts in Bacillus subtilis, M. genitalium, and Mycoplasma pneumoniae. With the accumulation and comparison of multiple complete genome sequences, it became clear that only ~80 genes of the 250 in the original minimal gene set are represented by orthologs in all life forms. For ~15% of the genes from the minimal gene set, viable knockouts were obtained in M. genitalium; unexpectedly, these included even some of the universal genes. Thus, some of the genes that were included in the first version of the minimal gene set, based on a limited genome comparison, could be, in fact, dispensable. The majority of these genes, however, are likely to encode essential functions but, in the course of evolution, are subject to nonorthologous gene displacement, that is, recruitment of unrelated or distantly related proteins for the same function. Further theoretical and experimental studies within the framework of the minimal-gene-set concept and the ultimate construction of a minimal genome are expected to advance our understanding of the basic principles of cell functioning by systematically detecting nonorthologous gene displacement and deciphering the roles of essential but functionally uncharacterized genes.

Background and History of the Minimal-Gene-Set Concept

The numbers of genes in well-characterized genomes of cellular life forms range from as few as 480 in the parasitic bacterium Mycoplasma genitalium to ~100,000–150,000 in multicellular eukaryotes, such as humans (information on the completely sequenced genomes, including several complementary views of gene arrangement, can be found at http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html). Is it possible to combine comparative genomics with biochemical and molecular-genetic data to determine the minimal number of genes required to make a modern-type cell? Furthermore, what are our chances of generating a realistic list of genes that constitute such a minimal gene set? Here I explore these questions using a comparative analysis of 21 genomes of bacteria, archaea, and eukaryotes that have been completely sequenced to date and relevant experimental data.

The idea of a minimal gene set refers to the smallest possible group of genes that would be sufficient to sustain a functioning cellular life form under the most favorable conditions imaginable, that is, in the presence of a full complement of essential nutrients and in the absence of environmental stress (5, 14, 29, 32). Deriving such a minimal gene set and examining its features are of interest both to further our understanding of the basics of cell functioning and, in a more practical perspective, to define the subset of genes that are expected to be essential in most, if not all, species. Furthermore, minimal-gene-set reconstructions are, at least in principle, experimentally testable. A first-approximation, relatively straightforward test involves knocking out the genes from the minimal set and assessing the phenotype—generally, these genes are expected to be essential, although the possibility of functional redundancy should be considered. Direct testing requires actually constructing and manipulating the hypothetical minimal genome.

The upper bound of the minimal set is given by the number of genes in the smallest known genome, that of M. genitalium, which consists of 480 genes (10). The lower bound is suggested by salient features of any modern cell—the requirements for complete systems of translation, transcription, and replication as well as integral components of the cell membrane and minimal transport systems. A crude estimate indicates that these systems cannot be supported by <100 proteins. A remarkable experimental study that resulted in an estimation of the minimal genome size was published at the end of the pregenomic era. Itaya has shown that of 79 random gene disruptions in Bacillus subtilis, only 6 were lethal (16). Furthermore, even simultaneous insertions into 33 loci have produced a viable bacterium. These findings resulted in an estimate of 318–562 kb for the minimal genome, which, given the average size of ~1 kb for a bacterial protein-coding gene, translates into 300–500 genes.

The sequencing, in 1995, of the first two complete genomes of cellular life forms, those of the parasitic bacteria Haemophilus influenzae (9) and M. genitalium (10), enabled a comparative genomic approach to the minimal-gene-set issue. This approach is based on two simple notions. (a) Cellular life forms are capable of importing a number of, if not all, metabolites and, accordingly, may dispense with the majority of metabolic enzymes; by contrast, cells, at least those of unicellular organisms, do not normally takeup proteins from the outside, and, therefore, all housekeeping proteins must be encoded in the genome. (b) Genes shared by multiple genomes are likely to be essential and therefore are good candidates for inclusion in the minimal gene set.

Generally, to apply the second notion meaningfully, one would need a number of complete genomes to compare. Any work in this direction based on the comparison of only a few genomes, let alone just two, necessarily would be preliminary, if not outright premature. The first two sequenced genomes, however, appeared to be particularly suitable for such a preliminary exercise of deriving a version of the minimal gene set, because they belong to phylogenetically distant groups of parasitic bacteria, each of which clearly has shed a number of genes in the process of its adaptation to the parasitic lifestyle. The gene losses have taken place subsequent to the divergence of these bacteria from their last common ancestor, in other words, independently; therefore, those common genes that remained, in principle, could be considered a good foundation for constructing a minimal gene set.

Based on these considerations, an attempt was made to construct a minimal gene set by comparing the H. influenzae and M. genitalium genomes (32; Figure 1). A detailed comparison of the protein sets from the two bacteria revealed 240 direct counterparts or likely orthologs (8). These genes, however, did not seem to add up to a viable minimal genome, because some of the metabolic pathways contained gaps that would preclude them from functioning in a theoretical minimal organism. To account for these gaps, one had to invoke nonorthologous gene displacement (NOD)—the situation when the same function is performed by unrelated or very distantly related and nonorthologous proteins (20). The M. genitalium/H. influenzae genome comparison produced a sketch of the minimal set of 256 genes that consisted mostly of orthologs, with NOD cases composing ~5% of these genes (32). In addition, this version of the minimal gene set was trimmed in a more arbitrary manner, namely by removing genes that appeared, at the time, to be specific for parasitic bacteria. Although undoubtedly just a crude approximation, the minimal gene set derived in this fashion appeared to correspond to a plausible minimalist bacterium. This organism would possess more or less complete systems for translation, transcription, and replication but would have all other cellular components, including the repair machinery, the set of molecular chaperones, the metabolic pathways, and particularly the signal transduction apparatus, reduced to a bare minimum.

Figure 1. A generalized procedure for constructing a version of the minimal gene set.

Figure 1

A generalized procedure for constructing a version of the minimal gene set. For simplicity, the schematic shows the derivation of a minimal gene set through a three-genome (G1, G2, G3) comparison. I, Intersection of the three genomes, which consists of (more...)

The Current Status of the Minimal Gene Set

How does the version of the minimal gene set that was derived from the comparison between M. genitalium and H. influenzae withstand the test with new genome sequences? The first such tests have shown that ~90% of the genes from the minimal set were represented in the genome of a taxonomically distant bacterium, Synechocystis sp., but, in the first sequenced eukaryotic genome, that of the yeast Saccharomyces cerevisiae, orthologs of only 40% of the minimal set genes could be identified (19). We have the opportunity, 3 years and about 25 complete genomes later (Table 1), to assess the original version of the minimal gene set in a fairly comprehensive manner, and I do so here, by using the system of clusters of orthologous groups of proteins (COGs) from 21 complete genomes (22, 35, 36).

Table 1. Coverage of completely-sequenced genomes by conserved families of orthologs.

Table 1

Coverage of completely-sequenced genomes by conserved families of orthologs.

The COG approach is based on the notion that any group of at least three proteins from distant genomes that are more similar to each other than to any other proteins from the same genomes most probably belong to a family of orthologs (36). This notion is relevant even if the absolute level of sequence similarity between the proteins in question is relatively low; thus the COG approach accommodates both slow-evolving and fast-evolving genes. The procedure for constructing the COGs involves the detection of all triangles of genome-specific best hits from the complete matrix of pairwise comparisons between proteins encoded in the analyzed set of genomes and then merging those triangles that have a common side to form the complete orthologous families. In addition, a detailed, case-by-case analysis of each COG was performed to eliminate potential false positives and to add weakly conserved proteins that had been missed by the automatic procedure, but nevertheless appeared to be orthologous to the rest of the members of a particular COG. For the latter purpose, additional searches were performed using the PSI-BLAST program. The resulting protein families capture not only one-to-one but also one-to-many and many-to-many orthologous relationships and hence clusters of orthologous groups of proteins.

The current collection of COGs shows two striking and, in a sense, opposing trends that are relevant for the discussion of the minimal-gene-set concept (35; http://www.ncbi.nlm.nih.gov/COG). First, it is notable that 55%–83% of the proteins encoded in each of the bacterial and archaeal genomes belong to the COGs, which, it should be emphasized, by definition include representatives of at least three phylogenetically distant clades (Table 1). Thus a good majority of bacterial and archaeal proteins are, in fact, highly conserved in evolution. Second, most of the COGs comprise only a few clades, whereas ubiquitous or nearly ubiquitous COGs are a small minority. The composition of protein families can be conveniently described using the language of phylogenetic patterns, that is, the patterns of species that are represented or missing in a given COG (36; http://www.ncbi.nlm.nih.gov/COG). Similar approaches to the analysis of phylogenetic representation of protein families have been developed by two other groups (11, 33). Among the 2112 COGs that comprise the current collection, as many as 1234 unique patterns are seen, which emphasizes the evolutionary plasticity of the families. The predominant evolutionary explanations for this mosaicism include clade-specific gene loss and horizontal gene transfer—phenomena that are increasingly recognized as major evolutionary factors, at least in the prokaryotic world (2, 6, 7, 21, 25, 32). On many occasions, the appearance of clade-specific gene loss may be created by rapid evolution in some of the lineages.

The phylogenetic patterns for all members of the original minimal gene set were extracted from the respective COGs. The outcome of this reanalysis is, primarily, that the role of NOD in evolution is by far more fundamental than originally imagined. The status of the members of the original minimal gene set after reassessment was performed by the COG approach can be classified as shown in Table 2 [see also supplementary material on the Annual Reviews web site (http://www.AnnualReview.org)]. Of the minimal-gene-set members, ~30% proved to be truly universal—evidently, this group should coincide with the set of 80 ubiquitous COGs, and this is indeed the case, except that two of the universal COGs were missed in the original study for very different reasons. The clamp loader ATPase that is encoded by the M. genitalium gene MG420 was not included, because the respective mycoplasmal protein was much shorter than its H. influenzae counterpart and, because of that, was not considered an ortholog. Subsequent genome comparisons, taken together with experimental data, indicate that the clamp loader is a ubiquitous and essential DNA polymerase subunit (28; http://www.ncbi.nlm.nih.gov/COG). In all likelihood, the M. genitalium sequence contains a frameshift, and the protein is a bona fide member of the minimal gene set. The second case is that of the MG046 protein, which, although it has a highly conserved ortholog in H. influenzae, has been excluded from the minimal gene set because its counterpart from Pasteurella haemolytica has been characterized as a sialoglycoprotease (26), a function that was considered parasite specific. Comparison of multiple sequenced genomes showed, however, that this protein is highly conserved in all of them; moreover, it has been shown to be essential in Escherichia coli and B. subtilis (3), and further sequence and structure analyses have led to the prediction that this is an intracellular protease that has chaperone activity (1). These examples demonstrate how comparative analysis of multiple genomes, supported by computational and experimental studies on individual protein families, can correct shortcomings of the preliminary studies and call for caution in applying straightforward biological reasoning.

Table 2. Phylogenetic patterns in the full COG a collection and in the minimal gene set classified by functional classes of proteins.

Table 2

Phylogenetic patterns in the full COG a collection and in the minimal gene set classified by functional classes of proteins.

A slightly smaller group of minimal-set members are those that are conserved in all or nearly all bacteria whose genomes have been sequenced; some of these proteins are also represented in eukaryotes and/or in a subset of the archaea (Table 2). It appears most likely that the majority of these genes encode essential functions, but archaea and bacteria or archaea and eukaryotes have evolved different, in many cases evolutionarily unrelated implementations of these functions (see below). In other words, this category of proteins is a major manifestation of NOD at a deep level of evolutionary divergence (ancient NOD cases; see next section for specific examples).

More than one third of the proteins from the original minimal gene set show a less consistent phyletic distribution (Table 2). Nearly one half of these, however (37 of 92), are missing in only one or two bacterial clades and so are highly conserved genes, even if they are not ubiquitous. It seems most likely that these genes correspond to critical functions, with NOD, at least among bacteria, being an exception. Some of the remaining genes might be NOD cases that have evolved a patchy phylogenetic pattern as a result of horizontal gene transfers, whereas others indeed are likely to be nonessential and do not belong to a true minimal gene set for cellular life. Distinguishing between these situations may be possible only by examination of the specific information on the biological functions of the respective proteins (see next section).

The predominant phylogenetic patterns are significantly different for different functional categories of proteins included in the minimal gene set; predictably, the regularities seen here are the same as those observed in the full COG collection (Table 2). The ubiquitous proteins are mostly components of the translation machinery and RNA polymerase subunits; very few are scattered among other functional categories (examples include the HSP60 chaperonin and glycine hydroxymethyltransferase). The replication-recombination-repair systems are consistently conserved among bacteria, but ubiquitous proteins are in the minority (see below). By contrast, among the metabolic functions, scattered phyletic distribution is prevalent (Table 2), which indicates both the wide spread of NOD and the loss of pathways in many organisms.

On the whole, it appears that the approach of constructing a minimal gene set by comparing the genomes of just two species, which are, however, phylogenetically distant parasites, has survived the test of multiple-genome comparison reasonably well. Indeed, this set is significantly enriched in universal and highly conserved proteins and includes a relatively small number of proteins with a scattered phyletic distribution, compared with an analogous breakdown of the full set of COGs (Table 2).

A recent major extension of minimal-gene-set studies has involved global, transposon-mediated knockout mutagenesis of M. genitalium and M. pneumoniae genes (14). Viable mutants with disruptive insertions have been obtained for 129 distinct mycoplasmal genes; an estimate based on a Poisson distribution of transposon insertion sites among genes indicates that the actual number of nonessential genes should be between 180 and 215. The upper bound on the number of nonessential genes obtained by this approach suggests a minimal gene set of 265 genes, which is remarkably close to the size of the set produced by the comparative genomic approach (30). A case-by-case examination of the list of viable knockouts shows that 38 of the genes included in the computer-derived minimal gene set have been proved nonessential (14; see supplementary material at http://www.sciencemag.org/feature/data/1042937.shl). Had the 250-gene minimal set been drawn at random from the 480 genes of M. genitalium and given the 129 nonessential genes identified, one would expect 67 hits to fall into the minimal gene set. Thus this set is clearly enriched in essential genes. Nevertheless, the significant number of viable disruptions within the theoretical minimal gene set is somewhat unexpected and could suggest that evolutionary conservation of a gene does not automatically translate into it being essential under any conditions. Among the 38 hits into the minimal gene set, 16 are into genes with a scattered phyletic distribution, 15 are into genes conserved in all bacteria, and 7 are into universal genes. For the first of these groups of genes, the results of comparative genomic analysis converge with those of global mutagenesis in indicating that the inclusion of these genes in the minimal set simply reflected the limited nature of the original genome comparison. The viability of the disruptions of conserved, particularly universal genes is, however, perplexing. Certainly, as indicated by Hutchison and coworkers (14), nonessentiality of a gene under laboratory conditions, in the absence of competition, is not a particularly good measure of its real-life importance. For example, the disruption of the ubiquitous gene for the GroEL chaperonin may not be immediately lethal, as indicated by the mutagenesis results, but the disadvantage under any limiting conditions is expected to be devastating. Similarly, it can be rationalized that disruption of the genes coding for the components of the UvrABC excinuclease, a repair enzyme present in all bacteria, as well as RecA, a ubiquitous enzyme involved in recombination and repair, does not kill the cell, but a cell with practically no capacity for DNA repair clearly is not facing a bright future. Still, for some of the ubiquitous genes, viable disruptions of which have been reported, one is hard pressed to imagine a mechanism for the cells’ survival. Cases in point are isoleucyl- and tyrosyl-tRNA synthetases. An outlandish possibility might be considered that these genes are rendered dispensable by the low level of mischarge of the respective tRNAs. It cannot be ruled out, however, that there are some unrecognized problems with the method of mutagenesis used, which result in leakage for some of the mutants.

In general, the results of the massive knockout mutagenesis of mycoplasmal genes lead to an important, even if in retrospect not particularly unexpected, conclusion. The evolutionary conservation of genes that is revealed by the comparative genomic approach reflects exactly what this approach was designed to reflect, namely the critical importance of the conserved genes for species evolution. The genes in question frequently prove to be essential under all tested conditions, but one cannot expect this correlation to be strict. Accordingly, it seems that the comparative genomic methodology is more applicable to deriving a minimal set of genes that are sufficient to sustain a robust evolutionary trajectory, for example that of the ultimate parasitism typical of the mycoplasmas (34), rather than just to support a cell under artificially favorable conditions.

Nonorthologous Gene Displacement—A Minimal Gene Set or a Minimal Set of Functions?

Probably the most notable change in our thinking about the minimal-gene-set concept, brought about by the comparison of multiple genomes, is the much greater extent of NOD than originally perceived. Indeed, only ~30% of the members of the original minimal gene set belong to ubiquitous protein families, which suggests that many of the remaining proteins from this set are responsible for critical functions but are subject to NOD. Examination of the known biological context of the 55 members of the minimal gene set that showed a scattered phyletic distribution (an analysis that inevitably includes a degree of arbitrariness), along with the trans-poson mutagenesis data, suggested that 20–30 of them are likely to be nonessential and simply should be removed from a more robust version of a minimal gene set (Table 3; see supplementary material at http://www.AnnualReview.org). The remaining proteins in this category are likely NOD cases. Moreover, whenever a protein is found in all bacteria but not in archaea and eukaryotes, or in bacteria and eukaryotes but not in archaea, NOD appears most probable. In some of the apparent NOD cases, both alternative solutions for the same functional niche are known, whereas in others, one of them remains to be identified (Table 4). Proteins that compose a NOD pair tend to display complementary phyletic patterns (Table 4). Although this complementarity may not be perfect, because it is common for some organisms to encode both members of a NOD pair, this feature can be used to predict previously undetected NOD cases (Table 4; EV Koonin & MY Galperin, unpublished data).

Table 3. Members of the original minimal gene set that show scattered phylogenetic patterns and are predicted to be dispensable (examples).

Table 3

Members of the original minimal gene set that show scattered phylogenetic patterns and are predicted to be dispensable (examples).

Table 4. Nonorthologous gene displacement (NOD) within the minimal gene set (examples).

Table 4

Nonorthologous gene displacement (NOD) within the minimal gene set (examples).

There is no functional category of proteins or functional system within the minimal gene set that would be immune to NOD, but in some systems it is a relatively rare exception, whereas in others the majority of functions are not performed by orthologs in all organisms. The translation machinery is by far more uniform in all life forms than any other functional system, but, even here, several notable examples of NOD are seen (Table 4). The cases of glutamine and asparagine activation for protein synthesis are particularly interesting, because these amino acids are linked to the cognate tRNAs by two completely different mechanisms—either via the corresponding aminoacyl-tRNA synthetases or via the transamidation mechanisms (Table 4). In these cases, as in some others, NOD is manifested as a one-to-many, rather than a one-to-one, relationship—a single gene is displaced by three unrelated genes whose products provide the same function as a complex (15, 17). The most striking display of NOD is seen in the DNA replication system, where the principal components in bacteria are not orthologous and, in some cases, appear to be unrelated to those in archaea/eukaryotes (24; Table 4).

NOD can be interpreted in a broader sense, with entire systems and pathways displacing others for a particular general role. Thus, glycolysis, or at least its lower part leading from trioses to phosphoenopyruvate, is nearly universal and, being present in the mycoplasmas, is an integral part of the minimal gene set. This pathway is, however, completely missing in Rickettsia prowazekii, which instead possess the tricarboxylic-acid cycle; the latter may be considered a displacement of glycolysis as the central loop of energy metabolism. In the same vein, no metabolite transport systems are ubiquitous, with unique solutions for metabolite intake in different organisms.

The prevalence of NOD suggests a shift in perspective on the entire concept of the minimal gene set. It seems that a more general and hence more robust idea is a minimal set of functional niches, most of which can be filled by proteins that belong to two or more distinct families of orthologs. A conserved core of functions with a single, ubiquitous solution certainly exists. The list of proteins included in this group is expected to further shrink with the accumulation of diverse genome sequences. There is, however, little doubt at this stage that a significant group of key proteins, probably at least 50, are truly universal, including several translation factors, the majority of aminoacyl-tRNA synthetases, and core RNA polymerase.

Extension and Development of the Minimal-Gene-Set Concept

With the evolution of the minimal-gene-set concept towards the more inclusive notion of a minimal set of functions, it is becoming clear that the task of delineating the minimal subset of the genome of, for example, M. genitalium by purely computational means is less straightforward than initially perceived, but also perhaps less generally important. The concept itself, however, may have considerable heuristic value. As mentioned above, constructing a minimal gene set makes no sense without explicitly defining the conditions under which the respective “minimal organism” should be expected to survive. Construction and analysis of minimal gene sets for different conditions could be a useful approach to predicting subsets of genes that are specifically required for life in the respective niches, for example, for thermophily. The conserved portions of minimal sets for different lifestyles are easy to identify using the tools for phyletic pattern analysis that are associated with the COG system (Table 5). The challenge lies in delineating the NOD cases to supplement these conserved sets of proteins, which requires careful computational analyses and biological reasoning and is beyond the scope of this review, but even examination of the conserved portions is of some interest. It shows, for example, that the gene set shared by all autotrophs whose genomes have been sequenced includes a large number of metabolic functions, indicating that these diverse organisms share a significant repertoire of biochemical pathways (Table 5).

Table 5. Conserved portions of hypothetical minimal gene sets for different lifestyles.

Table 5

Conserved portions of hypothetical minimal gene sets for different lifestyles.

What about the original quest for a “minimum minimorum” gene set for cellular life? Taken together, examination of the phylogenetic patterns, the transposon knockout data, and minimalist biochemical reasoning suggest that, in principle, a cell could be supported even by a considerably smaller number of proteins than the originally proposed 250. In addition to the proteins from the original minimal set that turned out to be poorly conserved and, in all likelihood, dispensable, a considerable number of conserved proteins also could be tentatively subtracted. These include all repair systems; some of the remaining metabolic enzymes, transport systems, and cell wall components; and even some ribosomal proteins. The result would be a bare-bones set of ~150 genes with the basal systems for translation, transcription, and replication; intermediate metabolism essentially reduced to glycolysis; a primitive transport system characterized by a broad specificity; and no cell wall [see supplementary material (http://www.AnnualReview.org)]. It remains questionable, of course, whether such a minimalist cell could survive under any realistic conditions. Although many conserved genes are individually dispensable, there are, so far, too few data on the effect of their simultaneous deletion, and the relatively small number of viable knockouts actually obtained in the recent large-scale experiment (14) calls for caution in interpretations. It is possible that concomitant removal of too many conserved and hence important genes would result in such a drop in fitness that, although theoretically possible, a bare-bones minimal cell could never be constructed in practice.

Comparative genomic approaches to the minimal-gene-set issue are straightforward but involve inherent uncertainties in that some of the widespread genes could still be dispensable, whereas identification of NOD can be ambiguous. This points to the importance of experimental approaches. With all of the advances of genomic engineering, however, the goal of actually constructing and manipulating a minimal genome still appears to be a major technical challenge. The recent global knockout studies, although an impressive scale-up of the analogous early experiments, still include only individual gene disruptions. Combining these in a single genome remains to be achieved, and, given that to actually define a minimal genome requires a number of trials, it seems that we are at least a few years away from a practicable minimal-genome technology. In principle, one could imagine a radically different approach based on selection of fast-replicating bacterial clones on a rich medium, perhaps starting from a strain with enhanced recombinational capabilities. This approach could be an attractive strategy modeled on the classic experiments of Spiegelman and coworkers with RNA bacteriophage genomes (27), but it is unclear whether such an approach could be implemented with bacteria on an acceptable timescale. In any case, the goal of experimentally constructing a minimal cell seems worth pursuing because not only will it help in verifying comparative genomic results and, accordingly, enhance our understanding of evolution, but a minimal cell also could provide a valuable model system for probing the principles of cell functioning.

The final issue to be tackled is the relevance, or lack thereof, of the minimal-gene-set concept to the reconstruction of ancestral genomes (31). The comparative procedure used to derive a hypothetical minimal gene set (Figure 1) has not been designed to retrace the actual course of evolution. Nevertheless, this shortcoming does not seem to justify a sweeping conclusion that the entire concept is evolutionarily irrelevant, as recently claimed (23). Not only are the universal genes a likely heritage of the “last universal common ancestor,” but the identified cases of NOD can be at least tentatively mapped to specific stages of life’s evolution, thus helping in the reconstruction of ancestral genomes.

Conclusions

Accumulation of multiple genome sequences provides ample material for computational approaches to minimal-genome construction. Comparative analyses of these genomes show that the majority of genes originally included in the minimal gene set derived by a comparison of the H. influenzae and M. genitalium genomes are either universal or at least conserved in all bacteria, whereas a minority show a scattered phyletic distribution. These results lead to a re-evaluation of the minimal-gene-set concept to accommodate a greater-than-originally-perceived contribution of nonorthologous gene displacement. It seems to be more appropriate to consider not a rigid minimal gene set but rather a minimal set of functional niches, some of which are occupied by members of the same orthologous family in all organisms but the majority of which allow at least two distinct solutions. Further development of the notion of the minimal gene set in which minimal gene sets are constructed for different conditions and lifestyles, for example thermophily or chemoautotrophy, seems to be a fruitful research direction.

Global knockout mutagenesis of the mycoplasmal genes, aimed at delineating a minimal gene set, has resulted in estimates that are very similar to those produced by original comparative genomic analysis but has also shown that even some of the universal or highly conserved genes can be dispensable. These results could indicate that even absolute evolutionary conservation does not automatically entail indispensability of a gene under any conditions, but their definitive interpretation requires further experiments. Actual experimental construction of a minimal genome may not be attainable in the nearest future but appears to be a goal worth pursuing.

Acknowledgments

I am grateful to Arcady Mushegian for his collaboration during the development of the minimal-gene-set approach and for numerous helpful conversations and to Clyde A Hutchinson III for critical reading of the manuscript.

Visit the Annual Reviews home page at http://www.AnnualReviews.org

Literature Cited

1.
Aravind L, Koonin EV. Gleaning nontrivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol. 1999;287:1023–40. [PubMed: 10222208]
2.
Aravind L, Tatusov RL, Wolf YI, Walker DR, Koonin EV. Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet. 1998;14:442–44. [PubMed: 9825671]
3.
Arigoni F, Talabot F, Peitsch M, Edgerton MD, Meldrum E. A genome-based approach for the identification of essential bacterial genes. Nat. Biotechnol. 1998;16:851–56. [PubMed: 9743119]
4.
Bellgard MI, Gojobori T. Identification of a ribonuclease H gene in both Mycoplasma genitalium and Mycoplasma pneumoniae by a new method for exhaustive identification of ORFs in the complete genome sequences. FEBS Lett. 1999;445:6–8. [PubMed: 10069363]
5.
Cho MK, Magnus D, Caplan AL, McGee D. Policy forum: genetics—ethical considerations in synthesizing a minimal genome. Science. 1999;286:2087–90. [PubMed: 10617419]
6.
Doolittle WF. Lateral genomics. Trends Cell Biol. 1999;9:M5–M8. [PubMed: 10611671]
7.
Doolittle WF. Phylogenetic classification and the universal tree. Science. 1999;284:2124–29. [PubMed: 10381871]
8.
Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–106. [PubMed: 5449325]
9.
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. [PubMed: 7542800]
10.
Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA. et al. The minimal gene complement of Mycoplasma genitalium. Science. 1995;270:397–403. [PubMed: 7569993]
11.
Gaasterland T, Ragan MA. Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics. 1998;3:199–217. [PubMed: 10027190]
12.
Deleted in proof.
13.
Deleted in proof.
14.
Hutchison CA, Peterson SN, Gil SR, Cline RT, White O. et al. Global transposon mutagenesis and a minimal Mycoplasma genome. Science. 1999;286:2165–69. [PubMed: 10591650]
15.
Ibba M, Curnow AW, Soll D. Aminoacyl-tRNA synthesis: divergent routes to a common goal. Trends Biochem. Sci. 1997;22:39–42. [PubMed: 9048478]
16.
Itaya M. An estimation of minimal genome size required for life. FEBS Lett. 1995;362:257–60. [PubMed: 7729508]
17.
Koonin EV, Aravind L. Genomics: re-evaluation of translation machinery evolution. Curr. Biol. 1998;8:R266–69. [PubMed: 9550696]
18.
Deleted in proof.
19.
Koonin EV, Mushegian AR. Complete genome sequences of cellular life forms: glimpses of theoretical evolutionary genomics. Curr. Opin. Genet. Dev. 1996;6:757–62. [PubMed: 8994848]
20.
Koonin EV, Mushegian AR, Bork P. Non-orthologous gene displacement. Trends Genet. 1996;12:334–36. [PubMed: 8855656]
21.
Koonin EV, Mushegian AR, Galperin MY, Walker DR. Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol. Microbiol. 1997;25:619–37. [PubMed: 9379893]
22.
Koonin EV, Tatusov RL, Galperin MY. Beyond complete genomes: from sequence to structure and function. Curr. Opin. Struct. Biol. 1998;8:355–63. [PubMed: 9666332]
23.
Kyrpides N, Overbeek R, Ouzounis C. Universal protein families and the functional content of the last universal common ancestor. J. Mol. Evol. 1999;49:413–23. [PubMed: 10485999]
24.
Leipe DD, Aravind L, Koonin EV. 1999. Did DNA replication evolve twice independently?Nucleic Acids Res. 27:3389–401. [PMC free article: PMC148579] [PubMed: 10446225]
25.
Logsdon JM, Faguy DM. Thermotoga heats up lateral gene transfer. Curr. Biol. 1999;9:R747–51. [PubMed: 10531001]
26.
Mellors A, Lo RY. O-sialo-glycoprotease from Pasteurella haemolytica. Methods Enzymol. 1995;248:728–40. [PubMed: 7674959]
27.
Mills DR, Peterson RL, Spiegelman S. An extracellular Darwinian experiment with a self-duplicating nucleic acid molecule. Proc. Natl. Acad. Sci. USA. 1967;58:217–24. [PMC free article: PMC335620] [PubMed: 5231602]
28.
Mossi R, Hubscher U. Clamping down on clamps and clamp loaders—the eukaryotic replication factor C. Eur. J. Biochem. 1998;254:209–16. [PubMed: 9660172]
29.
Mushegian A. The minimal genome concept. Curr. Opin. Genet. Dev. 1999;9:709–14. [PubMed: 10607608]
30.
Mushegian AR, Koonin EV. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl. Acad. Sci. USA. 1996;93:10268–73. [PMC free article: PMC38373] [PubMed: 8816789]
31.
Mushegian AR, Koonin EV. 1998. A minimal gene complement for cellular life and reconstruction of primitive life forms by analysis of complete bacterial genomes. In Bacterial Genomes. Physical Structure and Analysis, ed. FJ De Bruijn, JR Lupski, GM Weinstock, pp. 478–88. New York: Chapman & Hall.
32.
Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ. et al. Evidence for lateral gene transfer between archaea and bacteria from genome sequence of Thermotoga maritima. Nature. 1999;399:323–29. [PubMed: 10360571]
33.
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA. 1999;96:4285–88. [PMC free article: PMC16324] [PubMed: 10200254]
34.
Razin S, Yogev D, Naot Y. Molecular biology and pathogenicity of mycoplasmas. Microbiol. Mol. Biol. Rev. 1998;62:1094–156. [PMC free article: PMC98941] [PubMed: 9841667]
34a.
Stathopoulos C, Li T, Longman R, Vothknecht UC, Becker HD. et al. One polypeptide with two aminoacyl-tRNA synthetase activities. Scienci. 2000;287:479–82. [PubMed: 10642548]
35.
Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. [PMC free article: PMC102395] [PubMed: 10592175]
36.
Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–37. [PubMed: 9381173]

Views

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Similar articles in PubMed

See reviews...See all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...