NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Cover of Sequence - Evolution - Function

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.

Show details

Chapter 9Epilogue: Peering through the Crystal Ball

It's hard to make predictions, especially about the future.

Attributed to Yogi Berra, Niels Bohr, Mark Twain, and many others (

If you come to a fork in the road, take it.

Yogi Berra

Throughout this book, we discussed various aspects of computational genomics, all of which, more or less, fit under the broad categories of functional prediction or evolutionary reconstruction. Before we put the pen down, it is tempting to dream of the future. We hope that, when presenting on these pages several research directions in computational genomics, be it exploitation of genomic context for prediction of gene functions or reconstruction of early evolutionary events, we managed to convey to the reader the openness of the most important problems in each of these areas. It does not seem to make sense to recapitulate these. However, it is attractive to think of some real new avenues of research, those that are in their infancy now but may be expected to flourish, say, in the year 2012. The images in the crystal ball are vague, and we may easily take them for something that they are not, but since we are trying to think big, it will not be too embarrassing to miss the target.

9.1. Functional Genomics: A Programme of Prediction-driven Research?

Improving our understanding of how the cell works through prediction of gene functions and interpretation of experimental results using genomic information is usually considered the main goal of genomics, computational genomics in particular. But what is functional genomics? Currently, the phrase is used in the broadest possible sense to describe any experimental work that involves a contribution from genome analysis. This is probably well justified, but we believe that functional genomics also could be defined in a much more specific and focused way.

Namely, functional genomics could be a coherent research programme build around predictions made by genome comparison, conceptually very much like structural genomics (see Chapter 8). Structural genomics strives to prioritize targets for protein structure determination on the basis of two principal criteria: (i) probable structural novelty and (ii) importance of the biological function. Functional genomics could select targets for experimental analysis in exactly the same way except that “functional” should be substituted for “structural” under (i). Clearly, the targets of the highest priority for detailed experimental study should be those uncharacterized genes whose functions are essential and, on top of this, are likely to reveal novel biochemical mechanisms. How do we find essential genes among those with unknown functions?

The reader who went through the previous chapters (actually Chapter 2 alone would suffice, but Chapters 5, 6, and 7 are also helpful in providing numerous cases in point) should already have the answer. We select the genes with phyletic patterns that are widespread among many diverse lineages of organisms; these are most likely to be essential. For one last time in this book, we will employ the COG database to select a set of high-priority targets for functional genomics. We have seen already that there are only ~65 truly universal COGs, and the majority of these have well-characterized functions (however, see below). Therefore, let us select those COGs that are represented in all archaea and eukaryotes (and perhaps in some of the bacteria); this will give us an extra shot at finding new important eukaryotic genes, which is what often interests a biologist the most. The search selects 186 COGs, of which ~20 have not been characterized in terms of their specific cellular function(s). Typically, the biochemical activity of these proteins has been predicted at different levels of specificity, although but four remain completely mysterious.

In Table 9.1, we show a sample of these conserved archaeo-eukaryotic genes, which we ranked, more or less subjectively (that is, without any further computation, on the basis of “biological considerations” alone), in the decreasing order of “excitement”, in terms of identifying new, important functions. For most of the genes on this list, we could make additional predictions on the basis of genomic context (see 5.3) and the phyletic pattern itself. Not unexpectedly, most of the proteins encoded by the genes on our list are predicted to perform various roles in translation. There are several GTPases (some shown in Table 9.1), which are predicted to function as translation factors, unknown translation regulators, and RNA-binding proteins.

Table 9.1. Conserved protein families with predicted functions in search of experimental verification.

Table 9.1

Conserved protein families with predicted functions in search of experimental verification.

We seem to have an indisputable leader in terms of general importance and novelty, COG0533. This is one of only two universal COGs, along with COG0012 (already discussed briefly in Chapter 5), with virtually unknown functions. Since COG0012 is firmly predicted to consist of RNA-binding GTPases, i.e. ubiquitous and essential translation factors, the palm goes to COG0533. The proteins comprising this COG have a clearly predicted fold, the HSP70-like ATPase, with a metal-dependent protease active site inserted into the ATPase domain. On the basis of these features, it may be predicted that this is a previously undetected ATP-dependent protease with probable chaperone functions, perhaps involved in co-translational degradation of misfolded and/or prematurely terminated proteins. Moreover, genomic context suggests that it might be a subunit of a chaperone complex, which also includes an inactivated paralog of the predicted protease and several other proteins ([916] and E.V.K., unpublished observations). It seems that the data presented in Table 9.1 and other similar observations tell us something truly fundamental: we have so many essential pieces missing from our current picture of archaeal and eukaryotic translation that we cannot claim a proper understanding of this crucial process. A concerted, prediction-based experimental programme could go a long way toward an adequate description of translation in its real complexity. To our knowledge, such a programme does not exist.

Given the phyletic pattern we choose, the search outlined above was clearly geared toward detecting unknown, essential translation components. Similar straightforward computational approaches can be readily applied for identifying other kinds of functional systems. For example, a recent examination of the COGs has shown that, strikingly, there is only one protein that is unique for hyperthermophiles (both bacteria and archaea), the reverse gyrase [238], which leads to the inevitable conclusion that this protein is essential for life under hyperthermophilic conditions. In all likelihood, however, it is not sufficient. Given the wide spread of HGT in the prokaryotic world, other genes that are important for thermophiles have probably “leaked” into mesophiles. Therefore, a slightly more sophisticated approach can be employed to identify those additional determinants of the thermophilic phenotype. In Chapter 6, we briefly discussed a predicted repair system that is characteristic of thermophiles, although its individual components are found in some mesophiles also [541]. One could use the genes comprising this system as a training set to define the threshold phyletic pattern characteristic of functional systems important for thermophily (e.g. a pattern, in which 2/3 of the represented species are thermophiles, could be a reasonable cut-off). This approach reveals several proteins that are likely to perform thermophile-specific functions, e.g. distinct molecular chaperones (K.S. Makarova, Y.I. Wolf, and E.V.K., unpublished observations). Clearly, the inquiries into phyletic patterns can be formulated in many different ways (e.g. directed toward detection of potential drug targets, see 7.6) and, accordingly, will reveal gene sets that can be reasonably implicated in various types of biological functions.

We believe that this straightforward approach could serve as the foundation of a major direction in functional genomics. This direction certainly will not capture the entire complexity of life and is not supposed to supplant the predominant current approach based on purely biological considerations and experimental contingency. However, it could be a strong complement to the traditional methodology.

Is this vision going to materialize? The crystal ball gets dim, and we just do not know. The precedent of structural genomics is really heartening: this is where a similar rational approach clearly did work (even if the actual realization is only taking its early steps). Granted, this happened in large part because the methods for protein structure determination became much faster, and the whole enterprise is turning into a real “technology”. Perhaps prediction-directed functional genomics should await similar breakthroughs (if these ever are to come) in experimental study of protein functions. Nevertheless, it seems important for experimentalists to realize that this systematic approach to functional genomics is feasible at least in principle.

9.2. Digging Up Genomic Junkyards

We have already apologized in the Preface for dealing almost exclusively with proteins in this book. It is only fair if we devote one or two paragraphs in this epilogue to the other kind of genetic material, the non-coding sequences. Even the largest eukaryotic genomes currently known to us contain only a few times more genes than complex prokaryotes have. Moreover, and almost incredibly, a complex actinomycete like Streptomyces has almost as many genes as a fruit fly, and we can by no means rule out that bacteria exist with even larger genomes, so that the ranges of gene numbers overlap. Surely, some of the organismic complexity of eukaryotes comes from domain accretion (Chapter 8) and widespread alternative splicing [118,576]. However, an obvious and truly dramatic difference between prokaryotic and eukaryotic genomes is in the amount and fraction of non-coding DNA (Figure 9.1). It seems that the powerful pressure of selection for genome compactness, which keeps intergenic distances as short as possible in prokaryotes [709] and apparently also in unicellular eukaryotes [425], had been somehow removed in multicellular eukaryotes. Furthermore, as encapsulated in the famous C-value paradox, the complexity of a eukaryotic organism does not at all seem proportionate, even roughly, to the genome size: indeed some bony fish appear to have larger genomes than humans by a factor of hundreds.

Figure 9.1. Correlation between the genome size (in kb) and number of genes.

Figure 9.1

Correlation between the genome size (in kb) and number of genes. The number of genes in bacteria and archaea is proportional to the genome size. In eukaryotes, the gene number grows much slower than the genome size, resulting in a large fraction of non-coding (more...)

What is all this non-coding DNA for? An astonishing (and probably humiliating for those with a feeling of eukaryotic supremacy) explanation appeared in the classic 1980 article of W. Ford Doolittle and Carmen Sapienza [197] and was reaffirmed in the eloquent accompanying paper of Leslie Orgel and Francis Crick [634]. These researchers hypothesized that the bulk of non-coding DNA in multicellular eukaryotes is selfish, has no function useful for the organism, and is there simply because the organism does not know how to get rid of it (or finds it too expensive) and has learned to live with all that junk. This is certainly a powerful idea and, to a degree, it is definitely correct: the basic selfishness of mobile elements such as ALUs and LINEs, which comprise a significant fraction of the human genome, is beyond doubt. However, even with these selfish elements, the matter is not quite that simple.

Recent genome-wide analyses showed that a small fraction of transposable elements become exapted to function as protein-coding sequences [539,609] or, to an even greater extent, for regulatory elements of mammalian genes [409]. Obviously, these exapted sequences contribute to innovation, which may be critically important in evolution. So why are complex eukaryotes so tolerant to mobile elements: simply because they cannot get rid of them or because, despite their selfishness, these elements are repeatedly put to a good use? We do not have the answer. Perhaps the most balanced viewpoint is that these explanations are not alternative but rather give us complementary aspects of the real story. The mobile elements might have started off as pure genomic parasites, but later have been exploited by the host and now should be most properly regarded as symbionts.

More generally, recent comparisons of the non-coding DNA sequences in two nematode species and in humans and mice have shown that 20–30% of the “junk” non-coding DNA evolves under purifying selection, i.e. is not junk at all [762,763]. The impetus of these findings, if supported by further analysis, is hard to overestimate: they indicate that ~93% of functionally important DNA in our genomes does not code for proteins. What are the functions of this “dark matter”? As of today, we are not even close to an answer (note that, if the above estimates are correct, we have to account for ~600 megabase of DNA in the human genome, an equivalent of ~50 yeast genomes!). Most likely, there is no single solution to the puzzle because the functions of non-coding DNA are likely to be versatile. Some recent hints are most tantalizing: it seems that a huge fraction of the non-coding DNA is transcribed at some level, and some of the transcripts are previously unidentified microRNAs, which are likely to perform a plenitude of regulatory functions (see [180] and the references therein). Whatever are the solutions to the mystery of “junk” DNA, it seems likely that a book on functional genomics of eukaryotes to be written ten years from now is not going to be predominantly about proteins anymore.

9.3. “Dreams of a final theory”1

Lord Rutherford famously (and, of course, arrogantly) quipped that “there are two kinds of science: physics and stamp collection”. We freely and humbly admit that 99% of the comparative-genomic work discussed in this book is of the second kind. For sure, what genomicists collect are not more or less useless (even if beautiful) stamps, but important empirical findings on genomes, gene functions, and homologous relationships. Subsequent analyses of these observations lead to crucial, even if, again, empirical generalizations on the evolution of life forms, such as genome fluidity caused by gene loss and horizontal gene transfer. Nevertheless, it must be admitted that this is still observational science. We think, however, that perhaps some of the analysis presented in Chapter 8 and, possibly, the genomic clock concept discussed in Chapter 6 belong to the other 1%, the part of comparative and evolutionary genomic that, in some ways, starts resembling physics. What we mean is that analysis of certain features of genomes, e.g. the size distributions of protein families, seems to reveal footprints of extremely general evolutionary processes whose mathematical form, such as power law asymptotics, recurs across an astonishing range of phenomena. These are just little steps toward general theory in biology, and there is not even a guarantee that they lead us in the right direction. In principle, however, we believe that a new paradigm of theoretical biology might not be unsustainable: through comparative genome analysis, develop a theory(ies) of the major evolutionary processes and apply them to the reconstruction of the history of life.

Biology is an inherently historical science. In that respect, it is analogous to cosmology, which, in the last half century, has absorbed many advances of theoretical physics and, from a stamp-collection-like activity, has become a legitimate physical discipline. Indeed, the analogy seems to run deep. No theory will ever explain why our galaxy or our solar system looks exactly the way they do because this is the result of unique fluctuations, which occurred during the evolution of a tiny corner of the universe. However, theoretical studies combined with increasingly detailed observation have led to a rigorous description of many important aspects of the evolution of the universe as a whole. The famous Penrose-Hawking theorem, which proves that, under a very broad class of conditions, there was indeed a singularity in the beginning of the universe, is a good example of this. Similarly, if the evolution of life on earth could be run again or if evolution from similar initial conditions has actually run on another planet (something we would dearly love to know and one day probably will), the outcome would be quite different (and we would not be here to analyze it). The discovery of the major role of HGT in evolution emphasizes this unpredictability of life's evolution (a history with only vertical inheritance would be much easier to reconstruct but, again, we would not be around to do it). It seems almost certain that, in a rerun of life's evolution, the future eukaryotic cell would not form a symbiosis with a particular alpha-proteobacterium and eukaryotic life, as we know, it would not exist. However, it might not be unreasonable to hope that, eventually, the new theory of evolution might be able to answer questions such as: how likely is it that some archaeon would enter a symbiotic relationship with some complex bacterium, resulting in the emergence of new life forms of unprecedented complexity? One may further speculate that theoretical analysis could help develop reasonable models of the earliest phase of protein evolution, preceding the “Schmalhausen threshold” postulated in Chapter 6. From there, it might even be possible to glimpse the solution to the mystery that we view as the Holy Grail of evolutionary biology: the origin of genetic coding. If that point is reached, it will be time to say that we are starting to understand life.


1 This is the title of the wonderful popular book of the Nobel Prize winner Steven Weinberg on the physicists' quest for GUTs (Grand Unified Theories) and GTEs (General Theories of Everything) [886].

Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20252


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...