NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Cover of Sequence - Evolution - Function

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.

Show details

Chapter 6Comparative Genomics and New Evolutionary Biology

The affinities of all beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth.

Charles Darwin, 1859, The Origin of Species, Chapter IV

I should infer from analogy that probably all organic beings which have ever lived on this earth have descended from some one primordial form, into which life was first breathed.

ibid, Chapter XIV

In Chapter 2, we primarily focused on the foundations of comparative genomics that come from evolutionary theory and only briefly summarized the evolutionary implications of genome comparisons. In this chapter, we address the connection between comparative genomics and evolution from a different angle. The question we ask is: how does comparative genomics affect our understanding of major aspects of the evolution of life? We believe that the effect is (or at least has the potential to be) truly profound. Perhaps most importantly, comparative genomics has already led to the reappraisal of the central trends of genome evolution. Instead of the classic concept of relatively stable genomes, which evolve through gradual changes spread through vertical inheritance, we now have the new notion of “genomes in flux” [787]. According to this concept, evolution involves gene loss and horizontal gene transfer as major forces shaping the genome, rather than isolated incidents of little consequence.

This new picture of the evolutionary process is incomparably more complicated than the classic one but, in addition to revealing the true complexity of the phenomena than need to be analyzed to understand evolution, genomics provides the data that are required for this analysis. The genomes threaten to uproot the Tree of Life [667], but in the end, they may help build a better, more realistic tree. The new methods taking full advantage of the wealth of information contained in genome sequences are only starting to emerge. Most of the theoretical and algorithmic developments clearly lie ahead, which makes the field of evolutionary genomics particularly exciting. The availability of genome sequences from many diverse phylogenetic lineages provides for the possibility of reconstructing genomes of ancestral life forms, including the Last Universal Common Ancestor (LUCA) of all extant life forms. Furthermore, even deeper reconstruction becomes feasible and we are starting to glimpse some aspects of the primordial RNA world.

6.1. The Three Domains of Life

In the mid-1970s, while studying some unusual groups of bacteria, thermophilic methanogens and halophiles, Carl Woese and colleagues came to the revolutionary conclusion that these organisms were not really bacteria but should be assigned to a separate domain (also called primary kingdom or superkingdom or urkingdom) of life with the same status as bacteria and eukaryotes. This group was originally referred to as archaebacteria and later renamed archaea [901,906]. The uniqueness of the archaea was apparent, even from some of their biochemical features, such as the unusual structure of lipids, but what really clinched the case was the topology of phylogenetic trees of 16S rRNA, which were first built using oligonucleotide catalogs (we are talking here about the pre-genomic and even pre-sequencing era!) and subsequently derived from complete RNA sequence alignments. These trees clearly indicated that archaea comprised a unique branch of life, distinct from both bacteria and eukaryotes. Furthermore, although, phenotypically, archaea are obviously prokaryotes, like bacteria, i.e. have small cells without nuclei or organelles, they are, in some important respects, closer to eukaryotes than to bacteria.

These eukaryote-like features of archaea include the structure of the ribosomes, which have a number of proteins shared with eukaryotes but not with bacteria, the presence of histones (in one of the two major branches of archaea), the organization of the basal transcriptional apparatus, with several transcription factors of the eukaryotic variety, and the organization of the DNA replication apparatus, which is also conserved in archaea and eukaryotes but not in bacteria [122].

Comparative analysis of the complete genomes of 13 archaea now available strongly supports their uniqueness as a distinct domain of life [540]. This becomes immediately apparent from a simple taxonomic classification of the COGs (Figure 6.1). As many as 315 COGs are unique to Archaea (~14% of the total number of COGs in which archaea are represented) and may be considered to compose the archaeal “genomic fingerprint” [307,540]. We should note that only 16 of these COGs are found in all archaea, a fact that will become important in the next section when we discuss horizontal gene transfer and gene loss.

Figure 6.1. Distribution of the COGs in the three domains of life.

Figure 6.1

Distribution of the COGs in the three domains of life. The data were obtained using the Phyletic Pattern Search tool of the COG system. The partitions of the Venn diagram are not to scale.

A complementary analysis shows that the sequence similarity of archaeal proteins to homologs from other archaea is, on average, much greater than their similarity to homologs from bacteria or eukaryotes (Figure 6.2).

Figure 6.2. Taxonomic breakdown of the database hits for the proteins of the archaeon Archaeoglobus fulgidus.

Figure 6.2

Taxonomic breakdown of the database hits for the proteins of the archaeon Archaeoglobus fulgidus. The vertical axis shows the number of hits with a score greater than the value indicated on the horizontal axis (from the Entrez Genomes division web site: (more...)

In a memorable phrase of W. Ford Doolittle and colleagues, archaea are “bacterial in shape and eukaryotic in content” [950]. “Shape” here means primarily the prokaryotic features of the cellular organization, particularly the small size and the absence of nuclei and cytoskeleton. Genomic comparisons, however, show that much of the archaeal content, i.e. the apparent phylogenetic affinities of the genes, is bacterial, too [466]. This becomes clear from the inspection of the data in Figure 6.2: there are many more bacterial similarities, and they are, on average, considerably stronger, than eukaryotic similarities, among archaeal proteins. This notion of a rift between the “bacterial” and “eukaryotic” components of the archaeal gene complement is supported in a dramatic fashion by COG analysis. The number of COGs that include archaeal and bacterial proteins but not eukaryotic ones is almost 10 times greater than the number of archaeal-eukaryotic COGs without bacterial members (Figure 6.1)! This result might be biased because the COG set we analyzed includes only three eukaryotes with small genomes (the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe and the microsporidian Encephalitozoon cuniculi), but the bias could not be too significant because there are very few genes uniquely shared by archaea and muticellular eukaryotes.

This prevalence of archaeo-bacterial COGs seems to emphasize the existence of what might be called a common gene pool shared by these two domains of life; this has much to do with the prevalence of horizontal gene transfer as discussed in the next section.

It is also notable that, when a similar breakdown is produced for the set of 310 COGs that are represented in all 13 available archaeal genomes and thus compose the conserved core of archaeal genes, the number of COGs shared exclusively with eukaryotes becomes somewhat greater than the number of archaeo-bacterial COGs (Figure 6.3). Thus, as already noticed in the early days of archaeal genomics, there is a major “eukaryotic” component in the conserved core of the archaeal genomes, whereas the “variable shell” is overwhelmingly bacterial [540].

Figure 6.3. Taxonomic breakdown of the conserved archaeal COGs.

Figure 6.3

Taxonomic breakdown of the conserved archaeal COGs. The analysis was done for the 310 COGs represented in all 13 archaeal genomes. The purple sector on top represents 16 COGs that include exclusively archaeal proteins. Other sectors show, clockwise: COGs (more...)

Of course, the distinction between the “eukaryotic” and “bacterial” components of the archaeal genomes is not only quantitative. The distributions of (predicted) functions in the archaeo-eukaryotic and the archaeo-bacterial subsets of COGs are strikingly different as shown here for the conserved archaeal core of 310 COGs (Figure 6.4).

Figure 6.4. Protein functions in the archaeo-eukaryotic and archaeo-bacterial subsets of the conserved archaeal core (310 COGs total, Figure 6.3B).

Figure 6.4

Protein functions in the archaeo-eukaryotic and archaeo-bacterial subsets of the conserved archaeal core (310 COGs total, Figure 6.3B). A, archaea; B, bacteria; E, eukaryotes. Information: proteins involved in replication, repair, transcription, and (more...)

The great majority of archaeo-eukaryotic COGs in the conserved core represent information processing functions, i.e. DNA replication, transcription, and translation (in reality, this fraction is probably even greater because some of the poorly characterized proteins in this set are most likely to function in translation, as discussed in Chapter 5).

In a stark contrast, the archaeo-bacterial subset is enriched in metabolic enzymes; those proteins in this subset that are implicated in information processing are largely transcription regulators: this part of archaeal biology is predominantly bacterial [42]. Thus, it appears that archaea are “eukaryotic” in their basal information processing systems and “bacterial” in metabolism and much of cell biology [540,703].

There are at least two major inferences that seem to follow from these complex relationships between the protein sets encoded in the genomes from the three domains of life: (i) archaea and bacteria share a substantial gene pool, part of which is ancient heritage of the common ancestor of these two domains, and part is the result of HGT; and (ii) there is a small but critically important core of proteins, primarily involved in information processing, that reflects shared history of archaea and eukaryotes. In the remaining sections of this chapter, we provide more perspective on each of these issues.

6.2. Prevalence of Lineage-specific Gene Loss and Horizontal Gene Transfer in Evolution

Horizontal gene transfer (HGT) and lineage-specific gene loss are tightly linked phenomena. As we will see shortly, any observable phyletic pattern could be potentially explained by gene loss, by HGT or through a combination of both types of events. However, the status of the two types of events in the molecular evolutionary literature is quite different. While gene loss is fully accepted as a common evolutionary phenomenon, widespread occurrence of HGT is still contested. No one denies that it occurs, in principle, but there is serious and sometimes heated debate on its extent [479]: is the frequency of HGT comparable to that of gene loss, or is HGT several orders of magnitude less frequent? There is, indeed, a rather good reason to assess the two phenomena differently because there are well-defined situations in evolution where gene loss cannot be reasonably questioned. These include evolution of parasites, which obviously have lost large parts of their original gene sets, and evolution of free-living heterotrophs as well [55,114,425,581] (see also 6.4 and Chapter 7). There are no such obvious smoking guns for massive HGT. Besides, gene loss presents much less of a problem from the point of view of evolutionary theory because no special selective advantage needs to be postulated for a gene loss event: as soon as a gene becomes dispensable, it may be as well eliminated. In contrast, the fixation of a gene acquired by HGT in the genome of its new host is very much in need of an adaptive explanation.

We believe, however, that strong evidence of large-scale HGT between phylogenetically distant species does exist. Such evidence seems to be provided by clear correlations between similarity in organisms' lifestyles and the apparent number of genes they exchange via HGT. This notion first came to fore when it was shown that the hyperthermophilic bacterium Aquifex aeolicus (the first genome of a bacterial thermophile to be sequenced) contained significantly (with good statistical support) more “archaeal” genes (that is, genes that are either missing in bacteria altogether or are more closely related to archaeal than to bacterial orthologs) than any other bacterial genome available at that time [52]. Subsequently, similar observations have been reported for another bacterial hyperthermophile, Thermotoga maritima [610]. The genomes of these bacterial hyperthemophiles encode ~15%–20% proteins of probable archaeal descent compared to 1%–5% in mesophiles [462]. A lineage-specific gene loss explanation has been proposed even for these observations, under the notion that Aquifex and Thermotoga appear to be early-branching bacteria (see next section) and might have retained ancient thermophilic heritage that had been lost in the rest of bacteria as a result of only one loss event [483]. This argument does not appear to be particularly strong because it fails to account for the sharp divide, in terms of the apparent phylogenetic affinities, between these “archaeal” genes and the rest of the genomes of the bacterial hyperthermophiles, which consist of “garden variety” bacterial genes [53]. What seems to really clinch the case, however, is the fact that the same trend is seen in the recently sequenced Thermoanaerobacter tengcongensis [73]. This thermophilic bacterium belongs to the Bacillus-Clostridium group of Gram-positive bacteria but, nevertheless, has many more archaeal genes than its mesophilic cousins (Figure 6.5, see color plates). In this case, the gene loss explanation would necessarily require multiple, independent elimination of the same set of genes in many bacterial lineages and does not look plausible at all. The connection between an organism's lifestyle and HGT has been confirmed in the most dramatic fashion by the recent sequencing of the genomes of mesophilic archaeal methanogens M. acetivorans and M. mazei [181]. In M. acetivorans, nearly 30% of the genes seem to be of bacterial origin, an order of magnitude more than in phylogenetically related hyperthermophilic archaeal methanogens (Figure 6.6 in color plates). As in the previous case, an explanation based on gene loss seems to be unrealistic, given the position of the methanogens in the archaeal tree [779].

Figure 6.5. Genomic maps of apparent phylogenetic affinities for two bacterial genomes.

Figure 6.5

Genomic maps of apparent phylogenetic affinities for two bacterial genomes. A: The hyperthemophile Thermoanaerobacter tecongenesis: 258 of 2588 proteins (10%) with significantly greater similarity to archaeal than to bacterial homologs. B: The mesophile (more...)

Figure 6.6. Genomic maps of apparent phylogenetic affinities for two archaeal methanogens.

Figure 6.6

Genomic maps of apparent phylogenetic affinities for two archaeal methanogens. A: The hyperthermophile Methanopyrus kandleri. 98 of 1687 proteins (6%) with significantly greater similarity to bacterial than to archaeal homologs. B: The mesophile Methanosacrina (more...)

Using the significantly greater sequence similarity to homologs from a distant taxon compared to homologs from the “native” taxon, to which the given species belongs, as an argument for HGT is one of the so-called surrogate criteria for HGT detection [688]. Other such criteria include unusual phyletic patterns (more about this below), unexpected conservation of local gene order between distant species that might be indicative of transfer of entire operons, and anomalous nucleotide composition and codon usage [462,498,625,687,688]. These approaches have been dubbed surrogate because the “real” method for detecting HGT is supposed to be phylogenetic tree analysis. Suppose we have a set of orthologs (COG), which is represented in all archaea and in only one bacterial species, and, furthermore, the bacterial species is not equidistant from all archaeal orthologs but, when a tree is built, specifically cluster with a particular archaeal branch. This would have been an apparently irrefutable case for HGT. Such perfect situations are extremely rare. In the current COG database, there are only four COGs with that exact phyletic pattern, and only two of them seem to produce the desirable tree topology (EVK, unpublished observations). Thus, under a more relaxed criterion, any statistically supported “paradoxical” clustering in a tree would strongly suggest the HGT case. The general difficulty with this approach is that phylogenetic trees, even those constructed with the most powerful modern methods (see 6.3), are often ambiguous (claims of “rigor” often found in the literature notwithstanding). This becomes much more of a problem when attempts are made to construct trees (and underlying alignments) automatically and on a large scale. A recent attempt of a genome-scale phylogenetic study, aimed at the characterization of what the authors called the “phylomes” of several prokaryotic species (i.e. the sets of phylogenetic trees for all genes that are sufficiently conserved in evolution to allow tree construction), revealed many instances of unexpected clustering, which suggests widespread HGT [769]. However, it is hard to assess the reliability of this result.

The confidence that an unusual tree topology actually reflects HGT may be bolstered when there is support from an independent line of evidence. In a much smaller benchmarking study, Itai Yanai, Yuri Wolf, and one of the authors (EVK) investigated the evolutionary scenarios for gene fusions, aiming to distinguish cases of dissemination of fused genes via HGT from vertical inheritance accompanied by fission in some lineages and from independent evolution of the same fusion on two or more occasions [931]. To this end, fusions (two-domain proteins) were split into the component domains, and phylogenetic trees were constructed for each of the corresponding orthologous sets, including both fusion components and products of stand-alone genes from other species. The topologies of the resulting trees were compared to each other and to the topology of a tree made from a concatenated alignment of ribosomal proteins, which was treated as the species tree (see next section). The distribution of the fusion components in the phylogenetic trees for orthologous clusters would follow the phylogeny of the species that have the fusion if the fusion events occurred more than once independently or were vertically inherited, perhaps followed by fission in some lineages. In contrast, if the fusion gene disseminated via HGT, fusion components are expected to form odd clusters different from those in the species tree. When the trees for the two fusion components agree, the case for HGT becomes strong. The conclusion: of the ~50 analyzed fusion proteins that are present in both bacteria and archaea, ~2/3 have spread via HGT.

Anecdotal studies that support HGT by means of phylogenetic analysis have been sufficiently numerous to conclude that this phenomenon had a major role in evolution, at least as far as prokaryotes are concerned. It seems that HGT cuts throughout the range of biological functions, although, among genes coding for core proteins of translation, transcription, and replication, transfers probably are less common. Aminoacyl-tRNA synthetases (aaRS) are a notable exception: numerous cases of probable HGT have been detected for these essential enzymes involved in translation [190,330,907,909]. Although aaRS generally follow the “standard model” of evolution, with the original split leading to the separation of the bacterial and archaeo-eukaryotic lines of descent, strong evidence of HGT has been detected for at least 17 of the 20 aaRS specificities [909]. Figure 6.7 illustrates HGT for two aaRS families. The evolution of glutamate and glutamine aaRS (Figure 6.7A) involves one of the most spectacular cases of non-orthologous gene displacement (see also Table 2.3). Most of the bacteria and archaea do not have an aaRS for glutamine (Q-RS); instead, glutamine is formed by transamidation of glutamate-tRNAGln, whose formation is catalyzed by the so-called non-discriminating glutamate-RS (E-RS) [854,855]. Eukaryotes and at least some gamma-proteobacteria lack the enzymatic complex responsible for transamidation and instead encode Q-RS.

Figure 6.7. Phylogenetic trees for two families of aminoacyl-tRNA synthetases.

Figure 6.7

Phylogenetic trees for two families of aminoacyl-tRNA synthetases. A (top panel), glutamate and glutamine. B (bottom panel), tryptophan. The small red circles show bootstrap support >70%; the large black circle indicates the likely root position (more...)

This is one of the rather rare situations where the entire evolutionary scenario seems to be clear from the tree. Q-RS apparently evolved via a duplication of E-RS at an early stage of eukaryotic evolution and was subsequently acquired by gamma-proteobacteria, which was followed by obliteration of the ancient transamidation machinery (according to the general scenario for non-orthologous gene displacement discussed in Chapter 2). This sequence of events is dictated by the tree topology in Figure 6.7A, where the gamma-proteobacterial branch is within the archaeo-eukaryotic cluster. The opposite direction of HGT, from gamma-proteobacteria to eukaryotes, would have put eukaryotes in the midst of the bacterial cluster. The direction of the single HGT event for tryptophanyl-RS (W-RS) is equally certain from Figure 6.7B: the archaeon P. horikoshii acquired W-RS from a eukaryote via HGT; the alternative, namely acquisition of W-RS by an early eukaryote from this particular archaeal species, is unrealistic, even if only because the divergence of eukaryotes certainly predates the divergence of pyrococcal species. This case is a clear manifestation of the phenomenon of xenologous gene displacement , the variant of HGT when an ortholog from a distant lineage (xenolog) displaces the “native” gene in a given genome [462]. For essential genes, xenologous gene displacement must involve acquisition of the “alien” gene, followed by a period of co-existence of the “native” and “alien” forms, and then elimination of the native one. For non-essential genes, an alternative is conceivable whereby the “native” gene is lost first and the “alien” gene is acquired subsequently.

The distribution of apparent HGT events among different functional categories of genes has been interpreted in terms of the so-called complexity hypothesis, which posits that genes coding for protein subunits of macromolecular complexes or, more generally, proteins involved in a wide range of interactions, are less subject to HGT [390].

There is indeed strong logic behind this concept because, unless subunits of a complex are encoded in the same operon and are transferred together, a gene coding for just one subunit of a complex is unlikely to get fixed in the recipient genome. In particular, there seems to be relatively little HGT involved in the evolution of the genes for ribosomal proteins, components of the utmost molecular machine of the cell. However, even in this sanctum of vertical evolution, detailed phylogenetic analysis revealed several instances of HGT, some of which involved the gene for S14, an essential protein located “in the heart of the ribosome” [119,120,542].

It seems that, as far as HGT is concerned, everything is for sale, at least in the prokaryotic world; it is just the price (i.e. the likelihood of HGT) that differs for different genes.

The apparent major role of HGT in the early evolution of eukaryotes is discussed below in 6.4. The possibility of acquisition of bacterial genes via HGT relatively late in eukaryotic evolution, namely after the emergence of vertebrates, has become a subject of controversy after it was brought up in the report of the International Human Genome Consortium on the draft sequence of the human genome [463,488,678]. In this report (with a direct contribution from one of the authors of this book, EVK), it was noticed that proteins encoded by 113 human genes either had no detectable eukaryotic homologs outside the vertebrate lineage or showed significantly higher similarity to the bacterial homologs than to homologs from non-vertebrate eukaryotes. The history of these genes was proposed to have involved either lineage-specific gene loss or HGT or both, with HGT considered to be more likely because multiple losses were required to explain the observed phyletic patterns. Moreover, the direction of transfer was thought to be from bacteria to vertebrates because most of these genes were widespread in bacteria. This hypothesis was sharply criticized by three independent groups, who observed that many of the genes in question were present in lower eukaryotes, in particular the slime mold Dictyostelium discoideum (as shown by searching the slime mold EST database that was not screened in the original study). Moreover, all eukaryotic members of the respective families tended to cluster together in phylogenetic trees ([706,737,799], see also discussion in [29]). All these authors concluded that gene loss, even in multiple lineages, was a much more likely explanation for the observed patterns, whereas acquisition of bacterial genes by vertebrates via HGT was extremely rare, if it occurred at all. These results certainly emphasize the caution that is needed in the interpretation of indications for HGT in the absence of a representative set of genomes from the major lineages of eukaryotes (and, we should note parenthetically, the importance of database integration; see Chapter 3). However, we believe that the jury is out on the evolutionary history of these genes, and the data can be explained by several alternative scenarios. These include multiple HGT events and a combination of a relatively early gene acquisition from bacteria, e.g. by the common ancestor of Dictyostelium and animals, if a phylogeny including such an ancestor is accepted, with subsequent multiple gene losses. It is worth mentioning that analysis of the sequence of Dictyostelium chromosome 2, which appeared after the bulk of this manuscript had been completed, showed specific, high similarity to vertebrate homologs for numerous slime mold proteins [288]. We cannot fully assess the significance of these observations because much more analysis is required, but it seems that evolution of eukaryotes has profound mysteries in store.

Figure 6.8 (see color plates) shows the phylogenetic tree for one of the genes identified as possible vertebrate-specific horizontal transfers in the human genome report. This gene attracted special attention because its product, monoaminoxidase (MAO), is an enzyme involved in the metabolism of an essential neuromediator and a drug target in psychiatric disorders [2]. With regard to the origin of the particular subfamily that includes the classic human MAO, the tree topology is indeed compatible with acquisition of a bacterial gene by a common ancestor of slime mold and animals. It is striking, however, that four distinct branches of this family of enzymes each include both bacterial and vertebrate proteins (branches 1–4 in Figure 6.8). This suggests that both multiple HGT events and multiple losses contributed to the evolution of this single family of enzymes. Such complex scenarios might not be uncommon in evolution and are likely to come up repeatedly as the collection of sequenced genomes becomes increasingly more representative.

Figure 6.8. Phylogenetic tree of monoaminoxidases.

Figure 6.8

Phylogenetic tree of monoaminoxidases. MAOx and MAOy, uncharacterized predicted monoaminooxidases; LAO, L-amino oxidase; PAO, polyamine oxidase. The names of eukaryotic species in the MAOA-MAOB-Dictyostelium group are colored red, other eukaryotic species (more...)

Perhaps one of the most interesting directions in the analysis of the new reality of genome evolution, which includes numerous gene loss and HGT events, involves development of algorithms for explicit reconstruction of evolutionary scenarios and ancestral gene sets. The simplest of such algorithms take into account only phyletic patterns and a species tree. The principle is illustrated in Figure 6.9, which shows a tentative species tree based on concatenated alignments of universal ribosomal proteins ([915], see next section) and the phyletic distribution of two COGs superimposed upon it.

Figure 6.9. Distributions of two COGs on a tentative species tree.

Figure 6.9

Distributions of two COGs on a tentative species tree. The species with completely sequenced genomes (Table 1.4) are indicated by the first letter of the genus name and the first three letters of the species name. COG1747 (indicated by a rhomboid): uncharacterized (more...)

Consider, first, COG1747, which is represented in four species, two chlamydia and two spirochetes. Given this particular tree topology (again, see next section), there is no need to postulate any gene loss or HGT events to explain the evolution of this COG. The obvious, simplest evolutionary scenario will just hold that the COG emerged at the base of the chlamydia-spirochete clade and was inherited vertically ever since.

Consider, in contrast, COG2810. This COG does not map to a single, compact cluster of species (clade) in the tree and, accordingly, gene loss and/or HGT have to be invoked to explain its phyletic pattern, given the species tree. Suppose this protein first emerged (what exactly we mean by “emergence” of a protein family will be discussed in 6.4) in the bacterium Deinococcus radiodurans (or, more precisely, in the Deinococcus lineage). The observed distribution then could be accounted for by postulating two independent horizontal transfers into different archaeal lineages. The alternative, based solely on gene loss, would require elimination of this gene in five bacterial and five archaeal lineages, 10 evolutionary events altogether (the reader may want to reproduce this exercise if interested). Mixed scenarios are also imaginable. For example, the COG might have appeared first in the ancestral archaeal node indicated by an arrow in Figure 6.9; then, it would take three losses in archaea and one HGT event (four events total) to account for the observed phyletic pattern.

In this fashion, it is possible to compute, for each COG, the minimal number of events that is required to reconcile the phyletic pattern with a given species tree; we designate this measure Incompatibility Quotient. For a given COG i,

Image ch6e1.jpg

where l is the number of losses, h is the number of HGT events in the minimal (most parsimonious) evolutionary scenario for the given COG, and g is “HGT penalty”. Assuming g = 1, we will obtain the global minimum of I for each COG given a species tree.

Such calculations for the current COG collection [951] lead to a striking result: the history of a COG, on average, apparently included 3–5 loss or HGT events; only ~14% of the COGs seem to have evolved without these events. We have already seen that most of the COGs include a rather small number of species (Figure 2.8). The I-value calculations illustrated in Figure 6.10 show that this pattern, to a large extent, has been shaped by gene loss and HGT, rather than simply by late emergence of COGs.

Figure 6.10. Number of gene loss and HGT events in most parsimonious evolutionary scenarios for COGs (I values).

Figure 6.10

Number of gene loss and HGT events in most parsimonious evolutionary scenarios for COGs (I values).

Using this approach, fairly straightforward algorithms can be developed to reconstruct the ancestral gene sets for each internal node of the species tree and to assign a series of events to each branch, thus effectively producing the evolutionary scenario for the genomes themselves ([787, 951]). The difficulty lies in determining the correct value of the HGT penalty (the g parameter in equation 6.1). It appears plausible that gene loss is more likely to be fixed in evolution, as briefly discussed in the beginning of this section (see also [787]), i.e. we should assume g > 1; however, we have no data to determine, even approximately, just how much more likely one type of event is compared to the other. This seems to be a critical parameter for the study of genome evolution.

One way to approach the problem is to use the feedback from the results of the reconstruction of the ancestral gene sets. Snel and colleagues ([787] found that, with g = 1, the reconstructed genomes of early ancestors, in particular, the Last Universal Common Ancestor (LUCA), which corresponds to the root of the species tree, would include unrealistically few genes. In contrast, at high g values, which push HGT out of the picture, the hypothetical ancestors (counter-intuitively) grow in size as one moves toward the tree root, making LUCA an “omnipotent” organism with a huge number of genes (see also discussion in the next section). Thus, in principle, one could attempt to obtain an estimate of the number of genes in ancestral forms that form independent, biological considerations, and this could lead to reasonable estimates of the g value. Of course, as soon as “biological considerations” come into play, the speculative element of the reconstruction becomes worrisome. Another approach to resolving the loss versus HGT problem is to analyze both the phyletic patterns and the actual phylogenetic trees for each COG in conjunction. This hinges on ambiguities in tree topologies and algorithmic problems. Nevertheless, we tend to believe that eventually these and other, more sophisticated computational approaches will give us a description of genome evolution that will be immensely richer and, at the same time, more definitive than anything that could be imagined in the pre-genomic era.

6.3. The Tree of Life: Before and After the Genomes

6.3.1. Phylogenetic trees in the pre-genomic era

The concept of a tree of life depicting phylogeny (rather than simply showing classification of organisms in a convenient form), with leaves corresponding to extant species and nodes to extinct ancestors, was pioneered by Charles Darwin and is embodied in his famous single illustration of The Origin of Species [170]. The earliest attempts to populate the tree with real biological entities are associated with the name of Ernst Haeckel [327]. Haeckel's tree and other early trees were based on the general notion of a hierarchy of relationships between species and higher taxa. Gradually, quantitative criteria have been developed to measure the degree of morphological difference that was generally assumed to reflect evolutionary distance. Another major direction in evolutionary tree construction is cladistics, founded by Willy Hennig, a methodology that employs inferred shared derived characters to identify monophyletic lineages, or clades [760].

The possibility of using molecular sequences for phylogenetic tree construction was first suggested by Francis Crick in his ground-breaking 1958 paper [163] and realized by Emile Zuckerkandl and Linus Pauling, who built their trees using aligned sequences of the only two protein families for which enough sequence information had been available at the time, the cytochromes c and the globins [946]. This seminal analysis was done under the molecular clock hypothesis, introduced by Zuckerkandl and Pauling. The central postulate of this hypothesis is that a given gene evolves at a constant rate as long as the function of the gene product remains unchanged. With the accumulation of numerous sequences from diverse species, phylogenetic studies showed that, even if molecular clock could be adopted as a useful null hypothesis, it is violated all too often, which may easily result in incorrect tree topologies [310]. This inherent problem of phylogenetic tree analysis, together with the necessity for developing measures of evolutionary distance that take into account multiple substitutions in the same site and rate variation between sites within the same gene, led to a plethora of increasingly sophisticated methods for tree construction. The commonly used phylogenetic approaches, each existing in a number of flavors, are distance methods , such as neighbor-joining and least-square method (Fitch-Margoliash), maximum parsimony , and maximum likelihood . Any detailed discussion of these methods is beyond the scope of this book, especially as several highly informative texts on molecular phylogenetics, of both theoretical and more practical inclination, have been published in the last few years [346,607,647].

In the early days of molecular phylogenetics, a gene tree was generally equated with the species tree. This implies the possibility of finding an optimal molecular marker for deciphering the history of life, and indeed, ribosomal RNA sequences became a de facto standard in molecular phylogenetics. As we already indicated in the beginning of this chapter, phylogenetic analysis of rRNA revolutionized our understanding of the history of life by establishing the three-domain “standard model” of evolution [901,906]. In addition, phylogenetic analysis of rRNAs brought “the winds of (evolutionary) change” to taxonomy by revealing, supporting, or correcting many major clades among bacteria, archaea, and eukaryotes [630]. It had been recognized for a long time that the exact tree topology depends on the employed phylogenetic method, but the very validity of the rRNA-based approach to species phylogeny was not seriously challenged in the pre-genomic era. This appeared to be the approach of choice to produce the true Tree of Life.

6.3.2. Comparative genomics threatens the species tree concept

The disturbing signs appeared soon after the number of gene families available for phylogenetic analysis became substantial. The problem was that different genes often yielded different trees. This incongruence between tree topologies invaded even the “sacred of sacred” of phylogenetic taxonomy, the three-domain standard model. In particular, archaeal genes systematically showed different phylogenetic affinities, the components of information-processing systems typically affiliating with eukaryotes, whereas metabolic enzymes and structural proteins displayed bacterial connections ([292,324]; also see above). Although some of the discrepancies could be explained away by pitfalls in phylogenetic tree construction procedures, it was becoming increasingly clear that important evolutionary reality was lurking behind incompatible topologies. All this was but a foreshadow of things to come once multiple, complete genome sequences became available for comparison.

As outlined in the preceding section, systematic comparisons of complete gene sets showed, beyond reasonable doubt, that there was much more to evolution than vertical inheritance, with lineage-specific gene loss and HGT coming to the fore as major evolutionary phenomena, at least in the prokaryotic world. Phylogenetic tree analysis of multiple gene families sends the same message. A detailed study of 28 protein families from prokaryotes suggested that, after probable HGT cases were removed, there was no reliable phylogenetic signal left in the trees [834]. Similar results were obtained for proteins that compose the conserved core of archaeal genomes: they all showed greater conservation within the archaeal domain than outside it, but no clear consensus phylogeny for the archaea could be determined [611].

Thus, the possibility has been repeatedly considered that comparative genomics might undermine the very concept of a Tree of Life, at least as far as the prokaryotic life is concerned [193195,666,667] (since prokaryotes comprise two of the three primary domains, a Tree of Life without them is out of the question). Is it necessary to replace the tree representation of life's history with a network-like scenario? In the strict sense, this is certainly the case because, technically, even one HGT event makes a tree an incomplete depiction of the real course of evolution, and we have seen in the previous section that HGT had been widespread (difficulties with more precise estimates notwithstanding). However, genomics that seems to “uproot” any simple-minded tree of life based on a single gene or a small group of genes might also offer a way to salvage the concept itself, at least in a “weak” form. Soon after multiple genomes of bacteria, archaea, and eukaryotes had been sequenced, the idea emerged that phylogenetic analysis could be based not on a tree for selected molecules, e.g. rRNA, but (ideally) on the entire body of information contained in the genomes or on a rationally selected, substantial part of this information [914]. Below we briefly discuss different genome-based phylogenetic approaches (for the sake of brevity, we designate them “genome-tree” methods), and the first results in large-scale prokaryotic phylogeny brought about by the application of these methods.

6.3.3. Genome-trees—can comparative genomics help build a consensus?

The approaches to genome-tree construction and the main results obtained with each are briefly summarized in Table 6.1. The most obvious criterion for genome comparison is based on the analysis of gene content. Closely related species share a large proportion of genes; in contrast, distantly related species should have lost a substantial fraction of the genes inherited from their last common ancestor, rendering the proportion of shared genes low. If this process carries on in a regular fashion (i.e. inter-genomic distances based on gene repertoires can be mapped to time scale uniformly across lineages), it could be used for phylogenetic reconstruction.

Table 6.1. Genome-trees: methods and principal results.

Table 6.1

Genome-trees: methods and principal results.

The latter requirement raises an obvious objection. The notorious plasticity of prokaryotic genomes ( see 6.2) results in gene content being malleable by selective pressures, both in terms of gene loss (e.g. in parasites) and gene acquisition via HGT (e.g. adaptation to extreme environments as in hyperthermophilic bacteria discussed above). As first presciently noted by Charles Darwin himself, traits that are subject to strong selection are less suitable for phylogenetic reconstruction than neutral traits because of highly non-uniform rates of change and the tendency for convergent evolution among the former. This suggests that gene content comparisons could be a relatively poor tool for studying prokaryotic phylogeny, although, if treated as a means to study lifestyle-related similarities and differences between genomes, rather than evolutionary relationships per se, this approach can produce interesting results.

To build trees, data on representation of genomes in orthologous gene sets are either used directly for different variants of parsimony analysis or are converted to evolutionary distances, which are then employed for building neighbor-joining or least squares trees [230,786,835]. Comparison of trees produced with this approach shows that enough phylogenetic information is retained in gene repertoires to provide reliable classification on both ends of the evolutionary distance scale. Gene-content trees show good separation between the three primary domains and also consistently group together closely related species. However, on the intermediate distances, i.e. where the relationships between major lineages are concerned, this approach seemed to be less suitable for phylogenetic inference. In appears that the topology of the gene-content trees is determined largely by the relative amount of gene loss in different genomes. In particular, the main division in the bacterial branch is between the free-living and parasitic forms, which resulted in well-defined major lineages (e.g. Proteobacteria) being broken up [915]. This is readily explained by common trends in genome reduction under the selective pressure during the adaptation to parasitism in different lineages. Attempts to overcome this effect included simple removal of parasites from the species set used for tree construction [230,358] and normalization of the intergenomic distances by the number of genes in the smaller genome in each pair [470,786]. The latter method resulted in reasonable phylogenetic reconstructions, with most of the known major prokaryotic lineages recovered. In general, however, gene content analysis seems to have less resolution power than some of the other genome-tree approaches (see below).

More or less the same logic applies to genome comparisons based on gene order. Rearrangements continuously shuffle the genomes, gradually breaking ancestral gene strings. The operonic organization of a prokaryotic genome makes this a complex process. On the one hand, the selective advantage of physical proximity for co-regulation renders some gene arrays less prone to break-up than others, thus extending the range of evolutionary distances over which gene order conservation is detectable [491,916] (see also Chapter 5). On the other hand, operons are especially amenable to being transferred as a whole [494,495], which could accentuate the effect of HGT on the tree topology.

Given the limited conservation of gene order between phylogenetically distant genomes, the attempts to build trees on the basis of gene order included identification of shared gene pairs. The data on presence-absence of gene pairs in genomes were then subjected to either parsimony or distance analysis [470,915].

Generally, the results obtained with this approach are similar to those of gene content analysis, with a good separation between Archaea and Bacteria and correct clustering of closely related species but poor resolution at intermediate distances. The influence of HGT on the topology of the resulting trees was readily noticeable. Given the high rate of intragenomic rearrangements, comparison of gene orders, at least in theory, should work particularly well for resolving the phylogeny of closely related species [813].

Evolutionary distances between different pairs of orthologs in the given two genomes show a broad distribution [318]. In theory, this is due to the variability of mean protein evolution rates caused by the differences in the strength of selective constraints, which act on functionally distinct proteins. In practice, several other factors add to the rate variance, including sampling errors, incorrect identification of orthologs, and HGT. Nevertheless, if, for the majority of ortholog pairs, the time of divergence coincides with the divergence of species (i.e. HGT involves a minority of genes), it is reasonable to expect that the distance distribution retains enough phylogenetic information to be used for tree construction. On this premise, parameters of this distribution (preferably the median) for all pairs of compared genomes can be transformed into intergenomic evolutionary distances, which can then be used to construct neighbor-joining or least-squares trees [148,318,915]. This approach produced trees that were fairly robust in terms of correctly reproducing well-known lineages and also suggested the existence of several new ones.

Traditional sequence-based phylogeny relies on gradual sequence change over time. The three main problems with using single genes (more precisely, orthologous sets) to infer a species tree are insufficient number of informative sites, variability of evolutionary rates in different lineages, and the effect of HGT. The former two factors introduce (sometimes major) uncertainty into phylogenetic reconstructions; the latter one leads to gene phylogenies that are genuinely different from the (hypothetical) species phylogeny. In an attempt to overcome these pitfalls, one can concatenate many sequence alignments into one and use the combined long sequence for tree reconstruction. If the likelihood of HGT is reduced by a careful choice of genes, the trees reconstructed from such an alignment have the potential to provide good resolution. In agreement with this anticipation, phylogenetic analysis of concatenated sequences of ribosomal proteins (in some cases, with the addition of other proteins involved in translation, and/or with some proteins suspected to have undergone HGT removed), performed by three independent groups using different methods, produced trees that seemed to contain a strong phylogenetic signal [119,332,552,915]. The topology of these trees was generally compatible with that of the trees constructed using the median similarity between orthologs, but with a greater resolution, which allowed more confident prediction of several new prokaryotic lineages.

Figure 6.11 shows a proposed phylogenetic tree of prokaryotes, in which we combined the results of the genome-tree analyses discussed above (Table 6.1), in particular, trees made using the median similarity between orthologs and those based on concatenated alignments, and attempted to depict the apparent consensus.

Figure 6.11. A consensus of genome-trees.

Figure 6.11

A consensus of genome-trees. Although all genome-tree methods produce unrooted trees, this tree is, for the sake of clarity, shown in a rooted form, with the root position forced between bacteria and archaea. New clades considered to be firmly established (more...)

At least two major clades that have not been described or, to our knowledge, even suspected to exist prior to the genome-tree studies are strongly supported by different types of analysis and appear reliable: (i) chlamydia-spirochetes among bacteria; and (ii) methanogens-pyrococci among Euryarchaeota. In addition, several other major groupings were supported by some but not other approaches and should be considered tentative for the moment. These are the clade comprised of Cyanobacteria, Deinococcales, and Actinobacteria, its “grand unification” with the low-GC Gram-positive bacteria (the Bacillus-Clostridium group), and the clade including the hyperthermophilic bacteria Aquifex and Thermotoga.

Furthermore, the genome-tree analysis challenges the traditional phylogeny of archaea, according to which Crenarchaeota (a distinct archaeal lineage, which includes, among the sequenced genomes, Aeropyrum, Sulfolobus, and Pyrobaculum) is traditionally assumed to be a sister group of Euryarchaeota (the rest of the archaea). In addition to the separation of these major archaeal branches in rRNA trees, they differ in certain fundamental aspects of their gene content and biology, including the presence of histones in Euryarchaeota but not in Crenarchaeota. Several of the genome-tree analyses did not support this division and placed the euryarchaeon Halobacterium in the archaeal root. However, the euryarchaeal topology emerged as a strong competitor [915] or even the winner [552]. Clearly, the resolution of these major evolutionary puzzles requires many more genomes and more work on the genome-tree approaches.

The results of comparative genomics suggest that the simple notion of a single Tree of Life that would accurately and completely depict the evolution of all life forms is gone forever. Individual genes, especially those of prokaryotes, follow their unique evolutionary trajectories. However, those same comparative-genomic studies that have “uprooted” the Tree of Life provide hope that the concept could be rescued, albeit in a limited sense. Taken together, the results of genome-tree analyses indicate that there is, after all, a phylogenetic signal in the sequences of prokaryotic proteins, but it is weak because of massive gene loss and HGT and, possibly, also because of a punctuated equilibrium mode of evolution, with some of the major transitions having occurred within short time intervals. It seems that, to capture this faint signal, analysis of genome-wide protein sets or carefully selected subsets is required. The concept of the Tree of Life is bound to change in the post-genomic world. It cannot be thought of as a definitive “species tree” anymore, but only as a central trend in the rich patchwork of evolutionary history, replete with gene loss and HGT [903]. Nevertheless, we believe that Darwin's dictum used as one of the epigraphs of this chapter stands: in the epoch of complete genome sequences and “lateral genomics”, the tree simile still speaks essential truth about evolution.

6.3.4. The genomic clock

In the preceding section, we concluded that the notion of the Tree of Life as the ultimate depiction of the course of evolution has to be replaced with a softer concept of a “consensus species tree” as a central thread in the patchwork of life’s history. Is there a way to similarly reformulate the classic molecular clock concept of Zuckerkandl and Pauling such that, in spite of the fact that changes of the evolutionary rate in individual orthologous gene sets are too common to support the traditional clock (at least at large evolutionary distances), there remains a variable in genome evolution that changes linearly with time? Comparative analysis of the distributions of evolutionary distances between orthologs in pairs of genomes separated by vastly different evolutionary distances suggested a possibility of a positive answer to this question [318]. The shapes of these distributions are very similar, although the characteristic distances are dramatically different, just as expected (Figure 6.12; see color plates). It has been shown that, when multiplied by appropriate scaling factors, most of these distributions become statistically indistinguishable [318]. Thus, however dramatically evolutionary rates for individual sets of orthologous genes may change during evolution (we discuss some anecdotal examples of such drastic changes in the next section), the genome-wide distribution remains largely invariant in shape and only slides along the axis of evolutionary distance toward greater values (Figure 6.12; see color plates). The evolutionary clock concept lives, although it seems to be more like Dali's “Melting Clock” than a proper Swiss timepiece.

Figure 6.12. Cumulative distributions of evolutionary distances between orthologs for different genome pairs.

Figure 6.12

Cumulative distributions of evolutionary distances between orthologs for different genome pairs.

6.4. The Major Transitions in Evolution: A Comparative-Genomic Perspective

In their influential book, John Maynard Smith and Eors Szathmary present the history of life as a succession of “major evolutionary transitions” [555]. This view of evolution, clearly parallel (at least in general terms) to the punctuated equilibrium of Eldredge and Gould [304], is helpful in allowing one to concentrate on those relatively brief epochs during which momentous changes occurred, resulting in the birth of new states of living matter and which were separated by long periods of relative stasis. Here, we attempt to show how comparative genomics allows us to attain a better and, above all, more concrete understanding of these transitions by considering three of them: (i) from the pure RNA world to RNA-protein life forms; (ii) from RNA to DNA as the substrate of heredity; and (iii) from the prokaryotic to the eukaryotic cell (we should note parenthetically that only the first two are transitions in the strict sense, which represent the succession of different types of living systems replacing one another; the third one is “just” the emergence of a new state, which does not negate the ancestral one; it is, however, of tremendous interest, both fundamental and parochial).

6.4.1. Ancestral life forms and evolutionary reconstructions

One of the most striking features of life on this planet is the surprising unity of the molecular framework of all living things. A human being, amoeba, E. coli, and T. acidophilum, a hyperthermophilic archaeon that lives in nearly boiling acid, may not look like close relatives, but they share highly conserved regions in numerous proteins, particularly those involved in information processing (transcription and translation), and many structural features of key macromolecular assemblies, such as the RNA polymerase, the ribosome, and the plasma membrane. And, of course, minimal variations notwithstanding, they all use the same genetic code to translate information stored in their genomes into proteins. All these common features leave no reasonable doubt that all life forms known to us have evolved from a single common ancestor, which we will call the Last Universal Common Ancestor, or LUCA, an acronym first coined by Patrick Forterre [236]. Remarkably, the conclusion that all life on Earth evolved from a single common ancestor has been presciently reached by Charles Darwin in the Chapter 4 of The Origin of Species (see the first epigraph to this chapter), however daring this idea seems to be in the absence of any molecular data. We have dealt with the convergence counter-argument when we discussed the nature of similarities between individual proteins in Chapter 2, and the implausibility of convergence is only amplified when applied to multiple proteins (and the code itself) shared by all organisms. It is worth noting that the conclusion that all extant life on Earth shares a common ancestor has nothing to do with our specific ideas on the origin of life. In particular, the existence of LUCA is fully compatible with panspermia [359] and even with the somewhat less plausible idea that the first cell had been constructed by an alien genetic engineer. Nor does it exclude multiple origins of life and a primordial (pre-LUCA) diversity of biological systems. All that follows from the impressive repertoire of conserved features shared by the modern organisms is that, at some point, evolution went through a bottleneck, in which LUCA was uniquely positioned to give rise to the entire diversity of life as we know it today.

There are, of course, important, indeed dramatic differences between prokaryotes and eukaryotes and, within prokaryotes, between bacteria and archaea (see 6.1). Given the firm evidence that all extant life forms evolved from LUCA, it should be possible to study how gene duplication, divergence of functions, gene loss, and HGT have shaped the distinct gene repertoire of each major lineage. The goal of evolutionary genomics can be defined as reconstruction of ancestral genomes that existed at different stages of evolution, including LUCA. In 6.3, we briefly discussed the algorithmic aspects of this problem. In this section, we touch upon some biological features of the hypothetical ancestors that can be gleaned from comparisons of extant genomes. LUCA and origins of DNA replications

The existence of LUCA seems to be demonstrated beyond reasonable doubt by the unity of many molecular systems of all known cells, above all, the genetic code itself. Given this conclusion and the already considerable collection of sequenced genomes, reconstruction of the LUCA genome inevitably emerges as a fundamental and tantalizing problem. The most naïve approach would simply posit that all genes inherited from LUCA are probably essential and must be represented in all existing life forms. Should that be the case, the task of reconstructing LUCA's gene repertoire would be reduced to the (at least conceptually) trivial problem of finding all universal genes. This approach is quite sensible but clearly does not work in its simplest form. Indeed, we have already seen, when introducing phyletic patterns, non-orthologous gene displacement and the “minimal genome” concept in Chapter 2, that the number of truly universal genes is almost ridiculously small, only 65 or so. Furthermore, the set of universal genes is strongly functionally skewed: the great majority of these genes encode components of translation machinery, with just a few coding for basal components of the transcription system and molecular chaperones. It is probably indisputable that all universal genes are parts of LUCA's heritage (although, in principle, HGT could have resulted in ubiquitous dissemination of some genes of later origin), but it is equally clear that these proteins alone could not even come close to forming a functional organism. Thus, LUCA encoded many proteins that are not universally present in modern cells. The underlying reasons should be clear from the above discussion (see 6.2 and Chapter 2): first, parasites and some free-living heterotrophs undergo substantial loss of genes coding for proteins that are essential in autotrophs, and second, extensive non-orthologous gene displacement results in patchy phyletic patterns even for many essential genes. These phenomena, particularly non-orthologous gene displacement, seriously confound the task of reconstructing LUCA's gene repertoire. Only careful biological reasoning, combined with detailed genome comparison, can produce defendable answers to the question, which cases of multiple, non-orthologous solutions for the same function reflect displacement of ancestral proteins that were present in LUCA, and which are later, independent inventions.

A full reconstruction of LUCA's genome is a project far beyond the scope of this chapter. However, a preliminary sketch of what can be inferred of LUCA from comparisons of modern genomes (using the COG database for information on phyletic patterns) is both feasible and instructive. Examination of phyletic patterns suggests that many functional systems of LUCA were only slightly less complex than their counterparts seen in the simplest modern archaea or bacteria. A recent detailed comparative analysis of the entire RNA metabolism machinery suggested that LUCA had not only the basal translation system but also a considerable repertory of RNA-modifying enzymes, such as methylases and pseudouridine synthases, as well as a rudimentary system for RNA polyadenylation, and the molecular system for translation-coupled protein secretion ([27]; Table 6.2). Although the diversity of RNA modification in LUCA was probably less than that in any modern organism, all principal types of modification were already in place, so that the difference between LUCA and modern life forms seems to be quantitative only.

Table 6.2. A tentative reconstruction of LUCA’s repertoire of proteins involved in RNA metabolism.

Table 6.2

A tentative reconstruction of LUCA’s repertoire of proteins involved in RNA metabolism.

The classic early concepts of the Origin of Life, starting with Darwin's “little warm pond”, put much emphasis on the existence of a primordial (prebiotic) soup rich in all kinds of organic molecules, which allowed the first life forms to enjoy a lavish heterotrophic lifestyle [632]. More recently, this notion has been challenged on the grounds that a high concentration of abiogenic organic matter under the conditions of the primitive Earth was highly unlikely [536]. Regardless, even if a primordial soup was available to the very first, primitive life forms, it certainly must have been exhausted by the time an organism with a system for RNA metabolism as complex as outlined here for LUCA has evolved. Therefore, it appears most likely that LUCA was a chemoautotroph resembling, in terms of metabolic capabilities, the simplest modern chemoautotrophs, such as hyperthermophilic archaea (e.g. Methanococcus) and bacteria (e.g. Aquifex). Thus, LUCA must have had the central metabolic pathways, such as glycolysis and some form of the TCA cycle, as well as the main anabolic pathways, namely those for the biosynthesis of amino acids, nucleotides, lipids, and several coenzymes. The corresponding phyletic patterns, a few cases of non-orthologous displacement notwithstanding, are compatible with the notion that all these metabolic pathways in LUCA were equipped with more or less the same set of enzymes as are present in modern organisms, rather than with some radically different primitive forms. Beyond doubt, LUCA had a plasma membrane, into which it inserted proteins. This follows not only from general biological considerations, but also from the universal conservation of the protein components of the signal recognition particle (two paralogous GTPases) and of the SecY protein, which couples translation and secretion in prokaryotes. The rest of the secretory machinery, however, is not conserved, and details of this process and the biochemistry of the membrane lipids in LUCA are hard to deduce.

Thus, many functional systems in LUCA might have been nearly as complex as they are in the simplest modern cellular life forms, and the organism itself, in many respects, might have resembled modern chemoautotrophic archaea and bacteria. However, the grand mystery that remains is the nature of LUCA's genome and the mode of its replication. In a striking contrast to the other central information processing systems, i.e. translation and transcription, several critical components of the DNA replication machinery are either unrelated or distantly related and apparently not orthologous in archaea-eukaryotes and in bacteria (Table 6.3). Most conspicuously, the apparently unrelated components include the elongating DNA polymerase, the principal replication enzyme. This apparent lack of homology between the DNA replication systems in the two main branches of life had been noticed as soon as the first complete genomes of bacteria and archaea became available, and several hypotheses have been proposed by way of explanation [209,237]. One straightforward possibility is that replication components that appear to be unrelated are, in fact, extremely diverged orthologs. However, a detailed analysis of the sequences and structures of the key replication proteins all but refuted this hypothesis. It has been shown, in particular, that archaeo-eukaryotic and bacterial primases have completely different structures with unrelated folds [63,431,676]. Specifically, the structure of the archaeo-eukaryotic primase is distantly related to that of the palm domain that is present in a broad variety of DNA and some RNA polymerases [51]. In contrast, bacterial primases have the so-called TOPRIM catalytic domain shared with several other enzymes (topoisomerases and nucleases) that are not necessarily directly involved in replication [49]. Similarly, the replicative DNA helicases in the two branches appear to have been independently recruited from distinct groups of ATPases, which function in processes other than replication (Table 6.3).

Table 6.3. Evolutionary relationships between the major components of the DNA replication machinery in archaea-eukaryotes and bacteria.

Table 6.3

Evolutionary relationships between the major components of the DNA replication machinery in archaea-eukaryotes and bacteria.

The second explanation for the existence of the two distinct replication systems involves non-orthologous gene displacement and/or differential gene loss. The displacement hypothesis posits that one of the extant replication systems, e.g. the archaeo-eukaryotic version, was present in LUCA and has been displaced in the other branch, in this case bacteria, by the alternative system, whose components perhaps might have been acquired from viruses (bacteriophages) [226,236]. The differential loss scenario postulates that LUCA had two replication systems, one of which might have been used for repair, and that these systems have been differentially eliminated in the two branches of life.

A central aspect of each of these models is that LUCA had a double-stranded DNA genome generally similar to those of modern prokaryotes. In other words, these are continuity hypotheses, which, while postulating displacement or loss of essential genes, picture LUCA essentially as a modern-type cell.

However, these schemes seem to face serious difficulties in interpreting the evolution of the two DNA replication systems. Indeed, it is hard to imagine the evolutionary forces behind the purported displacement of a well-developed, multicomponent replication system. As for the possibility that LUCA had two distinct systems for replication, not only is this without precedent in known organisms, but it would also imply that, at least with respect to replication, LUCA's complexity substantially exceeded that of its descendants, which seems to be an unlikely proposition and poses the problem of the ultimate origin of these systems (but see discussion further in this section).

Considering these problems, a different, in a sense, more radical proposal regarding LUCA's genome and replication system has been made [504,596]. This hypothesis postulates a sharp discontinuity between LUCA and modern cells, in that the former simply did not have a large double-stranded DNA genome and the system for its replication, which are central to all modern cells. Instead, it is proposed that LUCA's genome consisted of multiple segments of RNA, which replicated via DNA intermediates in a retrovirus-like replication cycle (Figure 6.13). This hypothesis strives to account both for the lack of conservation of several central components of modern replication systems and for the presence of some other conserved components, such as, for example, the sliding clamp, clamp-loader ATPase, and RNAse H, as well as enzymes of DNA precursor biosynthesis, and the basal transcription machinery. The latter group of enzymes, which are ubiquitous in modern life forms, are thought to have been involved in the DNA-based part of the genome replication cycle in LUCA (Figure 6.13; see [504] for details). According to this hypothesis, although many functional systems of modern cells have already evolved in LUCA, the organism itself was not modern but rather a transitional form on the path from the ancient RNA world to our DNA world.

Figure 6.13. The proposed retrovirus-like genome replication cycle of LUCA.

Figure 6.13

The proposed retrovirus-like genome replication cycle of LUCA. 1, reverse transcription; RT, reverse transcriptase; 2, removal of RNA strand, synthesis of second strand of DNA; 3, transcription; DdRp, DNA-dependent RNA polymerase. The points in the cycle (more...)

As discussed above, LUCA must have had at least several hundred protein-coding genes and 30 or so genes for structural RNAs. It appears most likely that the RNA segments of LUCA's genome were of an operon size, i.e. a typical segment carried three to five genes. Comparisons of the gene order in extant bacterial and archaeal genomes show that some operons coding for ribosomal proteins are universally conserved [250,491,916] and (there being no evidence of sweeping HGT of ribosomal protein operons) must have been inherited from LUCA.

Should this be the case, LUCA's genome might have been a collection of 200–400 RNA molecules (and the corresponding DNA intermediates if the scheme in Figure 6.13 is accepted as a working model).

Such a set of genomic segments hardly could segregate with high accuracy into the daughter protocells during division, although multiploidy could have increased the likelihood that each received the complement of essential genes. Therefore, what we call LUCA inevitably must have been a collection of protocells with similar but not identical sets of genome segments, in notable agreement with the original notion of the “progenote” and subsequent conceptual developments of this idea [196,900,902]. Furthermore, in such a population, horizontal gene transfer would amount to reassortment of genome segments (not unlike what happens with certain modern viruses, e.g. influenza) and would be extremely common, allowing rapid emergence of new variants and, accordingly, rapid evolution of the LUCA population. However, reassortment could not be allowed to occur without restriction because, should that be the case, natural selection of protocellular entities would have been impossible. It is imaginable that there was a transition during evolution from the early stage of selection at the level of (self-replicating) genome segments, which, in many respects, seem to be quasi-selfish, virus-like entities, to the selection at the level of primitive protocells after reassortment was partially curbed and ensembles of genome segments have become more stable. Within the framework of this concept, the distinction between two hypotheses of origin of DNA genomes and replication, one of which holds that these evolved independently in the bacterial and archaeo-eukaryotic lineages, whereas the other posits that the progenitors of these lineages merely “sorted” among pre-existing replication systems, becomes blurred and perhaps less relevant. Under either of these hypotheses, however, a replication system similar to the one shown in Figure 6.13 seems to be a likely predecessor of the modern DNA replication systems.

In the parlance of evolutionary biology, the crucial transition in the early evolution of cells discussed above could be construed as change of selection agency from replicators (self-replicating, virus-like segments of RNA) to interactors (protocells) [304]. Carl Woese, in his recent theoretical treatise “On the evolution of cells”, designated a similar transition the “Darwinian threshold”, meaning that speciation has become possible at this point in evolution [904]. This concept seems to be quite compatible with the above notions, but the name could be somewhat disingenuous in its implication that evolution was “non-Darwinian” before the threshold. The essence of Darwinian evolution is natural selection, and it was definitely in place from the earliest stages of biological evolution, as soon as the first replicating and mutating entities emerged, even if the principal agency of selection changed along the way.

6.4.2. Beyond LUCA, back to the RNA world

The ideas of LUCA's gene repertoire and the nature of its genome discussed in the previous section seem to fit within another staple of Carl Woese's general concept of early evolution, the notion of “asynchronous crystallization” of different cellular systems [900]. More specifically, we address questions of the following type: given that LUCA is confidently predicted to have had a well-developed translation system and certain metabolic capabilities, what can be inferred regarding the nature of its genome and replication system? The line of reasoning developed above, albeit not definitive, suggests that the relatively advanced stage of evolution reached by LUCA could still lie on the boundary between the ancient RNA-protein world and the DNA world that replaced it. More generally, this transition is an inevitable step in the evolution of life under any model that accepts the notion of a primordial RNA world. The question then emerges: can we look beyond that transition phase, which perhaps coincides with LUCA, and trace some of the steps of evolution within the RNA world itself?

Even without a full reconstruction of LUCA's gene repertoire, evidence of complex evolution of proteins at earlier epochs is unequivocal. Clearly, paralogous protein families that can be inferred to have been present in LUCA are footprints of ancient, pre-LUCA duplication events. This fact was used in the classic work on paralogous rooting of phylogenetic trees that gave early support to the standard model of evolution [291,386]. For example, within the class of P-loop NTPases, which are the most common protein domains, at least in prokaryotes (see Chapter 8 for some details), perhaps as many as 50 distinct families appear to go back to LUCA, which, in all likelihood, encoded the ancestor of each of them. This includes about 10 families of GTPases alone [505] and about the same number of helicase families [27]. The topology of the evolutionary tree of the P-loop NTPases is fairly well resolved, and the presence of many diverse families in LUCA implies that the preceding history involved a long series of duplications, followed by divergence, between the ultimate ancestor of this protein class and the LUCA stage. The same logic applies to other widespread protein domains, such as, for example, Rossmann-fold oxidoreductases or methyltransferases, each of which was represented by multiple paralogs in LUCA [51].

Analysis of each of these domain classes gives us insights into very early stages of life's evolution, but some families are particularly important for aligning protein domain diversification with the timeline of the transition from the primordial RNA world to the modern-type cells. Consider the evolution of aminoacyl-tRNA synthetases (aaRS), crucial enzymes of the modern translation system [376,907,909]. There are 20 specific aaRS altogether, one for each amino acid (eukaryotes have all 20, whereas most bacteria and archaea encode only 17 to 19 because they have specialized mechanisms for incorporation of glutamine and asparagine that do not involve specialized aaRS, and archaeal methanogens also lack CysRS [376,907]). This (nearly) ubiquitous presence of aaRS indicates that at least 18 of them were already encoded in LUCA's genome. All aaRS are complex, multidomain proteins, but the principal catalytic domains, which are directly responsible for the formation of aminoacyl adenylates followed by coupling of amino acids to their cognate tRNAs, fall into two distinct classes, each including 10 amino acid specificities. The catalytic domains of all aaRS within each of the classes are homologous, even if some of them show only low sequence conservation, whereas the two classes are unrelated as indicated by sequence and structure comparisons. Since LUCA already had most of the aaRS of both classes, the series of duplications that led from the ancestor of each class to the complete sets of aaRS dates to earlier, pre-LUCA stages of evolution (Figure 6.14). Even more dramatically, each of the catalytic domains of the aaRS is related to several other families of nucleotide-binding domains. Figure 6.14 shows the evolutionary scenario for the HUP domain, so named after the HIGH domains, UspA and PP-loop NTPases, three protein superfamilies that have this type of nucleotide-binding domain [37]. The HIGH superfamily, designated after the distinct signature in the phosphate-binding loop, unites the catalytic domain of class I aaRS and a family of nucleotidyltransferases, primarily involved in coenzyme biosynthesis, whose relationship with the aaRS is readily demonstrable at the sequence level [101]. The connections between the HIGH superfamily and the rest of the HUP domains were detected primarily through structure comparisons, but the hierarchy of the observed relationships was sufficiently well resolved to infer the tree topology [37]. In this tree, the series of duplication leading to the emergence of the individual aaRS specificities is but a terminal elaboration, which was preceded by multiple ancient duplications (Figure 6.14). The most remarkable aspect of the evolution of the HUP domain class is that multiple HUP-containing proteins, namely aaRS of 9 specificities and tRNA thiouridine synthases (members of the PP-loop ATPase superfamily), are indispensable for translation in its modern form. Since a major diversification of the HUP domains preceded the radiation of the aaRS and the PP-loop ATPases, the conclusion becomes inevitable that ancestors of many modern protein families with well-defined structural and sequence features and distinct biochemical functions, e.g. the ancestral PP-ATPase, HIGH, and USPA domains, antedate the modern translation apparatus . The above analysis leads us to conclude that series of duplications followed by diversification of each of the above domains occurred within a system in which the specificity of translation was determined, not by aaRS as in modern cells, but by RNAs. An important corollary is that such a primitive, partially RNA-based translation system was sufficiently accurate and efficient to allow complex protein evolution. It is easy to argue that this primitive, largely RNA-based translation apparatus should be classified as part of the RNA world.

Figure 6.14. A schematic tree of the evolution of the HUP domain class.

Figure 6.14

A schematic tree of the evolution of the HUP domain class. The tree was derived through a combination of sequence comparison and cladistic analysis of structural features. As discussed in the text, diversification of aaRS predates LUCA (the LUCA stage (more...)

Very similar conclusions could be reached through analysis of the catalytic domains of Class II aaRS, which are related to the nucleotidyltransferase domain of biotin synthases [59], as well as translation factors. In the modern translation system, the latter are represented by several paralogous GTPases, each of which is indispensable for efficient and accurate translation (Table 6.2). Inevitably, the divergence of these GTPases must have occurred within the confines of a more primitive, at least partly RNA-based, translation system [27,505]. It might be harder to directly link the evolution of other protein classes to the evolution of the translation system, but on the whole, there is no doubt that the emergence of most of the major protein folds and the diversification of dozens of protein families within them are evolutionary events that map to the RNA world stage. Strikingly, it might not be an exaggeration to say that the most important stage of protein evolution, the formation of the majority of widespread protein folds, had already concluded by the time of the final “crystallization” of the translation system. Of course, from a purely logical point of view, this conclusion is almost trivial. In retrospect, it seems obvious that, in order to replace an old toolkit with a new one, the new tools had to be made using the old ones.

These reconstructions of the initial radiation of protein classes also give us, with considerable clarity, the teleology of early protein evolution. Consider, once again, the HUP class of nucleotide-binding domains. What becomes apparent when one descends down the tree toward the root is the gradual loss of specificity (Figure 6.14). Indeed, the common ancestor of the HUP class, which, in all likelihood, was an ATP-binding domain with a generic ATP-pyrophosphatase and/or nucleotidyltransferase activity, could not perform the multiple, specific functions assumed by its descendants. In particular, it was impossible for this generic ATP-hydrolase to mediate processes that require highly specific molecular interactions, such as tRNA aminoacylation or thiouridylation of specific bases in tRNA. The solution to this problem, already mentioned above, is that, in the primitive biochemical system, many functions, particularly those related to translation that are performed by proteins in modern cells, relied on RNA molecules, including the ancestors of rRNA and tRNA (to our knowledge, this idea was first explicated by Francis Crick in 1968 [162]). As shown above, at this early stage of evolution, proteins already had catalytic and ligand-binding capabilities, which suggests that the RNAs were mainly responsible for the specificity. In particular, the ancestor of the HUP class probably interacted non-specifically with proto-tRNAs, thus facilitating aminoacylation, whereas the specificity of these reactions was conferred bythe cognate tRNA itself, perhaps with an additional contribution from other, accessory RNA molecules. In support of this scenario, specific self-aminoacylation catalyzed by an RNA molecule has been demonstrated experimentally [377]. The same ancestral HUP domain probably functioned, in cooperation with other RNAs, to facilitate additional reactions, which require ATP hydrolysis or nucleotide transfer, including RNA modification and cofactor biosynthesis. After the first duplication that separated the two main branches of the HUP class (Figure 6.14), the ancestor of the HIGH-USPA lineage probably took over the function of a generic nucleotidyltransferase involved in translation and cofactor biosynthesis, whereas the ancestor of the PP-loop ATPases became a generic ATPase involved in RNA modification and some metabolic functions.

6.4.3. A brief history of early life

Generalizing from these analyses, we can now sketch the path from the RNA world to the RNA-protein world to the DNA-RNA-protein world of LUCA (Figure 6.15). Proteins most likely started off as RNA-binding and nucleotide-binding cofactors for the primordial RNA catalysts. Notable relics of this ancient stage of evolution are seen in modern cellular systems, including RNA-protein enzymes, such as RNAse P and, above all, the ribosome itself [143,294,620]. In these systems, to this day, RNA functions as the principal catalyst (ribozyme), whereas RNA-binding proteins are cofactors that stabilize the ribozyme and facilitate the reaction. Further evolution of these primordial proteins led, first, to the origin of generic, non-specific enzymes, such as NTPases, nucleotidyltransferases that gradually became the first polymerases, and nucleases and, subsequently, through a series of duplications, to the emergence of specificity. The structural similarity between the palm domain of DNA and RNA polymerases and the RRM-fold RNA-binding domain [51], which might point to the route of origin of first polymerases, supports the vector of evolution from RNA-binding protein cofactors to enzymes.

Figure 6.15. A hypothetical sequence of major events in the evolution of life from self-replicating RNA to the emergence of modern-type DNA replication.

Figure 6.15

A hypothetical sequence of major events in the evolution of life from self-replicating RNA to the emergence of modern-type DNA replication.

It appears that a well-developed RNA-protein world with numerous, specific protein enzymes and some protein regulators was a distinct stage in the evolution of life. Under this view, DNA was introduced as an alternative means for storage of genetic information and subsequently as the principal chemical substrate of the genome only at a relatively late stage, after the translation mechanisms and the principal metabolic pathways have been firmly established.

A scheme like the one in Figure 6.15 could be easily drawn from first principles at the time when the very idea of a primordial RNA world took shape, i.e. in the late 1960's [162]. However, comparative genomics makes a major difference by providing specific support for each of these hypothetical steps and giving us a fairly clear idea of the protein forms that functioned in these ancient worlds.

LUCA's existence is usually dated at 3.5-3.8 billion of years ago because, by then, diverse prokaryotes seem to have already evolved, as indicated by microfossil evidence [753,754], although later dating has been proposed [189]. This leaves perhaps 500 million years between the time the Earth cooled down enough to allow RNA and protein chemistry and the emergence of LUCA. How could it be that, as outlined above, much more dramatic changes in the protein world took place in this relatively short early era than during all subsequent evolution (the logic here will not change much if LUCA is moved, say, 500 million years closer to us)? The bewilderment over this question might lead to the believe that there was simply not enough time for life to emerge on Earth and that the only explanation (barring intelligent design, of course) for its presence is panspermia, seeding of our planet with simple life forms from outer space [359].

While there is nothing intrinsically impossible about panspermia, we believe that, in the absence of direct supporting evidence, this hypothesis fails the Okkam test. Instead, the general solution to the paradox seems to be that the modes of protein evolution radically differed during the early period when the major superfamilies were forming (as discussed above, it appears certain that this happened within the framework of the RNA world, with most catalytic functions performed by RNA molecules) and during the subsequent evolution, after the major protein folds have “crystallized” (Figure 6.15). The latter phase of evolution was and still is dominated by purifying selection and had been generally characterized by a relatively low rate of amino acid replacement; although the classic molecular clock seems to be a gross oversimplification even for this stage of evolution, a “soft” genomic molecular clock probably applies (see 6.3). Certainly, this conservative course of life's evolution was punctuated by numerous bursts of positive selection, which were associated with the birth of new protein superfamilies and even new folds. However, the bulk of evolution in the 3.5 billion years or so since LUCA occurred during periods of stasis, i.e. gradual functional adaptation without changes in the basic structure and biochemical activity of proteins (we deliberately adopt the language of the theory of punctuated equilibrium [304], which seems to be a good, if not necessarily precise simile for a description of the fundamentals of molecular evolution). In contrast, the early phase of evolution, during which the major protein folds have emerged in a semblance of biological Big Bang (Figure 6.15), most likely was dominated by positive selection, which caused rapid change of the sequences. Mechanistically, this mode of evolution seems to be compatible with the high error rate of RNA replication, which reaches 10−3 even in modern viruses with their evolved RNA-dependent RNA polymerases [882,883,888], and was probably even greater for the early RNA or RNA-protein polymerases. We may call this transition from positive selection to stabilizing selection arguably one of the pivotal events in the evolution of life, the “Schmalhausen threshold”, after the eminent Russian evolutionist who was the first to introduce, in the 1930's, the crucial concept of stabilizing (purifying) selection [750]. It is interesting to note that this view of the earliest phases of evolution is decidedly at odds with Darwin's opinion that “…at the very first dawn of life, when very forms of the simplest structure existed, the rate of change may have been slow in an extreme degree” [170].

6.4.4. The prokaryote-eukaryote transition and origin of novelty in eukaryotes The nature of the transition and origins of eukaryotic genes

As emphasized repeatedly and ever so eloquently by Stephen Jay Gould (e.g. [303]), the iconography of progress from inconsequential prokaryotes to spectacular bipedals, which is commonplace in biology textbooks and popular literature, is demonstrably false if construed as an objective depiction of the evolution of the biosphere on this planet. Indeed, whether we consider the energy flow in the biosphere, the spread of life over diverse habitats, including extreme ones, or the diversity of species themselves, the entire history of life is definitely the age of prokaryotes, which will continue until the ultimate demise of the earthly life. It is equally undeniable, however, that this iconography does represent something important and profoundly interesting in the evolution of life, namely the tremendous increase in the maximum organizational complexity of life forms observed at a given time. Arguably, the single most important leap in the direction of greater complexity was the origin of the eukaryotic cell. Indeed, the simplest eukaryotic cell is considerably more complex than even the most advanced prokaryote [11]. To begin with, the eukaryotic cell is much larger, on average by about a factor of 1,000 (in volume). Furthermore, the eukaryotic cell has several distinct organelles and intracellular structures without counterparts in prokaryotes, the most important of which are the nucleus, the mitochondria, the endoplasmic reticulum, and the cytoskeleton. There are no candidates for evolutionary intermediates between prokaryotic and eukaryotic cells, which suggests a true “phase transition”, a distinct, dramatic evolutionary event leading to the emergence of eukaryotes. The nature of this event is one of the greatest mysteries of life's history.

Rather unexpectedly, recent studies on early-branching eukaryotes have suggested a strong candidate for this event: the mitochondrial endosymbiosis itself. The very idea that endosymbiosis (although not necessarily mitochondrial) underlies the very emergence of eukaryotes has been proposed by Lynn Margulis in her pioneering 1967 paper that reinstated the endosymbiotic theory [727]. This idea, however, had not won wide acceptance, and the field of eukaryotic evolutionary biology was dominated by the Archezoan hypothesis, which contends that the host of the promitochondrial endosymbiont was an amitochondrial unicellular organism that already possessed all the main features of the eukaryotic cell, the hypothetical archezoan [140,659,707]. However, the search for an archezoan branch of eukaryotes so far has been futile. Although numerous protists, such as, for example, Giardia or Trichomonas, which live under anaerobic conditions, lack regular mitochondria, most of them have hydrogenosomes, small membranous organelles that metabolize malate and pyruvate to produce ATP, H2, and CO2. Typical hydrogenosomes have no genome or translation system but are composed of proteins, which, as shown by phylogenetic analysis, have been derived from the mitochondria and are encoded by former mitochondrial genes transferred to the nuclear genome [10,90,127,203,204,356,713]. Phylogenetic studies on protists (an incredibly fascinating field in itself, but the details of which are beyond the scope of this book) indicated that, in early eukaryotic evolution, secondary loss of mitochondria was a common event, which probably occurred in at least 15 lineages [707]. The hunt for an archezoan continues, but with each failure to discover one, the odds are increasing that the very origin of the eukaryotic cell was triggered by the acquisition of the α-proteobacterial symbiont that became the progenitor of the mitochondrion. The second partner, the recipient, in this event remains an enigma. Certainly, much evidence points to an archaeon (see 6.1), but whether this primary ancestor of eukaryotes belonged to one of the currently known archaeal lineages remains unclear.

Even if we accept that endosymbiosis triggered the emergence of the first eukaryotic cell, the nature of the transition is a great mystery. Since the cytoskeleton is one of the critical parts of the eukaryotic cell and, more specifically, the part that allows the eukaryotic cell to maintain its characteristic large size, the unusual evolutionary patterns seen among the principal cytoskeletal proteins might be directly linked to the transition. Actin, the most abundant eukaryotic cytoskeletal protein, is a highly derived ortholog of MreB and FtsA, bacterial proteins involved in cell division. These proteins are present in almost all bacteria, but only in a few scattered archaeal species, which probably acquired the respective genes via HGT. Actin and MreB/FtsA show very low sequence similarity to each other but share all diagnostic motifs of the HSP70-like ATPase-fold ATPases [106], have very similar structures, and form similar filaments [210,864,865]. The absence of these proteins in most archaea suggests that, in eukaryotes, they are an ancient bacterial acquisition, although the alternative, namely that these proteins were already present in LUCA (and then, presumably, lost in archaea), also has been considered [191]. The unusually low sequence conservation between MreB/FtsA and actin, in all likelihood, reflects the abrupt change in functional constraints that accompanied exaptation (i.e. recruitment of a protein for an entirely new function, which may be mechanistically but not biologically related to the original one [302]) of the bacterial cell division protein for functioning in the emerging cytoskeleton. In a striking parallel, tubulin, the other principal cytoskeleton protein of eukaryotes, also diverged from the prokaryotic ortholog, the cell division protein FtsZ, to the point of near obliteration of sequence similarity (except for the distinct phosphate-binding loop of the GTPase domain), although the structural conservation remains obvious [214,215,621]. This theme can be extended to the large subunit of dynein, a major motor ATPase, which is an extremely diverged AAA+ class ATPase whose prokaryotic progenitor could not be readily identified because of this divergence [613]. Furthermore, a similar trend is seen with one of the small dynein subunits, LC7, whose distant prokaryotic homologs appear to be involved in gliding motility and signal transduction in bacteria and archaea [455]. These dramatic changes in protein sequences that accompanied the origin of eukaryotic cytoskeleton might have been critical in the evolution of the eukaryotic cell. In particular, it seems possible that mutations that resulted in altered properties of the MreB/FtsA protein acquired from the bacterial endosymbiont initiated the entire process of transformation of the ancestral prokaryotic cell into the more complex eukaryotic cell.

Sequence similarity between actin, tubulin, LC7, and their respective prokaryotic orthologs is so weak, presumably due to a radical change in function, that these relationships escape detection in routine genome comparisons. The discovery of these crucial evolutionary connections became possible only as a result of systematic analyses of the sequences of the respective protein families or through direct comparison of protein structures (whereas each of these proteins is highly conserved within prokaryotes and eukaryotes). Case by case sequence and structural studies revealed several additional essential eukaryotic proteins that showed extreme divergence from their apparent evolutionary precursors in prokaryotes (Table 6.4). The fact that several proteins conserved in all eukaryotes and involved in quintessential eukaryotic functions, such as cytoskeleton formation (actin), chromosome separation in mitosis and meiosis (separase), and cell cycle control (Ddi1), show a bacterial provenance and might have been acquired by eukaryotes via HGT emphasizes the potential role of endosymbiosis in the very origin of eukaryotes. A further, focused search for potential prokaryotic counterparts of essential proteins that so far seem to be eukaryote-specific might help in revealing more about eukaryotic origins.

Table 6.4. Extreme divergence between some essential eukaryotic proteins and their apparent prokaryotic orthologs.

Table 6.4

Extreme divergence between some essential eukaryotic proteins and their apparent prokaryotic orthologs.

Genome-wide analyses of eukaryotic proteins and studies on evolution of specific functional systems converge on the notion that the eukaryotic proteome is a mix of proteins of apparent archaeal descent, those that seem to originate from bacteria, and eukaryote-specific ones. This is readily illustrated by the taxonomic breakdown of the database search results for proteins encoded in any eukaryotic genome as shown in Figure 6.16 for the nematode C. elegans [913]. The eukaryote-specific category is the largest, but, among those proteins that do have prokaryotic homologs, the “bacterial” group dominates. Similar conclusions can be reached by breaking down the COGs by domain-specific phyletic patterns: the number of COGs that consist of bacterial and eukaryotic proteins, to the exclusion of archaea, is more than fourfold that of archaeo-eukaryotic COGs (Figure 6.1).

Figure 6.16. Taxonomic breakdown of the BLAST search results for the proteins of the nematode C. elegans.

Figure 6.16

Taxonomic breakdown of the BLAST search results for the proteins of the nematode C. elegans. ‘Nematode’ (-specific) are proteins that did not show significant sequence similarity (E < 10−3) to proteins from any species other (more...)

The most obvious explanation of the presence of a large number of ‘bacterial’ proteins in eukaryotes is a massive acquisition of genes from the mitochondrial endosymbiont and, in the case of plants, from the chloroplast endosymbiont. Phylogenetic studies of mitochondrial rRNA and proteins have shown beyond reasonable doubt that mitochondria evolved from alpha-proteobacteria [30,313,489]. However, in phylogenetic analyses of eukaryotic proteins encoded in the nuclear genome, only a small number confidently cluster with the alpha-proteobacterial orthologs ([420,480]; K.S. Makarova, M.V. Omelchenko. and E.V. Koonin, unpublished observations). The legacy of the more recent chloroplast symbiosis in plants is easier to detect, but even in this case, only 63 unequivocal plant-cyanobacterial clades of the 386 examined phylogenies of genes shared by the cyanobacterium Synechocystis and the plant Arabidopsis were reported [720]. One cause of the problems with proving the influx of genes from endosymbionts is likely to be the accelerated evolution of the acquired eukaryotic genes, perhaps a burst of rapid evolution immediately after gene transfer from the symbiont to the nucleus. Indeed, it is often observed that proteins functioning in the mitochondria cluster within the bacterial branch in phylogenetic trees, but the specific affiliation with alpha-proteobacteria is seen only in the minority of cases; aminoacyl-tRNA synthetases are a good illustration of this trend [909]. This being the case, application (and further refinement) of phylogenetic methods that are minimally sensitive to changing evolutionary rates could clarify the situation.

Another, intriguing explanation that may be called Doolittle's ratchet (after W. Ford Doolittle who proposed this idea in a brief article with the memorable title “You are what you eat” [192]), is that mitochondrial (and to a lesser extent chloroplast) affinity is hard to demonstrate for most of the “bacterial” proteins from eukaryotes simply because they largely do not originate from the mitochondrial ancestor. Under this hypothesis, protoeukaryotes (whatever the nature of these organisms) formed numerous symbiotic relationships with bacteria, the pro-mitochondrial symbiont just being the only one that stayed on, became indispensable for the host, and had been passed on to most eukaryotic lineages. The less lucky symbionts might have ended up as food for their host but, in the process, some of them probably contributed genes to the host genome, resulting eventually in the eukaryotic genome becoming “what the protoeukaryotes ate”. This idea seems to be compatible both with the behavior of many eukaryotic proteins in phylogenetic studies and with the observations of multiple losses of mitochondria in unicellular eukaryotes (see above).

The characteristic functional distinction between eukaryotic genes that have been vertically inherited according to the standard model and those that seem to have been acquired from bacteria follows from the nature of the shared archaeo-eukaryotic heritage that we already discussed in 6.1. The archaeal legacy primarily consists of genes coding for information processing system components (translation, transcription, and replication), whereas metabolic enzymes and transporters (what is sometimes collectively called operational systems) seem to be largely of bacterial origin. The core of the DNA replication machinery, which we discussed when examining the probable nature of LUCA, is the most dramatic manifestation of this dichotomy, with eukaryotes using the archaeal system whose key components are unrelated to the bacterial functional counterparts. The distinction, however, is by no means absolute. Thus, detailed computational dissection of the proteins involved in various aspects of DNA repair [54] and RNA metabolism [27] showed that, along with the ancient archaeal core, the eukaryotic systems include a considerable number of proteins that apparently have been acquired from bacteria (see also Table 6.4).

At least three distinct scenarios of bacterial gene integration into the protoeukaryotic genome should be differentiated: (i) displacement of ancestral, archaeo-eukaryotic genes by bacterial counterparts (xenologous gene displacement, see 6.2); (ii) acquisition of bacterial genes without elimination of the ancestral archaeo-eukaryotic counterparts so that eukaryotes end up with both versions of a particular protein; and (iii) evolution of new functions by utilization of bacterial proteins (exaptation). The first scenario applies primarily to essential metabolic enzymes whose single eukaryotic forms appear to be of bacterial origin. The most interesting question regarding this displacement phenomenon is what was its underlying cause or, in other words, what was the selective advantage conferred by the acquired bacterial genes onto the recipient that led to displacement of the ancestral versions. One explanation is that bacterial enzymes were generally more efficient than the original archaeo-eukaryotic ones and were selected for that reason. The alternative is Doolittle's ratchet: there might not have been any selective advantage in replacing ancestral archaeo-eukaryotic genes with bacterial ones, but rather plenty of opportunities for this to happen through symbiotic relationships between protoeukaryotes and various bacteria.

The second scenario, acquisition without displacement, applies largely to universal proteins whose distinct forms function in the eukaryotic cytosol and in the mitochondria, particularly translation system components, e.g. the aaRS. Such cases are probably the best available demonstrations of probable gene transfer from the promitochondrial endosymbiont to the nuclear genome, even in cases when phylogenetic trees do not show the alpha-proteobacterial affinity of the respective eukaryotic proteins.

Eukaryotic innovations that appear to have evolved as a result of acquisition of bacterial genes are of special interest because systematic exploration of these cases may shed light on the origins of the unique eukaryotic complexity (e.g. Table 6.4 and the relevant discussion above). Such studies have become truly meaningful only after multiple genomes of both bacteria and eukaryotes have been sequenced, so as of this writing, there simply had not been enough time to conduct the comparative studies that are required to characterize the evolutionary provenance of all unique eukaryotic functional systems. However, below we discuss one example of such a study that has been completed and revealed both prominent bacterial connections and truly unique eukaryotic inventions. Origin and evolution of eukaryotic programmed cell death

The molecular machinery of programmed cell death (PCD, or apoptosis) appears to be a quintessential eukaryotic signaling system. Bacterial cells are known to “commit suicide” under certain circumstances, but the molecular mechanisms of these processes so far identified seem to be unrelated to those of eukaryotic PCD [512,932]. In contrast, PCD appears to be ubiquitous in multicellular eukaryotes and, indeed, should be considered one of the hallmarks of the multicellular state [778]. In any multicellular organism, PCD is indispensable for eliminating cells with impaired division control whose propagation leads to cancer and cells infected with pathogens or damaged by stress [340,701,808,935]. Programmed death of specific cells is also important in normal development [564].

The phyletic patterns for the key components of the eukaryotic PCD system show a clear ‘meta-pattern’: the enzymes involved in apoptosis tend to have a broad phyletic distribution, with bacterial homologs identifiable, whereas the non-enzymatic components typically have no bacterial homologs and, in some cases, are present in only one eukaryotic lineage (Table 6.5). With the single notable exception of the Apoptosis-Inducing Factor (AIF), the prokaryotic homologs of the proteins involved in PCD are widely represented in bacteria, but not in archaea. This pattern suggests a substantial, perhaps decisive, contribution of acquired bacterial genes to the evolution of the eukaryotic PCD system. It is instructive to further assess this hypothesis through a more detailed examination of the bacterial homologs of the eukaryotic proteins involved in PCD and phylogenetic analysis of the respective protein families [38,39,457].

Table 6.5. Phyletic distribution of domains and proteins involved in eukaryotic apoptosis and related pathways a.

Table 6.5

Phyletic distribution of domains and proteins involved in eukaryotic apoptosis and related pathways a.

The caspase superfamily proteases

Caspases are the principal proteases that are activated during animal apoptosis and cleave a variety of proteins, ultimately leading to cell death and disintegration [808,845]. The caspases have undergone remarkable proliferation and specialization in vertebrates and function in a cascade, which includes several consecutive cleavage steps. Caspases belong to a distinct class of cysteine proteases (we designate it the CHF-class, after Caspase-Hemoglobinase Fold), which also includes hemoglobinases, gingipains, clostripains, and separases, the proteases involved in chromosome segregation in mitosis and meiosis (see also Table 6.4 and discussion above). Recent sequence and structure analyses revealed a much greater diversity of caspase-related proteases than previously suspected [47]. Two families of predicted CHF-proteases were identified and shown to be more closely related to the caspases than to other proteases of this class and hence dubbed paracaspases and metacaspases [861]. A possible regulatory role for the human paracaspase in certain forms of PCD has been demonstrated [861]. More recently, the yeast metacaspase has been shown to mediate programmed cell death upon peroxide treatment and in aged cultures, which not only supports the role of metacaspases in apoptosis but also indicates that PCD occurs even in (at least some) unicellular eukaryotes via mechanisms related to those in multicellular organisms [537]. A major role for metacaspases in plant PCD is also likely, given the proliferation of the genes coding for metacaspases in plant genomes, the absence of other caspase homologs in plants, and the fusion of some of the plant metacaspases with the LSD1 Zn-finger, a regulator of plant PCD [861]. Paracaspases were detected in animals, slime mold, and one group of bacteria, the Rhizobiales, with a notable expansion in Mesorhizobium loti, whereas metacaspases are present in plants, fungi, early-branching eukaryotes, and a variety of bacteria (Table 6.5). Phylogenetic analysis of the caspase-like protease superfamily shows a clear affinity of the eukaryotic metacaspases, paracaspases, and the classic caspases with the corresponding predicted proteases from the Rhizobia, which belong to the α-subdivision of the Proteobacteria, the free-living ancestors of the mitochondria (Figure 6.17). This topology of the phylogenetic tree is best compatible with the origin of metacaspases from the mitochondrial endosymbiont. The case of caspases-paracaspases is more complicated because this branch of the superfamily so far has not been detected in eukaryotes other than animals and slime mold. This distribution is compatible with a second, later HGT from α-proteobacteria to eukaryotes or with independent loss of the paracaspase gene in multiple eukaryotic lineages.

Figure 6.17. Phylogenetic tree of the caspase-like protease superfamily.

Figure 6.17

Phylogenetic tree of the caspase-like protease superfamily. The proteins are indicated by their gene names and abbreviated species names. Csp, caspase; PC, paracaspase; MC, metacaspase. Circles show nodes with 75% bootstrap support. Thick red lines indicate (more...)

The OMI (HtrA-like) protease

The OMI protease homologous to the widespread and well-characterized bacterial HtrA family of serine proteases is a recent addition to the repertoire of PCD-associated eukaryotic protein [336,550,815]. This protein, normally located in the mitochondria, is released into the cytoplasm during apoptosis and contributes both to caspase-dependent and caspase-independent PCD. HtrA-like membrane-associated proteases are nearly ubiquitous in bacteria, the sole exception so far being the mycoplasmas, the bacterial parasites with the smallest genomes; in contrast, these proteins are missing in most archaeal genomes sequenced to date.

Phylogenetic analysis of the HtrA-like proteases suggests a major diversification of this family into several distinct lineages in bacteria, with a prominent expansion in α-proteobacteria. This analysis strongly supports the monophyly of the eukaryotic OMI/HtrA2 proteases, which are involved in PCD, with a particular lineage of α-proteobacterial HtrA-like proteases (Figure 6.18). Clearly, this observation is compatible with a mitochondrial origin for OMI.

Figure 6.18. Phylogenetic tree of the HtrA family of proteases.

Figure 6.18

Phylogenetic tree of the HtrA family of proteases. All details are as in Figure 6.17.

Apoptotic (AP) ATPases and NACHT GTPases

AP-ATPases are central regulators of PCD, which interact with caspases to form the so-called apoptosome and are required for caspase activation [142,146]. AP-ATPases are present in animals, plants, in which they are encoded by vastly proliferated pathogen and stress resistance genes, one fungal species (Neurospora crassa), many bacteria, and one archaeon, P. horikoshii. Among bacteria, AP-ATPase homologs are present in α-proteobacteria, cyanobacteria, and Actinomycetes, with a particularly notable proliferation in the latter lineage (Table 6.5). Phylogenetic analysis of the ATPase domain of AP-ATPases strongly supports the monophyly of the plant and animal representatives but does not group them with any bacterial lineage in particular; in contrast, the Neurospora AP-ATPase clusters with those of Actinomycetes, as does the only archaeal member of this family (Figure 6.19). The latter two AP-ATPases appear to be obvious cases of HGT from Actinomycetes. The origin of the animal and plant AP-ATPases is less clear. However, a more detailed examination of the alignment of the AP-ATPase domain showed that a large subgroup of these proteins, including those from plants, animals, and several bacteria, primarily actinomycetes, contained a distinct C-terminal motif, which was missing in the rest of the bacterial AP-ATPase homologs, including those from alpha-proteobacteria [457]. This feature allows us to tentatively root the AP-ATPase tree and hence establish the connection between eukaryotic AP-ATPases and homologs from Actinomycetes (Figure 6.19). Given these observations and the absence of AP-ATPases from the available genome sequences of yeasts and early-branching eukaryotes, a relatively late, around the time of animal-plant divergence, acquisition of this gene by eukaryotes from Actinomycetes seems to be the most likely evolutionary scenario. In principle, however, transfer of the AP-ATPase from mitochondria cannot be ruled out, assuming that the alpha-proteobacterial progenitor of the mitochondria, unlike rhizobia, had an AP-ATPase with the C-terminal motif and that some eukaryotic lineages have lost this gene.

Figure 6.19. Phylogenetic tree of the apoptotic (AP) ATPase family.

Figure 6.19

Phylogenetic tree of the apoptotic (AP) ATPase family. The inferred root position is shown by a bar. Other details are as in Figure 6.17.

The NACHT (after NAIP, CIIA, HET- E and TP1) family is another group of NTPases (primarily GTPases) with a eukaryotic-bacterial phyletic pattern [456]. So far, this family is represented in animals, one fungal species (Podospora anserina), and several bacteria (Table 6.5). A major proliferation of the NACHT family associated with an involvement in PCD and immune response against diverse viral and bacterial pathogens is observed in vertebrates, whereas other animals have only one NACHT domain that appears to be involved in the telomerase function rather than apoptosis [39]. Typically, the same bacteria that have AP-ATPases tend to encode NACHT NTPases, sometimes multiple ones (Table 6.5). In the general scheme of evolution of P-loop NTPases, the NACHT family appears to be the sister group of AP-ATPases [456]. Given the considerable diversification of each of these families in bacteria, the divergence between them should date to a relatively early stage of bacterial evolution. Although the limited sequence conservation within the NACHT family makes it a poor candidate for phylogenetic analysis, two distinct groups could be discerned within this family. The first group includes the vertebrate-specific expansion of NAIP-like proteins and several bacterial proteins from Streptomyces and Anabaena, Rickettsia; the second group consists of the animal TP-1-like telomerase subunits and the fungal proteins Het-E-1 from Podospora anserina and B24M22.200 from Neurospora crassa. Thus, multiple HGT events might have been responsible for the introduction of these proteins in eukaryotes, one occurring early in evolution and resulting in the TP-1-like forms, and the second one occurring much later, perhaps even just prior to the emergence of the vertebrate lineage, and injecting the NAIP-like forms.

Apoptosis-inducing factor (AIF) is a mitochondrial protein, which is released into the cytoplasm during apoptosis and stimulates a caspase-independent PCD pathway essential for early morphogenesis in mammals [410]. This function of AIF is highly conserved in evolution, as indicated by the recent demonstration of the function of the AIF ortholog in PCD in the slime mold Dictyostelium discoideum [58]. AIF is a Rossmann-fold, FAD-dependent oxidoreductase, but the redox activity is not required for its pro-apoptotic function [573]. This protein is highly conserved and nearly ubiquitous in bacteria, archaea, and eukaryotes. Phylogenetic analysis showed that eukaryotic AIFs cluster with their archaeal orthologs, to the exclusion of bacterial ones, with the sole exception of T. maritima, a hyperthermophilic bacterium, which probably acquired this gene from archaea via HGT [457]. Thus, AIF seems to be the only major component of the PCD apparatus that conforms with the standard model of evolution, which is particularly notable because this ancestral protein apparently had been secondarily recruited for a mitochondrial function.

The TIR domain is the only PCD adaptor molecule that has been detected in bacteria, although not so far in fungi or early-branching eukaryotes [38]. The distribution of the TIR domain in bacteria is similar to that of caspase-related proteases, AP-ATPases and NACHT NTPases, with a notable expansion in actinomycetes. The information contained in the TIR domain alignment does not seem to be sufficient to produce a reliable phylogenetic tree. Nevertheless, given that TIR domains seem to be present only in crown-group eukaryotes, possible evolutionary scenarios include a mitochondrial acquisition with subsequent loss in multiple eukaryotic lineages or a later HGT from a bacterial source.

Domain architectures of bacterial homologs of eukaryotic PCD-associated proteins suggest functional interactions

Functional information on bacterial homologs of eukaryotic apoptotic proteins is scarce. For the AP-ATPase homologs, the only available data point to a role of some of these proteins, such as GutR from B. subtilis and AfsR from S. coelicolor, in transcription regulation [233,681]. The function of GutR, which is a regulator of the glucitol operon, has been shown to be ATP-dependent [681]. Among the numerous caspase-related proteins detected in bacteria, only one, ActD from Myxococcus xanthus, has been characterized experimentally. This protein is a regulator of the production of the sporulation morphogen, CsgA, but its mechanism of action remains unknown [319]. Comparative-genomic information partly compensates for the paucity of experimental data: examination of the domain architectures of the bacterial homologs of apoptotic components provides tantalizing functional hints. Firstly, nearly all of these proteins form complex, multidomain architectures (Figure 6.20; see color plates). Secondly, many of them contain repetitive protein-protein interaction modules, such as WD40, TPR, and Armadillo repeats, which tend to form scaffolds facilitating the formation of multisubunit complexes. Finally, and most strikingly, some of the bacterial apoptosis-related proteins are fused within multidomain proteins, suggesting functional interactions between them (recall the “guilt by association” or “Rosetta Stone” principle, see 5.2.2). Examples include the fusion of a caspase-like protease with an AP-ATPase and WD40 repeats in the cyanobacterium Anabaena and the TIR-AP-ATPase and metacaspase-protein-kinase fusions in Actinomycetes (Figure 6.21; color plates). Some of the domain architectures observed among apoptotic protein homologs in bacteria reflect specifics of prokaryotic signal transduction, e.g., the characteristic fusions of AP-ATPases with helix-turn-helix DNA-binding domains in transcription regulators. These peculiarities notwithstanding, the above observations are sufficient to justify the hypothesis that bacterial homologs of the eukaryotic apoptotic proteins interact functionally and, most likely, also physically in signal-transduction pathways whose exact nature remains to be determined. A bolder speculation is that, in bacteria with complex development and differentiation, such as actinomycetes, cyanobacteria, myxobacteria, and some alpha-proteobacteria, the homologs of apoptotic proteins, particularly meta- and paracaspases and AP-ATPases, form large complexes that might be functional analogs or perhaps even evolutionary predecessors of the eukaryotic apoptosome. A search for such complexes in bacteria and elucidation of their potential role in signal transduction and/or an unknown form of PCD seems to be an exciting subject for experimental studies.

Figure 6.20. Bacterial homologs of apoptotic components have complex domain architectures pointing to roles in signal transduction. Apparently they interact even in bacteria.

Figure 6.20

Bacterial homologs of apoptotic components have complex domain architectures pointing to roles in signal transduction. Apparently they interact even in bacteria. TIR, Toll-interleukin-receptor domain; TPR, tetratricopeptide repeats; LRR, leucine-rich (more...)

Evolution of eukaryotic programmed cell death: the case for multiple infusions of bacterial genes

As discussed above, the principal enzymes and at least one adaptor domain involved in eukaryotic PCD are widespread in bacteria but are conspicuously missing in archaea. Furthermore, two important lines of evidence support HGT from bacteria to eukaryotes as the principal route of evolution of these proteins. Firstly, in at least two cases, those of OMI and metacaspases, phylogenetic analysis confidently shows a specific affinity of the eukaryotic apoptotic proteins with homologs from α-proteobacteria. These observations strongly suggest a mitochondrial origin for the respective genes (see above). Secondly, in each case, and particularly for caspase-related protease and AP-ATPases, exploration of the bacterial homologs of apoptotic proteins reveals a greater diversity, in terms of phyletic distribution, domain architectures, and sequences themselves, than seen in eukaryotes. This points to the probable direction for HGT: from bacteria to eukaryotes.

In principle, all apparent bacterial contributions to eukaryotic PCD could be explained through acquisition of mitochondrial genes. However, this would require multiple losses of the genes for apoptotic proteins in different eukaryotic lineages and, in addition, would be at odds with some phylogenetic analysis results, e.g. those that seem to link eukaryotic AP-ATPases with Actinomycetes (Figure 6.20). Thus, a different scenario, with at least two infusions of bacterial genes contributing to the origin of PCD, appears to be more parsimonious [457]. According to this hypothesis, the first influx of the relevant bacterial genes was part of the domestication of the pro-mitochondrial endosymbiont, whereas the second one probably occurred at the stage of a primitive multicellular eukaryote, perhaps the ancestor of the eukaryotic crown group. Apparently, there were at least occasional subsequent gene transfers, such as the acquisition of an AP-ATPase by the fungus Neurospora crassa, and perhaps even the less orthodox acquisition of an additional NACHT NTPase at a late stage of animal evolution. This hypothesis of multiple acquisitions of PCD-related genes by early eukaryotes from bacteria is clearly in line with the Doolittle's ratchet mechanism.

Proteins encoded by scavenged bacterial genes appear to constitute the core of the ancestral eukaryotic apoptotic machinery; a caspase-like protease, probably a metacaspase, an HtrA-like protease and AP-ATPases were principal enzymatic components, whereas the TIR domain might have functioned as the main adaptor. These core components have undergone further, lineage-specific proliferation and specialization, such as expansion of caspases in vertebrates and metacaspases and AP-ATPases in plants. Around this core, the outer layers of the apoptotic machinery have built up gradually from exapted domains that originally might have had different functions, such as MATH or BIR, and of newly “invented” domains, such as the six-helical adaptor domain, which subsequently gave rise to the Death, Death Effector, and CARD domains (Table 6.5).

Returining to the three routes of eukaryotic innovation mentioned in the beginning of this section, we see that routes (ii) and (iii), i.e. exaptation of bacterial and perhaps a few ancestral archaeo-eukaryotic proteins for new functions and “invention” of novel domains, contributed substantially to the evolution of PCD. Whether or not route (i), direct recruitment of a functionally analogous bacterial precursor, was also employed, remains to be established through functional characterization of the bacterial homologs of apoptotic proteins.

The observations described here emphasize the pivotal role of bacterial-eukaryotic HGT in the origin of the eukaryotic PCD system and, by implication, of the eukaryotic multicellularity itself. Indeed, much of the glory of eukaryotic ascension to the ultimate complexity of higher plants and animals might owe to a “lucky” choice of bacteria with complicated differentiation processes as the primary, promitochondrial, and perhaps subsequent symbionts.

Mitochondria appear to be among the principal (if not the principal) sensors of cell damage that trigger PCD by releasing cytochrome c, which stimulates apoptosome assembly [5,113]. Furthermore, additional proteins, such as AIF and OMI, are also released from mitochondria and contribute to PCD. Is there an intrinsic connection between the role of mitochondria in PCD and the origin of the eukaryotic apoptotic system? This is not immediately obvious, in part, because the involvement of mitochondria in apoptosis has been demonstrated primarily in the vertebrate model system, potentially allowing for the possibility that mitochondria are a late addition to the ancestral repertoire of apoptotic regulators. However, several recent studies suggest that the mitochondrial contribution to PCD is likely to be ancient, e.g. the demonstration of the role of AIF in PCD in the slime mold and the role of mitochondrial endonuclease G in apoptotic DNA degradation in the nematode [58,658]. Indications of a mitochondrial involvement in PCD in plants [487] and of a potential involvement of the metacaspase in mitochondrial biogenesis in yeast [818] add to the growing evidence of an ancient role of mitochondria in eukaryotic PCD. The other side of the problem is that mitochondrial endosymbiosis and the origin of PCD appear to be uncoupled in time because endosymbiosis, a very early event in eukaryotic evolution, apparently was followed by a lengthy age of unicellular eukaryotes, which generally are not known to have PCD. Thus, mitochondrial acquisitions, such as AIF and metacaspase, might have been “pre-adaptations” for PCD, which originally had other roles in primitive eukaryotes, and only later have been exapted for their functions in apoptosis. However, the recent striking experiments demonstrating the role of yeast and possibly even trypanosome (an early-branching protist) metacaspases in PCD might lead to a revision of these views [537,818]. Whether early-branching, truly unicellular eukaryotes have PCD is a subject of major interest; the ultimate evidence can be obtained only by direct experiments, but the conservation of the metacaspase in these primitive eukaryotes is suggestive. Regardless of the outcome of such studies, a straightforward hypothesis can connect pro-mitochondrial endosymbiosis with the origin of eukaryotic PCD. The early α-proteobacterial endosymbionts might have been using secreted and membrane proteases, such as metacaspases, paracaspases, and HtrA-like proteases, to kill the host cells once the latter became inhospitable, e.g. because of scarcity of nutrients or accumulation of free radicals. Such a mechanism could enable the endosymbionts to efficiently use the corpse of the assassinated host and move to a new one. During subsequent evolution, this weapon of aggression might have been appropriated by the host and made into a means of programmed suicide, with the subsequent addition of regulatory components [240,457].

6.5. Conclusions and Outlook: Evolution Tinkers with Fluid Genomes

In this chapter, we covered a lot of territory, at the inevitable price of sketchiness. Even so, we managed to discuss only a few of the important evolutionary issues brought into focus by comparative genomics. We already touched upon some other aspects, including evolution of genome organization, in Chapters 2 and 5; evolution of metabolic pathways and protein domains and families will be discussed in Chapters 7 and 8, respectively. Nevertheless, we certainly cannot hope to present any comprehensive treatise on “Genomics and Evolution”. Not only is the subject vast, but more importantly, the research in this area has started in earnest only three to four years ago (in part, for the obvious and excusable reason that genome data simply were not around before that). Therefore, development of new approaches and systematic application of the existing ones are required to tease out answers to evolutionary puzzles hidden in the genomes. The principal notion we tried to convey is that comparative genomics has already made the picture of the evolution of life substantially more complex but also immensely richer and more interesting than anyone could imagine in the pre-genomic era.

Three interlinked fundamental messages, we believe, are here to stay. Firstly, comparative genomics shows that genomes are much more dynamic, even volatile (on the evolutionary scale), systems than previously thought. Lineage-specific gene loss and horizontal gene transfer can no longer be treated as peripheral evolutionary phenomena that may be involved in important but specialized cases, such as evolution of parasites and antibiotic resistance. Instead, they should be accorded the status of major factors of evolution, which, at least among unicellular life forms, are ubiquitous and as important as vertical inheritance. If HGT (but not necessarily gene loss) is significantly less common in multicellular eukaryotes, they more than compensate with intragenomic mobility, including recruitment of mobile elements for coding and regulatory regions [517,539,609]. We believe that these results of comparative genomics amount to a new view of the evolutionary process. The new picture in no way contradicts what we at least consider to be the cornerstone of Darwinism: the central role of natural selection (in its substantially different forms, we must now add) in evolution. However, just like many modern developments in evolutionary biology itself [304], the new picture promulgated by genomics defies the exclusive emphasis on small, gradual mutational change, which was part of Darwin's message in The Origin of Species and had been further elevated in status by the neo-Darwinian synthesis.

Secondly, comparative genomics reaffirms, through numerous spectacular illustrations, a rather old but so far (we believe) not fully appreciated evolutionary principle captured in a brilliant metaphor of François Jacob [388]: evolution is largely a tinkerer who achieves the best feasible result by combining, sometimes in haphazard ways, whatever materials are at hand. Comparative genomics has shown in abundance how evolution tinkers with protein domains and operons (more about this in Chapters 5 and 8) to produce amazingly diverse, effective, and subtle signaling and regulatory systems. Genome flexibility, to a large extent ensured by HGT, provides ample material for this. Tinkerers are not supposed to be particularly good in inventing real new gadgets, and evolution indeed seems to avoid this as much as possible, by reutilizing, modifying, and recombining already tried solutions. However, true novelty also emerges, particularly in complex eukaryotes. For sure, on some occasions, what looks completely new is only the ultimate form of tinkering when the original gadget ceases to be recognizable; we have seen such cases when discussing exaptation of prokaryotic proteins for some crucial eukaryotic functions (Table 6.4). In other situations, tinkering turns into invention, which may occur via evolution of a globular domain from a generic non-globular structure, such as coiled coil [45], or even through emergence of coding sequences from non-coding ones. We do not discuss these processes in detail in this book not so much due to a lack of space but simply because their extent and mechanisms are still not sufficiently clear.

Thirdly, and finally, it seems that comparative genomics not only vastly complicates the picture of life's evolution but also provides the information necessary to resolve the principles and details of this picture. The methods and concrete studies that take us in this direction are starting to appear but much more remains to be done.

6.6. Further Reading

Crick FH. The origin of the genetic code. Journal of Molecular Biology. 1968;38:367–379. [PubMed: 4887876]
Jacob F. Evolution and tinkering. Science. 1977;196:1161–1166. [PubMed: 860134]
Woese C. The universal ancestor. Proceedings of the National Academy of Sciences of the United States of America. 1998;95:6854–6859. [PMC free article: PMC22660] [PubMed: 9618502]
Woese CR. Interpreting the universal phylogenetic tree. Proceedings of the National Academy of Sciences of the United States of America. 2000;97:8392–8396. [PMC free article: PMC26958] [PubMed: 10900003]
Woese CR. On the evolution of cells. Proceedings of the National Academy of Sciences of the United States of America. 2002;99:8742–8747. [PMC free article: PMC124369] [PubMed: 12077305]
Leipe DD, Aravind L, Koonin EV. Did DNA replication evolve twice independently? Nucleic Acids Research. 1999;27:3389–3401. [PMC free article: PMC148579] [PubMed: 10446225]
Anantharaman V, Koonin EV, Aravind L. Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Research. 2002;30:1427–1464. [PMC free article: PMC101826] [PubMed: 11917006]
Snel B, Bork P, Huynen MA. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Research. 2002;12:17–25. [PubMed: 11779827]
Image ch2f8
Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20254


Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...