![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2007, Cold Spring Harbor Laboratory Press Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world 1 Department of Crop Sciences, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA; 2 Department of Cell and Developmental Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA 3Corresponding author.E-mail gca/at/uiuc.edu; fax (217) 333-8046. Received March 1, 2007; Accepted August 23, 2007. This article has been cited by other articles in PMC.Abstract The repertoire of protein architectures in proteomes is evolutionarily conserved and capable of preserving an accurate record of genomic history. Here we use a census of protein architecture in 185 genomes that have been fully sequenced to generate genome-based phylogenies that describe the evolution of the protein world at fold (F) and fold superfamily (FSF) levels. The patterns of representation of F and FSF architectures over evolutionary history suggest three epochs in the evolution of the protein world: (1) architectural diversification, where members of an architecturally rich ancestral community diversified their protein repertoire; (2) superkingdom specification, where superkingdoms Archaea, Bacteria, and Eukarya were specified; and (3) organismal diversification, where F and FSF specific to relatively small sets of organisms appeared as the result of diversification of organismal lineages. Functional annotation of FSF along these architectural chronologies revealed patterns of discovery of biological function. Most importantly, the analysis identified an early and extensive differential loss of architectures occurring primarily in Archaea that segregates the archaeal lineage from the ancient community of organisms and establishes the first organismal divide. Reconstruction of phylogenomic trees of proteomes reflects the timeline of architectural diversification in the emerging lineages. Thus, Archaea undertook a minimalist strategy using only a small subset of the full architectural repertoire and then crystallized into a diversified superkingdom late in evolution. Our analysis also suggests a communal ancestor to all life that was molecularly complex and adopted genomic strategies currently present in Eukarya. The repertoire of protein structures encoded in a genome delimits the cellular functions and interactions that sustain cellular life. It also serves as an imprint of genomic history. While nucleic acid and protein sequence can be highly dynamic, domain structure in proteins is generally maintained for long periods of evolutionary time (Gerstein and Hegyi 1998; Chothia et al. 2003). For this reason, domains are considered not only units of structure but also units of evolution (Murzin et al. 1995; Orengo et al. 1997; Riley and Labedan 1997). In particular, the discovery of an architectural design, that is, an orderly and unique arrangement of protein components in three-dimensional (3D) space (herein referred to as an “architecture”), constitutes an important and rare event in protein evolution that adds new functions to the protein world. In fact, there have been very few of these finds in the history of life on earth. The number of fold (F) architectures discovered so far amount to only ~1000, the number of fold superfamilies (FSF) to ~1500, and the number of fold families (FF) to ~2500, according to one classification (Murzin et al. 1995; Andreeva et al. 2004). F and FSF architectures are highly conserved in nature. FSF are composed of protein molecules with low sequence identity but with structures and functions indicative of a probable common evolutionary origin (they group one or more sequence-related FF). F group FSF with secondary structures that are similarly arranged in 3D space but that may not necessarily be evolutionarily related. The vast majority of F and FSF represent highly successful architectural discoveries that have accumulated and dispersed throughout the 107–108 species that inhabit our planet. A delicate balance of survival and extinction of structural discoveries probably triggered propagation, but as with Galton-Watson branching processes (Harris 1963), only successful architectures are the ones represented by the >103 proteins per genome (i.e., the complement defining a proteome) that make up the estimated ~1010–1014 proteins in existence today. Consequently, the repertoire of architectures in proteomes can be regarded as a collection of historical imprints or molecular fossils preserved in nature by successful propagation and evolutionary “lock-in” (preservation of the original architecture by “structural canalization”) (Ancel and Fontana 2000). Indeed, the occurrence and abundance of F and FSF, and their combination in proteins, has been used successfully to build reasonable universal trees of life capable of describing the history of major organismal lineages satisfactorily (Caetano-Anollés and Caetano-Anollés 2003; Yang et al. 2005; Wang and Caetano-Anollés 2006). Furthermore, the phylogenetic analysis of the architectural repertoire can dissect deep evolutionary phenomena related to the origins of life (Caetano-Anollés and Caetano-Anollés 2003, 2005; Dupont et al. 2006; Wang et al. 2006; Caetano-Anollés et al. 2007). In this study, we take advantage of this potential. The ancestor of all organisms alive today is at the root of the universal phylogenetic tree, and its cellular and molecular organization illuminates our understanding on how life originated and evolved (Woese 1998; Penny and Poole 1999). However, its nature has been controversial. This stems from limitations and conflict in the evolutionary signals that are embedded in the limited number of molecular or cellular features that have been analyzed. The canonical view, stemming mostly from ribosomal RNA (rRNA), elongation factors, and other molecules of the “informational” class, suggests that the ancestor was simple and prokaryotic-like (Woese et al. 1990; Woese 1998) and that horizontal gene transfer (HGT) was rampant in early evolution (Doolittle 1999). In contrast, a tracing of the origins of the tripartite world from an ancient RNA world based on DNA sequence, RNA relics, and other considerations suggests that the ancestor was eukaryotic-like and complex (Poole et al. 1998; Forterre and Philippe 1999; Penny and Poole 1999; Kurland et al. 2006). Moreover, analysis of entire genomic complements indicated that massive HGT was not warranted (e.g., Snel et al. 1999; Gough 2005) or did not impair phylogenetic reconstruction of a universal tree (Doolittle 2005). It also revealed the complexities of phylogenetic reconstruction (Delsuc et al. 2005). Despite the promises of evolutionary genomics, the nature of the universal ancestor and the universal tree has yet to be resolved (Delsuc et al. 2005; Doolittle 2005). However, phylogenetic analyses of combined or concatenated genomic sequences (e.g., Ciccarelli et al. 2006) or genomic features describing the survey (e.g., Snel et al. 1999; Yang et al. 2005; Wang and Caetano-Anollés 2006) or arrangement (e.g., Korbel et al. 2002) of genomic component parts suggest a clear tripartite division into organismal domains Archaea, Bacteria and Eukarya (herein referred to as “superkingdoms” to avoid confusion between “domains” of organisms or molecules). We recently used a genomic census of protein architecture to generate genome-based phylogenies (phylogenomic trees) that describe the evolution of the protein world at different hierarchical levels of protein structural organization (Caetano-Anollés and Caetano-Anollés 2003, 2005; Wang et al. 2006). These trees were used to classify proteins (mostly globular), define structural transformations, and uncover evolutionary patterns in structure. We also traced patterns of organismal distribution in these trees and found that architectures at the base were omnipresent or common to all superkingdoms and that a timeline of organismal diversification could be inferred (Caetano-Anollés and Caetano-Anollés 2005; Wang et al. 2006). The diversity of ancient architectures common to superkingdoms suggested that the universal ancestor had a complex and relatively modern eukaryotic-like organization and hinted at a prokaryotic world stemming fundamentally from reductive evolutionary processes. In this study, we embark on a systematic and global study of 185 genomes that have been fully sequenced and represent organisms from all three superkingdoms of life that exhibit free-living (FL), parasitic (P), and obligate parasitic (OP) lifestyles. We first reconstructed phylogenomic trees of F and FSF using standard phylogenetic methods. The trees uncovered congruent patterns of architectural diversification and reductive evolutionary processes. Finally, we used this information to reconstruct global trees of proteomes and to propose a scenario for the birth and diversification of the tripartite world. Results Patterns of F and FSF distribution in the proteome world: Three epochs in protein evolution We generated intrinsically rooted trees of 776 F and 1259 FSF (Fig. 1A
When these architectural chronologies were dissected for the three superkingdoms (Fig. 2
These results suggest three epochs in protein evolution, which we then subdivide into six phases, each delimited by patterns of architectural use (elaborated in the Discussion): (1) Architectural diversification (ndF < 0.40 and ndFSF < 0.49; light green areas in Fig. 1B Further evidence, presented below through the analysis of architectural distribution (Fig. 2A All basal (nd < 0.1) and many of the more recent (nd < 0.4) F and FSF architectures were shared by most if not all proteomes in all superkingdoms (Figs. 1 Decreases in architectural representation (f-value) occurred also in Eukarya and Bacteria, but involved fewer and younger architectures. Architectural loss begins at ndF = 0.399 and ndFSF = 0.391 when the Bacteria and Eukarya (BE)-specific architectures experience a notable decrease in representation (Fig. 2B The decreasing trend in architectural representation (eukaryal “loser trend”) continues until the appearance of prokaryote-specific (AB) F and FSF at ndF = 0.491 and ndFSF = 0.538 (AB bar in Fig. 2A Evolution of cellular function To explain the above trends from a functional perspective, we tallied the FSF participating in various cellular functions in every phase of the architectural chronology. Functions were defined using a hierarchical coarse-grained classification encompassing seven functional categories and 50 subcategories (Vogel et al. 2004a, 2005; Vogel and Chothia 2006). For each phase and category, the fraction (fo) of FSF used in each superkingdom was calculated (Fig. 3
Most broad functions were invented very early in phase I, and all associated FSF were necessary for cellular physiology: none of them dropped out of use (fo ~ 1) (Fig. 3 During the superkingdom specification epoch, Bacteria became specified through the invention of several highly represented FSF corresponding to “information,” “intracellular processes,” and “regulation” functions in order of decreasing representation (f = 0.8–0.9 for functions in Bacteria, and significantly higher than those in Archaea and Eukarya, f = 0.2–0.5). Interestingly, Eukarya seem to be specified earlier than suggested by the architectural chronologies (Fig. 2 Further in evolution, bacterial FSF invention is prominent in phase IV (fo = 1 for most functions), while Archaea and Eukarya follow the loser trend in parallel with each other. This loser trend turns into diversification, especially in phase V for all three superkingdoms, evidenced by low usage of all FSF (f close to 0) and incomplete retention of FSF invented in this phase (fo < 1). In phase VI, Eukarya retain (fo bars and f close to 1) and Bacteria diversify all functions (tall fo bars with very low f). Archaea substantially raise their usage of “information” FSF, corresponding mostly to unknown functions (Supplemental Fig. S2). In terms of global evolutionary patterns, functions associated with “general,” “regulation,” and “intracellular processes” were abundant early and late in evolution; “metabolism” was maximal early and decreased steadily in time; “information” peaked midway (phases II and III); and “extracellular processes” and “other” were poorly represented early but increased in time (Fig. 3 Reconstruction of proteome trees Based on previous results, we reconstructed trees of proteomes to follow the rise of three organismal superkingdoms in evolution. We excluded organisms leading parasitic lifestyles (P and OP) from further phylogenomic analysis to increase the reliability of deep branches. This decision was based on the massive loss of architectures in parasitic lifestyles (Fig. 2B
Effect of parasitic lifestyles Proteomes from organisms with parasitic lifestyles (both P and OP) significantly affected the distribution of protein architectures between organisms. Most prominently, parasitic organisms lack a significant number of architectures that appeared throughout evolution (depicted by gray circles in Fig. 2
Occurrence and abundance of architectures in proteomes To examine the present-day outcome of the evolutionary scenario described above, we calculated the occurrence (usage) and abundance of architectures in proteomes analyzed (Fig. 6
Discussion Phylogenomic reconstruction of the protein world Advances in structural bioinformatics have extended structural information deposited in the Protein Data Bank (PDB) to macromolecules encoded by more than half of gene complements identified in the >500 fully sequenced genomes published to date (Grant et al. 2004). In this study, we use information embedded in a structural genomic census of protein architecture to generate trees that describe the evolution of protein structure at F and FSF hierarchical levels (Fig. 1A Our analysis does not consider the increasingly important contribution of non-coding functional RNA molecules (Eddy 2001). However, it does provide a comprehensive analysis of proteins encoded in the genomes we studied. The F and FSF examined here represent our current view of the complexity of the protein and organismal world. These architectures are associated with proteins that play diverse and fundamental functional roles in the cell, such as translational and transcriptional machinery, metabolic and signaling pathways, structural scaffolds, and many other aspects important for cellular function and interaction. The proteins themselves cannot capture adequately deep phylogenetic relationships because of the erasing effects of mutation and HGT; a comparative genomic exercise therefore reveals genomes as evolutionary mosaics of genes (Lester et al. 2005). A focus on molecular designs that are immutable for extended periods of time rather than a focus on the vagaries of gene sequence uncovers here deep historical signatures. These signatures are more successfully preserved in the architectural repertoire the older the architectures studied, because older architectures are more abundant and diverse. These ancient architectural designs provide important clues related to the molecular origins of modern life. Thus, the conclusions of this study are independent of the outcome of major debates in the evolutionary field, including the degree of HGT in the primordial and diversifying world (Kurland 2005), the origin of the eukaryotic cell (Poole and Penny 2007), and the ability of a single bifurcating tree to represent the evolution of superkingdoms of life (Doolittle and Bapteste 2007), most of which are centered on the limitations of genomic sequence evidence. Furthermore, architectural distributions reflect evolutionary and ecological pressures on the organisms, because F and FSF represent functional units of proteins, and their function is being selected for maximum survival of an organismal lineage within its environment. Consequently, architectural distributions today carry the imprint of the adaptation strategies adopted by the three superkingdoms during their evolution, and it is the evolution of those adaptations that we infer in this study. Specifically, we infer the timing of superkingdom specification and organismal diversification based on F and FSF distribution in organisms. The differences in F and FSF distribution patterns allow us to propose a timeline and mechanisms of organismal lineage segregation from the communal ancestor, as discussed below. Mechanisms of protein architecture distribution between organisms Phylogenetic trees of architectures embed timelines of protein discovery. Along these architectural chronologies, the distribution (f) of F and FSF in the organismal world as a function of their age (nd) was variable (Figs. 1 Here we use terminology that describes decreases in f as relative “loss” of architectures. In reality, decreases in f are solely due to changes in their representation. When a new molecular design appears, it is added to the global molecular repertoire. However, when some species fail to acquire the design, it may appear as a loss from their proteome, resulting in f < 1. We cannot distinguish this from the possibility of an original acquisition and subsequent loss of the design owing to it being unnecessary or incompatible with the lifestyle of the organism. Proteome evolution and the birth of the three superkingdoms of life Architectural chronologies derived from F and FSF trees revealed clear and congruent evolutionary patterns of origin and diversification of organismal groups (Figs. 1 Epoch 1: Architectural diversification Phase I: Organisms at the start of the protein world were molecularly complex and part of a rich communal world (0.000 < ndF < 0.162 and 0.000 < ndFSF < 0.092) All proteomes in all superkingdoms shared ancient F and FSF that were basal in the trees, including even P and OP organisms whose genomes are highly reduced. The 53 most basal F probably encompass the proteome complexity of this evolutionary period of life (Supplemental Fig. S1). The mere number of shared architectures suggests that the primordial organisms were molecularly complex and largely similar to each other (Fig. 2A The data from this evolutionary phase are compatible with the concept of a communal world similar to the one proposed by Woese (1998). However, this world was molecularly rich and contained complex architectures that encompassed each and every one of the six major SCOP classes of protein structure. About 40% of F and ~32% of FSF were in place before any superkingdom-specific architectures emerged, setting an upper bound for the architectural repertoire of the communal world. The relative richness of the architectural repertoire in the primordial organisms does not necessarily entail a large size of the proteome in comparison with modern organisms; thus the absolute size of the ancestral proteome still remains unknown. Phase II: The first organismal divide produced archaeal-like ancestors with reduced proteomes and a minimalist strategy (0.162 < ndF < 0.399 and 0.092 < ndFSF < 0.391) The organismal representation of architectures that occurred later in evolution was progressively smaller. The initially moderate decrease in representation (f-values high but <1) can be explained by architectural loss due to proteome reduction, not by architectural sorting processes in lineages. Additional decreases in f were likely caused by secondary adaptations that are not contemporary to this period, for example, due to organismal-dependent P and OP lifestyles (see below). The differential loss of F and FSF was particularly extensive in Archaea—the superkingdom that was also the first to experience complete loss (or lack of appearance) of architectures. Over time, this superkingdom lost a total of 175 F and 308 FSF specific to Eukarya and Bacteria (EB), resulting in the highly compact proteomes typical of today’s Archaea (Fig. 6 Epoch 2: Superkingdom specification Phase III: Reductive tendencies in the eukaryal-like ancestor led to the first superkingdom specification event and the emergence of Bacteria (0.399 < ndF < 0.439 and 0.391 < ndFSF < 0.489) Reductive tendencies were also present in the eukaryal-like ancestor, but involved fewer and younger architectures compared to Archaea. The first superkingdom-specific architecture appeared in Bacteria, signaling the “official” start of the superkingdom specification epoch. However, the appearance of the first superkingdom-specific architecture in the trees should be regarded as upper bounds to this period. Lineage diversification in Eukarya and Bacteria may have started significantly before their specific architectures appeared, as suggested by the significant loss of earlier F and FSF in both superkingdoms. Phase IV: Discovery of prokaryote-specific architectures and the rise of superkingdoms Eukarya and Archaea (0.439 < ndF < 0.543 and 0.489 < ndFSF < 0.614) This evolutionary phase delimits the steady decrease of f during species diversification in Bacteria, concurrent with lineage specification in the other two superkingdoms. We propose that reduced representation of architectures among organisms at this time may have been caused by several factors, including sorting of architectures in lineages, increased fusion of domains into domain combinations (M. Wang and G. Caetano-Anollés, in prep.), and intensification of proteome reductive tendencies that started in phases II and III. The concomitant appearance of the first F and FSF unique to Archaea and Eukarya marked the start of their specification. The late specification of Archaea contrasts with the early proteome reduction that defined the primordial archaeal-like ancestor. Perhaps the rates of processes underlying the adaptation of the archaeal-like ancestor to extreme environments were very different from those operating in the ancestors of the other superkingdoms and caused a delay of the lineage specification process. Ultimately, the timing of lineage specification follows the canonical and widely accepted topology of the universal tree of life, which is also reflected in the phylogeny from architectures arising during the superkingdom specification and diversification epochs (Fig. 4C Epoch 3: Organismal diversification Phase V: A burst of architectural innovation in Bacteria and Eukarya (0.543 < ndF < 0.601 and 0.614 < ndFSF < 0.674) During this brief period, a marked burst of F and FSF architectures with low f-values was evident in Bacteria and Eukarya, associated with proteins that establish domain combinations (M. Wang and G. Caetano-Anollés, in prep.). Many architectures that originated here are unique to Bacteria or to Eukarya. This, combined with their low representation, suggests that this was a period of “experimentation,” when organisms “searched” through the possible protein configurations for a promising beginning of stable lineages within the recently specified superkingdoms. Phase VI: Genome expansion and homogenization of proteomes in Eukarya and genome reduction in Archaea and Bacteria (0.601 < ndF < 1.000 and 0.674 < ndFSF < 1.000) Once commitment to archaeal, bacterial, or eukaryal lifestyle was in place, the proteomes in the three superkingdoms appeared to follow divergent evolutionary paths. While Archaea and Bacteria show signs of alternating retention and loss of architectures, architectural retention was increased in eukaryal lineages. We suggest that increases in architectural representation in Eukarya were caused by genome expansion, fission, and fusion/fission of domain combinations previously generated in the burst of phase V, endosymbiotic events mostly involving Bacteria, and HGT events, in order of decreasing importance. The process continued in Eukarya until new architectures were present in most eukaryotic genomes analyzed (f close to 1 again). This striking evolutionary path peculiar to Eukarya differs notably from mechanisms operating in Archaea and Bacteria, which seem to follow lineage sorting, genome reduction tendencies, and genome expansion due to HGT events (e.g., viral or plasmid transfer). Ecological and functional mechanisms of superkingdom diversification The patterns of F and FSF acquisition and retention within each superkingdom were certainly affected by the specific needs of organisms and their adaptation to the environment. The entire history of protein architectural evolution can thus be interpreted in ecological terms. As we have seen, Archaea were the first superkingdom to segregate from the rest by adopting the minimalist approach to the molecular repertoire. This early segregation of the archaeal-like ancestor from the eukaryal-like ancestor must have been compromised by HGT, as no substantial lineage splitting was evidenced by appearance of superkingdom-specific architectures at that time. Later they may have turned into ecologically more structured populations because of both natural selection (Vestigian et al. 2006) and adaptations to new environmental niches (L.S. Yafremava, J.E. Mittenthal, and G. Caetano-Anollés, in prep.). The archaeal-like ancestor may have been defined by adaptation to physical extremes, because extreme conditions, such as very high or very low pH, acidity, or pressure, may limit the number of functional protein variants, thus reducing the number of viable protein architectures in a cell (L.S. Yafremava, J.E. Mittenthal, and G. Caetano-Anollés, in prep.). For example, adaptation to extremely high temperatures is believed to cause proteins to be more compact and hydrophobic (structure-based thermostabilization) (Penny and Poole 1999; Berezovsky and Shaknovich 2005). Adaptations to possible chronic energy stress in methanogens, methane oxidizers, and nitrifiers (Valentine 2007) may also have led to a limited number of protein architectures that an organism is able to support. All these processes can impose constraints on structure that lead to a reduced and highly specialized protein repertoire, resulting, for example, in loss of FSF in all biological functions in phase III—these FSF could have been unstable in harsh environments (Fig. 3 The eukaryal-like emerging lineage with its large and diverse architectural repertoire may have been better suited for K-selection by exploiting flexibility of use of environmental resources (Carlile 1982). Later, some lineages may have discovered the advantages of rapid growth in times when nutrients were accessible (possibly enabled by a DNA-binding apparatus invented in phase III and fully retained by bacteria) (Fig. 3 The first functional specification event in Eukarya seems to occur in phase II: all the cell adhesion and immune response FSF invented in that phase were retained in all modern eukaryotic organisms—the only trend that is different in Eukarya compared to other superkingdoms. It is possible that full retention of these functions allowed Eukarya-like lineages to escape the survival struggle that necessitates quick reproduction, thereby setting up the conditions for long-term growth, storage, and multicellularity peculiar to eukaryotic organisms. Proteome reductions triggered by parasitic lifestyle Analysis of F and FSF specific to organisms with FL, P, and OP lifestyles showed that architectures unique to the P and OP categories and shared by them appeared concurrently with architectures specific to Archaea and Eukarya, and once the Bacterial superkingdom was in place (Fig. 5 In addition, we observed an expected tendency of parasitic organisms to have the smallest molecular repertoire within their respective superkingdoms. This reductive tendency significantly contributed to decreases in f throughout the evolutionary timeline until ndF = 0.757 or ndFSF = 0.886, delimiting a period of development of most P and OP interactions (Figs. 1 Evolutionary impact on architectural repertoires of present-day organisms Based on the above observations, we predict that genome reductive tendencies in Archaea and Bacteria must result in a substantial reduction in size of their proteomic repertoires, compared to Eukarya. The early start and protractive tendencies of architectural loss in Archaea predict that proteome reduction must be maximal in this superkingdom. Indeed, patterns of architectural occurrence and abundance in genomes (Fig. 6 Rooting the universal phylogenomic tree The topologies of the trees of proteomes reflect the events of the evolutionary timeline that are contemporary to the FSF architectures used in tree reconstruction and provide another tool to visualize the process of superkingdom specification and diversification, regardless of their possible ancestral relationship. Global trees of proteomes reconstructed from ancient FSF encompassing the architectural diversification epoch revealed a paraphyletic rooting in Archaea, reflecting their early segregation through the minimalist strategy. A rooting of the universal tree in Archaea supports paleobiological claims of early archaeal lipids and methanogenic activity linked to the fossil record (Chappe et al. 1979; Michaelis and Albrecht 1979; Schopf 1999) and contrasts with the canonical view of a bacterial ancestor (Woese et al. 1990). In our global trees of proteomes, ancient FSF revealed a paraphyletic rooting in Archaea and the monophyly of Eukarya, defining eukaryal-like ancestors as heirs to the rich communal world and progenitors of Eukarya and Bacteria. FSF of intermediate age revealed a strongly supported sister-clade relationship of Bacteria and Eukarya. Taken together, these trees reflected the early structuring and diversification of the communal world and the formation of archaeal-like and eukaryal-like emerging lineages during this time. In turn, a global tree reconstructed from the derived half of the FSF tree revealed the monophyletic nature of the three superkingdoms and a rooting in the Bacteria, consistent with their leading role in superkingdom specification. The inclusion of only FL organisms in this analysis minimized historical reconstruction artifacts due to parasitic lifestyle. Exclusion of problematic taxa notably enhanced the support of basal branches in the trees and minimized inconsistent placement of taxa. Indeed, some of the excluded taxa (e.g., Trypanosoma, Encephaltozoon, Nanoarchaeum) had highly reduced proteomes, were big losers of ancient architectures, and were generally oddly placed in trees of proteomes that have been previously reconstructed (Yang et al. 2005; Wang and Caetano-Anollés 2006). Conclusions In this study, we use an unorthodox approach to analyze the origins of the tripartite world. This approach focuses on building trees of architectures instead of universal trees of organisms and reveals evolutionary relationships at a genomic scale. The importance of the analysis presented here is that it pinpoints a possible mechanism by which superkingdoms emerged from the communal ancestor, specifically by adopting different strategies of F and FSF usage, possibly in response to different environmental pressures. These strategies involve reduction (notable in Archaea) and expansions (Bacteria and Eukarya) of the global protein repertoire:
Methods Genomic census We analyzed the genome sequence of 185 organisms, including 19 Archaea (A), 129 Bacteria (B), and 37 Eukarya (E). Of these, 82, 58, and 45 had FL, P, and OP lifestyles, respectively, using the general strategy described in Supplemental Figure S5. Free-living, parasite, obligate parasite, commensal, obligate commensal, symbiotic, and other lifestyles were annotated manually using various sources of information. For convenience, we pooled genomes from organisms that established symbiotic or commensal interactions into the parasitic groups to define the FL, P, and OP lifestyles. Structural protein domains were assigned to genome-encoded proteins at FSF level using hybrid linear hidden Markov models (HMMs) for remote homology detection in SUPERFAMILY version 1.67 (Gough et al. 2001). Genome sequences were scanned against an HMM library generated using the iterative Sequence Alignment and Modeling System (SAM) method. Each model generated by SAM-T02 identified each non-identical SCOP domain. The HMM searching protocol used a probability cutoff E of 0.02; more stringent cutoff values did not alter the topologies of the reconstructed trees (Yang et al. 2005). An internal calibration of the accuracy of HMM prediction against Protein Data Bank (PDB) records in the ASTRAL compendium (Brenner et al. 2000) showed that the method correctly identified 98% of sequences analyzed (Kim et al. 2006). The structural census assigned protein domains to ~50% of genomic sequences, ranging from 15% to 71% in individual genomes with a median of 52% (Wang et al. 2006). FSF were assigned to F using the Structural Classification of Proteins (SCOP) database release 1.67. SCOP classifies 24,037 PDB entries into 65,122 domains, which are then grouped into 2630 FF, 1447 FSF, and 887 F architectures (Murzin et al. 1995). Biological functions associated with FSF were annotated using the coarse-grained classification described in SUPERFAMILY (Vogel et al. 2004a, 2005; Vogel and Chothia 2006). Functions related to small molecule metabolism were dissected using MANET (Kim et al. 2006). Note that FSF functions were annotated with respect to their usual role in a protein or biological network, which can be a matter of debate. Moreover, while an older FSF is likely to have generated a function at an earlier time, statistical correlations between FSF ancestry and age of the function may not be necessarily valid for individual proteins because of the vagaries of recruitment in networks (e.g., Caetano-Anollés et al. 2007). For example, a younger FSF could be recruited to perform a particular function in a protein earlier than an older FSF. Phylogenomic analysis The frequencies with which individual protein architectures occur in an individual genome, termed GENOMIC ABUNDANCE (G), were used to describe at global levels the popularity of F and FSF architectures. For phylogenetic analysis, G values were normalized to compensate for differences in genome size and proteome representation and were subjected to logarithmic transformation to account for unequal variance (Wang et al. 2006). The gap-recoding technique of Thiele (1993) developed for the analysis of morphometric data was used in which a rescaling function rescores character information on both rank order and size of gaps between character states. Values were range standardized to a 0–20 scale, as this range is compatible with most phylogenetic analysis programs, encoded using an alphanumeric format with numbers 0–9 and letters A–K in the NEXUS format, and subjected to phylogenetic analysis using maximum parsimony (MP) as the optimality criterion in PAUP* (Swofford 2002). Phylogenomic trees of proteomes and trees of architectures analyzed at F and FSF levels of protein classification were generated using linearly ordered multistate phylogenetic characters. Characters are observable features that distinguish one object from another and constitute hypotheses of primary homology. In our case, they display multiple numerical values and frequency distribution of values called character states. The ANCSTATES command was used to polarize characters, based on two fundamental premises: (1) that protein structure is far more conserved than sequence and carries considerable phylogenetic signal, and (2) that F and FSF architectures that are successful and popular in nature are generally more ancestral. We consider that FF that originated early in evolution are prominent in genomes and that the number of FF members increases in single steps corresponding to the addition or removal of a homologous gene in the family. We assume that this process is reversible and expresses an asymmetry with gene duplication being favored over gene loss. Details and support for character argumentation and absence of circularity in assumptions have been described previously (Caetano-Anollés and Caetano-Anollés 2003, 2005; Wang et al. 2006). Because F and FSF are retained over long evolutionary times, their gain or loss constitute important evolutionary events that appear to be independent of HGT and other convergent evolutionary processes (Gough 2005). Phylogenetic reliability was evaluated by the bootstrap method in PAUP*. The structure of phylogenetic signal in the data was tested by the skewness (g1) of the length distribution of >104 random trees and permutation tail probability (PTP) tests of cladistic covariation using >103 replicates (Hillis and Huelsenbeck 1992). Ensemble consistency (CI) and retention (RI) indices were used to measure homoplasy and synapomorphy, confounding and desired phylogenetic characteristics, respectively. Our phylogenetic analyses depend on the accuracy and balance of genomic databases, efficient and accurate assignment of structures to protein sequences, adequate structural classification schemes in SCOP, and methods of phylogenetic tree reconstruction. For example, there are biases in the detection of FSF from protein with PDB entries used as seed sequences of the HMMs and biases in the representation of sequences and genomes in the databases, favoring Bacteria over Eukarya and Archaea. The effects that these factors have on our approach have been discussed previously (Caetano-Anollés and Caetano-Anollés 2003, 2005) and have not been controlled in our experimental design. We do not expect that the operational definition of F and FSF will be seriously challenged, even though many F can be better described by continuous rather than discrete distributions in structure space (Harrison et al. 2002). Domain structures of globular proteins that have not been discovered to date are probably of low genomic abundance and are expected to be highly diverse (Gerstein and Hegyi 1998). Gene sequences with no structural assignments probably encode membrane proteins or globular proteins that are difficult to crystallize (Liu and Rost 2002). Future advances in structural genomics and bioinformatics will help fill structural “gaps,” will decrease the bias introduced by unassigned domains and structural elements, and will benefit our approach. Organismal distribution analysis of F and FSF architectures Protein architectures were classified into F and FSF distribution categories that describe their spread across the three superkingdoms of life. Architectures appearing in all 185 organisms analyzed were assigned to the ABE0 category; those present in at least one proteome but in all superkingdoms were assigned to the ABE category; those present in two superkingdoms were assigned to the AE, AB, and BE categories; and those present in only one superkingdom were assigned to the A, B, and E categories. A distribution index (f) describing the distribution of individual architectures among proteomes was calculated. The f index represents the fraction of proteomes harboring an architecture within a category and ranges from absence (f = 0) to presence in all proteomes considered (f = 1). Because reconstructed trees were intrinsically rooted, we used a PERL script to establish the relative age (ancestry) of individual protein architectures by measuring a distance in nodes from the hypothetical ancestral F or FSF on a relative 0–1 scale. This node distance (nd) counts the number of nodes (cladogenic events) along a lineage in the tree of architectures, starting from the root and traveling to each terminal leaf. Consequently, the nd ancestry value is 0 for the most ancient architecture and 1 for the most derived. Since rates of genetic evolution are generally linked to speciation (Webster et al. 2003), the total genetic distance from the root to its tips (path length) will be correlated with the number of nodes and consequently with nd. The contribution of processes of gradual evolution will therefore be negligible, and nd will represent a good approximation of a path length based on character state change in individual branches. Protein classification databases are continuously updated to include more completely sequenced genomes and newly described F and FSF architectures. We cross-checked our results for several releases of the SUPERFAMILY database and found that the main overarching conclusions of this study remain the same: F and FSF distribution between organisms is preserved. However, some of the details may change as new discoveries are made. Thus, we ask the reader to be careful in interpreting the results and focus more on the general trends in the data (reductive tendencies in Archaea vs. retention tendencies in Eukarya) as opposed to the specifics, such as the exact number of F and FSF found in each superkingdom, which is prone to change with time. Also, the exact ancestry values (nd) that we mention in this study for easier description and reference to the graphs will change in the new data sets but are not as important as the relative position of architectures on the trees of F and FSF, which will remain the same. Thus, the reader should treat nd values as relative. Architectural use and abundance in genomes The architectural usage in genomes, that is, percentage of architectures used in an organism, was calculated by dividing the number of F or FSF appearing in the organism by the total number appearing in all organisms. G values were used to measure architectural abundance as frequencies with which individual architectures occurred in individual genomes. Acknowledgments We thank Simina M. Boca for preliminary results on effects of parasitic lifestyle and Christine Vogel for pointing us to her FSF functional annotation scheme. M.W., L.Y., and G.C. conceived and designed experiments, generated and analyzed data, and produced figures, with significant input from J.E.M. D.C. contributed functional annotation. G.C. and L.Y. wrote the paper, with contributions from all authors. Research was supported in part with funds from UIUC and grants from NSF (MCB-0343126) and the C-FAR Sentinel Program to GC. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6454307 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
FEMS Microbiol Rev. 1998 Oct; 22(4):277-304.
[FEMS Microbiol Rev. 1998]Science. 2003 Jun 13; 300(5626):1701-3.
[Science. 2003]J Mol Biol. 1995 Apr 7; 247(4):536-40.
[J Mol Biol. 1995]Structure. 1997 Aug 15; 5(8):1093-108.
[Structure. 1997]J Mol Biol. 1997 May 23; 268(5):857-68.
[J Mol Biol. 1997]Proc Natl Acad Sci U S A. 1998 Jun 9; 95(12):6854-9.
[Proc Natl Acad Sci U S A. 1998]Curr Opin Genet Dev. 1999 Dec; 9(6):672-7.
[Curr Opin Genet Dev. 1999]Proc Natl Acad Sci U S A. 1990 Jun; 87(12):4576-9.
[Proc Natl Acad Sci U S A. 1990]Science. 1999 Jun 25; 284(5423):2124-9.
[Science. 1999]J Mol Evol. 1998 Jan; 46(1):1-17.
[J Mol Evol. 1998]Nat Rev Genet. 2005 May; 6(5):361-75.
[Nat Rev Genet. 2005]Curr Opin Struct Biol. 2005 Jun; 15(3):248-53.
[Curr Opin Struct Biol. 2005]Science. 2006 Mar 3; 311(5765):1283-7.
[Science. 2006]Nat Genet. 1999 Jan; 21(1):108-10.
[Nat Genet. 1999]Proc Natl Acad Sci U S A. 2005 Jan 11; 102(2):373-8.
[Proc Natl Acad Sci U S A. 2005]J Mol Biol. 2005 Feb 11; 346(1):355-65.
[J Mol Biol. 2005]PLoS Comput Biol. 2006 May; 2(5):e48.
[PLoS Comput Biol. 2006]Genome Biol. 2004; 5(5):107.
[Genome Biol. 2004]Science. 2006 Oct 6; 314(5796):119-21.
[Science. 2006]Nat Rev Genet. 2001 Dec; 2(12):919-29.
[Nat Rev Genet. 2001]Bioessays. 2005 Jul; 27(7):741-7.
[Bioessays. 2005]Bioessays. 2007 Jan; 29(1):74-84.
[Bioessays. 2007]Proc Natl Acad Sci U S A. 2007 Feb 13; 104(7):2043-9.
[Proc Natl Acad Sci U S A. 2007]Proc Natl Acad Sci U S A. 2000 Jul 18; 97(15):8392-6.
[Proc Natl Acad Sci U S A. 2000]Bioessays. 2005 Jul; 27(7):741-7.
[Bioessays. 2005]Proc Natl Acad Sci U S A. 2007 May 29; 104(22):9358-63.
[Proc Natl Acad Sci U S A. 2007]J Biomol Struct Dyn. 2007 Feb; 24(4):321-4.
[J Biomol Struct Dyn. 2007]Proteomics. 2007 Mar; 7(6):875-89.
[Proteomics. 2007]Proc Natl Acad Sci U S A. 1998 Jun 9; 95(12):6854-9.
[Proc Natl Acad Sci U S A. 1998]Mol Biol Evol. 2004 Sep; 21(9):1643-60.
[Mol Biol Evol. 2004]Mol Biol Evol. 2006 Dec; 23(12):2444-54.
[Mol Biol Evol. 2006]Curr Biol. 2005 Apr 12; 15(7):R237-40.
[Curr Biol. 2005]Proc Natl Acad Sci U S A. 2006 Jul 11; 103(28):10696-701.
[Proc Natl Acad Sci U S A. 2006]Curr Opin Genet Dev. 1999 Dec; 9(6):672-7.
[Curr Opin Genet Dev. 1999]Proc Natl Acad Sci U S A. 2005 Sep 6; 102(36):12742-7.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 1998 Mar 31; 95(7):3726-30.
[Proc Natl Acad Sci U S A. 1998]Curr Opin Genet Dev. 1999 Dec; 9(6):672-7.
[Curr Opin Genet Dev. 1999]BMC Bioinformatics. 2006 Jul 19; 7():351.
[BMC Bioinformatics. 2006]Proc Natl Acad Sci U S A. 2007 May 29; 104(22):9358-63.
[Proc Natl Acad Sci U S A. 2007]Science. 2001 May 11; 292(5519):1096-9.
[Science. 2001]Science. 2006 Oct 13; 314(5797):267.
[Science. 2006]Science. 1995 Oct 20; 270(5235):397-403.
[Science. 1995]Mol Microbiol. 1996 May; 20(4):898-900.
[Mol Microbiol. 1996]Fold Des. 1998; 3(4):229-38.
[Fold Des. 1998]Biol Bull. 1999 Jun; 196(3):351-3; discussion 354-5.
[Biol Bull. 1999]Proc Natl Acad Sci U S A. 1990 Jun; 87(12):4576-9.
[Proc Natl Acad Sci U S A. 1990]Proc Natl Acad Sci U S A. 2005 Jan 11; 102(2):373-8.
[Proc Natl Acad Sci U S A. 2005]Mol Biol Evol. 2006 Dec; 23(12):2444-54.
[Mol Biol Evol. 2006]J Mol Evol. 1998 Jan; 46(1):1-17.
[J Mol Evol. 1998]Bioessays. 1999 Oct; 21(10):871-9.
[Bioessays. 1999]Curr Opin Genet Dev. 1999 Dec; 9(6):672-7.
[Curr Opin Genet Dev. 1999]Mol Microbiol. 2000 Oct; 38(2):177-85.
[Mol Microbiol. 2000]Proc Natl Acad Sci U S A. 1996 Sep 17; 93(19):10268-73.
[Proc Natl Acad Sci U S A. 1996]J Mol Evol. 1998 Jan; 46(1):1-17.
[J Mol Evol. 1998]Bioessays. 1999 Oct; 21(10):871-9.
[Bioessays. 1999]Curr Opin Genet Dev. 1999 Dec; 9(6):672-7.
[Curr Opin Genet Dev. 1999]Mol Microbiol. 2000 Oct; 38(2):177-85.
[Mol Microbiol. 2000]Proc Natl Acad Sci U S A. 1996 Sep 17; 93(19):10268-73.
[Proc Natl Acad Sci U S A. 1996]Proc Natl Acad Sci U S A. 1998 Jun 9; 95(12):6854-9.
[Proc Natl Acad Sci U S A. 1998]J Mol Biol. 2001 Nov 2; 313(4):903-19.
[J Mol Biol. 2001]Proc Natl Acad Sci U S A. 2005 Jan 11; 102(2):373-8.
[Proc Natl Acad Sci U S A. 2005]Nucleic Acids Res. 2000 Jan 1; 28(1):254-6.
[Nucleic Acids Res. 2000]BMC Bioinformatics. 2006 Jul 19; 7():351.
[BMC Bioinformatics. 2006]Mol Biol Evol. 2006 Dec; 23(12):2444-54.
[Mol Biol Evol. 2006]Mol Biol Evol. 2006 Dec; 23(12):2444-54.
[Mol Biol Evol. 2006]Genome Res. 2003 Jul; 13(7):1563-71.
[Genome Res. 2003]J Mol Evol. 2005 Apr; 60(4):484-98.
[J Mol Evol. 2005]Bioinformatics. 2005 Apr 15; 21(8):1464-71.
[Bioinformatics. 2005]J Hered. 1992 May-Jun; 83(3):189-95.
[J Hered. 1992]Genome Res. 2003 Jul; 13(7):1563-71.
[Genome Res. 2003]J Mol Evol. 2005 Apr; 60(4):484-98.
[J Mol Evol. 2005]J Mol Biol. 2002 Nov 8; 323(5):909-26.
[J Mol Biol. 2002]FEMS Microbiol Rev. 1998 Oct; 22(4):277-304.
[FEMS Microbiol Rev. 1998]Science. 2003 Jul 25; 301(5632):478.
[Science. 2003]