![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2009 The Authors Streamlining and Large Ancestral Genomes in Archaea Inferred with a Phylogenetic Birth-and-Death Model *Department of Computer Science and Operations Research, University of Montréal, Montréal,Canada †Rényi Institute of Mathematics, Hungarian Academy of Sciences, Budapest, Hungary Corresponding author.E-mail: csuros/at/iro.umontreal.ca Hideki Innan, Associate Editor Accepted June 9, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Homologous genes originate from a common ancestor through vertical inheritance, duplication, or horizontal gene transfer. Entire homolog families spawned by a single ancestral gene can be identified across multiple genomes based on protein sequence similarity. The sequences, however, do not always reveal conclusively the history of large families. To study the evolution of complete gene repertoires, we propose here a mathematical framework that does not rely on resolved gene family histories. We show that so-called phylogenetic profiles, formed by family sizes across multiple genomes, are sufficient to infer principal evolutionary trends. The main novelty in our approach is an efficient algorithm to compute the likelihood of a phylogenetic profile in a model of birth-and-death processes acting on a phylogeny. We examine known gene families in 28 archaeal genomes using a probabilistic model that involves lineage- and family-specific components of gene acquisition, duplication, and loss. The model enables us to consider all possible histories when inferring statistics about archaeal evolution. According to our reconstruction, most lineages are characterized by a net loss of gene families. Major increases in gene repertoire have occurred only a few times. Our reconstruction underlines the importance of persistent streamlining processes in shaping genome composition in Archaea. It also suggests that early archaeal genomes were as complex as typical modern ones, and even show signs, in the case of the methanogenic ancestor, of an extremely large gene repertoire. Keywords: gene content evolution, maximum likelihood, Last Archaeal Common Ancestor Introduction The evolution of homologous gene families, that is, genes of common ancestry, is enmeshed within species histories in a complex manner (Koonin 2005). Concomitantly with the diversification of organismal lineages, gene families expand by duplications, individual genes get eliminated, and new genes arrive by lateral transfer. It is now clear that de novo gene formation and vertical processes (Henikoff et al. 1997, Snel 2002), such as duplication and loss, act in concert with horizontal gene transfer (Boucher et al. 2003, Gogarten and Townsend 2005). Gene families are identified in current practice by pairwise sequence comparisons, coupled with the clustering of postulated homolog pairs (Tatusov et al. 1997, Alexeyenko et al. 2006) The phylogenetic profile of a gene family comprises the family size across a set of organisms, that is, the number of homologs within the same family in each genome. Such profiles are extremely informative even without taking the gene sequences into account: profile data sets have been used to construct organismal phylogenies (Fitz-Gibbon and House 1999, Snel et al. 1999, Tekaia et al. 1999) and to infer ancestral gene content (Mirkin et al. 2003, Iwasaki and Takagi 2007); similar and complementary profiles hint at functional associations (Tatusov et al. 1997, Pellegrini et al. 1999). Considering various evolutionary processes in a mathematical model of gene family evolution is challenging. One main element that distinguishes the present study from past work is the elaboration of a likelihood framework for phylogenetic profiles that simultaneously accounts for gene duplication, loss, and acquisition. In particular, we describe an algorithm for the exact computation of the likelihood in a phylogenetic gain –loss–duplication model. The present study uses a gain–loss–duplication model to address gene content evolution in Archaea. Relying on a complete set of known homolog families in 28 sequenced genomes, we inferred lineage- and family-specific statistics. In a precursory step, we constructed a plausible phylogeny using 88 universally conserved proteins, which we believe is a noteworthy result on its own, as the phylogeny resolves some problematic euryarchaeal branching orders (involving Thermoplasmatales, Methanopyrus, and Methanobacteriales) confidently. Gene loss emerges in our analysis as the dominant force that has shaped archaeal genomes throughout their history. Apparently, genome streamlining has been an ongoing process in all lineages with a fairly constant intensity, apart from dramatic genome compactions in endosymbiotic Archaea. Our reconstruction suggests that early Archaea had a comparable genomic complexity to today's organisms. In particular, the euryarchaeal ancestor of two classes of methanogens had a very large genome, resulting from one of the rare upsurges in gene content, similarly to some modern lineages of Methanosarcina and Halobacteria. Methods Phylogenetic Profiles in Archaea Phylogenetic profiles, sequences, and functional annotations were downloaded from the arCOG database of orthologous gene clusters in Archaea (Makarova et al. 2007) at ftp://ftp.ncbi.nih.gov/pub/wolf/ COGs/arCOG. The profiles were amended with data on lineage-specific singletons and inparalog families that have no archaeal homologs outside of one genome (Wolf Y, personal communication), which was produced in the process of compiling the arCOG database. The following organisms are included in the study: Archæoglobus fulgidus (Arcfu), Haloarcula marismortui ATCC 43049 (Halma), Halobacterium sp. strain NRC-1 (Halsp), Methanosarcina acetivorans (Metac), Methanococcoides burtonii DSM 6242 (Metbu), Methanoculleus marisnigri JR1 (Metcu), Methanospirillum hungatei JF-1 (Methu), Methanocaldococcus jannaschii (Metja), Methanopyrus kandleri (Metka), Methanosarcina mazei (Metma), Methanococcus maripaludis S2 (Metmp), Methanosphaera stadtmanæ (Metst), Methanothermobacter thermoautotrophicus (Metth), Nanoarchæum equitans (Naneq), Picrophilus torridus DSM 9790 (Picto), Pyrococcus abyssi (Pyrab), Pyrococcus furiosus (Pyrfu), Thermoplasma acidophilum (Theac), Thermococcus kodakaraensis KOD1 (Theko), Thermoplasma volcanium (Thevo), Æropyrum pernix (Aerpe), Caldivirga maquilingensis IC-167 (Calma), Cenarchæum symbiosum (Censy), Hyperthermus butylicus (Hypbu), Pyrobaculum ærophilum (Pyrae), Sulfolobus solfataricus (Sulso); Sulfolobus acidocaldarius DSM 639 (Sulac), Thermofilum pendens Hrk 5 (Thepe) with the last eight classified as crenarchaeota. The abbreviations are those used by Makarova et al. (2007) and the arCOG database. Reconstruction of Archaeal Phylogeny The phylogeny was constructed using concatenated multiple alignments of selected orthologous protein sequences. The sequences were chosen from the arCOG database based on phylogenetic profiles: we selected all arCOG groups where every studied genome contained exactly one homolog. There are 88 such groups (see Supplemental Material for sequences), and 46 of those correspond to ribosomal proteins (r-proteins). Alignments were done using the program Muscle (Edgar 2004). Phylogenies were built by likelihood maximization using PhyML (Guindon and Gascuel 2003), with the Jones–Taylor–Thornton substitution model and eight discrete gamma categories and invariant sites. The expected number of substitutions per amino acid site was computed on each edge for the r-proteins in the JTT+I+Γ 8 model by PhyML. Bootstrap support values for the branches were computed by PhyML, using 500 replicates. Inference of Gene Content Evolution We maximized the likelihood (see below for the likelihood computation) of the data set using a gain–loss–duplication model with a Poisson distribution at the root and four discrete gamma categories capturing rate variation across families, for edge length tf and duplication λf each. For a given set of model parameters (three parameters— —per edge, one for the root's Poisson parameter Γ and two gamma shape parameters for rate variation), the likelihood of each family was computed using (1) with the described methods of manipulating rate variation and correcting for absent profiles. The data set's likelihood (i.e., the product of family likelihoods) was then maximized numerically as a function of the model parameters, using custom-made software implementing the Broyden– Fletcher–Goldfarb–Shanno conjugate gradient method and Brent's one-dimensional optimization method (Press et al. 1997). Family sizes and lineage-specific events (gains,losses,expansions, and contractions) were computed using posterior probabilities in the optimized gain–loss–duplication model.Phylogenetic Birth-and-Death Model A phylogenetic birth-and-death model formalizes the evolution of an organism-specific census variable along a rooted phylogeny T. We consider only binary phylogenies here; the full set of methods applicable to multifurcating phylogenies is described in the Supplementary Material. The model specifies edge lengths, as well as birth-and-death processes (Ross 1996, Kendall 1949) acting on the edges. Populations of identical individuals evolve along the tree from the root toward the leaves by Galton–Watson processes. At nonleaf nodes of the tree, populations are instantaneously copied to evolve independently along the adjoining descendant edges. Let the random variable ξ(x) {0,1,2,…} denote the population count at every node x![]() (T). Every edge xy is characterized by a loss rate μxy, a duplication rate λxy, and a gain rate κxy. If (X(t):t ≥ 0) is a linear birth-and-death process (Kendall 1949, Takács 1962) with these rate parameters, then {ξ(y) = m|ξ(x) = n} = {X(txy) = m|X(0) = n}, where txy > 0 is the edge length, which defines the time interval during which the birth-and-death process runs. The joint distribution of (ξ(x):x![]() (T)) is determined by the phylogeny, the edge lengths and rates, along with the distribution at the root ρ, denoted as γ(n) = {ξ(ρ) = n}.It is assumed that one can observe the population counts at the terminal nodes (i.e., leaves) but not at the inner nodes of the phylogeny. As individuals are considered identical, we are also ignorant of the ancestral relationships between individuals within and across populations. The population counts at the leaves form a phylogenetic profile, which is formally a function : (T) {0,1,2,…}, where (T)![]() (T) denotes the set of leaf nodes. Our central problem is to compute the likelihood of a profile, that is, the probability of the observed counts for fixed model parameters. Define the notation ( ′) = ( (x):x![]() ′) for the partial profile within a subset ′![]() (T). Similarly, let ξ( ′) = (ξ(x):x![]() ′) denote the vector-valued random variable composed of individual population counts. The likelihood of is the probability L = {ξ( (T)) = }. Let Tx denote the subtree of T rooted at node x. Define the survival count range Mx for every node x as Mx = ∑y![]() (Tx) (y). Clearly, the ranges can be calculated easily in a postorder traversal.For our discussion, we borrow standard terminology applied to homologous genes (Sonnhammer and Koonin 2002). For every edge xy, the population of node y can be split by ancestry at node x: inparalog groups are formed by the progenies of each individual at x and a xenolog group is formed by the individuals whose ancestor immigrated into the population. When ξ(x) = n on the edge xy, then ξ(y) = η + ∑i = 1nζi, where η is the xenolog group size, and ζi are the independent and identically distributed inparalog group sizes. The distribution of xenolog and inparalog group sizes is the well-characterized transient distribution of the appropriate linear birth-and-death processes (Kendall 1949, Karlin and McGregor 1958, Takács 1962; see Supplemental Material). Namely, each ζi has a shifted geometric distribution, and for κ > 0, η has a negative binomial or Poisson distribution. The distributions’ parameters are known functions of the edge length txy and rates κxy,λxy,μxy.Surviving Lineages A key factor in inferring the likelihood formulas is the probability that a given individual at a tree node x has no descendants at the leaves within the subtree rooted at x. The corresponding extinction probability is denoted by Dx, which can be computed in a postorder traversal (Csűrös and Miklós 2006). An individual at node x is referred to as surviving if it has at least one progeny at the leaves descending from x. Let Ξ(x) denote the number of surviving individuals at each node x. The number of surviving xenologs and inparalogs follow the same class of distributions as the total number of xenologs and inparalogs (see Supplemental Material). Consequently, if ξ(x) = n on edge xy, then Ξ(y) = η + ∑i = 1nζi, where η is the surviving xenolog count with a Poisson or negative binomial distribution, and ζi are surviving paralog counts, with shifted geometric distributions. The distributions’ parameters can be computed explicitly using the process parameters and the extinction probabilities. In the formulas to follow, we use the probabilities wy*[m n] = {η + ∑i = 1nζi = m;∀ζi > 0}, which can be computed by dynamic programming for all n,m ≤ My in O(My2) time (see Supplemental Material).Computing the Likelihood We compute the likelihood using conditional survival likelihoods defined as the probability of observing the partial profile within Tx given the number of surviving individuals Ξ(x): Lx[n] = {ξ( (Tx)) = ( (Tx))|Ξ(x) = n}. For m > Mx, Lx[m] = 0. For values m = 0,1,…,Mx, the conditional survival likelihoods can be computed recursively as shown below.If node x is a leaf, then
{ζ = k} for a surviving inparalog group at xi. In the above equations, 0 < t ≤ Mx1. For all n = 0,…,Mx
{Ξ(ρ) = m}. In particular, if γ is the stationary distribution for a gain –loss–duplication or a gain–loss models, then Ξ(ρ) has a negative binomial or Poisson distribution, respectively. The likelihood for a Poisson distribution at the root is
The likelihood formula (1) is corrected to account for the fact that the data set does not contain all-absent profiles with (x) = 0 for all leaves x, in a manner analogous to Felsenstein (1992).Family-specific rate variation is considered by computing the likelihood values for each discrete rate category c characterized by factors (tc,κc,μc,λc). The factors in our analysis are either constant 1, or correspond to the expected values within the four quartiles of a gamma distribution with mean 1. Results and Discussion Computational Analysis of Phylogenetic Profiles Birth-and-death processes are commonly used to model a population of identical individuals (Kendall 1949, Karlin and McGregor 1958) and waiting queues (Takács 1962). Their use in modeling gene family evolution is justified by the fact that losses and duplications seem to occur independently between the members of multigene families (Nei and Rooney 2005). The most general process we consider is a gain–loss–duplication process that is characterized by the rates of gain κ, loss μ, and duplication λ: a population of size n grows by a rate of (λn + κ) and decreases by a rate of μn. In our context, the population comprises homologs of a given family in the genome. Gene acquisition occurs with a rate of κ, combining various means such as innovation and lateral transfer. We model gene family evolution in a phylogenetic setting by associating gain–loss–duplication processes with the branches of a phylogenetic tree. The corresponding phylogenetic birth-and-death model defines a probabilistic framework for the evolution of gene family size. The observed family sizes at the terminal nodes form a phylogenetic profile. In principle, a phylogenetic birth-and-death model suits likelihood-based inference since it is a probabilistic graphical model (Jordan 2004) with a tree structure. The mathematical difficulties stem from the fact that the state space of the processes (i.e., family size) is infinitely large. Consequently, routine computational techniques used to analyze molecular sequence evolution Felsenstein 1981) are not applicable. Previously proposed likelihood methods (Hahn et al. 2005, Spencer et al. 2006, Iwasaki and Takagi 2007) have sidestepped the infinity problem by using approximative calculations with bounds on maximal family size. We have introduced (Csűrös and Miklós 2006) a procedure for computing the likelihood in a restricted gain–loss–duplication model (assuming 0 < κ and 0 < λ < μ), without imposing artificial size bounds. The weakness of that procedure is potential numerical instability, due to the use of alternating sums in the formulas. We found practical cases (such as the archaeal gene content study we report below), where the numerical instability led to serious errors. The novel procedure presented here is numerically stable, as well as computationally efficient. It applies to arbitrary gain–loss–duplication models, including degenerate cases such as the one of Hahn et al. (2005) with λ = μ and κ = 0. The algorithm takes O(M2n) time to complete for a phylogenetic profile over n species and M total number of genes (see Supplemental Material).Gene Content Evolution in Archaea Archaea constitute one of the three main domains of cellular life, and are notable for a spectacular diversity of adaptive strategies to extreme environments (Garrett and Klenk 2006). We examined gene content evolution in Archaea. For the purposes of the study, we have selected 28 completely sequenced genomes covering all major physiological and metabolic groups recognized in cultured Archaea: thermophiles, halophiles, acidophiles, nitrifiers, and methanogens (Valentine 2007). Homolog gene families were extracted from the arCOG (archaeal clusters of orthologous groups) database (Makarova et al. 2007), and combined with groupings of genes that have no archaeal homologs outside of single genomes. The complete data set consists of 14,216 families, of which 7,461 are among the arCOGs. Phylogenetic Relationships Archaeal phylogenetic relationships have been resolved to an increasing degree of confidence (Forterre et al. 2006) with the aid of accumulating sequence data. Figure 1
The observed uncertainties about euryarchaeal groups concern the placement of Thermoplasmata, and so-called Class I methanogens (Bapteste et al. 2005) comprising Methanopyrales, Methanobacteriales, and Methanococcales. Thermoplasmata were originally thought to be a an early-branching lineage of Euryarchaeota (Forterre et al. 2006), but analyses of r-proteins (Matte-Tailliez et al. 2002) have provided strong evidence for their late-branching position after Class I methanogens as in fig. 1 The correct phylogenetic position of M. kandleri (Metka) is one of the remaining puzzles in archaeal evolution. The existence of close phylogenetic relationships between Class I methanogens is fairly certain, but different protein sets and taxonomic sampling give conflicting or weak indications (Slesarev et al. 2002, Brochier et al. 2004, Brochier et al. 2005, GaoGupta.methanogenesis) about the exact branching order among Methanopyrales, Methanobacteriales, and Methanococcales. R-proteins in our study give a weak support for the monophyly of Methanococcales and Methanobacteriales at the exclusion of Methanopyrales (49% BV) and faintly favor the paraphyly of Class I methanogens (37% BV for the immediate split of Methanopyrales between Thermococcales and Methanobacteriales/Methanococcales; see Supplemental Material). Uc-proteins, however, solidly point to the monophyly of Class I methanogens ( > 97% BV). Interestingly, the maximum likelihood trees built from uc-proteins do not resolve well the relationships between Halobacteriales, Methanosarcinales, and Methanomicrobiales (see Supplemental Material), but there is little reason to doubt that r-proteins provide a genuine phylogenetic signal about the monophyly of Class II methanogens (Bapteste et al. 2005, Brochier-Armanet et al. 2008), uniting Methanosarcinales and Methanomicrobiales.We conclude that based on protein sequences, Thermoplasmatales constitute a late-branching euryarchaeal lineage, and their early-branching status is a long-branch attraction artifact. Furthermore, the sequences provide evidence of the monophyly of both Class I and Class II methanogens. Evolutionary Rates: Correlations Between Sequence and Gene Content Evolution We experimented with models of increasing complexity that combine lineage- and gene-specific factors in the gain–loss–duplication processes. Specifically, we assumed that the process for family f on branch e is characterized by the rates , and runs for a duration of . Here, are branch-specific process parameters, and tf,κf,μf,λf are family-specific rate variation coefficients. Starting with simple models with invariant family-specific coefficients, we introduced rate variation in a model hierarchy with increasing complexity. In more complex models, some coefficients were drawn randomly from a discretized gamma distribution (Yang 1994). Different family-specific coefficients do not have the same impact on the model fit. We found the largest improvement when introducing variation in edge length (tf), followed by duplication–rate variation (λf). Further variation in loss and gain rates led to insignificant improvements in the model fit and were not assumed in the analysis.In the absence of extraneous scaling, we set to examine the total rates of gene content change on each edge e. We found a conspicuous correlation across branches between the rate of sequence evolution (expected numbers of substitutions per site for r-proteins) and the component rates of gene content evolution: on this point, see Figure 2
The apparent correlations between gene content and sequence evolution rates imply that a steady balance has been maintained between drift and natural selection in almost all lineages. Loss and duplication rates, in particular, have similar vagaries as amino acid substitution rates and provide thus comparable molecular clocks. We measured each terminal node's depth by summing the rates along branches from the root to the node in question. Excluding N. equitans and C. symbiosum, the coefficient of variation of the depth is 26% for protein sequences, 23% for gene loss rates, and 20% for duplication rates. Depths by gene gain rates span about a 4-fold range: for substitution, loss, and duplication, the span is close to 2-fold. Genes have thus been eliminated in all archaeal lineages with a fairly universal constancy, apart from occasional accelerations. In other words, genome degradation processes seem to persist at a fairly common intensity in every lineage (Mira et al. 2001). Conceivably, genome decay is counterbalanced by natural selection that eliminates deleterious mutations. The root cause of dramatically increased gene loss in obligate symbionts such as N. equitans (Makarova and Koonin 2005) may be reduced selection (Hershberg et al. 2007, Koonin and Wolf 2008). Principles of population genetics imply that changes in population size alone can explain rate changes (Lynch 2006): selection power is weaker in a smaller population, which should manifest in accelerated evolution of sequences (Ohta 1972) and gene content. We examined the differences between evolutionary rates in sibling terminal taxa for signs of natural selection. Figure 2 In the lineage leading to M. stadtmanæ, a human commensal (Fricke et al. 2006), all rates are simultaneously larger when compared with its sibling lineage M. thermoautotrophicus, which may be attributed to a smaller population size for the former, which has a smaller habitat. Gene gain and duplication rates behave in general less predictably: numerical differences between loss, gain, and duplication rates on sibling lineages occur in almost all possible sign combinations. The observed fluctuations corroborate the intuition that selection pressures acting on gain and duplication are strong and variable (Wolf et al. 2002). It is plausible that during episodes of massive adaptation, the selective advantages of gene acquisition may outweigh possible negative consequences of an increased genome, and thus drive elevated gene gains, especially if coupled with small population sizes. In our case, unusually large gain rates are inferred on some of the deepest branches (such as the one leading to node E1 in fig. 1 History of Archaeal Gene Census: Streamlining and Surges We inferred a probable history of archaeal gene content using posterior probabilities for ancestral family sizes and family size changes, computed from the phylogenetic profiles in the fitted model. Figure 3
Our reconstruction suggests a recurrent theme in archaeal evolution: a major physiological or metabolic invention leads to a successful founding population in a new environment, which then further diversifies by genomic streamlining. We can see notably that fig. 3 shows only a few branches where gains prevail over losses (i.e., at least twice as many gains as losses): such is the case for some deep crenarchaeal and euryarchaeal branches, and the terminal lineages for M. acetivorans and H. marismortuimi. About half of the remaining terminal lineages and two-thirds of remaining deep lineages are dominated by loss. Moreover, there is only one ancestral node (the crenarchaeal ancestor) in the entire tree for which gain is dominant in both descendant lineages. Why would gene loss be so prevalent? We speculate that the versatility of a large genome in such extant lineages as M. acetivorans (Galagan et al. 2002) and H. marismortuimi (Baliga et al. 2004) can be upheld for only relatively short time periods. Genetic drift already leads to the diversification of descendant lineages, which are frequently isolated, given the disconnectedness of the extreme environments they dwell in (Whitaker et al. 2003, Escobar-Páramo et al. 2005). Specialization and the loss of dispensable functions should be favorable in the descendants that are typically under significant energy stress (Valentine 2007). Genomic streamlining should also be favored by population-size effects due to the isolation (Lynch 2006), even in the case of slightly deleterious loss of function. After the crenarchaeal split, the main euryarchaeal lineage has been characterized by the accumulation of new families, culminating in a large surge on the branch leading to node E1, where many new families appeared. The time interval (judging by sequence divergence in fig. 1) and the extent of gene gain is similar to what is seen with H. marismortuimi (Halma) and M. acetivorans (Metac). The inference of large gains in the E1 lineage is due to the large number of gene families shared between multiple descendant lineages, and especially between the two classes of methanogens (Slesarev et al. 2002, Bapteste et al. 2005, Gao and Gupta 2007, Makarova et al. 2007). In fact, this lineage may very well have been where hydrogenotrophic methanogenesis was invented, which then underwent modifications, extensions, and degradations in subsequent lineages. It was noted in previous genome-scale comparisons (Bapteste et al. 2005, Gao and Gupta 2007) that it is likely that euryarchaeal lineages acquired methanogenesis predominantly by vertical inheritance because the associated pathways are fairly complex, and neither the sequences nor the phylogenetic profiles show evidence of substantial amounts of lateral gene transfer LGT. Figure 3 suggests that methanogenesis appeared after the split of Thermococcales in the company of more than 760 genes. Based on extant examples of archaea with such swelled genomes (Galagan et al. 2002, Baliga et al. 2004), it is plausible that the corresponding archaeal organisms were extremely versatile. Our inference of ancestral gene content is quite different from previous reconstructions based on parsimony principles (Makarova et al. 2007, Csűrös 2008): at deep nodes, we postulate larger genomes. Parsimonious reconstructions (Mirkin et al. 2003, Kunin et al. 2005, Csűrös 2008) aim to minimize the number of implied loss and gain events. As a consequence, parsimony inherently underestimates the age of gene families. A major concern in ancestral gene content reconstruction is that “patchy” profiles arise from a combination of lineage-specific loss events and LGT. Frequent LGT imply smaller ancestral genome sizes (Dagan and Martin 2007). Our reconstruction reveals the prevalence of differential loss, but LGT events are far from uncommon. Lineage-specific gains (“Gain column” in fig. 3) account to more than 14% of families (“Families in the genome”) at half of all the lineages. A probabilistic framework, such as a phylogenetic birth-and-death model, makes it feasible to take all possible gene family histories into consideration in a mathematically sound way. A case in point is the last archaeal common ancestor (LACA), where only about 1300 families are inferred to have been present with a posterior probability of at least 90%, which is close to a parsimony-based inference of about 1000 families (Makarova et al. 2007). Given the uncertainties of most family histories, the exact genome composition of LACA is hard to estimate, but the fractional probabilities point to a genome with slightly more than 2000 families, which is similar to such extant organisms as S. solfataricus. Such a large genome size implies that LACA's genomic complexity was even greater than previously imagined (Makarova et al. 2007), on a par with modern, moderately sized archaeal genomes. Supplementary Material are available online at Molecular Biology and Evolution (http://www.mbe.oxfordjournals.org/). [Supplementary Data]
Acknowledgments This work has been supported by a grant from the Natural Sciences and Engineering Research Council of Canada to M.Cs., and the EU FP6 Marie Curie grant MTKD-CT-2006-042794. Part of the study was done while M.Cs. was a sabbatical visitor at the Rényi Institute of Mathematics, supported by a Marie-Curie Transfer-of-Knowledge fellowship. We are grateful to Yuri Wolf for providing data on lineage-specific gene families. We thank Igor Rogozin, Csaba Pál and Balázs Papp for informative discussions. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]Science. 1997 Oct 24; 278(5338):609-14.
[Science. 1997]Genome Res. 2002 Jan; 12(1):17-25.
[Genome Res. 2002]Annu Rev Genet. 2003; 37():283-328.
[Annu Rev Genet. 2003]Nat Rev Microbiol. 2005 Sep; 3(9):679-87.
[Nat Rev Microbiol. 2005]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Bioinformatics. 2006 Jul 15; 22(14):e9-15.
[Bioinformatics. 2006]Nucleic Acids Res. 1999 Nov 1; 27(21):4218-22.
[Nucleic Acids Res. 1999]Nat Genet. 1999 Jan; 21(1):108-10.
[Nat Genet. 1999]Genome Res. 1999 Jun; 9(6):550-7.
[Genome Res. 1999]Biol Direct. 2007 Nov 27; 2():33.
[Biol Direct. 2007]Biol Direct. 2007 Nov 27; 2():33.
[Biol Direct. 2007]Nucleic Acids Res. 2004; 32(5):1792-7.
[Nucleic Acids Res. 2004]Syst Biol. 2003 Oct; 52(5):696-704.
[Syst Biol. 2003]Trends Genet. 2002 Dec; 18(12):619-20.
[Trends Genet. 2002]Annu Rev Genet. 2005; 39():121-52.
[Annu Rev Genet. 2005]J Mol Evol. 1981; 17(6):368-76.
[J Mol Evol. 1981]Genome Res. 2005 Aug; 15(8):1153-60.
[Genome Res. 2005]Bioinformatics. 2007 Jul 1; 23(13):i230-9.
[Bioinformatics. 2007]Genome Res. 2005 Aug; 15(8):1153-60.
[Genome Res. 2005]Nat Rev Microbiol. 2007 Apr; 5(4):316-23.
[Nat Rev Microbiol. 2007]Biol Direct. 2007 Nov 27; 2():33.
[Biol Direct. 2007]Philos Trans R Soc Lond B Biol Sci. 2006 Jun 29; 361(1470):1007-22.
[Philos Trans R Soc Lond B Biol Sci. 2006]Curr Opin Microbiol. 2005 Oct; 8(5):586-94.
[Curr Opin Microbiol. 2005]Archaea. 2005 May; 1(5):353-63.
[Archaea. 2005]Syst Biol. 2007 Jun; 56(3):389-99.
[Syst Biol. 2007]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4644-9.
[Proc Natl Acad Sci U S A. 2002]Genome Biol. 2004; 5(3):R17.
[Genome Biol. 2004]Archaea. 2005 May; 1(5):353-63.
[Archaea. 2005]Nat Rev Microbiol. 2008 Mar; 6(3):245-52.
[Nat Rev Microbiol. 2008]J Mol Evol. 1994 Sep; 39(3):306-14.
[J Mol Evol. 1994]Trends Genet. 2001 Oct; 17(10):589-96.
[Trends Genet. 2001]Curr Opin Microbiol. 2005 Oct; 8(5):586-94.
[Curr Opin Microbiol. 2005]Genome Biol. 2007; 8(8):R164.
[Genome Biol. 2007]Nucleic Acids Res. 2008 Dec; 36(21):6688-719.
[Nucleic Acids Res. 2008]Annu Rev Microbiol. 2006; 60():327-49.
[Annu Rev Microbiol. 2006]Annu Rev Microbiol. 2006; 60():327-49.
[Annu Rev Microbiol. 2006]J Bacteriol. 2006 Jan; 188(2):642-58.
[J Bacteriol. 2006]Science. 2003 Aug 15; 301(5635):976-8.
[Science. 2003]Mol Biol Evol. 2005 Nov; 22(11):2297-303.
[Mol Biol Evol. 2005]Nat Rev Microbiol. 2007 Apr; 5(4):316-23.
[Nat Rev Microbiol. 2007]Annu Rev Microbiol. 2006; 60():327-49.
[Annu Rev Microbiol. 2006]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4644-9.
[Proc Natl Acad Sci U S A. 2002]Archaea. 2005 May; 1(5):353-63.
[Archaea. 2005]BMC Genomics. 2007 Mar 29; 8():86.
[BMC Genomics. 2007]Biol Direct. 2007 Nov 27; 2():33.
[Biol Direct. 2007]Genome Res. 2002 Apr; 12(4):532-42.
[Genome Res. 2002]Biol Direct. 2007 Nov 27; 2():33.
[Biol Direct. 2007]BMC Evol Biol. 2003 Jan 6; 3():2.
[BMC Evol Biol. 2003]Genome Res. 2005 Jul; 15(7):954-9.
[Genome Res. 2005]Proc Natl Acad Sci U S A. 2007 Jan 16; 104(3):870-5.
[Proc Natl Acad Sci U S A. 2007]Biol Direct. 2007 Nov 27; 2():33.
[Biol Direct. 2007]