![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2009 Peregrín-Alvarez et al.; licensee BioMed Central Ltd. The conservation and evolutionary modularity of metabolism 1Program in Molecular Structure and Function, Hospital for Sick Children, College Street, Toronto, ON, M5G 1L7, Canada 2Department of Molecular Biology and Biochemistry, University of Malaga, Avda. Cevantes, 29071 Malaga, Spain 3Department of Molecular Genetics, University of Toronto, King's College Circle, Toronto, ON, M5S 1A8, Canada 4Department of Biochemistry, University of Toronto, Toronto, Kings' College Circle, ON, M5S 1A8, Canada Corresponding author.José M Peregrín-Alvarez: peregrin/at/sickkids.ca; Chris Sanford: c.sanford/at/utoronto.ca; John Parkinson: jparkin/at/sickkids.ca Received February 5, 2009; Revised May 27, 2009; Accepted June 12, 2009. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background Cellular metabolism is a fundamental biological system consisting of myriads of enzymatic reactions that together fulfill the basic requirements of life. The recent availability of vast amounts of sequence data from diverse sets of organisms provides an opportunity to systematically examine metabolism from a comparative perspective. Here we supplement existing genome and protein resources with partial genome datasets derived from 193 eukaryotes to present a comprehensive survey of the conservation of metabolism across 26 taxa representing the three domains of life. Results In general, metabolic enzymes are highly conserved. However, organizing these enzymes within the context of functional pathways revealed a spectrum of conservation from those that are highly conserved (for example, carbohydrate, energy, amino acid and nucleotide metabolism enzymes) to those specific to individual taxa (for example, those involved in glycan metabolism and secondary metabolite pathways). Applying a novel co-conservation analysis, KEGG defined pathways did not generally display evolutionary coherence. Instead, such modularity appears restricted to smaller subsets of enzymes. Expanding analyses to a global metabolic network revealed a highly conserved, but nonetheless flexible, 'core' of enzymes largely involved in multiple reactions across different pathways. Enzymes and pathways associated with the periphery of this network were less well conserved and associated with taxon-specific innovations. Conclusions These findings point to an emerging picture in which a core of enzyme activities involving amino acid, energy, carbohydrate and lipid metabolism have evolved to provide the basic functions required for life. However, the precise complement of enzymes associated within this core for each species is flexible. Background Cellular metabolism, represented by complements of enzymatic and transport reactions, is a fundamental biological system required for sustaining life. With a strong tradition in biochemistry, metabolism is probably the most widely characterized biological system studied to date. Several resources are now available that detail the metabolic complement of a wide range of organisms [1,2]. With the exception of a few model organisms, for the most part these are derived through automated methods capable of reconstructing metabolic networks through sequence similarity searches of known enzymes against their genomes [1-8]. Representation of metabolism is typically provided in the form of pathway maps that collate enzymes on the basis of their involvement in related biochemical conversions [1,2]. However, given that many pathways share common enzymes, metabolism is increasingly being viewed as an assemblage of functional modules, interconnected through common enzymes and substrates that together coordinate a cell's biochemical activities. This treatment of metabolism as a single integrated network allows the application of sophisticated graph theoretical analyses to uncover fundamental organizational principals within the network. For example, it has been suggested that metabolism displays the typical scale free behavior of small world networks [9-11] (that is, most of the enzymes within the network have only a few connections while a few enzymes, so called hubs, are very highly connected), although this has been questioned in a more recent study [12]. Recently, with the availability of increasing numbers of genomes, there has been considerable interest in examining biological networks from an evolutionary perspective [13-16]. Applied to metabolism, such studies are beginning to yield insights into the effects of selection pressures at both local (pathway) and global (network) scales. Recent studies suggest a 'core' set of metabolic reactions, conserved across many organisms, highlighting the fundamental role of metabolism that is subject to strong evolutionary constraint [17-19]. However, aside from the core, there is increasing evidence that pathway expansions have evolved through the recruitment and/or loss of functional modules of enzymes with related activities [17,18,20]. The ability of a functional module to evolve may play a key role in developing or limiting biological robustness [21]. For example, evolutionarily 'frozen' modules may be less able to withstand errors, such as mutational inactivation of gene products, within the same functional module. Focusing on metabolism, two recent studies have used Jaccard coefficients (JCs) to measure the similarity of phylogenetic profiles and identify modules of evolutionarily related enzymes [16,22]. Yamada and colleagues [16], for example, identified over 200 'phylogenetic network modules' consisting of enzymes proposed to behave in a similar way in the evolutionary process of the metabolic network. Additional studies have further suggested that enzyme modularity is affected by both environmental factors and evolutionary relationships [23,24]. For example, Kreimer and colleagues [23] noted a trend of modularity decrease from ancestors to descendants. Together these studies suggest the need to explore modularity from the perspective of known phylogenetic relationships. Due to the paucity of available eukaryotic genomes (to date the full genome sequence of only approximately 60 eukaryotes are currently available [25]), previous comparative studies of metabolism have largely focused on prokaryotes. Recently, we have collated and processed expressed sequence tag datasets to generate so called 'partial genomes' for almost 200 eukaryotes, including members of underrepresented taxonomic classes such as plants, nematodes and various groups of protists [26]. These data represent a valuable sequence resource that provides a unique opportunity to perform more comprehensive studies of metabolic pathways within Eukaryotes, allowing comparisons to be placed within a detailed taxonomic context. The inclusion of these expressed sequence tag datasets allows us to build further on previous studies of metabolic conservation by providing greater confidence to determining the taxonomic coverage afforded by individual enzymes as well as pathways. We highlight enzymes and pathways associated with taxonomic innovations and examine the evolution of these metabolic processes in the context of a novel approach to examining enzyme co-conservation. Results Conservation of enzyme activities over the three domains of life Collating 2.7 million sequences associated with 193 partial genomes [26,27], 167 fully sequenced genomes and the non-redundant protein database (nr), we performed a systematic scan to detect potential homologs of 29,893 proteins representing 1,474 distinct enzyme activities (Table 1; Additional data files 1 and 2). The accurate annotation of catalogs of enzymes from genome sequences remains a significant challenge [28]. Due to the scale of this analysis and consistent with other genome annotation schemes [2,18,29,30] we employed BLAST as an efficient means of inferring the conservation of the 1,474 enzyme activities. It is appreciated that the use of BLAST can lead to a high rate of false positive assignments. For example, the well annotated genomes of Escherchia coli strain O157:H7 and Saccharomyces cerevisiae have 567 and 426 enzymes, respectively, curated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database resource [2]. With a bit score cutoff of 50 (equivalent to an E-value of approximately e-5), BLAST identifies an additional 512 enzymes for E. coli and an additional 501 enzymes for S. cerevisiae, resulting in a false positive rate of approximately 50%. On the other hand, the rate of false negatives is less than 2%. Increasing the cutoff to 100 (equivalent to an E-value of approximately e-20) was found to lower the false positive rate to only approximately 42%. Aside from BLAST, profile-based methods have also been employed [6,31]. However while these methods significantly reduce the incidence of false positives, they also result in a high incidence of false negatives. Finally, it is worth noting that approaches based on orthology mappings, such as that employed by KEGG, can greatly improve on BLAST annotations. However, such approaches are limited to organisms with complete genome sequences and may be problematic where such sequences are highly divergent. Here the aim is to incorporate non-genome sequence data to provide a more global survey of enzyme conservation. The effect of using BLAST with a relatively flexible score cutoff will result in an over-prediction of enzyme conservation, but has the benefit of providing highly conservative estimates of which enzymes and pathways are likely absent from specific taxa. Consequently, care must be taken when interpreting the results of these analyses.
In the following we use the term 'enzyme' to refer to a collection of isoforms associated with a single enzyme activity (defined by unique Enzyme Commission (EC) numbers). To explore the distribution of enzymes across a broad set of informative taxonomic groupings, we defined 26 distinct taxa on the basis of number of species, sequence coverage and diversity (Table 1). These taxa are broadly consistent with previous studies of molecular evolution across the three domains of life [32]. Of the 1,474 enzymes studied here, half possessed significant sequence similarity in at least 60% of the completely sequenced genomes (complete genomes) and at least 17% of the partial genomes (Figure (Figure1a;1a
Relationships between enzyme function and patterns of conservation In addition to organizing enzymes into 118 distinct biochemical pathways, the KEGG database (version 33) [2] groups these pathways into ten functional subcategories (which we term 'superclasses'): carbohydrate metabolism; energy metabolism; lipid metabolism; nucleotide metabolism; amino acid metabolism; other amino acid metabolism; glycan metabolism; co-factors and vitamins; secondary metabolites; and xenobiotics. Enzyme conservation was therefore examined in the context of enzyme superclasses (Figure 1b-d Enzymes involved in multiple metabolic superclasses had significantly higher rates of detection in both complete and partial genomes (median number of complete/partial genomes = 126/57) compared with all other enzymes (median number of complete/partial genomes = 93/30; Mann-Whitney test, P < 10-5 for both genome datasets). Conversely, those involved in glycan metabolism (median complete/partial genomes = 19/14; Mann-Whitney test, P < 10-5 for both genome datasets) and xenobiotics (median complete/partial genomes = 72.5/7; Mann-Whitney test, P < 5 × 10-5 for complete genomes, P < 10-5 for partial genomes) were significantly less conserved in both genome datasets. Between the datasets some notable differences in enzyme conservation were observed (Additional data file 3). Specific to the complete genome datasets, enzymes involved in nucleotide metabolism had significantly higher rates of detection (median number of genomes = 127; Mann-Whitney test, P < 10-4), while enzymes involved in secondary metabolites had significantly lower rates of detection (median number of genomes = 24; Mann-Whitney test, P < 10-5). Within the partial genome datasets, enzymes involved in lipid metabolism had significantly higher rates of detection (median number of partial genomes = 41; Mann-Whitney test, P < 0.01). Interestingly, enzymes associated with co-factors and vitamins had significantly higher rates of detection in the complete genomes (median number of complete genomes = 116; Mann-Whitney test, P < 0.02), but significantly lower rates of detection in partial genomes (median number of partial genomes = 18; Mann-Whitney test, P < 10-5). Differences between the two datasets are also apparent from Figure Figure1b1b Complicating factors in interpreting these results include both the incomplete nature of the partial genome datasets and their bias towards highly expressed proteins (see Additional data file 1 for further discussion). However, in a previous study of global sequence diversity, we noted that findings obtained from the use of partial genome datasets were consistent with those obtained from the use of fully sequenced eukaryotic datasets [32]. Hence, we expect the observed differences between the complete and partial genome datasets likely reflect their taxonomic bias (complete genomes are mainly derived from prokaryotes, while partial genomes are derived solely from eukaryotes). In the following sections we explore these relationships in more detail. Conservation and taxonomic distribution of metabolic pathways In the previous section we examined the conservation of individual enzymes and found that, with a few notable exceptions, enzymes are broadly conserved across the three domains of life. Given that enzymes operate within the context of biochemical pathways, we next investigated the extent to which the pathways themselves are conserved. For each of the 118 pathways defined in KEGG, we identified the proportion of enzymes with homologs in each of the 26 taxonomic groups (Figure (Figure2).2
Across the 26 taxa, pathways display a spectrum of conservation, from those that are well conserved across all groups (for example, many pathways associated with carbohydrate, energy, nucleotide, amino acid, other amino acid and co-factor and vitamin metabolism) to those that are conserved in only a few specific taxonomic groups (for example, many glycan pathways are conserved only in metazoans) (Figure (Figure2).2 Pathways limited to specific domains include C5-branched dibasic acid metabolism, lipopolysaccharide biosynthesis and peptidoglycan biosynthesis, all poorly conserved in eukaryotes. The former pathway provides alternative sources of carbon and energy, while peptidoglycan and lipopolysaccharides are important components of the bacterial cell walls and envelope, respectively. On the other hand, inositol phosphate metabolism and sphingophospholipid biosynthesis were poorly conserved in prokaryotes, consistent with their roles as important components of secondary messenger systems and eukaryotic cell membranes [34-36]. Aside from these taxa-specific pathways, we noted that many pathways involved in secondary metabolism are more highly conserved in plants, ascomycetes and certain groups of bacteria such as the actinomycetes. Such pathways are associated with defense mechanisms (plants) [37], or as a means of surviving within a highly competitive environment (ascomycetes and bacteria) [38,39]. Finally, we note that many glycan biosynthetic pathways are restricted to a limited number of taxa, in particular the metazoa. This is consistent with the role of glycans in developmental and signaling processes associated with multicellularity [40,41]. Identification of evolutionarily distinct modules of enzymes From the preceding analysis it is not clear to what extent each pathway is conserved as a single entity. For example, pathways displaying moderate levels of conservation may arise from either a few organisms possessing the entire pathway, or many organisms possessing different, limited sets of enzymes. We therefore investigated the extent to which each pathway behaves as an evolutionarily distinct 'module' through the calculation of JCs (see Materials and methods). Due to the incomplete nature of the partial and nr datasets, modularity analyses were performed using the complete genome datasets only. As for the conservation analyses, pathway modularity was calculated for a range of different taxonomic groups (Figure S4 in Additional data file 1; Additional data file 6). Similar to pathway conservation, we observe a well populated spectrum of pathway modularity with mean JCs (MJCs) for entire pathways ranging from 0.05 (type I polyketide biosynthesis) to 0.83 (inositol metabolism); where a MJC of 1.0 would indicate that the enzymes of an individual pathway are always present in the same genome. In addition, some pathways are more modular (higher MJC) in the context of specific taxa (for example, C21 steroid hormone metabolism has a MJC of 0.95 when considering only eukaryotic species, but a MJC of 0.60 when all species are considered. From Figure S4 in Additional data file 1, we see that many pathways involved in glycan metabolism (for example, glycosaminoglycan metabolism) are more modular in metazoans; certain pathways involving amino acids and lipids (for example, glutamate metabolism and fatty acid metabolism) are more modular in metazoans and plants; and several pathways involved in secondary metabolism (for example, clavulanic acid biosynthesis and diterpenoid biosynthesis) are more modular in plants and/or bacteria. From these initial investigations we noted that pathways possessing enzymes that are highly conserved are likely to also have high raw JCs (Figure S5 and Results in Additional data file 1). We therefore assessed the significance of the observed modularity scores (as defined by MJC) through comparisons with 200 sets of randomly generated enzyme profiles. To account for biases in the phylogenetic relationships of the species sampled in this study (Figure S6 in Additional data file 1) [42], we applied a novel simulated annealing protocol to generate sets of adjusted enzyme profiles (see Materials and methods). These were used to obtain distributions of MJCs for each pathway allowing the calculation of Z-scores for the observed MJC (Additional data file 6). These provide a clearer view of which pathways may be defined as modular, accounting for biases in both enzyme conservation and evolutionary relationships of the organisms in which the component enzymes are found. It should be noted that the over-prediction of enzyme conservation discussed earlier has the potential to introduce noise into these analyses. Consequently, the modularity scores presented here should be treated as conservative estimates. We found 49 pathways to be significantly modular (P < 0.01 corrected for multiple testing - see Materials and methods). These included many of the glycan metabolism pathways, several lipid pathways, monoterpenoid and diterpenoid biosynthesis, inositol phosphate metabolism, and phenylalanine and tyrosine metabolism. Many (26) of these were also associated with low MJCs (< 0.4), compared with only nine that had relatively high MJCs (> 0.6). In pathways with low MJCs, there was a significant enrichment in enzyme pairs with high JCs (> 0.6), suggesting that modularity arises through smaller sets of enzymes. To examine this in more detail, we selected three pathways with moderate to low MJCs but with a significant enrichment of high JCs (> 0.8). These were diterpenoid biosynthesis (MJC 0.27; Z-score = 17.4; 12% enzyme pairs with JC > 0.8); C21-steroid hormone metabolism (MJC 0.60; Z-score = 8.5; 26% enzyme pairs with JC > 0.8); and histidine metabolism (MJC 0.45; Z-score = 3.9; 12% enzyme pairs with JC > 0.8). For each pathway, genome profiles for each enzyme were clustered into distinct groups and mapped onto KEGG pathway maps (Figure (Figure3).3
Metabolic network reconstruction and analysis Metabolic pathways are typically defined through linked sequences of enzyme-catalyzed reactions with their boundaries determined through manual curation [43]. While such definitions can provide biological meaning, they tend to ignore the integrated nature of metabolism. Consequently, metabolism is increasingly being studied in the context of a single integrated network with nodes representing substrates and links between these nodes representing enzyme reactions (or vice versa) [4,10,16,17,22]. To examine evolutionary relationships outside pathway boundaries, we reconstructed a global metabolic network (with 1,329 nodes representing enzymes and 5,906 links representing common substrates) based on information of reaction pathways from KEGG. Consistent with previous studies [10], the network is scale-free (R = -0.92); that is, most enzymes have few links, while a few 'hub' enzymes possess many links. Note this network does not represent a true biological network in the sense of being specific to a single species, but rather represents the collection of metabolic pathways from across the three domains of life [16]. Visualization of the network (Figure (Figure4)4
We applied topological metrics in conjunction with their conservation profiles to analyze these relationships in more detail (Figure (Figure5).5 We next examined co-conservation relationships (as defined by JC), between pairs of linked enzymes, within the context of the global network (Figure S7 in Additional data file 1). Using a cutoff Z-score of 2.0, 432 links involving 422 enzymes could be defined as significantly modular compared with the shuffled enzyme profiles (although as noted above, due to over prediction of enzyme conservation, this is likely to be a conservative estimate). Within this graph, many subnetworks consisting of co-conserved enzymes can be assembled into coherent pathways - mainly associated at the periphery of the network (for example, diterpenoid biosynthesis, N-glycan metabolism, folate biosynthesis and C21 steroid hormone metabolism). These pathways appear to provide specific functions requiring specialized enzymes that utilize unique substrates. For example, the main precursor in the diterpenoid biosynthetic pathway is geranyl geranyl pyrophosphate from which tree resin, plant hormones and anti-bacterial/fungal agents are synthesized. Although co-conserved enzyme pairs associated with the core of the network are observed, it is interesting to note the relatively low ratio of links to enzymes (432:422) compared with the entire network (5,906:1,329). Hence, although many of these core enzymes tend to be highly connected, they tend only to be co-conserved with one or two other enzymes. The lack of complex networks of modularity within this core indicates that although the network core may represent the basic metabolic framework required for life, there is a degree of flexibility in the precise combination of enzymes. Discussion Here we describe a systematic and comprehensive analysis of the conservation of the universal metabolic network (defined from the KEGG database) across a range of structural and taxonomic hierarchies. The inclusion of partial genomes provides a unique opportunity to extend our knowledge of the conservation of metabolism, particularly with respect to some of the more neglected Eukaryotic taxa (Figure S1 and Results in Additional data file 1). As such, this work builds on prior studies of metabolic conservation that were more restrictive in terms of sequence coverage and phylogenetic extent [18,33,46,47]. For example, Tanaka and colleagues [46] examined enzyme gain and loss events across six eukaryotes to reveal a gain in lipid metabolic processes in vertebrates. Similarly Freilich and colleagues [33] looked at the evolution of the human metabolic network from the perspective of vertebrate evolution. In the current paper, we provide a broader perspective and examine the conservation of enzymes and pathways across 26 taxa, selected on the basis of species coverage and evolutionary relationships [32]. Consistent with previous studies, we found enzymes involved in multiple superclasses were most highly conserved and those involved in glycan metabolism were least highly conserved [18]. Observed differences between the three datasets (complete genomes, partial genomes and nr), revealed the bias towards prokaryotes in the complete genome datasets and highlight the need to consider integrative analyses such as these within a phylogenetic context. As noted earlier, the use of BLAST with a relatively flexible score cutoff will result in an over-prediction of enzymes. Consequently, the lack of homologs of enzymes involved in inositol phosphate and sphingophospholipid pathways in prokaryotes highlights the unique role of these pathways in the formation and function of lipid membranes associated with multi-compartment cells [36,48]. Similarly, many glycan biosynthetic pathways are largely restricted to the Metazoa. The expansion of glycan pathways in these organisms increases the repertoire of post-translational modifications, resulting in the production of proteins of greater complexity. This enables the addition of new functionalities and specificities that likely underlie the complex array of developmental and signaling processes that facilitate multi-cellular life [49]. To circumvent biases that may occur in pathway definitions, recent studies of metabolic conservation are beginning to adopt a network approach in which metabolic components are linked as nodes in a graph [10,16,22,50]. The use of a global metabolic network map (Figure (Figure4)4
Previous studies examining the evolution of metabolism suggest the existence of functional modules of enzymes that share similar patterns of conservation and may act as evolutionary building blocks [16,22-24,52]. Two recent studies applied JCs to measure evolutionary modularity [16,22]. However, as we have shown here, the use of JC to infer modularity can be misleading if conservation is not taken into account: a high JC may merely reflect proteins that are highly conserved rather than mutual evolutionary trajectories. Furthermore, as noted by Kreimer and colleagues, evolutionary modularity can be influenced by both phylogenetic and environmental relationships [23]. Consequently, in our analyses, we developed a novel algorithm to generate simulated metabolic complements that reflect such relationships. Comparisons with these simulated datasets enable us to identify genuine instances of co-inheritance that do not simply reflect patterns of conservation and common evolutionary histories. For the most part, co-inheritance was restricted mainly to enzymes peripheral to the global metabolic network (for example, N-glycan metabolism and diterpenoid biosynthesis), and represent functions that support taxonomic innovations [53]. The relatively low number of links between co-inherited enzymes at the core of the network or in the nucleotide pathways indicates that despite such enzymes being very highly conserved, they do not always occur together in the same genomes. Given the highly connected nature of many of these proteins, we may further infer that hub proteins may only be co-conserved with a limited number of partners. These findings point at an emerging picture in which a core of enzyme activities involving amino acid, energy, carbohydrate and lipid metabolism have evolved to provide the basic functions required for life. However, as indicated by the relatively low number of significantly modular links, the precise complement of enzymes associated within this core for each species is flexible. It is important to remember that the network view provided in Figure Figure44 These analyses provide indications on the adaptations to metabolic networks that have been acquired by different sets of organisms. Applied in the context of organisms of industrial importance, such findings could find application in metabolic engineering. For example, there is considerable interest in the production of the diterpenoid paclitaxel, an important anti-cancer therapeutic. Alternatively, applied to organisms associated with pathogenicity, these findings could be exploited for identifying pathways and enzymes that could be usefully targeted for therapeutic intervention (see Additional data file 1 for a simple example). Finally, the approaches outlined here have been limited to the study of metabolism, but may also be applied to other systems such as regulatory pathways (Additional data files 1 and 7). Conclusions Here we have combined multiple sources of sequence data to perform a systematic and comprehensive analysis of the conservation and modularity of metabolism across the three domains of life. Intriguingly, while we identified a highly conserved set of enzymes at the core of the metabolic network, there appears to be enormous flexibility in their use across different organisms. It should be appreciated that in using BLAST to infer homology, this study does not claim to reconstruct the metabolic network of any single organism, but instead focuses on levels of sequence conservation associated with specific enzymes and pathways. As such a more detailed comparison of the metabolic networks encoded by individual species would help to further understand and interpret the biological meaning of the differences we have found. Given the scope of the analysis, we have been able to report only a small fraction of our findings. To facilitate further, in-depth explorations of the conservation and modularity of the metabolic network, we make the Cytoscape [54] files generated in this analysis freely available at [55]. Materials and methods Sequence input data Three types of sequence data were used in this study (see Table 1 and Additional data file 2 for details about the number of sequences, and taxonomic information on the genome datasets analyzed in this study). The first type comprised protein coding sequences of 167 complete genomes (19 Archaea, 127 Bacteria and 21 Eukarya) were derived from the COGENT database [56,57]. While 200 datasets were originally obtained, to avoid species redundancy in bacteria, if there were different strains from the same species, only one strain (the one with the largest number of open reading frames) was considered. In addition, we also obtained the protein coding sequences of Thalassiosira pseudanana [58] and the genome contigs of Coprinopsis cinerea [59]. The second type comprised the consensus DNA sequences associated with 193 eukaryotic partial genomes obtained from our in-house database, PartiGeneDB [26,60]. The third type comprised proteins from the non-redundant (nr) protein database SWALL (Swissprot and TrEmbl) [61]. Reference enzyme datasets were obtained from the KEGG database [2,62]. We obtained 29,893 sequences associated with 1,474 distinct EC numbers from 260 species (sequences with more than one EC number per sequence were excluded to reduce the number of potential false positive assignments when performing BLAST searches). In certain cases, we identified enzymes with EC numbers that were not assigned a sequence within the KEGG dataset. For example, according to KEGG, ten enzymes participate in the retinol metabolism pathway. However, we were only able to extract two of the sequences assigned to this pathway. Therefore, in a complementary approach, enzymes associated with EC numbers with no apparent sequence in the KEGG database were used as queries to search the nr database via the sequence retrieval system (SRS) [63]. This step has previously been successfully applied to identify missing genes in predicted metabolic pathway databases such as EcoCyc and MetaCyc [3] and was also found to extend the coverage of our approach by adding an additional 1,181 new protein sequences to the reference dataset. Taxonomic divisions Species were classified into 26 taxonomic divisions on the basis of the NCBI taxonomy resource [64] (Table 1; and Additional data file 2). Each domain of life was split into the following major taxonomic divisions: Archaea (Crenarchaeaota, Euryarchaeota and Archaea_Others); Bacteria (Actinobacteridae, Alphaproteobacteria, Betaproteobacteria, Gammaproteobacteria, Deltaproteobacteria, Epsilonproteobacteria, Cyanobacteria, Firmicutes, Spirochaetes and Bacteria_others); and Eukarya (Protists, further subdivided into Alveolata, Euglenozoa/Haptophyceae/Stramenophiles and Protist_Others; Fungi, further subdivided into Ascomycota, Basidiomycota, Glomeromycota/Zygomycotina and Fungi_Others; Metazoa, further subdivided into Annelida/Mollusca/Platyhelminthes, Arthropoda/Tardigrada, Chordata/Echinodermata, Nematoda and Metazoa_Others; and Plantae). Note that both the Euglenozoa/Haptophyceae/Stramenophiles group and the eukaryotic groups specified as 'Others' represent paraphylyetic groups and, therefore, artificially grouped together for convenience. Conservation of metabolic enzymes and pathways The process for obtaining the metabolic complement for each species and taxonomic divisions was as follows. First, we applied BLAST [65] to identify sequences with significant sequence similarity (defined using a raw bit score threshold of 50 and E-value of 10-5) to 29,893 metabolic enzymes from the KEGG dataset and the additional 1,181 enzymes derived through the SRS procedure (see above) from the previously defined taxonomic partitions. The above threshold was considered to be a useful compromise between the rate of discovery of false negatives and false positives and is similar to the threshold adopted in previous studies [18,46]. This process results in a taxonomic profile for each of the 1,474 EC numbers analyzed in this study, displaying the presence or absence of a sequence sharing significant sequence similarity with an enzyme with the appropriate EC number within each of the 167 complete genome datasets, 196 partial genome datasets and the nr protein dataset (divided into the 26 aforementioned taxonomic divisions). Note that these profiles are different to the phylogenetic profiles traditionally applied to individual proteins, as they represent a consensus 'EC profile' that may have been derived through many different isoenzymes. The MPC of each pathway (MPCij) was calculated by:
where eij is the number of isoenzymes in the jth pathway with homologs in the ith taxonomic group, Ej the total number of isoenzymes with sequence in the jth pathway according to KEGG, and N is the number of taxonomic groups analyzed. Since the pathway content we are analyzing is based on the number of enzymes (reactions) involved in metabolic pathways, multiple isoenzymes catalyzing the same reaction were counted only once, and multifunctional enzymes were counted as many times as they catalyze different reactions. To assign levels of significance associated with the conservation of each pathway, we compared the observed MPC for each pathway with the MPCs derived from 10,000 pathways of equivalent-sized pathways generated through random sampling of enzymes. P-value thresholds associated with the derived Z-scores were then adjusted using the Bonferroni correction for multiple testing. For 116 pathways examined in the conservation analyses, a conventional P-value of 0.05 is corrected to an α-score of 4 × 10-4, resulting in a critical Z-score of 3.33. This approach is similar to that described by Lopez-Bigas and colleagues [66]. To assess whether the observed phylogenetic distribution patterns of metabolic enzymes were different from any other proteins, 25 sets of proteins of equal size to the enzyme dataset were randomly taken from proteins from the nr database with no EC numbers in their annotations (that is, potentially representing unknown metabolic enzymes) as control sets, and were subjected to an identical analysis. For the statistical analysis we used a two-tailed t-test at 99% confidence level. Evolutionary modularity The evolutionary modularity of metabolic pathways was quantified using only sequence data from complete genomes. Briefly, we calculated the relative distance of the phylogenetic profiles for pairs of enzymes [67]. For each pair of enzymes we calculated the JC, an established measure of similarity between two phylogenetic profiles [16,68]; a JC of 1 indicates 100% overlap, and a JC of 0 indicates no similarity. Finally, we assessed pathway modularity using both approaches by calculating the MJC of all pairs of enzymes considered for the pathway as follows:
where JCij is the JC measure of each pair of enzymes involved in the jth pathway, and N is the number of distinct enzyme pairs known to participate in the jth pathway. By definition, a pathway is perfectly modular (100%) if all enzymes within the pathway have exactly the same phylogenetic profile (either present or absent in a given species; that is, the modularity should be 1). To ascribe a level of significance to the calculated JCs, we created 200 sets of randomly assigned enzyme profiles. These were created to reflect the observed distribution of enzyme profiles using an approach based on simulated annealing (Additional data file 6). Three properties were measured from the original profile: number of genomes each enzyme is found in, number of enzymes each genome contains and the similarity of each genome with each other. Prior to annealing, each genome was initialized with the correct number of enzymes, but shuffled randomly. During each annealing step an enzyme was selected from one species, and swapped with a different species. The change in scoring function was evaluated and the swap accepted or rejected with the probability:
Where ΔE is an energy term associated with the change in scoring function (see the following equation) and T is a temperature term associated with the simulated annealing protocol:
where δs is the sum of the Hamming distances between the target similarity matrix (a matrix of Hamming distances between the enzyme profiles of every pair of genomes - effectively providing a measure of the relatedness of enzyme content between organisms) and the current matrix, Sw (= 1) is a weight associated with this first term, δe is the Hamming distance between the target number of genomes each enzyme is found in and the current matrix, and Ew (= 15) is a weight associated with this second term. The temperature term, T, was initialized to 1,000 and decreased in a non-uniform adaptive fashion to 0.00001. Temperature was decreased to 0.9 of its previous value only after 106 consecutive swap attempts failed to improve the overall score. The procedure was re-run for 200 different initially assigned profiles. This resulted in sets of enzyme profiles in which, although the specific enzyme membership for each genome was shuffled, the similarity between the enzyme content from closely related species was maintained. Metabolic network reconstruction and analysis Several methods and databases are available to reconstruct an organism's metabolic network from genome information, such as KEGG [2] and MetaCyc [1]. In this study, we reconstructed and represented the complete metabolic network as an undirected enzyme (or reaction) interaction graph. In this graph, enzymes (EC numbers) are represented as nodes and substrates connecting two enzymes are represented as edges in the network. Common metabolites were ignored, consistent with previous studies [4]. Pathway relationships were inferred from KEGG. Certain enzymes can catalyze several different reactions and are involved in different metabolic pathways. These enzymes are represented by single nodes in the network. Thus, these nodes are linked to all other nodes that are connected to the different reactions or pathways undertaken by these enzymes. Network properties (node degree and betweenness values) were obtained using the tYNA network analysis platform [69]. Abbreviations EC: Enzyme Commission; JC: Jaccard coefficient; KEGG: Kyoto Encyclopedia of Genes and Genomes; MJC: mean Jaccard coefficient; MPC: mean percentage of conservation; nr: non-redundant protein database; SRS: sequence retrieval system. Authors' contributions JMPA helped conceive and design the study, collated and analyzed sequence datasets and drafted the manuscript. CS developed the statistical framework for accounting for phylogenetic bias in analyzing JCs. JP conceived and designed the study, performed additional statistical analyses and wrote the paper. Additional data files The following additional data files are available with the online version of this paper: supplemental material in terms of results, discussion and methods and supplementary figures S1-S7 (Additional data file 1); a table listing the complete and partial genomes used in this study (Additional data file 2); a table listing the conservation of individual enzymes (Additional data file 3); a table listing enzymes restricted to either bacteria or eukaryotes (Additional data file 4); a table showing the conservation of 118 metabolic pathways analyzed in this study (Additional data file 5); a table listing the modularity of 166 metabolic pathways analyzed in this study (Additional data file 6); a table listing the conservation of 25 regulatory pathways defined by KEGG (Additional data file 7). Additional data file 1 Supplemental results, discussion, methods and Figures S1-S7. Click here for file(367K, pdf) Additional data file 2 The complete and partial genomes used in this study. Click here for file(72K, xls) Additional data file 4 Enzymes restricted to either bacteria or eukaryotes. Click here for file(44K, xls) Additional data file 5 Conservation of 118 metabolic pathways analyzed in this study. Click here for file(53K, xls) Additional data file 6 Modularity of 166 metabolic pathways analyzed in this study. Click here for file(90K, xls) Additional data file 7 Conservation of 25 regulatory pathways defined by KEGG. Click here for file(21K, xls) Acknowledgements We would like to thank James Wasmuth (Hospital for Sick Children, Toronto, Canada) and two anonymous reviewers for providing valuable comments during preparation of the manuscript. This work is funded by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC #188266 - 04) and the Canadian Institute of Health Research (CIHR MOP #84556). JMP-A was supported in part by a Research Training award from the Hospital for Sick Children. JP is supported by a CIHR New Investigators award and an Early Researcher Award from the Ontario Ministry of Research and Innovation. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D511-6.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D354-7.
[Nucleic Acids Res. 2006]Bioinformatics. 2004 Mar 22; 20(5):709-17.
[Bioinformatics. 2004]BMC Bioinformatics. 2005 Jan 14; 6():8.
[BMC Bioinformatics. 2005]Nature. 2000 Jul 27; 406(6794):378-82.
[Nature. 2000]BMC Evol Biol. 2004 Nov 27; 4(1):51.
[BMC Evol Biol. 2004]BMC Bioinformatics. 2006 Mar 13; 7():130.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2006 Jul 19; 7():351.
[BMC Bioinformatics. 2006]Science. 2002 Aug 30; 297(5586):1551-5.
[Science. 2002]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]Nucleic Acids Res. 2001 Jan 1; 29(1):126-7.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D303-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D303-7.
[Nucleic Acids Res. 2005]Bioinformatics. 2004 Jun 12; 20(9):1398-404.
[Bioinformatics. 2004]Trends Parasitol. 2009 Jan; 25(1):37-43.
[Trends Parasitol. 2009]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D354-7.
[Nucleic Acids Res. 2006]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D303-7.
[Nucleic Acids Res. 2005]Bioinformatics. 2003 Jul 22; 19(11):1451-2.
[Bioinformatics. 2003]Genome Biol. 2007; 8(11):R238.
[Genome Biol. 2007]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D354-7.
[Nucleic Acids Res. 2006]Genome Biol. 2007; 8(11):R238.
[Genome Biol. 2007]BMC Evol Biol. 2008 Sep 9; 8():247.
[BMC Evol Biol. 2008]EMBO Rep. 2004 Aug; 5(8):777-82.
[EMBO Rep. 2004]FEBS Lett. 2002 Oct 30; 531(1):33-7.
[FEBS Lett. 2002]Science. 1973 Jul 6; 181(4094):81-82.
[Science. 1973]Nat Biotechnol. 2003 May; 21(5):526-31.
[Nat Biotechnol. 2003]Chem Biol. 2006 Jan; 13(1):31-7.
[Chem Biol. 2006]Bioinformatics. 2005 Aug 15; 21(16):3409-15.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Nucleic Acids Res. 2006; 34(13):3687-97.
[Nucleic Acids Res. 2006]Bioinformatics. 2003 Jan 22; 19(2):270-7.
[Bioinformatics. 2003]Nature. 2000 Oct 5; 407(6804):651-4.
[Nature. 2000]BMC Bioinformatics. 2006 Mar 13; 7():130.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2006 Jul 19; 7():351.
[BMC Bioinformatics. 2006]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]Mol Biol Evol. 2005 Mar; 22(3):421-5.
[Mol Biol Evol. 2005]Nature. 2005 Feb 3; 433(7025):531-7.
[Nature. 2005]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]BMC Evol Biol. 2008 Sep 9; 8():247.
[BMC Evol Biol. 2008]Gene. 2006 Jan 3; 365():88-94.
[Gene. 2006]BMC Bioinformatics. 2006 May 9; 7():252.
[BMC Bioinformatics. 2006]Genome Biol. 2007; 8(11):R238.
[Genome Biol. 2007]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]FEBS Lett. 2002 Oct 30; 531(1):33-7.
[FEBS Lett. 2002]Nat Rev Mol Cell Biol. 2008 Feb; 9(2):151-61.
[Nat Rev Mol Cell Biol. 2008]Science. 2001 Mar 23; 291(5512):2364-9.
[Science. 2001]Nature. 2000 Oct 5; 407(6804):651-4.
[Nature. 2000]BMC Bioinformatics. 2006 Mar 13; 7():130.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2007 Aug 27; 8():311.
[BMC Bioinformatics. 2007]Genome Biol. 2008; 9(6):R95.
[Genome Biol. 2008]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]BMC Bioinformatics. 2006 Mar 13; 7():130.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2007 Aug 27; 8():311.
[BMC Bioinformatics. 2007]Genome Res. 2004 Mar; 14(3):391-7.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2006 Jun 6; 103(23):8774-9.
[Proc Natl Acad Sci U S A. 2006]Proc Natl Acad Sci U S A. 2008 May 13; 105(19):6976-81.
[Proc Natl Acad Sci U S A. 2008]Genome Res. 2003 Nov; 13(11):2498-504.
[Genome Res. 2003]Bioinformatics. 2003 Jul 22; 19(11):1451-2.
[Bioinformatics. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D303-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2000 Jan 1; 28(1):45-8.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D354-7.
[Nucleic Acids Res. 2006]Comput Appl Biosci. 1993 Feb; 9(1):49-57.
[Comput Appl Biosci. 1993]BMC Bioinformatics. 2004 Jun 9; 5():76.
[BMC Bioinformatics. 2004]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Genome Res. 2003 Mar; 13(3):422-7.
[Genome Res. 2003]Gene. 2006 Jan 3; 365():88-94.
[Gene. 2006]Genome Biol. 2008; 9(2):R33.
[Genome Biol. 2008]Proc Natl Acad Sci U S A. 1999 Apr 13; 96(8):4285-8.
[Proc Natl Acad Sci U S A. 1999]BMC Bioinformatics. 2006 Mar 13; 7():130.
[BMC Bioinformatics. 2006]Genome Res. 2001 Mar; 11(3):356-72.
[Genome Res. 2001]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D354-7.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D511-6.
[Nucleic Acids Res. 2006]Bioinformatics. 2003 Jan 22; 19(2):270-7.
[Bioinformatics. 2003]Bioinformatics. 2006 Dec 1; 22(23):2968-70.
[Bioinformatics. 2006]