![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright : © 2006 Liu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. An Integrative Genomic Approach to Uncover Molecular Mechanisms of Prokaryotic Traits 1Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America 2Center for Biomedical Informatics, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America 3Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America 4Department of Computer Science, Yale University, New Haven, Connecticut, United States of America 5Department of Biomedical Informatics, Columbia University, New York, New York, United States of America Suzanna Lewis, Editor University of California Berkeley, United States of America * To whom correspondence should be addressed. E-mail: mark.gerstein/at/yale.edu (MG); Email: lussier/at/uchicago.edu (YAL) Received May 5, 2006; Accepted October 10, 2006. This article has been cited by other articles in PMC.Abstract With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%–42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%–80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research. Synopsis A key challenge of the post-genomic era is to conceive large-scale studies of genomes and observable characteristics of organisms (phenotypes) and to interpret the data thus produced. The goal of this “phenomic” study is to improve our understanding of complex biological systems in terms of their molecular underpinnings. In this paper, Liu and colleagues present comprehensive computational and novel visualization methods for discovering biological knowledge spanning multiple scales of biology. The authors were able to predict and visualize new knowledge between clusters of microbiological phenotypes and their molecular mechanisms. To their knowledge, this is the first time this has been done. More specifically, the method integrates microbiological data with genomic-scale data from protein family databases, gene ontology, and biological pathways. Conducted over 59 fully sequenced bacteria, and including significantly more phenotypes than previous studies of its kind, this study enables a “systems biology” view across different classifications of genes and processes. This represents advancement over previous techniques, which are either limited in biological scale or analytical breadth. Visualization of the networks generated by this technique shows the common biological modules shared by related phenotypes. The results of this experiment demonstrate that the fusion of clinical data with genomic information is able to elucidate, in high throughput, a massive number of biological processes underlying phenotypes. Introduction With the completion of hundreds of prokaryotic genome sequences, computational methods in systems biology aimed at integrating genotypes and phenotypes are developed at an increasing speed. However, data integration and mining remain key challenges in bioinformatics as well as in cross-disciplinary research in biomedical informatics. In addition, a critical issue that remains unsolved is to derive meaningful general biological principles from predictions of statistically significant associations between phenotypes and different biological scales of molecular mechanisms (e.g., protein domains, cellular processes, and cellular pathways) to facilitate the understanding of a particular species. The availability of a large number of fully sequenced genomes and the relatively simple and well-characterized biological processes of prokaryotic organisms makes them ideal model organisms to demonstrate the feasibility of a computational systems biology approach to integrate, mine, and analyze genomic, phenotypic, and functional databases to derive general principles that govern the biology of prokaryotes. Prokaryotic phenotypes defined by human observations (e.g., motility), living conditions of the organism (e.g., growth at high temperature), and experimental conditions (e.g., acid production in a medium containing D-mannose) are of great interest for post-genomics–era research [1] as well as systems biology research. In clinical microbiological practice, many of these phenotypes are used to discriminate human pathogens from other microorganisms. While a great amount of effort has been devoted to the analysis of prokaryotic phenotypes, prior technologies, operated in a semi-automatic fashion, can at best only analyze a handful of phenotypes at once to deduce their associations with genotypes. In the past, functional genomic approaches predicted prokaryotic genes associated to biochemical pathways [2–4]; however, these studies did not specifically look into systems properties of these genes and were not specifically focused on phenotypes associated to these pathways or the emerging multiscale properties of their molecular mechanisms. Recently, a few studies conducted semi-automatic analyses of associations between an individual prokaryotic phenotype (e.g., hyperthermophily, motility) and its clustering with genes that have similar nucleotide sequences [5] or with Clusters of Orthologuous Groups of proteins (COGs) [5,6]. These studies, limited by their need for manual curation (phenotypic annotations to species), were designed to predict linear relationships between only one biological scale of molecular functions and a limited number of manually annotated phenotypes. To overcome the limitations of manual annotation in the creation of phenotypic datasets, others in the field conducted phenotype–genotype analyses by mining known knowledge on phenotype–genotype relationships from the scientific literature using high-throughput technologies. In this regard, Korbel et al. used a natural language processing approach to mine the MEDLINE literature and the genomic contents of prokaryotes, resulting in the identification of 2,700 statistically significant associations between COGs and words from the literature related to phenotypes [7]. In other approaches, researchers have built integrated systems to correlate phenotypes, pathways, and genes [3,8]. For example, the WIT system [8] used an integrated system to deduce metabolic systems using genomic data, genes, and pathways. Haft et al. built a Web-based system to query and display curated phenotypes and annotated prokaryotic genome properties, such as protein families, pathways, and phylogenies [9]; however, the system does not predict correlations between microbiological phenotypes and genome properties. In addition to the work being done to integrate prokaryotic phenotypes to genotypes, researchers have also made significant advances in building large-scale phenotype–genotype networks in mice, rats, and humans. The Mouse Genome Database (MGD) has structured their mouse genomic data in terms of the Mammalian Phenotype Ontology [10]. Similarly, the Rat Genome Database (RGD) [11] also developed a phenome database, integrated with its genomic data. In humans, the GeneNetwork (WebQTL) provides a database of complex traits with mappings to quantitative trait loci [12]. And several studies have focused on integrating human phenome and genome resources. For example, Butte et al. created a large-scale phenome–genome network by integrating the Unified Medical Language System with human microarray gene expression data [13]; and Aerts et al. applied a prioritization method to associate genes with human diseases and pathways [14]. We hypothesized that by automatically and simultaneously merging and analyzing massive quantities of microbiological phenotypes and their molecular datasets, we could predict both the molecular underpinnings of prokaryotic phenotypes as well as the relationships between related groups of phenotypes. Thus, this study is designed to illustrate how the big picture emerges from the network of predictions between multiple scales of molecular mechanisms and their correlations to an individual phenotype or to clusters of phenotypes. We developed a high-throughput computational approach, and for the first time, demonstrate the feasibility of integrating a large quantity of prokaryotic phenotypes with genomic datasets from various sources for large-scale data mining across different scales of molecular biology (protein domains, pathways, molecular function, and cellular processes). To analyze large quantities of prokaryotic phenotypes, we employed the Microbiology module of the Global Infectious Diseases and Epidemiology Network (GIDEON) that we refer to as the Microbiology Knowledge Dataset (MKD) as our source data on phenotypes [15,16]. MKD contains results from laboratory examinations though which users can distinguish different microorganisms. These laboratory results contain descriptions about the morphologic characteristics of microorganisms (e.g., Gram-positive, Gram-negative, motility, and cell wall deficiency), metabolic functions of microorganisms, (e.g., urea hydrolysis, acetate utilization, and gas production from glucose), and microorganisms' adaptation to extreme living conditions, (e.g., growth at 42 °C and growth in 6.5% sodium chloride). We regarded MKD laboratory test results as phenotypes or phenotypic traits, as they constitute observable physical or biochemical characteristics under certain experimental conditions that are determined by the microorganisms' genetic contents. MKD contains more than 100 phenotypic characterizations for more than 3,000 bacterial species, not only allowing us to conduct large-scale data mining on genomics data over phenotypic traits, but also enabling us to compare different phenotypic traits based on their correlations to their genetic contents. Of these 3,000 bacterial species, we included 59 species with fully sequenced genomes in our studies. To integrate phenotypes in MKD with genomic datasets, we chose to include the Protein Family Database (Pfam) [17], Clusters of Orthologous Groups (COGs) [18,19], Kyoto Encyclopedia of Genes and Genomes (KEGG) [20], and biological concepts found in the Gene Ontology (GO) [21,22] which span multiple scales of biology. Applying our method of data integration and mining, we have identified the genetic basis and molecular mechanisms underlying the many bacterial phenotypes. We revealed 3,711 significant correlations and anti-correlations (p-value < 0.05) between 63 microbiological phenotypes and 1,499 Pfam families, and identified 17 and 506 significant molecular mechanisms of phenotypes according to our analyses of KEGG's biochemical pathways, and GO's biological concepts, respectively. In addition, for the first time, a novel phenomic analysis was conducted to compare phenotypes with each other on a large scale based on their genetic contents. The original visualization of the network of relationships between one cluster of phenotypes and its significant correlations to protein families, molecular pathways, processes, and function illustrates how clusters of phenotypes (metaphenotypes) share common molecular mechanisms. Such analysis could lead to a better understanding of the molecular relationships between microbial phenotypes on a genomic scale. We believe that this computational technology holds promise in facilitating a systems biology approach to biomedical research in the post-genomic era. Results/Discussion To address our hypothesis, we first describe the results from the high throughput mapping of phenotypes with multiple databases of molecular mechanisms: first the Pfam, followed by KEGG pathways and GO terms. Then, we present results from a combined phenomic analysis of the significant molecular mechanisms across multiple biological scales. Mapping the MKD's Clinical Phenotypic Traits to Protein Families Currently, the availability of more than 208 microbial genome sequences in GenBank provides a rich source of information about the genetic contents of various microorganisms [23]. In addition, functional classification databases, such as COGs [18,19] and Pfam [17], enable us to compare conservation and divergence of functional genes across microorganisms. However, little has been done in the past to correlate the genomic data with phenotypic information. In this study, to uncover the underlying linkages between microorganism phenotypes and their genetic contents, we integrated and analyzed datasets of a microbiological phenotypic database (the MKD) and a genomic protein domains dataset (Pfam). In this study we have, by design, limited the analysis to complete genome sequences to avoid a selection bias toward genes coming from partial genomes that were preferentially sequenced. Methods to deal with organisms with partial genomic sequences will be explored in a future study. As a result, we selected each of the 59 species of microorganisms that exist in both the MKD and Pfam databases with fully deduced genome sequences (Figure 1
Taxonomical mapping between genomic and phenotypic databases. The 59 microorganisms were automatically mapped between datasets of MKD and Pfam using the National Center for Biotechnology Information's (NCBI) taxons as a reference, followed by manual examination by experts (Figure 1 Identification of Correlations between Bacterial Protein Domains and Phenotypes We applied a comprehensive statistical and visualization method based on the hypergeometric distribution to identify the correlations between phenotypic laboratory results and the genetic contents of bacteria. Details are described in Materials and Methods (Equations 1–3), and the procedure is illustrated in Figure 2
Overall, we have identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes with an experiment-wide error rate of 5%, including 2,650 correlations and 1,061 anti-correlations. Here we weight the anti-correlations with the same importance as correlations, since the description of the opposite phenotypes would positively correlate with the same set of Pfam families. For example, the phenotype of vancomycin susceptiblity has an anti-correlation with the Pfam family of HlyD family secretion protein (PF00529); thus, we can also consider the converse relation to the opposite phenotype: the phenotype of vancomycin resistance has a positive correlation with the same Pfam family. We observed that while some phenotypes (i.e., motility and Gram-negative) correlate with a large number of Pfam families, others correlate with only a few families (i.e., Gelatin hydrolysis and urea hydrolysis). However, the number of correlations inferred by this method depends on both the limitations of the method [5] and the number of available phenotypic laboratory results for different species. For example, Pfam families that exist in all (or no) species would not be correlated with any laboratory results; neither would positive laboratory results that are lacking or existing in most species. Adjusting results for multiple comparisons. The resulting p-values are adjusted for multiple a posteriori comparisons to reduce the experiment-wide error rate. One of the most commonly used methods to control for experiment-wide error rate is the conservative Bonferroni-type adjustment. We used the related Šidák method, as discussed in detail in Materials and Methods (Equation 3), which provides a conservative threshold to filter out false positive results. In this approach, we regarded each phenotypic laboratory test independently, and applied the Šidák adjustment according to the total number of comparisons analyzed. For example, when comparing phenotypes to Pfam families, the number of distinct comparisons with each phenotype is 2,665; the number of proteins that can be compared with each species. Additionally, we limited the comparisons to those phenotypes and Pfam families found in more than three and fewer than 56 species. This resulted in 478 correlations with corrected p-values of no more than 0.05 (shown in the file Phenotype_Sidak_Pfam_mapping.xls at http://phenos.bsd.uchicago.edu/prok_phenotype). Since the Šidák adjustment provides a set of conservative results, many interesting correlations may be consequently filtered out due to its conservative criteria for genome-wide studies, as the variables under study are not entirely independent [27]. In this study, some laboratory tests and the organisms selected are not independent. For example, the laboratory tests Gram-negative and Gram-positive are anti-correlated. Organisms are phylogenetically related, of which Firmicutes and Proteobacteria are over-represented in the species used in this study (34% and 29% of the 59 species). Moreover, since the laboratory tests are designed to distinguish bacteria species, there is also a bias on laboratory tests being used to distinguish over-represented species. All of them are currently limitations in this study due to availability of prokaryotic phenotypes limited to MKD—a clinical microbiological database. Certainly, with more species being sequenced and more phenotypic data, we could explore using independent laboratory test results with a set of species more representative of overall prokaryotic diversity. To overcome these limitations, we applied an additional method based on statistical simulation which can stratify predicted correlations as described in detail in Materials and Methods. With this method, we conducted a permutation resampling in which we compared the number of significant correlations inferred from the original data with those inferred from an experimental control consisting of the distributions of random permutations of the data with different statistical cutoffs for the hypergeometric distribution. Since this method predicts significant correlations in comparison with randomized samples, its results have less stringent cutoffs and cover more phenotypes. Figure 3
Evaluation of the correlations between phenotypes and protein domains. We conducted an extensive manual evaluation of our predictions, which consists of five parts. First, we manually examined all the phenotypes (21 phenotypes in total) with their 478 significantly correlated Pfam families based on the Šidák adjustment (data shown in the file Phenotype_Sidak_Pfam_mapping.xls at http://phenos.bsd.uchicago.edu/prok_phenotype). One hundred distinct predictions were manually assessed and 60 were corroborated and annotated with the supporting bibliographic references (Table S2). We then analyzed each of these manually curated sets and provide a summary of the analysis of these predictions for each of the 21 phenotypes (Table S2). Overall, 67% (14) of these phenotypes have at least one Pfam association that was corroborated as shown in Table S2. Second, we randomly selected 50 positive correlations and 15 anti-correlations from the simulation method to evaluate the minimum precision of the predictions. In the evaluation process, we focused on evaluating the positively correlated phenotypes and Pfam families, since anti-correlations are often difficult to verify. Of the 50 positive correlations selected, 15 of them were confirmed by supporting literature. As future studies may provide additional corroborations, the precision of 30% (95% confidence interval: 20%–42%; n = 50) is a conservative estimate of the overall potential accuracy of the prediction method controlling for false positive rate (also known as false discovery rate) with permutation resampling. Of the 15 anti-correlations, two of them (13%) were supported by literature (95% confidence interval: 2%–40%; n = 15). A summary of this validation is provided in Table S3 with supportive references for each corroborated prediction. Third, the false negative rate of the correlations that were regarded as statistically insignificant is also estimated. We evaluated 50 random samples, and only one of them has been shown to be correlated, resulting in a 2% false negative rate (95% confidence interval: 0.1%–10%). A summary of this validation is provided in Table S4. Fourth, we conducted an in-depth evaluation of one phenotype (motility) and compared our results with those from a previously reported study [5], which used a different classification method to cluster full-length genes and interpreted the results using annotations of E. coli genes. We found the results of the two studies to be well-correlated, especially for the top-ranked genes (19 of the top 30, or about 63%, E. coli genes have corresponding Pfam families in the top 30 families in our study: see Table S1). Fifth, to further evaluate the accuracy of the method in the well-studied phenotype of motility, we performed a manual validation of the predicted results pertaining to bacterial motility mediated through flagella. In this evaluation, significant correlations between phenotypes and Pfam families using the Šidák adjustment and the simulation methods are examined. Since the Šidák adjustment is more conservative, its predicted correlations are also included in those predicted by the simulation method. The results are shown in Table 2, where 18 and 58 Pfam families are predicted by the Šidák and simulation methods, respectively. By manual examination of the annotation of Pfam families, we identified those which participate in bacterial motility, including flagellar mediated motion and chemotaxis. We manually confirmed 12 (out of 18) and 27 (out of 58) Pfam family predictions from the Šidák and permutation resampling methods, respectively. These results confirmed that the Šidák method predicts relatively conservatively, and the data-mining method works equally well to provide accurate predictions. In addition, our results could help improve the functional understanding of current Pfam annotations. For example, we discovered one of the Pfam families, PF06429, described as Domain of unknown function (DUF1078), to be correlated with bacteria motility.
Overall, the results of these evaluations indicate that our approach can faithfully identify the most significantly correlated protein families as accurately as the other classification methods. However, our approach differs from the previous studies because we compared significantly more phenotypes and extended the phenotypic analyses to KEGG pathways and GO concepts which have not previously been analyzed in other studies to our knowledge (discussed below). Limitations of the correlations of phenotypes to protein domain families and future work. In this study, we primarily used sequence-based classifications (Pfam and COGs) to correlate with phenotypes. The correlations identified by this method suggest hypothetical association based on statistical analysis. However, we limited our exploration of the converse, correlations that are not statistically significant, to the previously described one manual evaluation (Table S5). Though it is feasible to conduct studies to demonstrate that there is not a correlation between certain properties, this was not the design of this study, and therefore we cannot make conclusions about the absence of relationships between correlated elements that did not reach statistical significance in this study. Many factors could lead to statistically insignificant correlations in our approach, for example, the lack of available laboratory data could lead to poor correlations to Pfam families. In future work, it would be interesting to explore the use of structure-based classifications and databases, such as the Structural Classification of Proteins (SCOP) [28], CATH [29], or DALI [30], or using integrated structure and sequence-based classifications, such as classifications based on Pfam domains integrated with Structural Classification of Proteins domains, as studied by Pouliot et al. [31]. Furthermore, we could integrate the classified protein domains with a protein structure database, such as the Protein Data Bank (PDB) [32] or OCA (http://oca.ebi.ac.uk/oca-docs/oca-home.html), to further study their functions. Mapping Phenotypes to KEGG Pathways and GO We also applied the hypergeometric statistical and data-mining approaches to identify correlations of phenotypes with molecular pathways and GO concepts. Using existing bioinformatics resources, we integrated data using the following methods: 1) phenotypes with KEGG molecular pathways by mining their matching COG groups; and 2) phenotypes with GO concepts by mining their matching Pfam families. KEGG pathways and GO concepts significantly correlated with phenotypes were identified by their probabilities of occurrence (see Materials and Methods). This provided more correlations for the mapping, which are likely to reveal biological significance. The details of the procedure are described in Materials and Methods. We unveiled ten significant correlations and seven significant anti-correlations between phenotypes and KEGG pathways and 506 significant correlations between phenotypes and GO concepts. Complete results can be found at the study Web site, http://phenos.bsd.uchicago.edu/prok_phenotype, file Phenotype_KEGG_mapping_results.xls for KEGG mapping and file Phenotype_GO_mapping_results.xls for GO mapping. Compared with the mapping of phenotypes to Pfam families, which provides the relationships of individual protein domain families to phenotypes, the mapping of phenotypes to GO and pathways provides a systematic view of the underlying molecular mechanisms (from multiple scales of biology) related to phenotypes. Evaluation of the KEGG pathway mappings. To evaluate the accuracy of our mapping method, we conducted two evaluations: (i) we manually revised each of the 17 predictions, and eight correlations as well as two anti-correlations were found corroborated in the literature (Table S5), and (ii) we then pursued a deeper manual evaluation on the most significant mapping results in KEGG. Table 3 shows that two KEGG pathways, the Lipopolysaccharide [33] and Ubiquinone biosynthesis pathways, are significantly correlated with the Gram-negative phenotype, both of which are supported by the literature [34]. In theory, every gene family involved in the Lipopolysaccharide biosynthesis pathway should have significant correlations with the Gram-negative phenotype. Our method accurately identified 15 significantly correlated distinct COGs out of a total of 19 defined in the Lipopolysaccharide biosynthesis pathway. According to the phenotype–COG mapping described in Methods, the remaining four COGs that did not map to the phenotype are COG0438 (predicted glycosyltransferases), COG1442 (Lipopolysaccharide biosynthesis protein: glycosyltransferases), COG0451 (Nucleoside-diphosphate-sugar epimerases), and COG0515 (Serine/threonine protein kinases). This could be due to imprecise definitions in the classification method, resulting in diverse functions of the proteins in the families, as three COGs (COG0438, COG0451, and COG0515) participate in many other pathways; or it could also due to the limitation of our method by using hypergeometric function [5]. In contrast, of the 15 COGs mapped between the Lipopolysaccharide biosynthesis pathway and Gram-negative phenotype, 14 are well-defined and unique to only one pathway, with only one exception (COG0241) that exists in two pathways. This suggests that biases in classification method and gene annotation could reduce the signals for the correlations. Reduction of such biases could improve the accuracy of the prediction of correlations in future studies. Additionally, other data resources could be used in future studies for assessing prokaryotic molecular pathways, such as Metacyc and Ecocyc [35,36].
Evaluation of GO mappings. In addition to the phenotype-pathway mapping, we mapped phenotypes to GO concepts (biological processes, molecular functions, cellular components) based on their correlated groups of Pfam families. Using the Šidák adjustment for a posteriori comparisons, there are 309 significant positive correlations with 33 distinct phenotypes within 166 distinct GO terms and 197 anti-correlations of 13 unique phenotypes within 142 distinct GO terms. We also provide two evaluations of the GO–phenotype predictions: (i) a random sample of 50 predictions were manually revised and showed a precision of 72% (95% confidence interval: 60%–82%; Table S6), and (ii) a manual evaluation of two phenotypes: Gram-negative and motility. Table 4 shows the GO concepts mapped to the Gram-negative phenotype. Lipopolysaccharide biosynthesis (GO:0009103) and lipid A biosynthesis (GO:0009245) are the top-ranking GO concepts mapped in the biological process branch of GO, while cell (GO:0005623), cell envelope (GO:0030313), and periplasmic space (sensu Gram-negative bacteria) (GO:0030288) are the top-ranking concepts mapped in the cellular component of GO (there are no mappings to the molecular functions of GO). In contrast to the phenotype-pathway mapping, phenotype-GO mapping provides characterizations of phenotypes using different aspects of GO. Though the mappings of phenotypes to pathways and GO concepts were conducted through differently classified gene families (COGs or Pfam), the results are strikingly comparable.
By applying a similar mapping to the motility phenotype (Tables 5 and 6), we identified four pathways and 27 GO concepts that are closely correlated with bacterial motility. The three pathways are 1) flagellar assembly, 2) type III secretion system, and 3) bacterial chemotaxis. Bacterial flagellar assembly and chemotaxis pathways are well-known to be important for bacterial motility [37,38], functioning together to guide bacteria's direction of movement. The type III secretion system is known to share many protein structure similarities with the flagellar assembly system in structure, function, and gene sequence [39,40]. Consequently, it is also shown to be significantly correlated with bacterial motility.
These case studies demonstrate that our high-throughput automated method for mapping phenotypes to pathways and GO concepts can faithfully recapitulate known knowledge. In addition, the method has the potential to predict new correlations between phenotypes and biological systems represented in GO as shown in the complete result datasets at http://phenos.bsd.uchicago.edu/prok_phenotype. While previous correlations studies had been completed on only four phenotypes [5,6], we present an additional 38 phenotype-to-GO correlations. We propose that this method potentially enables a systems-biology approach to analyze genomic datasets by providing a systematic view of the molecular mechanisms beneath phenotypes across different classifications of genes (protein families, pathways, molecular functions, and biological processes). In future studies, we intend to further explore the meaning of directionality of correlations between molecular mechanisms and phenotypes. Indeed, three types of significant correlations can be observed using the hypergeometric distribution: either the observed molecular mechanism is (i) disproportionably associated to a phenotype, or (ii) vice versa, or (ii) both are disproportionably associated to one another. Phenomic Analysis and Visualization of Combined Genomic Information across Multiple Biological Scales The results described above systematically provide significant correlations between classes of genes (protein families, pathways, molecular function, and biological processes) and prokaryotic phenotypes. To investigate how information from these classes of genes interacts together on groups of phenotypes, we conducted a cross-phenotype comparison using their correlations to genetic contents. This analysis is anchored on our previously described correlations between prokaryotic phenotypes and Pfam families. All the phenotypes were clustered using a hierarchical average-linkage method based on their correlation scores with Pfam families. Figure 4
In the second cluster (highlighted in the blue boxes in Figure 4 The third cluster of phenotypes within the green boxes contains six phenotypes related to the catabolism of carbohydrates clustered in the following order: Glucose fermenter (fermentation in a glucose medium), Maltose (production of acid in a medium containing maltose), Facultative anaerobic, Glycerol (production of acid in a medium containing glycerol), Trehalose (production of acid in a medium containing trehalose), and D-mannose (production of acid in a medium containing D-mannose). Every one of these phenotypes is also related to glycolysis [43]. We illustrated this cluster of phenotypes with their significantly correlated Pfam families, GO concepts, and KEGG pathways in detail (shown as a multiscale network in the Figure 5
Conclusion In this study, we developed a high throughput computational approach capable of automatically integrating clinical microbiological laboratory characterizations of bacterial phenotypes with various genomic databases spanning multiple scales of molecular biology (protein domains, pathways, molecular function, and cellular processes). To our knowledge, this is the first study demonstrating the feasibility of integrating a large quantity of prokaryotic phenotypes together with genomic datasets from various sources for large-scale data mining. Furthermore, in contrast to previous predictive studies aimed at building large-scale phenotype–genotype networks, we have thoroughly elucidated systems properties involving multiple scales of molecular mechanisms underlying prokaryotic phenotypes. More specifically, we were able to achieve three objectives. First, we predicted and stratified previously unidentified and uncharacterized correlations (both positive and anti- correlations) between protein domain families (Pfam) and bacterial phenotypes using a comprehensive statistical data-mining and visualization method. Our evaluations attest that we faithfully recapitulated known biological knowledge between prokaryotic phenotypes and their molecular underpinnings, demonstrating the validity of our approach to integrate and analyze clinical and genomic datasets. Second, phenotypic information was correlated to additional biological scales such as cellular processes (GO), molecular functions (GO), and molecular pathways (KEGG). Third, the convergence of relationships in the phenomic visualization illustrates the nexus of specific biological systems shared within clusters of related phenotypes (metaphenotypes). This novel phenomic visualization analysis provides insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes. We propose that this method, elucidating the relationship between classes of molecular mechanisms and their association with phenotypes or metaphenotypes, holds promise in facilitating a systems biology approach to genomic and biomedical research. Materials and Methods Datasets. In this study, we used the following six datasets. 1) Global Infectious Diseases and Epidemiology Network database (http://www.cyinfo.com) [15,16]. It contains an MKD, which contains 100 phenotypic microbiology laboratory results for more than 1,000 microorganism species (92 laboratory tests that contain test results in our 59 selected species were used in this study). The lack of data for some species laboratory tests in the MKD indicates that this knowledge is not useful in clinical bacteriology since MKD has been designed to satisfy the needs of clinical bacteriologists. It does not indicate that the knowledge does not exist elsewhere in the literature. We extracted the MKD data in December 2004. 2) Pfam dataset (release version 16, downloaded in April 2005) [17], of which the Pfam-A classifications were used in this study. 3) KEGG pathway data (KEGG Ontology file (KO), release version 31, downloaded in August 2004) [20,45]. 4) Gene Ontology Annotation (GOA) (downloaded in August 2005) [21,22]. 5) Pfam-GO mapping data, which is maintained by the Gene Ontology Consortium, (downloaded in August 2005 from http://www.geneontology.org/external2go/). 6) COGs data (downloaded in December 2004 from http://www.ncbi.nlm.nih.gov/COG/new/) [19]. Data integration. The laboratory results in the MKD are collected for bacterial species, which are primarily used for identifying bacterial strains for medical diagnostics. The MKD rarely has distinct annotations below the taxonomic level of the species according to the NCBI taxonomy. However, bacterial genomes are generally sequenced and annotated at the subspecies or strain levels according to the NCBI taxonomy. A complete list of fully sequenced prokaryotes, many of which have taxonomic annotation (NCBI Taxonomy ID) as no rank at present, was obtained from the NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). To map them, we first identified species taxons for the fully sequenced bacteria using the taxonomy tree from NCBI, and then mapped them to the bacteria species in the MKD through computational terminology mapping of text strings [46], followed by manual examination. As a result, we examined nearly 200 bacteria species that have complete genome sequences, and mapped 70 to the bacteria species in the MKD. In the case of a bacterial species having more than one strain with genome sequences, such as B. anthracis, all the strains were considered as one species and their genomic data were merged in a lossless way. Of the mapped species, we merged the data within each species group (refer to Figure 1 Correlating the MKD's clinical diagnostic laboratory data of bacterial phenotypes with functional genomic data. To investigate whether there are correlations between the clinical diagnostic laboratory results (phenotypes) and the genomic data for bacteria species, we explored the functional classifications of genes. Based on the hypothesis that the existence of a family of genes (or the coexistence of families of genes) is responsible for a phenotype and leads to certain expressed phenotypes under controlled laboratory conditions, we calculated the probability of co-occurrence (by random chance) between a phenotypic laboratory result and presence of a certain cluster (family) of genes across species to uncover such correlations, according to the hypergeometric distribution shown and described below [5].
The hypergeometric distribution takes into account the frequencies of species within a specific Pfam to a specific phenotype association and compares it with reference frequencies of species in the entire dataset for (i) the chosen Pfam and (ii) the chosen phenotype, independently of one another. It then calculates the probability (p-value) of obtaining these frequencies by chance assuming that the species are randomly distributed across phenotypes and Pfam. A p-value smaller than 5%, when corrected for multiple comparisons, indicates that the observed frequency of species sharing a specific Pfam and phenotypes are unlikely to have occurred by chance alone. In our study, there are 59 common species that have diagnostic laboratory results in the MKD and fully sequenced genomes. For instance, the MKD dataset contains 31 (n) positive species in the phenotypic class Gram-negative out of 56 species (N) for which there are some results for that laboratory data (there is some missing data in MKD because they are not relevant for microbiological characterizations); and the Lipid-A disaccharide synthetase family (Pfam ID: PF2684) has its member domains distributed in 25 (M) species. The number of common species between Gram-negative and PF2684 is 24 (m). The resulting p-value for calculating this co-occurrence distribution by random chance according to the above hypergeometric distribution expression is 1.2 × 10−8. The above-mentioned relationship could have two possible types of correlations: 1) a correlation, referring to a positive laboratory result correlated with the existence of a Pfam family; 2) an anti-correlation, referring to a positive lab result that correlates with the absence of a Pfam family. We believe that both correlation types could be equally important for inferring gene functions. To distinguish the two types of correlation, we used the mean value (μ) of hypergeometric distribution (shown below) as a reference.
As illustrated in Figure 2 To control for multiple comparisons, we applied two methods to identify and stratify significant correlations and anti-correlations: 1) the conservative Bonferroni-type method known as the Šidák single-step adjusted p-value for multiple comparisons [26], and 2) the calculation of error rates using a less conservative data-mining algorithm allowing finding correlation with p-value < 0.05. Controlling for multiple comparison with a Bonferroni-type method. The Šidák adjustment for a posteriori comparisons, that was used to maintain an experiment-wide error rate of less than 5%, is calculated according to the following equation,
where α′ and α represent the corrected and uncorrected p-values, respectively, and k represents the number of independent tests. However, since the laboratory dataset contains missing values for some species in different tests, applying the Šidák adjustment for multiple comparisons could be overly conservative or biased toward the laboratory tests with more data. Controlling for multiple comparison with a simulation method. Therefore, to stratify our results with a less conservative method which can predict more correlations, albeit with a higher error rate, we also applied a simulation method to the datasets. Using established statistical resampling principles [48], we created random datasets for a control experiment by generating 1,000 random distributions for each combination of the laboratory results and Pfam families (keeping the total number of occurrences of each lab and Pfam constant in the datasets, while randomizing their distributions in the species—permutation resampling without replacement). For each random distribution, we then calculated the number of statistically correlated laboratory results and Pfam families from these random datasets using the previously described hypergeometric method, with different cutoffs (uncorrected p-value <= 0.05, 0.01, 0.005, 0.002, 0.001, 0.0005, and 0.0001). Rather than controlling for multiple comparisons with a statistical test, we used the statistically significant results from the random datasets to predict the number of false positive errors that we should expect in the real dataset when analyzed under the same conditions and subjected to uncorrected multiple comparisons. Since each of the 1,000 random datasets provides a slightly different interpretation using the hypergeometric statistic, we chose a threshold for the calculated hypergeometric statistics that would be observed as the worst case 99% of the time (i.e., 99% confidence). A distribution of the number of errors has been generated for each cutoff, and the numbers that are greater than 99% of the total numbers were selected as references for confidence levels. Evaluating the results. A manual examination was conducted on the predicted results of correlated phenotypes and Pfam families using the two methods. For the Šidák corrected result, we examined the correlated Pfam families for all phenotypes and summarized the results for each phenotype (Table S2). To estimate the false positive rate, we randomly selected 50 predicted phenotype–Pfam correlations from the result of the simulation method, whose Pfam families have biological annotation (i.e., Pfam families annotated as domain with unknown function are not included in this evaluation). Correlations with literature supports were identified as correct predictions from the random set. The false negative rates were also estimated by evaluating a random selection of 50 phenotype–Pfam correlations from all possible combinations between phenotypes and Pfam excluding the significant correlations predicted by the simulated method. Correlating MKD's laboratory data with KEGG's molecular pathways and GO concepts. In a previous study, we calculated correlations between COGs and phenotypes using the hypergeometric and Bonferroni-type methods [47]. In the current study, we also applied the previously described data-mining method to generate a less conservative estimate of phenotypes related to COGs (phenotype–COG dataset), which we have used as intermediary results to compare KEGG's pathways to phenotypes. We also applied the previously described hypergeometric function and Šidák adjustment (Equations 1 and 3) for a posteriori comparisons to identify significant correlations between phenotypes, and either KEGG's molecular pathways or phenotypes and GO concepts. To correlate phenotypes and KEGG molecular pathways, we integrated the correlation of COGs and pathway data from the KEGG ontology file and assigned the following numbers to the hypergeometric function: 1) the number of COGs families in the KEGG ontology file (N); 2) the number of correlated COG families for each microbial phenotype from the phenotype–COG dataset (n); 3) the number of unique COGs families in each pathway that are also used in this analysis (M); 4) the number of common COGs between 2) and 3) (m). To further identify significant correlations between phenotypes and GO concepts, we used a GO term finder software [49] to correlate phenotypes with GO using the Pfam to GO mapping data from the Gene Ontology Consortium. The GO term finder, designed for correlating genes to GO, also exploits the hypergeometric distribution function for identification of significant correlations and provides Šidák-adjusted p-values. A set of common Pfam families between the two datasets (Pfam–phenotype and Pfam–GO) was retrieved. Relevant subsets of these two datasets were generated for this study, and subsequently used in calculating phenotypes and GO correlations. The availability of data resources at the time of this study limited our method. Though we first thought to map KEGG through Pfam families, we could not find reliable resources that provide a mapping between them. However, we found a good resource for mapping KEGG and COGs and therefore used it for the study as a convenient alternative. Hierarchical clustering of Pfam families and phenotypes for phenomics analysis. We conducted hierarchical clustering using unweighted average linkage and Euclidean distance of all the phenotypes and Pfam families using normalized correlation p-values [50]. The Euclidean distance is defined as:
Table S1: Comparison of Pfam and E. coli Genes Significantly Correlated to Flagellar-Mediated Motility The top 30 E. coli genes identified by a previous study [5] that correlated with flagellar-mediated motility are compared with their corresponding Pfam families identified in this study. The significance of the correlations defined by the two studies, including uncorrected p-values, and p-values adjusted according to the Šidák and the resampling methods, are shown. (85 KB DOC) Click here for additional data file.(85K, doc) Table S2: Evaluation of Phenotypes with Their Significantly Correlated Pfam Families by the Šidák Adjustment Method We manually evaluated 21 phenotypes that have significantly correlated Pfam families by the Šidák adjustment method. Descriptions of the phenotypes are provided with summary and references. (298 KB DOC) Click here for additional data file.(299K, doc) Table S3: Manual Evaluation of a Random Sample of Correlations and Anti-Correlations of Phenotypes and Pfam Families Fifty positive correlations of phenotype–Pfam were randomly selected from the 3,711 significant correlations by the simulation method. Manual examination indicated that 15 of them have strong literature support (provided), suggesting that they are true positives. (187 KB DOC) Click here for additional data file.(187K, doc) Table S4: Manual Evaluation of Randomly Selected Correlations, Which Were Statistically Insignificant, to Estimate False Negative Rates A random selection of 50 phenotype–Pfam correlations from all possible combinations between phenotypes and Pfam excluding the significant correlations predicted by the simulated method was evaluated to estimate false negative rate. (112 KB DOC) Click here for additional data file.(112K, doc) Table S5: Manual Evaluation of Every Statistically Significant Correlation and Anti-Correlation between Phenotypes and KEGG Pathways (68 KB DOC) Click here for additional data file.(69K, doc) Table S6: Manual Evaluation of 50 Randomly Selected Significant Correlations of Phenotype and GO A random selection of 50 phenotype–GO significant correlations was evaluated. (185 KB DOC) Click here for additional data file.(186K, doc) Acknowledgments We thank H. Rosie Xing and Tara Borlawsky for their critical comments. Abbreviations
Footnotes Competing interests. The authors have declared that no competing interests exist. A previous version of this article appeared as an Early Online Release on October 10, 2006 (doi:10.1371/journal.pcbi.0020159.eor). Author contributions. YL, JL, MG, and YAL conceived and designed the experiments. YL, JL, and YAL performed the experiments. YL, JL, LS, and YAL analyzed the data. YL, JL, CSG, and YAL contributed reagents/materials/analysis tools. YL, LS, MG, and YAL wrote the paper. Funding. We acknowledge the support of the following grants: US National Institutes of Health (NIH)/National Library of Medicine 1K22 LM008308 (Semantic Approaches to Phenotypic Database Analysis), and NIH/National Cancer Institute 1U54CA121852-01A1 (National Center for the Multiscale Analysis of Genomic and Cellular Networks (MAGNet), NIH/National Institute for Allergies and Infectious Diseases 5U54 AI057158–02, and the Ruth L. Kirschtein NIH Postdoctoral fellowship. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
J Mol Biol. 1998 Nov 6; 283(4):707-25.
[J Mol Biol. 1998]BMC Bioinformatics. 2006 Mar 23; 7():168.
[BMC Bioinformatics. 2006]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]Trends Genet. 2003 Apr; 19(4):172-6.
[Trends Genet. 2003]Nucleic Acids Res. 2000 Jan 1; 28(1):123-5.
[Nucleic Acids Res. 2000]Bioinformatics. 2005 Feb 1; 21(3):293-306.
[Bioinformatics. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D471-5.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D485-91.
[Nucleic Acids Res. 2005]Neuroinformatics. 2003; 1(4):299-308.
[Neuroinformatics. 2003]Nat Biotechnol. 2006 Jan; 24(1):55-62.
[Nat Biotechnol. 2006]Nat Biotechnol. 2006 May; 24(5):537-44.
[Nat Biotechnol. 2006]Clin Infect Dis. 2005 Jan 1; 40(1):123-6.
[Clin Infect Dis. 2005]Int J Health Geogr. 2005 Apr 22; 4(1):10.
[Int J Health Geogr. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2001 Jan 1; 29(1):123-5.
[Nucleic Acids Res. 2001]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nature. 1999 Oct 28; 401(6756):877-84.
[Nature. 1999]Genome Res. 2005 Jul; 15(7):954-9.
[Genome Res. 2005]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2003 Aug 5; 100(16):9440-5.
[Proc Natl Acad Sci U S A. 2003]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]Nucleic Acids Res. 1999 Jan 1; 27(1):254-6.
[Nucleic Acids Res. 1999]Nucleic Acids Res. 2003 Jan 1; 31(1):452-5.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2001 Jan 1; 29(1):55-7.
[Nucleic Acids Res. 2001]Genome Res. 2001 Oct; 11(10):1766-79.
[Genome Res. 2001]Acta Crystallogr D Biol Crystallogr. 1998 Nov 1; 54(Pt 6 Pt 1):1078-84.
[Acta Crystallogr D Biol Crystallogr. 1998]Carbohydr Res. 2003 Nov 14; 338(23):2521-9.
[Carbohydr Res. 2003]Microbiology. 1999 Aug; 145 ( Pt 8)():1817-30.
[Microbiology. 1999]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]Nucleic Acids Res. 2000 Jan 1; 28(1):56-9.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2002 Jan 1; 30(1):59-61.
[Nucleic Acids Res. 2002]J Bacteriol. 1992 Jul; 174(13):4338-49.
[J Bacteriol. 1992]Proc Natl Acad Sci U S A. 2003 Mar 18; 100(6):3027-30.
[Proc Natl Acad Sci U S A. 2003]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]Trends Genet. 2003 Apr; 19(4):172-6.
[Trends Genet. 2003]J Neurosurg. 2005 May; 102(5):915-7.
[J Neurosurg. 2005]Acta Otolaryngol. 1996 Nov; 116(6):836-9.
[Acta Otolaryngol. 1996]Biosci Biotechnol Biochem. 1999 Oct; 63(10):1843-6.
[Biosci Biotechnol Biochem. 1999]Clin Infect Dis. 2005 Jan 1; 40(1):123-6.
[Clin Infect Dis. 2005]Int J Health Geogr. 2005 Apr 22; 4(1):10.
[Int J Health Geogr. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 1999 Jan 1; 27(1):29-34.
[Nucleic Acids Res. 1999]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D277-80.
[Nucleic Acids Res. 2004]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]Bioinformatics. 2004 Dec 12; 20(18):3710-5.
[Bioinformatics. 2004]Genome Res. 2004 Jan; 14(1):109-15.
[Genome Res. 2004]