![]() | ![]() |
Formats:
|
||||||||||||||
Copyright This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose. Discovering Biological Guilds through Topological Abstraction 1 Division of Health Science and Technology, Massachusetts Institute of Technology/Harvard University, Cambridge, MA. 2 Children’s Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, Boston, MA. 3 Harvard Medical School-Partners Center for Genetics and Genomics, Boston, MA Abstract High-throughput generation of new types of relational biomedical datasets is creating a demand for methods to provide insights into their complexity. Such networks are often too large to interpret visually and too complicated to be explained solely based on local topological properties. One way to try to make sense of such complex networks would be to transform them into discernable abstracts, or summaries, of the original networks. Then, important components could become more readily visible. This work presents such an approach for understanding networks via abstraction of global network connectivity using compression. This made possible the discovery of a new type of topological class, referred to herein as a guild, that captures global connectivity similarity. Lastly, the correspondence of these guilds to biological function is validated via an E. Coli gene regulation network. This resulted in biological findings that could not be derived from local topology of the original network. Introduction High-throughput experiments across medicine, proteomics, and genomics are producing datasets that are too large to be analyzed manually. In addition, many of these new technologies are yielding new relationship-based datasets. For example, yeast two-hybrid technology is generating new organism-level interactome-based networks1, 2, while microrrays and other technologies have been used in automated ways to generate transcription regulation networks 3, 4. In medicine, new methods are being used to explore medical literature networks to study everything from evidence-based medicine to interactions between research groups 5, 6. Others have been working on visualizing subsets of such medical informatics networks7. In such networks, nodes can represent genes, proteins, or medical information. Edges encode control, interaction, or other relationships. However, the underlying data and models used often contain noise, redundancy, and artifacts. In addition, the resulting networks are often so large, that it can be hard to visualize and understand the relationships (see Figure 1
Just as signal processing and statistical techniques have been used in traditional biological datasets, so too are methodologies needed to automatically discern patterns in these new types of networks. Current approaches to the analysis of these graphical structures rest on the identification of local features of the network, such as hubs (nodes with many neighbors), modules (protein interactions from a specific biological process), or motifs (commonly recurring patterns of interconnections) 8–10. Others have sought to capture network connectivity in a few summary statistical measures 11. These notions have been invaluable in providing biological insights. The findings have ranged from linking centrality and lethality in protein interaction networks 8 to linking connectivity with evolution rates 12. However, such local topological structures do not naturally lend themselves to the development of compression algorithms that could make visualization and exploration of large networks more amenable (by creating an abstract form of the original network). This work seeks expose the underlying meaning of biomedical networks. It does so with a global approach. The intuition is that new patterns can be found by using a global approach that takes into account distances between all nodes, rather than simply looking at local topology. Others have started to take note of the importance of looking beyond local connectivity. For example, one recent work looked at inter-module connections 13. This yielded novel findings, but required obtaining values for variables and parameters beyond topology (e.g. for flux balance analysis) that may not be available. Such approaches also do not lend themselves readily to network abstraction via compression. Here, a network transformation is derived in a compression-based process that yields an abstract of the network’s global connectivity. New topologic classes are found within such an abstract network. This work shows how novel, biological knowledge can be garnered from such abstractions that are not discernable within the original network via local topological methods. The abstraction process effectively transforms related global connectivity patterns into local ones- where they can be easily visualized and analyzed. This results in subgraphs with highly connected nodes. These new topological classes share many characteristics with classical guilds. For thousands of years, artisans in shared professions have formed guilds. Members of guilds could work independently or together. Yet, since members within the artisans’ guild shared similar trades, they often interacted with other members of society in similar ways. They would often share the same type of clients, social status, and social network. If, for example, one were to calculate the shortest path of social connections to a town mayor (i.e. local hub in the biomolecular network), one may expect that path to be very similar. Likewise, genes in a ‘guild’ share similar relationships in terms of global connectivity to other nodes. Thus, even though two genes may not be directly linked on the original network, the abstract network allows for visual and analytical confirmation of similar global connectivity profiles. Methods Graphs help to capture the notion of biomedical networks across several domains. Here, the E. Coli regulatory network, as derived from EcoCyc 14, is used as a canonical biological network. Nodes are used to represent the E. Coli genes and the edges are used to imply regulatory control. In order to capture global topologic properties, distances between all 953 gene nodes were calculated via Dijkstra’s algorithm in an undirected version of the network. The largest connected graph (851 gene nodes), referred to here as the global topologic profile, was then examined in detail. In order to determine the dimensionality of the data, a scree plot 15 was calculated. Based on the knee of the curve in the scree plot, only the first five principal components were used to project, and thereby topologically abstract, the global topologic profile. Several quantitative metrics were used to evaluate the topologic profile after discretization-ranging from root-mean-square error to the percentage of distances and nodes affected by the transformation. In order to examine the biologic information encoded, the topologic profile’s adjacency matrix was used to construct a new visual representation of the network. This abstract network (see Figure 2
Subnetworks (i.e. guilds in the abstract network) can readily be discerned visually (see Figure 3
Robustness of the topological abstraction-based approach was examined as well. To study the impact of the reduction in dimensionality on abstraction, the number of principal components used in transforming the original network (inversely proportional to level of abstraction) was varied. The Spearman correlation coefficient was then calculated to determine gene distance correlation between the abstract and original topological profile matrices. Results Figure 2 Besides the clarity of visual clarity hubs, another feature of the abstract network is the existence of decentralized, highly connected nodes. Investigation of these subnetworks suggests that the embedded nodes are guilds that share similar function relationships. Further quantitative analysis of the nine discovered guilds was done by calculating GO enrichment as described in the methods section (see Figure 3 It is noteworthy that none of the genes investigated in the fimbrae-enriched local group were directly linked to one another by an edge in the gene regulation network. However, since they had similar global topologic profiles, the abstraction of the network led them to be linked together so as to reduce the variance across the nodes’ inter-node distances. For example, while none of the fimbriae enriched sub-network nodes were directly linked to each other, mapping these nodes to the original network revealed that all of these nodes had a regulatory control edge linking them to the ihf and lrp gene nodes. Thus, this approach may be useful in determining joint regulatory patterns. It can be used to select pharmaceutical targets from different global profiles-to ensure synergistic drug activity. Six of the other guilds (see Figure 3 Two guilds were validated manually via Entrez Protein database annotation (due to sparse GO-based information). One included an abundance of cytochrome-related proteins. Lastly, annotation revealed that all of the genes in the final guild involved protein transport-and virtually all of these could be specifically linked to phosphate transport. All guilds had novel links compared to original network. In fact, none of the genes directly connected together in four of the functionally enriched guilds were linked to each other in the original network. This new local topology in the abstract network conveyed functional relationships not evident in the local connectivity of the original network. It is noteworthy that this paper’s findings were made possible using information derived solely from the topology of a given network-rather than necessitating additional measurements (e.g. gene expression, protein levels) of the actual node values. The abstract global topologic profile also provided a number of insights on the transformed gene regulation network. A scatter plot of the abstract versus original gene distances is shown in Figure 4
The topological abstraction approach’s robustness was examined next. The impact of dimensionality reduction on the gene distance correlation between abstract and original is depicted in Figure 5
Discussion and Conclusion This paper presents an automated method for analyzing large networks using a global perspective. The work describes how a novel topological class, or guild, can be derived by transforming a network via abstraction into a representation more amenable for visualization and analysis of the original network’s global properties. By mapping highly connected groups within a network abstract form back to the original network, it may be possible to find patterns of regulation/control that are not readily visible in the original network-as was done here. There are a variety of applications from biomolecular visualization to pharmaceutical drug target selection. Future work is ongoing with regard to investigating and validating guilds in other biomedical networks-including interactomes, PubMed citation links, and metabolic networks. References 1. Uetz P, Giot L, Cagney G, et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000 Feb 10;403(6770):623–627. [PubMed] 2. Rual JF, Venkatesan K, Hao T, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005 Oct 20;437(7062):1173–1178. [PubMed] 3. Yu T, Li KC. Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics. 2005 Nov 1;21(21):4033–4038. [PubMed] 4. Haverty PM, Frith MC, Weng Z. CARRIE web service: automated transcriptional regulatory network inference and interactive analysis. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W213–216. [PubMed] 5. Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF. Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc. 2005 Mar–Apr;12(2):207–216. [PubMed] 6. Guimera R, Uzzi B, Spiro J, Amaral LA. Team assembly mechanisms determine collaboration network structure and team performance. Science. 2005 Apr 29;308(5722):697–702. [PubMed] 7. Douglas SM, Montelione GT, Gerstein M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 2005;6(9):R80. [PubMed] 8. Jeong H, Mason SP, Barabasi AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001 May 3;411(6833):41–42. [PubMed] 9. Han JD, Bertin N, Hao T, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004 Jul 1;430(6995):88–93. [PubMed] 10. Shen-Orr SS, Milo R, Mangan S, Alon U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet. 2002 May;31(1):64–68. [PubMed] 11. Yu H, Zhu X, Greenbaum D, Karro J, Gerstein M. TopNet: a tool for comparing biological subnetworks, correlating protein properties with topological statistics. Nucleic Acids Res. 2004;32(1):328–337. [PubMed] 12. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW. Evolutionary rate in the protein interaction network. Science. 2002 Apr 26;296(5568):750–752. [PubMed] 13. Segre D, Deluna A, Church GM, Kishony R. Modular epistasis in yeast metabolism. Nat Genet. 2005 Jan;37(1):77–83. [PubMed] 14. Keseler IM, Collado-Vides J, Gama-Castro S, et al. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D334–337. [PubMed] 15. Joliffe I. Principal Component Analysis. 2. New York, NY: Springer-Verlag; 2002. 16. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998 Jun 4;393(6684):440–442. [PubMed] 17. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25–29. [PubMed] 18. Bland M. An Introduction to Medical Statistics. 3. Oxford University Press; 2000. 19. Michalickova K, Bader GD, Dumontier M, et al. SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics. 2002 Oct 25;3(1):32. [PubMed] 20. Dennis G, Jr, Sherman BT, Hosack DA, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4(5):P3. [PubMed] 21. Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8348–8353. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Nature. 2000 Feb 10; 403(6770):623-7.
[Nature. 2000]Nature. 2005 Oct 20; 437(7062):1173-8.
[Nature. 2005]Bioinformatics. 2005 Nov 1; 21(21):4033-8.
[Bioinformatics. 2005]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W213-6.
[Nucleic Acids Res. 2004]J Am Med Inform Assoc. 2005 Mar-Apr; 12(2):207-16.
[J Am Med Inform Assoc. 2005]Nature. 2001 May 3; 411(6833):41-2.
[Nature. 2001]Nat Genet. 2002 May; 31(1):64-8.
[Nat Genet. 2002]Nucleic Acids Res. 2004; 32(1):328-37.
[Nucleic Acids Res. 2004]Science. 2002 Apr 26; 296(5568):750-2.
[Science. 2002]Nat Genet. 2005 Jan; 37(1):77-83.
[Nat Genet. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D334-7.
[Nucleic Acids Res. 2005]Nature. 1998 Jun 4; 393(6684):440-2.
[Nature. 1998]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]BMC Bioinformatics. 2002 Oct 25; 3():32.
[BMC Bioinformatics. 2002]Genome Biol. 2003; 4(5):P3.
[Genome Biol. 2003]