- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Topological structure analysis of the protein–protein interaction network in budding yeast

^{1}Yi Zhao,

^{1}Lun Cai,

^{1}Hong Xue,

^{2}Xiaopeng Zhu,

^{2}Hongchao Lu,

^{1}Jingfen Zhang,

^{1}Shiwei Sun,

^{1}Lunjiang Ling,

^{2}Nan Zhang,

^{2}Guojie Li,

^{1}and Runsheng Chen

^{1,}

^{2,}

^{a}

^{1}Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology and

^{2}Bioinformatics Laboratory, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China

^{a}To whom correspondence should be addressed at Bioinformatics Laboratory, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. Tel: +8610 64888543; Fax: +8610 64871293; Email: nc.ca.pbi.5nus@src

^{b}Correspondence may also be addressed to Guojie Li. Tel: +8610 62565533; Fax: +8610 62567724; Email: nc.ca.tci@gil

^{b}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

## Abstract

Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology. Here, a spectral method derived from graph theory was introduced to uncover hidden topological structures (i.e. quasi-cliques and quasi-bipartites) of complicated protein–protein interaction networks. Our analyses suggest that these hidden topological structures consist of biologically relevant functional groups. This result motivates a new method to predict the function of uncharacterized proteins based on the classification of known proteins within topological structures. Using this spectral analysis method, 48 quasi-cliques and six quasi-bipartites were isolated from a network involving 11855 interactions among 2617 proteins in budding yeast, and 76 uncharacterized proteins were assigned functions.

## INTRODUCTION

With the availability of complete DNA sequence data for many prokaryotic and eukaryotic genomes, a formidable challenge of post-genomic biology is to understand how genetic information results in the concerted action of gene products both temporally and spatially to achieve biological function, as well as how they interact with each other to create an organism. It is important to develop reliable proteome-wide approaches for a better understanding of protein functions (1,2). Genomic approaches have been used to predict functions of a large number of genes based on their sequences. However, as we know, proteins rarely act alone at the biochemical level; rather, they interact with other proteins as an assembly to perform particular cellular tasks. Having systematic functions, these assemblies represent more than the sum of their parts (3). Traditionally, protein interactions were studied individually by genetic, biochemical and biophysical techniques focusing on a few proteins at a time (4). It is increasingly realized that dissecting the genetic and biochemical circuitry of a cell prevents us from further understanding the biological processes as a whole. Basic constituents of cellular protein complexes and pathways, protein–protein interactions are key determinants of protein function. It is believed that all biological processes are essentially and accurately carried out through protein–protein interactions.

In the last 3 years, high-throughput interaction detection approaches, such as yeast two-hybrid systems (5,6), protein complex purification techniques using mass spectrometry (3,7), correlated messenger RNA expression profiles (8,9), genetic interaction data (10,11) and ‘*in silico*’ interaction predictions derived from gene context analysis [gene fusion (12,13), gene neighborhood (14,15) and gene co-occurrences or phylogenetic profiles (16,17)], have been developed and they have created a number of datasets regarding protein– protein interactions for several model organisms (*Saccharo myces cerevisiae*, *Caenorhabditis elegans* and *Helicobacter pylori*). These large-scale datasets open a door to comprehensive understanding of the genetic and biochemical phenomena in a cell. Subsequently, several promising methods have been successfully applied to this field. For instance, Schwikowski *et al*. (18) and Hishigaki *et al*. (19) predicted uncharacterized proteins based on interacting partners; Maslov and Sneppen (20) analyzed the stable topological properties of interaction networks; Ge *et al*. (21) provided the first global evidence that genes with similar expression profiles are more likely to encode interacting proteins; and Fraser *et al*. (22) revealed that the connectivity of well-conserved proteins in the network is negatively correlated with their rate of evolution. These studies revealed that the available data from protein–protein interaction networks in *S.cerevisiae* share some unexpected features with other complex networks.

The topological pattern of interactions is a rich source of biological functional information, and therefore we need to develop methods to mine and to understand the interaction networks. Here, we applied the spectral analysis method, which has been successful used in other fields (23), to proteomics to identify topological structures of interaction networks, i.e. quasi-cliques and quasi-bipartites. Interestingly, we found that the proteins within same group share similar biological functions. Moreover, for one-third of proteins that are still uncharacterized in *S.cerevisiae*, this method provides a new approach to predict their functions based on topological structures.

## MATERIALS AND METHODS

### Spectral analysis

Spectral analysis is a powerful tool to reveal high-level structures underlying enormous and complicated relationships. As a famous paradigm, David Gibson, Jon Kleinberg and Prabhakar Raghavan did excellent work on extracting information from link structure of the Web (23,24). The World Wide Web is known to be composed of an increasing number of pages with hyperlinks pointing to other pages. Despite high complexity of the Web structure, spectral analysis was successfully used to discover ‘authoritative’ information sources and ‘hub’ pages joining authoritative ones together.

We applied the spectral analysis method to complicated protein–protein interaction networks and identified interesting topological structures. In this method, a network is represented by a bi-directed graph G(V,E), i.e. vertex set including each protein as a vertex *V* = {*P*_{1},*P*_{2}…*P*_{n}}, and the edge set *E* = {(*P _{i}*,

*P*)|

_{j}*there is an interaction between protein P*}. The symmetric

_{i}and P_{i}*n*×

*n*adjacent matrix is defined as

*A*= (

*a*), where

_{ij}*a*= 1 if (

_{ij}*P*,

_{i}*P*)

_{j}*E*, and

*a*= 0 if (

_{ij}*P*,

_{i}*P*)

_{j}*E*.

Spectrum of the adjacency matrix A is essentially a reasonable measurement of properties of nodes that could be propagated across the interactions. Let us consider assigning a score to each node to represent their intensity, say X. A node with a high score would increase its neighbors’ score through their interactions. In other words, two nodes are mutually reinforcing, which is in nature a cyclic definition of scores:

The iteration method derived from Gibson *et al*. (23) and Kleinberg (24) is introduced to break such a cycle. It is interesting that *X*_{i} converges to a fixed point from any initializing assignment, and it can be proved that the fixed point is one of the eigenvectors of matrix *A*, which means it is an intrinsic characteristic of interactions. Moreover, since matrix *A* is symmetric, all of its eigenvectors are mutually orthogonal, which means that the corresponding properties are also mutually independent. In other words, each eigenvector represents a special property that none of the others could represent.

### Identification of topological structures

From a topological point of view, the spectrum helps to uncover the hidden topological structures of a complex interaction network. We found that for each eigenvector with a positive eigenvalue, the proteins corresponding to absolutely larger components tend to form a quasi-clique (i.e. every two of them tend to interact with each other) (Fig. (Fig.1a),1a), whereas for each eigenvector with a negative eigenvalue, such proteins tend to form a quasi-bipartite (i.e. the proteins in which two disjoint subsets express high level connectivity between sets rather than within sets) (Fig. (Fig.11b).

**a**), while in a quasi-bipartite, proteins between sets have denser interactions than those within sets (

**b**).

This observation can be explained as follows. The maximal eigenvalue of an adjacent matrix is the maximal value of

(where *x _{i}* is the

*i*th component of the eigenvector). Other positive eigenvalues can also be described as the maximal value

*Q*with orthogonal condition. Since

*Q*is the summary of

*x*corresponding to edge

_{i}x_{j}*v*, it would be maximal when the nodes with more edges are assigned a larger value with the same signal, which form a quasi-clique intuitively. Similar quasi-bipartites would be obtained eigenvectors with negative eigenvalues.

_{i}v_{j}We applied the clustering coefficient (CC) (25,26) in our analysis to quantify a quasi-clique’s tendency to form a cluster. The ratio between the number of edges that actually exist between these *N* nodes and the total number *N*(*N* – 1)/2 gives the CC-value of a quasi-clique, i.e. *CC* = *E*/[*N**(*N* – 1)/2]*100%, where *E* is the number of interactions within the clique and *N* is the number of proteins in it. CC is greater than 0 and less than 1. A value close to 1 represents a clique close to a complete graph.

### Assignment of annotation and *P*-values to quasi-cliques

As an isolated quasi-clique may involve different functional categories, *P*-values (27,28) are used as criteria to assign each quasi-clique a main function. Hypergeometric distribution was applied to model the probability of observing at least *k* proteins from a quasi-clique size *n* by chance in a category containing *C* proteins from a total genome size of *G* proteins, such that the *P*-value is given by

The above test measures whether a quasi-clique is enriched with proteins from a particular category more than would be expected by chance. If the *P*-value of a category is near 0, the proteins of the category in a quasi-clique will have a low probability of being chosen by chance. Here, we assigned each quasi-clique the main function with the lowest *P*-value in all categories.

## RESULTS

### Data source and analysis

Among the interactions produced by high-throughput methods there could be many false positives. To measure their accuracy and to identify the biases, von Mering *et al*. (4) assessed a total of 80 000 interactions among 5400 yeast proteins reported previously and assigned each interaction a confidence value. In order to reduce the interference by false positives, we focused on 11855 interactions with high and medium confidence among 2617 proteins.

To analyze the interaction dataset, first we applied the spectral method to calculate all eigenvalues and eigenvectors of the adjacency matrix corresponding to the network. The following criteria were then used to generate quasi-cliques based on eigenvectors with larger and positive eigenvalues. (i) All the proteins were sorted by their absolute weight value in an eigenvector, and the top 10% were selected. (ii) Every protein must interact with at least 20% of the members. Here, we used CC-value to measure the degree of the interconnectivity between nodes and tuned the parameter to guarantee the quality of those cliques. (iii) A quasi-clique must contain at least 10 proteins. As a result, we yielded 48 quasi-cliques, among which the largest one contains 109 proteins (quasi-clique 1 in Table Table1)1) and the smallest one contains 10 proteins (quasi-clique 45 in Table Table1);1); on average, a quasi-clique contains 26.6 proteins (a protein may appear in different quasi-cliques). Similar analysis based on eigenvector with negative eigenvalue produced six quasi-bipartites.

The two topological structures show different interaction patterns. In a quasi-clique proteins tend to interact with each other (Fig. (Fig.1a),1a), while in a quasi-bipartite, proteins between sets have denser interactions than those within sets (Fig. (Fig.1b).1b). Identification of the above topological structures could not only represent the complicated interaction network in order, but also make the complicated network more convenient to analyze.

### Annotation of quasi-cliques

For each of the 48 quasi-cliques, we calculated its *P*-value and annotated it based on the Munich Information Center (MIPS) hierarchical functional categories. MIPS allows a protein to appear in more than one category, which was taken into account in the calculation of *P*-value. As a result, 43 quasi-cliques were annotated with one functional category and the other five quasi-cliques were assigned to a set of functional categories (Table (Table1;1; see Supplementary Material for complete data sets).

We investigated the functions of individual proteins in quasi-cliques and found that most of them usually share common functions, including ribosome biogenesis, rRNA and tRNA synthesis, processing, transcription control and mRNA splicing, etc. (Fig. (Fig.22 and Table Table1).1). Only a small fraction of the proteins turn out uncharacterized or have functions conflicting with the common function of the quasi-clique, as shown in Figure Figure2.2. This could be explained by either unavoidable false positive interactions under the current experimental conditions or that the proteins really share this kind of function but it is yet not proved.

**...**

To visualize protein interactions and functional annotations, we have developed a software package that, along with the complete set of data generated by our algorithm, is publicly available at http://www.bioinfo.org.cn/PIN/. Using this software, users can view topological structures and find annotations of proteins and their interactions conveniently.

### Functional prediction for uncharacterized proteins in quasi-cliques

The isolated quasi-cliques give a good clue to predict functions of the uncharacterized proteins. Among the 2617 proteins in the raw dataset, 555 were uncharacterized according to MIPS hierarchical functional categories (4). For the 76 uncharacterized proteins in the 48 quasi-cliques, we assigned for each one a function according to the main function of its hosting quasi-clique. If a protein falls into more than one quasi-clique, the main function of the quasi-clique with the lowest *P*-value was assigned to it. If multiple hosting quasi-cliques have the lowest *P*-value, or a quasi-clique has multiple main functions, a set of functions would be assigned to the protein. The 76 unknown proteins and their predicted functions with the corresponding *P*-values are listed in Table Table2.2. There are 43 rRNA processing proteins, seven proteins related to pre-RNA processing, 11 proteins related to ribosome biogenesis, and the other 15 proteins related to energy, metabolism, cytoskeleton and transcription-regulating (See Table Table22 for complete data).

We assessed the ability of the *P*-value to annotate and assign functions using the same approach as Wu *et al*. (28). As a control, we created and analyzed random networks with the same interaction distribution as the original network. The results show that among the 48 quasi-cliques of our experimental data, >87.5% were significant in one or more annotation categories at *P* ≤ 0.01/*Nc* (here *Nc* is the number of categories), whereas <2.1% of quasi-cliques identified from random network met the same criteria. This means a substantial fraction of isolated quasi-cliques are likely to be biologically meaningful.

Some of our predictions were supported by recent experimental evidence. Of all the quasi-cliques, five were dominated by uncharacterized proteins (functions are unknown for at least 50% of proteins, Fig. Fig.2),2), which imply that those unknown proteins in a same quasi-clique may form a large complex relating to a certain cellular process. For quasi-cliques 3 and 4, most of the proteins were predicted to mediate rRNA processing, which is partly consistent with the results from recent experiments (29,30,31) (Fig. (Fig.33).

## DISCUSSION

The yeast large-scale protein–protein interaction data have broadened our view of protein functions in this proteomics era. The biological processes of a cell are controlled by interacting proteins in metabolic and signaling pathways and in complexes such as the molecular machines that synthesize and use adenosine triphosphate, replicate and transcribe genes, or build up the cytoskeletal infrastructure (32,33). The knowledge regarding protein–protein interactions has been accumulated by biochemical and genetic experiments, including the widely used high-throughput interaction detection methods, such as the yeast two-hybrid system and protein complex purification techniques using mass spectrometry. Now, a challenging task is to decipher the relationships between individual proteins and to understand the molecular organization of cellular networks. Here, for the first time, we analyzed the complicated protein interaction networks using the spectral analysis method. This approach is useful in revealing hidden topological structures, including quasi-cliques and quasi-bipartites, which exhibit meaningful information of a complex network. Figure Figure4a4a shows a part of the original interaction network, which contains 109 proteins. It looks confusing and difficult to assimilate before analysis. In contrast, a tightly interacting quasi-clique including 68 proteins was found from this part of network by spectral analysis. This suggests that a network actually is not random as it appears (Fig. (Fig.44b).

**a**). The spectral analysis revealed a hidden topological

**...**

As part of these studies, we first offered a flexible and promising large-scale protein function prediction system based on spectral analysis. Compared with the previous approaches, what we presented here has a number of practical advantages. Previous methods used partners or neighbors alone to perform the prediction, whereas our method utilized the more informative topological structure of the whole network, and produced some results that were not covered by the previous predictions. The 76 proteins contain 43 rRNA processing proteins, seven proteins related to pre-RNA processing, 11 proteins related to ribosome biogenesis and another 15 proteins related to energy, metabolism, cytoskeleton and transcription regulation. As a control, we created and analyzed random networks with the same interaction distribution as the original network. The results show that among the 48 quasi-cliques of our experimental data, >87.5% were significant in one or more annotation categories at *P* ≤ 0.01/*Nc* (here *Nc* is the number of categories), whereas <2.1% of quasi-cliques identified from a random network met the same criteria. Some of our predictions have been proved by experiments published recently. This suggests that our prediction method is accurate. Furthermore, this method is a universal one that could be used to predict protein function in other organisms.

Although the initial results are promising, the current method is still far from perfect. We have not yet fully explored all quasi-cliques, for that the problem has been proved to be NP-Complete. Therefore new methods should be developed to reveal more sophisticated topological features. It should be pointed out that prediction accuracy is affected by knowledge of known annotations and false positive interactions. It is well known that so far annotations of proteins in databases are incomplete, i.e. a number of proteins with well-characterized function, or at least well-supported functional prediction, are annotated as ‘unknown function’ in MIPS. This introduces additional uncertainties into our prediction. We believe that our prediction would be better if a more accurate interaction and annotation dataset was applied.

## ACKNOWLEDGEMENTS

We would like to acknowledge with deep appreciation Professor Soren Norby for his examining and revising this paper. This work was supported by the Chinese Academy of Sciences Grant No. KSCX2-2-07, National Sciences Foundation of China Grant No. 39890070, the National High Technology Development Program of China under Grant No. 2002AA231031, National Key Basic Research & Development Program (973) under Grant No. 2002CB713805, the National Grand Fundamental Research 973 Program of China under Grant No. G1998030510 and Beijing Science and Technology Commission Grant No. H010210010113.

## REFERENCES

*Helicobacter pylori*. Nature, 409, 211–215. [PubMed]

*et al*. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147. [PubMed]

*et al*. (2000) A comprehensive analysis of protein–protein interactions in

*Saccharomyces cerevisiae*. Nature, 403, 623–627. [PubMed]

*et al*. (2002) Systematic identification of protein complexes in

*Saccharomyces cerevisiae*by mass spectrometry. Nature, 415, 180–183. [PubMed]

*et al*. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126. [PubMed]

*et al*. (2001) Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science, 294, 2364–2368. [PubMed]

*et al*. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 31–34. [PMC free article] [PubMed]

*Saccharomyces cerevisiae*. Nature Genet., 29, 482–486. [PubMed]

*Proceedings of the 9th ACM Conference on Hypertext and Hypermedia.*ACM Press, New York, NY.

*Proceedings of the 9th ACM Conference on Hypertext and Hypermedia.*ACM Press, New York, NY.

*Modern Graph Theory*. Springer-Verlag, Inc., New York, NY, pp. 3–77.

*Saccharomyces cerevisiae*gene function using overlapping transcriptional clusters. Nature Genet., 31, 255–265. [PubMed]

*et al*. (2001) Composition and functional characterization of yeast 66S ribosome assembly intermediates. Mol. Cell, 8, 505–515. [PubMed]

*et al*. (2002) A large nucleolar U3 ribonucleoprotein required for 18S ribosomal RNA biogenesis. Nature, 417, 967–970. [PubMed]

*Molecular Biology of the Cell*, 3rd Edn. Garland, New York, NY.

*Molecular Cell Biology*, 3rd Edn. Scientific American Books, New York, NY.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (218K)

- k-Partite cliques of protein interactions: A novel subgraph topology for functional coherence analysis on PPI networks.[J Theor Biol. 2014]
*Liu Q, Chen YP, Li J.**J Theor Biol. 2014 Jan 7; 340:146-54. Epub 2013 Sep 19.* - TopNet: a tool for comparing biological sub-networks, correlating protein properties with topological statistics.[Nucleic Acids Res. 2004]
*Yu H, Zhu X, Greenbaum D, Karro J, Gerstein M.**Nucleic Acids Res. 2004; 32(1):328-37. Epub 2004 Jan 14.* - Identifying protein complexes based on multiple topological structures in PPI networks.[IEEE Trans Nanobioscience. 2013]
*Chen B, Wu FX.**IEEE Trans Nanobioscience. 2013 Sep; 12(3):165-72. Epub 2013 Aug 21.* - Topological properties of protein interaction networks from a structural perspective.[Biochem Soc Trans. 2008]
*Gursoy A, Keskin O, Nussinov R.**Biochem Soc Trans. 2008 Dec; 36(Pt 6):1398-403.* - Computational detection of protein complexes in AP-MS experiments.[Proteomics. 2012]
*Choi H.**Proteomics. 2012 May; 12(10):1663-8.*

- ABC and IFC: Modules Detection Method for PPI Network[BioMed Research International. 2014]
*Lei X, Wu FX, Tian J, Zhao J.**BioMed Research International. 2014; 2014968173* - Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks[BMC Bioinformatics. ]
*Xiong W, Liu H, Guan J, Zhou S.**BMC Bioinformatics. 14(Suppl 12)S4* - A binary matrix factorization algorithm for protein complex prediction[Proteome Science. ]
*Tu S, Chen R, Xu L.**Proteome Science. 9(Suppl 1)S18* - QuateXelero: An Accelerated Exact Network Motif Detection Algorithm[PLoS ONE. ]
*Khakabimamaghani S, Sharafuddin I, Dichter N, Koch I, Masoudi-Nejad A.**PLoS ONE. 8(7)e68073* - Revealing the Hidden Relationship by Sparse Modules in Complex Networks with a Large-Scale Analysis[PLoS ONE. ]
*Jiao QJ, Huang Y, Liu W, Wang XF, Chen XS, Shen HB.**PLoS ONE. 8(6)e66020*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Topological structure analysis of the protein–protein interaction network in bud...Topological structure analysis of the protein–protein interaction network in budding yeastNucleic Acids Research. May 1, 2003; 31(9)2443PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...