• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of jbbJournal's HomeManuscript SubmissionAims and ScopeAuthor GuidelinesEditorial BoardHome
J Biomed Biotechnol. 2008; 2008: 860270.
Published online Mar 12, 2008. doi:  10.1155/2008/860270
PMCID: PMC2278021

An Algorithm for Finding Functional Modules and Protein Complexes in Protein-Protein Interaction Networks

Abstract

Biological processes are often performed by a group of proteins rather than by individual proteins, and proteins in a same biological group form a densely connected subgraph in a protein-protein interaction network. Therefore, finding a densely connected subgraph provides useful information to predict the function or protein complex of uncharacterized proteins in the highly connected subgraph. We have developed an efficient algorithm and program for finding cliques and near-cliques in a protein-protein interaction network. Analysis of the interaction network of yeast proteins using the algorithm demonstrates that 59% of the near-cliques identified by our algorithm have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with the experimentally determined protein complexes catalogued in MIPS.

1. INTRODUCTION

Proteins in a highly connected subgraph of a protein interaction network usually share a common function [1]. Therefore, a highly connected subgraph such as clique and near-clique in a protein interaction network can be used to predict the function of uncharacterized proteins in the highly connected subgraph. Finding a clique with a maximum size in a graph is an NP-hard problem [2]. There are several heuristic algorithms for the maximum clique problem [2, 3], but most of them focus on finding a complete subgraph (i.e., clique) and cannot be used to find near-cliques.

Several topological analysis methods have been developed for identifying biologically meaningful groups from protein interaction networks or for assessing the reliability of protein interactions. A recent program called CFinder [4, 5] finds overlapping cliques in protein interaction networks. It allows a protein to belong to more than one clique, but cannot find near-cliques. Our study shows that the near-cliques can reveal higher functional coherence than the overlapping cliques.

The primary focus of this study is to find functional groups by identifying cliques and near-cliques in protein interaction networks. This study attempts to answer two questions as follows. “Can we efficiently find all cliques and near-cliques?" and “does a dense subgraph such as clique and near-clique indeed represent a functional module or protein complex?" This study demonstrates that the answers to both questions are “yes." This paper presents an algorithm for finding near-cliques and its application to the interaction network of yeast proteins.

2. ALGORITHMS FOR FINDING NEAR-CLIQUES

A clique is a complete graph G = (N, E)in which every node is connected to every other node in the graph. In our previous work, we developed a heuristic algorithm and implemented the algorithm in a program called InterViewer [6], which identifies all edge-disjoint cliques (i.e., cliques that do not share an edge).

Our experience with protein interaction networks suggests that a near-clique as well as a clique often represents a biologically meaningful unit such as functional module or protein complex. A near-clique is almost a clique but is not a clique due to a few missing edges. We consider near-cliques of the following basic types, which are biologically meaningful clusters (see Figure 1).

Figure 1
Near-cliques of types A, B, and C. Proteins outside a clique are represented as shaded nodes.

Type A —

When a protein outside a clique interacts with two or more proteins in the clique, the protein and the clique forms a near-clique.

Type B —

When a clique shares a protein with other cliques, the cliques form a near-clique.

Type C —

When two or more cliques interact with a common protein outside them and the protein has at least two interactions with each clique, the cliques and the protein form a near-clique.

The near-cliques of types A and C can be refined using the indegree and outdegree of a node (there is no change to the near-clique of type B). For a node x in subgraph G[subset or is implied by] G, indegree(x, G′) is the number of the edges connecting node x to other nodes in G′, and outdegree(x, G′) is the number of edges connecting node x to other nodes that are in G but not in G′. We use the definition of a community in a strong sense [7] to find more near-cliques in a graph.

DEFINITION 1. —

A subgraph G′ is a community in a strong sense if indegree (x, G′) > outdegree (x, G′) for every x in G′ .

The original definition of a strong community misses many near-cliques due to a single node in the communities. For example, in Figure 2a, node x cannot belong to a near-clique since indegree(x, G′) = 3 < outdegree(x, G′) = 4. Likewise, node x in Figure 2b cannot belong to a near-clique because indegree (x, G′) < outdegree (x, G′). Thus, nodes with only one edge connected to them and their edges are removed from the graph when we search near-cliques in the graph. In the graph of Figure 2a, nodes p, q, r, and s and their edges are removed. After removing them, node x and the existing clique form a near-clique of type A. A cluster that satisfies indegree(x, G′) ≥ 0.5|G′| for every x in G′, where |G′| is the number of nodes in G′, forms a near-clique, too. The example shown in Figure 2b becomes a near-clique since it satisfies indegree (x, G′) ≥ 0.5|G′| even if it does not satisfy indegree (x, G′) < outdegree (x, G′).

Figure 2
(a) After removing nodes p, q, r, and s and their edges, node x forms a near-clique of type A with the remaining nodes. (b) This graph becomes a near-clique G of type C since indegree(x, G) ≥ 0.5|G|. (c) A big near-clique is too big (e.g., ...

Therefore, a near-clique G of basic types A and C should satisfy at least one of the following conditions.

  1. indegree (x, G) ≥ outdegree (x, G) for every x in G.
  2. indegree (x, G) ≥ 0.5|G|.

After finding all edge-disjoint cliques first, we identify near-cliques as follows. More detailed description of finding near-cliques are outlined in Algorithms Algorithms11 and and2.2. In the algorithms, cIdx represents the index of a clique.

Algorithm 1
AssignNearCliqueIdx.
Algorithm 2
ExtendNearClique.

  1. Assign every node of a clique the index of the clique containing the node.
  2. When a node of a clique has already an assigned clique index, assign the index to all nodes of the clique, and merge two cliques into a near-clique of type B.
  3. When a node x outside a clique forms a basic near-clique G of type A due to the interactions with two or more proteins in the clique, and either indegree (x, G) ≥ outdegree (x, G) or indegree i(x, G) ≥ 0.5|G| is true, assign the index of the clique to the node.
  4. When two or more cliques form a near-clique G due to two or more interactions with a common protein outside the cliques, and either indegree (x, G) ≥ outdegree (x, G) or indegree is i(x, G) ≥ 0.5|G| is true, merge the cliques and the protein into a near-clique of type C. A near-clique is formed by selecting nodes with the same clique index (cIdx) as those nodes with cIdx > 0.

Since the most relevant processes form a group of proteins of moderate size in biological networks [8], we obtain near-cliques smaller than the maximum size specified by a user. That is, when a near-clique bigger than the maximum size is found (e.g., near-clique with more than 50 nodes), it is split into smaller near-cliques (3 near-cliques in Figure 2c). The way we split a big near-clique is as follows. When our program finds a big near-clique with the minimum clique size set to k, we rerun the program on the big near-clique with the minimum clique size set to k + 1 to find a new clique and a near-clique with the clique. After removing the new near-clique from the original, big near-clique, we run the program again with the minimum clique size set to k. The big near-clique shown in Figure 2c is split into 3 small near-cliques with at least 4 proteins each.

3. RESULTS AND COMPARISON WITH EXPERIMENTAL DATA

We tested the algorithms on the data with 8,397 interactions between 4,380 yeast proteins, which is the combined data of Ito et al. [9], Uetz et al. [10], and MIPS (http://mips.gsf.de) with redundant data removed. To every protein in the near-cliques, we assigned the functional categories of the Functional Catalog (FunCat) version 2.0 [11], which includes 97 functional categories. There are six levels of hierarchy in the FunCat structure.

In the data with 8,397 interactions between 4,380 yeast proteins, we found 100 near-cliques with the minimum size of a clique set to 3 and the maximum size of a near-clique set to 40. Only one near-clique contains more than 40 proteins, and so it was split into 17 small near-cliques, resulting in total 116 near-cliques. Figure 3 shows an example of the network of yeast protein interactions with 6 near-cliques. Proteins in each near-clique share at least one function with other proteins within the near-clique.

Figure 3
Six near-cliques found in yeast protein interaction networks. Proteins in each near-clique share at least one function with other proteins within the near-clique.

As shown in Table 1, 68 (59%) out of the 116 near-cliques have at least one function shared by all the proteins in the near-cliques (100% sharing), and 39 near-cliques have a function shared by more than 50% of the proteins in the near-cliques, supporting data are available at http://wilab.inha.ac.kr/ppi/homepage.mht. Only 9 near-cliques have no function shared by > 50% of the proteins in the near-cliques. As shown in Figure 4, the functional coherence of each near-clique is high. The functional coherence was computed by the ratios of the number of proteins having a specific functional category to the group size (i.e., the number of proteins in the group).

Figure 4
The functional coherence in each of the 116 groups, computed as the ratio of the number of proteins having a specific functional category to the number of proteins in the group. The black, white, and grey bars represent functional categories with the ...
Table 1
Functional groups identified from the yeast protein interaction data. 68 modules have at least one function shared by all the proteins in the groups (100% sharing), and 39 groups have a function shared by more than 50% of the proteins in the groups. ...

Interestingly, most near-cliques found by our algorithm belong to multifunctional categories. For example, two functional categories are common to all the proteins in a near-clique of Figure 5. As shown in Table 2, the near-clique identified as group 93 by our program is involved in both stress response (functional category 32.01) and biosynthesis of vitamins, cofactors, and prosthetic groups (functional category 01.07.01).

Figure 5
Group 93 identified as a near-clique by our algorithm.
Table 2
Functional annotation of group 93 shown in Figure 5. The code represents functional category.

Near-cliques may correspond to protein complexes in addition to functional modules. So, we compared the near-cliques identified by our algorithms with known yeast protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database (http://mips.gsf.de/genre/proj/yeast). For each near-clique, we found a best-matching protein complex by minimizing the probability of a random overlap between the two, using the following equation [4, 5]:

Poverlap=(n2k)(Nn2n1k)(Nn1),
(1)

where n1, n2 are the sizes of a known protein complex and a computed module, k is the number of their common proteins, and N is the size of the network.

As shown in Table 3, 65 near-cliques (56% of the total 116 near-cliques) identified by our algorithm show a good agreement ( ln(Poverlap) < − 14) with the protein complexes cataloged in MIPS.

Table 3
The near-cliques matched with experimentally determined protein complexes cataloged in MIPS. The overlap column represents the number of proteins common to the near-cliques and the protein complexes.

To compare the functional coherence of the groups found by our program with that of cliques found by CFinder, we tested both programs on the same dataset. 75.9% of the groups identified by our program have at least two functional categories shared by all the proteins in the groups, whereas 63.1% of the groups identified by CFinder have at least two functional categories shared by all the proteins in the groups (Table 4). This result indicates that our program finds groups with stronger functional coherence than CFinder.

Table 5 shows the actual running times of our program and CFinder on three datasets of yeast protein interactions. Our program is faster than CFinder on all datasets, and the difference in speed becomes more obvious as the dataset becomes bigger.

Table 5
Running times of the programs on 3 data sets of yeast protein interactions on a Pentium IV 3.0 GHz processor with 512 MB memory.

4. CONCLUSION

Identifying hidden topological structures of protein interaction networks often unveil biologically relevant functional groups and structural complexes. We developed an efficient heuristic algorithm for finding cliques and near-cliques in protein interaction networks. From the interaction data of yeast proteins, the algorithm identified 116 near-cliques. Comparison with the experimental data showed that 59% of the near-cliques have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with known protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database.

Table 4
Comparison of our method and CFinder in terms of the number of functional categories shared by all the proteins in the groups.

ACKNOWLEDGMENTS

This work was supported by the Korea Science and Engineering Foundation (KOSEF) under Grant no. F01-2007-000-10140-0 and Grant no. F01-2007-000-10140-0 through the Systems Bio-Dynamics Research Center.

References

1. Bu D, Zhao Y, Cai L, et al. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research. 2003;31(9):2443–2450. [PMC free article] [PubMed]
2. Battiti R, Protasi M. Reactive local search for the maximum clique problem. Algorithmica. 2001;29(4):610–637.
3. Katayama K, Hamamoto A, Narihisa H. An effective local search for the maximum clique problem. Information Processing Letters. 2005;95(5):503–511.
4. Adamcsek B, Palla G, Farkas I, Derényi I, Vicsek T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22(8):1021–1023. [PubMed]
5. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435(7043):814–818. [PubMed]
6. Ju B-H, Han K. Complexity management in visualizing protein interaction networks. Bioinformatics. 2003;19(1):i177–i179. [PubMed]
7. Radicchi F, Castellano C, Cecconi F, Loreto V, Paris D. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(9):2658–2663. [PMC free article] [PubMed]
8. Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(21):12123–12126. [PMC free article] [PubMed]
9. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(8):4569–4574. [PMC free article] [PubMed]
10. Uetz P, Giot L, Cagney G, et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae . Nature. 2000;403(6770):623–627. [PubMed]
11. Ruepp A, Zollner A, Maier D, et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research. 2004;32(18):5539–5545. [PMC free article] [PubMed]

Articles from Journal of Biomedicine and Biotechnology are provided here courtesy of Hindawi Publishing Corporation
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

    Your browsing activity is empty.

    Activity recording is turned off.

    Turn recording back on

    See more...