- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Functional modules by relating protein interaction networks and gene expression

^{1}Institute for Bioinformatics, German National Center for Health and Environment, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany and

^{2}Technische Universität München, Wissenschaftszentrum Weihenstephan, Lehrstuhl fur Genomorientierte Bioinformatik, Am Forum 1, 85435 Freising-Weihenstephan, Germany

^{*}To whom correspondence should be addressed. Tel: + 49 89 31873578; Fax: + 49 89 31873585; Email: sabine.tornow/at/t-online.de

## Abstract

Genes and proteins are organized on the basis of their particular mutual relations or according to their interactions in cellular and genetic networks. These include metabolic or signaling pathways and protein interaction, regulatory or co-expression networks. Integrating the information from the different types of networks may lead to the notion of a functional network and functional modules. To find these modules, we propose a new technique which is based on collective, multi-body correlations in a genetic network. We calculated the correlation strength of a group of genes (e.g. in the co-expression network) which were identified as members of a module in a different network (e.g. in the protein interaction network) and estimated the probability that this correlation strength was found by chance. Groups of genes with a significant correlation strength in different networks have a high probability that they perform the same function. Here, we propose evaluating the multi-body correlations by applying the superparamagnetic approach. We compare our method to the presently applied mean Pearson correlations and show that our method is more sensitive in revealing functional relationships.

## INTRODUCTION

Biological systems are functionally organized in different related networks defined by the type of their particular interaction, such as metabolic or signaling pathways and protein interaction, regulatory or co-expression networks. Metabolic networks are known to be subjected to conditional activity control, implemented by a variety of mechanisms such as transcript regulation, chemical modification, protein–protein interaction or signal cascades. No clearly separated networks exist in the cell. While metabolic pathways often contain protein complexes with strong protein–protein interactions, they are regulated by product feedback inhibition and are subject to common transcriptional regulation.

Single biomolecular networks are currently investigated in terms of topology (1,2), motifs (3), correlation structure (4) and modular properties (5–7) which are related to function. A functional module (5) is defined as a group of genes or their products which are related by one or more genetic or cellular interactions, e.g. co-regulation, co-expression or membership of a protein complex, of a metabolic or signaling pathway or of a cellular aggregate (e.g. chaperone, ribosome, protein transport facilitator, etc.). An important property of a module is that its function is separable from other modules (5) and that its members have more relations among themselves than with members of other modules, which is reflected in the network topology. The separability may stem from, for example, cellular localization or specific interaction of proteins or specific regulation of genes. Modules can be understood as a separated substructure of a network or pathway, e.g. the complex of fatty acid synthetase subunits may serve as an example of a module of the fatty acid biosynthesis pathway and the protein complex is a module of a protein interaction network (6). In principle, the large-scale cellular networks are robust due to their hierarchical, scale-free organization (2). Genes with related functions may have similar expression profiles, i.e. may be members of a module of the co-expression network. Co-expression is regulated in yeast by the modular action of transcription factors showing a strong correlation with gene function (8).

Presently, large amounts of data related to functional properties of genomes, e.g. gene expression and protein interaction data, are being generated. Expression data are analyzed in an unsupervised way by finding a similarity measure between gene expression profiles by clustering or biclustering the data with hierarchical, *k*-means clustering or more appropriate clustering like CLICK (9) and superparamagnetic clustering (10,11).

To obtain more reliable information than using these distinct data sets alone and to obtain insights into functional modules we integrate independent but biologically related data sets or different genetic and molecular networks (12). So far, an integrative data analysis has been performed with correlation mapping (13) and mean Pearson correlations (14,15). The main drawbacks are that the present integrative methods rely on clustering procedures which are not sufficiently robust against noise, fail for complex non-spherical data structures, or are dependent on external parameters like the predefined number of clusters. Furthermore, multi-body correlations are often estimated by averaging, but averaging does not reflect any realistic correlation structure of the data.

The integration could be done, on the one hand, by combining each binary interaction (16). On the other hand, we can find clusters, structures or modules in one particular network and see if the components (proteins or genes) of these structures are significantly related in any other network. In the present paper we propose a method for the latter strategy which can be used to integrate genetic, metabolic and regulatory information as well as functional classification (17) to find functional modules.

### Outline of the method

Our method can be used: (i) to reduce the rate of false functional assignments; (ii) to analyze expression data in a more sensible way compared to statistical evidence only; (iii) to find hypotheses for functional modules and new complexes and to assign unknown genes a function. To provide an example we integrated protein interaction and co-expression networks. Having identified a module of the protein interaction network or using a protein complex we calculated the significant correlation strength of the corresponding gene expression profiles. For this purpose we employed the definition of the correlation of superparamagnetic clustering (18), a very successful algorithm (19,20), very effective for expression analysis (10,11) and clustering of genetic networks (32). Besides its advantageous features, including its robustness against noise, it is able to define the correlation strength of a group of gene expression profiles or, more generally, of a group of nodes in a sparse network such as the co-expression network (Materials and Methods). The corresponding *P* value (probability that the correlation strength is a random coincidence) is calculated according to the rationale described in Jansen *et al*. (14). We calculated the distribution (a histogram) of the correlation strengths of all possible groups with the same number of genes. The area of the distribution greater than the correlation strength of our module consisting of the same number of genes is then the definition of our *P* value (see Materials and Methods). In contrast to Jansen *et al*. (14), we include the structure of the co-expression network. Its nodes are the genes which are connected for the most similar pairs of expression profiles. With the help of the superparamagnetic approach we looked for significant substructures. Highly significant correlation may be resulting in direct neighborhood of two nodes or membership of a larger (dense) substructure of the coexpression network where the gene expression profiles of the resulting complex are connected by a transitivity relation (21). In contrast to Jansen *et al*. (14), our method is not only applicable in the supervised mode, introducing prior knowledge, it can subsequently be used in an unsupervised way. First, the protein interaction network and its clusters are tested for significant co-expression, leading to a hypothesis of protein complexes and, second, we obtain not only the most significant co-expressed protein complexes but also their corresponding gene expression profiles.

## MATERIALS AND METHODS

### Expression data and co-expression network

We used two independent yeast expression data sets to evaluate our method. The first one is data on cell cycle-related profiles using alpha, cdc15 and cdc28 synchronization (22,23). Each time series was used separately. The second is the Rosetta Compendium, which includes 300 deletion and drug treatment experiments (24). The expression data is available in the form of a matrix having *N* rows and *D* columns. The columns represent the tissues in a special condition and the rows represent the gene profiles. The data used in the calculations had already been preprocessed. We normalized them in a *z*-score fashion such that the average expression ratio of one profile is 0 and the standard deviation is 1. From the expression data a sparse co-expression network was constructed using the *K* mutual nearest neighbor criterion (25). For every gene expression profile a list of the *K* nearest neighbor profiles was produced. The nearest neighbor of one expression profile is defined as the most similar profile measured, for example, as the Euclidean distance. Two nodes were connected if they were on each others’ list. The optimal *K* is ~15, as discussed in Agrawal and Domany (26).

### Protein interaction data

The protein interaction data set is taken from the MIPS database (17). As an example we used the yeast two-hybrid (Y2H) data of Ito *et al*. (27) and von Mering *et al*. (28) as well as the complex catalog (17). The correlation structure has been investigated by Maslov and Sneppen (4). In principle, protein complexes and modules can be found by clustering the protein interaction network (29,32) or by clustering according to functional assignments (30). A protein in a complex is a densely connected subnetwork, but a member directly interacts with no more than a few members of the same complex.

### The superparamagnetic approach

Superparamagnetic clustering (18) has been successfully applied to artificial and real data (19,20) as well as to expression data (10,11) and is based on a physical analog, the magnetic phase transitions of spin systems. We briefly describe the algorithm which is able to partition a network into clusters, i.e. highly connected subgraphs. The algorithm is very suitable for our analysis because it establishes a hierarchy of clusters, a dendrogram. A dendrogram is formed if we are looking at a system with different resolutions: at low resolution the whole network is one cluster. At higher resolutions it decays into multiple other clusters until at the highest resolution every node is its own cluster. There exists a particular resolution where a cluster disappears, which we call the critical resolution. With the help of the algorithm we determined the correlation for a number of nodes in the network (see below), which is the probability that the nodes belong to a common cluster. We define the correlation strength of a module or group of nodes as the critical resolution where its correlation drops to zero. Finally, we calculated the distribution of the correlation strength of all pairs, triplets, etc. of nodes.

After having constructed the co-expression network according to the *K* mutual nearest neighbor criterion (see above), we assigned every node an integer label *S*_{i} = 1…*q* (equivalent to a Potts spin with *q* different states), where *q* is an integer. The nodes representing the expression profiles *i* and *j* are connected with edges weighted with the coupling constant *J*_{ij}, for which we use a fast decreasing Gaussian decay (10)

*J*_{ij} =(1/*K*)exp[–(*d*_{ij}^{2}/2 ^{2})].

Here, *d*_{ij} is the Euclidean distance between the gene expression profiles of gene or node *i* and *j*, is the mean distance between all neighbors and *K* is the mean number of neighbors. The coupling is only non-zero for connected nodes. For unweighted networks, e.g. the protein interaction network, *J*_{ij} may be 1 or 0.

We calculate the correlation of a certain number of nodes (gene expression profiles) of the network using a Monte-Carlo simulation, the Swendsen–Wang algorithm (31). Starting from a random configuration (random label *S*_{i} = 1…*q* on each node *i*), the algorithm assigns node *i* and *j* the same label with the probability *p*_{ij} = 1 – exp(*J*_{ij}/τ), where *J*_{ij}/τ is the effective coupling between node *i* and *j* and τ (the temperature of the physical spin system) is defined as the resolution (see above) with which we investigate the system. Having gone over all edges of the network, every area with the same label forms a cluster. The integer *q* is not related to the number of clusters. A new configuration is generated by giving every node in a cluster a new random label. Averaging over several of these configurations gives the probability of a number of nodes being in the same cluster, which is defined as the correlation function. The algorithm includes a transitivity rule: if node A has the same label as B and A the same as C, then B and C also have the same label; A, B and C are strongly correlated. Increasing the resolution τ we decrease the effective coupling *J*_{ij}/τ, which leads to hierarchically related nodes. The effective coupling and thus the correlation of a group of nodes decreases with increasing resolution. We define their correlation strength *T*_{M} as the critical resolution where the correlation of a group of nodes drops to 0 (the correlation as a function of the resolution is actually a step function). It is dependent on the coupling of two or more nodes and is a collective measure as well. The more densely connected these nodes are the higher is their correlation strength.

To assess the correlation strength of genes which are members of the same module of a different genetic network we define a *P* value which gives the probability that the strength of the correlation was found by chance. Therefore, we calculate the distribution ρ or histogram of the correlation strength *T* of all possible groups with the same number of genes. The one-sided *P* value is then defined as the area of the distribution ρ above the correlation strength of the module, divided by the normalization ρ_{0} (the number of all possible modules),

*P*(*T*_{M}) = (1/ρ_{0}) ∫^{Tp}_{TM} ρ(*T*)d*T*

where *T*_{M} is the correlation strength of the tested module and *T*_{p} the maximal possible correlation strength in the network.

## RESULTS

We constructed a graph of co-expressed genes (see Materials and Methods) for the cell cycle as well as the Rosetta data set, in which the nodes are the genes. Two nodes are connected if they fulfill the mutual nearest neighbor criterion (see Materials and Methods). Such networks have been investigated in detail in Agrawal (25). The definition of the co-expression network is similar to the transitivity relations of Zhou *et al*. (21). We adopted the definition of the collective multi-body correlations from the superparamagnetic clustering (18) and calculated the correlation strength for modules of random genes described in detail in Materials and Methods.

A distribution or histogram of the correlation strength is displayed in Figure Figure11 (top) for groups of two to six gene expression profiles for the cell cycle experiment (α arrest). In Figure Figure11 (bottom) we show the corresponding one-sided *P* value which is defined similarly to in Marcotte *et al*. (12) (see Materials and Methods). As expected, the weight of the distribution is shifted to the left if the number of members of a module increases. Most of the larger modules have a lower correlation strength, but to find a large group with a high correlation strength gives a lower *P* value than for finding a smaller module.

_{0}or histogram of the critical resolution or correlation strength

*T*(top) and

*P*value (bottom) for groups of two to six (circle, square, diamond, star and triangle) expression profiles. The weight of

**...**

Given the distribution of the correlation strength based on the expression data we were able to test the protein interaction data for significant co-expression. First, we tested the binary data of Hughes *et al*. (24) and von Mering *et al*. (28) and, second, the complex data. Figure Figure22 shows the distribution of the correlation strength (co-expression) of members of Y2H data and members of a ribosome complex in comparison to the random background. The Y2H data shows, in accord with Marcotte *et al*. (12), almost no deviation from the random background, although the distribution is slightly shifted to the right to a higher correlation strength. By choosing a certain *P* value it is possible to obtain a significantly co-expressed part of the protein interaction network (see below). The situation is different in that only the open reading frames (ORFs) of the ribosome were chosen. As shown in Figure Figure2,2, most of the ribosome is highly co-expressed and so the weight of the distribution is shifted to a higher correlation strength. Figure Figure33 displays the distribution of the correlation strength of six gene expression profiles (with squares representing the random background). It is clearly shown that the nucleosomal complex and parts of the ribosome are highly significantly co-expressed.

**...**

Details of single complexes are annotated in Table Table11 for the Rosetta Compendium and the alpha, cdc15 and cdc28 synchronization time series. Only complexes which have a *P* value <1E – 3 are displayed. Mainly those complexes are significant which are constantly needed in the cell, as expected, using an expression profile over many experiments or during the cell cycle. A part of the cycline complex is co-expressed in the α, cdc15 and cdc28 experiments. They are not significant in the Rosetta Compendium, which does not include any cell cycle synchronization and thus averages over the cell cycle. The nucleosomal protein complex is also very tightly co-expressed. It is co-regulated in all the expression experiments as well as parts of the cytochrome *c* oxidase, the mitochondrial and the cytoplasmic ribosomes. The respiration chain complex F_{0}/F_{1} ATP synthase is only significantly co-expressed in the Rosetta Compendium. In summary, we found significant co-expression in many permanent complexes, similar to Jansen *et al*. (14). Since we did not average over the correlation of all members, we immediately obtained those parts of the complexes with the highest correlation strength. In agreement with Jansen *et al*. (14), we found that transient complexes mostly have no significant co-expression. The significant co-expression of the 20S proteasome and of the 19/22S regulator was found for the cell cycle data but not in the Rosetta data. We found a qualitatively similar result as Jansen *et al*. (14). Quantitative differences are related to the fact that we included transitive similarities of expression profiles (not directly similar, but similar to the same set of profiles).

**List of protein complexes with more than two open reading frames (ORF) with significant co-expression (**

*P*< 0.001)Our approach has the advantage that it is possible: (i) to display the result as a substructure of the co-expression network, e.g. to detect those profiles which are transitively related; (ii) to find parts of the protein interaction network with a significant correlation strength in the co-expression network. As an example we show two subnetworks in Figures Figures44 and and5.5. The result is a subnetwork of the protein interaction network (28) which is related to cell cycle (alpha) expression data (Fig. (Fig.4)4) where only the highest significant correlation strength (lowest *P* value) is taken into account. This network is still connected and mirrors parts of a RNA metabolism complex, nucleosomal protein complex and a protein synthesis turnover complex. Interactive interfaces to the data can be used to obtain hypotheses of protein complexes and to find essential parts of the protein interaction network, reducing its high false positive rate.

**...**

The parts of the co-expression network that we display in Figure Figure55 correspond to the subnetwork of the Rosetta expression data, which includes the six genes of the significant nucleosome protein complex. The correlations that we found can be mapped to known and unknown functional interactions. For instance, a non-histone protein and genes with unknown function, as well as cell cycle genes and genes related to budding, correlate with the gene expression profiles of the complex. One gene is annotated as an endochitinase (17), a function that does not fit into the experimental context.

## CONCLUSION

It remains a challenging task to interpret expression data in the context of known functional relations. Systematic approaches which integrate different types of functional information representing cellular networks are still needed in the post-genomic sector to understand the functional context of genes and to uncover functional modules. Almost no protein or gene performs its function in isolation, thus most of the existing interactions have to be discovered or confirmed. Currently, groups of gene expression profiles are clustered according to their similarity and are related to function or protein interaction afterwards. Our method starts from different groups with known interacting proteins and looks at whether they are also significantly related in other experimental data. In the case that a group or subset of the data correlates in a dense co-expression subnetwork, unknown genes that are members of such a subnetwork are candidates for interaction.

We integrated cellular networks with gene expression based on the correlation defined in the superparamagnetic approach, a very successful clustering procedure (10,11,32), which includes transitive co-expression. Furthermore, the method is robust against noise and is able to calculate multi-body correlations and their strength. Having defined in our examples a module or complex in the protein interaction network, we evaluated the correlation strength of this module in the co-expression network. By calculating the distribution of the correlation strength of all groups of gene expression profiles (nodes of the co-expression network) we were able to evaluate *P* values for any module of a given size. Since the set of the known or predicted correlations is small compared to the combinatorial number of all possible correlations, we generally avoided most false positive signals by calculating the strength of a correlation to all groups and comparing it to the strength of any chosen module. The *P* value is the probability that the observed strength was by random coincidence.

The main advantage of the method is the use of multi-body correlations in contrast to the averaging used earlier (14,15). The latter mixes strong with weak correlations, which does not reflect the network structure of the data. In addition, when compared to the Pearson correlation, the superparamagnetic approach takes into account the co-expression network. For instance, some pair could have a low Pearson correlation but could be a member of the same process because the partners are related by transitivity (21). The superparamagnetic approach would not miss these correlations.

We applied our new method to combine protein interaction and expression data from independent experiments. The correlation of protein complexes significantly overlapping with interaction data appears to be a logical consequence of the need to co-express tightly interacting and functionally dependent proteins. However, in most cases such correlations are found by intuition rather than statistical correlation. It has been shown that we can map correlated structures, e.g. complexes, to correlated structures of the co-expression network, which leads to the identification of functional modules. We provide a systematic, generally applicable approach which integrates different genetic information and expression profiles and which is able to test and reveal hypothetical functional modules. These features cannot be supplied by other frequently applied methods like mean Pearson correlations (14) or mapping of clusters (13). With a growing number of known interactions and co-expression, a larger number of hypotheses can be tested. In addition, we employed confidence values correlating to the nature of the interaction data (e.g. high for many known complexes but low for Y2H data).

The method is well suited to application to other combinations and is directly extendable to any set of cellular and genetic network data. Future work will be directed at systematic application of the method to the different functional classifications available (e.g. 17). Starting from well-known correlations we will attempt to define, for example, co-expressed modules exhibiting significant *P* values, and to annotate them as experimentally confirmed functional dependencies. Our method provides a framework and generator of hypotheses to be confirmed or rejected. It is applicable to the large amount of experimental high-throughput functional data to come.

## ACKNOWLEDGEMENTS

We thank A. Manolescu for his help concerning the algorithm, M. Münsterkötter and U. Güldener concerning the data, G. Kastenmüller, A. Facius and K. Mayer for useful discussions and S. Rudd for critical reading of our manuscript. The research was supported by the BMBF (FKZ01KW9928 and 031U118A).

## REFERENCES

*Eschericha coli*. Nature Genet., 31, 64–68. [PubMed]

*et al*. (2002) Transcriptional regulatory networks in

*Saccharomyces cerevisiae*, Science, 298, 799–804. [PubMed]

*Saccharomyces cerevisiae*. Nature Genet., 29, 482–486. [PubMed]

*Saccharomyces cerevisiae*by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297. [PMC free article] [PubMed]

*et al*. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 65–73. [PubMed]

*et al*. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126. [PubMed]

*et al*. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (262K)

- A copula method for modeling directional dependence of genes.[BMC Bioinformatics. 2008]
*Kim JM, Jung YS, Sungur EA, Han KH, Park C, Sohn I.**BMC Bioinformatics. 2008 May 1; 9:225. Epub 2008 May 1.* - Identification of functional hubs and modules by converting interactome networks into hierarchical ordering of proteins.[BMC Bioinformatics. 2010]
*Cho YR, Zhang A.**BMC Bioinformatics. 2010 Apr 29; 11 Suppl 3:S3. Epub 2010 Apr 29.* - Functional partitioning of yeast co-expression networks after genome duplication.[PLoS Biol. 2006]
*Conant GC, Wolfe KH.**PLoS Biol. 2006 Apr; 4(4):e109. Epub 2006 Apr 4.* - Transcriptional regulatory networks and the yeast cell cycle.[Curr Opin Cell Biol. 2002]
*Futcher B.**Curr Opin Cell Biol. 2002 Dec; 14(6):676-83.* - Protein networks in disease.[Genome Res. 2008]
*Ideker T, Sharan R.**Genome Res. 2008 Apr; 18(4):644-52.*

- Modeling regulatory cascades using Artificial Neural Networks: the case of transcriptional regulatory networks shaped during the yeast stress response[Frontiers in Genetics. ]
*Manioudaki ME, Poirazi P.**Frontiers in Genetics. 4110* - Cliques for the identification of gene signatures for colorectal cancer across population[BMC Systems Biology. ]
*Pradhan MP, Nagulapalli K, Palakal MJ.**BMC Systems Biology. 6(Suppl 3)S17* - Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data[BMC Bioinformatics. ]
*Li M, Wu X, Wang J, Pan Y.**BMC Bioinformatics. 13109* - Seed selection strategy in global network alignment without destroying the entire structures of functional modules[Proteome Science. ]
*Wang B, Gao L.**Proteome Science. 10(Suppl 1)S16* - Ontology integration to identify protein complex in protein interaction networks[Proteome Science. ]
*Xu B, Lin H, Yang Z.**Proteome Science. 9(Suppl 1)S7*

- Functional modules by relating protein interaction networks and gene expressionFunctional modules by relating protein interaction networks and gene expressionNucleic Acids Research. Nov 1, 2003; 31(21)6283PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...