# The interactome as a tree—an attempt to visualize the protein–protein interaction network in yeast

^{2}Xiaopeng Zhu,

^{1}Haifeng Liu,

^{2}Geir Skogerbø,

^{2}Jingfen Zhang,

^{2}Yong Zhang,

^{1}Lun Cai,

^{2}Yi Zhao,

^{2}Shiwei Sun,

^{2}Jingyi Xu,

^{2}Dongbo Bu,

^{2,}

^{a}and Runsheng Chen

^{1,}

^{2,}

^{*}

^{1}Bioinformatics Laboratory, Institute of Biophysics and

^{2}Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, Peoples Republic of China

^{*}To whom correspondence should be addressed. Tel: +86 10 6256 5533 5716; Fax: +86 10 6256 7724; Email: nc.ca.pbi.5nus@src

^{a}Correspondence may also be addressed to Dongbo Bu. Tel: +86 10 6256 5533 5716; Fax: +86 10 6256 7724; Email: nc.ca.cicn@bdb

^{a}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

## Abstract

The refinement and high-throughput of protein interaction detection methods offer us a protein–protein interaction network in yeast. The challenge coming along with the network is to find better ways to make it accessible for biological investigation. Visualization would be helpful for extraction of meaningful biological information from the network. However, traditional ways of visualizing the network are unsuitable because of the large number of proteins. Here, we provide a simple but information-rich approach for visualization which integrates topological and biological information. In our method, the topological information such as quasi-cliques or spoke-like modules of the network is extracted into a clustering tree, where biological information spanning from protein functional annotation to expression profile correlations can be annotated onto the representation of it. We have developed a software named PINC based on our approach. Compared with previous clustering methods, our clustering method ADJW performs well both in retaining a meaningful image of the protein interaction network as well as in enriching the image with biological information, therefore is more suitable in visualization of the network.

## INTRODUCTION

It is now thought that the complexity of organisms rise not from the number of their macromolecules but rather from the relationships between them (1). Protein interactions are one of the major sources of this complexity. Since several high-throughput protein interaction detection approaches already have been developed (2–7), the information about protein interactions in yeast, which is one of the best-characterized model organisms, has grown considerably. A number of large-scale protein interaction data sets (2–5) have recently been published. These large-scale data sets provide a yeast protein–protein interaction network, in which proteins are depicted as vertices and interactions as edges.

The challenge coming along with the network is to find better ways to make it accessible for biological investigation. A visualization method would be helpful for extraction of useful biological information from the network. The present algorithms represent a vertex as a point, an edge as a straight line, and focus on how to draw these graphs more nicely and neatly either on two-dimensional (2D) picture (8) or in three-dimensional (3D) space (8,9) by adjusting positions of the points and lines. However, as there are thousands of proteins and tens of thousands of interactions in the yeast protein–protein interaction network, much information in the network remains hidden with this kind of visualization methods. For instance, in high linkage density areas such as cliques, the lines will overlap in a 2D picture, and is difficult to obtain information from dense parts even in a 3D representation.

Since the limitations of traditional visualization methods are unavoidable, we here present an alternative visualization method, which aims at combining the topological and biological information in a better way. The topological information is extracted from the network and displayed in a clustering tree. Based on the clustering tree, a graphical representation was created. The representation takes the form of a graphical adjacency matrix where proteins are listed according to the order of the clustering tree, and in which a pixel depicts an interaction between two proteins. Biological information is then, added into the graphical representation of the network. Different colors could be used to represent different biological information, spanning from protein functional annotation to expression profile correlations. We also provide a software (PINC, protein interaction network clustering) which can cluster and visualize protein–protein interation networks based on our method.

Topological clustering methods have proven to be a good solution for metabolic networks (10) and complicated networks in other areas (11,12). Recently, two research groups separately applied two clustering methods on the yeast protein interaction network (13,14). Several studies (15–19) have shown that there exists meaningful topological information in a protein–protein interaction network, commonly in the form of quasi-clique (20) or spoke-like patterns (21). Both patterns can be clustered in separate branches of topological clustering trees, thus revealing information about sub-topological modules of the network. However, different clustering methods reveal different parts of the network. Here, we provide two methods. One is a new topological clustering method which is called ADJW clustering, in which a modified adjacency matrix of the network was employed as the similarity matrix for the clustering. The other is a topological clustering method based on Hall (12). In this method the proteins were first projected into Euclidian space and then clustered according to their positions using a hierarchical clustering method. We applied both to the yeast protein–protein interaction network. Compared to two previously published methods (13,14), the ADJW clustering method is more suitable for visualizing several aspects of the network. We further analyzed the distribution of protein complexes and unclassified proteins in the ADJW clustering tree. The results show that the ADJW clustering method is an efficient tool for clusters of protein complexes. Further analyses show the Hall's method provides different and complementary results.

## MATERIALS AND METHODS

### Data source

This approach is applied to the yeast *Saccharomyces cerevisiae* protein–protein interaction network. The protein–protein interactions data detected by experiments, such as the yeast two-hybrid assay (5), HMS-PCI and TAP methods (2), were collected from the MIPS (http://mips.gsf.de/), PreBIND (http://bind.ca/index2.phtml?site=prebind), BIND (http://bind.ca/) and GRID (http://biodata.mshri.on.ca/grid/servlet/Index). In a preprocessing step, self interactions and redundant interactions were filtered out. For interactions detected by the HMS-PCI and TAP methods, the spoke model data (21,22) that assign interactions only between the bait and the associated proteins were used. This yielded an interaction data set containing 13344 physical interactions among a total of 4537 yeast proteins (see Supplementary Material).

### Methods

The topological information of the network was represented in a clustering tree produced by a Hierarchical Clustering algorithm. Biological information was annotated with color into the adjacency matrix base on the order of the clustering tree. Functional *P*-value and *P*-value of complex were employed as criteria to compare our topological clustering with the methods of Brun *et al*. (14) and Rives and Galitski (13).

#### Hierarchical Clustering Algorithm

A protein–protein interaction network is represented as a bi-directed graph *G*(*V,E*), i.e. each protein is noted as a vertex and each interaction between proteins as an edge between vertices. Let *A* be the adjacency matrix, where *A* = (*a _{ij}*),

*a*= 1 when there is an edge between vertices

_{ij}*i*and

*j*, and

*a*= 0 otherwise.

_{ij}#### ADJW

The adjacency matrix *A* is employed as the similarity matrix. The average linkage hierarchical clustering is applied to this matrix. For two groups *M* and *N*, their average linkage is

Here *D _{MN}* represents the density of the edges between these two groups. The average linkage hierarchical clustering is a greedy algorithm based on

*D*. Two groups

_{MN}*I*and

*J*which have the max value of

*D*are clustered into one group in each step. By iterating these steps, a hierarchical clustering tree is generated.

_{IJ}In the beginning of the clustering, the method treats all of the edges as the same, and the proteins with edge are clustered together. As we know, the more common neighbors the two vertices in an edge have, the better the initial clustering of their edge. In order to decide which edge should be clustered first we made a modification based on the adjacency matrix *A*. Thus, we defined a similarity matrix as

where *w* is a very small number, here, with an assigned value of 10^{−8}. The modification *w*·*A*^{2} ensures that the interacting protein pair which shares more neighbors should be clustered first.

The tree based on *A* is called the ADJ Tree while the tree based on *S* is called the ADJW Tree.

#### Hall Clustering (12)

The clustering tree was obtained through a two-step process. First, the proteins in the network were projected into Euclidian space based on an optimization. Second, a hierarchical cluster method was applied to the proteins according to their positions in the Euclidian space.

*Step 1: Projecting proteins into an r-dimensional Euclidean space*. The vertices were projected into an *r*-dimensional Euclidian space according to the principle that two vertices should be as near as possible if there is an edge between them (12). For simplification, the problem to find one dimension to match the above requirement equals the problem of finding the vector *X* = (*x*_{1}, *x*_{2},…, *x _{n}*)

^{T}by minimizing the following formula:

Here, let *L* = *D* − *A* be the *Laplacian* matrix, and *D* be the *diagonal* matrix *D _{ii}* = ∑

_{k}

*a*,

_{ik}*D*= 0 (

_{ij}*i*≠

*j*).

It has been proved that *Q* will reach a minimum when *X* is the eigenvector with the minimal eigenvalue of the *Laplacian* matrix *L* (12). Hence, finding the first dimension can easily be solved by setting it as the eigenvector of the minimal eigenvalue.

Notice that all eigenvectors are mutually orthogonal because *L* is a symmetric matrix. The eigenvectors are employed to produce the *r*-dimensional space we aim at. That *L* is a positive semi-definite (12) matrix should facilitate the computation of eigenvectors. The rank of *L* will be *n* − *1* if *G* is a connected graph, which means only one eigenvalue of *L* equals 0. This trivial solution is not useful because it would mean that all proteins would be projected into one point. So we can start from the second minimal eigenvector (Fiedler Vector) to get *r* eigenvectors for *r* dimensions. Here, *r* = 350 was chosen.

*Step 2: Ward clustering method*. The Ward hierarchical clustering method (23) was applied to cluster the projected vertices into a clustering tree. The Euclidean distances were used as a metric to measure the topological distance of vertices in the network. Since the two groups with the smallest sum form a new group for each step, the groups closer in the distance space will be chosen earlier during the clustering process. As a result, a clustering tree is produced step by step.

The tree based on this method is called the H Tree.

#### Visualization and annotation

Using our software PINC, we drew the adjacency matrix of the interaction network with row and column protein headings ordered according to the clustering tree. A filled-in row/column entry indicates an interaction between the two proteins heading the row and column. The color of the entry was used to symbolize different biological information such as information concerning individual protein (function annotation, complex annotation and degree of evolutionary conservation) or protein interactions (expression profile correlations, interaction confidence and regulatory relationship). Different patterns in the picture imply different topological structures in the network, i.e. a block means that the involved proteins form a clique while a line means a spoke module.

### Comparison and validation

*P*-value of a branch

As a branch may involve different functional categories, *P*-values (24,25) were employed to assign each branch a main function, which is a criterion of coincidence of topological cluster and biological function.

Hypergeometric cumulative distribution was applied to model the probability of observing, by chance, at least *k* proteins in a branch size *n* belonging to a category containing *C* proteins from a total genome size of *G* proteins, such that the *P*-value is given by

The above test measures whether a branch is more enriched with proteins from a particular category than that would be expected by chance. If the *P*-value of a category is near 0, the proteins of the category in a branch will have low probability be chosen by chance. The functional category with the lowest *P*-value in a branch was assigned as its main function, and used to evaluate our clustering method.

### The *P*-value of a complex

Each branch was assigned a *P*-value (see Equation 4) for a complex containing *C* proteins. The *P*-value of the complex *P*(*C _{j}*) is the minimum

*P*-value attained at any branch

*B*in the hierarchical clustering tree

_{i}*T*. That is,

## RESULTS

We applied our clustering method ADJW to a yeast *S.cerevisiae* protein–protein interaction data set containing 4537 yeast proteins and 13344 physical interactions (see Materials and Methods). The ADJW clustering tree was displayed using TreeView (26) (http://rana.lbl.gov/EisenSoftware.htm) with functional annotation (22,27). The outline of the tree is shown in Figure Figure1.1. The graphical representation of the interaction network according to the ADJW Tree is outlined in Figure Figure2a2a (for details see Supplementary Figure 1).

**a**), and a branch of it (

**b**) are shown. The branch of the tree consists of a quasi-clique (see proteins in red) and a spoke-like fashion (see proteins in green).

**...**

This representation revealed many hidden modules in the network such as quasi-cliques and spoke-like modules. These modules, which are not easily revealed through conventional visualization, are believed to be biological meaningful and ready to be further analyzed by adding biological information. Among them, we presented two examples in Figure Figure2b,2b, which was a representation of a branch of the clustering tree. A densely interconnected module (square block), which was clustered together, was a quasi-clique pattern (20) (Figure (Figure2c).2c). On the other hand, proteins that were clustered together in spoke-like patterns, were represented by slender blocks at right angles (21) (Figure (Figure2d).2d). Using our software PINC, the MIPS functional annotations are added into the representation (Figure (Figure3).3). It indicated that the proteins in Figure Figure2c2c belong to *cellular fate/organization function category* and most proteins in Figure Figure2d2d belong to an unclassified category.

**...**

Using PINC, we analyzed quasi-cliques revealed in the ADJW Tree by adding biological annotation. Most of them are protein complexes. Details about the distribution of complexes in the ADJW Tree are shown in Supplementary Table 1 and Supplementary Figure 1. This representation also revealed a number of unclassified proteins clusters which are potentially new complexes or executing special biological functions (20). A selection of clusters from Tree comprising mostly unclassified proteins are shown in Supplementary Table 2.

In order to illustrate the advantages of the ADJW clustering method, we compared this method with the previously reported methods of Brun *et al*. (14) and Rives and Galitski (13), which have been used on the yeast protein–protein interaction network. We also applied these two methods on our data set along with the ADJ method and the Hall clustering method. The resulting trees were called the B Tree, the R Tree, the ADJ Tree and the H Tree, respectively. Together with the ADJW Tree, these trees were compared by several criteria.

A good test of how well the interaction information is retained in the clustering tree would be to compute the distribution of the shortest path in the tree between interacting proteins. We computed these distributions in the five trees, the result showing in Figure Figure4a.4a. In comparison, the ADJW Tree is as good as the R Tree and better than the B Tree and the H Tree (Figure (Figure4a).4a). The topology of the network is well preserved in the ADJW clustering tree.

**a**) Distribution of interacting proteins according to the shortest path between them in the tree. A point in the line shows the number of interactions (

*y*-axis) that has a path less than a certain distance (

*x*-axis) between the

**...**

Besides the network topology reservation, biological information enrichment is another important criterion for evaluation of a clustering method. Applying MIPS functional annotation (27), the *P*-value was introduced to measure these enrichments in a clustering tree. We calculated the *P*-value of each branches in the five trees. In total, 264 branches covering 541 proteins in the ADJW Tree had *P*-values below 0.001 (25,28). Compared to other trees (see Table Table1),1), the ADJW Tree is almost as good as the B Tree and much better than the R Tree, H Tree and the ADJ Tree.

The *P*-value (see Methods) was used to measure the coincidence between the protein complexes and branches in a tree. We calculated the coincidence between 307 complexes from MIPS (27) (or *complex categories*) and the branches in five trees. The result is shown in Figure Figure4b.4b. Among 307 complexes, 222 complexes had a *P*-value lower than 10^{−5} in the ADJW Tree. Compared to the other trees (see also Table Table1),1), the ADJW Tree is better than the other four trees measured by this criterion. As the results show, biological information is much enriched through clustering by ADJW.

Another important criterion for visualization is the distance between interacting proteins in the visualized adjacency matrix ordered according to the clustering tree. An interaction will be near the diagonal of the matrix if the distance between the proteins is small. If there are many interacting proteins which have a small distance in the matrix, a major part of the information will be found in this relatively narrowly diagonal area. We recorded separately the interactions in the selected area in the adjacent matrix of the five trees (Figure (Figure4d),4d), showing that the ADJW method performed well also on the basis of this criterion (Figure (Figure44c).

In summary, the ADJW clustering method performed well both in retaining the interaction information and in enriching of the network with biological information. This clustering method thus has an advantage over other methods in visualizing the protein–protein interaction network.

However, the different clustering results revealed different information of the network. For example, the H Tree revealed an unusual, hidden module (Supplementary Figure 3), consisting of two quasi-cliques, in which the proteins were mostly unclassified according to the MIPS annotation. Classified proteins in this module were mostly related to RNA processing, and the unclassified proteins in one of the quasi-cliques have also been suggested to be involved in pre-rRNA processing (20,29). This module could not be easily seen in the other clustering trees.

## DISCUSSION

The refinement and high-throughput of the protein interaction detection methods offer us a chance to study the protein–protein interaction network. However, because of the large number of proteins, the network is too complicated to be easily visualized. Here, we attempt to visualize this network in a simple but information-rich way. We drew the adjacency matrix of the interaction network according to the clustering tree, and used different colors to represent different biological information. Based on this approach, we have developed the software PINC. Using PINC, our approach was subsequently applied to visualize any two or more categories of proteins in yeast protein–protein interaction network, e.g. ‘prokaryotic’ ‘eukaryotic’ proteins (i.e. proteins with or without a prokaryote ortholog, see Supplementary Figure 1). Other biological information such as the expression profile (see Supplementary Figure 2), interaction confidence and regulatory relationships can also be integrated into this approach. Given the abundant information linked to proteins and protein relationships, a versatile visualization approach such as ours should be highly useful. Compared with conventional visualization methods, our method has an advantage in that it reveals more hidden modules in a large network.

However, since some information about the network is lost through our visualization method, we also integrated a conventional visualization method into our software PINC. The PINC software has a friendly graphical interface and can be downloaded from http://www.bioinfo.org.cn/clustering.

The ADJW clustering is a newly developed method, whereas application of Hall's method for clustering of genomic networks represents an older strategy applied to a new field. The former method has the advantage of being simple and easy in implementation; however, when further developed, Hall's clustering may still have a potential in visualization of biological networks (see Figure Figure4c).4c). These methods and all the other clustering methods which are mentioned in this paper were also integrated in PINC. We hope that biologists will find useful biological information based on our software when studying the distribution of the proteins they focus on.

## ACKNOWLEDGEMENTS

The authors acknowledge helpful advice from Dr Yan Fu. This work was supported by the Chinese Academy of Sciences Grant No. KSCX2-2-27, National Sciences Foundation of China Grant No. 39890070, 60496320, the National High Technology Development Program of China under Grant No. 2002AA231031, National Key Basic Research & Development Program 973 under Grant No.2002CB713805, 2003CB715900, and Beijing Science and Technology Commission Grant No. H010210010113.

## REFERENCES

*et al*. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147. [PubMed]

*et al*. (2002) Systematic identification of protein complexes in

*Saccharomyces cerevisiae*by mass spectrometry. Nature, 415, 180–183. [PubMed]

*et al*. (2000) A comprehensive analysis of protein–protein interactions in

*Saccharomyces cerevisiae*. Nature, 403, 623–627. [PubMed]

*r*-dimensional quadratic placement algorithm. Management Science, 17, 219–229.

*et al*. (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res., 31, 2443–2450. [PMC free article] [PubMed]

*Saccharomyces cerevisiae*gene function using overlapping transcriptional clusters. Nature Genet., 31, 255–265. [PubMed]

*et al*. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32 (Database issue), D41–44. [PMC free article] [PubMed]

*et al*. (2002) A large nucleolar U3 ribonucleoprotein required for 18S ribosomal RNA biogenesis. Nature, 417, 967–970. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (963K) |
- Citation

- Identification of functional modules in a PPI network by clique percolation clustering.[Comput Biol Chem. 2006]
*Zhang S, Ning X, Zhang XS.**Comput Biol Chem. 2006 Dec; 30(6):445-51. Epub 2006 Nov 13.* - Evaluation of clustering algorithms for protein-protein interaction networks.[BMC Bioinformatics. 2006]
*Brohée S, van Helden J.**BMC Bioinformatics. 2006 Nov 6; 7:488. Epub 2006 Nov 6.* - Detecting functional modules in the yeast protein-protein interaction network.[Bioinformatics. 2006]
*Chen J, Yuan B.**Bioinformatics. 2006 Sep 15; 22(18):2283-90. Epub 2006 Jul 12.* - Inferring network interactions within a cell.[Brief Bioinform. 2005]
*Carter GW.**Brief Bioinform. 2005 Dec; 6(4):380-9.* - The Cartographers toolbox: building bigger and better human protein interaction networks.[Brief Funct Genomic Proteomic. 2009]
*Sanderson CM.**Brief Funct Genomic Proteomic. 2009 Jan; 8(1):1-11. Epub 2009 Mar 12.*

- Applied Graph-Mining Algorithms to Study Biomolecular Interaction Networks[BioMed Research International. 2014]
*Shen R, Guda C.**BioMed Research International. 2014; 2014439476* - Recent advances in clustering methods for protein interaction networks[BMC Genomics. ]
*Wang J, Li M, Deng Y, Pan Y.**BMC Genomics. 11(Suppl 3)S10* - Determining modular organization of protein interaction networks by maximizing modularity density[BMC Systems Biology. ]
*Zhang S, Ning XM, Ding C, Zhang XS.**BMC Systems Biology. 4(Suppl 2)S10* - Protein complex prediction based on k-connected subgraphs in protein interaction network[BMC Systems Biology. ]
*Habibi M, Eslahchi C, Wong L.**BMC Systems Biology. 4129* - Jerarca: Efficient Analysis of Complex Networks Using Hierarchical Clustering[PLoS ONE. ]
*Aldecoa R, Marín I.**PLoS ONE. 5(7)e11585*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- The interactome as a tree—an attempt to visualize the protein–protein interactio...The interactome as a tree—an attempt to visualize the protein–protein interaction network in yeastNucleic Acids Research. 2004; 32(16)4804

Your browsing activity is empty.

Activity recording is turned off.

See more...