![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright Muller et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Identification and Analysis of Co-Occurrence Networks with NetCutter 1Department of Experimental Oncology, European Institute of Oncology, Milan, Italy 2Consortium for Genomic Technologies (Cogentech), Milan, Italy Ji Zhu, Editor University of Michigan, United States of America * E-mail: heiko.muller/at/ifom-ieo-campus.it Conceived and designed the experiments: HM. Performed the experiments: HM FM. Analyzed the data: HM FM. Contributed reagents/materials/analysis tools: HM FM. Wrote the paper: HM. Received March 7, 2008; Accepted August 7, 2008. This article has been cited by other articles in PMC.Abstract Background Co-occurrence analysis is a technique often applied in text mining, comparative genomics, and promoter analysis. The methodologies and statistical models used to evaluate the significance of association between co-occurring entities are quite diverse, however. Methodology/Principal Findings We present a general framework for co-occurrence analysis based on a bipartite graph representation of the data, a novel co-occurrence statistic, and software performing co-occurrence analysis as well as generation and analysis of co-occurrence networks. We show that the overall stringency of co-occurrence analysis depends critically on the choice of the null-model used to evaluate the significance of co-occurrence and find that random sampling from a complete permutation set of the bipartite graph permits co-occurrence analysis with optimal stringency. We show that the Poisson-binomial distribution is the most natural co-occurrence probability distribution when vertex degrees of the bipartite graph are variable, which is usually the case. Calculation of Poisson-binomial P-values is difficult, however. Therefore, we propose a fast bi-binomial approximation for calculation of P-values and show that this statistic is superior to other measures of association such as the Jaccard coefficient and the uncertainty coefficient. Furthermore, co-occurrence analysis of more than two entities can be performed using the same statistical model, which leads to increased signal-to-noise ratios, robustness towards noise, and the identification of implicit relationships between co-occurring entities. Using NetCutter, we identify a novel protein biosynthesis related set of genes that are frequently coordinately deregulated in human cancer related gene expression studies. NetCutter is available at http://bio.ifom-ieo-campus.it/NetCutter/). Conclusion Our approach can be applied to any set of categorical data where co-occurrence analysis might reveal functional relationships such as clinical parameters associated with cancer subtypes or SNPs associated with disease phenotypes. The stringency of our approach is expected to offer an advantage in a variety of applications. Introduction Biological research has experienced a paradigm shift in the last decade catalyzed by the availability of genome sequences and the resulting development of high-throughput technologies. The large data volumes produced by these novel technologies are often published as supplementary material and/or stored in extensive data repositories [1]. Functional interpretation of these data is an ongoing challenge. Co-occurrence analysis, based on the hypothesis that co-occurring entities are functionally linked, is a technique that has been used in three main areas of biological research:
Co-occurrence of genes in sequenced genomes relies on the fact that proteins do not function in isolation and are dependent on other proteins, either as direct binding partners, or as catalysts of substrates. Thus, when two proteins significantly co-occur in a large number of genomes or can be observed as fusion proteins in a subset of genomes, they are likely to be binding partners or enzymes needed for a specific metabolic pathway. Examples of those studies have been reported by [2]–[7]. Text mining is a quickly evolving field that aims at developing technologies helping to cope with the functional interpretation of large volumes of publications. Co-occurrence of gene names in publication abstracts, entire publications, or other gene-related databases has been used to derive co-occurrence networks with clear evidence that edges in those networks are reflecting functionally relevant relationships [8]–[11]. Gene names have also been analyzed for co-occurrence with other entities such as mutations [12], chemical compounds [13], and disease related keywords [14]. From the resulting networks, hypotheses about candidate genes involved in inherited diseases and drug targets can be derived. Clustering of gene related publications using keywords has been applied to enhance the quality of gene expression clusters [15], [16]. More general (non gene-centric) approaches try to organize the literature into functional areas based on co-occurrence of MeSH terms, keywords, diseases, phenotypes, chemicals, and similar objects of biomedical research interest [17]–[21]. Co-occurrence analysis of transcription factor binding motifs has been carried out in a variety of slightly differing ways in a wide range of organisms, including humans. [22]–[33]. The underlying hypothesis is that co-regulated genes, identified usually by gene expression studies, should contain specific combinations of transcription factor binding motifs in their upstream regulatory regions, the identification of which would allow the reverse-engineering of transcription regulatory networks [34]. We have recently applied co-occurrence analysis to studying published gene expression signatures and showed that co-occurrence patterns of genes reflect cancer signaling pathways [35]. Although co-occurrence analysis has a respectable history, the methodologies used in the studies mentioned above could not be easily applied to studying gene expression signatures. There are three main reasons that dictated the use of a different approach. First, gene expression signatures can vary in size by orders of magnitude. Obviously, the larger a signature the more likely it is to find two or more genes co-occurring in that signature. Thus, the significance of co-occurrences must be evaluated in the presence of considerable heterogeneity of co-occurrence probabilities among gene lists. As a consequence, the statistics used to evaluate the significance of co-occurrence events must reflect this heterogeneity. In particular, it must be based on list-specific co-occurrence probabilities. Second, in the vast majority of previous studies, co-occurrence is analyzed for pair-wise combinations of co-occurring entities. We found that the resulting stringency of this approach is not adequate for the analysis of published gene expression signatures [35]. Third, the null-model against which the significance of co-occurrences is tested does not work well for gene expression signatures. A common procedure is to use generic randomization of the entire data set under analysis or to select subsets of data entries randomly for comparison purposes. However, gene expression signatures are composed of distinct gene sets and the null-model must maintain this property, which is not guaranteed using these approaches. Furthermore, the list-specific nature of co-occurrence probabilities cannot be dealt with properly. NetCutter was developed to address these challenges and to provide a generic tool for generating and analyzing co-occurrence networks. Although NetCutter has been developed for the analysis of gene expression signatures, it is based on abstract concepts that make it applicable to a wide variety of problems. The input is represented by a bipartite graph that is composed of list-entry pairs, which are stored in tab-separated text format. Co-occurrence of entries in lists is analyzed using pair-wise or higher order combinations of entries. The significance of co-occurrence is tested using a novel bi-binomial approximation of Poisson-binomial statistics (which is a binomial distribution with trial specific probabilities) that handles list-length-heterogeneity properly and provides a novel measure of association that is found to be superior to the Jaccard and the uncertainty coefficients. Occurrence probabilities are obtained from an edge-swapping procedure that maintains vertex degrees in the underlying bipartite graph and distinct sets of entries per list. As we shall see below, this procedure has a number of advantages over other possible null-models and permits co-occurrence analysis with near maximum stringency. Last but not least, NetCutter is equipped with a number of algorithms for the identification of network communities, vertex ranking, and convenience tools needed in the analysis of co-occurrence networks, or any undirected graph. We illustrate the utility of NetCutter in the identification of corresponding clusters of genes and publications from the PubLiME data set. PubLiME (Published Lists of Microarray Experiments) is a repository of published cancer related gene expression signatures (http://bio.ifom-ieo-campus.it/Publime). The concept of cluster correspondence follows from the bipartite graph representation of the data. Reversing the list-entry order in the bipartite graph permits identifying communities of entries as well as communities of lists. We show that communities of publications corresponding to communities of genes in the PubLiME data set can be used to generate hypotheses about the putative function of gene communities. Results The bipartite graph model of co-occurrence analysis Co-occurrence analysis using NetCutter is based on the abstraction of list-entry pairs. Any entity that co-occurs with some other entity must be confined to some sort of container where co-occurrence is observed. For example, in the case of gene name co-occurrence in PubMed abstracts, the abstract is the container and the gene names are the co-occurring entities. Similarly, co-occurrence of transcription factor binding motifs is observed in gene promoters. The promoters are the containers where motif entities co-occur. The containers generally host more than one entity (otherwise co-occurrence would be impossible) and can be conveniently interpreted as lists. The co-occurring entities are the list entries. Lists and entries form a bipartite graph with one part of the graph representing lists and the other part representing entries. The presence of a given entry in a given list is indicated by an edge between the corresponding list and entry vertices. It is required that each entry can be linked to the same list only once. Without loss of generality, let's consider genes as entries and PubMedID_listIDs as lists in the following, unless otherwise specified (Fig. 1A
Occurrence probabilities and null-models A prerequisite for co-occurrence analysis is the availability of occurrence probabilities of genes per list. The occurrence probabilities can be derived from randomizing the bipartite graph and are dependent on the choice of the null-model. A null-model creates an occurrence probability matrix where the occurrence probability for each list–gene pair is listed. As a general property of this matrix, the sum of all matrix elements must equal the number of edges in the bipartite graph. This is because each edge is linked to either side of the bipartite graph with certainty and therefore the sum of occurrence probabilities over all lists (which can be calculated as the row sum if genes are listed vertically or as the column sum if genes are listed horizontally) followed by summing the results over all genes must be 1 for every edge. The number of matrix elements is given by #genes*#lists and therefore the average occurrence probability for any null-model must be #edges/(#genes*#lists). As a consequence, different null-models will only be distinguished by the way they attribute occurrence probabilities to vertices with different vertex degrees but not by the average occurrence probability. We consider six different strategies to randomize the bipartite graph. First, we could reconnect all edges of the graph randomly. The probability of being connected by an edge for a given list-gene pair is given by (1/#genes)*(1/#lists). Since there are #edges edges to be reconnected, the occurrence probability for a single list-gene pair is #edges/(#genes*#lists), i.e. equal to the average occurrence probability. Thus, this model provides equal occurrence probabilities for all gene-list pairs and does not consider vertex degrees. We call this model the generic randomization (GR) model in the following. Second, we could disconnect the edges on only the list side of the bipartite graph and reconnect them randomly. The occurrence probability of a gene vertex would be given by (gene vertex degree)/#lists. The sum of these probabilities over all lists is equal to the gene vertex degree and the sum of all gene vertex degrees is equal to the total number of edges. Thus, the sum of all matrix elements is equal to the number of edges, as required. Since this model considers gene vertex degrees, we call it the gene vertex degree (GVD) model. Third, we disconnect the edges on the gene side of the bipartite graph and reconnect them randomly. The probability of a list vertex being connected to a gene would be given by (list vertex degree)/#genes. The sum of these probabilities over all genes is equal to the list vertex degree and the sum of all list vertex degrees is equal to the total number of edges. Again, the sum of all matrix elements is equal to the total number of edges. Since this model considers list vertex degrees, we call it the list vertex degree (LVD) model. In model four and five, we reconnect edges considering both gene and list vertex degrees and allow multiple edges between list-gene pairs. The occurrence probabilities in model four are calculated according to the binomial distribution. We calculate the probability of a list-gene pair for being connected as the cumulative binomial probability of the list-gene pair being chosen at least once in the process of randomly reconnecting the edges. This can be achieved by setting the number of trials equal to the gene vertex degree, the probability of success equal to the list vertex degree divided by the total number of edges, and the number of successes equal to 0. The occurrence probability of a list-gene pair is then given by the complement of this probability. This model is called the binomial (BN) model. In model five, we calculate occurrence probabilities according to the hypergeometric distribution. The number of successes in the sample is equal to 0, the sample size is equal to the gene vertex degree, the number of successes in the population is set to the list vertex degree, and the population size is the total number of edges. Again, the occurrence probability of a list-gene pair is obtained as the complement of this probability. We call this model the hypergeometric (HG) model. Calculating occurrence probabilities in this manner does not guarantee that the matrix elements add up to the total number of edges. Therefore, the matrices are normalized such that this condition is satisfied by multiplying each matrix element with the factor #edges/(observed matrix sum), which is generally quite close to 1, however. In model six, we again consider vertex degrees, but we require that each list is composed of distinct sets of genes. Thus, multiple edges are forbidden. This condition is satisfied by applying an edge-swapping procedure during graph randomization. Edge-swapping works by randomly choosing two list-gene pairs from the bipartite graph and prior to performing the edge-swap, a test is performed to ensure that the two genes are not already linked to the respective target lists. This procedure is performed a large number of times. To ensure complete randomization of the graph, the number of swaps performed should be significantly larger than the number of edges. After performing R randomizations of the graph and counting the number of times a gene has been linked to a particular list, division of this number by R gives the occurrence probability of a gene in a given list. As will be shown below, edge-swapping produces occurrence probabilities that closely approximate occurrence probabilities obtained by generating a complete permutation set of the bipartite graph, counting the number of times a gene is found part of a list, and dividing this number by the total number of permutations. In the permutation model, the sum of occurrence probabilities of a gene over all lists equals the gene vertex degree (see below) and thus the sum of all matrix elements is the number of edges. Since permutation sets of bipartite graphs are difficult to calculate, we use the edge-swapping procedure as a close approximation and call this model the edge-swapping (ES) model. Fig. 1 Expected number of co-occurrences As a general criterion for comparing the effectiveness of different null-models, we have to compare them for the number of expected co-occurrences. The most effective null-model will be the one that maximizes the expected number of co-occurrences. If the expected number of co-occurrences is larger, an observed number of co-occurrences in a real bipartite graph will be less significant and thus such a null-model permits co-occurrence analysis with higher stringency. The expected number of co-occurrences depends in an obvious fashion on the list vertex degree. If pair-wise co-occurrences are considered, the number of co-occurrences in a list of vertex degree N is given by the binomial coefficient N over 2. Larger lists will give rise to more co-occurrences and the number increases quickly with list vertex degree. The dependency of the expected number of co-occurrences on the gene vertex degree is less obvious and depends strongly on the null-model. A gene that is part of a list with vertex degree N will give rise to N-1 co-occurrences in that list. The null-model permits calculating the probability to find this gene in a given list. Thus, the expected number of co-occurrences of a gene is given by the sum of expected co-occurrences in all lists where for a single list the expected co-occurrences are given by (Nl−1)*pl. Nl is the list vertex degree and pl is the occurrence probability of the gene in that list as determined by the null-model. We used the PubLiME data set [35] to calculate the expected number of co-occurrences with different null-models. The results are shown in Fig 2A = HG<ES. The BN and the HG models perform in an essentially identical way. However, the ES model is the model that yields the largest estimates of expected co-occurrences. The results are also in line with the intuitive expectation that genes with higher vertex degree give rise to more co-occurrences. However, it can be seen that this is not true for all null-models. In particular, it is not true for the GR and the LVD models, which do not consider gene vertex degrees.
As outlined above, it is expected that the null-model that yields the highest estimates of expected co-occurrences should permit co-occurrence analysis with the highest stringency. In Fig. 2B The reason for the superior stringency of the ES model over all other models can be explained by examining the average occurrence probability per gene and list vertex degree. Fig. 2D and E The ES model as an approximation of the permutation null-model The data shown above have revealed that the ES model is the best of the models tested. One may wonder, however, whether yet more effective null-models can be found. An obvious choice would be the permutation model. In the permutation model, a complete permutation set of the bipartite graph is created such that each list is composed of distinct sets of genes. The number of graphs where a gene is present in a given list divided by the total number of permutations then provides the occurrence probability estimate. The permutation model is the ideal null-model because it is exhaustive. The problem is that a complete permutation set of bipartite graphs of some complexity is very time consuming to calculate. For example, the simple bipartite graph from Fig. 1A We generated a complete permutation set of the graph shown in Fig. 1A
As an interesting observation, we provide evidence that occurrence probabilities are non-linear functions of vertex degrees in the edge-swapping model. This is illustrated in Fig. 3C We conclude that the ES null-model is the null-model that permits co-occurrence analysis with the highest stringency among the models tested and that it closely approximates occurrence probabilities derived from an ideal permutation model. The increased stringency of the ES model over other models is a consequence of higher occurrence probabilities for genes and list with high vertex degrees, which are giving rise to a large part of all co-occurrences in the bipartite graph. Since large occurrence probabilities make co-occurrence more likely, the analysis becomes more stringent. Co-occurrence probabilities Co-occurrence analysis can be thought of as a Bernoulli experiment with a binomial outcome (a given combination of entries is either present or not present in a given list). Thus, the Binomial distribution (BD) is a natural choice for judging the significance of the number of co-occurrences. However, the BD is defined for a probability of success which is equal in all trials. The list-specific nature of occurrence probabilities is not compatible with this condition (analysis of each list represents one trial), which means that co-occurrence analysis in the presence of list-length-heterogeneity is better described as a series of Poisson trials, where the probability of success varies from trial to trial. Therefore, the significance of co-occurrences must be evaluated using a binomial distribution with trial-specific probabilities, i.e. the Poisson-binomial distribution (PBD). The probability of success in a single Poisson trial can be calculated by multiplying the list-specific occurrence probabilities for the combination of genes under study. The number of occurrence probabilities that need to be multiplied is equal to the module size, i.e. the number of genes whose combination is studied. An observed number of co-occurrences for a combination of genes can then be evaluated using the PBD, which is given by the formula [36]:
The structure of this formula is very similar to the structure of the formula used to calculate the binomial distribution, except that multiplication with a binomial coefficient is replaced by summation over individual terms, which makes calculation of P-values using (18) inefficient (note that equation numbering starts in the Materials and Methods section). Here, Ak denotes the kth set of indices of the i lists where genes are co-occurring (“success”). There are possible sets and summation is carried out accordingly. denotes the set of indices of N−i lists where genes are not co-occurring (“failure”). [36] have reported two fast procedures for calculating exact PBD P-values. However, both procedures work with probability ratios and suffer from numerical overflow/underflow problems for large numbers of trials. NetCutter uses two workarounds to circumvent this problem. One is based on using Poisson-binomial Z-scores, which can be calculated very easily instead (see below). The other relies on a fast approximation procedure for calculating Poisson-binomial P-values, which we call bi-binomial approximation (BBA) or bi-binomial distribution (BBD).Z-scores and P-values of BBD Given the mean μ (1) and variance σ2 (2) of PBD (see Materials and Methods), the Z-score associated with a given number of co-occurrences x is obtained as:
Considering the structure of formulae (1) and (2) (Materials and Methods section), PBD Z-scores can be calculated very easily and provide a simple estimate of the significance of co-occurrence modules. However, in contrast to normally distributed Z-scores, binomial and Poisson-binomial Z-scores do not correspond to the same P-value for different sets of probabilities of success. To see this, calculate for example the probability of success in a series of 100 Bernoulli trials with success probability 0.1 and 0.9 for the expectation of 10 and 90 successes, respectively. The Z-score will be 0 in both cases but the corresponding cumulative P-values are 0.5832 and 0.5487. Therefore, exact levels of significance cannot be derived from Z-scores alone. Thus, a fast and reliable procedure for calculating Poisson-binomial P-values is needed. The BBD approximation was developed to solve this problem. The BBD approximation of PBD P-values follows from the relationship between the variance of PBD and the population variance of trial-specific probabilities of success. This relationship is shown in Materials and Methods to be described by (4):
This equation shows that there is an inverse linear relationship between the population variance S2 of the N trial probabilities and the variance of PBD σ2, which means that PBD becomes increasingly narrow as the variance of trial probabilities grows. It also shows that, for constant mean μ and number of trials N, the shape of PBD depends only on the variance of trial probabilities. Therefore, relationship (4) suggests an easy way to approximate PBD P-values. The P-value can be obtained by constructing a set of trial probabilities with equal variance as the original set of trial probabilities, which, however, are not all different. In other words, the series of Poisson trials can be replaced by two sets of Bernoulli trials with trial probabilities p1 and p2 constructed such that the variance is equal to the original set of trial probabilities. This strategy is illustrated in Fig. 4
In order to evaluate whether BBD P-values as a significance measure of co-occurrence offer an advantage over other measures such as the Jaccard coefficient or the uncertainty coefficient, pair-wise co-occurrence of two genes in 200 lists with and without list-length-heterogeneity was studied (Fig. 5
Three different cases of list-length-heterogeneity are considered in Fig. 5 = 100 occurrences per gene/200 lists) ^2*200 lists. The uncertainty coefficient is found to be 0 and the Jaccard coefficient is 0.33333 at that point. When there is modest list-length-heterogeneity (standard deviation 0.283), the mean of BBD is shifting to the right. This is because the sum of squares of varying occurrence probabilities (i.e. the sum of co-occurrence probabilities used as trial probabilities, which is equal to the mean of BBD) is always larger than the sum of squares of constant occurrence probabilities with the same average occurrence probability (0.5). The corresponding BD in the presence of list-length-heterogeneity is obtained by dividing the expected number of co-occurrences by the total number of lists, which means assuming equal co-occurrences in all lists. This visualization is shown to illustrate how BBD (which is narrower than the corresponding BD) gives rise to a steeper cumulative distribution of P-values and as a consequence to more significant P-values for numbers of co-occurrence that are far from the expectation. As the level of list-length-heterogeneity grows (standard deviation of occurrence probabilities 0.401), the mean of BBD is shifted even further to the right and BBD P-values are distributed in a still steeper fashion as compared to corresponding BD P-values and the interval of non-significant co-occurrences is shrinking further. With modest list-length-heterogeneity, the expected number of co-occurrences is 66, which is associated with a Jaccard coefficient of 0.49 and an uncertainty coefficient of 0.075. In the case of large list-length-heterogeneity, the expected number of co-occurrences is 82 with J = 0.69 and UC = 0.32.Taken together, these data show that the expected number of co-occurrences varies strongly with the level of list-length-heterogeneity and that the expected number of co-occurrences is associated with different values of UC and J. To complicate matters further, 66 co-occurrences (J = 0.49, UC = 0.075) represent significant positive association (PBBD = 0.996) with equal list lengths, no significant association with modest differences in list length (PBBD = 0.536) and strongly negative association (meaning one gene excludes the other) with strong list-length-heterogeneity (PBBD = 0.00016). Thus, the same J and UC association measure is obtained for positive, negative, and absence of association. Therefore, the meaning of these measures cannot be interpreted properly in the absence of knowledge about the occurrence probabilities of the co-occurring entities. Furthermore, the data in Fig. 5Generation of co-occurrence networks and the identification of communities The procedures outlined above allow the identification of significant co-occurrence modules in any type of bipartite graph. Three user defined parameters have an impact on the stringency of co-occurrence analysis: The module size, the support, and the Z-score/P-value cutoff. The module size determines how many entries will be tested for co-occurrence, the support sets a lower boundary on the required number of co-occurrences, and the Z-score/P-value cutoff sets the significance threshold. In general, higher module size leads to more stringent co-occurrence analysis at the cost of computational complexity. The support parameter allows limiting this complexity by filtering out co-occurrence modules which co-occur less frequently than required by the support. The significance cutoff permits adjusting the signal-to-noise ratio, which is calculated as the number of co-occurrence modules observed in the real versus a randomized bipartite graph. The impact of these parameters on the stringency of co-occurrence analysis has been reported previously for the PubLiME data set [35] and is illustrated in a simulation study provided as supplementary material Simulation S1. From the set of significant co-occurrence modules, a co-occurrence network is generated by considering each entry a vertex and drawing an edge between any two vertices, which have been part of the same significant co-occurrence module [35]. An important question in the analysis of co-occurrence networks regards the presence of network communities. Communities can be understood as groups of vertices with the property that the number of edges running within groups is larger than expected by chance and that the number of edges running between groups is lower than expected by chance [37]. This problem of partitioning a graph is often referred to as the graph-cut problem (hence the name NetCutter). NetCutter is built on the Java Universal Network and Graph framework (JUNG) software package (http://jung.sourceforge.net), which provides algorithms for solving this problem. In particular, NetCutter implements the Bicomponent clustering algorithm [38], the Edge-Betweenness clustering algorithm [39], and the Exact Flow Community algorithm [40]. Furthermore, there is a clustering tool that is not part of the JUNG package, namely an algorithm identifying communities using eigenvectors of the modularity matrix [37]. The code for this algorithm was kindly provided by Mark Newman in C++ and ported to Java. In addition to these tools, NetCutter provides a number of convenience functions for the analysis of co-occurrence networks, such as testing the significance of lists reporting a set of entries making up a network community, ranking of vertices, random graph generators for topological analysis of co-occurrence networks, and others. Details on all functions are provided in the NetCutter software documentation. One of the possible applications of NetCutter is illustrated below. This application is tightly linked to the bipartite graph representation of the data. Namely, NetCutter can be used to perform co-occurrence analysis of genes or list derived from the same bipartite graph. The network communities identified in each both reflect the same underlying structure of the bipartite graph. In the case of gene expression signatures stored in PubLiME, clusters of genes correspond to clusters of publications, which can reveal possible functions of gene clusters. Cluster correspondence and association studies The co-occurrence analysis of the PubLiME data set published previously [35] identified 5 major network communities of genes with consistent functional annotations that are deregulated in cancer related gene expression signatures. This analysis was performed by considering all genes mentioned in a particular publication as a single signature, even though they might have been part of different tables and cluster analyses. Here we present an advanced analysis of the PubLiME data set where each table and/or cluster identified in a given publication is considered as a separate signature. This brings the total number of signatures to be analyzed to 1015 comprising a total of 7358 differentially regulated genes derived from 233 publications reporting cancer related signatures derived from human samples. We use this analysis to illustrate three major points: First, the set of communities reported previously is reproduced by this more fine-grained analysis. Second, the set of gene communities corresponds to a set of publication communities. Third, associations between publications and gene communities can be calculated with higher stringency using the edge-swapping null-model in conjunction with bi-binomial P-values as compared to binomial or hypergeometric statistics. The bipartite graph to be analyzed is composed of PubMedID_listID-gene pairs (see supplementary material Table S1). Co-occurrence analysis was carried out in two ways: First, gene co-occurrence was analyzed and communities of co-occurring genes were defined by edge-betweenness clustering as described in Materials and Methods. Second, co-occurrence of PubMedID_listIDs was analyzed. To this end, the order of PubMedID_listID-gene pairs was reversed to form GENE-PUBMEDID_LISTID pairs. Thus, the lists in the resulting bipartite graph are formed by genes and the entries are the PubMedID_listIDs where the genes are reported as differentially regulated. Occurrence probabilities for the reversed bipartite graph can be obtained by transposing the occurrence probability matrix of the original bipartite graph. Since the gene communities identified in gene co-occurrence analysis reflect the structure inherent in the bipartite graph (which is not affected by reversing the list-entry order), co-occurrence analysis of the reversed bipartite graph will result in PubMedID_listID communities that reflect the same underlying structure in the bipartite graph. In other words, PubMedID_listID communities correspond to gene communities. In less abstract terms, the PubMedID_listID communities should correspond to sets of publications that report similar sets of genes as differentially regulated. The identification of communities of publications can help the researcher to easily identify publications studying genes in a gene community that is of interest to the researcher. The results of both types of co-occurrence analysis are displayed in Fig. 6
The corresponding clusters of PubMedID_listIDs are shown in Fig. 6B The edge-swapping null-model determines occurrence probabilities in such a way that the number of genes in a given PubMedID_listID is associated with insignificant P-values in the context of the complete bipartite graph. However, when a subset of genes is analyzed, e.g. all the genes that are reported in a particular list, the P-value associated with the number of genes contained in this list will likely be highly significant according to how unlikely it is to obtain all the genes contained in a given list in a random draw from all genes present in the bipartite graph. Thus, PubMedID_listID association with a set of genes in the bipartite graph model can be calculated in the following way: The set of genes that is used to analyze association is used to extract a subgraph from the original bipartite graph where occurrence probabilities for each gene-PubMedID_listID pair are identical to those in the original bipartite graph (i.e. they are not recalculated by edge-swapping). The vertex degree of the PubMedID_listID vertices in the subgraph indicates the number of genes contained in each PubMedID_listID overlapping with the set of genes used to extract the subgraph. From the occurrence probabilities of the genes in a given PubMedID_listID, the bi-binomial P-value can then be calculated for every list vertex degree observed in the subgraph. In Fig. 7
The analysis of significant associations between PubMedID_listIDs and gene clusters now permit answering the question whether there is correspondence between gene clusters and PubMedID_listID clusters. The naming of PubMedID_listID clusters shown in Fig. 6B = 0.95) were identified. Second, the number of significant PubMedID_listIDs in each PubMedID_listID cluster was counted for every gene cluster. The significance of this number was then calculated using binomial statistics. The results of this analysis are shown in Table 1. Negative decadic logarithms of the binomial P-value are displayed. It is apparent that each PubMedID_listID cluster is strongly associated with at least one gene cluster, except for the “extracellular matrix-immune response” cluster, which is associated with two gene clusters. The strength of these associations suggests that the PubMedID_listID clusters are indeed corresponding to the gene clusters and that both the gene and the PubMedID_listID clusters reflect the structure of the bipartite graph representing the PubLiME data set.
Details about all the lists analyzed are attached as supplementary material Table S2. Looking at these lists, some general conclusions about the gene clusters can be drawn. Cell cycle cluster genes have been found deregulated in a wide variety of tumor types such as colon cancer, breast cancer, in biliary tract cancer, pancreatic cancer, gastric cancer, prostate cancer, T-cell leukemia, glioma, acute lymphoblastic and myeloblastic leukemias, soft tissue sarcoma, neuroblastoma, as well as in a number of cellular model systems in response to different stimuli. Thus, the cell cycle cluster seems to consist of genes with a general role in oncogenesis. The surface antigen cluster instead seems to be derived preferentially from studies on leukemia. The interferon cluster genes are found deregulated in virus induced pathologies such as papilloma virus induced cervical cancer, and viral hepatitis. Immune response cluster genes were reported as differentially regulated in inflammatory conditions such as ulcerative colitis, Crohn's disease, and Helicobacter pylori infections. Genes of the extracellular matrix cluster seem to be associated with cancer progression studies and metastatic potential. For the protein biosynthesis cluster, there are 15 signatures that are significantly enriched for those genes. The cancers studied comprise medulloblastoma, glioblastoma, pancreatic cancer, soft tissue sarcoma, lung carcinoma, breast carcinoma, prostate carcinoma, multiple myeloma, and lymphocytic leukemia. The genes are also found deregulated in response to DNA damage. Although the number of signatures is limited, the variation in conditions where the genes are deregulated is compatible with the hypothesis that protein biosynthesis genes, as cell cycle genes, are deregulated in many cancer types, which might reflect the general property of cancer cells to divide and grow in an uncontrolled fashion. Discussion Here we have investigated basic aspects of co-occurrence analysis and present a software tool, NetCutter, which can be used to identify and analyze generic co-occurrence networks. In NetCutter, a co-occurrence data set is represented as a bipartite graph with one part representing lists and the other part list entries whose co-occurrence patterns are studied. The bipartite graph representation of co-occurrence data sets allows the efficacy of different null-models to be tested systematically. We have shown that an edge-swapping procedure used to randomize the bipartite graph generates a null-model that allows co-occurrence analysis with the highest stringency. The other null-models tested here tend to underestimate occurrence probabilities of entries per list for lists and genes with high vertex degrees, i.e. for lists and genes where most co-occurrences are observed. As a result, co-occurrences are judged more significant than they really are. Co-occurrence data sets with exactly equal lists lengths are likely to be the exception from the rule. It can be assumed that some list-length-heterogeneity will be present in most circumstances. An important consequence of list-length-heterogeneity regards the co-occurrence probability distribution used to evaluate the significance of the observed number of co-occurrences. Co-occurrence analysis in the presence of list-length-heterogeneity is best performed using the Poisson-binomial distribution (a binomial distribution with trial specific probabilities). However, calculating Poisson-binomial P-values for large numbers of lists is difficult using existing procedures [36]. We have presented an approximation to the Poisson-binomial distribution, called bi-binomial distribution, which is based on replacing the set of Poisson trials by two sets of Bernoulli trials. The resulting distribution reproduces the Poisson-binomial distribution nearly exactly and its P-values can be calculated with ease even for thousands of lists (see also supplementary material Simulation S1 for details on the precision of BBD). Importantly, BBD provides a novel measure of association, which is shown to be superior to existing measures such as the Jaccard coefficient and the uncertainty coefficient, whose values cannot be interpreted properly in the absence of knowledge about the occurrence probabilities of co-occurring entities. It is worth noting that Poisson-binomial Z-scores are distinguished from Gaussian Z-scores by the fact that they do not correspond to the same P-value for different PBDs, BBDs, and even BDs. This is because the Z-score is an explicit part of the function defining the normal probability density while it is not part of the definitions of BD, PBD, and BBD densities. As a consequence, the simple Poisson-binomial Z-score based approach to evaluating significance of co-occurrence must be complemented with the BBD to approximate Poisson-binomial P-values in order to enable multiple testing corrections and to allow calculation of confidence levels in association studies precisely. However, NetCutter is equipped with a bipartite graph randomization tool that permits measuring the number of false positives due to multiple testing directly by comparing the number of significant co-occurrence modules in the real bipartite graph to the corresponding number in a randomized version thereof. Randomization is performed by edge-swapping in order to preserve vertex degrees. The resulting signal-to-noise ratios that are plotted for each Z-score/P-value cutoff provide a highly reliable and visually intuitive defense mechanism against false positives (see also supplementary material Simulation S1). In the vast majority of co-occurrence studies, pair-wise co-occurrences have been analyzed using different statistical models. We have observed that the stringency of pair-wise co-occurrence analysis is far below the stringency of co-occurrence analysis using higher order combinations of co-occurring entities [35]. In NetCutter, co-occurrence analysis is preceded by occurrence analysis, i.e. the occurrence probability of each entry in each list is determined. Starting from occurrence probabilities, co-occurrence probabilities for any size of co-occurrence modules under study can be obtained by multiplying the respective list-specific occurrence probabilities. Given the list-specific co-occurrence probabilities, bi-binomial P-values are then calculated in exactly the same way for any module size. As a consequence, NetCutter can perform co-occurrence analysis for higher order combinations of co-occurring entries (i.e. larger module sizes) using the same statistical model. One of the benefits of using higher module sizes is robustness of the analyses in the presence of noise. This is because each edge in the resulting co-occurrence network is evaluated many times since every pair of co-occurring entries can be part of many higher order co-occurrence modules [35]. Another advantage is that implicit relationships between entries, which have never occurred together [18], can be derived as a natural by-product of using module sizes larger than 2. As shown in a simulation study (supplementary material Simulation S1), the result is a dramatic reduction of misclassifications at higher module sizes. NetCutter can be used to calculate the strength of association between a subset of entries and lists reporting those entries. In this case, the analysis is performed on a subgraph of the original bipartite graph. The subgraph can correspond to communities of entries in the co-occurrence network, or any set of entries of interest. NetCutter will then calculate the significance of observing a given number of occurrences of an entry in the user defined subset of lists using bi-binomal statistics. This analysis mode corresponds to association studies with the advantage that the structure of the underlying bipartite graph (i.e. list length heterogeneity) is considered and handled appropriately using the bi-binomial distribution. As a consequence, association studies can be performed with higher stringency. This result can be understood by examining the occurrence probability matrix that is implicitly assumed in performing binomial or hypergeometric tests for the significance of overlaps. In both tests, a gene is assumed to have an equal opportunity to be present in a list. Therefore, the probability of success for a gene to be part of a list is given by the list vertex degree divided by the total number of genes. In other words, both tests are implicitly based on the list vertex degree model, which has been shown previously to underestimate the occurrence probability and the expected number of co-occurrences for genes with high vertex degree (see Fig. 2A Co-occurrence analysis of data represented as bipartite graphs permits visualizing the structure of the bipartite graph either as communities of list entries (genes) or as communities of lists (PubMedID_ListID) in co-occurrence networks. We have analyzed the PubLiME data set for the presence of corresponding gene and list clusters. In addition to previously published clusters of genes, we describe a novel gene cluster that is composed of protein biosynthesis associated genes [35]. We found that the corresponding clusters of PubMedID_ListID (gene expression signatures) are in general strongly enriched for genes reported in the corresponding gene cluster and that interrogation of corresponding clusters can be used to deduct hypotheses about the putative function of gene clusters. In addition to co-occurrence analysis, NetCutter offers a number of tools for the analysis of co-occurrence networks, or any undirected graph. In particular, community identification is supported by four different community identification algorithms. NetCutter also offers a range of convenience functions that are of help in network analysis. Worthy of mentioning are the random graph generators that can provide control graphs for topological studies. The complete set of options is described in the software documentation. In summary, we present a general framework for co-occurrence analysis with many potential applications. We illustrate a number of advantages of using the bipartite graph representation of data and the associated statistics. In particular, the identification of corresponding clusters permits the identification of functional subunits such as gene clusters on the one hand, and the generation of hypotheses about the function of those units by analyzing the corresponding list clusters on the other hand. Future developments will be directed towards the analysis of data sets that are considerably larger than the data sets analyzed so far. For example, co-occurrence analysis might be of interest for the analysis of single nucleotide polymorphism (SNP) data sets and association studies of genome variability with disease. Each patient is characterized by a specific range of SNPs. Co-occurrence patterns of patients according to their SNPs could be compared to clinical parameters with the aim of identifying genomic regions associated with disease. The increased stringency of association studies offered by NetCutter may be of use in the analysis of polygenic diseases where conventional methods fail. For being useful in this setting, NetCutter must be capable of analyzing bipartite graphs with millions instead of thousands of vertices. Materials and Methods Implementation of NetCutter NetCutter is written in Java using NetBeans6 software (http://www.netbeans.info/) and tested on the Java Runtime environment 1.6.0.0. on a Windows XP Professional computer. The Java Runtime environment, which can be downloaded from http://java.sun.com/, must be installed on a computer that is intended to run NetCutter. NetCutter is provided as a single jar file and should run by double clicking the jar file, provided that the Java runtime environment is properly installed. NetCutter makes use of the following software packages and classes: JUNG version 1.3 (http://jung.sourceforge.net/download.html), Apache Jakarta Commons Collections 3.1 (http://jakarta.apache.org/commons/collections/), Cern Colt Scientific Library 1.2.0 (http://dsd.lbl.gov/hoschek/colt/), Xerces (http://xerces.apache.org/xerces2-j/index.html), Jama (http://math.nist.gov/javanumerics/jama/), Netlib Java LAPACK (http://www.netlib.org/lapack/), JFreeChart (http://www.jfree.org/jfreechart/), partition.java (http://astro.u-strasbg.fr/fmurtagh/mda-sw/java/partition.java). Bi-binomial approximation of Poisson-binomial distribution The Poisson-binomial distribution (binomial distribution with trial specific probabilities) has recently been proposed as a statistic that properly handles largely differing sizes of gene expression signatures in meta-analysis of gene expression data [35]. Z-scores have been used to estimate the significance of co-occurrence because P-value calculation is cumbersome and error prone. Two methods reported by [36] suffer from numerical overflow/underflow problems when large numbers of Poisson trials with probabilities deviating significantly from 0.5 are being analyzed. Therefore, we propose a fast approximation of P-values based on a bi-binomial distribution. The bi-binomial distribution is a special case of the Poisson-binomial distribution where the probability of success can assume only two values. In order to achieve a good approximation of the underlying Poisson-binomial distribution, the values of these two probabilities and the number of trials where they are assumed must be determined carefully. As is shown in the following, the values of the two trial probabilities and their number of occurrences follow from the formula used to calculate the variance of the Poisson-binomial distribution and from the formula yielding the population variance of trial probabilities of the Poisson-binomial distribution to be approximated. The mean μ and the variance σ2 of the Poisson-binomial distribution are given by equation (1) and (2), respectively.
The population variance S2 of trial probabilities pi is given by equation (3).
Rearranging equation (3) considering (1) and (2) leads to (4) and (5), where pa denotes the average trial probability of success and qa its complement.
Considering (1), the average trial probability pa can then be obtained from (7).
Using (7), p1 can thus be calculated as (8).
Similarly, considering (2), the variance σ2 is given by (9).
Substituting p1 in (9) using (8) followed by substituting σ2 in (5) by (9) leads to a quadratic equation for p2 as a function of pa, N, and S2, as shown in equation (10).
The solution to (10) is given by (11).
Setting p2 to
Comparing (13a) to (12) and (12a) to (13), it can be seen that the formulae are identical except for the fact that N1 and N2 are reversed. Since the assignment of which set of trials is called N1 and which set of trials is called N2 is completely arbitrary, we can limit the remaining analysis on (12) and (13) without loss of generality. Note that (12) and (13) do not guarantee that p1 and p2 are always confined between 0 and 1 for any combination of N1 and N2. While probabilities smaller than 0 or bigger than 1 would still result in a distribution with the same overall variance as the original distribution, P-value calculation will be imprecise because the tails of the distribution will deviate significantly from the original distribution. Thus, we need to define the values N1 and N2 in such a way that p2< = 1 and p1> = 0. This can be achieved by evaluating (12) and (13).Evaluating (12) for the condition that p2< = 1, solving the resulting inequality for N2, and considering (5), which relates S2 and σ2, we obtain (14).
Similarly, evaluating (13) for the condition p1> = 0, solving the resulting inequality for N2, considering (5), which relates S2 and σ2, and defining μf the expected number of failures as N * (1−pa) (15),
The meaning of these boundaries is best illustrated by considering a Poisson-binomial distribution whose variance is 0, i.e. that assumes 1 at X = μ and 0 otherwise. In this case (14) requires N2< = μ while (16) requires N2> = μ. These conditions can only be fulfilled contemporaneously when N2 is set to μ. Intuitively, this means that there are μ trials with probability of success 1 and N−μ trials with probability of 0, resulting in a Poisson-binomial distribution with variance σ2 = 0 and mean μ. When σ2 is larger than 0, the choice of N1 and N2 is more flexible. However, since the choice of N2 = μ is valid for all possible values of σ2, this is how NetCutter determines N1 and N2. When μ is not an integer, N2 is set to the integer closest to μ.Having determined p2 (12) and p1 (13) as well a N1 and N2 (14, 16, 6), we can now calculate the bi-binomial approximation of the Poisson-binomial distribution in a fashion that is very similar to calculate the binomial P-value. With q1 = 1−p1 and q2 = 1−p2 we obtain:
The summation is necessary because i successes can be obtained from any combination of j p1 and i−j p2 trials, where j can assume any value from 0 to i. Calculating Jaccard and uncertainty coefficients For the purpose of comparing the efficacy of the bi-binomial distribution as a significance measure of co-occurrence, Jaccard and uncertainty coefficients (which are also called measures of association) were calculated using the formulae:
The Jaccard coefficient J is calculated as the number of times A and B occur together divided by the number of times A occurs without B plus the number of times B occurs without A plus the number of times A and B occur together [42]. The uncertainty coefficient [42] is calculated as:
Co-occurrence analysis of the PubLiME data set The bipartite graph to be analyzed is composed of 27619 PubMedID_listID-gene pairs (see supplementary material Table S1). Edge-swapping (1000 simulations, see above) was used to determine occurrence probabilities and gene co-occurrence was analyzed using module size 3 (co-occurrence of three genes), bi-binomial Z-score cutoff 6, bi-binomial P-value cutoff 1.0E-5, and support 5. Supplementary material Simulation S1 provides details on why module size 3 is chosen. The support parameter ensures that each 3-gene co-occurrence module is present in at least 5 signatures. We identified 1654 significant modules in the test data compared to 5 modules in a randomized bipartite graph, corresponding to a signal-to-noise ratio of 331. The co-occurrence network was generated from the significant co-occurrence modules by drawing an edge between each pair wise combination of genes that are part of the same co-occurrence module. Gene communities were identified in this network by edge-betweenness clustering removing 4 edges, which resulted in a maximal network modularity of 0.63. Modularity is calculated as described by [43]. For the identification of PubMedID_listID clusters, the PubMedID_listID-gene pairs in the original bipartite graph were reversed to form gene-PubMedID_listID pairs. Occurrence probabilities were obtained by transposing the original occurrence probability matrix determined by edge-swapping as described above. PubMedID_listID co-occurrence was analyzed using module size 5, Z-score cutoff 6, bi-binomial P-value cutoff 1.0E-5, and support 3. Please note that the choice of these parameters is dictated by the parameters used in gene co-occurrence analysis. The reversal of the bipartite graph necessitates the support parameter used in gene co-occurrence analysis (5) to be used as module size in PubMedID_listID co-occurrence analysis and the module size used in gene co-occurrence analysis (3) to be used as the support parameter in PubMedID_listID co-occurrence analysis if the scope of the analysis is the identification of PubMedID_listID clusters that correspond to gene clusters identified before. The significance cutoffs remain unchanged. PubMedID_listID co-occurrence analysis revealed 448 significant co-occurrence modules in the real bipartite graph and 6 significant co-occurrence modules in the randomized bipartite graph with a signal-to-noise ratio of 75. Communities in the resulting co-occurrence network were identified by edge-betweenness clustering removing 130 edges. The resulting maximal network modularity was found to be 0.47. Simulation S1 (0.42 MB PDF) Click here for additional data file.(413K, pdf) Acknowledgments We thank Prof. M.E.J. Newman for providing the source code for eigenvector clustering. Footnotes Competing Interests: The authors have declared that no competing interests exist. Funding: This work was supported by grants from AIRC (Italian Association of Cancer Research) and CARIPLO (Cassa di Risparmio della Regione Lombardia). References 1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005;33:D39–45. [PubMed] 2. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. [PubMed] 3. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A. 1999;96:4285–4288. [PubMed] 4. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D. A combined algorithm for genome-wide prediction of protein function. Nature. 1999;402:83–86. [PubMed] 5. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, et al. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999;285:751–753. [PubMed] 6. Huynen MA, Bork P. Measuring genome evolution. Proc Natl Acad Sci U S A. 1998;95:5849–5856. [PubMed] 7. Snel B, Bork P, Huynen MA. The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci U S A. 2002;99:5890–5895. [PubMed] 8. Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H. Application of a New Probabilistic Model for Mining Implicit Associated Cancer Genes from OMIM and Medline. Cancer Informatics. 2006;2:361–371. 9. Schuemie MJ, Weeber M, Schijvenaars BJ, van Mulligen EM, van der Eijk CC, et al. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004;20:2597–2604. [PubMed] 10. Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28:21–28. [PubMed] 11. Stapley BJ, Benoit G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput. 2000:529–540. [PubMed] 12. Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R, Casari G, et al. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res. 2004;32:135–142. [PubMed] 13. Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H. A probabilistic model for mining implicit ‘chemical compound-gene’ relations from literature. Bioinformatics. 2005;21(Suppl 2):ii245–ii251. [PubMed] 14. Perez-Iratxeta C, Bork P, Andrade MA. Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002;31:316–319. [PubMed] 15. Masys DR, Welsh JB, Lynn Fink J, Gribskov M, Klacansky I, et al. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics. 2001;17:319–326. [PubMed] 16. Chaussabel D, Sher A. Mining microarray expression data by literature profiling. Genome Biol. 2002;3:RESEARCH0055. [PubMed] 17. Ding J, Berleant D, Nettleton D, Wurtele E. Mining MEDLINE: abstracts, sentences, or phrases? Pac Symp Biocomput. 2002:326–337. [PubMed] 18. Wren JD, Bekeredjian R, Stewart JA, Shohet RV, Garner HR. Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics. 2004;20:389–398. [PubMed] 19. Smalheiser NR, Swanson DR. Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput Methods Programs Biomed. 1998;57:149–153. [PubMed] 20. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, et al. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999;27:1210–1214, 1216–1217. [PubMed] 21. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput. 2000:517–528. [PubMed] 22. Halfon MS, Grad Y, Church GM, Michelson AM. Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res. 2002;12:1019–1028. [PubMed] 23. Jelinsky SA, Estep P, Church GM, Samson LD. Regulatory networks revealed by transcriptional profiling of damaged Saccharomyces cerevisiae cells: Rpn4 links base excision repair with proteasomes. Mol Cell Biol. 2000;20:8157–8167. [PubMed] 24. Kel A, Kel-Margoulis O, Babenko V, Wingender E. Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells. J Mol Biol. 1999;288:353–376. [PubMed] 25. Markstein M, Markstein P, Markstein V, Levine MS. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A. 2002;99:763–768. [PubMed] 26. Pilpel Y, Sudarsanam P, Church GM. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 2001;29:153–159. [PubMed] 27. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–285. [PubMed] 28. Brazma A, Vilo J, Ukkonen E, Valtonen K. Data mining for regulatory elements in yeast genome. Proc Int Conf Intell Syst Mol Biol. 1997;5:65–74. [PubMed] 29. Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol. 1998;278:167–181. [PubMed] 30. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A. 2002;99:757–762. [PubMed] 31. Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y. Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res. 2003;13:773–780. [PubMed] 32. Manke T, Bringas R, Vingron M. Correlating protein-DNA and protein-protein interaction networks. J Mol Biol. 2003;333:75–85. [PubMed] 33. Sudarsanam P, Pilpel Y, Church GM. Genome-wide co-occurrence of promoter elements reveals a cis-regulatory cassette of rRNA transcription motifs in Saccharomyces cerevisiae. Genome Res. 2002;12:1723–1731. [PubMed] 34. Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol Syst Biol. 2007;3:122. 35. Finocchiaro G, Mancuso FM, Cittaro D, Muller H. Graph-based identification of cancer signaling pathways from published gene expression signatures using PubLiME. Nucleic Acids Res. 2007;35:2343–2355. [PubMed] 36. Chen SX, Liu JS. Statistical Applications of the Poisson-Binomial and Conditional Bernoulli Distributions. Statistica Sinica. 1997;7:875–892. 37. Newman ME. Finding community structure in networks using the eigenvectors of matrices. Phys Rev E Stat Nonlin Soft Matter Phys. 2006;74:036104. [PubMed] 38. Tarjan RE. Depth first search and linear graph algorithms. SIAM J Comp. 1972 39. Girvan M, Newman ME. Community structure in social and biological networks. Proc Natl Acad Sci U S A. 2002;99:7821–7826. [PubMed] 40. Flake GW, Lawrence S, Giles CL, Coetzee F. Self-Organization and Identification of Web Communities. IEEE Computer. 2002;35:66–71. 41. Dennis G, Jr, Sherman BT, Hosack DA, Yang J, Gao W, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4:P3. [PubMed] 42. Goodman LA, Kruskal WH. Measures of association for cross classifications. New York-Berlin: Springer-Verlag; 1979. 43. Newman ME, Girvan M. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69:026113. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D39-45.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2001 Jan 1; 29(1):22-8.
[Nucleic Acids Res. 2001]Proc Natl Acad Sci U S A. 2002 Apr 30; 99(9):5890-5.
[Proc Natl Acad Sci U S A. 2002]Pac Symp Biocomput. 2000; ():529-40.
[Pac Symp Biocomput. 2000]Nucleic Acids Res. 2004; 32(1):135-42.
[Nucleic Acids Res. 2004]Bioinformatics. 2005 Sep 1; 21 Suppl 2():ii245-51.
[Bioinformatics. 2005]Nat Genet. 2002 Jul; 31(3):316-9.
[Nat Genet. 2002]Bioinformatics. 2001 Apr; 17(4):319-26.
[Bioinformatics. 2001]Genome Res. 2002 Jul; 12(7):1019-28.
[Genome Res. 2002]Genome Res. 2002 Nov; 12(11):1723-31.
[Genome Res. 2002]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Phys Rev E Stat Nonlin Soft Matter Phys. 2006 Sep; 74(3 Pt 2):036104.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2006]Proc Natl Acad Sci U S A. 2002 Jun 11; 99(12):7821-6.
[Proc Natl Acad Sci U S A. 2002]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Genome Biol. 2003; 4(5):P3.
[Genome Biol. 2003]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Bioinformatics. 2004 Feb 12; 20(3):389-98.
[Bioinformatics. 2004]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(7):2343-55.
[Nucleic Acids Res. 2007]Phys Rev E Stat Nonlin Soft Matter Phys. 2004 Feb; 69(2 Pt 2):026113.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2004]