![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright Prieto et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Human Gene Coexpression Landscape: Confident Network Derived from Tissue Transcriptomic Profiles Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC-IBMCC, CSIC/USAL), Salamanca, Spain Nicholas James Provart, Editor University of Toronto, Canada * E-mail: jrivas/at/usal.es Conceived and designed the experiments: CP JDLR. Performed the experiments: CP AR CF. Analyzed the data: CP AR CF. Contributed reagents/materials/analysis tools: CP AR CF. Wrote the paper: JDLR. Received July 3, 2008; Accepted November 5, 2008. This article has been cited by other articles in PMC.Abstract Background Analysis of gene expression data using genome-wide microarrays is a technique often used in genomic studies to find coexpression patterns and locate groups of co-transcribed genes. However, most studies done at global “omic” scale are not focused on human samples and when they correspond to human very often include heterogeneous datasets, mixing normal with disease-altered samples. Moreover, the technical noise present in genome-wide expression microarrays is another well reported problem that many times is not addressed with robust statistical methods, and the estimation of errors in the data is not provided. Methodology/Principal Findings Human genome-wide expression data from a controlled set of normal-healthy tissues is used to build a confident human gene coexpression network avoiding both pathological and technical noise. To achieve this we describe a new method that combines several statistical and computational strategies: robust normalization and expression signal calculation; correlation coefficients obtained by parametric and non-parametric methods; random cross-validations; and estimation of the statistical accuracy and coverage of the data. All these methods provide a series of coexpression datasets where the level of error is measured and can be tuned. To define the errors, the rates of true positives are calculated by assignment to biological pathways. The results provide a confident human gene coexpression network that includes 3327 gene-nodes and 15841 coexpression-links and a comparative analysis shows good improvement over previously published datasets. Further functional analysis of a subset core network, validated by two independent methods, shows coherent biological modules that share common transcription factors. The network reveals a map of coexpression clusters organized in well defined functional constellations. Two major regions in this network correspond to genes involved in nuclear and mitochondrial metabolism and investigations on their functional assignment indicate that more than 60% are house-keeping and essential genes. The network displays new non-described gene associations and it allows the placement in a functional context of some unknown non-assigned genes based on their interactions with known gene families. Conclusions/Significance The identification of stable and reliable human gene to gene coexpression networks is essential to unravel the interactions and functional correlations between human genes at an omic scale. This work contributes to this aim, and we are making available for the scientific community the validated human gene coexpression networks obtained, to allow further analyses on the network or on some specific gene associations. The data are available free online at http://bioinfow.dep.usal.es/coexpression/. Introduction Exploration and analysis of gene expression data using genome-wide microarrays is a technique often used in genomic studies to find coexpression patterns and locate groups of co-transcribed genes. This kind of studies has been used in model organisms, like yeast [1], to discover gene functions, to define biological processes and to find related transcription factors and their products. The main features of expression patterns that give a wide utility in bioinformatic studies are: the functional information associated [2], the high conservation of gene coexpression groups along evolution [3] and the high correlation of these groups with biomolecular pathways or reactions [4]. All these features leverage genome-wide expression profiling, and convert this topic in a hot research area. Despite the described interest, coexpression studies done at global “omic” scale are not focused in many cases on human samples [5], and, when they correspond to human, very often they include heterogeneous datasets, mixing “normal” samples with “disease altered” samples from patients suffering from some kind of pathological state. This is the case, for example, in several human gene expression large studies [2], [6]. The inclusion of many disease datasets (mainly from cancer) in such meta-analyses may introduce strong bias and produce a lot of biological noise in the results. In fact, it is well known that cancer cells have altered genomes. Therefore, these kind of studies cannot be used to clarify how a normal-healthy human cellular system works, and they cannot be used to draw a reliable map of the human gene coexpression landscape. The technical noise in the genome-wide expression microarray studies is another well reported problem that can not be ignored when gene coexpression studies at “omic” scale are undertaken. Considering all these problems and knowing the interest of having a reliable normal human gene coexpression network, we have undertaken this task selecting human genome-wide expression microarrays from a controlled set of different normal tissues to build a confident human transcriptomic network using several statistical and computational methods. These methods (which include robust data normalization and signal calculation, combined parametric and non-parametric correlation and random cross-validation) help to avoid both biological and technical noise and provide a human gene coexpression network that shows good accuracy and coverage. Moreover, the network reveals well defined biological functions and pathways that map to specific coexpression clusters. Results and Discussion Genome-wide expression profiles from a broad set of human samples An expression matrix was calculated for a dataset of human genome-wide microarrays hybridized with mRNA samples coming from different human tissues, glands and organs from healthy normal individuals. As indicated in Materials and Methods the dataset included two biological replicates of samples from 24 parts of the body: adrenal gland, appendix, blood, bone marrow, brain, kidney, liver, lung, lymph node, muscle heart, ovary, pancreas, pituitary gland, prostate gland, salivary gland, skin, spinal cord, testis, thymus gland, thyroid gland, tongue, tonsil gland, trachea and uterus. Figure 1
The heatmaps (Figures 1A and 1B = 100%). However, within the tissues and organs only two stable groups were found with both methods: the group that includes lymph node, thymus gland and tonsil gland (that gave a AU value of 0.98); and the group that includes kidney and adrenal gland (with AU value 0.97). These groups have clear biological meaning since they correspond to physiologically and functionally related organs (i.e. lymph node, thymus and tonsil are related to the lymphatic and immune systems). Thus the functional relationship between samples is captured by the gene expression profiles. However, all the other tree branches produced low AU values, therefore the overall sample clustering observed in the heatmaps indicates a lack of well defined and stable groups. In conclusion, these results show neat separation of most of the sample expression profiles, which is an adequate condition for the exploration of a global broad human gene expression landscape.In order to consider if these observations are reliable enough, we explored the data changing some conditions following another two different strategies (data not shown). First strategy, the same analyses with 48 microarrays were done again twice: one not using the total number of genes (i.e. 22 283 gene probesets) but only the 25% of the genes that showed the largest variance; and another using only the 25% of the genes that showed the highest signal. In both cases, the heatmap and trees obtained were very similar to the ones presented in Figure 1 From sample expression profiles to gene expression signatures The main data presented so far correspond to the analysis of the genome-wide expression profiles of samples from different human normal tissues, organs or glands. These genome-wide “sample profiles” are numerical vectors including the expression values of each one of the gene probesets present in the microarray (i.e. each one of the detectable genes of the human genome). As shown above, the “sample profiles” can resemble the physiological relationships expected between the samples (tissues, glands and organs). However, in order to achieve a mapping of the human gene coexpression landscape, we needed to move from the analysis of the “sample expression profiles” based on the genes, to the analysis of each “gene expression signature” based on the sample set. It is difficult to achieve a proper gene coexpression study due to several obstacles that have to be taken in consideration: (i) the technical noise present in the microarrays at genomic scale [10], despite the fact that the Affymetrix high density oligonucleotide genechips have been reported quite reliable and reproducible [11], [12]; (ii) the small number of samples used to define each gene expression signature (specially in comparison to the large number of genes); (iii) the strong heterogeneity of the data sets frequently studied, that include in many cases samples from pathological or altered states [2], [13] which are not adequate samples to find “normal” gene expression behavior. The approach and strategies taken in this study to solve or minimize these problems were the following: (a) careful selection of expression samples from different parts of the human body (tissues, whole glands and whole organs) from normal healthy individuals; (b) calculation of expression signals and correlations using two different independent methods: MAS5-Spearman, RMA-Pearson; (c) use of a robust random cross-validation strategy to find the most stable correlation pairs and distinguish the consistent biological-signal from the noise-signal; (d) statistical estimation of the accuracy and the coverage for each coexpression dataset obtained. All the details and description of these strategies are presented in Materials and Methods. The results associated with them have been partially described above and are explained in the following paragraphs. Gene pairs coexpression analyzed with cross-validated correlations The complete expression data matrix analyzed had, as indicated, 48 samples (24 duplicates) and 22,283 gene probesets (which correspond to 13,068 distinct known human genes according to Affymetrix annotation). Therefore the global pair-wise gene coexpression matrix including all possible pairs had 248,254,903 data points and was calculated twice, once for each independent method used (MAS5-Spearman and RMA-Pearson). These huge data matrices have many pairs that are false coexpression pairs and to detect those positive gene pairs that had stable and significant correlation we use cross-validation. The results corresponding to the gene pairs correlation obtained with the cross-validation method (described in Methods) are presented in Figure 2
To demonstrate how the rN-plots represent stable and consistent correlations, we selected in the case of the red circles or dots only the gene probeset pairs that correspond to probesets assigned to “the same gene”. For example, pairs between the 4 probesets that correspond to gene ALDOB, fructose bisphosphate aldolase B (204704_s_at, 204705_x_at, 211357_s_at, and 217238_s_at in microarray HGU133A); or pairs between the 3 probesets that correspond to gene CDK10, cell division protein kinase 10 (203468_at, 203469_s_at and 210622_x_at in HGU133A). When correlation is found between these kind of “common gene probesets” they are drawn as red circles in Figure 2 The differences observed between Fig. 2A and 2B = 0.225; and this is why the red circles with high-r and low-N only appear for values N>225. By contrast, the MAS5-Spearman method does not find any red circle in the high-r and low-N region, because Spearman is a “rank correlation coefficient” which does not produce high correlation values for gene pairs that correlated in only one tissue (just once out of 6). The r value obtained with the Spearman method is proportional to the number of tissues or samples that co-express and so it is quite proportional to N.Data filtering to clear genes with low information content The calculations and analysis presented in Figure 2 As described in Methods we use a combined filter based on between-sample variability and gene minimal signal, that is designed to get rid of genes with low information content. The use of this filter with the 48 microarrays sample set gave different results for the data expression matrix obtained with RMA method and the expression matrix obtained with MAS5 method. In the first case the filter leaves out 6,893 gene probesets (leaving 69.06%) and in the second 3,682 (leaving 83.48%) from 22,283 total gene probesets. The difference in these numbers shows that these two methods do not provide an equal calculation of expression signal and variance and therefore, as explained bellow, both methods can be considered complementary. Analysis of accuracy and coverage along gene coexpression data Using the filtered data sets we follow a more thorough analysis of the coexpression distributions with respect to the parameters r and N. In the rN-plots (Fig. 2 In Figure 3
In conclusion, the study shows that the RMA-Pearson method has better coverage of the coexpression landscape and the MAS5-Spearman is more accurate to find coexpression pairs. These results support the use of both methods in order to find a confident human coexpression network, since they do not find exactly the same expression signal and both provide important and complementary data allowing a progressive improvement of the significance and confidence of the coexpression set. Moreover, a better knowledge of the strength of each method is a discovery that complements previous comparative studies about RMA [7] and MAS5 [8]. Effects of gene filtering The original coexpression data used in Figure 2 = 0.5 and N = 200 included 15,623 positive coexpression pairs; and this number was very similar to the 15,657 pairs found for RMA-Pearson filtered.Integration of correlation, cross-validation and PPV for datasets obtained with two balanced methods Following the observations and arguments described above we proceed to integrate in “three-dimensions color plots” the data corresponding to the values of correlation (r), cross-validation (N) and PPV obtained with each method. The results are shown in Figure 4
The three-dimensions color plots allow to assess in a graphic way the level of confidence for a given coexpression data subset. We use them to select three data subsets derived from each method at three specific PPV values: ≥0.60, ≥0.70 and ≥0.80. The values of the correlation and cross-validation coefficients that correspond to these data subsets are indicated in the table enclosed as Fig. 4C Biological significance of the coexpression datasets: house-keeping gene pairs and tissue-specific gene pairs Once significant human gene coexpression datasets have been found and evaluated using statistical parameters, we started exploring the biological meaning and functional consistency of these datasets. In a first approach, we investigate the location of house-keeping gene pairs in the coexpression datasets, taking two different published compendiums of human house-keeping genes [16], [17]. Hsiao et al. identified 451 genes that are expressed in all 19 different human tissue types. Eisenberg et al. identified 575 human genes that show constitutive expression in all conditions tested in several publicly available databases. Mapping these genes in the general distribution of coexpression data shows that the ratio of house-keeping genes increases at high N and r coefficient values (Fig. 5A,B
We further investigate this observation by selecting subsets of the coexpression data for genes included in specific KEGG pathways. Examples of this subsetting are presented in Fig. 5C = hsa03010), (2) oxidative phosphorylation (hsa00190), (3) proteasome (hsa03050), (4) cytokine-cytokine receptor interaction (hsa04060), (5) neuroactive ligand-receptor interaction (hsa04080), and (6) complement and coagulation cascades (hsa04610). First three pathways can be considered as general constitutive, present in all tissues and cellular types. The other three pathways are tissue-specific, only present in some cell types, like: nervous system cells in the case of the neuroactive ligand-receptor interaction pathway or blood cells in the case of the complement and coagulation cascades pathway. These differences in functional specificity are reflected in the coexpression distributions: only the three panels on the right (Fig. 5C 4,5,6Comparison of human coexpression datasets: molecular machines and pathways consistently co-regulated In a second approach, we investigate the functional assignment of the gene coexpression data following the strategy taken by Stuart et al. [5], who explored functional coverage on a coexpression network obtained for four organisms looking at the percentage of genes that are connected to at least one other gene in the same “functional category”. We proceed to the same percentage calculation using the KEGG pathways as “functional categories”. The analysis was done for the coexpression dataset derived from RMA-Pearson method with r>0.63 and N>500. The same functional analysis was also done using two other external human coexpression datasets previously published by Lee et al. [2] and Griffith et al. [6]. The results are presented in Table 1, that includes the ten-top pathways found with best percentage of genes coexpressing within the gene groups assigned to KEGG pathways for 3 different human coexpression datasets (this work, Lee et al. and Griffith et al.). This comparative analysis of functional coverage shows some interesting results: (i) All coexpression datasets find the most significant coexpression for 3 key molecular machines: ribosome, proteasome and oxidative phosphorylation. (ii) Genes involved in cell scaffolding and cell to cell interaction or anchoring are also found to coexpress quite often, as indicated by the presence of pathways like focal adhesion, extracellular matrix (ECM) interaction and cytoskeleton regulation. (iii) Genes involved in cell cycle pathway are also common to the three datasets, indicating that cells keep a tight regulation of the genes involved in essential living functions (maintenance, proliferation, survival). (iv) An important difference between our coexpression dataset and Lee et al. or Griffith et al. datasets is that this work only includes samples coming from normal non-pathological tissues, but the others include quite heterogeneous samples mixing normal and disease altered samples (for example, Lee et al. includes many human cancer samples). The inclusion of pathological samples can bias the results and this may be the reason of the appearance of “pathogenic infection pathways” in Lee et al. data. (v) Finally, the data obtained in this work also includes many coexpressing pairs involved in cell-cell communication like cytokine-receptor and ligand-receptor interactions. As a general conclusion of this analysis, we can say that KEGG pathways is revealed as a good database to investigate the biological functions of human genes, because it includes groups of genes that really work together in well defined biomolecular processes. The comparative calculation of the coverage for the three human coexpression datasets included in Table 1 indicates that the data obtained in this work present a higher level of functional coherence than previously published datasets [2], [6]. This comparison was also done taking coexpression networks of similar sizes (including in each case around 12,000 best coexpression relations) and calculating the statistical accuracy for all of them. The result presented in Table 2 shows that the accuracy estimated as PPV was 0.61 for our dataset obtained with MAS5-Spearman, 0.56 for Lee et al. and 0.49 for Griffith et al. As a whole these numbers indicate that the human coexpression network derived from this work includes very consistent co-regulation of genes many times involved in common pathways. A high confidence human coexpression network reveals a map of ubiquitous biological functions As far as we know, none of the previously published human coexpression networks [2], [5], [6] has a comprehensive calculation of the estimated statistical error in the datasets at different levels of coverage. However, following the analysis and data presented in Figure 4 Figure 6
As a whole the network is quite stringent but it is functionally very coherent. Moreover, coming from the intersection of two methods it will be expected to include mainly essential human genes. To prove if this network is enriched in house-keeping and essential genes we identified the nodes of the network that are included in the Hsiao human house-keeping gene set [16] and we also identified the nodes that correspond to genes that are orthologous to known essential yeast genes (taken from SGD database). In this way, we found that the two major constellations of the network, including mainly genes involved in nuclear related and mitochondrial related metabolism, show respectively 63% and 58% of genes assigned to be house-keeping. This result reveals that the coexpression network is enriched in essential genes. In conclusion, the functional consistency observed in the constellations and regions defined by the coexpression network and the enrichment on house-keeping genes place the genes in a new integrative relational context that has strong biological coherence and, in many cases, can reveal essential or ubiquitous biological processes. The network also unravels new non-described human gene associations. All the details about this coexpression network are provided in a supplementary file for Cytoscape (Supporting Information File S1: S1_HumanCoexpNtw_615g_cys.zip; that can be downloaded and used as a .cys file to be explored interactively using Cytoscape). This file also includes information about each node with GO and KEGG functional annotations. Analysis of the network with clustering algorithms The network described above was analyzed using a graph theoretic clustering algorithm called MCODE [19] as indicated in Materials and Methods. The result of this analysis is presented in Figure 6 We also applied another cluster algorithm for graphs called MCL [20] (see Methods). The analysis with MCL provided similar results to MCODE for the large clusters mentioned, although it splits the network in more clusters being the smaller ones more coherent in functional terms that the ones found by MCODE. For example, MCL algorithm finds another cluster form by 15 genes, with 7 assigned to RNA binding gene products, 3 to DNA binding gene products (all included in region blue in Figure 6 These results show that the gene clusters obtained with the graph algorithms from the coexpression network can help to understand the function of many human genes and the active relations between them. As expected, we find that stable and consistent coexpression clusters of genes are involved in specific functions, at cellular or systemic level. A complete analysis of all clusters is not possible in just one article but, as indicated above, the coexpression datasets of this study are open to new studies. Functional coherence of the coexpressing modules: finding coregulation and new biological assignments To show some specific examples about the functional coherence of the gene coexpressing modules and the adequate correlation of the genes with common regulatory elements (i.e. transcription factors, TFs, and corresponding promoters) we analyzed three specific clusters or modules found in the core coexpression network. The first module includes 10 genes: 8 forming a full cross-related octogonal structure plus 2 nodes linked to them. The 8 genes are all metallothioneins: MT1E, MT1F, MT1G, MT1H, MT1L, MT1M, MT1X, MT2A. The other 2 genes are not well annotated: DDX42 (that encodes a member of the DEAD box protein family with unclear function) and LOC645745 (that has been recently and provisionally identified as a putative MT1, metallothionein 1 pseudogene 2). The coexpression of these two genes with a well defined and stable cluster of metallothioneins allows to infer that they will be genes also involved in metal ion homeostasis. This module can be seen in Figure 7
A further analysis was done to find if these coexpressing genes have any common transcription factor (TF) that can act on the promoters and regulation regions of these genes. Two bioinformatic tools were used to find out TFs associated in a significant way to the coexpressing genes: PAP [21] and FactorY (see Methods). Using PAP we found that the 10 coexpressing genes of module 1 are regulated in common by the transcription factor MTF1 (found with p-value = 0.001). This result could be expected since MTF1 is a metal-regulatory transcription factor that induces expression of metallothioneins and other genes involved in metal homeostasis (such as zinc and copper). In any case, the association of MTF1 to module 1 provides strong coherence to the data, showing that this coexpression network is correlated with an underlying transcription regulatory entity.The second module shown in Figure 7 Finally, the third module shown in Figure 7 The results presented for three coexpression modules can be extended to most of the clusters present in the network, and they indicate that the coexpression network can be correlated with an underlying regulatory network driven by specific transcription factors. This observation provides biological and functional coherence to the human gene pairwise coexpression network presented in this paper deduced from the analysis of normal-healthy human samples (whole tissues, glands or organs). Finally, it is clear that a complete pairwise coexpression network of human genes will be only obtained using a comprehensive and systematic set of samples including all different human cell types. This achievement is at present quite far and difficult, since there are more than two hundred different cell types in the human body and that each cell type can be at different development or differentiation stages. Meanwhile, however, we think that the present study reports a reliable gene-gene coexpression network that includes very valuable information about many human genes, placing them in an integrated transcriptomic context. These coexpression networks selected at specific levels of confidence include a lot of information to better understand the complexity of the human expressing genome. Materials and Methods Sample selection: dataset of genome-wide expression microarrays from human normal whole tissues/glands/organs The data used in this work corresponds to a set of human genome-wide expression microarrays hybridized with mRNA samples coming from different human tissues, glands or organs from healthy normal individuals. The complete list of tissues, glands and organs is: adrenal gland, appendix, blood, bone marrow, brain, kidney, liver, lung, lymph node, muscle heart, ovary, pancreas, pituitary gland, prostate gland, salivary gland, skin, spinal cord, testis, thymus gland, thyroid gland, tongue, tonsil gland, trachea and uterus. These 24 samples where selected from a larger set of 68 human samples (GEO GSE1133; Su et al. 2004) that also included some cell specific sources, like: lung bronchial epithelial cells HBEC, blood B-cells CD19 and T-cells CD4. The samples selection done was driven under the criteria of including mRNA samples from whole organs, glands or tissues covering the main parts of the human body and avoiding samples of very specific cell types within a tissue. This selection was validated performing global expression analyses of the samples, using a series of algorithms described bellow. The total mRNA from these 24 different samples came form a mix of 3 different individuals, that were: two men and one woman or one man and two women for the samples non sex-associated; three men for testis and prostate samples and three women for ovary and uterus samples. Moreover two biological replicates were used in each case, producing a total set of 48 microarrays. The microarrays used were high density oligonucleotide microarrays HGU133A GeneChips from Affymetrix, that include 22,283 probesets (corresponding to 13,068 human genes according to Affymetrix annotation). Genome-wide sample expression profiles and gene expression signatures The global expression matrix including the genome-wide expression profiles of each sample and the expression signature of each gene-probeset was calculated and evaluated using a set of algorithms and methods in four consecutive steps: (1st) use of two different background correction, normalization and signal calculation methods: MAS5 [8], [27] and RMA [28]; (2nd) use of two distance measuring methods based in the global gene expression profile of each sample: first, distance based on Spearman correlation coefficient applied to MAS5 data; second, distance based on Pearson correlation coefficient applied to RMA data (both methods provided robust non-parametric distance distributions); (3rd) analysis by hierarchical clustering with complete linkage of the samples using the tool hclust from R (http://www.r-project.org/), taking as distance (1−r), where r is the correlation coefficient between sample expression profiles [29]; (4th) analysis by bootstrapping of the sample hierarchical trees to assay the stability of the associations, using the tool pvclust from R. The pvclust algorithm allows to assess the uncertainty in hierarchical cluster analysis via multiscale bootstrap resampling. This assessment is provided by two parameters: the approximately unbiased p-value (AU) and the bootstrap probability value (BP). The maximum and optimum values of AU and BP are 1 (or 100 in %). Gene pairs coexpression and cross-validation As indicated above the global gene to gene (i.e. pair-wise) coexpression matrix was calculated using two different and independent methods: MAS5-Spearman and RMA-Pearson. Furtherly, cross-validation was used to discriminate stable and significant correlations. The cross-validation strategy applied was a 1000 times random selection of a 25% subset sampling (that are 12 samples, corresponding to 6 duplicates out of 24 duplicated samples) and calculation of the r correlation coefficient for each gene-probeset pair in such 1000 samplings. Only when the r correlation coefficient for a given time was higher than |0.70|, such was considered a positive event (positive cross-validation) and counted for the corresponding gene-probeset pair. In this way, for example, a given gene pair with N = 620 means that it gave 620 positive times out of the 1000 samplings. Therefore N can be considered a cross-validation coefficient or cross-validation factor (N = 620 is equivalent to 620/1000 = 0.62).Gene filtering method In order to get rid of genes with low information content a combined filter based on between-sample variability and gene minimal signal was used. The filter leaves out only those gene probesets that fulfilled both of the two following conditions: 1st.- Genes which have an expression difference or variability between samples (ΔExpgihighest-lowest) lower than the median of all the expression differences calculated for each gene (ΔExpgihighest-lowest<median ΔExphighest-lowest); 2nd.- Genes which have a mean expression signal between samples (meanExpsamples) lower than the median of all the expression signals calculated for each gene. Statistical estimation of accuracy and coverage of the coexpression datasets The accuracy measured as “Positive Predictive Value” (PPV) in statistical terms is defined as the ratio TP/(TP+FP), where TP is the number of true positives and FP is the number of false positives [30], [31]. This parameter is related to “error type I”, and it is the inverse to the ratio of “false positives” (i.e. FP/(TP+FP), percentage of false positives within all the positives). The coverage (sometimes also named recall) can be measured as the proportion of true positives that remain in a given subset selected, with respect to an initial reference set of positives. We consider that both the accuracy and coverage are critical statistical parameters to evaluate the error and validity of a method. They are directly related to specificity = TN/(TN+FP), −where (TN+FP) are all the “false”−, and sensitivity = TP/(TP+FN) −where (TP+FN) are all the “true”− [30], though these can only be applied when the real true and real false data of a test are known; while the accuracy defined as “positive predictive value” and the defined coverage can be applied when it is only possible to know or estimate the “positive data”.Therefore, in this study if the true data are not known (i.e. if we do not know a priori which are true gene coexpressing pairs) a proper calculation of the sensitivity and specificity is not possible. This is the most common situation in many biological and biomolecular studies where many of the true occurring relations between molecules are not yet known. Therefore, we need to design a way to at least estimate the percentage or ratio of “true positives” of the method, and so estimate the accuracy and coverage. These parameters will provide a good indication of how valuable is the method that we have applied to find human coexpressing gene pairs. The estimation was done considering the idea that genes that work together in the same biological pathway are much more likely to coexpress than genes that are not involved in a common biological reaction or pathway. This biomolecular axioma in our case was tested annotating all the genes of the microarrays to the KEGG pathway database (www.genome.jp/kegg/), that is one of the most complete and expert curated repository of human genes involved in biological reactions or pathways [32]. Therefore, selecting only the subset of the genes annotated to KEGGs, a gene coexpression pair was considered a “true positive” when both genes of the pair were included in a common KEGG human pathway. This strategy allows to calculate the statistical parameters accuracy and coverage defined above, and therefore to explore how the values of the r and N coefficients change such parameters. Analytic algorithms to find groups and modules in the coexpression networks The gene to gene coexpression networks obtained were analyzed using a graph theoretic clustering algorithm called MCODE (Molecular Complex Detection) [19] that allows to detect densely connected regions in large interaction networks which may represent molecular associations. This algorithm follows a vertex weighting by local neighbourhood density and outward traversal from locally dense seed nodes to isolate the dense regions. Furthermore, the networks were also analyzed using another cluster algorithm for graphs called MCL (Markov Cluster algorithm, http://micans.org/mcl/) [20] that finds cluster structure in graphs by a mathematical bootstrapping procedure. MCL has been shown very robust to find relevant modules in protein interaction networks [33]. Mapping transcription factors associated to gene coexpressing modules Two bioinformatic tools were used to find out transcription factors that can be associated in a significant way to groups or modules of coexpressing genes: Promoter Analysis Pipeline (PAP) and Transcription Factor Enrichment Analysis (FactorY). PAP is based in a systematic, statistical model of mammalian transcriptional regulatory sequence analysis and it is suitable for the identification of the potential transcriptional regulators of co-expressed genes and the identification of the potential regulatory targets of transcription factors. A typical PAP analysis includes input of a co-expressed gene cluster, identification of several high scoring transcription factors and visualization of the predicted transcription factor binding sites [21]. The bioinformatic tool is at: http://bioinformatics.wustl.edu/webTools/portalModule/PromoterSearch.do. FactorY is another bioinformatic tool that explores the 1000 bp upstream sequence signature of co-expressed genes to find homology with transcription factor binding sites (TFBs) based on JASPAR and TRANSFAC databases. The tool calculates the significant enrichment in known given TFBs for a group of genes and it was used at the web site: http://www.garban.org/factory/. File S1 Human Gene Coexpression Network. Network that corresponds to the core with the most confident human gene pairwise coexpression data and includes 615 gene-nodes and 2190 coexpression-links. This network is provided in Cytoscape format (.cys file compressed as .zip) with full annotations about the genes. The file to be run in Cytoscape should have .cys extension: S1_HumanCoexpNtw_615g.cys (0.30 MB ZIP) Click here for additional data file.(296K, zip) Acknowledgments We thank the support provided by the Instituto de Salud Carlos Tercero, Ministery of Health, Spanish Government (ISCiii-FIS, MSyC) and by the Consejería de Educación, Castilla-Leon Local Government (JCyL). Footnotes Competing Interests: The authors have declared that no competing interests exist. Funding: Funding and grant support was provided by the Ministery of Health, Spanish Government (ISCiii-FIS, MSyC; Project reference PI061153) and by the Ministery of Education, Castilla-Leon Local Government (JCyL; Project reference CSI03A06). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. References 1. van Noort V, Snel B, Huynen MA. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep. 2004;5:280–284. [PubMed] 2. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004;14:1085–1094. [PubMed] 3. Tirosh I, Weinberger A, Carmi M, Barkai N. A genetic signature of interspecies variations in gene expression. Nat Genet. 2006;38:830–834. [PubMed] 4. Magwene PM, Kim J. Estimating genomic coexpression networks using first-order conditional independence. Genome Biol. 2004;5:R100. [PubMed] 5. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. [PubMed] 6. Griffith OL, Pleasance ED, Fulton DL, Oveisi M, Ester M, et al. Assessment and integration of publicly available SAGE, cDNA microarray, and oligonucleotide microarray expression data for global coexpression analyses. Genomics. 2005;86:476–488. [PubMed] 7. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. [PubMed] 8. Lim WK, Wang K, Lefebvre C, Califano A. Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics. 2007;23:i282–288. [PubMed] 9. Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. 2006;22:1540–1542. [PubMed] 10. Wang Y, Miao ZH, Pommier Y, Kawasaki ES, Player A. Characterization of mismatch and high-signal intensity probes associated with Affymetrix genechips. Bioinformatics. 2007;23:2088–2095. [PubMed] 11. Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P. Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res. 2005;33:5914–5923. [PubMed] 12. Dallas PB, Gottardo NG, Firth MJ, Beesley AH, Hoffmann K, et al. Gene expression levels assessed by oligonucleotide microarray analysis and quantitative real-time RT-PCR – how well do they correlate? BMC Genomics. 2005;6:59. [PubMed] 13. Choi JK, Yu U, Yoo OJ, Kim S. Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics. 2005;21:4348–4355. [PubMed] 14. Prieto C, Rivas MJ, Sanchez JM, Lopez-Fidalgo J, De Las Rivas J. Algorithm to find gene expression profiles of deregulation and identify families of disease-altered genes. Bioinformatics. 2006;22:1103–1110. [PubMed] 15. Calza S, Raffelsberger W, Ploner A, Sahel J, Leveillard T, et al. Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. Nucleic Acids Res. 2007;35:e102. [PubMed] 16. Hsiao LL, Dangond F, Yoshida T, Hong R, Jensen RV, et al. A compendium of gene expression in normal human tissues. Physiol Genomics. 2001;7:97–104. [PubMed] 17. Eisenberg E, Levanon EY. Human housekeeping genes are compact. Trends Genet. 2003;19:362–365. [PubMed] 18. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. [PubMed] 19. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4:2. [PubMed] 20. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. [PubMed] 21. Chang LW, Fontaine BR, Stormo GD, Nagarajan R. PAP: a comprehensive workbench for mammalian transcriptional regulatory sequence analysis. Nucleic Acids Res. 2007;35:W238–244. [PubMed] 22. Falvo JV, Parekh BS, Lin CH, Fraenkel E, Maniatis T. Assembly of a functional beta interferon enhanceosome is dependent on ATF-2-c-jun heterodimer orientation. Mol Cell Biol. 2000;20:4814–4825. [PubMed] 23. Panne D, Maniatis T, Harrison SC. Crystal structure of ATF-2/c-Jun and IRF-3 bound to the interferon-beta enhancer. Embo J. 2004;23:4384–4393. [PubMed] 24. Kypriotou M, Beauchef G, Chadjichristos C, Widom R, Renard E, et al. Human collagen Krox up-regulates type I collagen expression in normal and scleroderma fibroblasts through interaction with Sp1 and Sp3 transcription factors. J Biol Chem. 2007;282:32000–32014. [PubMed] 25. Magee C, Nurminskaya M, Faverman L, Galera P, Linsenmayer TF. SP3/SP1 transcription activity regulates specific expression of collagen type X in hypertrophic chondrocytes. J Biol Chem. 2005;280:25331–25338. [PubMed] 26. Poree B, Kypriotou M, Chadjichristos C, Beauchef G, Renard E, et al. Interleukin-6 (IL-6) and/or Soluble IL-6 Receptor Down-regulation of Human Type II Collagen Gene Expression in Articular Chondrocytes Requires a Decrease of Sp1{middle dot}Sp3 Ratio and of the Binding Activity of Both Factors to the COL2A1 Promoter. J Biol Chem. 2008;283:4850–4865. [PubMed] 27. Liu WM, Mei R, Di X, Ryder TB, Hubbell E, et al. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics. 2002;18:1593–1599. [PubMed] 28. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. [PubMed] 29. Murtagh F. Multidimensional Clustering Algorithms. COMPSTAT Lectures. Wuerzburg: Physica-Verlag; 1985. 30. Loong TW. Understanding sensitivity and specificity with the right side of the brain. Bmj. 2003;327:716–719. [PubMed] 31. Suojanen JN. False false positive rates. N Engl J Med. 1999;341:131. [PubMed] 32. Aoki-Kinoshita KF, Kanehisa M. Gene annotation and pathway mapping in KEGG. Methods Mol Biol. 2007;396:71–91. [PubMed] 33. Brohee S, van Helden J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006;7:488. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
EMBO Rep. 2004 Mar; 5(3):280-4.
[EMBO Rep. 2004]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Nat Genet. 2006 Jul; 38(7):830-4.
[Nat Genet. 2006]Genome Biol. 2004; 5(12):R100.
[Genome Biol. 2004]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Genomics. 2005 Oct; 86(4):476-88.
[Genomics. 2005]Bioinformatics. 2003 Jan 22; 19(2):185-93.
[Bioinformatics. 2003]Bioinformatics. 2007 Jul 1; 23(13):i282-8.
[Bioinformatics. 2007]Bioinformatics. 2006 Jun 15; 22(12):1540-2.
[Bioinformatics. 2006]Bioinformatics. 2007 Aug 15; 23(16):2088-95.
[Bioinformatics. 2007]Nucleic Acids Res. 2005; 33(18):5914-23.
[Nucleic Acids Res. 2005]BMC Genomics. 2005 Apr 27; 6(1):59.
[BMC Genomics. 2005]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Bioinformatics. 2005 Dec 15; 21(24):4348-55.
[Bioinformatics. 2005]Bioinformatics. 2006 May 1; 22(9):1103-10.
[Bioinformatics. 2006]Nucleic Acids Res. 2007; 35(16):e102.
[Nucleic Acids Res. 2007]Bioinformatics. 2003 Jan 22; 19(2):185-93.
[Bioinformatics. 2003]Bioinformatics. 2007 Jul 1; 23(13):i282-8.
[Bioinformatics. 2007]Physiol Genomics. 2001 Dec 21; 7(2):97-104.
[Physiol Genomics. 2001]Trends Genet. 2003 Jul; 19(7):362-5.
[Trends Genet. 2003]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Genomics. 2005 Oct; 86(4):476-88.
[Genomics. 2005]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Genomics. 2005 Oct; 86(4):476-88.
[Genomics. 2005]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Genomics. 2005 Oct; 86(4):476-88.
[Genomics. 2005]Genome Res. 2003 Nov; 13(11):2498-504.
[Genome Res. 2003]Physiol Genomics. 2001 Dec 21; 7(2):97-104.
[Physiol Genomics. 2001]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2002 Apr 1; 30(7):1575-84.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W238-44.
[Nucleic Acids Res. 2007]Mol Cell Biol. 2000 Jul; 20(13):4814-25.
[Mol Cell Biol. 2000]EMBO J. 2004 Nov 10; 23(22):4384-93.
[EMBO J. 2004]J Biol Chem. 2007 Nov 2; 282(44):32000-14.
[J Biol Chem. 2007]J Biol Chem. 2008 Feb 22; 283(8):4850-65.
[J Biol Chem. 2008]Bioinformatics. 2007 Jul 1; 23(13):i282-8.
[Bioinformatics. 2007]Bioinformatics. 2002 Dec; 18(12):1593-9.
[Bioinformatics. 2002]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]BMJ. 2003 Sep 27; 327(7417):716-9.
[BMJ. 2003]N Engl J Med. 1999 Jul 8; 341(2):131.
[N Engl J Med. 1999]Methods Mol Biol. 2007; 396():71-91.
[Methods Mol Biol. 2007]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2002 Apr 1; 30(7):1575-84.
[Nucleic Acids Res. 2002]BMC Bioinformatics. 2006 Nov 6; 7():488.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W238-44.
[Nucleic Acids Res. 2007]