- Journal List
- BMC Bioinformatics
- v.10; 2009
- PMC2657217

# Detecting intergene correlation changes in microarray analysis: a new approach to gene selection

^{1}Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Box 630, Rochester, New York 14642, USA

^{2}Department of Probability and Statistics, Charles University, Sokolovska 83, Prague 18675, Czech Republic

^{}Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

Microarray technology is commonly used as a simple screening tool with a focus on selecting genes that exhibit extremely large differential expressions between different phenotypes. It lacks the ability to select genes that change their relationships with other genes in different biological conditions (differentially correlated genes). We intend to enrich the above procedure by proposing a nonparametric selection procedure that selects differentially correlated genes.

### Results

Using both simulations and resampling techniques, we found that our procedure correctly detected genes that were not differentially expressed but differentially correlated. We also applied our procedure to a set of biological data and found some potentially important genes that were not selected by the traditional method.

### Discussion and Conclusion

Microarray technology yields multidimensional information on the function of the whole genome. Rather than treating intergene correlation as a nuisance to the traditional gene selection procedures which are essentially univariate, our method utilizes the rich information contained in the correlation as a new selection criterion. It can provide additional useful candidate genes for the biologists.

## Background

It has become common practice to use microarray technology to find "interesting" genes by comparing two or more different phenotypes. Modern methods of microarray data analysis typically employ two-sample statistical tests to test differential expression of genes, combined with multiple testing procedures to guard against Type 1 errors (see [1,2] for reviews). Such methods are biased towards selecting those genes that display the most pronounced differential expression. Once the list of genes showing statistically significant differential expression has been generated, these genes are often ranked using purely statistical criteria and this ranking is thought to reflect their relative importance. Quite typically, a certain number of genes with the smallest *p*-values are finally selected from the list of all "significant" genes. While most biologists recognize that the magnitude of differential expression does not necessarily indicate biological significance, in the absence of better methods, this remains the dominant means to initially prioritize candidate genes. From a biological perspective, the above-described paradigm is far from a perfectly valid approach, because genes are not independent entities – they can interact with each other in many ways. As an example, a "chain reaction" type of a dependence structure of gene expressions was documented in the literature [3]. In such situation, even a very small change in expression of a particular gene may have dramatic physiological consequences if the protein encoded by this gene plays a *catalytic *role in a specific cell function. Many other downstream genes may amplify the signal produced by this truly interesting gene, thereby increasing their chance to be selected by formal statistical methods. For an upstream regulatory gene, however, the chance of being selected by such methods may diminish as one keeps hunting for downstream genes that tend to show bigger changes in their expression. As a result, the initial list of candidates may be inflated with many effector genes that do little to elucidate the fundamental mechanisms of biological processes.

There are two natural ways to remedy this situation. One way is to use bioinformatics tools that utilize prior biological knowledge, such as partially known pathways, to prioritize candidate genes. This approach is now routinely used in biological studies and there are ongoing efforts to enrich it with specially designed algorithms [4]. The main weakness of the above approach is that the current biological knowledge is still quite limited and sometimes inaccurate. Another way is to extract additional information on the changes of the *relationships *between different genes from microarray data by using pertinent statistical methods.

For example, if an upstream gene ceases to be catalytic in one phenotype, or this gene is active in two *different *biological pathways in two phenotypes, a carefully designed statistical test based on the intergene dependence structure should be able to detect this change. In more general situations, intergene dependence structure alone may be insufficient to pick up those upstream genes directly, but knowing the relationship changes across conditions points out possible directions for searching the interesting genes.

Notwithstanding the importance of testing for differential expression of genes, we suggest approaching the problem of microarray data analysis from a different angle. We designed a new method to select those genes that are likely to change their relationships with other genes. More specifically, we suggest selecting candidate genes using a statistical test that detects changes in the whole correlation vector associated with each gene. This additional information will be instrumental in making the final selection of candidate genes more meaningful.

We propose to enrich the statistical inference from microarray gene expression data by testing the following hypothesis: *the ith gene does not change its relationships with all other genes across the two phenotypes (conditions) under study*. This can be accomplished by comparing the joint distribution of the correlation coefficients between this gene and other genes in different conditions.

We conducted a series of simulations with different configurations. The results obtained by our method were compared with those of a similarly designed univariate selection procedure. We observed that our method correctly selected those genes which change correlations with other genes but retained the same marginal distributions.

We also conducted various experiments with biological data. A large set of childhood leukemia data available from St. Jude Children's Research Hospital [5] were used. Our method selected some genes which were not selected by a comparable univariate approach.

## Biological data

The biological dataset used in this study was from the St. Jude Children's Research Hospital (SJCRH) Database on childhood leukemia [5]. Two groups of data were employed: patients (*n *= 88) with hyperdiploid acute lymphoblastic leukemia (**HYPERDIP**) and patients (*n *= 79) with a special translocation type of acute lymphoblastic leukemia(**TEL**). To make two data groups more comparable, only the first 79 patients in **HYPERDIP **were used.

Since the original probe set definitions in Affymetrix GeneChip data were known to be inaccurate [6], we updated them by using a custom CDF file to produce values of gene expressions. The CDF file was obtained from http://brainarray.mbni.med.umich.edu. After that, each patient was represented by an array reporting the logarithm (base 2) of expression level on the set of 7084 genes. For both datatsets, each gene was standardized so that it had zero mean and unit standard deviation. This was to avoid introducing false correlation coefficients when doing permutations.

## Methods

### Correlation vectors

Let us denote *m *as the number of genes. For the *i*th gene, we computed the (*m *- 1)-dimensional random vector **r**_{i }= (*r*_{i1}, , *r*_{i, i-1}, *r*_{i, i+1}, , *r*_{im}). Here *r*_{ij }is the sample correlation coefficient between the *i*th and the *j*th gene. This vector represents the relationships between the *i*th gene and all other genes. Denote the (*m *- 1)-dimensional joint distribution functions of **r**_{i }in two different conditions by ${F}_{{r}_{i}(A)}$(*x*) and ${F}_{{r}_{i}(B)}$(*x*). A pertinent statistical test can be designed to test the basic null hypothesis

To increase the sensitivity of our test to departures from **H**_{i}, especially when the correlation coefficients are very high, we applied the Fisher transformation to the sample correlation coefficients:

where *k *= 1, , *i *- 1, *i *+ 1, , *m*. The power improvement was confirmed by our simulation (see Table Table1).1). We denote the *correlation vectors *in two conditions by **w**_{i}(*A*) and **w**_{i}(*B*), respectively.

Instead of testing **H**_{i}, we tested

where ${F}_{{w}_{i}(A)}$(*x*) and ${F}_{{w}_{i}(B)}$(*x*) are the joint distribution functions of **w**_{i}(*A*) and **w**_{i}(*B*), respectively. If ${{H}^{\prime}}_{i}$ was rejected, we declared the *i*th gene to be a *differentially correlated *gene.

In order to test the hypotheses based on the joint distribution functions of correlation vectors, we needed to create samples of correlation vectors. The following two methods were employed for this purpose:

• Group method: Divide each dataset into 8 subgroups, each containing 10 slides (9 slides for the last subgroup of the biological data). By computing correlation vectors from each subgroup, we obtained a sample of size 8.

• Resampling method: Randomly select 60 slides to calculate the correlation vector. Repeat 20 times to get 20 correlation vectors in each group, respectively.

Through the simulations, we found that these two methods were comparable, with the resampling method being slightly better in terms of testing power (see Table Table1,1, ,2,2, ,33 and and4).4). However, the resampling method was much more computationally demanding. As an example, it took approximately 30 hours to analyze the biological data (7084 genes and 79 slides in both conditions, 10,000 permutations) with the group method.

For the resampling method, the computation time was 216 hours instead. All computations were done in a Saturn cluster computer which includes 2 nodes each with 8× AMD Opteron dual-core processors 2 GHz (16 processor cores), 16×2GB SDRAM.

Throughout this paper, we use the group method unless otherwise noted.

*N*-statistic

We chose a multivariate nonparametric test based on *N*-statistic with Euclidean kernel for testing the hypothesis ${{H}^{\prime}}_{i}$. This statistic has been successfully used to select differentially expressed genes and gene combinations in microarray data analysis [7-10]. The *N*-statistic is defined as follows:

where *n*_{s }is the number of correlation vector samples in each group, **w**_{i}(·, *k*) indicates the *k*th replication of the correlation vector (using the group or the resampling method), and $L(x,y)=\left|\right|x-y\left|\right|=\sqrt{{\displaystyle {\sum}_{s=1}^{d}{({x}_{s}-{y}_{s})}^{2}}}$ is the kernel defined by Euclidean distance.

After this step, the *i*th gene was assigned a non-negative number *N*_{i}, a measurement of how much intergene correlation structure had changed from condition *A *to condition *B*.

#### 0.1 Resampling based *p*-values

We used the following algorithm to obtain resampling based *p*-values for each gene:

1. Randomly shuffle the slides in two different conditions, then split them into two groups.

2. Compute correlation vectors for each gene by using the group method or the resampling method.

3. Compute *N*-statistic for each gene based on the correlation vectors.

4. Repeat the above steps for *K *= 10, 000 times, record the resampling based *N*-statistics as *N*_{ik}*, i *= 1, ..., *m*, *k *= 1, ..., *K*. They can be used to construct the (resampling based) null distribution for each index *i*.

5. Compute *N*_{i}, the *N*-statistic for each gene (without random shuffles).

6. Obtain the resampling based *p*-value, *p*_{i}, by comparing *N*_{i }with the null distribution constructed from *N*_{ik}. Specifically, *p*_{i }is defined to be $\frac{\#({N}_{ik}\u2a7e{N}_{i})}{K}$, the proportion of *N*_{ik }which is greater than or equal to *N*_{i}.

Finally, we applied the extended Bonferroni adjustment [11] with threshold 1.0 to control PFER (per-family error rate). Extended Bonferroni adjustment is less conservative than the FWER (familywise error rate) controlling procedures and more stable than FDR (false discovery rate) controlling procedures in the context of microarray analysis. More details about this multiple testing adjustment procedure can be found in [11].

From the computational perspective, it was very tempting to reduce the number of permutations by pooling all *N*_{ik }to construct one grand null distribution. However, we noticed that the null distributions for different genes can be very different. Based on our biological data, the density functions for the significant genes tended to shift to the left compared to those associated with the non-significant genes (see Figure Figure11).

### Univariate gene selection method

We would like to emphasize that our method (henceforward denoted as the **CV **method) is nonparametric. Because of this, we decided to compare the **CV **method to a nonparametrc univariate gene selection method: Wilcoxon rank-sum test with the same extended Bonferroni adjustment (henceforward denoted as the **WRS **method).

## Results

### Simulations

To gain better insight into the performance of the **CV **method, we simulated several sets of data. All sets had two groups of 80 arrays representing two different biological conditions (condition A and condition B). Each array had *m *= 708 genes. Denote the genes in the first condition by *x*_{i}, 1 ≤ *i *≤ *m *and genes in the second condition by *y*_{i}, 1 ≤ *i *≤ *m*. For both groups, all genes were identically distributed with marginal distribution *N*(0, 1). With different baseline correlation structure, we had the following two classes of simulated datasets:

**• SIMU1**: Any two distinct genes that were both in the set of the first 100 genes were correlated with coefficient *ρ*_{d }in condition A, 0.0 in condition B. Otherwise the correlation coefficient was 0.0. Here *ρ*_{d }was a constant taking value in {0.1, 0.2, 0.3, 0.4}. Condition B can be thought of as the control condition where genes were independent of each other. We called this dataset the independent base data.

**• SIMU2**: Any two distinct genes that were both in the set of the first 100 genes were correlated with coefficient 0.5 + *ρ*_{d }in condition A, 0.5 in condition B. Otherwise, the correlation coefficient was 0.5. Again, *ρ*_{d }was a constant taking value in {0.1, 0.2, 0.3, 0.4}. Unlike **SIMU1**, the baseline intergene correlation was 0.5. We called this dataset the dependent base data.

By this design, the differentially correlated genes were the first 100 genes for both **SIMU1 **and **SIMU2**. *ρ*_{d }can be seen as a parameter indicating how much correlation structure had changed across two conditions.

For **SIMU1 **and **SIMU2 **and every *ρ*_{d}, we applied both the **CV **method and **WRS **method, and recorded the true/false positives. We also repeated this process 20 times with different random seeds to get the mean and standard deviation of the true/false positives. The results are shown in Table Table11 and Table Table2.2. As expected, the **CV **method detected differentially correlated genes while the **WRS **method did not. The power of the **CV **method clearly increased as the effect size gets larger. Also, it was easier to detect differentially correlated genes in the independent base data than in the dependent base data. This means that high baseline correlation structure deteriorated the power of the **CV **method.

### Simulations with biological data

The difference between **SIMU1 **and **SIMU2 **was that the baseline intergene correlation was much higher in **SIMU2**. This was an attempt to model the intergene dependence structure in biological data. In some sense, a better way of modeling the actual dependence structure is through resampling from the biological data.

First, we combined **HYPERDIP **and **TEL **data and randomly permuted the slides. We then divided them into two groups of an equal number of slides, mimicking two biological conditions without differentially correlated genes. For both conditions, genes were standardized so that the sample means equaled zero and the sample standard deviations were one. We denoted the entries in two groups by *x*_{ij }and *y*_{ij}, 1 ≤ *i *≤ 7084 and 1 ≤ *j *≤ 79, and the correlation matrix of these two groups by {*ρ*_{ik}}, 1 ≤ *i*, *k *≤ 7084.

Next, we generated a 79-dimensional random vector with *i.i.d*. standard normal components. Denote this vector by **a **= {*a*_{j}}, 1 ≤ *j *≤ 79. We added **a **to the first 300 row vectors in the first condition with a tuning parameter *ρ *as follows: $\sqrt{1-\rho}{x}_{ij}+\sqrt{\rho}{a}_{j}$. These transformed entries are denoted as *x*_{ij }again, and we name this dataset **SIMU3**.

The first condition had the correlation coefficients as follows:

Noticeably, the correlation coefficients between any two of the first 300 genes of the first group differed substantially from those of the second group, and these were considered as differentially correlated genes. The correlation coefficients between any two of the remaining 6784 genes of the first group were the same as the corresponding correlation coefficients between those of the second group, so they were not considered as differentially correlated genes.

There was one caveat in this approach. Even if *x*_{i }was one of the 6784 genes that were not differentially correlated, *corr*(*x*_{i}, *x*_{k}) was still different in two conditions if *x*_{k }happened to be a gene from the first 300 genes. In other words, the first 300 differentially correlated genes induced some small changes for those genes that were not differentially correlated. In practice however, when summarized through the *N*-statistic, these differences for the latter 6784 genes were negligible and it was reasonable to view them as random fluctuations.

We set *ρ *= 0.5 in **SIMU3**. As before, we applied both the **CV **method and the **WRS **method. The above procedure was repeated 10 times, the results were summarized in Table Table55.

*ρ*= 0.5.

The **CV **method selected most of differentially correlated genes while **WRS **method did not. The **CV **method also produced fewer false positives than the **WRS **method.

### Data analysis based on biological data

Using the **WRS **method, we found 102 differentially expressed genes. Using the **CV **method (resampling method was used here to gain more power), we detected 16 differentially correlated genes. Out of these, 11 were differentially expressed and 5 were not. These 5 genes were: CD1C (antigen precursor), HDHD1A (haloacid dehalogenase-like hydrolase domain containing 1A, enzyme involved in many catalytic activities), BASP1 (brain acid-soluble protein 1), CYB5A (cytochrome b5 type A, microsomal) and TFPI (tissue factor pathway inhibitor, which helps to regulate the extrinsic blood coagulation cascade). In the original study, two of these genes were found as differentiating among two leukemia subtypes (BASP1 and CYB5A) and the other three were never mentioned. Differential correlation of these genes in two leukemia subtypes might provide some valuable information for better understanding the underlying subtypes' differences; however, these genes could not be captured by conventional tests. The results of this study (with multiple thresholds before and after the Bonferroni adjustments) can be found in Table Table66.

## Discussion and Conclusion

Our method represents a radical conceptual change from current approaches focused solely on differentially expressed genes. However, this method is not intended to replace the existing methodology but rather to provide biologists with an *additional source of information *for decision making. As an example, the univariate method failed to detect Gene CYB5A, which had a modest unadjusted *p*-value 0.168 based on Wilcoxon ranksum statistic. Yet CYB5A was detected as a differentially correlated gene with an unadjusted *p*-value 0.0 (its observed *N*-statistic was larger than all permutation *N*-statistics). To get a rough idea of how many genes had different correlation coefficients with CYB5A across two conditions, we looked at the marignal distributions of CYB5A' correlation vector. 252 genes were detected to have changed correlations with CYB5A dramatically across conditions. The selection procedure of these 252 genes can be summarized as follows: First, we splitted the slides in two conditions into 8 subgroups, respectively, as in the Group method. Second, we calculated 8 correlation vector samples of gene CYB5A in each condition. Finally, for all 7083 correlations (with 8 samples in each condition), we applied Wilcoxon rank-sum test to each of them to obtain an unadjusted *p*-value. After extended Bonferroni adjustment, 252 genes were selected at significant level 2.0. According to BioCarta, about one third of all pathways associated with these 252 genes are related to cell cycle progression, cell division and control of centrosome duplication. **HYPERDIP **phenotype is characterized by the presence of more than 50 chromosomes. As a consequence, all pathways working for cellular maintenance and proliferation should be higly activated in **HYPERDIP **phenotype. The differential correlation of genes involved in pathways, related to cell proliferation between **HYPERDIP **and **TEL **seemed reasonable and might deserve future studies.

The microarray technology yields unique multidimensional information on the functioning of the whole genome machinery at the level of transcription so that much can be learned about relationships between genes and mechanisms by which the cell assigns tasks to different genes to maintain a specific function. It is unfortunate that such an advanced technology continues to be used as a simplistic screening tool with a focus on big differences between mean values of expression measurements. The true potential of microarray technology has yet to be unveiled. It is noteworthy that recent years have seen a growing interest in correlations between gene expression levels in statistical methodologies for microarray analysis [9,12-25]. The correlation coefficient has been used extensively as a measure of similarity in gene clustering since a seminal paper by Eisen et al. [26]

However, very few studies have examined the possibility of using the intergene correlation structure to find important genes that are linked to disease. One obstacle lies in the fact that there are *m *different sample means but $\frac{m(m-1)}{2}$ different sample correlation coefficients. It is much harder to catch the differences hidden in the correlation matrix that has much higher degrees of freedom (25, 087, 986 in our study). Furthermore, it is much more computationally intensive to compute the sample correlation coefficients than the sample means. Consequently, we could not afford to use more than 10, 000 permutations to get finer *p*-value estimation, and we are thus reluctant to recommend the slower permutation based resampling method to the biologists, despite the fact that we know this method is more powerful than the group method.

As illustrated by our study on biological data (with 79 slides in each condition) we were able to identify 102 genes, which changes the medians of their (univariate) distributions, yet only 16 genes were reported as differentially correlated (see Table Table6).6). This seeming inadequacy of power was also shown in the simulation studies (Table (Table1,1, Table Table2).2). This phenomenon might be subject to a number of explanations. It might be caused by the small sample size. Due to the nature of the **CV **method, its statistical perspective is to compare the distribution of *sample correlation coefficients *instead of expression levels in different biological conditions. With the group method, we splitted 79 slides into subgroups and computed the sample correlation vectors from each group. As a result, we only had eight sample correlation vectors for each condition. We could divide the data into more subgroups, and then there would be fewer slides per subgroup so that the sample correlation coefficient computed from each subgroup would be less accurate. This was a trade-off. With the resampling method, we had 20 correlation vectors. Having more than 20 resamplings would enhance the accuracy of the estimated *N*-statistics and improve the power; meanwhile, it would demand more computing time, making this another trade-off.

The choice of the Euclidean distance as the kernel for the *N*-statistics might be another culprit. The Euclidean distance kernel is a *generic *kernel that is invariant under any orthogonal transformation. In other words, it is symmetric and indifferent to all departure from the null distribution. A specifically designed kernel that is sensitive to the likely departure from the null distribution caused by the changes of correlation might significantly increase the selection power of the **CV **method. Last, it may have been that indeed fewer genes were differentially correlated than were differentially expressed in the biological data. The hypotheses that the **CV **method was testing were entirely different from those tested by the univariate selection methods, such as the **WRS **method. It is absolutely possible that one gene is differentially expressed but not differentially correlated, or vice versa. The very fact that 11 out of 16 differentially correlated genes were differentially expressed in our study is an interesting phenomenon that is worth further investigation.

We believe many improvements can be made to enhance the selecting power of the **CV **method. We also firmly believe, as larger sets of microarray gene expression data become readily available, quantitative insights into dependencies between gene expression levels will gain increasing importance.

## Authors' contributions

The basic idea was first proposed by AY, LK and XQ. The detailed study design was developed by all members in the research team. RH carried out the needed computations and simulations and the majority of the software development. GG conducted the pathway analysis for differentially expressed and differentially correlated genes. RH and XQ were responsible for most of the write-up of the findings.

## Acknowledgements

This research is supported by NIH Grant GM079259 (X. Qiu), NIH/NIGMS grants RO1 GM075299 (A. Almudevar), project IAA101120801 and grant MSM 002160839 from the Ministry of Education, Czech Republic (L. Klebanov). We are grateful for insightful comments from Dr. David Oakes. We would also like to thank Dr. Sung Yong Park and Ms. Christine Brower for their technical assistance with computing in general and parallel computing in particular. In addition, we also appreciate Ms. Cheryl Cicero's proofreading efforts.

## References

- Dudoit S, Shaffer J, Boldrick J. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003;18:71–103.
- Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y. Design and Analysis of DNA Microarray Investigations. Springer Verlag; 2003.
- Klebanov L, Jordan C, Yakovlev A. A new type of stochastic dependence revealed in gene expression data. Stat Appl Genet Mol Biol. 2006;5:Article7. [PubMed]
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005;102:13544–13549. [PMC free article] [PubMed]
- Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–143. [PubMed]
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005;33:e175. [PMC free article] [PubMed]
- Szabo A, Boucher K, Carroll W, Klebanov L, Tsodikov A, Yakovlev A. Variable selection and pattern recognition with gene expression data generated by the microarray technology. Mathematical Biosciences. 2002;176:71–98. [PubMed]
- Szabo A, Boucher K, Jones D, Tsodikov AD, Klebanov LB, Yakovlev AY. Multivariate exploratory tools for microarray data analysis. Biostatistics. 2003;4:555–567. [PubMed]
- Xiao Y, Frisina R, Gordon A, Klebanov L, Yakovlev A. Multivariate search for differentially expressed gene combinations. BMC Bioinformatics. 2004;5:164. [PMC free article] [PubMed]
- Klebanov L, Gordon A, Xiao Y, Land H, Yakovlev A. A permutation test motivated by microarray data analysis. Computational Statistics and Data Analysis. 2005.
- Gordon A, Glazko G, Qiu X, Yakovlev A. Control of the Mean Number of False Discoveries, Bonferroni, and Stability of Multiple Testing. The Annals of Applied Statistics. 2007;1:179–190.
- Jaeger J, Sengupta R, Ruzzo WL. Improved gene selection for classification of microarrays. Pac Symp Biocomput. 2003:53–64. [PubMed]
- Goeman JJ, Geer SA van de, de Kort F, van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. [PubMed]
- Geman D, d'Avignon C, Naiman DQ, Winslow RL. Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol. 2004;3:Article19. [PMC free article] [PubMed]
- Lai Y, Wu B, Chen L, Zhao H. A statistical method for identifying differential gene-gene co-expression patterns. Bioinformatics. 2004;20:3146–3155. [PubMed]
- Shedden K, Taylor J. Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. Methods of Microarray Data Analysis IV. 2005. pp. 121–131.
- Laan MJ van der, Birkner MD, Hubbard AE. Empirical Bayes and resampling based multiple testing procedure controlling tail probability of the proportion of false positives. Stat Appl Genet Mol Biol. 2005;4:Article29. [PubMed]
- Lu Y, Liu P, Xiao P, Deng H. Hotelling's T 2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics. 2005;21:3105–3113. [PubMed]
- Qiu X, Brooks AI, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005;6:120. [PMC free article] [PubMed]
- Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes. Statistical Applications in Genetics and Molecular Biology. 2005;4:34. [PubMed]
- Almudevar A, Klebanov LB, Qiu X, Salzman P, Yakovlev AY. Utility of correlation measures in analysis of gene expression. NeuroRx. 2006;3:384–395. [PMC free article] [PubMed]
- Qiu X, Yakovlev A. Some comments on instability of false discovery rate estimation. J Bioinform Comput Biol. 2006;4:1057–1068. [PubMed]
- Klebanov L, Qiu X, Yakovlev A. Testing differential expression in non-overlapping gene pairs: A new perspective for the empirical Bayes method. Journal of Bioinformatics and Computational Biology. 2008;6:301–316. [PubMed]
- Klebanov L, Yakovlev A. Diverse correlation structures in gene expression data and their utility in improving statistical inference. Annals of Applied Statistics. 2008;1:538–559.
- Klebanov L, Glazko G, Salzman P, Yakovlev A, Xiao Y. A multivariate extension of the gene set enrichment analysis. J Bioinform Comput Biol. 2007;5:1139–1153. [PubMed]
- Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998;95:14863–14868. [PMC free article] [PubMed]

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (525K) |
- Citation

- Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.[BMC Bioinformatics. 2006]
*Ooi CH, Chetty M, Teng SW.**BMC Bioinformatics. 2006 Jun 23; 7:320. Epub 2006 Jun 23.* - The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data.[BMC Bioinformatics. 2002]
*Mutch DM, Berger A, Mansourian R, Rytz A, Roberts MA.**BMC Bioinformatics. 2002 Jun 21; 3:17. Epub 2002 Jun 21.* - Chaotic genetic algorithm for gene selection and classification problems.[OMICS. 2009]
*Chuang LY, Yang CS, Li JC, Yang CH.**OMICS. 2009 Oct; 13(5):407-20.* - Microarray data analysis: from disarray to consolidation and consensus.[Nat Rev Genet. 2006]
*Allison DB, Cui X, Page GP, Sabripour M.**Nat Rev Genet. 2006 Jan; 7(1):55-65.* - Microarrays--identifying molecular portraits for prostate tumors with different Gleason patterns.[Methods Mol Med. 2008]
*Mendes A, Scott RJ, Moscato P.**Methods Mol Med. 2008; 141:131-51.*

- Pathway network inference from gene expression data[BMC Systems Biology. ]
*Ponzoni I, Nueda MJ, Tarazona S, Götz S, Montaner D, Dussaut JS, Dopazo J, Conesa A.**BMC Systems Biology. 8(Suppl 2)S7* - Evaluation of Bias-Variance Trade-Off for Commonly Used Post-Summarizing Normalization Procedures in Large-Scale Gene Expression Studies[PLoS ONE. ]
*Qiu X, Hu R, Wu Z.**PLoS ONE. 9(6)e99380* - EDDY: a novel statistical gene set test method to detect differential genetic dependencies[Nucleic Acids Research. 2014]
*Jung S, Kim S.**Nucleic Acids Research. 2014 Apr; 42(7)e60* - Identification of phenotype deterministic genes using systemic analysis of transcriptional response[Scientific Reports. ]
*Lee J, Park J, Choi C.**Scientific Reports. 44413* - Harnessing the complexity of gene expression data from cancer: from single gene to structural pathway methods[Biology Direct. ]
*Emmert-Streib F, Tripathi S, Matos Simoes RD.**Biology Direct. 744*

- PubMedPubMedPubMed citations for these articles

- Detecting intergene correlation changes in microarray analysis: a new approach t...Detecting intergene correlation changes in microarray analysis: a new approach to gene selectionBMC Bioinformatics. 2009; 10()20

Your browsing activity is empty.

Activity recording is turned off.

See more...