![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||
Copyright © 2009 Zhu; licensee BioMed Central Ltd. Semi-supervised gene shaving method for predicting low variation biological pathways from genome-wide data 1Department of Computer Science, University of New Orleans, New Orleans, LA 70148, USA 2Research Institute for Children, Children's Hospital, New Orleans, LA 70118, USA Corresponding author.Dongxiao Zhu: dzhu/at/cs.uno.edu SupplementSelected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009) Michael Q Zhang, Michael S Waterman and Xuegong Zhang ConferenceThe Seventh Asia Pacific Bioinformatics Conference (APBC 2009) 13–16 January 2009 Beijing, China This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background The gene shaving algorithm and many other clustering algorithms identify gene clusters showing high variation across samples. However, gene expression in many signaling pathways show only modest and concordant changes that fail to be identified by these methods. The increasingly available signaling pathway prior knowledge provide new opportunity to solve this problem. Results We propose an innovative semi-supervised gene clustering algorithm, where the original gene shaving algorithm was extended and generalized so that prior knowledge of signaling pathways can be incorporated. Different from other methods, our method identifies gene clusters showing concerted and modest expression variation as well as strong expression correlation. Using available pathway gene sets as prior knowledge, whether complete or incomplete, our algorithm is capable of forming tightly regulated gene clusters showing modest variation across samples. We demonstrate the advantages of our algorithm over the original gene shaving algorithm using two microarray data sets. The stability of the gene clusters was accessed using a jackknife approach. Conclusion Our algorithm represents one of the first clustering algorithms that is particularly designed to identify signaling pathways of low and concordant gene expression variation. The discriminating power is achieved by manufacturing a principal component enriched by signaling pathways. Background Gene clustering that assigns group membership(s) to each gene is a widespread pattern extraction technique. Genes sharing the same membership are often hypothesized to be regulated by the same defined or undefined genomic influence, such as cellular pathway. Model-free clustering techniques such as K-means and hierarchical clustering [1-3] are widely used. One limitation of these approaches, as pointed out by many researchers, e.g. [4], is that each gene can only belong to a single cluster. These types of gene clustering algorithms are thus called mutually exclusive clustering. In the context of cellular pathways, they assume that one gene can only be regulated by one pathway at a time, which apparently, is not the case. Model-based clustering or soft clustering [5-8] provides mechanisms to relax this stringent assumption by introducing "probabilistic" or "fuzzy" memberships. However, these "soft" memberships do not biologically account for the fact that one gene is often simultaneously regulated by multiple genomic influences. Singular value decomposition (SVD) [9-11] has shown great promise towards deconvolving channels of genomic influence. Assuming rows of data matrix correspond to genes and columns correspond to physiological/genetic conditions under which the gene expression abundance was interrogated using gene chips, the SVD factors the data matrix into three matrices. The first matrix, which contains most of information, is called a gene coefficient matrix where each column (principal component, PC) defines a preliminary gene cluster that might be regulated by a specific genomic influence. We will describe more details of SVD in the method section. SVD has been repeatedly shown to be able to deconvolve the observed gene expression signal into a composite of multiple overlapping genomic influences, many of them correspond to signaling pathways [9,11]. Thus SVD provides a methodology base for non-mutually exclusive clustering. The gene clusters generated by SVD are often preliminary due to the fact that many non-relevant genes might contaminate the PC's that define gene clusters. Hastie et al [4] proposed removing non-relevant genes in an iterative fashion, in which the least correlated genes with the leading PC is treated as non-relevant. The gene shaving algorithm quickly became an important tool in the pattern discovery arsenal. It iteratively searches for clusters of genes showing high variation across the samples, and correlation across the genes [8]. The former is achieved by working with the leading PC and the latter is achieved by iteratively discarding non-relevant genes to the cluster. There are other types of non-mutually exclusive clustering methods as well, such as plaid model [12]. The underlying assumption of the gene shaving algorithm is that the leading PC accounting for the largest portion of variation is always of exclusive interest to the investigator [4,13]. Consequently the algorithm iteratively refines the first gene cluster defined by the first PC by shaving off a proportion of genes that are least correlated with the leading PC. The second gene cluster is formed by performing the same procedure on the orthogonal data, resulting from the residuals of regression, and so on. However, the underlying assumption that the whole algorithm is based on is not always true for every single case. In fact, gene expression in many signaling pathways show modest but concordant changes. The gene shaving algorithm would most likely to fail in these cases by working exclusively with the leading PC. Gene set based methods, such as Gene Set Enrichment Analysis (GSEA) were designed to overcome this limitation. Since it's first introduction in 2003 [14], it has been widely applied to interpret genome-wide expression profiles [15,16]. However, the approach only ranks pre-compiled gene sets according to the relevancy to the data and does not predict any new genes in the gene sets. Therefore, it strictly depends on the availability and validity of a priori defined gene sets. In reality a gene set is not always available in a complete and accurate format. What is typically available is partial pathway learned from empirical experimental studies. We seek a seamless combination of the strengths of the two methodological frameworks. We manufacture a PC that is most enriched by prior knowledge (signaling pathway of interest). Performing the analysis iteratively we will be able to identify the gene cluster showing modest but concordant changes. In many cases, we are further interested in finding genes that are concordantly up or down-regulated by genomic influences. Therefore, it might be beneficial to turn our attention not only to the PC that the prior knowledge is most enriched, but also to the positive PC and the negative PC separatively. The hypothesis can be substantiated by previous works that positive and negative PC's can be enriched by completely different biological functions, e.g. [11]. In our work, we eliminate non-relevant genes iteratively following and improving the procedure used in the gene shaving algorithm [4]. In each iteration, a weighted average expression profile was calculated and used as the seed profile to rank genes. With the heuristic removal of non-relevant genes at the beginning of the iterations, and some relevant genes by the end, the enrichment of prior knowledge has seen a sharp increase, followed by a gradual decrease. We then propose a trace-back step to retrieve the gene cluster in which enrichment of prior knowledge is maximized (Figure (Figure11 Results We aim to demonstrate that the proposed algorithm is capable of identifying tightly regulated gene sets showing modest and concerted variation using incomplete prior knowledge and real-world microarray data set. Ground truth, which indicates a "complete" gene set used as precondition for applying GSEA algorithm [14,16], is desirable to demonstrate the claimed advantages of our algorithm. It is often not available. Therefore, we use four "high-amplitude" and four "low-amplitude" gene sets identified in [17] as ground truth to evaluate the ability of our algorithm to recover them using subsets of a variety of lengths. The high and low amplitude genes used in this example are well-studied genes in the cell cycle, and many of them are co-regulated by a number of signaling pathways [17,18]. We then use incomplete prior knowledge supplied by our collaborator and apply our algorithm to predict new WNT and NOTCH pathway genes in the somitogenesis process. Recovering low and high amplitude gene sets using incomplete prior knowledge As a proof of concept, we first analyzed a cell cycle data set originally reported in [17]. The data set consists of whole yeast genome expression profiles interrogated over two full cell cycles (20 evenly spaced time points) synchronized by elutriation. We considered the same 308 genes as in the paper derived using Fourier transform. In each of the four gene sets, genes were further classified into high-amplitude and low-amplitude groups according to magnitude of variation. The processed data are available from the authors' website at [19]. We treated the high-amplitude genes and low-amplitude genes in each gene set as "complete", as assumed in classical GSEA analysis. We sampled subsets of increasing sizes from 5 to complete (e.g. 40) with a step size of 5. In each step experiment, we generated 500 subsets of the same size (with replicates), and for each subset we applied our algorithm to demonstrate its ability to recover the full gene set using the hypergeometric test explained in method section. The P-values of the tests were used as a measure for such an ability. For visualization convenience, the P-values were negatively log-transformed and higher value corresponds to better recovery of the complete gene set. The high-amplitude and low-amplitude complete gene sets were plotted in Figure Figure2a2a
Our algorithm can be viewed as an generalization of the gene shaving algorithm. Gene shaving algorithm exclusively works with the leading PC. Therefore, it is only capable of identifying high-amplitude signaling pathways. Our algorithm adaptively works with the PC that is most enriched by prior knowledge. Therefore, it is capable of identifying either high-amplitude or low-amplitude signaling pathways wherever prior knowledge is available. Comparing Figure Figure2b2b Predicting WNT and NOTCH pathway genes using prior knowledge Microarray data and prior knowledge We then proceed to re-analyze microarray data originally reported in Dequeant et al [20] to predict genes in WNT and NOTCH pathways. In this experiment, the genome-wide gene expression was interrogated over 17 developmental stages using Affymetrix GeneChip 430A. Using the Lomb-Scargle periodogram [21] the top 687 genes were used for gene clustering so that all prior knowledge genes are included. Microrarray data are available at ArrayExpress at [22]. Prior knowledge corresponds to a list of experimentally validated cyclic genes regulated by the segmentation clock, a molecular oscillator acting during somitogenesis [20]. The segmentation clock is a set of periodic processes linked to the formation of the vertebrate embryo segments (somites) that give rise to the segments in the adult body plan of a vertebrate animal. Malfunction of cyclic genes are the direct cause of many developmental diseases, such as Noonan syndrome and truncated tail [20]. Therefore, predicted cyclic genes are potential human disease genes. In particular, we have incomplete sets of 11 genes in the WNT pathway, and 9 genes in the NOTCH pathway as our prior knowledge. Our objective is to predict more WNT and NOTCH genes using prior knowledge, microarray data and our proposed algorithm. Finding the most enriched PC using prior knowledge In each iteration of our algorithm, we search for the PC that is most enriched by known WNT and NOTCH genes. We filtered the gene coefficients in each PC using the cutoff and tested enrichment of known pathway genes using the hypergeometric test (see method section). Figure Figure33
Comparing our semi-supervised algorithms with the gene shaving algorithm We aim to show that our semi-supervised algorithm is uniquely able to identify low variation signaling pathway genes but not the gene shaving algorithm. For predicting WNT cluster, our algorithm terminates after 18 iterations, and for predicting NOTCH cluster, it terminates after 20 iterations. We then traced back to retrieve the optimized clusters. Both WNT and NOTCH clusters were retrieved at the 9th iteration that prior knowledge is most enriched, and were smallest clusters containing all prior knowledge genes (Figure (Figure5c).5c
The left panel of Figure Figure66
To make our prediction useful for improving current understanding of the mechanisms of WNT and NOTCH pathways in somatogenesis, we performed analysis to infer what kinds of biological functions (defined by Gene Ontology, GO) are most enriched in the pathways, and what kind of transcription factors (inferred through ChIP-chip experiments) are most likely to be involved in regulating the two pathways. Table 1 presents the results of abovementioned enrichment analysis. The analysis was done through the web-server of the Segal lab: [23]. In table 1, results appear to be meaningful since many significantly enriched GO terms (column 3) are related to embryonic development, and both enriched transcription factors (column 4): MyoG and MyoD are closely related to cell differentiation [24,25]. In particular, Myod and Myog have distinct regulatory roles at a similar set of target genes. The role of Myog in mediating terminal differentiation is partially to enhance expression of a subset of genes previously turned on by Myod [25]. Stability of clusters against perturbation of prior knowledge Our approach predicts new pathway genes based on the available prior knowledge, therefore, it is critical to investigate the sensitivity of our prediction to a modest perturbation of prior knowledge. Since in this data set we don't know such ground truth as we did in the cell cycle data analysis, we performed sensitivity analysis using leave-one-out and leave-two-out jackknife approaches, see method section for technical details. Narrower Jackknife confidence interval of the enrichment indicates better stability of our enrichment estimation against perturbation of prior knowledge. In Figure Figure7a7a
Discussion With exception of a few recent works [26-28], most clustering algorithms these days are non-supervised in the sense that prior knowledge is not properly utilized to guide the learning process. Instead prior knowledge is often used in the post-learning phase in that researchers predict functions of unknown genes based on genes of known functions lying in the same cluster. The traditional gene shaving method focuses on the leading PC that accounts for most of variation in the data. On one hand, it is useful in discovering high variation pathway genes [4,29], on the other hand, it tends to overlook essential pathway genes that have modest expression variation. We hypothesized that highly concerted expression behavior of these genes, albeit modest in variation, may help shape its pattern out of the noisy microarray data using appropriate analysis techniques, i.e., SVD. The main contribution of this work is that we proposed an optimization algorithm combining the strengths of gene set based analysis and iterative gene selection. The iterative fashion inspired from the gene shaving algorithm allows distilling desired gene cluster using prior knowledge, while the latter enables us to discover gene clusters of modest and concerted expression change. The PC's that define gene clusters group a series of tightly regulated genes ranked by variance over samples. The orthogonality as specified in SVD analysis indicates those gene clusters of different variation were regulated by orthogonal defined or undefined genomic influences (Table 1 of [11]). Our method is particularly suitable for identifying gene clusters with modest and concerted expression change, therefore it is not limited to identify periodically expressed gene clusters. When there is no prior knowledge available, the optimization process can be done through optimizing the enrichment of interesting Gene Ontology (GO) vocabulary, for example, somitogenesis [GO:0001756]. The technique for testing enrichment of GO term is very similar to that was used here, also see review in [30]. A recursive dendrogram can be constructed as a foundation to generate overlapping gene clusters, from which the optimal clusters can be identified and retrieved according to the enrichment of the interesting GO term(s) [3]. Conclusion Our algorithm represents one of the first clustering algorithms that is particularly designed to identify signaling pathways of low and concordant gene expression variation. The discriminating power is achieved by manufacturing a principal component enriched by the prior knowledge. Methods Singular Value Decomposition Assume the gene expression data is in the matrix format Xp × n, where rows (p) correspond to genes and columns (n) correspond to conditions under which gene expression abundance were interrogated. Singular value decomposition (SVD) of the rectangular matrix X can be expressed as follows:
where Up × n is the gene coefficient, and Uij is the contribution of ith , i = 1, ..., p, gene to the jth, j = 1, ..., n, PC. If we correspond each Uj to a genomic influence j, then Uij defines how much the gene i is regulated by the genomic influence j. Sn × n is the singular value matrix, where the diagonal contains list of singular values, and the magnitude of singular values corresponds to percentage of variation explained by each PC.
Refer to supplemental figure figure11 Testing gene coefficients Smaller fraction numbers of Uij may indicate the contribution of ith gene to jth PC is negligible. We used a cut-off value that was originally used in [10] to test the vanishing of Uij (similar to a 3σ statistical significance):
Each element in Enrichment test For each PC j, suppose there is a gene set K of k genes that Uij is not 0, and for a biological pathway, suppose there is a prior knowledge gene set M of m genes in known in the pathway. Also assume there are n genes NOT in the pathway, and x is the number of common genes shared by K and M. The probability of observing exactly x common genes is:
In order to estimate the probability of observing x common genes or more is purely due to chance, we test the following one-sided hypothesis:
where The P-value is then defined as the probability of observing x or more overlaps given H0 is true. Therefore, it is calculated as follows:
Semi-supervised gene shaving algorithm 1: Start with the centered data matrix X that each row has zero mean 2: while TRUE do 3: Singular value decomposition 4: for all column of 5: if column elements are greater than a cut-off then 6: NO change 7: else 8: Set to 0 9: end if 10: end for 11: for all Gene sets correspond to each columns do 12: Test enrichment of prior knowledge in each gene set 13: end for 14: if Two or more columns that are most enriched with prior knowledge exist then 15: Break 16: else 17: Retrieve the best PC that are most enriched by prior knowledge 18: end if 19: Sort genes according to absolute correlation with the best PC 20: Discard α% least correlated genes (α = 10% followed from [4]) 21: Assign the reduced data matrix to X 22: end while 23: Trace-back to retrieve the best gene cluster As shown in the above Algorithm and Figure Figure1,1 Stability analysis of gene clusters – a jackknife approach Jackknife approach, e.g. "leave-one-out", is a resampling approach that is frequently used to access the stability of an estimator such as enrichment studied here. Suppose we wish to estimate enrichment parameter (η) as a complicated statistic (T) of n genes in prior knowledge as well as
Let jth partial estimate of η be given by the estimate computed with gene i removed,
The jackknife estimate of η is given by the average of the pseudovalues [31],
An approximate sampling error for
Likewise, an approximate (1 - α)% confidence interval is given by [31],
where tα/2, n-1 satisfies Pr(tn ≥ tα/2, n-1) = α, with tn denoting a t-distributed random variable with n degree of freedom. Competing interests The author declares that they have no competing interests. Author's contributions DZ conceived and designed the method, analyzed data and drafted the manuscript. Acknowledgements DZ is supported by Research Start-up Grants from the University of New Orleans and Research Institute for Children of Children's Hospital New Orleans. This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||
Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14587-9.
[Proc Natl Acad Sci U S A. 1998]Bioinformatics. 2005 Nov 1; 21(21):4014-20.
[Bioinformatics. 2005]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Bioinformatics. 2001 Oct; 17(10):977-87.
[Bioinformatics. 2001]Proc Natl Acad Sci U S A. 2000 Aug 29; 97(18):10101-6.
[Proc Natl Acad Sci U S A. 2000]Genome Res. 2006 Apr; 16(4):520-6.
[Genome Res. 2006]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]BMC Bioinformatics. 2008 May 28; 9 Suppl 6():S16.
[BMC Bioinformatics. 2008]Nat Genet. 2003 Jul; 34(3):267-73.
[Nat Genet. 2003]Proc Natl Acad Sci U S A. 2005 Sep 20; 102(38):13544-9.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Oct 25; 102(43):15545-50.
[Proc Natl Acad Sci U S A. 2005]Genome Res. 2006 Apr; 16(4):520-6.
[Genome Res. 2006]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Nat Genet. 2003 Jul; 34(3):267-73.
[Nat Genet. 2003]Proc Natl Acad Sci U S A. 2005 Oct 25; 102(43):15545-50.
[Proc Natl Acad Sci U S A. 2005]Nat Genet. 2004 Aug; 36(8):809-17.
[Nat Genet. 2004]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Nat Genet. 2004 Aug; 36(8):809-17.
[Nat Genet. 2004]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Nat Genet. 2004 Aug; 36(8):809-17.
[Nat Genet. 2004]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Science. 2006 Dec 8; 314(5805):1595-8.
[Science. 2006]Bioinformatics. 2006 Feb 1; 22(3):310-6.
[Bioinformatics. 2006]Science. 2006 Dec 8; 314(5805):1595-8.
[Science. 2006]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]Science. 2006 Dec 8; 314(5805):1595-8.
[Science. 2006]EMBO J. 2006 Feb 8; 25(3):502-11.
[EMBO J. 2006]Bioinformatics. 2006 Apr 1; 22(7):795-801.
[Bioinformatics. 2006]Bioinformatics. 2007 Sep 1; 23(17):2247-55.
[Bioinformatics. 2007]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]BMC Bioinformatics. 2005 Sep 12; 6():225.
[BMC Bioinformatics. 2005]Genome Res. 2006 Apr; 16(4):520-6.
[Genome Res. 2006]Bioinformatics. 2007 Feb 15; 23(4):401-7.
[Bioinformatics. 2007]Bioinformatics. 2005 Nov 1; 21(21):4014-20.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2000 Aug 29; 97(18):10101-6.
[Proc Natl Acad Sci U S A. 2000]Bioinformatics. 2001 Jun; 17(6):566-8.
[Bioinformatics. 2001]Bioinformatics. 2001 Jun; 17(6):566-8.
[Bioinformatics. 2001]Genome Biol. 2000; 1(2):RESEARCH0003.
[Genome Biol. 2000]