- Journal List
- BMC Proc
- v.5(Suppl 2); 2011
- PMC3090761

# Nonlinear gene cluster analysis with labeling for microarray gene expression data in organ development

^{}

^{1,}

^{2}Vinodh N Rajapakse,

^{2}Barry R Zeeberg,

^{3}Brian P Brooks,

^{4}Jacob Brown,

^{4}Wojciech Czaja,

^{2}and Robert F Bonner

^{1}

^{1}National Institutes of Health, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Section on Medical Biophysics, Bethesda MD 20892, USA

^{2}University of Maryland, Department of Mathematics, Norbert Wiener Center, College Park MD 20742, USA

^{3}National Institutes of Health, National Cancer Institute, Laboratory of Molecular Pharmacology, Genomics & Bioinformatics Group, Bethesda MD 20892, USA

^{4}National Institutes of Health, National Eye Institute, Ophthalmic Genetics and Visual Function Branch, Bethesda MD 20892, USA

^{}Corresponding author.

#### Supplement

#### Conference

## Abstract

### Background

The gene networks underlying closure of the optic fissure during vertebrate eye development are not well-understood. We use a novel clustering method based on nonlinear dimension reduction with data labeling to analyze microarray data from laser capture microdissected (LCM) cells at the site and developmental stages (days 10.5 to 12.5) of optic fissure closure.

### Results

Our nonlinear methods created clusters of genes that mapped onto more specific biological processes and functions related to eye development as defined by Gene Ontology at lower false discovery rates than conventional linear cluster algorithms. Our new methods build on the advantages of LCM to isolate pure phenotypic populations within complex tissues in order to identify systems biology relationships among critical gene products expressed at lower copy number.

### Conclusions

The combination of LCM of embryonic organs, gene expression microarrays, and nonlinear dimension reduction with labeling is a potentially useful approach to extract subtle spatial and temporal co-variations within the gene regulatory networks that specify mammalian organogenesis and organ function. Our results motivate further analysis of nonlinear dimension reduction with labeling within other microarray data sets from LCM dissected tissues or other cell specific samples to determine the more general utility of our method for uncovering more specific biological functional relationships.

## Background

Common variations in genetic and epigenetic patterns among humans are associated with variations in risk for developing all common chronic diseases, a few of which have been identified from genome-wide polymorphism screens [1,2]. The functional biological robustness or its failure in disease is most likely not just reflected in a few dominant components, but in many complex interactions within gene regulatory networks. Due to the overwhelming complexity, the deeper understanding of such networks remains a major challenge in modern systems biology, a field that aims to discover and iteratively refine mechanistic models of biological processes. Biological knowledge is typically encoded in the structure and parameterization of these models. The Gene Ontology project [3,4] can help to incorporate the known biological details of gene functions into such analysis. The challenge is to reasonably approximate attributes in such models using experimental data that is complex, noisy, and often incomplete. For the purpose of acquiring biologically rich data sets, laser capture microdissection (LCM) has proven a powerful tool to isolate pure cell populations from complex heterogeneous tissue specimens [5-7]. In combination with microarray technologies, that allow the simultaneous measurement of expression levels for thousands of genes, LCM enables identifying critical gene products even if expressed at low copy numbers. Our work aims to facilitate efforts in systems biology by organizing data in ways that can suppress noise and better reveal latent, biologically meaningful structure. Coloboma is a not uncommon congenital defect of human ocular development resulting in large retinal holes which often significantly affect vision. The present paper focuses on refinements in the analysis of a temporal series of microarray data obtained from microdissected sites of retinal fissure closure in normal mouse embryos. This data was previously analyzed [8] to identify a putative repressive transcription factor, nlz2 (zinc finger protein 503), which, when its expression was blocked in zebrafish embryos, led to incomplete optic fissure closure, a coloboma model. The interaction of transcription factors, binding sites and gene networks involving nlz2 and related genes, however, are poorly understood [8]. The present paper is dedicated to develop a novel pipeline for the analysis of microarray gene expression data that complements standard approaches and provides a list of candidate genes guiding further experimental analysis of genetic variations.

By developing and applying a novel clustering scheme, we have identified a 50 per cent larger gene cluster (in comparison to PCA and previous hierarchical cluster analyses [8]), whose spatio-temporal gene expressions correlate with nlz2. According to GoMiner, a computational high-throughput tool for biological interpretation of genomic, transcriptomic, and proteomic data, that identifies the biological processes, functions and components of gene clusters [9,10], this larger cluster still shows gene enrichment for its specific functions in the context of Gene Ontology.

Next, using GoMiner, we sought to identify those gene clusters whose co-expressions correlate with processes in eye development. First, we apply a novel clustering scheme that builds on the intertwining of Laplacian Eigenmaps, a nonlinear geometrical data transformation, with *k*-means and hierarchical clustering. To validate the findings, we also use two standard clustering schemes, basic *k*-means and principal component analysis combined with *k*-means and hierarchical clustering. All three methods identify gene clusters enriched for functional GoMiner categories related to eye development, but the proposed nonlinear scheme leads to lower false discovery rates. Secondly, we have proposed a mechanism that allows experts to introduce their input in form of additional, labeled information by means of a potential on a data-dependent graph in [11] to improve the dimension reduction and clustering process. Distances between certain labeled genes are forced to appear closer than normally while others are increased. In the present paper, we aim to label genes that are highly connected and thus constitute hubs within the regulatory network. Such genes appear to promote coherence within a gene cluster and would thus be ideal candidates for labeling to obtain a more meaningful and coherent clustering. There are many ways to extract genes of high connectivity, and we use the weights that are generated by the Laplacian on the regulatory network and alternatively weighted correlation networks as described in [12]. Identified gene hubs are then labeled to incorporate regulatory network characteristics into the labeled Laplacian clustering. This novel clustering scheme based on nonlinear dimension reduction and involving labeled data further improves the biological specificity according to GoMiner analysis.

Starting from experimental work based on LCM and microarray technologies in organogenesis, we obtain a list of candidate genes that could be significant in normal development of optic fissure closure and could be useful in guiding analysis of genetic variations in humans with coloboma.

## Materials and methods

The Affymetrix MOE 430 2.0 microarray datasets analyzed to develop and test our new method were for eight samples LCM microdissected from serial cryosections of the retina at the site of final optic fissure closure in the mouse embryos at specific embryonic stages 10.5 days through 12.5 days previously reported in [8]. The 8 time-points span the time just before and just after final fusion (optic fissure closure) and were expected to reveal sets of genes critical for the completion of optic fissure closure in normal development. This previous report further investigated a specific putative repressive transcription factor, nlz2 (or zinc finger protein 503), that was discovered to be highly expressed before and during fissure closure and then downregulated. Gene knockdown experiments in zebra fish of nlz2 resulted in incomplete optic fissure closure (coloboma). Our current analysis explores possible associated gene regulation patterns. Within the 8 different time-point microarrays were 8316 genes consistently identified as expressed and with greater than 2-fold variation in gene expression levels. For our clustering analysis, we chose the subset of *n* = 3416 genes whose expression levels varied between 4-fold and 26-fold over the 2 days of embryonic development. For analysis purposes, each gene of the microarray is considered as a vector of its expression levels. This perspective yields a collection of *D* = 8 dimensional vectors. Our proposed analysis relies on Laplacian Eigenmaps [13,14], a geometrical data transformation that provides a new representation of gene expressions still covering essential geometrical behaviors. The nonlinear geometric representation can be further steered by involving labels [11] that are either derived from weighted correlation networks analysis [12] or from the Laplacian analysis. We intertwine this new data representation with *k*-means [15], a widely used clustering scheme. GoMiner [9,10] is then used to identify genes within clusters that are associated with particular biological processes or function. Let us list the steps of our proposed scheme:

1. **Expression vectors:** Each gene’s expression over the 8 time points builds a vector. They constitute a collection {*x*_{1},…, *x _{n}*} of 8-dimensional vectors, where

*n*is the number of considered expressions

2. **Nonlinear dimension reduction:** Choose a target dimension *d* <*D*, and obtain a new *d*-dimensional data representation {*y*_{1},…, *y _{n}*} of the original

*D*-dimensional vectors {

*x*

_{1},…,

*x*}

_{n}3. ** k-means:** Run

*k*-means on {

*y*

_{1},

*…*,

*y*} to obtain the final clustering

_{n}4. **GoMiner:** Feed the clusters into GoMiner to evaluate their biological relevance

Step 2 in the above scheme is specified in two different ways: First, we use a nonlinear dimension reduction method without labeling (unsupervised):

2.A **Laplacian Eigenmaps:** Choose the number *m* of gene neighbors and a target dimension *d* <*D*, then apply Laplacian Eigenmaps to obtain a new *d*-dimensional data representation {*y*_{1},*…*,*y _{n}*} of the original

*D*-dimensional vectors {

*x*

_{1},…,

*x*}

_{n}Alternatively, we may want to incorporate further input into the dimension reduction process by using labeled data. We then identify step 2 with the following supervised procedure:

2.B a) **Identifying highly connected genes:** Apply an R package for weighted correlation network analysis (WGCNA) [12] to identify genes that are highly connected within the gene regulatory network and that act as hubs. Alternatively, use the Laplacian analysis to identify highly connected genes

2.B b) **Schroedinger Eigenmaps:** Gene hubs are labeled by means of a potential term. Choose the number m of gene neighbors and a target dimension *d* <*D.* The application of Laplacian Eigenmaps with potentials [11] yields a new *d*-dimensional data representation {*y*_{1},*…*,*y _{n}*} of the original

*D*-dimensional vectors {

*x*

_{1},…,

*x*}

_{n}In the following, we present the components of the above scheme in more detail. For comparison we also applied PCA and *k*-means and therefore briefly discuss these conventional methods too.

### Principal component analysis

PCA [16] is a statistical tool that linearly transforms the data into an orthogonal coordinate system whose axes correspond to the principal components in the data, i.e., the first principal component accounts for as much variance in the data as possible and, successively, further components capture the remaining variance. Through an eigenanalysis, the principal components are determined as eigenvectors of the dataset’s covariance matrix and the corresponding eigenvalues refer to the variance that is captured within each eigenvector. After subtracting the mean of the dataset, PCA is performed on vectors {*x*_{1},…,*x _{n}*} by first diagonalizing the covariance matrix cov(X) =

*E*(

*XX*

^{}), where

*X*= (

*x*

_{1}…

*x*) is the zero mean data matrix. The eigenvectors

_{n}*p*

_{1},

*…*,

*p*the principal components ordered according to the magnitude of their eigenvalues - provide the transformed data

_{D}-*Y*=

*W*, where

^{}X*W*= (

*p*

_{1}

*…p*)

_{D}*.*We obtain the collection of d-dimensional vectors {

*y*

_{1},…,

*y*} whose first entries represents the abundance of the primary principal. The second entries are each datapoint’s projection along the second eigenvector and so forth.

_{n}### Laplacian Eigenmaps

Laplacian Eigenmaps (LE) [13,14] is a nonlinear geometric tool that transforms data into a new representation in a nonlinear fashion. Given points {*x*_{1},…,*x _{n}*}

*, we assume that they are steered by*

^{D}*d*latent variables, and aim to find a new data representation {

*y*

_{1},…,

*y*}

_{n}^{d}. We briefly recall the three step procedure of Laplacian Eigenmaps.

**Step 1: Adjacency graph, m-nearest neighbors** We build a graph

*G*, whose nodes

*i*and

*j*are connected if

*x*is among the

_{i}*m*-nearest neighbors of

*x*or vice versa. The distance between data points is measured by the Euclidean metric. The graph

_{j}*G*represents the connectivity of the data vectors.

**Step 2: Heat kernel as weights** Next, we weight the edges of the graph and focus on the diffusion *weight matrix W* given by

The number of neighbors *m* controls the sparsity of *W.*

**Step3: Solving an eigenvalue problem** We denote a potential new data representation by *y* = (*y*_{1},…,*y _{n}*)

^{}, where each row is considered as a vector in

*, and we then consider the following minimization problem*

^{d}where *L* = *D — W* and *D* is the diagonal matrix *D _{i,i}* = ∑

*. The minimizer of (2) is given by the*

_{j}W_{i,j}*d*minimal eigenvalue solutions of

*Lx*=

*λDx*under the constraint

*y*

^{}

*Dy*=

*I*, where

*I*denotes the identity matrix, i.e., the minimizer

*y*’s columns are the

*d*eigenvectors with respect to the smallest eigenvalues. If the graph is connected, then

**1**= (1,…,1)

^{}is the only eigenvector with eigenvalue 0, and we exclude it. Instead of (2), we try to find the minimizer of

By applying the change of variables *z* = *D*^{1/2}*y*, this yields

where . The minimizer *z* is given by the *d* eigenvectors with smallest nonzero eigenvalue, and we obtain the *d*-dimensional representation {*y*_{1},…,*y _{n}*} from

*y*=

*D*

^{-1/2}z.### Identifying highly connected genes

Weighted gene co-expression network analysis is a systems biology tool that allows to identify highly connected genes within a regulatory network. An R package implementation WGCNA is available with an accompanying tutorial [12]. Alternatively, the matrix *D* in (2) is a measure of the connectivity within the network and can be used to identify highly connected genes within the Laplacian framework directly.

### Schroedinger Eigenmaps

Based on the Laplacian matrix *L* in (3), a flexible potential, that can capture additional labels, has been introduced in [11]. The matrix *L* is replaced with a Schroedinger type matrix *E* = *L* + *V*, where *V* is a potential matrix that encodes labels. One then aims to minimize

The result is a new Schroedinger Eigenmaps method that allows for input in an otherwise fully automated dimension reduction process [11]. Here, labels are utilized to emphasize “important” genes, and we use the connectivity of genes as a measure of their importance in the description of the regulatory network.

### Standard cluster analysis

For hierarchical clustering, we refer to [17], and we also apply a shape similarity-based clustering as introduced in [18]. *k*-means is a method of cluster analysis which aims to partition *n* observations into *k* clusters {*c*_{1},…,*c _{k}*}, where

*k*has to be chosen a-priori [15], i.e., one aims at minimizing

where is the mean of cluster *c _{j}*. The basic

*k*-means algorithm requires the target number of clusters to be specified as a parameter.

The *k*-means algorithm begins with a data set, a target number of clusters *k*, and a set of *s*_{1},…,*s _{k}* initial cluster centroids. It then iteratively assigns points to clusters by centroid proximity, and then adjusts centroids to reflect changes in cluster membership. The algorithm terminates either after a specified number of iterations, or once the cluster centroids/membership no longer change. Although optimal results cannot be guaranteed, the algorithm is quite fast, and many runs can be efficiently computed, with the best clustering taken as an overall result.

### GoMiner

GoMiner provides a quantitative and statistical analysis-tool for biological interpretation of genomic, transcriptomic, and proteomic data, commonly derived from gene expression microarray experiments. It classifies genes into biologically coherent categories and then uses the Gene Ontology project to identify the biological processes, functions and components of genes within these categories [9,10]. A one-sided Fisher’s *p*-value is used to determine the significance and biological enrichment levels within a category.

### Clustering with Genesis

Clustered image maps (CIMs) were first introduced in [19] and are produced here with the Genesis program [20]. We select the Euclidean distance metric and average linkage for hierarchal clustering. To facilitate visualization, a recently-added feature of GoMiner has been implemented that removes large generic categories from all CIMs.

### Silhouette coefficient

The silhouette coefficient is a measure for the coherence of clusters. If we take a clustering *C* to be a mapping from a data set *X* = {*x*_{1},…,*x _{n}*} to the integers 1, 2, …,

*k*(where

*k*is the total number of clusters), we can define the silhouette coefficient

*sil*(

*x*) for each point

*x*in

*X*to be

where *A*(*x*) is the average distance between *x* and other points in its cluster, and *B*(*x*) is the average distance between *x* and the points in the nearest neighboring cluster, cf. [21]. The silhouette coefficient *sil(i)* for a cluster *i* is the average of the coefficients for its constituent points. We similarly define the silhouette coefficient sil for an entire clustering to the average silhouette coefficient over all data set points. A clustering with a silhouette coefficient closer to 1 will contain more cohesive and well-separated clusters. For our experiments, we use the squared Euclidean distance for the computations indicated above, as well as for the data clustering algorithms.

### Description of the approach

Microarray data from LCM isolated cells in a mouse model of coloboma as described in the present Section are analyzed by using standard cluster analysis and a novel gene clustering scheme. We derive a coherent clustering and make use of GoMiner to identify those genes identified in public databases as being associated with eye development or function as a measure of the quality of the other members in the cluster. We used GoMiner to identify the degree of association of clusters obtained by all methods with early stage retinal development, and, in particular, with the closure of the optic fissure, see Figures Figures1,1, ,2,2, ,3,3, ,4,4, ,5.5. For *k*-means, we set the target number of clusters to be 24, based on previous work with the current data set [8] that yielded biologically meaningful (but smaller and fewer) cluster results. The maximal silhouette coefficient sil specifies the best *k*-means clustering over 100 repeated runs, starting in each case from different randomly selected initial centroids. The maximum was stable over different 100 run sets, suggesting that an at least near optimal clustering was being obtained. Since the parameter space is too big for an exhaustive search in the dimension reduction process, we fix σ = 1/8 in (1) and assess remaining parameters (number of nearest neighbors and target dimension) over *>m* = 5, …, 10, 12, 15, 20, 25, 50, 100 and *d* = 1, …, 10, 12, 16. The idea is that parameter combinations that yield better cluster structure in the mapped data {*y*_{1},…,*y _{n}*} might be better tuned to resolve possible intrinsic structure in the original data {

*x*

_{1}…,

*x*}. Silhouette coefficients suggest values

_{n}*m*= 10 and

*d*= 2, which additionally provide excellent GoMiner gene identifications.

**CIM cluster 22**CIM for LE+k-means cluster 22 with functional categories related to eye development, false discovery rate (FDR)< 0.05. The cluster is enriched for eye photoreceptor cell development and for a eye photoreceptor cell differentiation.

**...**

**CIM cluster 22 with relaxed**FDR portion CIM for LE+k-means cluster 22 with functional categories related to eye development (the entire CIM contains 74 GO categories). The input cluster for the present CIM is the same as for Figure Figure1.1.

**...**

**CIMs across methods**Enriched GoMiner categories that are shared among clustering methods. Each of the 3 different clustering methods (LEIGS KM 24, PCA KM 24, KM 24) produced 24 clusters. We picked 12 among the 72 clusters that seemed to show significant

**...**

**CIM Schroedinger Eigenmaps I**Seven highly connected genes from Figures Figures1111 and and1212 were labeled in Schroedinger Eigenmaps. After clustering, all seven labeled genes are contained in the same cluster with 145 other genes. The

**...**

**CIM Schroedinger Eigenmaps II**Five highly connected genes from Figure Figure1313 were labeled in Schroedinger Eigenmaps. After clustering, all five genes are contained in the same cluster. The GO categories for this cluster match the optic fissure

**...**

## Results

We aim to increase our understanding of the gene network underlying the closure of the optic fissure during vertebrate eye development:

### Enlarged cluster containing nlz2

We have identified a 50 per cent larger gene cluster than with hierarchical clustering in [8] whose spatio-temporal gene expressions significantly correlate with nlz2, a gene which when previously inhibited in zebrafish induced coloboma. The latter cluster is associated with 210 Affymetrix probes corresponding to 169 genes, nlz2 was among them. See Figures Figures66 and and77 for gene expression profiles and its set of enriched functional categories. GoMiner assigns the functional category of ‘gene silencing’, indicating the repressive influence of nlz2 and co-varying genes. Previous biological studies have shown nlz2 gene product to repress gene transcription of a number of genes regulated hindbrain development possibly as part of a transcription factor complex consistent with its H2N2 zinc finger domain and its binding site for histone deacetylase. Consistent with this hypothesis, we also identify an additional cluster that varies inversely with the primary ‘nlz2 cluster’ gene silencing, suggestive of the previously documented role of nlz2 in suppression of gene transcription, cf. Figure Figure88.

**nlz2 cluster profile**Profile of cluster that contains nlz2. Gene expression levels are plotted vs. 8 time-points, black circles indicate nlz2.

**CIM containing nlz2**Clustered Image Map (produced by GoMiner) showing enriched functional categories for the cluster that contains nlz2 and 168 other genes. More genes in this cluster have been associated to the 5 above GO categories (gene silencing is

**...**

### One complementary cluster

We have found a large cluster whose shape is distinct from nlz2 by applying the similarity-based shape clustering in [18]. GoMiner assigns a number of significantly associated functions to this large cluster including **retina morphogenesis** (vertebrate eye), **generation of neurons**, cellular morphogenesis during differentiation, photoreceptor differentiation, cell motility, neuron differentiation, cell projection organization, and biogenesis. The highlighted functions are specifically associated with CHX10, a gene in this cluster that has previously been identified in retinal development, see, for instance, [22,23].

### Collection of enriched clusters

We also apply *k*-means on the original data set and on PCA and LE reduced data. The selected ‘best’ *k*-means result applied directly to the original data has an overall silhouette coefficient of 0.38. To evaluate PCA+*k*-means, for each possible number of retained principal components, the mapped data is clustered, and overall silhouette scores are obtained. The best results refer to the mapping based on principal components capturing about 85% of the variance, with the best overall silhouette score being 0.698. The silhouette scores in the mapped data are substantially higher than those obtained following clustering of the original data, illustrating the fact that Laplacian Eigenmaps enhance cluster structure, see Table Table11 for more details.

*k*-means leads to higher silhouette coefficients.

We find that PCA+*k*-means, basic *k*-means, and LE+*k*-means yield several significantly enriched gene clusters (out of a total of 24) associated with developmental processes, cf. Table Table2.2. Cluster 22 of the Laplacian Eigenmaps-based approach reveals a cluster significantly enriched (with a false discovery rate (FDR) of less than 0.05) for genes specifically implicated in eye development - which is the focus of the experimental work underlying the data set considered in this study. These functional categories (in GoMiner terminology) are

*k*-means and PCA+

*k*-means do not show any eye specific enrichement in any of the clusters. Only LE+

*k*-means

**...**

(i) GO:0042462 eye photoreceptor cell development,

(ii) GO:0001754 eye photoreceptor cell differentiation,

(iii) GO:0042461 photoreceptor cell development.

When slightly relaxing the FDR up to < 0.15, this cluster 22 shows gene enrichment for further eye specific developmental functions:

(iv) GO:0048592 eye morphogenesis,

(v) GO:0001654 eye development,

(vi) GO:0046530 photoreceptor cell differentiation,

see also Figures Figures11 and and4.4. These categories are neither hit by *k*-means nor PCA+*k*-means clustering when restricting the FDR to < 0.05. By relaxing the FDR, however, both k-means and PCA+k-means clustering show gene enrichment for eye specific functions. This verifies that the eye specific functions in LE+*k*-means cluster 22 are real and have not been picked up by chance. To support the latter claim, we compare the enriched categories in the LE+*k*-means cluster 22 with the clusters of the other two clustering methods with relaxed FDR. It turns out that specific eye development functions are present in all three clustering methods, but our proposed Laplacian-based scheme leads to lower false discovery rates, see also Table Table2.2. Potential nonlinear structures in the data could be an explanation for this observation, see Figure Figure9.9. A nonlinear dimension reduction method would clearly be better suited to fit nonlinear structures than linear methods. CIMs in Figures Figures33 a)-b) indicate which clusters across the three methods share common GoMiner categories. It enables us to identify categories that are more specific to one method than to the others. Based on Table Table11 the fraction of genes, that are associated to biological functions, are computable for each cluster, method, and false discovery rate.

### Note on LE+*k*-means

We note that relatively unusual expression patterns are often mapped to distinct, outlying clusters by the Laplacian Eigenmaps approach. For example, the three expression patterns indicated in Figure Figure88 form a distinct cluster under the Laplacian Eigenmaps data representation. They are not as well separated in the original and PCA-mapped data, and are consequently misplaced in inappropriate clusters. This could be a technical explanation for greater biological specificity of Laplacian Eigenmaps clustering.

### Schroedinger Eigenmaps

We first label a collection of transcription factors that are known to be annotated to eye development. Enriched GO categories, however, appear generic when applying Schroedinger Eigenmaps with such labels, cf. Figure Figure10.10. To obtain more meaningful labels, that are directly extracted from the data rather than from the literature, we identify a set of highly connected genes through the weighted correlation analysis described in [12], see Figures Figures1111 and and12.12. These “hub genes” are then labeled by means of the potential to steer Schroedinger Eigenmaps utilizing the gene network topology. This labeling seems to further improve the biological specificity, cf. Figure Figure4.4. Alternatively, the matrix D in (2) is a natural measure of the connectivity within the Laplacian framework. According to D, we use highly connected genes as labels within the LE cluster 22, cf. Figure Figure13,13, providing the highest biological specificity, cf. Figure Figure5.5. Enriched GO categories that are derived from supervised and unsupervised dimension reduction are shown in Table Table3.3. The supervised procedure Schroedinger Eigenmaps identifies more categories specific to early retinal development and the optic fissure closure than the unsupervised approach.

**CIM Schroedinger Eigenmaps III**We have labeled transcription factors (CHX10, OTX2, PAX6) that are known from the literature to be associated to development. We intend to derive a cluster whose GO categories are specific to eye development when starting

**...**

**plot of connectivity scores in increasing order for WGCNA weights**To derive gene labels directly from the measured Affymetrix data rather than from the literature, we aim to identify co-varying genes with high connectivity in the regulatory network. The

**...**

**connectivity network for WGCNA weights**Portion of the thresholded weighted correlation network derived from WGCNA. In the entire connectivity network, each of the genes to be labeled (Cdh4, Dll1, Tox, Onecut2, Dcc, Epha5, Cadps) would have more than 16

**...**

**plot of connectivity scores in increasing order for LE weights, LE cluster 22**We aim to further improve the biological specificity of cluster 22 derived from Laplacian Eigenmaps +

*k*-means. To identify co-varying genes with high connectivity in the regulatory

**...**

## Discussion

Obtaining a clearer understanding of the gene regulatory network underlying optic fissure closure during eye development will be a long process involving genetic analysis of humans with coloboma and studies of eye development in animal models. Our present analysis and results focus on expanding a list of candidate genes that could be critical for normal fissure closure and in coloboma patients may contain mutations. Compared with conventional clustering algorithms that we tested, our new method is able to identify larger clusters associated either with the nlz2 gene expression or with a distinctly complementary pattern enriched with associations to eye development gene ontologies. It also uniquely identifies the ‘nlz2-repressed’ pattern as a distinct cluster, cf. Figure Figure2.2. The large temporally covarying gene cluster in Figure Figure77 is identified by GoMiner as being significantly associated with gene silencing, suggestive of a gene regulatory network that represses alternative fates until optic fissure closure is successfully completed (day 11.5 in the mouse). The pattern of genes in Figure Figure22 could represent such genes that are transiently repressed only when the nlz2 cluster is high. Using temporal pattern-based similarity clustering [18] allows identification of other distinct clusters (i.e., not containing nlz2) in which GoMiner identifies significant associations with specific developmental functions in databases.

Distinct biological specificity for our data set is obtained when labeling highly connected genes and encoding these labels in the potential term. The GO categories **eye morphogenesis, retina morphogenesis in camera type eye, and camera type eye morphogenes**, for instance, reflect the optic fissure closure and are identified by Schroedinger Eigenmaps suggesting that nonlinear dimension reduction with labeled data can improve the biological specificity in gene cluster analysis, cf. Figure Figure55 and Table Table33.

Clearly, our new mathematical approach to identify new components of gene regulatory networks controlling development is preliminary and would need further validation to claim its usefulness in more generality. We anticipate improvements in our analysis methods based on nonlinear dimension reduction with connectivity analysis and labeled data.

## Conclusions

Microarray data are commonly used for global searches for gene expression changes that might be associated with a perturbation of a cell state or in pathology. In organ development, temporal and spatial patterns accessible through microdissection are associated with reproducible changes in gene expression of even larger numbers of genes. More efficient analysis of microarray data from such microdissected samples could provide improved understanding of cell fate and organogenesis as well as elaboration of gene expression covariance networks. Our nonlinear analysis scheme based on Laplacian Eigenmaps and labeling highly connected genes through a potential appears to offer advantages over standard clustering algorithms in the sense of greater biological specificity and sensitivity. Our results motivate further analysis of nonlinear dimension reduction with labeling within other microarray data sets from LCM dissected tissue or other phenotypically specific cell samples to potentially validate its biological specificity in more generality. Together with LCM-focused gene expression microarray measurements, our proposed analysis could be part of an iterative process to more completely identify additional elements in gene regulatory networks underlying mammalian organogenesis.

## Competing interests

The authors declare that they have no competing interests.

## Authors’ contributions

ME, WC, and RFB made substantial contributions to conception and design of the study. BPB and JB acquired the primary expression data. ME, VNR, and BRZ analyzed the data. ME, VNR, and RFB contributed to the interpretation of the data. ME and RFB wrote the manuscript, and ME, VNR, BRZ, WC, and RFB have been involved in revising it critically.

## Acknowledgements

The research was funded by intramural research funds from the National Institute of Child Health and Human Development, National Institutes of Health. WC is supported, in part, by NSF (CBET0854233), by NGA (HM15820810009), and by ONR (N000140910144). ME is supported by the NIH/DFG Research Career Transition Awards Program (EH 405/1-1/575910).

This article has been published as part of *BMC Proceedings* Volume 5 Supplement 2, 2011: Proceedings of the 6th International Symposium on Bioinformatics Research and Applications (ISBRA'10). The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S2.

## References

- Mordantameron DJ, Yang Z, Gibbs D, Chen H, Kaminoh Y, Jorgensen A, Zeng J, Luo L, Brinton E, Brinton G, Bernstein PS JMB, Zabriskie NA, Tang S, Constantine R, Tong Z, Zhang K. HTRA1 variant confers similar risks to geographic atrophy and neovascular age-related macular degeneration. Cell Cycle. 2007;6(9):1122–1125. doi: 10.4161/cc.6.9.4157. [PubMed] [Cross Ref]
- Yang Z, Camp NJ, Sun H, Tong Z, Gibbs D, Cameron DJ, Chen H, Zhao Y, Pearson E, Li X, Chien J, Dewan A, Harmon J, Bernstein PS, Shridhar V, Zabriskie NA, Hoh J, Howes K, Zhang K. A variant of the HTRA1 gene increases susceptibility to age-related macular degeneration. Science. 2006;314(5801):992–993. doi: 10.1126/science.1133811. [PubMed] [Cross Ref]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. 2000;25:25–9. doi: 10.1038/75556. [PMC free article] [PubMed] [Cross Ref]
- Gene Ontology Consortium. The Gene Ontology project in 2008. Nucleic Acids Res. 2008;36:D440–4. doi: 10.1093/nar/gkm883. [PMC free article] [PubMed] [Cross Ref]
- Bonner RF, Emmert-Buck M, Cole K, Pohida T, Chuaqui R, Goldstein S, Liotta LA. Laser capture microdissection: molecular analysis of tissue. Science. 1997;21(278):1481–1483. doi: 10.1126/science.278.5342.1481. [PubMed] [Cross Ref]
- Goldstein SR, McQueen OG, Bonner RF. Thermal modeling of laser capture microdissection. Appl. 1998;37(31):7378–91. doi: 10.1364/AO.37.007378. [PubMed] [Cross Ref]
- Suàrez-Quian CA, Goldstein SR, Bonner RF. Laser capture microdissection: a new tool for the study of spermatogenesis. J. 2000;21(5):601–8. [PubMed]
- Brown JD, Dutta S, Bharti K, Bonner RF, Munson PJ, Dawid IB, Akhtar AL, Onojafe IF, Alur RP, Gross JM, Hejtmancik JF, Jiao X, Chan W, Brooks BP. Expression profiling during ocular development identifies 2 Nlz genes with a critical role in optic fissure closure. Proc. Nat. Acad. Sci. USA. 2009;106(5):1462–7. doi: 10.1073/pnas.0812017106. [PMC free article] [PubMed] [Cross Ref]
- Zeeberg B, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003;4(4):R28. doi: 10.1186/gb-2003-4-4-r28. [PMC free article] [PubMed] [Cross Ref]
- Zeeberg B, Qin H, Narasimhan S, Sunshine M, Cao H, Kane DW, Reimers M, Stephens RM, Bryant D, Burt SK, Elnekave E, Hari DM, Wynn TA, Cunningham-Rundles C, Stewart DM, Nelson D, Weinstein JN. High-Throughput GoMiner, an ’industrial-strength’ integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID) BMC Bioinformatics. 2005;6:168. doi: 10.1186/1471-2105-6-168. [PMC free article] [PubMed] [Cross Ref]
- Czaja W, Ehler M. Schroedinger Eigenmaps for the analysis of bio-medical data. submitted to IEEE T. 2010. [PubMed]
- Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9 doi: 10.1186/1471-2105-9-559. [PMC free article] [PubMed] [Cross Ref]
- Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS. 2002;14
- Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural. 2003;15(6):1373–1396. doi: 10.1162/089976603321780317. [Cross Ref]
- Lloyd SP. Least-squares quantization in PCM. IEEE Trans. Inform. Theory. 1982;28(2):129–137. doi: 10.1109/TIT.1982.1056489. [Cross Ref]
- Pearson K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine. 1901;2(7-12):559–572.
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; 2009.
- Hestilow T, Huang Y. Clustering Of Gene Expression Data Based On Shape Similarity. EURASIP J. 2009. [PMC free article] [PubMed]
- Weinstein JN, Myers TG, O’Connor PM, Friend SH, Fornace JAJ, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johnson GS, Wittes RE, Paull KD. An information-intensive approach to the molecular pharmacology of cancer. Science. 1997;275(5298):343–9. doi: 10.1126/science.275.5298.343. [PubMed] [Cross Ref]
- Sturn A, Quackenbush J, Trajanoski Z. Genesis: cluster analysis of microarray data. Bioinformatics. 2002;18:207–8. doi: 10.1093/bioinformatics/18.1.207. [PubMed] [Cross Ref]
- Kaufman L, Rousseeuw P. Finding Groups in Data: An Introduction to Cluster Analysis. London: John Wiley and Sons; 1990.
- Reichman S, Kalathur RKR, Lambard S, Aït-Ali N, Yang Y, Lardenois A, Ripp R, Poch O, Zack DJ, Sahel J, Léveillard T. The homeobox gene CHX10/VSX2 regulates RdCVF promoter activity in the inner retina. Hum. Mol. 2010;19(2):250–61. [PMC free article] [PubMed]
- Sigulinsky CL, Green ES, Clark AM, Levine EM. Vsx2/Chx10 ensures the correct timing and magnitude of Hedgehog signaling in the mouse retina. Dev. 2008;317(2):560–75. doi: 10.1016/j.ydbio.2008.02.055. [PMC free article] [PubMed] [Cross Ref]

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.8M) |
- Citation

- Neural crest and mesoderm lineage-dependent gene expression in orofacial development.[Differentiation. 2007]
*Bhattacherjee V, Mukhopadhyay P, Singh S, Johnson C, Philipose JT, Warner CP, Greene RM, Pisano MM.**Differentiation. 2007 Jun; 75(5):463-77. Epub 2007 Feb 5.* - Fuzzy ensemble clustering based on random projections for DNA microarray data analysis.[Artif Intell Med. 2009]
*Avogadri R, Valentini G.**Artif Intell Med. 2009 Feb-Mar; 45(2-3):173-83. Epub 2008 Sep 17.* - Contaminating cells alter gene signatures in whole organ versus laser capture microdissected tumors: a comparison of experimental breast cancers and their lymph node metastases.[Clin Exp Metastasis. 2008]
*Harrell JC, Dye WW, Harvell DM, Sartorius CA, Horwitz KB.**Clin Exp Metastasis. 2008; 25(1):81-8. Epub 2007 Oct 12.* - [Application of laser capture microdissection in plant genomic research].[Yi Chuan. 2006]
*Cai MH, Hu YK, Li YX, Yan YM.**Yi Chuan. 2006 Oct; 28(10):1325-9.* - Techniques for clustering gene expression data.[Comput Biol Med. 2008]
*Kerr G, Ruskin HJ, Crane M, Doolan P.**Comput Biol Med. 2008 Mar; 38(3):283-93. Epub 2007 Dec 3.*

- PubMedPubMedPubMed citations for these articles

- Nonlinear gene cluster analysis with labeling for microarray gene expression dat...Nonlinear gene cluster analysis with labeling for microarray gene expression data in organ developmentBMC Proceedings. 2011; 5(Suppl 2)S3

Your browsing activity is empty.

Activity recording is turned off.

See more...