- Journal List
- BMC Bioinformatics
- v.7; 2006
- PMC1468435

# A factor analysis model for functional genomics

^{1}Public Health Sciences, Health Sciences Bldg, University of Toronto, Toronto, ON, Canada

^{2}Department of Combinatorics and Optimization, University of Waterloo, Waterloo, ON, Canada

^{3}Department of Statistics and Actuarial Science, Universityof Waterloo, Waterloo, ON, Canada

^{}Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories.

### Results

We propose a factor analysis model (FAM) for functional genomics and give a two-step algorithm, using genome-wide expression data for yeast and a subset of Gene-Ontology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genome-wide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance.

### Conclusion

Our factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions.

## Background

Functional genomics is often described as one of the most important challenges in the post-genomics era. Now that many genomes have been sequenced, the next step is to understand the functions of all the gene-products. One of the most accessible genome-wide modalities that has proven useful for this task is expression microarray experiment e.g., [1-3]. The general idea is to associate genes to functional categories by comparing their expression profiles to genes with known functional associations. This is usually supported by the biological argument that genes in the same or similar functional categories are often part of a functional pathway and therefore likely to be co-regulated by the same set of transcription factors.

In a typical functional genomic experiment relying on expression data, a few dozen to a few hundred samples from an organism are processed and hybridized. The samples may be obtained from different tissues and organs, e.g., the mouse experiment of [3], or from different experimental conditions [1]. The expression data for *p *genes are then assembled in a *p *× *n *matrix where each row presents an expression profile of the corresponding gene across *n *samples.

A few distinct ways have been proposed to predict gene functions from microarray data. The authors of [1] run a multitude of clustering methods which generate thousands of (overlapping) clusters. For each functional category, every cluster is assigned an *enrichment *p-value based ona hypergeometric distribution. Those clusters with p-values below a certain threshold are used for prediction: an uncharacterized gene is localized among all enriched clusters and predicted to be involved in all functions these clusters entail. This is biologically sensible since a gene is usually involved in multiple functions. The authors of [2] use gene-gene correlations to build graphical models of gene expression data which are then used, among other things, for functional prediction. The predictions are done by first forming a set of shortest paths between genes, thresholded at some length to retain significance, and predicting with a constrained majority rule using the genes on the path.

One model-based approach, perhaps more intuitive for statisticians, is to use one of many classification methods. In such an approach, one starts with the expression matrix, where genes correspond to observations and their expression profiles correspond to features. A set of *K *functional categories is specified which span all interesting and relevant biological functions for a given genome. Since any given gene can belong to multiple functional categories, the simplest approach is to build *K *separate binary classifiers, one for each function, and use them to predict whether each of the uncharacterized genes belongs to one of the *K *functional categories.

This general paradigm permits one to use many known classification models and algorithms. The authors of [3] apply the Support Vector Machine (SVM) classifier, popular in the machine learning literature, to a set of 55 microarrays of different mouse tissues to predict functions of over 12,000 mouse genes which currently have unknown status. They use a set of 992 functional categories defined by Gene Ontology – Biological Process (GO-BP; see [4]), which is a hierarchical categorization of functions much finer in scale than the set of 42 YPD Cellular Role functions, largely independent of each other (in the sense that most genes belong to at most two functional categories), which was used by [1]. More details about GO-BP are given below.

It is important to acknowledge that functions of gene-products cannot, in many cases, be fully resolved from co-expression data alone: co-expression may be a result of other factors besides co-regulation, e.g., separate pathways that happened to be activated in the same experiment, or simply chance. Similarly, trans-regulation factors may be acting more globally inducing co-expression of genes that are weakly connected or completely disconnected on a functional level. Some of these concerns can be alleviated by using expression data that represent a wide array of conditions, so that significant observed co-expression between genes is more likely to have resulted from close functional relationship of respective gene-products than from other factors. Studies with such design, as those already cited [1,3], have been successful in that regard. As with many other genome-wide studies that use high-throughput technologies, this proposal can potentially be highly useful as an initial hypothesis-generating step.

In this paper, we propose a new approach to predicting gene functions using expression data and GO-BP annotations. It is worth noting that our framework can, in principle, be extended to utilize further sources of information, besides expression data (see "Discussion" section). Such approaches, sometimes dubbed "genomic data fusion" models, have been proposed in the literature and show some promise of improving prediction accuracy. In [5], the authors utilize phylogenetic profiles and fused domain methods in addition to experimental data that include mRNA expression to derive a large number of candidate protein-protein links and hence predict function in yeast *Saccharomyces cerevisiae*. A method which is similar in spirit but more refined in terms of an approach to assigning interactions from various genomic sources, has been proposed by [6], who use phenotype data from gene knockout experiments, cellular localization data, and information from protein-protein interaction databases in addition to expression data. Each data source contributes a bit matrix that codes all the protein-protein relationships. Depending on the manner of inferring the relationships (e.g., various p-value thresholds of Pearson correlation coefficient for expression data), and on the manner in which various bit matrices are combined, one can control the contribution of each data source and arrive at the final association matrix and hence predict gene functions. A moremodel-oriented approach to data fusion is described in [7] and [8], which propose the Bolztmann machine and SVMs, respectively, for gene-product function prediction using various genomic data sources.

### Gene Ontology – biological processes

We use the GO-BP annotations to provide us with a set of functional categories and the associations of characterized genes to these categories. Gene Ontology [9] is a popular information resource for genetic research. The Biological Process part (GO-BP), which is relevant for the expression data, has currently almost 10,000 processes arranged in a Directed Acyclic Graph (DAG) structure. The top node in this DAG (Biological Process, accession number GO:0008150) has six children (Development, Behaviour, Physiological Process, Viral Life Cycle, Cellular Process, Regulation of Biological Process) which constitute major groupings of biological processes in all organisms. Each of these is subdivided into more specialized characterizations. The seventh top child is a Biological-Process Unknown node, which has no further descendants; it is applied to currently uncharacterized genes. The arcs in the GO-BP DAG usually denote subdivisions of a process characterization (commonly called "is_a" relationships), although sometimes they also denote partitions of a process into smaller steps (commonly called "part_of" relationships). The GO-BP is a cross-organism ontology; for any specific organism only a part of its structure will usually be relevant. For example, for simple eukaryotes such as yeast studied in [1], the whole subgraphs under Viral Life Cycle and Behaviour are not relevant and thus have no genes associated to them.

In addition to providing a continually evolving ontology structure, the GO consortium also maintains and curates the association data. For most model organisms, one can obtain a table where each known gene or Open Reading Frame (ORF) is associated with a set of biological processes. These associations are built and maintained based on reliable evidence such as major publications or directly confirmable laboratory experiments, although there is a range of "credibility" scores. These scores are called Evidence Codes and are maintained alongside the association data. They could potentially be used in the analysis to weigh different associations according to their credibility, but we have chosen not to make this distinction in this research.

### Motivation and summary of our study

A problem with the typical classification approach is its need to train very large numbers of classifiers on thousands of genes. This is further complicated by the need to use multiple cross-validation experiments for each classifier to fine tune its control parameters. These requirements are computationally prohibitive for most laboratories. The authors of [3], for example, had access to a powerful computer cluster to perform extensive computations, an atypical resource in many biological labs.

We propose a factor analysis model (FAM) to predict functions for uncharacterized genes using microarray expression data and a subset of the GO-BP functional categories. We show that our factor analysis approach provides significant computational advantages without compromising prediction accuracy. It is worthwhile to notice that FAM directly models the expression-induced correlation structure among genes. This correlation structure coupled with small signal-to-noise ratios of microarray data, can negatively impact the statistical efficiency of some classifiers that treat genes as independent observations [10,11]. Finally, we show that our model has the potential for incorporating important gene ontology information and deal with a number of major challenges in using statistical classification techniques for functional genomics (see "Discussion" section).

## Results

This section gives an overview of our prediction model, experiments (including descriptions of our data, preprocessing steps, benchmarks and performance evaluation metrics) and computational results.

### The model

Our model, FAM, applies factor analysis to predict gene functions using microarray data.

#### The typical approach

We first describe the typical computational approach, e.g., [3], more formally. Let **Z **be the *p *× *n *matrix of gene expression data; the (*i*,*j*)^{th }element of that matrix, *z*_{ij}, is the expression level of gene *i *from experiment (or sample) *j*. Without loss of generality, we shall assume that the GO-BP data contain the functional annotations of the first *p*_{1 }<*p *genes. In particular, with a total of *K *different functional categories, this information can be stored in a binary *p*_{1 }× *K *matrix, which we call **Y**_{1}. Its (*i*, *k*)^{th }element *y*_{ik }equals 1 if gene *i *is associated with function *k*, and equals 0 otherwise. We use **Y**_{2 }to denote the unknown functional annotation matrix for the remaining *p*_{2 }= *p *- *p*_{1 }genes; thus **Y**_{2 }is the *p*_{2 }× *K *matrix that needs to be predicted.

Using the matrix **Y**_{1 }and the first *p*_{1 }rows of **Z**, we can build predictive models to associate gene functions (**Y**) from the microarray data (**Z**). Typically, *K *different models {*F*_{k }: *k *= 1, 2, ..., *K*} are constructed independently, one for each functional category, i.e., each column of **Y**. Since the elements of **Y**_{1 }are binary, each model *F*_{k }is simply a classifier. In theory, any classification algorithm can be used here, the most popular being SVMs, e.g., [3]. Each model *F*_{k }is then applied to the remaining *p*_{2 }rows of **Z **to predict whether or not these genes are associated with function *k*.

#### The factor analysis model (FAM)

Let ${g}_{1}^{\left(j\right)},{g}_{2}^{\left(j\right)},\dots ,{g}_{p}^{\left(j\right)}$ be the expression level of *p *different genes from the *j*^{th }microarray experiment. Using a factor analysis model [[12], Chapter 9], we model the expression level ${g}_{i}^{\left(j\right)}$ as

${g}_{i}^{\left(j\right)}={\displaystyle \sum _{l=1}^{L}{\lambda}_{il}{f}_{l}^{\left(j\right)}+{\epsilon}_{i}^{\left(j\right)},}\text{}\left(1\right)$

where ${f}_{1}^{\left(j\right)},{f}_{2}^{\left(j\right)},\dots {f}_{L}^{\left(j\right)}$ are latent variables that affect the expression level. Others have applied latent variable models to analyze microarray data as well [13,14], although most of these studies use latent variable models in unsupervised ways and not directly to predict gene function.

In the absence of experimental covariates describing each experiment, we may treat each experiment as equivalent with both random noise and experimental differences modeled by the error part, ${\epsilon}_{i}^{\left(j\right)}$. If experiments are controlled and experimental covariates are available, the above model may incorporate them by making *λ*_{il }a function of covariates. This is one of the extensions to this model being currently studied by our group.

In matrix form, Eq. (1) can be written as:

**g**^{(j) }= **Λf**^{(j) }+ *ε*^{(j)}, (2)

or

$\left[\begin{array}{c}{g}_{1}^{\left(j\right)}\\ {g}_{2}^{\left(j\right)}\\ \vdots \\ {g}_{p}^{\left(j\right)}\end{array}\right]=\left[\begin{array}{cccc}{\lambda}_{11}& {\lambda}_{12}& \dots & {\lambda}_{1L}\\ {\lambda}_{21}& {\lambda}_{22}& \dots & {\lambda}_{2L}\\ \vdots & \vdots & \ddots & \vdots \\ {\lambda}_{p1}& {\lambda}_{p2}& \cdots & {\lambda}_{pL}\end{array}\right]\left[\begin{array}{c}{f}_{1}^{\left(j\right)}\\ {f}_{2}^{\left(j\right)}\\ \vdots \\ {f}_{L}^{\left(j\right)}\end{array}\right]+\left[\begin{array}{c}{\epsilon}_{1}^{\left(j\right)}\\ {\epsilon}_{2}^{\left(j\right)}\\ \vdots \\ {\epsilon}_{p}^{\left(j\right)}\end{array}\right].$

Suppose Var(**g**) = **S**. In factor analysis, it is typically assumed that Var(**f**) = **I **is the identity matrix; Var(** ε**) =

**E**= diag{${\sigma}_{1}^{2},{\sigma}_{2}^{2},\dots ,{\sigma}_{p}^{2}$} is a diagonal matrix; and that

**and**

*ε***f**are uncorrelated. We then obtain the following equation from (2):

Var(**g**) = **Λ**Var(**f**)**Λ**^{T }+ Var(** ε**)

or

**S **= **ΛΛ**^{T }+ **E**. (3)

The matrix **S **can be estimated directly from the microarray data. Following general conventions in factor analysis, we take **S **to be the correlation matrix rather than the covariance matrix. This is equivalent to standardizing the data prior to the analysis and is customary in factor analysis [12].

#### A two-step algorithm

Our strategy for using this model to predict the gene functions is as follows: We first estimate **Λ **with $\widehat{\Lambda}$ using principal factor analysis. The number of latent variables, *L*, can be determined empirically by evaluating the goodness of $\widehat{\Lambda}$$\widehat{\Lambda}$^{T }as an approximation to **S **[[12], p. 262–267], but we simply retain all factors with non-zero singular values.

Let $\widehat{\Lambda}$_{1 }be the top *p*_{1 }rows of $\widehat{\Lambda}$ corresponding to the characterized genes and let $\widehat{\Lambda}$_{2 }be the bottom *p*_{2 }rows of $\widehat{\Lambda}$ corresponding to the uncharacterized genes. We then derive a mapping from $\widehat{\Lambda}$_{1 }to **Y**_{1}, say *F*, and apply *F *to $\widehat{\Lambda}$_{2}. We can picture the entire process as follows:

$S\begin{array}{llllllllllll}\stackrel{\text{Step1}}{\to}\hfill & \widehat{\Lambda}\hfill & =\hfill & \left[\begin{array}{c}{\widehat{\Lambda}}_{1}\\ {\widehat{\Lambda}}_{2}\end{array}\right]\hfill & \stackrel{\text{Step2}}{\to}\hfill & \widehat{\theta}\hfill & =\hfill & \underset{\theta}{\mathrm{arg}\mathrm{min}\ell}\left(F\left({\widehat{\Lambda}}_{1};\theta \right),{Y}_{1}\right)\hfill & \stackrel{\text{Predict}}{\to}\hfill & {\widehat{Y}}_{2}\hfill & =\hfill & F\left({\widehat{\Lambda}}_{2};\widehat{\theta}\right).\hfill \end{array}\text{}\left(4\right)$

In this modeling paradigm, (·,·) denotes a loss function between the known associations **Y**_{1 }and the predicted associations *F*(·;** θ**) based on the estimated loadings $\widehat{\Lambda}$

_{1}. The mapping

*F*(·;

**) is assumed to be completely specified by a collection of parameters which we simply write as**

*θ***, where**

*θ***is estimated by minimizing the loss function.**

*θ*Details on the implementation of this two-step algorithm is given in the "Method" section below.

### Experiments

In this section, we discuss the set-up of our computational experiments, including descriptions of our data set, our preprocessing steps, and how performance is evaluated. We also briefly describe a benchmark SVM algorithm that we use for the purpose of comparison.

#### Data

We applied our model to the microarray data for yeast described in [1], but with functional annotations from the December 2004 release of the GO-BP table and the *Saccharomyces *Genome Database (SGD; also part of the GO consortium). The authors of [1] used a much simpler YPD classification scheme, which is not publicly available. Their microarray data contains 300 experiments from the Rosetta Compendium using genetic variants of yeast, as well as an additional publicly available data of 124 arrays from a set of three different experiments from other labs. Thus in total, there are 424 arrays with a common set of 6064 genes and ORFs. The full description of the data can be found in [1] and references therein.

#### Data preprocessing

We obtained the data from the authors of [1], where the data were already normalized and combined into one data matrix. Our further preprocessing steps included the following:

1. We deleted ORFs which are not listed in the newest (December 2004) SGD association table maintained by GO. Most of those had a "Dubious ORF" characterization in the SGD database.

2. We ignored the uncharacterized genes (and ORFs) – genes annotated to the node "Biological Process Unknown" in SGD.

3. We imputed missing values in the expression data using an adaptation of the k-nearest neighbor imputing algorithm for expression data described in [15]. We modified it so that the putative nearest neighbour gene profile, considered for imputing a given missing value, did not need to have a full expression profile as long as: (a) it had non-missing data in the column under consideration and (b) it had at least 80% expression values that were not missing. We used k = 10 nearest neighbours.

4. We restricted the set of GO-BP functional categories to those with at least 15 but no more than 100 genes associated to it. Since the goal is to produce functional predictions for genes to enable biologists to conduct more guided confirmatory experiments, we chose not to consider categories which had too few or too many genes associated to them. On the one hand, categories with too many genes annotated to them are less informative and hence not as interesting. On the other hand, categories with too few genes annotated to them cannot be credibly predicted.

5. Restricting GO categories produced a small list of "stranded" genes – genes which were otherwise known but whose annotations happened to be exclusively in the categories deleted from the list. We removed these "stranded" genes from our data set.

This resulted in 3224 genes and 369 functional categories. Specifically, we formed a 3224 × 424 expression data matrix **Z **(see [1]) and a 3224 × 369 association (response) matrix **Y **[see Additional file 1].

#### Benchmark

To compare the performance of our method, we also applied the popular approach in functional genomics: function-wise binary classification using Support Vector Machines (SVMs). SVM is a well-known method in machine learning (see [16] for links to much of the literature, tutorials and software) and is used by [3] to classify mouse genes.

We chose Gist 2.1.1 [17] as our SVM software due to its tailored interface for microarray data and because of its use in [3]. We tested several different values via cross-validation for the following tuning parameters in SVM: the type of kernel (linear, polynomial, and radial basis functions), the misclassification penalty (diagfactor in Gist), and the width factor (widthfactor in Gist) that controls the bandwidth of the radial basis kernel, which is calculated as (width) × (width factor). The "width" is taken to be the median distance from a data point to its nearest neighbor of the opposite class; the "width factor" is specified by the user. Further details on the parameter selection process are elaborated in the "Methods" section.

#### Performance evaluation

Although classifiers are typically evaluated by the misclassification rate in the statistics and machine learning literature, it is *not *a useful criterion for functional genomics from a biologist's point of view. Since the most interesting functional categories are those that are fairly rare, the vast majority of genes would *not *be associated with such a category. This means that even poor classifiers can achieve very low misclassification rates by simply predicting that no genes are associated with that category (i.e. by classifying everything to the majority class). Instead, suppose the prediction algorithm ranks genes according to predicted likelihood of association to that category (e.g., FAM can rank genes using estimated posterior probabilities via naïve Bayes – see the "Method" section). Biologists are most interested in using a computational algorithm to generate a list of candidate genes which can be further investigated in the lab. Since such investigations are often labourious and costly, a highly relevant measure of success should determine whether the algorithm ranks the truly associated genes ahead of the unassociated ones.

In the biomedical literature, prediction accuracy is often evaluated by *sensitivity *and *specificity *[18]. Suppose we focus on *m *<*p *top-ranked genes. Sensitivity is the probability that a gene appears in this top-ranked list given that it is truly associated, whereas specificity is the probability that a gene does not appear in this list given that it is not associated. They both depend on the number *m*, the size of the top-ranked list. The 2 × 2 Table Table11 shows how these quantities can be calculated. Hence, we have:

*one*GO category.

$\begin{array}{ccc}\text{sensitivity}=\frac{a}{a+b}=\frac{a}{s}& \text{and}& \text{specificity}\end{array}=\frac{d}{c+d}=\frac{p-m-b}{p-s}.$

The number *m *is typically determined by external factors such as how much funding there is to perform confirmatory experiments. Full confirmation or rejection of a putative association predicted from a genome-wide scan will typically involve classical experiments from molecular biology or biochemistry over a longer time span. A given lab may decide to pursue a few of these candidate genes, taking into account their ranking from prediction tools as well as additional information such as specific reagents needed and lab's area of expertise. However an intermediate screening step which we have in mind, such as quantitative RT-PCR, may be performed on a larger subset of candidates.

In order to compare two prediction algorithms, one needs to decide on the threshold *m*. Objectively derived statistical thresholds are hard to estimate and do not always reflect the reality of the biologist. We decided on a fixed threshold of *m *such that the top-ranked list consists of 10% of the genes, which we felt was a realistic compromise. For example, yeast has about 2000 uncharacterized genes, and attempting to furtherscreen the top 200 of them by lower-throughput methods such as quantitative RT-PCR would be feasible in a typical lab.

For predicting rare functional categories, sensitivity is more important than specificity. Again, refer to Table Table1.1. By definition, *s *is small compared to *p *for rare functional categories. Clearly, *b *<*s*. Therefore,

$\text{specificity}=\frac{d}{c+d}=\frac{p-m-b}{p-s}\approx \frac{p-m}{p},$

which is roughly constant for fixed *m*. This means all algorithms will have very similar specificity values.

In practice, the choice of *m *must reflect an acceptable trade-off between falsely detecting an unassociated gene and failing to detect an associated gene. Given that our choice of *m *is somewhat arbitrary, it is important for us to consider a more global performance measure. According to [18], the receiver operating characteristic (ROC) curve is currently the best-developed statistical tool for evaluating ranking algorithms; it "describes the [entire] range of trade-offs that can be achieved" by a given algorithm.

The area under the ROC curve – often referred to simply as AUC in the biomedical literature – is a widely used summary measure of the ROC curve. Let *g *and $\overline{g}$ be a randomly selected pair of associated and unassociated genes, and *r*_{g }and ${r}_{\overline{g}}$ be their respective rankings. It is well-known [18,19] that AUC is equal to the probability that *g *and $\overline{g}$ are correctly ordered by the ranking algorithm, i.e., AUC = *P*(*r*_{g }<${r}_{\overline{g}}$). Therefore, AUC is a good (and threshold-free) performance measure for determining whether an algorithm can rank truly associated genes ahead of the unassociated ones. The use of AUC as a performance criterion is also becoming more popular in the machine learning community [20]. As the averaged results later illustrate, SVM and FAM tend to perform comparably on the AUC scale.

### Computational results

We validated our method with a four-fold cross-validation (C-V) experiment. We used four folds (rather than ten folds, for example) because of the need to retain a reasonable number of associated genes in each fold for each functional category, which we verified *post hoc*. It is important to note that the C-V folds were obtained by an unconstrained randomization process.

More specifically, the 3224 genes were randomly divided into 4 groups of 806 each. In each C-V run, one of the four folds is designated as the test set and the corresponding part of the association matrix, **Y**, removed. The remaining three quarters of the data were designated as the training set. Then both FAM and SVM were fitted to the training data and then applied to the test data to obtain predictions. For FAM, the prediction was based on the estimated posterior probability (see Method section for more details); for SVM, the prediction was based on the signed distance to the separating hyperplane. The test genes would then be ranked by each method for each function separately and the three performance metrics (sensitivity and specificity for the 10% top-ranked genes, and area underneath the ROC curve) calculated.

#### Computational details

FAM is implemented in R [21]. Executing the four-fold C-V experiment, run serially across the 369 functional categories, took less than 5 CPU minutes on a dual-CPU (3 GHz Pentium) Linux desktop with 2 GB of RAM. Gist is implemented in ANSI C. Executing the four-fold C-V experiment, run in *parallel *across the 369 functional categories, took at least 12 hours on a 32 CPU (l.3 GHz Itanium II each) SGI Altix server with 64 GB of RAM (where each of the processors were fully loaded for the 12 hours). This corresponds to approximately 23040 CPU minutes (12 hours × 60 × 32 CPUs). Thus, compared to FAM's running time of 5 CPU minutes, training the SVMs took approximately 4000 times longer.

#### Prediction results

Tables Tables22 and and33 show the main results from our four-fold C-V experiments. Table Table22 compares the performance of FAM and SVM, averaged across all 369 functional categories. Table Table33 makes pairwise comparisons of the predictions made by FAM and SVM for the 369 GO functions in the four C-V experiments. For each of the 369 GO functions, the performance of FAM and SVM are compared, and Table Table33 reports the number of GO functions for which FAM's predictions are better than, tied with, or worse than that of SVM. The last row, denoted by "ACV," compares FAM and SVM using averaged performance metrics over the four C-V folds, which give rise to much fewer ties.

*averaged*over all 369 GO functions.

**...**

The general conclusion is that the two methods are very similar in terms of their predictive performances (as measured by the three performance metrics). FAM is slightly better than SVM based on the ACV pairwise comparison (Table (Table3).3). These differences are not statistically significant, but we are greatly encouraged by these results given the tremendous computational advantages of FAM.

Tables Tables44 and and55 are similar to Tables Tables22 and and3,3, except the 369 functional categories are partitioned into 4 groups based on how informative they are. We say that a GO function is more informative if fewer genes are associatedwith it. The first group (H) consists of categories with only 15 to 20 associated genes (out of a total of 3224) – these functional categories are highly informative; there are 95 such categories. The second and third group consist, respectively, of categories with 21 to 40, and 41 to 60 associated genes, which we label as Medium High (MH) and Medium Low (ML) on the informativeness scale. There are 144 and 71 such functional categories in these two groups, respectively. The least informative group (L) contains functions with over 60 associated genes; there are 59 such GO categories.

**...**

**...**

To illustrate, we randomly choose one GO function from each of the four groups described above (H, MH, ML, and L) and display the corresponding sample ROC curves of the two methods in Figure Figure1.1. The curves produced by FAM and SVM are quite similar; naturally, we would expect them to have similar areas underneath as well. We also observe that the ROC curves of FAM tend to slightly dominate in the early part, where specificity is high. This explains why FAM tends to perform slightly better in terms of sensitivity at low thresholds (see Table Table2),2), although the differences here are quite small. Full prediction results for FAM and SVM are in [Additional file 2] and [Additional file 3], respectively.

## Discussion

In terms of using microarray expression data to identify uncharacterized genes that may be associated with various biological functions, FAM appearsto be as effective as SVM, a powerful machine learning tool. However, our FAM-based algorithm is faster than the SVM by a factor of 4000.

More importantly, the FAM framework has a great potential for incorporating important gene ontology information and dealing with a number of major challenges in using statistical classification techniques for functional genomics.

### FAM as a functional random-effects model

The hierarchical nature of how the functional categories are organized in GO-BP imply that some of these functional categories are highly correlated. A very promising and important extension of FAM is the possibility of taking this correlation structure into account. More specifically, with a total of *K *different functional categories, we can model the expression level ${g}_{i}^{\left(j\right)}$ explicitly as being dependent on the *K *functional categories, i.e.,

${g}_{i}^{\left(j\right)}={\displaystyle \sum _{k=1}^{K}{\lambda}_{ik}{f}_{k}^{\left(j\right)}+{\epsilon}_{i}^{\left(j\right)},\text{}\left(5\right)}$

where ${f}_{k}^{\left(j\right)}$ is now a (non-latent) random effect for function *k*. Thus, there are two random components in this model: the function-dependent random factors, ${f}_{k}^{\left(j\right)}$, realized independently for each microarray experiment *j*, and the error term, ${\epsilon}_{i}^{\left(j\right)}$, which is realized independently for each microarray (*j*) and each gene (*i*).

As stated above, these (functional) random effects can no longer be assumed to be independent. Let **Φ ** Var(**f**) ≠ **I **be the covariance matrix of these random effects. Eq. (3) then becomes

S = **ΛΦΛ**^{T }+ **E**. (6)

When properly estimated, the number *λ*_{ik }in this model can be interpreted as the "coefficient of association" between gene *i *and function *k *and used directly to predict gene functions.

Hence, FAM has the potential to explicitly account for the fact that both the gene expression profiles and the functional categories are correlated. By contrast, applying a classifier such as SVM separately for each functional category implicitly assumes that both are independent.

The challenge lies in the estimation of the covariance matrix **Φ **based on the DAG structure of GO-BP. This is a non-trivial but interesting challenge. So far, we have experimented with three different methods to obtain a *similarity *matrix: Lin similarity [22], which has previously been applied in microarray data [7]; Resnik similarity [23], which has also been applied to microarray data [24]; and a flexible Jiang and Conrath similarity [25], which has been extended to specifically apply to estimate gene-gene similarity using GO-BP by [26]. Such similarity matrices then need to be transformed to a valid (i.e., positive-definite) covariance matrix, which can be used as an estimate of **Φ**. After experimenting with these different techniques, we obtained very different estimates of **Φ**. At the current state, it is difficult to evaluate which estimate is better than others. We will report more details in our future work.

### FAM with functional constraints

If it is known that there is no association between gene *i *and function *k*, we can set *λ*_{ik }= 0 in Eq. (6) and estimate the remaining entries of **Λ **from data. This is closely related to *confirmatory factor analysis *[27]. Using the notation above, we wish to find values for the unknown entries of the matrix **Λ **that optimally "fits" Eq. (6). For example, we can use the Frobenious norm as the loss function and minimize the function:

||**S **- **ΛΦΛ**^{T }- **E**||_{F}. (7)

If *λ*_{ik }are restricted to be 0 or 1, this problem would be a nonlinear binary optimization problem, which is computationally taxing but potentially solvable due to recent breakthroughs in integer programming [28]. The continuous relaxation (when the restrictions *λ*_{ik }= 0 or 1 are removed) is still a non-trivial problem since the objective function is non-convex with respect to **Λ **[29]. Nevertheless, we explored various techniques to tackle non-convex integer programming problems, including approximating the solution by a semi-definite programming formulation (SDP) – a technique that has been successful in many non-convex optimization problems [30]. Although the Lagrangian relaxation of minimizing (7) can be modeled as an SDP problem [30], the objective value is often non-positive. Thus, this relaxation did not lead to a usable solution.

There has been some recent advancements in the area of non-convex integer programming problems, where researchers from chemical engineering and optimization have joined forces to develop a practical software [31]. In the future, improvements in computation time of these algorithms may allow us to solve our problem to global optimality.

### Other extensions

Finally, even by considering just the basic FAM framework (3) alone leads to a number of different algorithms. For example, we are currently exploring alternatives to principal factor analysis such as maximum likelihood or Bayesian factor analysis. As mentioned earlier, our FAM-based algorithm has a big computational advantage. Alternatives mentioned such as maximum likelihood and Bayesian factor analysis are computationally more demanding but we expect that they would retain their computational advantage over training hundreds of SVMs to a data set comprised of thousands of gene expression vectors.

## Conclusion

We now summarize the main contributions of our work. First, we proposed a model and an algorithm based on factor analysis (FAM) for predicting gene functions using microarray expression data. We evaluated the performance of our algorithm with data previously analyzed by [1]. By running a four-fold cross-validation experiment using only characterized genes, we showed that our method is not only orders of magnitude faster, but performs comparably against a sophisticated, state-of-the-art classifier that is being widely used in functional genomics. Second, we showed that our FAM framework will allow us to incorporate important gene ontology information and hence deal with a number of major challenges in using statistical classification techniques for functional genomics. Most importantly, it will allow us to directly account for the fact that the functional categories are inherently correlated and exploit this correlation structure for prediction.

## Method

This section elaborates on the details of our implementation both for FAM and SVM.

### Algorithm implementation

Here, we give more details for how we implemented the two-step algorithm based on our factor analysis model.

#### Step 1: Principal factor analysis

To estimate **Λ**, we choose $\widehat{\Lambda}$ and $\widehat{E}$ that minimize some loss criterion (**S**, **ΛΛ **+ **E**). For example, one may choose (**A**, **B**) to be the Frobenious norm of the difference **A **- **B **and choose $\widehat{\Lambda}$ and $\widehat{E}$ that solve:

$\underset{\Lambda ,{\rm E}}{\mathrm{min}}{\Vert S-\left(\Lambda {\Lambda}^{T}+E\right)\Vert}_{F}.\text{}\left(8\right)$

In our procedure, we use the most popular algorithm for fitting a factor analysis model known as Principal Factor Analysis [[12], p. 261], but one could also use maximum likelihood or Bayesian factor analysis to estimate these quantities (see the "Discussion" section). Principal factor analysis produces an approximate solution, $\widehat{\Lambda}$, to Eq. (8), where the columns of $\widehat{\Lambda}$ are uncorrelated.

#### Step 2: Naïve Bayes

The fact that the columns of $\widehat{\Lambda}$ are uncorrelated allows us to use very simple procedures to construct the mapping *F*, e.g., the naïve Bayes method [[32], Section 6.6.3], which fully exploits the uncorrelated nature of the inputs. In order to apply naïve Bayes, we break the mapping *F *into *L *components, *F *= {*F*_{l}; *l *= 1, 2, ..., *L*}; each *F*_{l }uses the naïve Bayes method to map $\widehat{\Lambda}$ to the *l*^{th }column of **Y**.

#### The Naïve Bayes method

Let *y * {0,1} be the class label and **x ** ^{L}, the vector of predictors. Suppose *p*_{0}(**x**) and *p*_{1}(**x**) are the density functions for the two classes. The posterior probability that an observation belongs to class 1 is simply

$P\left(y=1|x\right)=\frac{{\pi}_{1}{p}_{1}\left(x\right)}{{\pi}_{1}{p}_{1}\left(x\right)+{\pi}_{0}{p}_{0}\left(x\right)}\text{}\left(9\right)$

where *π*_{1 }and *π*_{0 }are prior probabilities. The naive Bayes method simply assumes the density functions *p*_{1 }and *p*_{0 }to be of a product form:

$\begin{array}{ccc}{p}_{0}\left(x\right)={\displaystyle \prod _{l=1}^{L}{p}_{0,l}\left({x}_{l}\right)}& \text{and}& {p}_{1}\left(x\right)=\end{array}{\displaystyle \prod _{l=1}^{L}{p}_{1,l}\left({x}_{l}\right)}.$

In other words, the elements of the input vector **x **are assumed to be independent. The marginal density functions *p*_{0,l }and *p*_{1,l }can be estimated with standard (univariate) density estimators.

For any given functional category, often only a few genes are associated with it, meaning that class 1 is relatively rare. Under such circumstances, it is often necessary to impose additional parametric assumptions on the density functions before they can be estimated. We assume (as is also commonly done) that both *p*_{0,l }and *p*_{1,l }are Gaussian density functions with the same variance. This means uncorrelated inputs are the same as independent inputs and that *p*_{0,l }and *p*_{1,l }are completely specified by three parameters: *μ*_{0,l}, *μ*_{1,l}, and ${\sigma}_{l}^{2}$. Under such assumptions, naïve Bayes becomes equivalent to Diagonal Linear Discriminant Analysis [[33], DLDA].

Other than using a penalized variance estimate for ${\sigma}_{l}^{2}$ (see below), we use the formulae derived by [33]. DLDA classifies an observation **x **to class 1 when

$\sum _{l}{v}_{l}\ge 0}\text{}\left(10\right)$

where

${v}_{l}=\frac{{\mu}_{1,l}-{\mu}_{0,l}}{{\sigma}_{l}^{2}}\left({x}_{l}-\frac{{\mu}_{0,l}+{\mu}_{1,l}}{2}\right).\text{}\left(11\right)$

The left-hand side of (10) is monotonically equivalent to the posterior probability (9).

To estimate ${\sigma}_{l}^{2}$, we use a penalized estimate, ${\tilde{\sigma}}_{l}^{2}$, which is similar in spirit to variance estimates used in [34]. In particular,

${\tilde{\sigma}}_{l}^{2}$ = *s*_{0 }+ ${\tilde{\sigma}}_{l}^{2}$ (12)

where *s*_{0 }= 0.01 is a small positive constant and ${\tilde{\sigma}}_{l}^{2}$ is the standard pooled variance estimate. As noted by [34], the resulting estimate ${\tilde{\sigma}}_{l}^{2}$ is numerically more stable than ${\tilde{\sigma}}_{l}^{2}$.

### SVM parameter selection

Due to the heavy computational burden of fine-tuning SVM parameters for each of the 369 functional categories, we tuned the parameters to a subset of the categories. Namely, we applied the same parameter values to functional categories with similar levels of informativeness. To select the parameters for SVM, we relied on the more global performance measure, AUC. As stated in the "Performance Evaluation" section, AUC does not depend on the threshold *m *and it summarizes the entire range of trade-offs that can be achieved by a ranking algorithm.

We trained SVMs on ten functional categories chosen as follows: We sorted all 369 functional categories according to their informativeness or gene association levels (again, we refer to a functional category as more informative if fewer genes are associated with it). From that sorted list, we split the categories into ten groups based on the deciles. For example, the first group is comprised of the most informative functions and the tenth group is comprised of the least informative functions. To construct the "training set", we chose one function from each of these groups, and chose the SVM parameter values corresponding to the highest AUC value for each of them.

We tested all three kernels (linear, radial basis function, and polynomial) with diagfactor = {0.01, 0.02, 0.05, 0.1, 0.5, 1, 2, 5, 10}. For radial basis function (RBF) kernels, we tested widthfactor = {0.1, 0.2, 0.25, 0.5, 0.75, 1, 1.5, 2} and for polynomial kernels, we tested power = {2, 3} which corresponds to the degree of the polynomial kernel. For each of the ten training set categories, the AUC values were plotted in contour plots to determine the highest value. If this optimal parameter set was on the boundary of the tested parameter range, the range of the parameter values were expanded. This process was repeated until we found a range of parameters in which the optimal parameter set did not lie on the boundary of our contour plots. Additional values tested include diagfactor = {0.001, 0.005, 12, 15, 20} and widthfactor = {3, 5, 7}.

To test whether the fine-tuned parameters generalizes for other functional categories, we measured their performance on a separate "test set". From each of the ten groups described above, we selected five additional functional categories as the "test set". The AUC values of the test set areshown in Table Table6.6. Two different sets of parameters were applied to the SVM:"default", which corresponds to Gist's default parameters (diagfactor = 0.1 and widthfactor = 1) with RBF kernels (we chose this kernel since it appear to perform well across all functions we tested); and "tuned", which corresponds to the parameter values fine-tuned to the training set.

**...**

Table Table66 illustrates that there is little difference between using the default and the fine-tuned parameters with respect to AUC values. Using default parameters resulted in higher AUC values than using the tuned parameters in 5 out of 10 groups, and the mean AUC across all groups were the same. Since the Gist default parameters appear to be robust across different functional categories for this particular data set, we decided to use the default parameters to train the SVMs for all 369 categories in our experiments.

### FAM parameter selection

FAM formally has one parameter: a penalty on diagonal variances, *s*_{0}, mentioned in section *The Naïve Bayes Method*. We have not performed an extensive optimization of this parameter and settled on *s*_{0 }= 0.01, since it seems that FAM is not very sensitive to its value.

## Authors' contributions

RK was responsible for defining the problem and preparing the data. All authors contributed equally to formulating the general modeling paradigm, investigating different implementations, and writing the paper. For specific tasks, RK worked on the implementation of FAM; RS worked on SVM experiments; and MZ worked on performance evaluation.

## Supplementary Material

**Additional File 1:**

**GO-BP annotations **Tab-delimited file with 3225 lines each with 370 columns, including gene names (first column) and GO accession numbers (first line), for 3224 genes and 369 GO-BP categories used in the text. Contains binary annotation entries.

^{(2.2M, txt)}

**Additional File 2:**

**FAM predictions **Tab-delimited file with 3225 lines each with 370 columns, including gene names (first column) and GO accession numbers (first line), for 3224 genes and 369 GO-BP categories used in the text. Contains FAM prediction scores realized when a gene was part of the validation set.

^{(8.6M, GZ)}

**Additional File 3:**

**FAM predictions **Tab-delimited file with 3225 lines each with 370 columns, including gene names (first column) and GO accession numbers (first line), for 3224 genes and 369 GO-BP categories used in the text. Contains SVM prediction scores realized when a gene was part of the validation set.

^{(3.7M, GZ)}

## Acknowledgements

All authors are partially supported by the Natural Science and Engineering Research Council (NSERC) of Canada. RK was also supported by the Canadian Institutes of Health and Research (CIHR) – New Emerging Team grant. We thank Tim Hughes, for providing us with the data, and Tim Hughes, Quaid Morris, Nikolaos Mitsakakis and Adam Zagdanski for helpful discussions. We would also like to thank the three anonymous referees for their helpful comments which have greatly enhanced the clarity of the paper.

## References

- Wu L, Hughes T, Davierwala A, Robinson M, Stoughton R, Altschuler S. Large Scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics. 2002;31:255–260. doi: 10.1038/ng906. [PubMed] [Cross Ref]
- Zhou X, Kao MC, Wong W. Transitive functional annotation by shortest-path analysis of gene-expression data. Proceedings of the National Academy of Sciences. 2002;99:12783–88. doi: 10.1073/pnas.192159399. [PMC free article] [PubMed] [Cross Ref]
- Zhang W, Morris Q, Chang R, Shai O, Bakowski M, Mitsakakis N, Mohammad N, Robinson M, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng W, Krogan N, Greenblatt J, Fehlings M, van derKooy D, Aubin J, Bruneau B, Rossant J, Blencowe B, Frey B, Hughes T. The functional landscape of mouse gene expression. Journal of Biology. 2004;3 [PMC free article] [PubMed]
- the Gene Ontology http://www.geneontology.org
- Marcotte E, Pellegrini M, Thompson M, Yeates T, Eisenberg D. A combined algorithm for genome-wide prediction of protein function. Nature. 1999;402:83–86. doi: 10.1038/47048. [PubMed] [Cross Ref]
- Kemmeren P, Kockelkorn T, Bijma T, Donders R, Holstege F. Predicting gene function through systematic analysis and quality assessment of high-throughput data. Bioinformatics. 2005;21:1644–1652. doi: 10.1093/bioinformatics/bti103. [PubMed] [Cross Ref]
- Chen Y, Xu D. Global protein function annotation through mining genome-scale data in yeast
*Saccharomyces cerevisiae*. Nucleic Acids Research. 2004;32:6414–6424. doi: 10.1093/nar/gkh978. [PMC free article] [PubMed] [Cross Ref] - Lanckriet G, De Brie T, Cristianini N, Jordan M, Noble W. A statistical framework for genomic data fusion. Bioinformatics. 2004;20:2626–2635. doi: 10.1093/bioinformatics/bth294. [PubMed] [Cross Ref]
- Harris M, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin G, Blake J, Bult C, Dolan M, Drabkin H, Eppig J, Hill D, Ni L, Ringwald M, Balakrishnan R, Cherry J, Christie K, Costanzo M, Dwight S, Engel S, Fisk D, Hirschman J, Hong E, Nash R, Sethuraman A, Theesfeld C, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee S, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz E, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2004;31:D258–61. [PMC free article] [PubMed]
- McLachlan G. Further results on the effect of intraclass correlation among training samples in discriminant analysis. Pattern Recognition. 1976;8:273–275. doi: 10.1016/0031-3203(76)90047-9. [Cross Ref]
- Tubbs J. Effect of autocorrelated training samples on Bayes's probabilities of misclassification. Pattern Recognition. 1980;12:351–354. doi: 10.1016/0031-3203(80)90011-4. [Cross Ref]
- Mardia KV, Kent JT, Bibby JM. Multiviariate Analysis. London, Great Britain: Academic Press; 1979.
- Rogers S, Girolami M, Campbell C, Breitling R. The Latent Process Decomposition of cDNA Microarray Data Sets. ACM/IEEE Transactions on Computational Biology and Bioinformatics. 2005;2:143–156. doi: 10.1109/TCBB.2005.29. [PubMed] [Cross Ref]
- Frey B, Mohammad N, Morris Q, Zhan W, Robinson M, Mnaimneh S, Chang R, Pan Q, Sat E, Rossant J, Bruneau B, Aubin J, Blencowe B, Hughes T. Genome-wide analysis of mouse transcript using exon microarrays and factor graphs. Nature Genetics. 2005;37:991–997. doi: 10.1038/ng1630. [PubMed] [Cross Ref]
- Troyanskaya O, Cantor M, Sherlock G, Eisen M, Brown P, Botstein D. Imputing Missing Data for Gene Expression Arrays. Bioinformatics. 2001;17:520–25. doi: 10.1093/bioinformatics/17.6.520. [PubMed] [Cross Ref]
- Kernel Machines http://www.kernel-machines.org
- Noble WS, Pavlidis P. Gist: Support Vector Machine and Kernel Principal Components Analysis Software Toolkit. Columbia University; 2002. http://microarray.genomecenter.columbia.edu/gist/
- Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003.
- Hanley JA, McNeil BJ. The Meaning and Use of the Area Under an ROC curve. Radiology. 1982;143:29–36. [PubMed]
- Cortes C, Mohri M. AUC Optimization vs. Error Rate Minimization. In: Thrun S, Saul L, Schölkopf B, editor. Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press; 2004.
- R Development Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2004. http://www.r-project.org [ISBN 3-900051-07-0]
- Lin D. Proceedings of 15th International Conference on Machine Learning, San Francisco. Morgan Kaufmann; 1998. An information-theoretic definition of similarity; pp. 296–304.
- Resnik P. Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence. 1995. pp. 448–453.
- Bolshakova N, Azuaje F. Cluster validation techniques forgenome expression data. Signal Process. 2003;83:825–833. doi: 10.1016/S0165-1684(02)00475-9. [Cross Ref]
- Jiang J, Conrath D. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the International Conference on Research in Computational Linguistics, Taiwan. 1998.
- Couto F, Silva M, Coutinho P. Tech Rep DI/FCUL TR 03-29. Department of Informatics, University of Lisbon; 2003. Implementation of a Functional Semantic Similarity Measure between Gene-Products.http://www.di.fc.ul.pt/tech-reports
- Jöreskog K. A General Approach to confirmatory maximum likelihood factor analysis. Psychometrika. 1969;34:183–202. doi: 10.1007/BF02289343. [Cross Ref]
- Bertsimas D, Weismantel R. Optimization Over Integers. Belmont, MA: Dynamic Ideas; 2005.
- Bazaraa M, Sherali HD, Shetty CM. Nonlinear Programming: Theory and Algorithms. New York: John Wiley and Sons; 1993.
- Wolkowicz H, Saigal R, Vandenberghe L. Handbook of Semidefinite Programming. Norwell, MA: Kluwer Academic Press; 2000.
- Computational INfrastructure for Operations Research http://www.coin-or.org
- Hastie TJ, Tibshirani RJ, Friedman JH. The Elements of Statistical Learning: Data-Mining, Inference and Prediction. Springer-Verlag; 2001.
- Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–88. doi: 10.1198/016214502753479248. [Cross Ref]
- Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiarion response. Proceedings of the National Academy of Sciences. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [PMC free article] [PubMed] [Cross Ref]

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (356K) |
- Citation

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- A factor analysis model for functional genomicsA factor analysis model for functional genomicsBMC Bioinformatics. 2006; 7()216

Your browsing activity is empty.

Activity recording is turned off.

See more...