- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- Bioinformatics
- PMC2682520

# Network-based multiple locus linkage analysis of expression traits

## Abstract

**Motivation:** We consider the problem of multiple locus linkage analysis for expression traits of genes in a pathway or a network. To capitalize on co-expression of functionally related genes, we propose a penalized regression method that maps multiple expression quantitative trait loci (eQTLs) for all related genes simultaneously while accounting for their shared functions as specified a priori by a gene pathway or network.

**Results:** An analysis of a mouse dataset and simulation studies clearly demonstrate the advantage of the proposed method over a standard approach that ignores biological knowledge of gene networks.

**Contact:** ude.nmu.tatsoib@piew

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

Genomic locations that control the expression levels of transcripts are called expression quantitative trait loci (eQTL) (Brem *et al.*, 2002; Schadt *et al.*, 2003). In eQTL analysis, treating the expression level of a gene as a quantitative trait, one aims to identify genomic loci associated with the trait; see Kendziorski and Wang (2006) for a recent review on statistical methods. It is likely that a gene's expression is associated with multiple regulators, hence a multiple locus linkage analysis is perhaps most suitable. A straightforward approach is to treat each expression trait independently, then apply a standard QTL mapping technique to each individual trait. For a single trait, a general approach to mapping multiple QTL is essentially to identify non-zero regression coefficients in a regression analysis (Sen and Churchill, 2001). Traditionally, such a variable selection problem can be approached by model comparison via a model selection criterion (Bogdan *et al.*, 2004; Broman and Speed, 2002), or to save computing time, by a sequential, e.g. a stepwise forward, procedure for variable selection (Storey *et al.*, 2005; Zou and Zeng, 2007). In addition to high-computational cost, a severe problem associated with these approaches is the reduced power due to multiple tests. Furthermore, such model selection is separated from parameter estimation and is inherently unstable (Breiman, 1996). Hence, simultaneous variable selection and parameter estimation has become increasingly popular as implemented in either a fully Bayesian (e.g. Xu, 2003; Yi *et al.*, 2003) or a penalization (e.g. Zhang and Xu, 2005) framework for QTL mapping. In spite of their much different implementations, the Bayesian framework and the penalization one are closely related to each other due to the connection between penalization and Bayesian modeling (Tibshirani, 1996). Although the penalized likelihood approach is computationally simpler and faster than the fully Bayesian approach, Zhang and Xu (2005) found that both approaches performed well and similarly in their simulation study for QTL mapping, as confirmed by Xu (2007), in which an empirical Bayesian method with good performance and fast computation was proposed.

As for multivariate traits, due to correlated expressions among the genes, separate analyses on individual expression traits would not be efficient. On the other hand, standard multivariate QTL techniques (Jiang and Zeng, 1995) cannot be directly applied to a relatively small sample (with sample size typically less than or around 100) with a large number of expression traits, which is in the order of thousands; furthermore, because some genes' expression levels are uncorrelated, pooling over all the genes together in a multivariate QTL analysis, even if computationally feasible, would again have reduced power. To overcome these problems while realizing the potential of combining correlated expression traits, a common strategy (Lan *et al.*, 2003; Li *et al.*, 2006, 2007) is to apply dimension reduction to a group of relevant genes based on either a gene functional annotation system, such as the Gene Ontology (GO) (Ashburner *et al.*, 2000) or KEGG (Kanehisa and Goto, 2000), or on clustering or co-expression analysis (Tseng, 2007). Albeit useful, there are various issues associated with these methods: how to select gene groups, how to conduct dimension reduction and how to interpret results, etc.; in addition, they may fail to take into account specific roles of the genes in a group or cluster. An extreme example is to use the average expression of the genes in a pathway or subnetwork to infer their common eQTL (Kliebenstein *et al.*, 2006), which will be shown to be a special case of our approach. Alternatively, another line of research is to analyze genome-wide expression traits simultaneously without depending on any gene grouping or dimension reduction. For example, Jia and Xu (2007) extended a popular Bayesian variable selection scheme for a single regression model (George and McCulloch, 1993) to multiple regression models, one for each expression trait. We argue that, although this approach takes advantage of existing multiple traits by borrowing information across them, as a generic method, by treating all the genes equally a priori, it fails to utilize biological knowledge that the genes function in different pathways. Our goal here is to incorporate biological knowledge on genes in the form of gene networks while maintaining the strength of borrowing information across the genes.

We would like to distinguish two aims of eQTL analyses. The first aim, focus of most existing works, is to identify candidate regulator–target gene pairs with eQTL suggesting locations of regulators while the genes whose expressions are analyzed as targets (Liu *et al.*, 2008). We restrict our attention to this first step in this article. Based on the inferred regulator–target pairs, a preliminary gene regulatory network can be constructed, which however may not distinguish direct and indirect effects of a regulator. The second aim is to refine a preliminary regulatory network that is either inferred from aim one or given a priori. Some typical examples are to infer causal relationships (such that network edges showing indirect effects of a regulator can be removed) or orienting edges (Aten *et al.*, 2008; Liu *et al.*, 2008; Neto *et al.*, 2008) among the genes, possibly also including some phenotypes (such as clinical traits). Focusing on aim one, we require a gene network to be given a priori; this given network may or may not be related to the network to be inferred. In our example, we use a co-expression network while inferring a transcriptional regulatory network. Any network can be used in our approach as long as it implies our key prior assumption: any two neighboring genes in the network are more likely (than a random pair of genes) to have their expression co-regulated by some common loci. For example, it is reasonable to use KEGG pathways (Kanehisa and Goto, 2000), protein–protein interaction networks, transcriptional regulatory networks such as available from RegulonDB (Salgado *et al.*, 2004), and even some computationally predicted networks, such as functional linkage (Lee *et al.*, 2004) or regulatory (Faith *et al.*, 2007) networks.

It is helpful to review a main theme of Lan *et al.* (2006) that partly motivated this research. In an eQTL study of mice related to obesity and diabetes, Lan *et al.* discovered that, by separate QTL mappings on the individual expression traits, the expression of a few, but only a minority of, GPCR genes, was statistically significantly linked to a region on chromosome two; by correlating these significant genes' expression with other genes, they found that the expanded list of co-expressed genes included all 194 GPCR genes, many of which showed secondary linkage peaks, though not statistically significant, in the same region of chromosome two. They argued that combining eQTL analysis with co-expression analysis would yield higher statistical power for biologically more meaningful discoveries. We completely agree with them; in fact, we did like to go further along the line by proposing a unified framework: rather than having two separate analysis steps of eQTL mapping and expression clustering, respectively, we would like to directly incorporate biological knowledge of GPCR genes into eQTL analysis and see whether it can improve statistical power to discover common eQTL for these functionally related genes. For example, we may first construct a co-expression network (possibly by clustering analysis), then incorporate the network information into eQTL mapping. More generally, because genes work coordinately as dictated by some pathways or networks, any two genes in the same pathway are expected to have correlated expression levels, leading to similar associations (or non-association) with a marker. Hence, to capitalize on the correlation among the genes as suggested by a gene network a priori, we construct a proper penalty function to realize the smoothness of the association parameters for neighboring genes in a general framework of penalized regression. Such an idea has been explored in the context of a single regression model by Li and Li (2008) and Pan *et al.* (2009). Here, we extend the idea to the scenario with multiple regression models, in which special characteristics derived from multiple models demand special treatments, such as in selecting penalization parameters. We will demonstrate the advantage of this approach over a standard approach that treats individual expression traits separately.

## 2 METHODS

### 2.1 Penalized regression

We first consider the most common situation with a single linear regression model:

where *Y* = (*y*_{1}, *y*_{2},…, *y*_{n})′ is a vector of trait values, *X*=(*x*_{ik}) is a design matrix, β=(β_{1},…,β_{p})′ is the vector of unknown regression coefficients and ϵ=(ϵ_{1},…, ϵ_{n})′ is a vector of noise or error terms. We assume throughout that both *Y* and each predictor have been standardized to have mean 0 and variance 1. To estimate β, it is popular to use the ordinary least square estimator (OLSE)

For variable selection, it is common to follow a stepwise-type procedure based on, e.g. *P*-values of , which however is unstable (Breiman, 1996). Furthermore, if *p*≈*n* or *p* > *n*, it may be difficult to estimate the variance, and thus *P*-value for any β_{k}; in fact, even would be unstable with large variability.

It has become increasingly popular to take a penalization approach that realizes variable selection and parameter estimation *simultaneously*, especially for ‘large *p*, small *n*’ problems, as arising from eQTL analysis. A penalized least square estimator (PLSE) is defined to be

where *p*_{λ}(β) is a penalty function with penalization or tuning parameter λ, which has to be determined based on some model selection criterion such as cross-validation (CV). There have been quite a few penalty functions proposed in the literature. For variable selection, the most popular is the *L*_{1} or Lasso (Tibshirani, 1996) penalty:

A nice feature of the Lasso penalty is its capability for variable selection: with a sufficiently large λ, some will be exactly 0, effectively excluding the corresponding predictor *x*_{k} from the model.

The whole solution path , as a function of penalization parameter λ, can be efficiently obtained by a slightly modified Lars algorithm (Efron *et al.*, 2004). Furthermore, one can take the Lars estimator as an alternative to the Lasso estimator, though the two are often similar.

In spite of many successful applications of the above methods, they are generic, possibly failing to take full advantage of prior knowledge of existing structures among the predictors. For example, in eQTL analysis, they ignore various gene functions and thus the relationships among the genes, such as represented in gene networks. To incorporate biological knowledge of gene networks, Li and Li (2008) proposed a new penalty that uses the Laplacian of a network. Specifically, given a network that describes the relationships among the predictors, denote *d*_{i} as the degree of predictor (or exchangeably, node) *i* in the network; that is, *d*_{i} is the number of direct neighbors of node *i* in the network. Li and Li's network-based penalty is

where *i* ~ *j* means that nodes *i* and *j* are direct neighbors on the network. Similar to the elastic net penalty (Zou and Hastie, 2005), the first term is used for variable selection while the second to smooth the parameters over the network. A limitation of penalty (2) is its high computational cost, which can be prohibitive for large *p*: a straightforward implementation as suggested by Li and Li (2008) converts the model fitting to a Lasso problem with *n*+*p* observations and *p* predictors. In addition, determining two tuning parameters (λ_{1}, λ_{2}) is computationally more intensive than choosing just one as in Lasso or Lars.

Pan *et al.* (2009) proposed a new network-based penalty. Similar to the second term of (2), their penalty contains multiple terms, each of which is for an edge in a network and has a form of grouped penalty (Yuan and Lin, 2006; Zhao *et al.*, 2006); it differs from existing grouped penalties for its specific reference to a network. This penalty also allows the user to choose a weight for each node, which realizes different types of shrinkages, enforcing various prior relationships among β_{i}'s. For each node *i* with degree *d*_{i}, define *w*_{i}=*g*(*d*_{i}, γ) as its weight, possibly dependent on *d*_{i} and γ. The penalty is

where γ>1 and 1/γ′ + 1/γ = 1. Each penalty term *p*(β_{i}, β_{j}) is essentially a weighted *L*_{γ} norm of vector (β_{i}, β_{j})′, and hence *p*_{λ}(**β**; λ, *w*) is convex in β. A generalized boosted Lasso algorithm (Zhao and Yu, 2004) can be applied to obtain an approximate solution path .

Pan *et al.* (2009) studied some specific choices of the weights: for example, if *w*_{i}=*d*_{i}, under some conditions, it is shown that |β_{i}|'s for neighboring genes are smoothed and shrunken to each other, which is often desired. Other choices of weights may result in different shrinkage schemes. Each term of (3) is a (weighted) grouped penalty, encouraging both β_{i} and β_{j} to be equal to 0 *simultaneously* (Yuan and Lin, 2006; Zhao *et al.*, 2006), which incorporates our prior assumption that two neighboring genes in a network should be more similar to each other and thus more likely to be both associated with the trait. Furthermore, a larger γ is chosen to impose a stronger shrinkage effect for two neighboring genes *i*~*j*.

The numerical studies in Pan *et al.* (2009) indicated some complex relationships between the choices of γ and *w* and the resulting predictive performance, while the performance in variable selection was more robust to the choices. Because the goal here is more for variable selection, we will simply use γ = 2 and *w*_{i}=*d*_{i} throughout.

### 2.2 Adapting penalized regression to eQTL analysis

In eQTL analysis, in contrast to standard regression (1), we have multiple regression models, one for each expression trait *g*:

for *g* = 1,…, *G*. Note that the design matrix *X* is related to the marker data drawn from the individuals, independent of gene *g*. Here we assume that, for simplicity, (i) eQTLs are located on markers; (ii) there are only additive effects: each marker is coded as −1, 0 and 1. It is possible to relax the above two assumptions: for the first one, we can take the general approach of Sen and Churchill (2001) by first imputing pseudo-genotypes between two neighboring markers (as shown in Section 3.4) and then proceeding as proposed here by treating pseudo-genotypes as observed markers; for the second one, we only need to create another dummy variable for the dominance effect and its associated regression coefficient for each marker, then proceed as proposed here.

To estimate gene- or trait-specific β_{g}=(β_{g1},…, β_{gp})′, a standard approach would apply a penalized method, such as Lasso or Lars, to each of the above model, which however ignores possible relationships among the genes. In particular, to capitalize on the association between co-expression and co-regulation, it is advantageous to smooth the parameters for functionally related or co-expressed genes. Here, we wish to incorporate into analysis gene function information in a general form as gene networks. As in a functional group analysis, if two genes′ expression profiles are strongly correlated, then there is a high chance that the two genes are co-regulated and share some common eQTLs. Here, we assume that the co-expression or co-regulation relationships among the genes are described by a gene network a priori. Specifically, if two genes *g*~*h* are linked in a network, we assume a priori that their regression coefficients are close: |β_{g}|≈|β_{h}|; that is, two neighboring genes in the network are more likely to share common eQTLs. We can combine the above multiple regression models into a single one with *Y*_{c}=(*Y*_{1}′,…, *Y*_{G}′)′, *X*_{c}=diag(*X*,…, *X*) and β=(β_{1}′,…, β_{G}′)′. The penalty function is

Then the same generalized boosted Lasso algorithm can be applied to obtain an approximate solution path . Note that, if the λ is large enough, then β_{g}≈β_{h}, corresponding to the network-averaging approach of Kliebenstein *et al.* (2006).

Although we obtain an approximate solution path as a function of λ simultaneously for each *g*, rather than selecting a common for all the genes, we choose gene-specific tuning parameters to account for possible heterogeneity across the models; this partially acknowledges that two different pairs of neighboring genes in the network may share their mutual similarity to varying degrees. Only gene-specific data (*Y*_{g}, *X*) are used in evaluation to select . Specifically, for a *V*-fold CV, first, we randomly partition the data (*Y*_{g}, *X*) into *V* subsets of about an equal size: (*Y*^{v}_{g}, *X*^{v}), *v*=1,…, *V*; denote by (*Y*^{−v}_{g}, *X*^{−v}) the data excluding partition *v*. Second, we fit a network-based (or other penalized) regression model to training data {(*Y*^{−v}_{g}, *X*^{−v}): *g*=1,…, *G*}, and use (*Y*^{v}_{g}, *X*^{v}) as test data to calculate the prediction mean squared error (PMSE), PMSE^{v}_{g}(λ), of the trait for each λ and *g*; iterate the process for each *v*. Third, an overall PMSE for trait *g* is PMSE_{g}(λ)=Ave_{v}PMSE^{v}_{g}(λ). We then choose the penalization parameter for trait or model *g* as . At the end, the final network-based PLSE (Net) for trait *g* is . Note that, due to the choice of gene-specific penalization parameter, the penalty function used can be also regarded as gene specific; that is, for any gene *g*, the penalization parameter λ in the penalty function (5) is replaced by a gene-specific λ_{g}.

A standard approach would fit each model separately, e.g. by the Lars algorithm. Again, CV can be used to estimate an optimal penalization parameter and the final Lars-PLSE . Note that by the use of network information, the solution path obtained by the network-based approach differs from that by the Lars method, as to be shown in the real data example. Hence, although gene-specfic penalization parameters are used by both methods, their final estimates are in general different (even with the same penalization parameter values).

Once we obtain a final PLSE , we examine the components of : a non-zero component suggests a possible linkage between trait *g* and the corresponding marker; of course, a claimed linkage is either a true or a false positive. For any penalization method, these suggested linkages by the non-zero components of the PLSE are taken as our estimated eQTLs. Finally, we note that, for either method, one can parameterize the penalization parameter as a fraction parameter *s*: for any given λ, suppose that is the PLSE; then there is a corresponding fraction , in which the denominator refers to an estimate without penalization. Throughout the below results, we use fraction *s*, which is always between 0 and 1, facilitating its use and comparison across various models. In particular, we note that for our example data, we found that for either the Lars or our network-based method, if a common penalization parameter *s*_{0} was imposed for all the traits, then ŝ_{0}=0 would be selected, leading to intercept-only models and suggesting no eQTL at all.

## 3 RESULTS

### 3.1 Example data

We analyzed a published mouse dataset deposited at the gene expression omnibus (GEO) by Lan *et al.* (2006). The data contained 60 mice in an F2 sample from the C57BL/6J (B6) and BTBR founder strains; B6 and BTBR strains, when made obese, showed different susceptibility to diabetes: B6-ob/ob mice are diabetes resistant while BTBR-ob/ob mice are not (Lan *et al.*, 2006). About 45 000 gene expression traits were obtained from Affymetrix `Moe430 Set` arrays, processed by the robust multi-array average (RMA) method (Irizarry *et al.*, 2003). Genotypes for 194 markers were distributed across 19 chromosomes with an average marker interval of ~10 cM. There were about 7% missing genotypes. We applied the imputation method of Sen and Churchill (2001), as implemented in R/qtl (Broman *et al.*, 2003), to replace any missing genotype by an imputed value; all the work was done with this imputed dataset.

Because our goal is to investigate whether and how to incorporate gene function information into analysis, for illustration, we will consider only the genes in the G protein-coupled receptor (GPCR) protein signaling pathway. For the purpose, we constructed a gene co-expression network. Although it is possible to use the same dataset to do so, to mimic the practical situation with a network given a priori, we used another mouse dataset with liver gene expression of 135 female mice from an sample of F2 intercross between inbred strains of C3H/HeJ and C57BL/6J (Ghazalpour *et al.* 2006). The data were derived from some custom ink-jet arrays; the authors provided on their web site a subset of the data consisting of the top 8000 genes with the most varying expression levels across 135 samples. By gene names, we identified 17 GPCR genes appearing on both datasets; one of the genes, Rgs3, appeared twice (with two probe sets) and were denoted as Rgs3 and Rgs3(2), respectively. Using the second dataset, we calculated pairwise Pearson correlation coefficients between any two of the 17 genes; using a cutoff at 0.4, we obtained a co-expression network (Fig. 1). We conducted an eQTL analysis for these 17 GPCR genes throughout.

### 3.2 Analysis results

We applied Lasso/Lars implemented in R package `lars` (Efron *et al.*, 2002) to each expression trait separately; there seemed to be some problems with CV for Lasso, so we used Lars throughout. Supplementary Figure 1 showed the solution paths for the genes, and the selected tuning parameters in terms of fraction *s* by a 5-fold CV; note quite different ŝ_{g} across the models. The Lars-PLSEs of β_{g}'s from the final models are shown in Supplementary Figure 2; a non-zero component of suggested a possible linkage between trait *g* and the corresponding locus. For 10 genes, ŝ_{g}=0 was selected by CV, leading to an intercept-only model, hence no eQTL was detected; for the other ones, four (Ccr5, Dok4, Rps6ka4 and 1200007D18Rik) showed a linkage at marker locus *D2Mit148* on chromosome 2, which was in agreement with Lan *et al.* (2006) who found a significant linkage peak on chromosome 2 for several GPCR genes. Nonetheless, because most of the genes did not show a linkage on chromosome 2, we would like to see whether the network-based method could improve the detection by borrowing information across the genes connected on the network. Note that the four genes whose expression traits were identified by the Lars to be linked to marker *D2Mit148* on chromosome 2 are mostly disconnected to each other in the co-expression network (Fig. 1).

The solution paths and selected penalization parameters ŝ_{g} by our network-based method (Net) with a 5-fold CV are shown in Supplementary Figure 3. It is clear that the solution paths are different from that of Lars. As a result, more linkages, either true or false positives, were identified by the network-based method (Supplementary Fig. 4). In particular, in addition to the three of the four genes identified by Lars (except 1200007D18Rik), the network-based method also identified seven other genes whose expression was linked to marker *D2Mit148* on chromosome 2: Calcr1, Cxcr3, Kcnq1, Rgs6, gs3, Rgs3(2) and Gabbr1. In contrast to the four largely disconnected genes identified by the Lars, all the genes except Calcr1 identified by the Net were well connected to each other in the co-expression network.

We also applied the network-averaging approach of Kliebenstein *et al.* (2006). It took the average expression of all 17 genes in the network as the expression trait for each individual, then applied the Lars. By a 5-fold CV, the penalization parameter was selected at 0, hence all the regression coefficient estimates were 0; that is, no eQTL was detected. The solution paths are shown in Supplementary Figure 5.

Next, we used simulated data to confirm that the network-based method could indeed improve statistical power to detecting eQTL in practical situations.

### 3.3 Simulation I

#### 3.3.1 Simulation setups

To mimic real data, we used the same genotype data and same number of individuals (i.e. *n*=60 and *p*=194 for each gene); the network was the subnetwork for the real data containing five genes: Rps6ka4, 1200007D18Rik, Apln, Rgs6 and Gabbr1. The true models for simulated data assumed that all the five genes were linked to the last marker on chromosomes 2 (*D2Mit148*), 4 (except 1200007D18Rik) linked to the first marker on chromosome 10 (*D10Mit16*), and only one (Gabbr1) linked to the first marker on chromosome 13 (*D13Mit16*). Specifically, the expression trait of gene *g* for individual *i* was simulated from a linear model

where *x*_{1i}, *x*_{2i} and *x*_{3i} are the genotypes of individual *i* at the three possibly linked markers, and the true coefficients were β_{g,1}=*r*_{g1j} × (−0.2) for *g* = 1,…, 5, β_{g,2}=*r*_{g2j}×(0.2) for *g* ≠ 2 and β_{g,2}=0 for *g*=2, and β_{g,3} = 0.2 for *g* = 5 and β_{g,2}=0 for *g*≠5; was a scaling factor used to perturb the effect size of marker *k* on trait *g* in dataset *j*. Four cases were considered with the noise SD σ_{e}=0.3, 0.5, 0.7 and 0.9, respectively; for each case, 100 simulated datasets were generated independently.

To save computing time, for tuning parameter selection, we used an independent tuning dataset of an equal size (i.e. *n*=60). The idea was similar to CV except that we only needed to fit a model once with the training data, then used the tuning data to calculate PMSE and thus selected the tuning parameter ŝ_{g} for each trait *g*.

The simulation setups reflected practical situations. First, there were multiple eQTLs, some of which were common for all the genes while others were not. Second, although the effect sizes of a common eQTL on its targets were close, they were not exactly the same. Finally, the prior belief as modeled in the penalty function was largely, but not completely, correct: some markers associated with a gene were not associated with the gene's neighbor(s). Note that, although there were at most three eQTLs for each trait, which was unknown, as usual, we fitted a model with all 194 markers for each trait.

#### 3.3.2 Simulation results

The simulation results are included in Table 1. For each gene, we considered the number of false positives (*q*_{0}) and number of true positives (*q*_{1}); ideally, we would like to have *q*_{0} as small and *q*_{1} as large as possible. From Table 1, it is clear that the network-based method in general gave a higher number of true positives while often maintaining an either smaller or comparable number of false positives as compared with the Lars, a standard approach that treated each trait separately. One can even argue that, for each trait with only 1–3 true eQTLs but with > 190 non-eQTL markers, it is affordable to have a few extra false positives as long as there is a significant improvement in detecting a true eQTL. Note that for many cases the mean differences of *q*_{1}'s between the two methods were substantial and statistically significant, as judged by the *P*-values given by paired *t*-tests. Hence, as expected, by borrowing information across the genes in a network, the network-based method gained statistical power of detecting eQTL.

### 3.4 Simulation II

The second set of simulation setups were the same as that in Simulation I except the following three modifications. First, based on the same genotype data, we used R/qtl to impute a pseodo marker every 8 cM on each chromosome, resulting in a total of 408 original or imputed markers. Second, we added two pseudogenes, say genes 6 and 7, to the original subnetwork of the five genes: gene 6 was connected to both gene 120007D18Rik and gene 7; there was no any other change with the subnetwork. Third, in addition to the three previous eQTLs, we added a new eQTL at a pseudo-marker at 80 cM on chromosome 3 (between markers D3Mit44 and D3Mit19) for the five genes with a regression coefficient of −0.2, while the two pseodo genes were not linked to any loci. In addition to the previous two methods, we also considered the network-averaging (Ave) approach of Kliebenstein *et al.* (2006): for each individual, the average expression level of the seven genes was used as the trait, then a linear model was fitted using Lars; it was implicitly assumed that all the genes in the network shared the same eQTL. Otherwise, we followed exactly the same procedures as before, yielding four setups and the corresponding results (Table 2).

*q*

_{1}) and false positives (

*q*

_{0}) by the three methods from 100 simulated datasets

It is evident that our network-based method always detected more eQTLs while giving fewer false positives than the Lars method. The network-averaging method discovered more eQTLs but at the high cost of much larger number of false positives (with statistical significance for most of them).

## 4 DISCUSSION

We have proposed a gene network-based regression approach to multiple linkage analysis of expression traits. Because the genes in the same functional group or pathway tend to be co-expressed and co-regulated, it makes sense to combine eQTL analyses for the functionally related genes; this point has been increasingly recognized (e.g. Lan *et al.*, 2006). However, there is yet any consensus on how to do so effectively; here we have outlined a novel penalized regression approach that incorporates into a penalty the prior knowledge of gene functions embedded in a network. In particular, we have formulated multiple regression models for individual traits as a single, expanded regression model so that an existing penalization technique for a single model can be applied. Although this formulation largely simplifies the problem, there are some special features associated with multiple models that distinguish the problem from the one with only a single model. For example, although the regression coefficient solution paths for all the traits are obtained as functions of a common penalization parameter, because of possibly varying effect sizes of a common eQTL across the traits or of varying degrees of co-expression for neighboring genes in the network, we allow trait-specific penalization parameters to be selected at the end; for the mouse data, the choice of such trait-specific penalization parameters seemed to be necessary.

Our work is related to the ongoing efforts of incorporating biological knowledge into genomic data analysis. For example, Wei and Pan (2008b) considered the use of gene functional groups in regression analysis of gene expression on DNA–protein binding data [or similarly on DNA sequence data as shown by Conlon *et al.* (2003)] to infer transcription factor–target relationships. The basic idea is that the genes within the same functional group share some relevant characteristics, and thus it may be beneficial to borrow information across them. Their methods can be adapted to the current context. However, a limitation of the methods is their dependence on gene group selection: for example, since there are thousands of GO categories, which one to use? There is a trade-off between group size and functional specificity: a more specific and functionally homogeneous functional group contains necessarily fewer genes, introducing estimation difficulties with a smaller sample size. We feel that a gene network is both flexible and powerful to capture biological knowledge, as compared with gene functional groups; in fact, some even argued that the former should be more suitable (Fraser and Marcotte, 2004). For gene functional groups, in addition to which group to use, there are also other issues, such as how to handle genes annotated in multiple groups and why treating the genes inside a group equally a priori.

Although we have focused on regression analysis for multiple eQTL mapping, the idea of incorporating biological knowledge can be equally applied to single eQTL analysis. For instance, to identify possible associations between any expression trait and any genomic marker, Kendziorski *et al.* (2006) proposed a mixture model (of transcripts) over markers (MOM), in which the transcripts are treated equally a priori as in a standard mixture model; Gelfond *et al.* (2007) proposed incorporating genomic location information into MOM. As developed for differential expression analysis, incorporating gene functional groups as defined by either functional annotations or clustering analysis (Pan, 2006), or by gene networks (Wei and Li, 2007; Wei and Pan, 2008a), into a mixture model for eQTL analysis is in principle straightforward, though computational challenge remains due to a large number of components in the mixture model. Finally, our current implementation based on the generalized boosted Lasso algorithm can handle dozens of genes in a pathway; for genome-wide expression traits, one may deal with individual pathways separately. Although considering pathways one by one is often acceptable, if one wants to combine thousands of expression traits simultaneously with a genome-wide network, more efficient algorithms still need to be developed. These are all interesting topics to be studied.

## ACKNOWLEDGEMENTS

The author thanks the reviewers' for helpful comments.

*Funding*: National Institutes of Health (GM081535 and HL65462).

*Conflict of Interest*: none declared.

## REFERENCES

- Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
- Aten JE, et al. Using genetic markers to orient the edges in quantitative trait networks: the NEO software. BMC Syst. Biol. 2008;2:34. [PMC free article] [PubMed]
- Bogdan M, et al. Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics. 2004;167:989–999. [PMC free article] [PubMed]
- Breiman L. Heuristics of instability and stabilization in model selection. Ann. Stat. 1996;24:2350–2383.
- Brem RB, et al. Genetic dissection of transcriptional regulation in budding yeast. Science. 2002;296:752–755. [PubMed]
- Broman KW, Speed TP. A model selection approach for the identification of quantitative trait loci in experimental crosses (with discussion) J. R. Stat. Soc. Ser. B. 2002;64:641–656.
- Broman KW, et al. R/qtl: QTL mapping in experimental crosses. Bioinformatics. 2003;19:889–890. [PubMed]
- Conlon EM, et al. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA. 2003;100:3339–3344. [PMC free article] [PubMed]
- Efron B, et al. Least angle regression. Ann. Stat. 2004;32:407–499.
- Faith JJ, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5:e8. [PMC free article] [PubMed]
- Fraser AG, Marcotte EM. A probabilistic view of gene function. Nat. Genet. 2004;36:559–564. [PubMed]
- Gelfond JAL, et al. Proximity model for expression quantitative trait loci (eQTL) detection. Biometrics. 2007;63:1108–1116. [PubMed]
- George EI, McCulloch RE. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 1993;88:881–889.
- Ghazalpour A, et al. Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet. 2006;2:e130. [PMC free article] [PubMed]
- Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. [PubMed]
- Jia Z, Xu S. Mapping quantitative trait loci for Expression Abundance. Genetics. 2007;176:611–623. [PMC free article] [PubMed]
- Jiang C, Zeng ZB. Multiple traits analysis of genetic mapping for quantitative trait loci. Genetics. 1995;140:1111–1127. [PMC free article] [PubMed]
- Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. [PMC free article] [PubMed]
- Kendziorski CM, Wang P. A review of statistical methods for expression quantitative trait loci mapping. Mamm. Genome. 2006;17:509–517. [PubMed]
- Kendziorski CM, et al. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics. 2006;62:19–27. [PubMed]
- Kliebenstein DJ, et al. Identification of QTLs controlling gene expression networks defined a priori. BMC Bioinformatics. 2006;7:308. [PMC free article] [PubMed]
- Lan H, et al. Dimension reduction for mapping mRNA abundance as quantitative traits. Genetics. 2003;164:1607–1614. [PMC free article] [PubMed]
- Lan H, et al. Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genet. 2006;2:51–61. [PMC free article] [PubMed]
- Lee I, et al. Probabilistic functional network of yeast genes. Science. 2004;306:1555–1558. [PubMed]
- Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. [PubMed]
- Li H, et al. Integrative genetic analysis of transcription modules: towards filling the gap between genetic loci and inherited traits. Hum. Mol. Genet. 2006;15:481–492. [PubMed]
- Li N, et al. Functional group-based linkage analysis of gene expression trait loci. BMC Proceedings. 2007;1(1):S117. [PMC free article] [PubMed]
- Liu B, et al. Gene network inference via structured equation modeling in genetical genomics experiments. Genetics. 2008;178:1763–1776. [PMC free article] [PubMed]
- Neto EC, et al. Inferring causal phenotype networks from segregating populations. Genetics. 2008;179:1089–1100. [PMC free article] [PubMed]
- Pan W. Incorporating gene functional annotations in detecting differential gene expression. Appl. Stat. 2006;55:301–316.
- Pan W, et al. Biometrics. in press (Available as Research Report 2009-001, Division of Biostatistics, University of Minnesota; 2009. Incorporating predictor network in penalized regression with application to microarray data. at
`http://www.biostat.umn.edu./rrs.php`last accessed date on February 15, 2009) [PMC free article] [PubMed] - Salgado H, et al. RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 2004;32:D303–D306. [PMC free article] [PubMed]
- Schadt EE, et al. Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003;422:297–302. [PubMed]
- Sen S, Churchill GA. A statistical framework for quantitative trait mapping. Genetics. 2001;159:371–387. [PMC free article] [PubMed]
- Storey JD, et al. Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol. 2005;3:e267. [PMC free article] [PubMed]
- Tibshirani R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B. 1996;58:267–288.
- Tseng GC. Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 2007;23:2247–2255. [PubMed]
- Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007;23:1537–1544. [PubMed]
- Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics. 2008a;24:404–411. [PubMed]
- Wei P, Pan W. Incorporating gene functions into regression analysis of DNA-protein binding data and gene expression data to construct transcriptional networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008b;5:401–415. [PubMed]
- Xu S. Estimating polygenic effects using markers of the entire genome. Genetics. 2003;163:789–801. [PMC free article] [PubMed]
- Xu S. An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics. 2007;63:513–521. [PubMed]
- Yi N, et al. Stochastic search variable selection for identifying quantitative trait loci. Genetics. 2003;164:1129–1138. [PMC free article] [PubMed]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B. 2006;68:49–67.
- Zhang Y-M, Xu S. A penalized maximum likelihood method for estimating epistatic effects of QTL. Heredity. 2005;95:96–104. [PubMed]
- Zhao P, Yu B. Boosted Lasso. Technical Report #678. UC-Berkeley: Department of Statistics; 2004.
- Zhao P, et al. Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat. 2006 in press.
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B. 2005;67:301–320.
- Zou W, Zeng ZB. Multiple interval mapping for gene expression QTL analysis. 2007 Available at statgen.ncsu.edu/zeng/MIM-eQTL.pdf (last accessed date on Feburary 15, 2009) [PubMed]

**Oxford University Press**