• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Nov 1, 2009; 25(21): 2735–2743.
Published online Sep 7, 2009. doi:  10.1093/bioinformatics/btp531
PMCID: PMC2781753

Efficiently finding genome-wide three-way gene interactions from transcript- and genotype-data

Abstract

Motivation: We address the issue of finding a three-way gene interaction, i.e. two interacting genes in expression under the genotypes of another gene, given a dataset in which expressions and genotypes are measured at once for each individual. This issue can be a general, switching mechanism in expression of two genes, being controlled by categories of another gene, and finding this type of interaction can be a key to elucidating complex biological systems. The most suitable method for this issue is likelihood ratio test using logistic regressions, which we call interaction test, but a serious problem of this test is computational intractability at a genome-wide level.

Results: We developed a fast method for this issue which improves the speed of interaction test by around 10 times for any size of datasets, keeping highly interacting genes with an accuracy of ~85%. We applied our method to ~3 × 108 three-way combinations generated from a dataset on human brain samples and detected three-way gene interactions with small P-values. To check the reliability of our results, we first conducted permutations by which we can show that the obtained P-values are significantly smaller than those obtained from permuted null examples. We then used GEO (Gene Expression Omnibus) to generate gene expression datasets with binary classes to confirm the detected three-way interactions by using these datasets and interaction tests. The result showed us some datasets with significantly small P-values, strongly supporting the reliability of the detected three-way interactions.

Availability: Software is available from http://www.bic.kyoto-u.ac.jp/pathway/kayano/bioinfo_three-way.html

Contact: kayano/at/kuicr.kyoto-u.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

We address the issue of efficiently finding a three-way gene interaction, precisely two interacting genes in expression under the genotypes of a different gene, given a dataset in which both gene expressions and genotypes are measured for each individual. We illustrate our problem setting by using synthetic 2D diagrams in Figure 1, where expression values of two genes are plotted with three classes (genotypes): +, * and [big up triangle, open]. In this figure, panel (a) shows expression values being just randomly distributed; (b) shows expression values being easily categorized into three classes; and (c) shows that classes can be categorized by expressions without using two genes at the same time. We are not interested in (a–c) but in (d), which shows that the correlation in expression between two genes differs for each class. More concretely, two genes are positively correlated for one class, whereas they are negatively correlated for another. This is exactly a switching mechanism in expression between correlation and inverse-correlation of two genes, controlled by another gene. Also this is the three-way gene interaction which we attempt to find in this article. We note that this can be categorized into a general switch in biology. A simple, well-known example is Max, a transcription factor, which plays a role of an activator or a suppressor, depending on whether it binds to Myc (i.e. Myc-Max) or Mad (i.e. Mad-Max) (Ayer and Eisenman, 1993). We emphasize that this type of interaction must be a key to elucidating complex biological systems.

Fig. 1.
Synthetic examples: expressions of two genes under the three classes of another gene. (a) randomly distributed, (b,c) easily categorized into three classes and (d) a switching mechanism.

A reasonable approach to detect such three-way interactions is the likelihood ratio test for regression (LRTR). Particularly, logistic regression must be suitable the most, because of categorical responses (genotypes) in our setting (McCullagh and Nelder, 1989). The first item of note is that parameter estimation for logistic regression is based on the maximum likelihood, for which a time-consuming iterative gradient descent, Newton–Raphson, is usually used. Secondly, in our case, classes are genotypes, causing a problem of an explosive number of combinations of one SNP (genotypes) and two genes (expressions). For example, for 50 000 SNPs and 1000 genes, we have roughly 5 × 1010 (= 50 000 × 1000 × 1000) combinations, making scanning over all possible combinations intractable. In fact, >24 h are needed to run Newton–Raphson over only 107 combinations in our experiments. Thus, the main focus of this article is to speed up the procedure of finding the three-way interactions. Our strategy for this issue is to prune irrelevant combinations, such as those in which the expression values of two genes are randomly distributed as in Figure 1a, by using a hypothesis test assuming the normality of given examples.

The contribution of this article can be summarized into three folds: (i) We present a problem setting of finding a three-way gene interaction of two numerical variables and one categorical, corresponding to a biological switch in expression. (ii) LRTR and LRT of logistic regression (LRTLR) are the standard approaches for this problem, but these are computationally inefficient, particularly for a huge number of combinations that we can have. We then propose an efficient method for pruning large part of input combinations. (iii) Our experiment with a huge dataset of human brain samples showed that our method run 10 times faster than LRTLR for any data size, keeping the accuracy of detecting three-way interactions at ~85%.

2 RELATED WORK

Three-way interactions in expression have not been considered except only a few cases of using simple methods (Li et al., 2004; Zhang et al., 2007). There are two reasons for this: (i) dealing with more than two-way correlations is intractable at a genome-wide level, because of the explosive number of combinations and (ii) three-way interactions along this line can be inferred from two-way co-expression. We emphasize that our three-way interaction is different from them, in terms that correlation or inverse-correlation in expression of two genes is controlled by genotypes of another gene.

Genome-wide association studies (GWA) using genotypes, especially single nucleotide polymorphisms (SNPs), have been highlighted in these few years (McCarthy and Hirschhorn, 2008), whereas cDNA microarrays have been a standard tool for understanding gene/protein behaviors in a cell. Thus, currently a large number of studies use both gene expressions and genotypes, showing the importance of combining these two information sources (Nica and Dermitzakis, 2008). Consequently, we now have a unique dataset, in which both gene expressions and genotypes are measured at once for each individual, and this type of dataset, which we use in this article, is increasing in these few years, which makes our approach very promising (Dixon et al., 2007; Myers et al., 2007; Schadt et al., 2008).

A standard analysis in GWA is conducted between a single SNP (i.e. genotypes at a locus) and a categorical or continuous outcome (phenotype). For this analysis, the two most typical approaches are ANOVA (Analysis of Variance) and LRTR (Balding, 2006). Usually more complex analysis is multiple (usually two) SNPs with a single phenotype where two-way ANOVA or LRTR with two explanatory variables can be considered. This situation is closely related with epistasis, a general concept in modern quantitative genetics (Aylor and Zeng, 2008; Cordell, 2002), meaning the interaction between multiple loci and phenotype (Marchini et al., 2005). Our problem setting looks similar to this but interestingly in the reverse direction. That is, we consider the interaction between two expression phenotypes under categorical genotypes which thus have not been examined in GWA. We note that ANOVA cannot be applied to this issue,1 whereas LRTR can be applied as a standard manner for our setting. Another item of note is that finding three-way interactions in only SNPs exists (Lo et al., 2008), but their problem setting is straightforward and totally different from our setting.

3 METHODS

3.1 Notations and preliminaries

Let X be an input matrix, in which each row is an individual and each column is a numerical vector of gene expressions or a categorical vector of SNPs (in genes). Let E be the set of genes for which expressions are measured in X and Q be the set of SNPs in X, indicating that |E|+|Q| is the total number of columns of X. To test the three-way interaction, we choose one combination, i.e. two genes (e1 and e2) and one SNP (q) out of E and Q, respectively, and we write X(e1, e2, q) which has only three columns of X, corresponding to e1, e2 and q [we write X(e, q) when we choose only one gene e out of E and q out of Q]. Hereafter, until Section 3.6, we assume that we already choose one combination.

For gene expressions, let X=(X1,…, XK)′[set membership]RK be a K-dimensional numerical variable, taking value x=(x1,…, xK)′. We note that using two genes in expression does not necessarily mean K=2. For example, for two genes, we can set K=3, where X1, X2 and X3 correspond to one gene, the other gene and the interaction between these two genes, respectively. For genotypes, let C be the number of groups (or classes), and in fact, C=3. We denote three genotypes by G1, G2 and G3, into one of which each individual falls. Let Y be the class variable, taking value y, where Y=(Y1, Y2)′[set membership]{0, 1} × {0, 1}. Here, we note that y takes the following values: y=(1, 0)′ if x[set membership]G1, y=(0, 1)′ if x[set membership]G2 and y=(0, 0)′ if x[set membership]G3. We denote N inputs (individuals) by X=(x1,…, xN)′ and Y=(y1,…, yN)′=(y(1), y(2)), which can be classified into N1, N2 and N3 inputs for G1, G2 and G3, respectively. The average expression values can be defined for each class c and all classes: An external file that holds a picture, illustration, etc.
Object name is btp531i1.jpg and An external file that holds a picture, illustration, etc.
Object name is btp531i2.jpg, respectively, where An external file that holds a picture, illustration, etc.
Object name is btp531i3.jpg. IK is the identity matrix of size K, and 1 is an n-dimensional vector in which all elements are 1.

We incorporate some basic statistics: An external file that holds a picture, illustration, etc.
Object name is btp531i4.jpg and An external file that holds a picture, illustration, etc.
Object name is btp531i5.jpg, where T=B+W. We can further define covariance matrix Sc for class c, An external file that holds a picture, illustration, etc.
Object name is btp531i6.jpg and total covariance matrices S and ST, An external file that holds a picture, illustration, etc.
Object name is btp531i7.jpg and An external file that holds a picture, illustration, etc.
Object name is btp531i8.jpg. We note that W = ∑c=1CNcSc and An external file that holds a picture, illustration, etc.
Object name is btp531i9.jpg.

We show the multivariate normal distribution, having two parameters, μc and Σc (the mean and the covariance matrix of class c), and the log-density (log-likelihood) function of this distribution can be given as follows:

equation image
(1)

From this equation, we can see that An external file that holds a picture, illustration, etc.
Object name is btp531i10.jpg, covariance matrix Sc and covariance matrix S can be the maximum likelihood estimators of μc, Σc and Σ (=Σ1=···=ΣC), respectively.

We briefly describe likelihood ratio test (LRT), which will be used. We first assume that examples x1, x2 ,…, xn are generated according to parameter vector θ. Let H0 : θ[set membership]Ω0 be a null hypothesis and H1 : θ[set membership]Ω1 be the alternative hypothesis. The statistic λ for testing H0 against H1 can be defined as λ=L0*/L1*, where L0* and L1* are the maximum likelihoods under θ[set membership]Ω0 and θ[set membership]Ω1, respectively. Usually we can use the log-likelihood ratio (LLR), −2logλ=2([ell]1*[ell]0*), where [ell]1*=logL1* and [ell]0*=log L0*. We note that this statistic follows χqr2 distribution as N→∞, where qr is the degree of freedom (df) of the χ2 distribution.

3.2 Finding three-way interactions: interaction test (Likelihood Ratio Test of Logistic Regression, LRTLR)

A standard and exact approach for our problem is LRTLR (McCullagh and Nelder, 1989), which we simply call interaction test in this article.

3.2.1 Logistic regression

We first denote the probability that x is in G1 by p1(x), and similarly the probability that x is in G2 by p2(x), by which the probability that x is in G3 is p3(x) (=1−p1(x)−p2(x)). We use logistic regression to link these probabilities to K-dimensional input x by using weight parameters (or coefficients) w=(w1′, w2′)′, where w1=(w10, w11,…, w1(K−1))′, w2=(w20, w21,…, w2(K−1))′ as follows:

equation image
(2)

Here, we denote p1(x), p2(x) and p3(x) by p1(x; w), p2(x; w) and p3(x; w) (=1−p1(x; w)−p2(x; w)), respectively, because they can be functions of w. We can then write the likelihood of logistic regression for given N examples and parameters w, as follows:

equation image

where yi=(yi1, yi2)′.

3.2.2 Parameter estimation

We can obtain the maximum likelihood estimator ŵ for w by maximizing the log-likelihood l(w)=log L(w). A standard approach for this purpose is the Newton–Raphson method, which is an iterative gradient descent, having the following updating rule by which we can have ŵ(t+1) at the (t+1)-th iteration, using ŵ(t) of the t-th iteration:

equation image
(3)

where Hessian matrix H(w) (=[partial differential]2l/[partial differential]w[partial differential]w′) and gradient vector U(w) (=[partial differential]l/[partial differential]w) can be given in the following:

equation image

where X*=diag(X, X) (block diagonal matrix of X), a(w)=(a1(w)′, a2(w)′)′ where aj(w)=y(j)pj(w) and pj(w)=(pj(x1; w),…, pj(xN; w))′ (j=1, 2).

equation image

where N × N matrix Rjk(w) (j, k=1, 2) is given by Rjj(w)=diag{pj(w)[middle dot in circle](pj(w)−1)} and Rjk(w)=diag{pj(w)[middle dot in circle] pk(w)} (jk).

Finally, the updating rule of the Newton–Raphson method for logistic regression can be rewritten in the following:

equation image
(4)

In practise, we start with some initial values ŵ(0) and update ŵ(t+1) according to Equation (4) until the following equation is satisfied:

equation image
(5)

where δ is set at a certain value.

3.2.3 Interaction test

We then examine the significance of the interaction in expression between two genes in terms of classes of another gene. Let xi1 and xi2 be expression values of the corresponding two genes for input i. The interaction term is xi1xi2, meaning that our purpose is to find the case that the logistic model is well fitted to the data when this term is added. We then let xi=(1, xi1, xi2, xi1xi2)′ and w=(w10, w11, w12, w13, w20, w21, w22, w23)′, and the logistic model with the interaction term is given as follows:

equation image

If wc3=0, the model does not have the interaction term, meaning that the null hypothesis and w0 are given as follows:

equation image
(6)

Then the test statistic, LLR and its asymptotic distribution can be given:

equation image
(7)

where χ22i) is the χ2 distribution with the df of two, meaning that interacting genes can be obtained as those which have lower P-values under this distribution than the input significance level αi. We run interaction test 100 times over four examples in Figure 1, and Table 1 shows the average results over the 100 runs. This table clearly shows that the P-value is very large for Figure 1a–c, while that is zero for Figure 1d, indicating that intraction test can detect our target sample correctly.

Table 1.
Log-likelihoods and LLR by Newton–Raphson

Figure 2 shows a pseudocode of interaction test. We can write interaction test by function Interaction_test(e1, e2, q, αi), which outputs one if given example (e1, e2, q) has the three-way interaction; otherwise zero. A significant drawback of interaction test is computational inefficiency. In fact, Equation (6) shows K=8, meaning that Newton–Raphson needs to compute an 8 × 8 inverse-matrix at each of its iteration procedure.

Fig. 2.
Pseudocode of interaction test.

3.3 Key idea for speeding-up interaction finding

A basic idea for accelerating the finding of a three-way interactions is to prune some combinations, to which interaction test does not have to be applied. From Equation (7), we can see that the interacting genes should have a larger LLR. Figure 3 shows a schematic figure, in which we plot the log-likelihood without the interaction term in the left-hand side and with the interaction term in the right-hand side. We note that the range of the log-likelihood can be limited, because the maximum log-likelihood is zero and the minimum log-likelihood can be given by the case of the uniform distribution for pi(x). The LLR in question can be then given by the distance being shown by a dotted line in Figure 3. Thus, two interacting genes should have a long dotted line, meaning that the point in the left-hand side should be lower and that in the right-hand side should be higher. This observation indicates that we can prune the following two cases: (I) a large likelihood can be obtained without the interaction term, and (II) only a small likelihood can be obtained even if we use the interaction term. These (I) and (II) correspond to areas I and II, respectively, in Figure 3. We then attempt to efficiently detect examples in areas I and II by assuming the normality on data distribution.

Fig. 3.
LLR and its components.

3.4 Linear discriminant analysis

Area I in Figure 3 contains examples in which expressions can be easily separated into three classes without the interaction term, as shown in Figure 1b and c. Thus, in this case, we can consider a simpler, easily computable estimation method for parameters of the logistic regression model without the interaction term, and if the likelihood for a given combination by that model is high enough, this combination can be pruned. For the simpler estimation method, we use linear discriminant analysis (LDA), which assumes that x follows the normal distribution N(μ, Σ) with the same covariance Σ for all three classes (Hastie et al., 2001). We skip the detail of this method due to space limitations because in our experiment only a small part of all given examples can be pruned by LDA. Interested readers should refer the Supplementary Material. We can write LDA by function LDA(e1, e2, q, αi) [or LDA(e, q, αi)], which outputs one if given example (e1, e2, q) [or (e, q)] should be pruned; otherwise zero.

3.5 Randomness test

Area II in Figure 3 contains an example for which the maximum likelihood with the interaction term is very low, implying that expression values are almost randomly distributed in terms of classes, as shown in Figure 1a. To detect the randomness of expression values, if we use a faster hypothesis test for randomness than Newton–Raphson, we can speed up the procedure for finding the three-way interaction. We assume that expression values follow the K-dimensional normal distribution for each class of genotypes, and under this assumption, we present our approach, which combines multivariate ANOVA (MANOVA) and Box's M test (Mardia et al., 1979). We can set K=2 for our test, meaning that the largest matrix size is 2 × 2, making the computation very efficient.

3.5.1 MANOVA

MANOVA considers the following null hypotheses over the means:

equation image

For testing H0 against H1, we use LLR, −2logλ (=2([ell]1*[ell]0*)). By replacing Σc in Equation (1) with Σ and using the maximum likelihood estimators An external file that holds a picture, illustration, etc.
Object name is btp531i11.jpg and S for μk and Σ, respectively, we have the following:

equation image
(8)

On the other hand, for the log-likelihood under null hypothesis, we can use the maximum likelihood estimators An external file that holds a picture, illustration, etc.
Object name is btp531i12.jpg and ST for μk and Σ, respectively, and we have the following:

equation image
(9)

Thus, the statistic can be given as follows: An external file that holds a picture, illustration, etc.
Object name is btp531i13.jpg. We can further see that q is An external file that holds a picture, illustration, etc.
Object name is btp531i14.jpg and r is An external file that holds a picture, illustration, etc.
Object name is btp531i15.jpg.

We conducted MANOVA over four samples in Figure 1, and Table 2 shows the resultant average over 100 runs with SDs in parentheses. The P-value of MANOVA for (a) was high (0.53), whereas that for (b) [and (c)] was zero, meaning that MANOVA can discriminate (a) from (b) [and (c)]. However, the P-value of (d) was also high (0.94), meaning that MANOVA could not separate (a) from (d). Thus, we need another hypothesis test, which can distinguish (a) from (d).

Table 2.
MANOVA, Box's M test and Means-Covariances (MC) test on four examples in Figure 1

3.5.2 Box's M test

We then consider the following hypotheses over the covariance:

equation image

Here, [ell]0* can be given by [ell]1* of MANOVA [i.e. Equation (8)], and [ell]1* can be obtained by using maximum likelihood estimators An external file that holds a picture, illustration, etc.
Object name is btp531i16.jpg and Sk for μk and Σk, respectively, in Equation (1).

equation image
(10)

Thus, the statistic is −2logλ=∑c=1C Nclog det(Sc−1S). Here, q is An external file that holds a picture, illustration, etc.
Object name is btp531i17.jpg and r is An external file that holds a picture, illustration, etc.
Object name is btp531i18.jpg.

We run Box's M test over four samples in Figure 1, and Table 2 shows the results. This result shows that the P-value of (a) was high (0.70), whereas that of (d) was zero, meaning that M-test separated (a) from (d). However, this time, this test could not discriminate (a) from (b) [and (c)], since the P-value of (b) [and (c)] was also high. Thus, this result showed that Box's M test can be a complement of MANOVA, implying that we can combine these two tests for detecting random distributions such as Figure 1a.

3.5.3 MC test (MANOVA + M Test)

We finally consider the following hypotheses over both the means and covariances:

equation image

We emphasize that this test suits our purpose the most, although this is an unpopular statistic and not named. We then call this test as MC test. Interestingly, [ell]0* of this test is given by [ell]0* of MANOVA, i.e. Equation (9) and [ell]1* is given by [ell]1* of M test, i.e. Equation (10). Thus, the statistic of MC test is given as follows:

equation image
(11)

since An external file that holds a picture, illustration, etc.
Object name is btp531i19.jpg. Here, An external file that holds a picture, illustration, etc.
Object name is btp531i20.jpg and An external file that holds a picture, illustration, etc.
Object name is btp531i21.jpg, meaning that df is 10 in our case. Figure 4 shows a pseudocode of MC test. We can write MC test by function MC_test(e1, e2, q, αm), having significance level αm as an input which removes given combination (e1, e2, q) if its P-value is larger than αm, meaning that a larger number of combinations can be removed if αm is smaller. This function outputs one if (e1, e2, q) should be pruned; otherwise zero.

Fig. 4.
Pseudocode of MC test.

We checked the performance of MC test using synthetic four samples of Figure 1. Table 2 shows that all P-values are zero, except (a) with the P-value of 0.60, indicating that MC test can successfully detect (a) out of the four examples and is expected to work on real data as well.

3.6 Proposed procedure

Figure 5 shows a pseudocode of our entire procedure. We can first check each pair of a gene and a SNP by LDA, and if the log-likelihood is high, this pair is stored to be pruned. We then generate all possible combinations of two genes and a SNP out of given data. For each of these combinations, it is first pruned if it contains the stored gene–SNP pair. Then, LDA and MC test are run in sequence for pruning, and finally interaction test is applied to the remaining. Hereafter, we call our proposed procedure FTGI, standing for Fast finding Three-way Gene Interactions, whereas we call the approach of running Interaction Test Only over all possible combinations as ITO. More details of our proposed method is shown in the Supplementary Material.

Fig. 5.
Pseudocode of our entire procedure: FTGI.

4 EXPERIMENTS

4.1 Data

We used the human brain-derived dataset of Myers et al. (2007), which originally has 193 rows (individuals) and 14 078 numerical columns (corresponding to gene expressions) and 366 140 categorical columns (corresponding to SNPs). We first removed the columns containing missing values and the columns which have a genotype to which only less than 10 individuals are assigned. Our purpose is to find three-way gene interactions, and so we further removed SNPs which are neither in coding regions nor in introns, by specifying genes on sequences using the FTP site of NCBI Mapviewer for Homo sapiens. Finally, we obtained 5269 numerical vectors (in expression of genes) and 13 411 categorical vectors (in genotypes of SNPs) for 193 individuals, which we call the Source dataset. Myers et al. (2007) collected the original dataset from human brains, and so we focused on neurodegenerative diseases [including Alzheimer's disease (AD) and Parkinson's disease, etc.] out of five disease pathways in the KEGG disease database (Kanehisa et al., 2008), resulting in 142 genes which we call Neuro. All experiments were run on a machine with Dual-Core AMD Opteron 2222 SE (3.0 GHz) and 18 GB RAM. Throughout Section 4, each P-value is shown by log10(P-value).

4.2 Results and discussion

4.2.1 Speeding-up finding three-way interactions and pruning accuracy

We examined the improvement in time efficiency by FTGI over ITO. Figure 6 shows the real computation time of ITO and FTGI, when we changed the number of combinations randomly chosen from the source dataset. We here focused on Area II of Figure 3 only, since we found that in the Source dataset of Area I had only a small number of examples, which do not affect the efficiency greatly. This figure clearly shows that as αm decreased, the amount of running time of FTGI became smaller for any size of inputs, by pruning a larger number of them. In particular, at αm of 0.001, FTGI runs approximately 10 times faster than ITO, resulting in only ~2 h for 107 combinations, being a sizable improvement. This means that for 5 × 1010 (= 50 000 SNPs × 1000 genes × 1000 genes) combinations, FTGI just needs only a couple of days with 100 CPUs, while ITO needs more than a month.

Fig. 6.
Computation time improvement by reducing αm.

The αm controls the number of pruned combinations, and Table 3 shows the pruning rate, i.e. the ratio of pruned combinations to all input combinations, with varying αm for 107 input combinations. We further checked the pruning accuracy, which can be defined as the overlap between the resultant top K (set at 100) combinations by P-values of ITO and those of FTGI. Table 3 shows that for αm of 0.05, FTGI can prune around ~70% of input combinations with pruning accuracy of almost 100%. If αm is reduced to 0.001, ~94% inputs can be pruned, keeping the pruning accuracy of ~85%. This high pruning rate effects the time efficiency of FTGI.

Table 3.
Pruning rates and pruning accuracies (top 100) at three αm values of FTGI for 107 combinations

We note that all results in this section were averaged over three runs at each corresponding setting.

4.2.2 Detecting three-way interactions

We then generated all combinations from the Source dataset, focusing on the genes in Neuro, meaning that we had totally ~3 × 108 combinations (= 13 411 SNPs × 142 genes × 142 genes). We then run FTGI with αm of 0.001 over these combinations. Figure 7 shows the gene expressions of the resultant top 10 combinations in terms of P-values. We note that these P-values of interaction test were computed by the procedure in Section 3.2. Each of Figure 7 is a 2D diagram on which expression values of the corresponding two genes are plotted with Contour lines for each genotype. This figure shows that the topographical distribution of different genotypes are clearly crossed in all cases, meaning that in each of all the top 10 combinations, genes are interacting in expression, being controlled by genotypes, as shown in Figure 1d.

Fig. 7.
Expressions of two genes under three genotypes of another gene for top 10 (aj) ranked three-way interactions out of 3 × 108 combinations.

Table 4 shows the detail (Gene name for one SNP and the name with GeneID, the definition and the pathway for each of two interacting genes in expression) of the 10 three-way interactions in Figure 7, all information in this table being retrieved from KEGG.2 For example, the first interaction of Table 4 shows the switching mechanism of two genes, COX6C and UBA1, being controlled by a SNP in LARP4.

Table 4.
Details of the top 10 three-way interactions in Figure 7

4.2.3 Validating detected interactions with permutations

To confirm the statistical significance of the detected three-way interactions, we conducted permutations by measuring P-values of ‘null data’, generated in the following three manners, and comparing them with those of the interactions we detected.

  • Null data 1: we randomly chose 10 000 combinations out of all combinations using the Source dataset (13 411 SNPs × 5269 genes × 5269 genes) and randomly permuted the genotypes of these combinations 100 times. Totally, we had one million null examples.
  • Null data 2: we randomly chose 10 000 combinations out of all combinations using the Neuro dataset (13 411 SNPs × 142 genes × 142 genes) and randomly permuted the genotypes of these combinations 100 times. Totally, we had one million null examples.
  • Null data 3: we permuted the genotypes of each of the detected top 10 interactions in Figure 7 one million times, resulting in one million null examples for each combination.

We first show the results of permutation tests when we use Null data 1 and 2. Figure 8 shows the distribution of P-values of null examples, being located in the right side, for Null data 1 and 2. In this figure, the distribution of P-values for the top 10 000 interactions detected by FTGI is located in the left side. This figure shows that the red-colored distribution is clearly separated from the black-colored one, meaning that the detected three-way interactions have significantly small P-values. For Null data 3, we show the result, focusing on two cases (the top and the 10th interactions), since the trend of results was kept the same for all 10 interactions in Table 4. Figure 9 shows the distribution of P-values of null examples generated from the top interaction (or the 10th), with the P-value of the top (or the 10th) interaction by an arrow. This figure indicates that the P-value of the top (or the 10th) interaction is clearly distant from the P-value distribution of null examples, implying that P-values of the detected interactions are statistically significant.

Fig. 8.
Distributions (left side) of P-values of the top 10 000 interactions detected by FTGI, with those (right side) of Null data (a) 1 and (b) 2.
Fig. 9.
The P-values of the (a) top and (b) 10th interactions (shown by arrows) and the distributions of P-values of the corresponding null examples generated.

4.2.4 Validating detected interactions with GEO

To confirm the reliability of the interactions in Table 4, we tried to find, for each gene pair, the switching mechanism in expression which can be controlled by some experimental condition of gene expression. This is because if found, this directly means that the corresponding gene pair can be controlled by another categorical factor, such as genotypes of another gene.

For this purpose, we used GEO (version of June 1, 2009; Barrett et al., 2007), from which we found 2089 GDSs (gene datasets) which are annotated. Out of the 2089 datasets, we selected datasets which satisfy all the following four conditions for each gene pair in Table 4: (i) expression values of the corresponding gene pair are contained; (ii) the total number of experiments is ≥50; (iii) experimental conditions can be divided into two or more classes; and (iv) each class has 10 or more experiments. We then obtained 36 datasets.3 For each gene pair of the top 10 list, we conducted interaction test by using pairwise (binary) classes in each dataset and ranked them according to P-values of interaction test. Table 5 shows a list of datasets, each giving the lowest P-value for each gene pair of the 10 interactions in Table 4. For example, for COX6C and UBA1, the gene pair of the first interaction of Table 4, we found a switching mechanism in GDS2960_1 with the P-value of −3.9532, showing the statistical significance of this mechanism. This directly indicates that there must exist a switching mechanism in expression between these two genes under the alteration of experimental conditions which is specified by the annotation of GDS2960_1. In fact, Table 5 indicates that the switching mechanism happens between patients of Marfan syndrome and controls. This type of explanation is possible for all 10 GDSs in Table 5 by using annotations in this table. As well all P-values shown in Table 5 are small enough,4 supporting the reliability of the three-way interactions in Table 4 which our method detected. Furthermore, Figure 10 shows the real expression values of two genes, being categorized into two classes, for each GDS of Table 5. These orthogonal Contour plots also assist the reliability of three-way interactions that we detected in Table 4.

Fig. 10.
(aj) Expressions of two genes which give the smallest P-value of interaction test in the corresponding GDS of GEO.
Table 5.
Results of interaction test over the datasets from GEO

We further briefly checked the genes having SNPs in the first and the third interactions in Table 4: (i) the first interaction in Table 4 has two genes, COX6C and UBA1, which is controlled by a SNP in LARP4, i.e. La ribonucleoprotein domain family member 4. This gene was already known as an important gene in both AD and aging, being already pointed out that LARP4 increases expression with increasing AD progression and normal aging (Miller et al., 2008). As our focus was on 142 genes on neurodegenerative diseases including AD, the known function on LAPR4 is consistent enough with the interaction with COX6C and UBA1, being possibly in the switching mechanism. (ii) The gene with the SNP in the second interactions in Table 4 was a hypothetical one, but the third interaction has two genes, ATP5D and ITCH, being controlled by a SNP in RPS3AP5, which is a pseudogene of RPS3A, i.e. ribosomal protein S3A. This gene is known to be downregulated in the same manner as some genes in oxidative phosphorylation pathway (Welle et al., 2003), which includes ATP5D. Thus, these observations reveal the possibility that the third interaction also may exist as the switching mechanism in expression of two genes, i.e. ATP5D and ITCH.

Overall our extensive analysis has implied that the detected three-way interactions can exist. These results show the potential of our approach to explicate complex biological systems appearing in modern biology and medical sciences.

5 CONCLUDING REMARKS

We have presented a fast method for finding three-way gene interactions from transcript-and genotype-data and showed experimental results obtained by applying this method to ~3 × 108 human brain samples. In our experiments, we confirmed the three-way interactions that we found in various manners. Possible future work would be to apply our approach to various types of transcript- and genotype-data further to uncover three-way gene interactions, i.e. biological switches by genotypes.

Supplementary Material

[Supplementary Data]

ACKNOWLEDGEMENTS

The authors would like to thank anonymous reviewers for their helpful comments and advice.

Funding: Grant-in-Aid for Young Scientists 20700134, 20700269, 21680025 from Ministry of Education, Culture, Sports, Science and Technology (MEXT, in parts); the Functional RNA Project of New Energy and Industrial Technology Development Organization (NEDO, in parts).

Conflict of Interest: none declared.

Footnotes

1ANOVA can be applied only to the case with a single continuous response (phenotype) and one or more discrete explanatory variables (genotypes).

2The Supplementary Material shows annotations by Reactome (Vastrik et al., 2007) for interacting genes.

3In each GDS, if it has more than two classes or replicated experiments, we consider all possible pairwise combinations of them. We then name generated multiple datasets from one GDS (e.g. GDS2960) those like GDS2960_1, GDS2960_2, etc. This results in that the number of datasets we used could be >36. The actual number of datasets for each gene pair is shown in Table 5.

4For each gene pair, not only the dataset giving the top P-value but also 10 datasets providing the top 10 P-values are shown in the Supplementary Material. All P-values in the Supplement Material are small, showing the statistical significance of the switching mechanism of each gene pair.

REFERENCES

  • Ayer DE, Eisenman RN. A switch from myc:max to mad:max heterocomplexes accompanies monocyte/macrophage differentiation. Genes Dev. 1993;7:2110–2119. [PubMed]
  • Aylor DL, Zeng Z-B. From classical genetics to quantitative genetics to systems biolog: modern epistasis. PLoS Genet. 2008;4:e1000029. [PMC free article] [PubMed]
  • Balding DJ. A tutorial on statistical methods for population association studies. Nat. Genet. 2006;7:781–791. [PubMed]
  • Barrett T, et al. NCBI GEO: mining tens of millions of expression profiles.database and tools update. Nucleic Acids Res. 2007;35:D760–D765. [PMC free article] [PubMed]
  • Cordell HJ. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 2002;11:2463–2468. [PubMed]
  • Dixon AL, et al. A genome-wide association study of global gene expression. Nat. Genet. 2007;39:1202–1207. [PubMed]
  • Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001.
  • Kanehisa M, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008;36:D480–D484. [PMC free article] [PubMed]
  • Li K-C, et al. A system for enhancing genome-wide coexpression dynamics study. Proc. Natl Acad. Sci. USA. 2004;101:15561–15566. [PMC free article] [PubMed]
  • Lo S-H, et al. Discovering interactions among brca1 and other candidate genes associated with sporadic breast cancer. Proc. Natl Acad. Sci. USA. 2008;105:12387–12392. [PMC free article] [PubMed]
  • Marchini J, et al. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 2005;37:413–417. [PubMed]
  • Mardia KV, et al. Multivariate Analysis. New York: Academic Press; 1979.
  • McCarthy MI, Hirschhorn JN. Genome-wide association studies: Past, present and future. Hum. Mol. Genet. 2008;17(Review Issue 2):R100–R101. [PubMed]
  • McCullagh P, Nelder J. Generalized Linear Models. 2. Boca, Raton: Chapman & Hall CRC; 1989.
  • Miller JA, et al. A systems level analysis of transcriptional changes in Alzheimer's disease and normal aging. J. Neurosci. 2008;28:1410–1420. [PMC free article] [PubMed]
  • Myers AJ, et al. A survey of genetic human cortical gene expression. Nat. Genet. 2007;39:1494–1499. [PubMed]
  • Nica AC, Dermitzakis ET. Using gene expression to investigate the genetic basis of complex disorders. Hum. Mol. Genet. 2008;17(Review Issue 2):R129–R134. [PMC free article] [PubMed]
  • Schadt EE, et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 2008;6:e107. [PMC free article] [PubMed]
  • Vastrik I, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007;8:R39. [PMC free article] [PubMed]
  • Welle S, et al. Gene expression profile of aging in human muscle. Physiol. Genomics. 2003;14:149–159. [PubMed]
  • Zhang J, et al. Extracting three-way gene interactions from microarray data. Bioinformatics. 2007;23:2903–2909. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...