Logo of bioinfoLink to Publisher's site
Bioinformatics. Jul 15, 2010; 26(14): 1752–1758.
Published online May 26, 2010. doi:  10.1093/bioinformatics/btq257
PMCID: PMC2894507

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

Abstract

Motivation: Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene–gene and gene–environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden.

Results: Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions.

Availability: The RJ software package is freely available at http://www.randomjungle.org

Contact: ed.kcebeul-inu.sbmi@gineok.ekni; ed.kcebeul-inu.sbmi@relgeiz

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Genome-wide association (GWA) studies have become a standard approach for helping unravel the genetic basis of complex genetic diseases. The recent successes are tremendous, and a series of new loci have been identified using single marker analyses (McCarthy et al., 2008; Samani et al., 2007; Wellcome Trust Case Control Consortium, 2007). Unfortunately, only a small portion of the heritability was explained by corresponding single nucleotide polymorphisms (SNPs) for most diseases such as Type 2 diabetes (6%) or Crohn's disease (20%) (Manolio et al., 2009). Furthermore, SNPs identified by GWA studies for various diseases make poor classifiers (Jakobsdottir et al., 2009).

For overcoming such drawbacks and recognizing the complexity of the underlying biology, further mechanisms such as gene–gene interaction need to be taken into account (Moore et al., 2010). However, the discovery of gene–gene interactions using GWA studies remains challenging with traditional statistical approaches (Cordell, 2009; Moore et al., 2010). Given genotype data at different loci, an exhaustive search of interactions between all loci is the obvious way of testing interactions. Testing all two-locus interactions is computationally feasible although time demanding (Marchini et al., 2005). However, an exhaustive search of higher order interactions is computationally impractical because the number of tests increases exponentially with the order of interaction (Cordell, 2009).

One approach to deal with such large numbers of SNPs is to first perform univariate tests on each SNP, discard SNPs with high P-values and apply interaction methods, e.g. within logistic regressions, to SNP subsets afterwards (Hoh et al., 2000; Marchini et al., 2005). Unfortunately, such approaches may result in low power for SNP–SNP interactions with very small marginal effects.

Another concern is genetic heterogeneity, i.e. different subsets of genes affect the same disease, and traditional statistical methods show limitations when genetic heterogeneity is present (Province et al., 2001).

A promising alternative is Random Forests (RFs; Breiman, 2001). RF was applied successfully to genetic data in various studies (Bureau et al., 2005; Chang et al., 2008; Jiang et al., 2009; McKinney et al., 2009; Sun et al., 2007), and it is anticipated that RF will help to detect gene–gene interactions in genome-wide data (Moore et al., 2010). It has been shown that RF can substantially be more efficient than standard statistical methods in ranking the true disease-associated SNPs in order to detect SNP–SNP interaction (Lunetta et al., 2004). The method is able to detect SNPs with small effects and to deal with genetic heterogeneity because separate models are automatically fit to subsets of data defined by early splits in the tree (Lunetta et al., 2004; Province et al., 2001). In addition, RF is able to handle SNPs that are associated in a non-linear fashion.

The RF method is a specific data mining method. In data mining, in general, algorithms attempt to identify an unknown concept based on randomly chosen examples of the collected data. The aim is to find a prediction rule that correctly classifies new instances of the concept (Breiman, 2001). Thus, RF makes fewer assumptions about the functional form of the model, as required by statistical tests (McKinney et al., 2006).

A grown tree in a forest is often graphically represented by an upside down tree. Multiple paths lead through the tree from the root to different leaves via various nodes. Each node corresponds to a specific predictor variable. Thus, a path is a sequence of predictor variables (for details, see Section 2.1 and König et al., 2008). Such a predictor variable sequence includes potential interactions between them in terms of hierarchical dependencies (Cordell, 2009; Moore et al., 2010). Thus, the RF method allows for interactions between SNPs.

RF yields a classification result and a measure of the importance for each variable. Variable importance (VI) quantifies the impact of a SNP in predicting the response and may reflect a causal effect. In turn, it can be used to select the relevant SNPs from a GWA study (Ziegler et al., 2007).

Although appealing, RF has rarely been applied on the genome-wide level. In analogy to standard statistical approaches, this is due to the computational intensity and memory requirements (Zhang et al., 2009; Ziegler et al., 2007). The original RF implementation, termed RF in Fortran, by Breiman and Cutler (2004) designed to analyze low-dimensional data, i.e. a low number of SNPs, with a large number of observations, e.g. 100 SNPs and 10 000 observations. It has been successfully used, e.g. by Bureau et al. (2005) in a candidate gene case–control study involving 42 SNPs. However, it is computationally and memory inefficient so that not more than 10 000 of SNPs can be analyzed on a standard machine within reasonable time and memory usage (Ziegler et al., 2007). Furthermore, the code is not user-friendly because the program has to be modified and compiled anew, whenever a new dataset is used.

An alternative implementation is the randomForest package for the programming language R (R Development Core Team, 2009) by Liaw and Wiener (2002). It is user-friendly, and it has been often used in applications (Ziegler et al., 2007). The source code of the package randomForest consists of R, C and Fortran source code. Elementary subroutines were left in Fortran code. However, the same computational and memory limitations apply as to RF in Fortran.

One approach to overcome the memory issue is to split up the GWA data into small chunks, which are subsequently analyzed separately (Jiang et al., 2009; Schwarz et al., 2007). The results of all processed chunks are finally combined. Through this, main effects are detected, but one may fail to discover some important interaction effects due to data separation. Thus, these approaches do not overcome the restrictions in detecting complex interactions.

An alternative has recently been presented by Zhang et al. (2009). In their package Willows, they compress the GWA data internally and subsequently apply RF. However, it is slow for large values of the mtry parameter (see Section 2) as recommended for datasets with many noise variables such as GWA study data (Breiman and Cutler, 2004; Liaw and Wiener, 2002). Tuning mtry for optimizing the performance of the forest is also strongly recommended (Breiman and Cutler, 2004), which can hardly be done with Willows. Using this program can be computationally intensive, thus time demanding.

Here, a novel software package called Random Jungle (RJ) is presented, which has been specifically tailored for the large-scale analysis of GWA studies. This computational and memory efficient implementation of RF is able to analyze hundreds and thousands of samples and SNPs.

In the following, we first briefly introduce the RF methodology, including the growing procedure and essential features. Next, we describe the estimation of various VI measures. Specifically, we show differences between importance scores of randomForest and RF in Fortran. After these theoretical considerations, we describe the RJ software and demonstrate its superior computational performance when compared with other implementations. Finally, we illustrate its use with data from a GWA study on Crohn's disease.

2 METHODS

2.1 Random forests

RFs is an ensemble consisting in multiple classification and regression trees (CART) that are grown using a bootstrap sample of given data and without pruning. In general, an ensemble is a group of classifiers in which the classifier is only required to perform slightly better than random guessing or coin flipping. This property is fulfilled by many base classifiers, such as CART (Breiman, 1996; Schapire, 1990). With a CART as base classifier, a sample is classified by taking the majority vote over all tree classifiers in a forest (Breiman, 2001). RF has been shown to provide good accuracy, robustness to noise, internal estimation of error, stable classifiers and VI (Breiman, 2001; Breiman and Cutler, 2004; Meng et al., 2009). The RF procedure takes the following steps (Breiman, 2001):

  1. Consider a dataset X, termed training data, consisting of one response variable and many predictor variables from N samples. The total count of predictor variables is M, with M being substantially larger than N.
  2. A bootstrap sample X* consisting of N samples is drawn with replacement from the original training data X. On average, one-third of all samples are left out due to the bootstrapping process. These samples are called ‘out-of-bag’ (OOB) data X\X*.
  3. A CART t is grown using the bootstrap dataset X*. The CART is constructed by recursively splitting data into distinct subsets, so that one parent node leads to two child nodes. For splitting data, an appropriate split rule has to be selected so that the subsets of each child node are purer than the subset of corresponding parent node. The goodness of the split is defined to be the decrease in impurity as follows: Δi = iparent−(pleft · ileft+pright · iright). The proportion of samples in left and right nodes is given by pleft and pright, respectively. The measure of impurity iparent, ileft and iright of parent node, left and right child node is determined by the Gini index, i.e. i = 1 − Σjp(j)2, where is the proportion of samples that are labeled with class j in that node. At each node, a random subset of all predictor variables is chosen without replacement to determine the best split. The size of the subset is given by the parameter mtry. Although different variables might be selected at each node to be tested, the number mtry is held constant during the procedure, and the default setting is An external file that holds a picture, illustration, etc.
Object name is btq257i1.jpg , where [left ceiling]·[right ceiling] denotes the next larger integer.
  4. The tree t is grown to its largest extent, and no pruning proceeds. The final nodes are called terminal nodes.
  5. Steps 1 to 4 are repeated to grow a specific number of trees, and, for classification, the majority vote over all trees in the resulting forest is used.
  6. Finally, the OOB error fraction is calculated by classifying each sample of the OOB. Each observation is predicted by the trees for which it is an OOB observation. The prediction accuracy of the classifier is estimated by subtracting the OOB error fraction from its maximum, which is one. The prediction accuracy estimation method is a suitable surrogate for cross-validation (Breiman, 2001).

A special feature of RF is the calculation of proximities between samples. For this, after a tree is grown, every subject is classified by each tree. Then, each pair of subjects is compared with regard to its final stopping point. That is, if they are assigned to the same terminal node in a single tree of the forest, the proximity between them is increased by one. The proximity matrix is useful, e.g. for replacing missing data, imputing data, identifying outliers and finding class representative samples called prototypes.

RF can be turned into an unsupervised learning method. To initialize the process, the original dataset is considered as Class 1. A new synthetic dataset of the same number of samples and predictor variables is created and labeled as Class 2. This synthetic data is created by sampling at random without replacement from the univariate distributions of the original data. The original and the synthetic data are merged. The resulting artificial two-class dataset is analyzed by RF in order to produce sample proximities as described above. The 2D multidimensional scaling (MDS) technique (Cox and Cox, 2001) is subsequently applied to the proximity matrix. The method yields a 2D graphical representation of the underlying sample structure. To identify clusters in the sample structure, the graphical representation has to be investigated by standard clustering techniques, such as k-means clustering (Macqueen, 1967).

A further feature is the computation of sample margins. A sample margin is the difference of proportional votes for the correct class and maximum proportional votes of remainder classes. Sample margins are defined between 1 and −1. A high positive sample margin means a coherent and correct classification.

The standard RF methodology can be extended by a flexible backward elimination procedure. The procedure identifies small sets of variables that can achieve good predictive performance. To select a small subset of variables, RFs are fitted iteratively. Specifically, at each iteration step a RF is grown and its importance (see Section 2.2) for classification is calculated. Variables that yield small VI scores are discarded subsequently. The elimination procedure is stopped when the number of remaining variables falls below a specific threshold or when the OOB accuracy is maximized (Diaz-Uriarte and Alvarez de Andres, 2006).

2.2 Importance

An essential standard feature of RF is that the importance of each predictor variable can be estimated. The RF approach serves two fundamentally different VI measures, the Gini importance and the permutation importance. The Gini importance of a predictor variable Xi is the total decrease in impurity ΔI = Σ k Δik. The Gini importance is obtained by adding up impurity decrease Δik of all nodes in a forest, where the corresponding predictor variable was selected for splitting. The Gini importance has been shown to be biased when the number of categories differs between predictor variables (Archer and Kimes, 2008; Strobl et al., 2007). Moreover, bootstrapping observations without a replacement yields a less biased VI (Strobl et al., 2007).

Another VI is the unscaled permutation importance, which is the mean decrease of accuracy for a predictor variable. This VI is calculated as follows: first, the prediction accuracy At is estimated for each tree t in forest T using OOB samples; second, the values of corresponding predictor variable are randomly permuted; third, prediction accuracy At* is estimated using OOB samples again; and finally, the difference in accuracy, averaged over all trees in the forest, gives the unscaled permutation importance of the predictor variable

equation image
(1)

The scaled permutation importance, often called z-score, is calculated by dividing the mean decrease of accuracy by its standard error over all trees in the RF

equation image
(2)

The variance estimators differ between randomForest and RF in Fortran. As a result, both programs provide different scaled permutation importance scores. The estimator of randomForest is defined as

equation image
(3)

where NOOB,t determines the number of samples in OOB of the current tree. The variance estimator of RF in Fortran is defined as

equation image
(4)

VI measures as described above can show a bias of correlated predictor variables such as SNPs in linkage disequilibrium (Meng et al., 2009; Nicodemus and Malley, 2009; Nicodemus et al., 2010; Strobl et al., 2008). Permuting a predictor variable using the usual permutation scheme disrupts a potential dependency structure between the permuted variable and the other predictor variables. The disruption entails an inflation of the importance value of the predictor variable when predictors were associated with the outcome (Nicodemus et al., 2010; Strobl et al., 2008).

The conditional VI (CVI) is an approach to solve this problem (Strobl et al., 2008). For preserving the dependency structure between a specific predictor variable and other predictor variables, the predictor variable in question is permuted only within groups of observations. Group assignment is determined by analyzing the corresponding dependency structure as described in detail by Strobl et al. (2008). It has been shown that the CVI reflects the importance of predictors of correlated predictors more reliably than usual importance measures (Strobl et al., 2008). Therefore, the CVI should be applied to data that contain correlated predictor variables such as SNPs in linkage disequilibrium.

3 IMPLEMENTATION

3.1 Random jungle

The novel software package RJ implements all features of the reference implementation randomForest such as various tuning parameters, prediction of new datasets using previously grown forests, sample proximities and imputation. Commonly used VI measures are implemented, such as Gini importance, permutation importance and conditional importance measures. The features of RJ are shown in Supplementary Table 1. RJ additionally implements the variable backward elimination. When multiple CPU are available, RJ is able to perform RF on multiple CPUs simultaneously using multithreading and Message Passing Interface (MPI) parallelization.

RF in Fortran and randomForest grow ensembles of a CART, but the RF method is not restricted to a CART. Therefore, RJ serves a generalized framework for tree growing, which can be utilized to extend the set of tree types. RJ also implements CART, but several tree types such as conditional trees (Hothorn et al., 2006) are currently under construction and will be added to RJ in the future.

RJ is written in the C++ language, and the program structure fundamentally differs from the randomForest and RF in Fortran implementations. A comparison of importance scores, computing time and memory consumption across different implementations is given in Section 3.2.

The software is freely available on www.randomjungle.org, where a detailed documentation of RJ can be found.

3.2 Comparison of importance values

A simulation study was set up for comparing importance score ranks of RJ with the reference implementation randomForest. To this end, we used the simulated data for rheumatoid arthritis (RA) that were provided for the Genetic Analysis Workshop (GAW) 15 (Miller et al., 2007). Several loci contribute to the susceptibility, and nine major gene effects (Locus A–H and Locus DR) and three key covariate effects (smoking, age and sex) were simulated. The first replicate of the genome-wide SNP dataset, RA affection status and gender was utilized for the purpose of comparison. To mimic a case–control study, one affected sibling per affected pair for the cases and one unaffected sibling per control family for the controls were randomly selected. The dataset for application comprises 1500 cases and 2000 controls genotyped at 9187 SNPs. Ranks of Gini and unscaled permutation importance scores were investigated by applying RJ and randomForest to a subset of the GAW15 data. The subset comprised three informative predictor variables, i.e. sex, Locus C/DR (SNP6_153) and Locus D (SNP6_162), and three uninformative predictor variables (SNP2_394, SNP3_481, SNP1_98) that were randomly selected out of the set of all uninformative predictor variables (Miller et al., 2007). Each program was applied 500 times in order to capture the variation of importance ranks. Data was analyzed using default parameter settings of 500 trees and An external file that holds a picture, illustration, etc.
Object name is btq257i2.jpg. Finally, importance ranks of predictor variables yielded by randomForest and RJ were compared using boxplots. All applications were performed on computers running the SUSE Linux operating system with a 2.33 GHz Intel dual quad-core processor (8 CPUs) and 16 GB memory.

3.3 Performance and application to Crohn's disease

The real dataset was used to compare the different implementations and to find potential interactions.

The performance of RJ, randomForest, RF in Fortran and Willows was compared in terms of computing time and memory consumption of each software. RJ was run in two different modes, namely in a single CPU mode and in a 40-CPU mode using multithreading and MPI. All applications were performed on computers running the SUSE Linux operating system with a 2.33 GHz Intel dual quad-core processor (8 CPUs) and 16 GB memory. In the 40-CPU mode, five processes were distributed among five computers. Each process was performing on eight CPUs simultaneously using multithreading.

The real dataset is from a Crohn's disease GWA study, which has been described previously in detail (Duerr et al., 2006). In brief, data of 513 Crohn's disease affected Caucasian cases and 515 Caucasian controls were analyzed. The samples were genotyped on the Illumina HumanHap300 Genotyping BeadChip (317 503 SNPs). The GWA study was funded by NIDDK IBD Genetics Consortium. Samples were visualized using MDS plots and outlying persons were excluded from MDS clusters by visual inspection of two experienced experts, resulting in 1006 persons (501 cases and 505 controls). Sex was the only covariate in the analysis in addition to the SNPs.

SNPs with a call rate <0.98 per study group, a MAF <0.05 in the cases and controls combined or a P-value <0.0001 for deviation from Hardy–Weinberg expectations in control group were excluded, resulting in 275 153 SNPs. The RJ software can handle missing data, i.e. imputing internally and analyzing data subsequently, but it is advised to impute data using standard imputing tools (Schwarz et al., 2009). Missing genotypes were imputed by the IMPUTE program (Marchini et al., 2007) using default parameters. Imputation uncertainty cannot be taken into account. Each implementation performed a RF analysis using the default forest size of 500 trees. To optimize the performance of the forest, the parameter mtry was tuned as recommended (Breiman and Cutler, 2004). The RF manual recommends choosing the mtry value that minimizes the OOB prediction error fraction. The parameter mtry was optimized by using several candidate values based on the formula mtry = [left floor]M/20 · (1,…, 19)[right floor] is the number of predictor variables which are SNPs and sex. The results are shown in Supplementary Table 2. The minimal OOB prediction error was obtained for mtry = 247 638.

For investigating the genetic relevance of SNPs and their interactions, data analysis was performed by RJ using 100 000 trees in forest. The parameter mtry was optimized for 100 000 trees by comparing different values shown in Supplementary Table 2. The optimal mtry was found to be 27 515. The CVI was calculated for each SNP. For comparing results with a standard univariate method, the 275 153 SNPs were also analyzed using the common trend test, which tests SNPs for being associated with a disease (Ziegler and König, 2010). The P-values and their rank were compared with results of RJ analysis.

A network was created using the top 10 genes. The network was generated through the use of Ingenuity Pathways Analysis (Ingenuity Systems, www.ingenuity.com).

4 RESULTS

Comparison of importance scores shows that RJ and randomForest rank all variables in the same order. Results of Gini importance score and permutation importance scores investigation are shown in Supplementary Figure 1a, b and c. Both programs yield similar scores for all importance measures.

All implementations are able to handle the real dataset, but computing time and memory consumption differed substantially (Fig. 1). Specifically, the randomForest and RF in Fortran analyzed the dataset in 88.8 and 84.1 h, respectively, whereas RJ performed the same analysis in only 0.53 h using 40 CPUs in parallel. RF in Fortran is the fastest alternative tool. In comparison to RF in Fortran, RJ performed 159 times faster. With RJ running in a single CPU mode, a speed up of seven was still obtained. Willows required 1750 h for the analysis. RJ turned out to be the fastest program for real data analysis. The computing time of all implementations is depicted in Figure 1a.

Fig. 1.
Comparison of computing time and memory usage of several RF implementations. Each program analyzed a simulated dataset comprising 1006 samples genotyped at 275 153 SNPs. A short bar indicates a fast implementation of RF in comparison to other programs: ...

The randomForest package and RF in Fortran consumed 9805 and 5421 MB memory, respectively (Fig. 1b). Considerably less memory was used by Willows, which consumed 136 MB memory. RJ spent 179 MB using one CPU. When using multiple CPUs, the program RJ distributed five processes among five computers using the MPI mode. Each process consumed 303 MB and utilized eight CPUs by using multithreading. RJ consumed more memory in the multi-processor mode because helping data structures have to be provided for every CPU.

The importance scores of the SNPs and their chromosomal positions are depicted in Figure 2. The two highest peaks are located on chromosomes 1 and 16, which correspond to genes IL23R and NOD2, respectively.

Fig. 2.
CVI scores of SNPs and their chromosomal position. The axis of CVI scores was log transformed. Small and negative CVI values were omitted.

A comparison of positive CVI scores and two-sided P-values from the Cochrane–Armitage trend test is shown in Supplementary Figure 2. The smallest P-value of all positive CVI scores is 2 × 10−8. The Pearson's correlation coefficient between scores and P-values is 0.38, showing a moderate association between importance scores and P-values. Corresponding negative CVI scores are shown in Supplementary Figure 3.

The 10 most important genes for the Crohn's disease data are displayed in Table 1. The genes were derived by evaluating the most important SNPs that are located within genes as shown in Supplementary Table 3. The first 10 unique genes were selected for further investigations.

Table 1.
Top 10 most important genes identified by RJ, which was performed on Crohn's disease GWA study data

The four most important SNPs yielded the smallest P-values in the trend test. The interleukin 23 receptor (IL23R) and nucleotide-binding oligomerization domain containing 2 (NOD2) were found to be the most important genes by RJ analysis. IL23R and NOD2 genes were also identified by several GWA studies and corresponding SNPs are known to be strongly associated with susceptibility to Crohn's disease (Barrett et al., 2008; Duerr et al., 2006; Rioux et al., 2007; Wellcome Trust Case Control Consortium, 2007).

A moderate association of calsyntenin 2 (CLSTN2) (P = 6 × 10−5) with Crohn's disease was reported in literature (Rioux et al., 2007). The tumor necrosis factor superfamily member 10 (TNFSF10) was high ranked by RJ analysis. The traditional trend test of TNFSF10 yielded P = 0.001576. TNFSF10 is involved in apoptosis and proliferation of human colon cancer cells (Baader et al., 2005; Saaf et al., 2007; Tang et al., 2002; Tillman et al., 2003). Colorectal cancer and Crohn's disease are related due to the fact that relative risk of colorectal cancers is significantly raised in Crohn's disease (Canavan et al., 2006; Ekbom et al., 1990).

Five of the 10 most important genes can be linked to a potential pathway by consulting additional genes, proteins and transcripts. The potential pathway is shown in Figure 3. The IL23R and TNFSF10 potentially interact via the signal transducer and activator of transcription 3 (STAT3; Niu et al., 2001; Parham et al., 2002). TNFSF10 possibly interacts with PRKG1 via Sp1 transcription factor (SP1; Sellak et al., 2002; Xu et al., 2008). TNFSF10 contingently interacts with NOD2 via nuclear factor of kappa light polypeptide gene enhancer in B-cells 1 (NFKB1; Baetu et al., 2001; Gutierrez et al., 2002). Finally, NOD2 conceivably interacts with CDK5 regulatory subunit associated protein (CDKAL1) via hepatocyte nuclear factor 4 alpha (HNF4A, Odom et al., 2004; Rioux et al., 2007).

Fig. 3.
Six of the 10 most important genes can be combined to a potential pathway. TNFSF10 potentially interacts with IL23R, NOD2 and PRKG1. The NOD2 conceivably interacts with CDKAL1.

5 DISCUSSION

The RF method was applied to genome-wide data using RJ and assigned a score of importance to each SNP. The resulting list of SNPs was investigated for potential interactions.

The results of the real data analysis validate the findings of GWA studies such as NOD2 and IL23R. Results give also evidence of new potential interactions between genes that are associated with Crohn's disease. Specifically, the TNFSF10 was not found to be strongly associated with Crohn's disease by traditional statistical tests. In contrast, RJ analysis detected that TNFSF10 potentially interacts with NOD2, PRGK1 and IL23R. STAT3 is considered to be a link between TNFSF10 and IL23R, which was shown to be moderately associated with Crohn's disease. The TNFSF10 is involved in apoptosis of human colon cancer cells. TNFSF10 might possibly explain a part of the high risk of colorectal cancers in Crohn's disease patients.

However, interpreting results and assessing biological plausibility is challenging. Results may show false positives. Thus, further investigation is needed to validate this specific causal relationship. Furthermore, although RF has the ability to detect very small main effects, its power to identify interactions depends on the presence of main effects. Thus, gene–gene interaction with no marginal effects might be left unrevealed when RF is applied.

Nevertheless, an impressive computational efficiency and memory management of RJ allow for analyzing high-dimensional data in an acceptable amount of time. Analyzing GWA data comprising thousands of observations and a million SNPs seem to be feasible with respect to time and memory consumption. The software computes up to 159 times faster than the fastest alternative implementation, and the program shows the same importance ranking with respect to the reference program. RJ presents various features such as RF growing, prediction, parameter tuning or imputation.

In summary, RJ is a promising software package for applying RF method to high-dimensional data such as GWA data. The application of RF to GWA data may help to identify potential interacting SNPs that were not found by traditional statistical approaches.

Supplementary Material

[Supplementary Data]

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their helpful suggestions and comments. The NIDDK IBDGC Crohn's Disease GWA Study was conducted by the NIDDK IBDGC Crohn's Disease GWA Study Investigators and supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This manuscript was not prepared in collaboration with Investigators of the NIDDK IBDGC Crohn's Disease GWA Study and does not necessarily reflect the opinions or views of the NIDDK IBDGC Crohn's Disease GWA Study or the NIDDK.

Funding: DFG (KO 2250/3-1); intramural funding from Medical Faculty of the University at Lübeck (E32-2009, SPP2); NIH grants 5RO1-HL049609-14, 1R01-AG021917-01A1; University of Minnesota; Minnesota Supercomputing Institute; GAW grant, R01-GM031575; ENGAGE (grant agreement number 201413); Atherogenomics (grant agreement number 01GS0831).

Conflict of Interest: none declared.

REFERENCES

  • Archer KJ, Kimes RV. Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 2008;52:2249–2260.
  • Baader E, et al. Tumor necrosis factor-related apoptosis-inducing ligand-mediated proliferation of tumor cells with receptor-proximal apoptosis defects. Cancer Res. 2005;65:7888–7895. [PubMed]
  • Baetu TM, et al. Disruption of NF-kappaB signaling reveals a novel role for NF-kappaB in the regulation of TNF-related apoptosis-inducing ligand expression. J. Immunol. 2001;167:3164–3173. [PubMed]
  • Barrett JC, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 2008;40:955–962. [PMC free article] [PubMed]
  • Breiman L. Bagging predictors. Mach. Learn. 1996;24:123–140.
  • Breiman L. Random Forests. Mach. Learn. 2001;45:5–32.
  • Breiman L, Cutler A. Random Forests 5.1. 2004 Available at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm (last accessed date April 16, 2010)
  • Bureau A, et al. Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 2005;28:171–182. [PubMed]
  • Canavan C, et al. Meta-analysis: colorectal and small bowel cancer risk in patients with Crohn's disease. Aliment Pharmacol. Ther. 2006;23:1097–1104. [PubMed]
  • Chang JS, et al. Pathway analysis of single-nucleotide polymorphisms potentially associated with glioblastoma multiforme susceptibility using random forests. Cancer Epidemiol. Biomarkers Prev. 2008;17:1368–1373. [PubMed]
  • Cordell HJ. Genome-wide association studies: detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009;10:392–404. [PMC free article] [PubMed]
  • Cox TF, Cox M.AA. Multidimensional Scaling. Monographs on Statistics and Applied Probability. Boca Raton: Chapman & Hall/CRC; 2001.
  • Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3. [PMC free article] [PubMed]
  • Duerr RH, et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science. 2006;314:1461–1463. [PubMed]
  • Ekbom A, et al. Increased risk of large-bowel cancer in Crohn's disease with colonic involvement. Lancet. 1990;336:357–359. [PubMed]
  • Gutierrez O, et al. Induction of NOD2 in myelomonocytic and intestinal epithelial cells via nuclear factor-kappaB activation. J. Biol. Chem. 2002;277:41701–41705. [PubMed]
  • Hoh J, et al. Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann. Hum. Genet. 2000;64:413–417. [PubMed]
  • Hothorn T, et al. Unbiased recursive partitioning. J. Comput. Graph. Stat. 2006;15:651–674.
  • Jakobsdottir J, et al. Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet. 2009;5:e1000337. [PMC free article] [PubMed]
  • Jiang R, et al. A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics. 2009;10(Suppl. 1):S65. [PMC free article] [PubMed]
  • König IR, et al. Patient-centered yes/no prognosis using learning machines. Int. J. Data Min. Bioinform. 2008;2:289–341. [PMC free article] [PubMed]
  • Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2:18–22.
  • Lunetta KL, et al. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 2004;5:32. [PMC free article] [PubMed]
  • Macqueen JB. Proceedings of the 5th Berkeley Symposium on Mathamatical Statistics and Probability. Berkeley and Los Angeles, California: University of California Press; 1967. Some methods of classification and analysis of multivariate observations; pp. 281–297.
  • Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed]
  • Marchini J, et al. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 2005;37:413–417. [PubMed]
  • Marchini J, et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. [PubMed]
  • McCarthy MI, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 2008;9:356–369. [PubMed]
  • McKinney BA, et al. Machine learning for detecting gene-gene interactions: a review. Appl. Bioinformatics. 2006;5:77–88. [PMC free article] [PubMed]
  • McKinney BA, et al. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet. 2009;5:e1000432. [PMC free article] [PubMed]
  • Meng Y, et al. Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinformatics. 2009;10:78. [PMC free article] [PubMed]
  • Miller MB, et al. Genetic Analysis Workshop 15: simulation of a complex genetic model for rheumatoid arthritis in nuclear families including a dense SNP map with linkage disequilibrium between marker loci and trait loci. BMC Proc. 2007;1(Suppl. 1):S4. [PMC free article] [PubMed]
  • Moore JH, et al. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26:445–455. [PMC free article] [PubMed]
  • Nicodemus KK, Malley JD. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics. 2009;25:1884–1890. [PubMed]
  • Nicodemus KK, et al. The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics. 2010;11:110. [PMC free article] [PubMed]
  • Niu G, et al. Overexpression of a dominant-negative signal transducer and activator of transcription 3 variant in tumor cells leads to production of soluble factors that induce apoptosis and cell cycle arrest. Cancer Res. 2001;61:3276–3280. [PubMed]
  • Odom DT, et al. Control of pancreas and liver gene expression by HNF transcription factors. Science. 2004;303:1378–1381. [PMC free article] [PubMed]
  • Parham C, et al. A receptor for the heterodimeric cytokine IL-23 is composed of IL-12Rbeta1 and a novel cytokine receptor subunit, IL-23R. J. Immunol. 2002;168:5699–5708. [PubMed]
  • Province MA, et al. Classification methods for confronting heterogeneity. Adv. Genet. 2001;42:273–286. [PubMed]
  • R Development Core Team. R: a language and environment for statistical computing. 2009 Available at http://www.r-project.org (last accessed date April 16, 2010)
  • Rioux JD, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat. Genet. 2007;39:596–604. [PMC free article] [PubMed]
  • Saaf AM, et al. Parallels between global transcriptional programs of polarizing Caco-2 intestinal epithelial cells in vitro and gene expression programs in normal colon and colon cancer. Mol. Biol. Cell. 2007;18:4245–4260. [PMC free article] [PubMed]
  • Samani NJ, et al. Genomewide association analysis of coronary artery disease. N. Engl J. Med. 2007;357:443–453. [PMC free article] [PubMed]
  • Schapire RE. The strength of weak learnability. Mach. Learn. 1990;5:197–227.
  • Schwarz DF, et al. Picking single-nucleotide polymorphisms in forests. BMC Proc. 2007;1(Suppl. 1):S59. [PMC free article] [PubMed]
  • Schwarz DF, et al. Evaluation of single-nucleotide polymorphism imputation using random forests. BMC Proc. 2009;3:S65. [PMC free article] [PubMed]
  • Sellak H, et al. Sp1 transcription factor as a molecular target for nitric oxide- and cyclic nucleotide-mediated suppression of cGMP-dependent protein kinase-Ialpha expression in vascular smooth muscle cells. Circ. Res. 2002;90:405–412. [PubMed]
  • Strobl C, et al. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9:307. [PMC free article] [PubMed]
  • Strobl C, et al. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8:25. [PMC free article] [PubMed]
  • Sun YV, et al. Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests. BMC Proc. 2007;1(Suppl. 1):S62. [PMC free article] [PubMed]
  • Tang X, et al. Cyclooxygenase-2 overexpression inhibits death receptor 5 expression and confers resistance to tumor necrosis factor-related apoptosis-inducing ligand-induced apoptosis in human colon cancer cells. Cancer Res. 2002;62:4903–4908. [PubMed]
  • Tillman DM, et al. Rottlerin sensitizes colon carcinoma cells to tumor necrosis factor-related apoptosis-inducing ligand-induced apoptosis via uncoupling of the mitochondria independent of protein kinase C. Cancer Res. 2003;63:5118–5125. [PubMed]
  • Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
  • Xu J, et al. Sp1-mediated TRAIL induction in chemosensitization. Cancer Res. 2008;68:6718–6726. [PMC free article] [PubMed]
  • Zhang H, et al. Willows: a memory efficient tree and forest construction package. BMC Bioinformatics. 2009;10:130. [PMC free article] [PubMed]
  • Ziegler A, et al. Data mining, neural nets, trees–problems 2 and 3 of Genetic Analysis Workshop 15. Genet. Epidemiol. 2007;31(Suppl. 1):S51–S60. [PubMed]
  • Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology: Concepts and Applications. Weinheim: Wiley-VCH; 2010.

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...