- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest

^{1}Department of Computer Science, New Jersey Institute of Technology, Newark, NJ,

^{2}Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA,

^{3}Psychiatry & the Behavioral Sciences, Zilkha Neurogenetic Institute, University of Southern California, CA and

^{4}Center for Applied Genomics, The Childrens Hospital of Philadelphia, Philadelphia, PA, USA

## Abstract

We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2*r* chi-square-ranked SNPs, where *r* is the number of SNPs with *P*-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5*r* and 10*r*. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu.

## INTRODUCTION

Genome-wide association studies aim to identify genetic variants associated with disease, drug response and various phenotypes (1). The standard method of ranking SNPs from genome-wide association studies is the one or two degree of freedom chi-squared test (2).

Previous studies have examined the performance of the chi-squared statistic in ranking SNPs (3), proposed techniques to improve the rankings under two-stage designs (4), and to correct for overestimated significance values and apply the false discovery rate control method thereafter (5). Other approaches instead of chi-square have also been proposed for ranking SNPs. These include the trend test (6), Bayes factors (1), random forests (RFs) (7–9), support vector machine (SVM; 10), L1 penalized logistic regression (11,12) and a hidden Markov model method (13).

Chi-square-based rankings have been found similar to other univariate tests such as Bayes factors and likelihood ratios (1,3). Our experiments with information theoretic methods (14) and the MAX-rank trend test (6) also show strong similarity to chi-square-based rankings on simulated data.

In this article, we study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic and two popular multivariate feature selection methods: the SVM (15) and the RF method (16). Both have been studied extensively for the problem of gene selection from microarray data (17–20). While the SVM and RF have previously been applied to genome-wide association studies (7–10), here we explicitly study their performance in ranking causal SNPs and those from associated regions and their performance as a function of the input as explained below.

We apply each of the two methods to the top *kr* chi-square-ranked SNPs where *r* is the number of SNPs with *P*-values at most 0.05 divided by the total number of SNPs. This corrected *P*-value threshold is also known as the Bonferroni correction (21) for multiple hypothesis and is a common cutoff in genome-wide association studies (22,23). We show that the SVM(2*r*) method, which is basically the SVM method applied to the top 2*r* chi-square-ranked SNPs, and RF(2*r*) contain more causal variants and those from associated regions compared to chi-square when we examine the top-ranked SNPs at the Bonferroni threshold. The SVM performs the best followed by RF and chi-square. However, at the 5*r* and 10*r* thresholds the improvement is less, if at all, in both methods. We also show that SVM(2*r*) has the highest power followed by RF(2*r*) and chi-square, but this progressively decreases at the 5*r* and 10*r* threshold.

As applications on real data, we show that both SVM(2*r*) and RF(2*r*) improve the ranks of previously replicated SNPs on Wellcome Trust Case Control Consortium (WTCCC) (1) genome-wide studies and identify more known associated type 1 diabetes regions—as given by the Type 1 Diabetes Consortium (24)—than chi-square at the Bonferroni cutoff. We also show that top ranked SVM(2*r*) SNPs achieve the highest AUC for type 1 diabetes, arthritis and simulated data disease risk prediction in testing across independent cohorts and in cross-validation studies.

In the rest of the article, we provide brief descriptions of the SVM and RF, followed by the real and simulated data used in our study. We then present detailed experimental results including an empirical power study to compare the three methods followed by results on disease risk prediction. Software and data to reproduce the results in this article along with a webserver are available at http://svmsnps.njit.edu.

## METHODS

Here, we describe the SVM and RF methods along with their implementations used in this study. The input to each method is a case control study with the top *kr* chi-square-ranked SNPs where *r* is the number of SNPs with *P*-values within the Bonferroni correction (0.05 divided by total number of SNPs *m*) and *k* is an integer ≥1. Before applying each method, we encode each genotype in the real data as the number of copies of major allele. We use our own implementation of the one degree of freedom chi-squared statistic in C which we provide freely.

### Support vector machine

The SVM is the optimally separating hyperplane between two sets of points each belonging to a different class. The sign of the SVM discriminant *w*^{T}*x*+*w*_{0} determines the class of input *x* and the distance to the hyperplane is given by (25). The SVM can be represented by the vector *w* and scalar *w*_{0} that minimizes where *x*_{i} is the genotype vector of the *i*-th individual, *y*_{i} is an integer specifying case (+1) or control (−1), max(0, 1−*y*_{i}(*w*^{T} *x*_{i}+*w*_{0})) is the hinge loss function, and *C* is the loss-complexity tradeoff parameter. The solution *w* and *w*_{0} is obtained by applying Lagrange multipliers to obtain the dual problem which is a quadratic program and can be solved by standard methods see (25,26 for details). In the SVM, optimization criterion the term max(0, 1−*y*_{i}(*w*^{T}*x*_{i}+*w*_{0})) measures how well the discriminant *w*, which is basically our model, fits the training data and ||*w*||^{2} measures the complexity of the model. We set the parameter *C* to its default value of , where *i* loops over all subjects in the training set, as provided in the SVM-light software package. We obtain a SNP ranking from the SVM discriminant *w* as described below.

#### Obtaining a SNP ranking from the SVM discriminant vector *w*

We can obtain an ordering of the SNPs using the absolute value of the entries of *w*. The input to the discriminant is the set of top *mr* chi-square ranked SNPs *s*_{1}, *s*_{2},…*s*_{mr}. The *i*-th entry of *w* represents the weight of the *i*-th SNP in the input. Let *w*=(*w*_{1},…,*w*_{mr}) and |*w*|=(|*w*_{1}|,…,|*w*_{mr}|). Now consider the entries of |*w*| in sorted descending order. We denote this ordering by the vector *p* such that . Using *p* we obtain an ordering on the input SNP identifiers which gives us the SNP ranking.

#### Implementation

We use the popular and freely available *SVM-light* SVM implementation (27). We run it with the linear kernel and all other parameters set to their default values.

### Random forest

A classification tree is built by the recursive partitioning method (25). At each step the feature, which is a SNP in this study, with the highest impurity, usually measured by entropy or the Gini index, is selected and then split into *k* children where *k* is the number of values the SNP can take. This process is repeated until all nodes are pure, meaning that all sets of decisions leading to that node result in the same class. In a RF (16), several classification trees are created each by drawing *n* subjects with replacement from the original data, where *n* is the total number of case and control subjects. The SNPs are then ranked by a classification based variable importance index that considers interaction between the SNPs (28).

#### Implementation

We use the freely available willows software package (28) for generating random forests and obtaining variable importance indices. We set the number of trees to 10000 and use the default values for other parameters as provided by the program.

### Datasets

#### Real data

We use real data from two sources: the Wellcome Trust Case Control Consortium (WTCCC) and The Genetics of Kidneys in Diabetes (GoKinD). The WTCCC provides two sets of controls and one set of cases each for type 1 diabetes, rheumatoid arthritis, Crohn's disease and type 2 diabetes (1) and GoKinD provides a type 1 diabetes case set (29,30). The WTCCC also provides case subjects for bipolar disorder, hypertension and coronary artery disease. However, we omit them from this study for two reasons. First, much fewer replicated SNPs are catalogued for them in comparison to the other four. In Ref. (31), which is from where we obtain these SNPs, there are just three listed for bipolar, one for coronary artery disease and none for hypertension. Second, none of the listed SNPs or those in linkage disequilibrium with them are captured by twice or even five times the number of top chi-square-ranked SNPs within the Bonferroni correction for bipolar disorder, whereas for coronary artery disease the single previously replicated SNP is already ranked as the top one by chi-square.

We follow the standard protocol of removing SNPs with >1% missing entries and those that deviate from Hardy–Weinberg equilibrium with *P*-values below 5×10^{−7} (1). See Supplementary Table S6 for more details and the number of subjects and SNPs in each dataset.

#### Simulated data

The GWAsimulator program (32) produces case and control genome-wide SNP genotypes under a logistic regression disease model. It takes as input a control file that specifies various parameters such as relative risk and sample size and phased genotype data, and simulates SNP genotypes with the same linkage disequilibrium structure as the input genotype data. It outputs data in a numerical format as the number of copies of the causal allele. We use the HapMap CEU phased genotypes provided with the software package as input. These genotypes were produced by the Illumina HumanHap300 SNP chip. The program generates one causal SNP on a specified position of a chromosome and then simulates remaining SNPs according to a moving window algorithm (33).

We simulated data across several different parameters from the followings sets of control files. In each case though each causal SNP follows a multiplicative model. This means that if λ is the relative risk for one copy of the risk allele than λ^{2} is the risk for two copies of that allele. Except for the power study case we generate one simulated study per control file.

- General performance on different relative risks: 50 control files each for relative risk 1.25, 1.5 and 2. Each control file contains 15 randomly selected SNPs as causal, one per chromosomes 1 through 15 each with a specified relative risk. The disease prevalence is set to .01, and case and control sample sizes each to 1000. We simulate a 1000 SNPs on either side of each causal one which adds up to a total of approximately 30000 SNPs per dataset.
- Performance as a function of sample size: 50 control files of relative risk 1.25 and two additional case and control sizes of 2000 and 4000. Remaining parameters same as above.
- Performance on low causal allele frequencies: 10 control files each for relative risk 1.25 and 1.5 and two case and control sizes of 2000 and 4000, and causal allele frequencies of at most 5%. Other parameters as above in the general performance setting.
- Power study: First five control files for relative risks 1.25, 1.5 and 2. We simulated 50 studies for each control file thereby giving a total of 250 simulated studies for each relative risk setting. Remaining parameters same as the general performance setting.
- Disease risk prediction: 50 control files each for relative risk 1.25, 1.5 and 2 and same settings as the general performance case except that 100 case and 100 control subjects are generated instead of 1000 each.

We provide all simulated studies, input control files and HapMap CEU phased genotypes to the GWAsimulator program at http://www.cs.njit.edu/usman/SVMSNPs.

## RESULTS

We are interested in measuring the number of causal variants and associated regions identified by top SNPs in a given ranking. In simulated data, we define an associated region as the set of SNPs in linkage disequilibrium (34) with the causal one. In other words, the squared correlation coefficient is at least 0.05, which is a standard threshold for defining associated regions (12,35). In order to simulate a scenario where causal variants are not necessarily genotyped we make a copy of each simulated study without the causal SNPs and then compute chi-square, SVM, and random RFs. We also compute rankings on the original studies with the causal SNPs.

It is straightforward to measure the number of causal variants in the number of top-ranked SNPs given by a method. To measure the number of associated regions we count the number of unique regions covered by top ranked SNPs. For example, consider a ranking of five SNPs: *s*_{1},*s*_{2},*s*_{3},*s*_{4},*s*_{5}. Suppose that SNPs *s*_{1} and *s*_{3} belong to region *r*_{1}, SNPs *s*_{4} and *s*_{5} belong to region *r*_{2}, and the *s*_{2} does not belong to any known region. In this ranking, we have two regions covered by the five SNPs.

We first study the effect of the *P*-value threshold on the two methods including both methods applied to all SNPs in GWAS, effect of sample sizes at relative risk 1.25 and performance on data with low causal allele frequencies. We then compare the power of the three methods, their running times on simulated and real data, and stability of SNP rankings given by the different methods. Finally, we study ranks of previously replicated SNPs on real data, associated regions in type 1 diabetes, and disease risk prediction accuracies of logistic regression as given by top ranked SNPs by the methods.

### Effect of *P*-value threshold on the support vector machine and random forest

Let *r* be the number of SNPs with *P*-values within the Bonferroni threshold. We reorder the top 2*r*, 5*r* and 10*r* chi-square-ranked SNPs with the SVM and RF separately. At the relative risk 1.25 setting *r* is 0 for some datasets and so we exclude them from the analysis.

In Table 1, we show the mean number of causal variants identified by the SVM and RF when applied to the top 2*r*, 5*r* and 10*r* chi-square-ranked SNPs as input as well as the entire GWAS. We examine the top *r* ranked SNPs in each method. The larger input to the SVM and RF contains many more false positives and this clearly deteriorates their SNP rankings. Similarly, the larger *P*-value also deteriorates the number of associated regions detected by the two methods as shown in Table 1.

*r*SNPs given by each method at the three different relative risks

A comparison of the SVM and RF shows that while SVM(2*r*) is the best performing method, RF(5*r*) and RF(10*r*) are better than the SVM counterparts. This holds true for detecting causal variants and associated regions at all three relative risks.

The improvement given by SVM(2*r*) and RF(2*r*) is small at relative risk 1.25 but increases as we move to relative risk 1.5 and 2. However, the signal in these studies depend upon sample size among other things. If we increase the total sample size to 4000 and 8000 with each containing half cases and half controls then *r*, which is the number of SNPs with *P*-values within Bonferroni, increases and so does the improvement given by SVM(2*r*) as shown in Table 2 below. In Supplementary Figures S2–S4, we report mean number of causal variants and associated regions for different thresholds of top ranked SNPs instead of the just the top *r* ranked SNPs. There too we find similar patterns reported here.

*r*ranked SNPs given by each method on different sample sizes at relative risk 1.25

The simulated data above has random causal allele frequencies. In Table 3, we compare the methods on relative risk 1.25 and 1.5 but with low causal allele frequencies of at most 5% and with 4000 and 8000 subjects each containing half cases and half controls. We find both SVM(2*r*) and RF(2*r*) to perform better than chi-square at these settings.

*r*ranked SNPs by each method on data with causal allele frequencies at most 5% and two different sample sizes and relative risks

We also studied the SVM and RF applied to the entire GWAS and found that they perform worse than chi-square and SVM(2*r*) and RF(2*r*) (Table 4). Note that we use the SVM with an automatic setting of the value of *C* which controls the tradeoff between error on training data and model complexity. To be fair to the SVM we ran it on all the simulated studies of relative risk 1.5 with fixed values of *C* from the set {10^{−6}, 10^{−5}, 10^{−4}, 10^{−3}, 10^{−2}, 10^{−1}}. At *C*=10^{−3} the SVM identifies the same mean number of causal variants as chi-square which is 8.9 and at the remaining values it is lower.

### Power study

We now compare the empirical power of the chi-square, SVM and RF to rank causal variants from simulated data of relative risk 1.5. We define the empirical power of a method to be the percentage of simulated datasets where the top *r* ranked SNPs given by the method, where *r* is the number of SNPs with *P*-values within Bonferroni correction, contain *k* causal variants. In Figure 1, we plot this value for *k* ranging from 1 to 15 which is the total number of causal SNPs in the simulated data. We see that SVM(2*r*) has the highest power followed by RF(2*r*).

In Supplementary Figures S5, we compare the empirical power of the three methods on simulated data of relative risk 1.25 and 2. At the 1.25 setting chi-square has highest power for detecting one causal variant, RF(2*r*) and SVM(2*r*) both have the highest power for detecting two and three causal variants, and after that all three methods have same power. At relative risk of 2 all three methods have the same power up to value of *k*=12. After that SVM(2*r*) has highest power.

### Running time comparisons

In Supplementary Tables S1–S3, we show running times on real data and the simulated one with different relative risks, sample sizes and causal allele frequency. This running time includes the time for chi-square since both methods require a chi-square ranking to start with. These were measured on AMD Opteron model 2218 machines each with 2.6GHz speed and 8 GB RAM. Our results show that running time of all methods depends unsurprisingly on the sample size. However, the running time of SVM(2*r*) and RF(2*r*) also depends on the value of *r* which in turn depends on the relative risk and causal allele frequency. We also see that both RF(2*r*) and SVM(2*r*) are much faster than their 10*r* counterparts and the running time for SVM(2*r*) is comparable to that of chi-square.

### Stability of rankings

In line with recent studies that examine stability of ranked gene and SNP lists, we do the same for the SVM and RF methods on our simulated data (36,37). Following these studies we create a jacknifed dataset by randomly removing 10% of the subjects from a given simulated study. In this manner, we create 100 jacknifed studies and compute chi-square, SVM and RF rankings with the four *P*-value thresholds of *r*, 2*r*, 5*r* and 10*r* on each one. As before *r* denotes the number of SNPs with *P*-values within the Bonferroni correction. We perform this on process on simulated studies one through five. For each of the three methods we then compute the correlation coefficient between the ranks of the top *r* SNPs captured by chi-square on the original datasets with their mean rank in the jacknifed studies.

We study two variants of the random forest method. In RF(100), we set the number of trees in the forest to 100 and in RF(10000) we set this to 10000. Note that the latter setting is the one we used in the experiments throughout this paper. In Table 5, we see that the correlation is high at the *r* threshold for all methods but progressively decreases as the *P*-value threshold increases. We also find that the random forest with 10000 trees has much better stability than with just 100 trees even though the former has a higher running time.

Calle *et al.* (36) report a low correlation when they study the stability of the RF applied to all SNPs in a real study. In agreement with their results, we find that the correlation of RF(100) and RF(10000) are both very low when applied to the entire GWAS. In Supplementary Tables S4 and S5, we show the stability at relative risk 1.25 and 2. There too we find high stability at the *r* and 2*r* thresholds and RF(10000) doing much better than RF(100).

### Applications on real data

We demonstrate some applications of our work by studying ranks of previously replicated SNPs, associated regions in type 1 diabetes and prediction of disease risk as given by top-ranked SNPs by the three methods.

#### Ranks of previously replicated SNPs in WTCCC studies

The Bonferroni corrections *r* for the WTCCC type 1 diabetes, arthritis, Crohn's disease and type 2 diabetes are 452, 176, 63 and 14, respectively. We compute SVM and RF rankings with the three thresholds and show results in Supplementary Tables S7 and S8. In type 1 diabetes and arthritis, we see a clear improvement in rank by SVM(2*r*) and a less pronounced one by RF(2*r*). As the threshold increases from 2*r* to 5*r* and 10*r* the ranking given by SVM and RF deteriorates. In Crohn's disease and type 2 diabetes the rankings are comparable. Note that the value of *r* for these two diseases is also much smaller than the ones for the other two.

#### Type 1 diabetes-associated regions

As in the simulated data, we define an associated region for each replicated as the set of all SNPs with squared correlation coefficient at least 0.05 with the replicated SNP. We also examine associated regions defined by the Type 1 Diabetes Consortium (24) and list SNPs and boundaries of both sets of regions in Section 8 of the Supplementary Data. We consider a region as detected if the top *r* SNPs of a ranking contains at least one SNP from the region. The Bonferroni corrected *P*-value threshold is about 10^{−7} which yields 452 SNPs. If we double this to 904 SNPs the *P*-value threshold increases to about 0.002 and includes many more regions not detected by chi-square. Table 6 shows that the SVM(2*r*) and RF(2*r*) can lift the ranks of many SNPs from these undetected regions to above 452.

#### Prediction of type 1 diabetes risk on independent studies

The previous few sections have demonstrated that the SVM(2*r*) and RF(2*r*) can lift the ranks of causal SNPs and those from associated regions compared to chi-square on simulated data. We expect that top ranked SNPs given by the SVM should be enough for predicting disease risk accurately since they are mostly causal and cover several associated regions. To test this hypothesis, we measure the ROC area under curve of a logistic regression based} composite odds ratio score (31,38,39) for predicting type 1 diabetes risk as a function of top-ranked SNPs given by the three methods including chi-square. See the Supplementary Data for details of the composite odds ratio score (Section 1), cross-validation results on the WTCCC arthritis study (Supplementary Figures S9) and risk prediction on simulated data with this risk estimator (Supplementary Figures S10 and S11).

We compute SNPs rankings on the WTCCC study and then classify subjects in the GoKinD study plus WTCCC coronary artery disease samples as controls using top ranked SNPs from the three different methods. We also repeat these steps by computing SNP rankings on the GoKinD study and predicting on the WTCCC one. In Figure 2, we show the composite odds ratio AUC as a function of top ranked SNPs in the three rankings. SVM(*r*) achieves the highest AUC of 0.83 with 21 SNPs followed by random forest and chi-square AUCs of 0.81 each with 29 and 17 SNPs, respectively. See Supplementary Figure S8 for graphs comparing AUCs of the SVM score for predicting disease risk (40).

We make similar observations if we compute the rankings on the GoKinD study and predict risk on the WTCCC study. Figure 2 and its reverse counterpart in Supplementary Figure S7 show that many initial thresholds of top SVM-ranked SNPs are consistent in their prediction AUC. With chi-square the AUC is highest for a few top ranked SNPs after which it begins to fall quickly. In fact, this also happens for arthritis as shown in Supplementary Figure S9. The rapid drop in type 1 diabetes and arthritis prediction with chi-square-ranked SNPs is also observed by Evans *et al.* (31) who use a composite odds ratio score similar to ours.

## DISCUSSION

The work presented here sheds light into the performance of the SVM and the RF method for ranking SNPs in genome-wide association studies. As the *P*-value threshold increases the ranking of causal SNPs and those from associated regions deteriorates by each method suggesting that non-causal SNPs and those not from associated regions affect the performance of these two discriminative multivariate methods.

In unpublished work, we make similar observations with three other multivariate feature selection methods: L2 norm regularized logistic regression (41), the weighted maximum margin criterion (42) and ridge regression (25). We use the Bundle software package (41) for regularized logistic regression and our own implementations of the latter two methods. After cross-validating parameters in each method, we find that at the 2*r* threshold level all methods improve upon chi-square to different degrees but the improvement decreases at higher thresholds.

The strategy of removing features in high-dimensional data using a simple statistic before applying a more sophisticated method has been studied previously but not exactly in the manner that we do and not on genome-wide studies. Take the winners of the NIPS 2003 feature selection contest. They used a simple univariate statistic to obtain a smaller input size before applying a more sophisticated neural network plus tree-based multivariate procedure for final feature selection (43). Fan and Lv (44) provide theoretical and empirical arguments for the same idea: removing many features with a simple statistic before applying a sophisticated multivariate one for final selection. Finally, Chen and Lin (45) show that removing features with a simple univariate ‘*F*-score’ statistic improves classification performance of the SVM on all but one of the datasets used in the NIPS 2003 contest. This is not the same as SVM-based feature selection but is relevant to our work because it says something about SVM discriminant computed with full versus a culled set of features.

The SVM has previously been shown to rank non-causal variables higher than causal ones empirically (46) and theoretically (47). In both studies, culling the input dataset by a univariate filter was not considered. In light of our results here and the studies cited above it is possible that the SVM may yield better results in those studies if the input is first culled.

When the SVM is applied to the entire GWAS we find its ranking of causal variants and associated regions to be similar to chi-square and better than SVM(5*r*) and SVM(10*r*). This can be explained by the automatic setting of the SVM loss-complexity tradeoff parameter , where *i* loops over all subjects in the training set. When the entire GWAS is considered each ||*x*_{i}||^{2} is large which makes *C* a very small number particularly in comparison to the value obtained in SVM(5*r*) and SVM(10*r*). This affects the discriminant *w* which is actually given by *w*=∑_{i} α_{i} *y*_{i} *x*_{i} where each α_{i}≤*C* (25), *y*_{i} is +1 if *x*_{i} is case and −1 otherwise. With a very small value of *C* all of α_{i} are small and the same thus effectively reducing the discriminant value of the *j*-th SNP to be where *n* is the total number of subjects in the study, *y*_{i} as defined above, and *x*_{ij} is the *j*-th encoded SNP of the *i*-th subject. We verify this manually on simulated study number zero and find the SVM ranking to be similar to the one obtained using the above formula.

It is important to note that our work with the SVM presented here does not cross-validate the tuning parameter *C* which controls tradeoff between error on the training data and the classifier complexity. As mentioned earlier, we use a default value of *C* provided by the SVM-light software. We did perform the same experiments by cross-validating *C* and found no difference in the performance of the support vector machine at the 2*r* threshold. At the larger 5*r* and 10*r* thresholds the SVM performs better with the cross-validated *C* than the automatic one. The improvement, however, is not large enough to justify an expensive cross-validation procedure which is why we omit the procedure from this study altogether.

The less pronounced differences between the multivariate methods and chi-square on the WTCCC Crohn's disease and type 2 diabetes studies as well as on simulated data of relative risk 1.25 suggest that the advantage from multivariate methods over univariate in genome-wide studies may be gained only on studies where the value of *r*, which is the number of SNPs with *P*-values within Bonferroni, is non-trivial. This becomes clear if we compare the values of *r* across the three different relative risks at fixed sample sizes of 2000 and across the different sample sizes at relative risk 1.25.

Our risk prediction results show limited improvement with SVM-ranked SNPs compared to chi-square and RF ones. However, there are several aspects of this improvement that are noteworthy: (i) we see it consistently on many simulated datasets, (ii) it becomes larger at higher relative risks, and (iii) the AUC peaks earlier with a few top SVM ranked SNPs when compared to the chi-square ranked ones. In Supplementary Figures S10 and S11, we provide risk prediction results on simulated data that support the above observations. There we see that the improvement given by a few top SVM-ranked SNPs is highest at relative risk 2 and progressively decreases at lower relative risks. This suggests that there is a potential to gain higher risk prediction accuracy with SVM ranked SNPs if there is sufficient signal in a GWAS. This may very well be the case with GWAS that have larger sample sizes and more SNPs than current ones. It is part of our ongoing research to test these methods on such GWAS.

Although we have not explored this in detail here, the SVM(2*r*) and RF(2*r*) methods both have the potential to detect interacting SNPs. A straightforward, yet computationally expensive, solution would be to first recode all pairs of SNPs into new numerical values between 0 and 8 instead of encoding each SNP 0, 1 or 2. Then we would apply SVM(2*r*) and RF(2*r*) in the same way as done in this article and examine the top *r* ranked variables for interacting SNPs.

We also rank all SNPs in the GWAS by the SVM and apply chi-square to the top 100 ranked ones to determine if it would improve upon the support vector machine ranking. Supplementary Figures S6 shows that this offers no improvement over chi-square applied to the entire GWAS.

Finally, it is straightforward to incorporate non-genetic variables such as age, sex and principal components for population substructure into the SVM(*kr*) and RF(*kr*) methods. They would simply be additional columns in the culled data matrix that is given as input.

## CONCLUSION

We find the support vector machine to rank causal SNPs and those from associated regions higher than random forest and chi-square if applied to the top 2*r* chi-square-ranked SNPs, where *r* is the number of SNPs with *p*-values within Bonferroni, and the value of *r* is sufficiently large.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

## FUNDING

The CIPRES cluster supported by the National Science Foundation (EF0331654) and the Kong cluster at NJIT; Wellcome Trust under award 076113. Funding for open access charge: U.S. National Science Foundation and U.S. National Institutes of Health.

*Conflict of interest statement*. None declared.

## ACKNOWLEDGEMENTS

This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk.

## REFERENCES

*Nature*,

**447**, 661–678. [PMC free article] [PubMed]

*In*Proceedings of International Conference on Technology and Applications in Biomedicine. USA: IEEE; 2008. The application of random forest in genetic case-control studies; pp. 370–373.

*Am. J. Psychiat.*,

**166**, 540–556. [PMC free article] [PubMed]

*Nature*,

**39**, 1045–1051. [PubMed]

*In*KDD '07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2007. A scalable modular convex solver for regularized risk minimization; pp. 727–736.

*Studies in Fuzziness and Soft Computing*, Springer Berlin/Heidelberg, Germany, pp. 315–324.

*In*Proceedings of Neural Information Processing Systems (NIPS) Workshop on Causality and Feature Selection. Cambridge, MA, USA: The MIT Press; 2006. Using svm weight-based methods to identify causally relevant and non-causally relevant variables.

*In*ICML '04: Proceedings of the Twenty-First International Conference on Machine Learning. New York, NY, USA: ACM; 2004. A theoretical characterization of linear svm-based feature selection; p. 48.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (171K)

- Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.[BMC Bioinformatics. 2013]
*Wang Y, Goh W, Wong L, Montana G, Alzheimer's Disease Neuroimaging Initiative.**BMC Bioinformatics. 2013; 14 Suppl 16:S6. Epub 2013 Oct 22.* - Increasing power of genome-wide association studies by collecting additional single-nucleotide polymorphisms.[Genetics. 2011]
*Kostem E, Lozano JA, Eskin E.**Genetics. 2011 Jun; 188(2):449-60. Epub 2011 Apr 5.* - Use of support vector machines for disease risk prediction in genome-wide association studies: concerns and opportunities.[Hum Mutat. 2012]
*Mittag F, Büchel F, Saad M, Jahn A, Schulte C, Bochdanovits Z, Simón-Sánchez J, Nalls MA, Keller M, Hernandez DG, et al.**Hum Mutat. 2012 Dec; 33(12):1708-18. Epub 2012 Aug 3.* - Random forests for genetic association studies.[Stat Appl Genet Mol Biol. 2011]
*Goldstein BA, Polley EC, Briggs FB.**Stat Appl Genet Mol Biol. 2011; 10(1):32. Epub 2011 Jul 12.* - Multigenic modeling of complex disease by random forests.[Adv Genet. 2010]
*Sun YV.**Adv Genet. 2010; 72:73-99.*

- Genome wide association scan for chronic periodontitis implicates novel locus[BMC Oral Health. ]
*Feng P, Wang X, Casado PL, Küchler EC, Deeley K, Noel J, Kimm H, Kim JH, Haas AN, Quinelato V, Bonato LL, Granjeiro JM, Susin C, Vieira AR.**BMC Oral Health. 1484* - Predicting Disease Risk Using Bootstrap Ranking and Classification Algorithms[PLoS Computational Biology. 2013]
*Manor O, Segal E.**PLoS Computational Biology. 2013 Aug; 9(8)e1003200* - Explaining microbial phenotypes on a genomic scale: GWAS for microbes[Briefings in Functional Genomics. 2013]
*Dutilh BE, Backus L, Edwards RA, Wels M, Bayjanov JR, van Hijum SA.**Briefings in Functional Genomics. 2013 Jul; 12(4)366-380* - Impact of Natural Genetic Variation on Gene Expression Dynamics[PLoS Genetics. 2013]
*Ackermann M, Sikora-Wohlfeld W, Beyer A.**PLoS Genetics. 2013 Jun; 9(6)e1003514* - Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?[Briefings in Bioinformatics. 2013]
*Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SA.**Briefings in Bioinformatics. 2013 May; 14(3)315-326*