- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A Fast Method for Computing High-Significance Disease Association in Large Population-Based Studies

## Abstract

Because of rapid progress in genotyping techniques, many large-scale, genomewide disease-association studies are now under way. Typically, the disorders examined are multifactorial, and, therefore, researchers seeking association must consider interactions among loci and between loci and other factors. One of the challenges of large disease-association studies is obtaining accurate estimates of the significance of discovered associations. The linkage disequilibrium between SNPs makes the tests highly dependent, and dependency worsens when interactions are tested. The standard way of assigning significance (*P* value) is by a permutation test. Unfortunately, in large studies, it is prohibitively slow to compute low *P* values by this method. We present here a faster algorithm for accurately calculating low *P* values in case-control association studies. Unlike with several previous methods, we do not assume a specific distribution of the traits, given the genotypes. Our method is based on importance sampling and on accounting for the decay in linkage disequilibrium along the chromosome. The algorithm is dramatically faster than the standard permutation test. On data sets mimicking medium-to-large association studies, it speeds up computation by a factor of 5,000–100,000, sometimes reducing running times from years to minutes. Thus, our method significantly increases the problem-size range for which accurate, meaningful association results are attainable.

Linking genetic variation to personal health is one of the major challenges and opportunities facing scientists today. It was recently listed as 1 of the 125 “big questions” that face scientific inquiry over the next quarter century.^{1} The accumulating information about human genetic variation has paved the way for large-scale, genomewide disease-association studies that can find gene factors correlated with complex disease. Preliminary studies have shown that the cumulative knowledge about genome variation is, indeed, highly instrumental in disease-association studies.^{2}^{}^{–}^{4}

The next few years hold the promise of very large association studies that will use SNPs extensively.^{5} There are already reported studies with 400–800 genotypes,^{6} and studies with thousands of genotypes are envisioned.^{6} High-throughput genotyping methods are progressing rapidly.^{7} The number of SNPs typed is also likely to increase with technological improvements: DNA chips with >100,000 SNPs are in use,^{8} and chips with 500,000 SNPs are already commercially available (^{Affymetrix}). Hence, it is essential to develop computational methods to handle such large data sets. Our focus here is on improving a key aspect in the mathematical analysis of population-based disease-association studies.

The test for association is usually based on the difference in allele frequency between case and control individuals. For a single SNP, a common test suggested by Olsen et al.^{9} is based on building a contingency table of alleles compared with disease phenotypes (i.e., case/control) and then calculating a χ^{2}-distributed statistic. When multiple markers in a chromosomal region are to be tested, several studies suggested the use of generalized linear models.^{10}^{}^{–}^{12} Such methods must assume a specific distribution of the trait, given the SNPs, and this assumption does not always hold. Typically, a Bonferroni correction for the *P* value is employed to account for multiple testing. However, this correction does not take into account the dependence of strongly linked marker loci and may lead to overconservative conclusions. This problem worsens when the number of sites increases.

To cope with these difficulties, Zhang et al.^{13} suggested a Monte Carlo procedure to evaluate the overall *P* value of the association between the SNP data and the disease: the χ^{2} value of each marker is calculated, and the maximum value over all markers, denoted by *CC*_{max}, is used as the test statistic. The same statistic is calculated for many data sets with the same genotypes and with randomly permuted labels of the case and control individuals. The fraction of permutations for which this value exceeds the original *CC*_{max} is used as the *P* value. A clear advantage of this test is that no specific distribution function is assumed. Additionally, the test handles multiple testing directly and avoids correction bias. Consequently, it is widely used and, for instance, is implemented in the state-of-the-art software package, ^{Haploview}, developed in the HapMap project.

The permutation test can be readily generalized to handle association between haplotypes and the disease—for example, by adding artificial loci of block haplotypes,^{14}^{}^{–}^{16} with states corresponding to common haplotypes. Similarly, one can represent loci interactions as artificial loci whose states are the allele combinations.

Running time is a major obstacle in performing permutation tests. The time complexity of the algorithm is *O*(*N*_{S}*nm*), where *N*_{S} is the number of permutations, *n* is the number of samples, and *m* is the number of loci. To search for *P* values as low as *p,* at least 1/*p* permutations are needed (see appendix A for details). Therefore, the time complexity can be written as *O*( 1/*pnm*). For instance, to reach a *P* value of 10^{-6} in a study that contains 1,000 cases and 1,000 controls with 10,000 loci, >10^{13} basic computer operations are required, with a running time of >30 d on a standard computer. Scaling up to larger studies with 100,000 loci is completely out of reach.

When complex diseases are being studied, SNP interactions should also be considered, and, then, time complexity is an even greater concern. Several statistical studies focus on modeling loci interactions that have little or no marginal effects at each locus.^{17}^{}^{–}^{19} Recently, Marchini et al.^{20} addressed the issue of designing association studies, given the plausibility of interactions between genetic loci with nonnegligible marginal effects. In all of these studies, the multiple-testing cost of fitting interaction models is much larger than that of the single-locus analysis. Furthermore, the dependency among different tests is higher, so the disadvantage of the conservative Bonferroni correction is exacerbated. For example, when all possible pairwise loci interactions are tested, the number of tests grows quadratically with the number of loci, and applying Bonferroni correction would artificially decrease the test power. In this case, the permutation test is of even higher value. Unfortunately, the running time is linearly correlated with the number of tests, which causes this algorithm to become prohibitively slow, even with a few hundred SNPs.

In this study, we present a faster algorithm for computing accurate *P* values in disease-association studies. We apply a well-known statistical technique, importance sampling, to considerably decrease the number of sampled permutations. We also use the linkage disequilibrium (LD) decay property of SNPs, to further improve the running time. These two elements are incorporated in a new sampling algorithm called “^{RAT} (Rapid Association Test).” Accounting for decay in LD has already been employed by several studies, for the development of more-efficient and more-accurate algorithms. For example, by using this property, Halperin et al.^{21} reported a more accurate and faster method for tagging-SNP selection, and Stephens et al.^{22} presented an algorithm that improves the phasing accuracy. To the best of our knowledge, LD decay has not yet been exploited in permutation tests.

In the standard permutation test (SPT), when *y* permutations are performed and *z* successes are obtained, the *P* value is estimated as *z*/*y*. However, when *z*=0, we know only that *p*1/*y*. Therefore, to obtain small *P* value bounds, one has to expend a lot of computational effort. In contrast, our method provides an estimate of the true *P* value, with a guaranteed narrow error distribution around it. The distribution gets narrower as the *P* value decreases, and, therefore, much less effort is needed to achieve accurate, very low *P* values.

Our method has a running time of *O*(*n*β+*N*_{R}*nc*), where *N*_{R} is the number of permutations drawn by ^{RAT}, β is a predefined sampling constant, and *c* is the upper bound on the distance in SNPs between linked loci. Put differently, any two SNPs that have *c* typed SNPs between them along the chromosome are assumed to be independent. In appendix A, we analyze *N*_{R} in terms of the needed accuracy and the true *P* value.

We compared the performance of our algorithm with that of the regular permutation test, on simulated data generated under the coalescent model with recombination^{23} (^{ms software)} and on real human data. For both algorithms, we measured accuracy by the SD of the measured *P* value. We required an accuracy of 10^{−6} and compared the times to convergence in both algorithms. On realistic-sized data sets, ^{RAT} runs 3–5 orders of magnitude faster. For example, it would take ~30 d for the SPT to evaluate 10,000 SNPs in a study with 1,000 cases and 1,000 controls, whereas RAT needs ~2 min. When marker-trait association is tested in simulations with 3,000 SNPs from 1,000 cases and 1,000 controls, it is >5,000 times faster. With 10,000 SNPs from chromosome 1, a speed-up of >20,000 is achieved. With 30,556 simulated SNPs from 5,000 cases and 5,000 controls, it would take 4.62 years for the SPT to achieve the required accuracy, whereas RAT requires 24.3 min. Hence, our method significantly increases the problem-size range for which accurate, meaningful association results are attainable.

This article is organized as follows: in the “Methods” section, we formulate the problem and present the mathematical details of the algorithm. In the “Results” section, results for simulated and real data are presented. The “Discussion” section discusses the significance of the results and future plans. Some mathematical analysis and proofs are deferred to appendix A.

## Methods

### Problem Formulation

Let *n* be the number of individuals tested, and let *m* be the number of markers. The input to our problem is a pair (*M*,*d*), where *M* is an *n*×*m*–“markers matrix” and **d** is an *n*-dimensional “disease-status” vector. When haplotype data are used, the dimensions of the matrix may be 2*n*×*m*. The possible types (alleles) a marker may attain are denoted by 0,1,…,*s*-1. Hence, *M*(*i*,*j*)=*k* if the *i*th individual has type *k* in the *j*th marker. Each component of the disease vector is either 0 (for an unaffected individual) or 1 (for an affected individual). An “association score” *S*(*d*) between *M* and **d** is defined below. Let π(**d**) be a permutation of the vector **d**. The goal is to calculate the *P* value of *S*(*d*)—that is, the probability of obtaining an association score *S*(*d*) under the random model, in which all instances (*M,* π(**d**)) are equiprobable.

Let (i.e., ξ is the number of affected individuals). In this article, the set of all possible permutations of the binary vector **d** is defined by {*v*|*v* is a binary vector that contains exactly ξ 1s}. In other words, two permutations cannot have the same coordinates set to 1. Following this definition, there are possible permutations of **d**, instead of *n*! possibilities by the standard definition of a permutation. Notice that, since all (standard) permutations are equiprobable, our definition for a permutation is equivalent to the standard one from a probabilistic view: for each of the permutations with the use of our definition, there are exactly ξ!(*n*-ξ)! permutations with the use of the standard definition.

For two marker vectors **x** and **y** of size *n,* let *T* denote their contingency table. *T* is built as follows: *T*_{i,j}=|{*k*|*x*(*k*)=*i*, *y*(*k*)=*j*}|. Let *T*_{E} be its expected contingency table, assuming the vectors **x** and **y** are independent—that is, . The Pearson χ^{2} score of the table *T* is . We also use *S*(*x*,*y*) to denote *S*(*T*).

The *j*th column of the matrix *M* is denoted by *M*_{·,j}. We use the notation *S*_{j}(*x*) for the score *S*(*M*_{·,j},*x*). Hence, *S*_{j}(*d*) is the Pearson score of marker *j* and the disease vector **d**. Under the random model described above, the asymptotic distribution of *S*_{j}(*d*) is χ^{2}, with *s*-1 df.^{24} For a vector **x**, let —that is, the highest Pearson score of any marker in *M* with the disease vector **x**. *S*(*d*) is called the “association score” for (*M*,*d*). We would like to calculate the probability that *S*(*x*)>*S*(*d*), where **x** is a random permutation of the vector **d**.

Let be the event space of all possible permutations of the vector **d**. The probability measure of is defined as . We use *f*(·) to denote . Let be the subset of , such that ={*d*_{i}|*d*_{i},*S*(*d*_{i})*S*(*d*)}. (Note that throughout we denote, by *d*_{i}, the *i*th permutation and not the *i*th component of the vector **d**.) Then, *p*= ||/|| is the desired *P* value. Zhang et al.^{13} proposed a Monte Carlo sampling scheme of the space . This test will be referred to as the “SPT.” The running time of this algorithm is *O*(*nmN*_{S}), where *N*_{S} is the number of permutations of the standard algorithm. We use *p*_{S} to denote the calculated *P* value of SPT and *p*_{R} to denote the calculated *P* value of our algorithm.

### Importance Sampling

We now describe our sampling method. We use the methodology of importance sampling.^{25} Informally, in SPT, sampling is done from all possible permutations of the labels of the case and control individuals. This is very computationally intensive, since the number of all possible permutations can be very large. For example, the number of all possible permutations for 1,000 cases and 1,000 controls is . In our method, instead of sampling from this huge space, sampling is done from the space of all “important permutations”—namely, all possible permutations that give larger association scores than the original one. To achieve this goal, we first define this probability space (i.e., define a probability measure for each of these permutations) and then show how to correctly sample from it. This sampling is done in three steps: (1) a column (or a SNP) is sampled, (2) a contingency table is sampled for that column from the set of all possible contingency tables that are induced by this column and whose association score is at least as large as the original one, and (3) an important permutation that is induced by this contingency table is sampled.

We construct an event space , which contains the same events as but with a different probability measure that will be defined below. has three important properties: (1) one can efficiently sample from , (2) the probability of each event in can be readily calculated, and, (3) for each *d*_{i}, . The probability function over is denoted by *g*(·).

We use *N*_{R} to denote the number of permutations drawn by the RAT algorithm. With the use of property 3, if *N*_{R} samples are drawn from instead of from , then

We now define the probability measure on . For a permutation *e*, let *Q*(*e*)=|{*j*| 1*j**m*, *S*_{j}(*e*)*S*(*d*)}|. Namely, *Q*(*e*) is the number of columns in *M* whose Pearson score with the disease vector **e** is at least *S*(**d**). Observe that, since *e*, *Q*(*e*)1. The probability of **e** in is defined as:

Let _{j} be the set of all possible contingency tables that correspond to column *j* of *M* and to different permutations of the vector **d**. The number of different permutations of **d** that correspond to a specific contingency table *T* is denoted by μ_{j}(*T*) and can be calculated directly as follows:

Let *T* be a contingency table that fits column *j.* Define

Let _{j} be the set _{j}={*d*_{i}|*S*_{j}(*d*_{i})*S*(*d*)}. Observe that =^{m}_{j=1}_{j}. Define _{j}={*T*_{j}| *S*(*T*)*S*(*d*)}.

The following sampling algorithm from will be referred to as the “-sampler”:

- 1.Sample a column
*j*with probability . - 2.Sample a contingency table
*T*from_{j}with probability μ*j*(*T*)/|*j*|. - 3.Sample a permutation that fits the contingency table
*T*uniformly—that is, with probability 1/μ*j*(*T*).

Theorem 1: *The probability for a vector* **d**_{i} *to be sampled in the* *-sampler algorithm is* g(**d**_{i}).

Proof: Let *d*_{i}, and suppose that *Q*(*d*_{i})=*q*. Let *T* be the corresponding contingency table of *d*_{i}. With the use of the -sampler, there is a probability of

to choose an element *d*_{i} from . Since , the probability is .

Step 3 in the -sampler can be easily performed, given *T.* For example, in the case of binary traits, one has to randomly select *T*_{0,0} out of the controls and *T*_{0,1} out of the cases. When performing steps 1 and 2, there are two computational challenges: (1) calculating || and (2) sampling a contingency table *T* from _{j} with probability μ*j*(*T*)/|*j*|. We present two different schemes for these problems: an exact algorithm and a faster approximation algorithm.

#### An exact algorithm

For a column *j,* we enumerate all *O*(*n*^{s-1}) possible contingency tables and construct the set _{j}. For each table *T,* we calculate μ_{j}(*T*) according to formula (3), and |_{j}| is calculated by . The total time complexity of this algorithm is *O*(*n*^{s-1}*m*+*N*_{R}*nm*).

#### An approximation algorithm

To calculate ||, let β be a constant. We randomly sample a set *L* of β columns and calculate |_{j}| for each of the columns in set *L,* by using the exact algorithm. || is then approximated by . In practice, for the problem sizes we tested, this approach was very accurate when used with β=100. The running time of this step is *O*(*n*^{s-1}β), and the total running time of the algorithm is *O*(*n*^{s-1}β+*N*_{R}*nm*). In our case, *s*=2, since there are two possible alleles in each position, so the time complexity becomes *O*(*n*β+*N*_{R}*nm*).

For sampling a contingency table *T* from _{} with probability μ*j*(*T*)/|*j*|, we use a Metropolis-Hastings sampling algorithm.^{26}^{,}^{27} We define a directed graph with nodes corresponding to Markov states and with edges corresponding to transitions between states. Each state represents a specific contingency table *T* and is denoted by St(*T*). Let π(*T*)= μ*j*(*T*)/|*j*|. Our goal is to sample a state St(*T*) with probability π(*T*). We do this by generating a random walk that has a stationary distribution π[St(*T*)].

To define the edges in the graph, we first need some definitions. We say that a row is “extreme” if one of its cells has value 0. *T* is a “boundary table” if it has fewer than two nonextreme rows. A “tweak” to a contingency table is obtained by taking a 2×2 submatrix, decreasing by one the elements on one diagonal, and increasing by one the elements on the other diagonal. A tweak is “legal” if the resulting table is nonnegative.

Let *N*_{g}(*T*) be the set of all contingency tables that can be obtained by a legal tweak of *T.* In addition, if *T* is a boundary table, then *N*_{g}(*T*) also contains all other possible boundary tables that maintain π[*St*(*T*)]>0. The resulting set *N*_{g}(*T*) constitutes the possible transitions from St(*T*).

Let *J*(*T*_{old},*T*_{new}) be defined as:

The sampling algorithm, which will be called “*T*-sampler,” is as follows:

- 1.Start with an arbitrary table
*T*_{old}_{j}. - 2.Choose an arbitrary table
*T*_{new}*Ng*(*T*_{old}), and calculate - 3.With probability
*h,*set*T*_{old}=*T*_{new}. - 4.Return to step 2.

The T-sampler algorithm is stopped after a predefined constant number of steps, denoted by ζ, and outputs the final contingency table *T.* It is guaranteed that, when ζ is large enough, *T* is sampled with probability close to π(*T*). The last sentence holds true, since the sampler is irreducible (this is proved in appendix A). The running time of the T-sampler algorithm is bounded by a constant, since ζ is a predefined constant.

Once a permutation *d*_{i} is drawn, calculating *Q*(*d*_{i}) takes *O*(*nm*), so the total running time of the algorithm (applying the -sampler for *N*_{R} permutations) is *O*(*N*_{R}*nm*). We note that, when *n* is not too large, the sampling of the contingency table can be done by calculating the probability of all *O*(*n*^{s-1}) possible contingency tables. This is relevant, in particular, when testing individual SNPs (and, thus, *s*=2).

#### Calculating *g*(**d**_{i}) and the *P* value

After a random permutation *d*_{i} is drawn from , *g*(*d*_{i}) is calculated in the following way: according to equation (2), we need to calculate both *Q*(*d*_{i}) and . The second term, , equals and is calculated only once, as a prepossessing step. We denote this value by Γ. The first term is calculated in *O*(*m*) time, by going over all columns and counting *Q*(*d*_{i})=|{*j*| 1*j**m*, *S*_{j}(*d*_{i})*S*(*d*)}|.

To calculate the *P* value, define

The *P* value is calculated using equations (1) and (2):

Hence, *p*_{R} is calculated by .

It follows from equation (4) (with the assumption that Γ was correctly computed) that the only factor that determines the accuracy of the importance sampling is the *variance* of 1/*Q*(*e*) and not whether it is small or large. The smaller the variance, the better the accuracy. This relationship is discussed theoretically in appendix A. In practice, as described in the “Results” section, the variance of 1/*Q*(*e*) (or of the calculated *P* value) was small, though not zero, when real data were used. Intuitively, this can be explained by the limited range of linkage between markers: if the linkage is limited to, at most, *c* markers, *Q*(*e*) will not be much larger than *c,* and, hence, *Var*[ 1/*Q*(*e*)] will be bounded.

### Using LD Decay to Improve Time Complexity

In this section, we show how to improve time complexity, under assumptions of biological properties of the data. Assume that two SNPs separated by *c* SNPs along the genome are independent, because of the LD decay along the chromosome. *c* is called the “linkage upper bound” of the data. Hence, when calculating *Q*(*d*_{i}) for each permutation *d*_{i}, it is unnecessary to go over all *m* SNPs. Let *b*_{i} be the position of the SNP that induces the permutation *d*_{i} that achieves maximum score. Only SNPs within a distance of *c*—that is, SNPs whose positions are between *b*_{i}-*c* and *b*_{i}+*c*—are checked.

The remaining *m*-2*c*-1 SNPs are independent of *b*_{i}, so the expected number of columns that give scores >*S*(*d*) is (*m*-2*c*-1)*q*, where *q* is the probability for a single column to result with a score >*S*(*d*). *q* can be calculated only once at the preprocessing step. Consequently, only *O*(*cn*) operations are needed to calculate *Q*(*d*_{i}), instead of *O*(*nm*). Since *O*(*n*β) operations are needed for the preprocessing phase, the total time complexity is *O*(*n*β+*N*_{R}*nc*). Observe that, by increasing the value of *c,* one can improve the accuracy of the procedure at the expense of longer run time.

It should be pointed out that, by using this scheme, the correct expectation of *Q*(*d*_{i}) is obtained, since the remote markers are independent of *b*_{i}. Theoretically, the remote markers need not necessarily be independent of each other, and, hence, the calculated *Q*(*e*) may be biased. In practice, as we shall show (in the “Results” subsection “Real Biological Data”), this is a faithful approximation. Upper bounds on the number of permutations required to search for a *P* value *p* are derived in appendix A, both for SPT and RAT.

## Results

We implemented our algorithm in the software package ^{RAT} in C++ under LINUX.

### Simulated Data

To simulate genotypes, we used Hudson’s program that assumes the coalescent process with recombination^{23} (^{ms software}). We followed Nordborg et al.,^{28} using a mutation rate of 2.5×10^{-8} per nucleotide per generation, a recombination rate of 10^{-8} per pair of nucleotides per generation, and an effective population size of 10,000. Of all the segregating sites, only the ones with minor-allele frequency >5% were defined as SNPs and were used in the rest of the analysis. We used the strategy described elsewhere^{29} in choosing the disease marker—that is, we chose a SNP locus as the disease locus if it satisfied two conditions: (1) the frequency of the minor allele is between 0.125 and 0.175, and (2) the relative position of the marker among the SNPs is between 0.45 and 0.55 (i.e., the disease locus is approximately in the middle). The chosen disease SNP was removed from the SNP data set. We then generated case-control data according to a multiplicative disease model. The penetrances of genotypes aa, aA, and AA are λ, λγ, and λγ^{2}, respectively, where λ is the phenocopy rate and γ is the genotype relative risk. As in Zhang et al.,^{29} we set γ=4 and λ=0.024, which corresponds to a disease prevalence of 0.05 and a disease-allele frequency of 0.15. Finally, *N* cases and *N* controls were randomly chosen for each experiment.

We compared the times until convergence in both algorithms, where convergence was declared when the SD of the computed *P* value drops below 10^{-6}. In all our tests, the actual *P* values were 10^{-6} (see the “Discussion” section). We set *c*=*m* in ^{RAT}, so no LD decay is assumed, and the running time is measured using only the importance-sampling component. The approximation algorithm was used in all cases, with the parameter β set to 100. The running times of SPT are very large and, therefore, were extrapolated as follows: since at least 10^{6} permutations are needed to achieve an accuracy of 10^{-6} (see appendix A), we measured the running time for 100 permutations, excluding the setup cost (e.g., loading the files and memory allocation), and multiplied by 10^{4} to obtain the evaluated running time. We validated this extrapolation by conducting several experiments with 1,000 permutations. The differences between different runs of 1,000 permutations were <1.5%. All runs were done on a Pentium 4 2-GHz machine with 0.5 gigabytes of memory.

In the first setup, we simulated 20,000 haplotypes in a region of 1 Mb. Overall, 3,299 SNPs were generated. We compared the running times when varying three parameters: (1) the number of SNPs (100, 200,…, 2,000), (2) the number of sampled cases and controls (*N*=500, 1,000,…, 5,000), and (3) the SNP density. We chose every *i*th SNP, where *i* varies from 1 to 10 (this corresponds to SNP densities between 3,299 and 329 SNPs/Mb). The results are summarized in figure 1. On average, ^{RAT} is faster than SPT by a factor of >5,000. For example, it would take ~62 d for SPT to evaluate all 3,299 SNPs for 5,000 cases and 5,000 controls, whereas RAT needs 13 min to obtain the result.

*circles*) on simulated data under the coalescent model with recombination. The target

*P*value was 10

**...**

We also tested both algorithms on a very large data set consisting of 10 different regions of 1 Mb each. This data set, generated as described above, contained 5,000 cases and 5,000 controls with 30,556 SNPs. For ^{RAT}, we used a linkage upper-bound value of *c*=100 kb, on the basis of our observations of LD decay in real biological data (see the “Real Biological Data” subsection). The evaluation of the running time of SPT was performed by the same extrapolation method described above. For this data set, SPT would take 4.62 years to achieve the required accuracy of 10^{-6}, whereas RAT’s running time is 24.3 min (i.e., 100,000 times faster).

Since ^{RAT} and SPT are both based on sampling, their computed *P* values are distributed around the exact one. Does RAT provide accuracy similar to SPT, in terms of the spread of their distributions? To answer this question, we tested whether RAT converges to the *P* value obtained by SPT. To obtain a reliable estimate of the *P* value obtained by SPT, we used a relatively small number of cases and controls and ran SPT for a large number of permutations. We simulated five different data sets, each with 3,299 SNPs and with 100 cases and 100 controls. We ran SPT for 10,000 permutations, to calculate 95% CIs of the “true” *P* values. Since a small *P* value was obtained (<.001) in two of these experiments, we increased the number of permutations to 100,000, to improve the accuracy. The results are summarized in figure 2. In all five cases, convergence of the *P* value calculated by RAT to the CI was obtained after <100 permutations.

*P*value. Each of the five figures represents a different experiment with 100 controls and 100 cases of simulated SNPs in a 1-Mb region (~3,000 SNPs), under the coalescent model. SPT

*P*value was

**...**

Our theoretical analysis (see appendix A) shows that, when ^{RAT} with a linkage upper bound is used, the accuracy (measured by SD) increases as the *P* value decreases. For evaluation of the actual connection between these two measures, we used simulated data of a 1-Mb region, as described above. We conducted several experiments with different values of *N,* to obtain a range of *P* values. In each experiment, we generated 100 permutations, to estimate the SD. The results are presented in figure 3. For the whole range of *P* values, the SD is, on average, 1/15 of the *P* value.

*P*value. Data sets were simulated SNPs under the coalescent model with recombination of a 1-Mb region. To obtain different

*P*values, we performed the simulations with different numbers of cases and controls ranging from 50

**...**

The complexity analysis of both algorithms (see table A1) shows the theoretical advantage of ^{RAT} over SPT when the required *P* value is sufficiently small. At what level of *P* value does RAT have an advantage in practice? To answer this question, we tested both algorithms on data generated by the simulation described above. The data contain ~3,300 SNPs from 5,000 cases and 5,000 controls. To obtain different *P* values, the simulations were performed with different phenocopy rates (λ parameter) of the multiplicative disease model. The results are presented in figure 4. A shorter running time for RAT can be observed, starting from *P*=10^{-2}.

### Real Biological Data

We also tested ^{RAT} on HapMap project data. We used SNPs from chromosomes 1–4 of 60 unrelated individuals in the CEPH population. We used the GERBIL algorithm and trios information^{15}^{,}^{30} to phase and complete missing SNPs in the data. We amplified the number of samples by adapting the stochastic model of Li and Stephens for haplotype generation.^{31} When there are *k* haplotypes, the (*k*+1)st haplotype is generated as follows: first, the recombination sites are determined assuming a constant recombination rate along the chromosome (we used 10^{-8} per pair of adjacent nucleotides). Second, for each stretch between two neighboring recombination sites, one of the *k* haplotypes is chosen, with probability 1/*k.* The process is repeated until the required number of haplotypes is achieved. After amplification of the number of samples, cases and controls were chosen as described in the “Simulated Data” subsection.

We wanted to test the effect of the linkage upper bound of the algorithm on real data. Different linkage upper bounds ranging from 1 to 500 kb were checked. For each of the four chromosomes, we used the first 10,000 SNPs (~85 Mb) in 200 cases and 200 controls and applied ^{RAT} with varying values of *c.* The results are presented in figure 5. A linkage upper bound of 75 kb (which corresponds to 9 SNPs, on average) appears to be enough to obtain very accurate evaluation of the *P* value.

*P*value calculated by RAT. Data sets

*A*–

*D*are the first 10,000 SNPs in chromosomes 1–4, respectively, of 200 cases and 200 controls, which were amplified from 60 unrelated individuals (the CEPH

**...**

For a scenario of genomewide association studies that requires typing and checking numerous sites, we used the first 10,000 SNPs of chromosome 1, which span ~84 Mb. We used 1,000 cases and 1,000 controls. For this data set, the running time of RAT for testing disease association of individual SNPs was 361 s (~6 min), compared with the 2.6×10^{6} s (~30 d) needed for SPT.

The contribution of the LD decay property is larger when the data set contains more SNPs. To evaluate it, we measured the running times of ^{RAT} while using different linkage upper bounds, with 1,000 cases and 1,000 controls for the 10,000 SNPs of chromosome 1. The permutation phase of RAT takes 7 s when the linkage upper bound is 1,000 kb and <2 s when it is set to 200 kb (fig. 6). Without the use of this property, 265 s are required (a factor of 132). An additional preprocessing time of 96 s is needed in both cases.

## Discussion

The faithful calculation of disease association is becoming more important as more large-scale studies involving thousands of persons and thousands of SNPs are conducted. Testing not only individual SNPs but also haplotypes and loci interactions will further increase this need. Unfortunately, as the size of the data increases, the running time of SPT becomes prohibitively long. In this work, we present an algorithm called “^{RAT}” that dramatically reduces the running time. Our analysis shows that RAT indeed calculates the permutation test *P* value with the same level of accuracy as SPT, but much faster. Our experiments illustrate that the running time of our algorithm is faster by 4–5 orders of magnitude on realistic data sets. This vast difference in the running time enables an evaluation of high-significance association for larger data sets, including evaluations of possible loci interactions and haplotypes.

It is important to emphasize that the advantage of ^{RAT} over SPT applies only when the sought *P* value is low. Consider a case-control–labeled data set of SNPs, and suppose there is no association with the disease (e.g., *P*=.5. Using SPT, one can halt the test after very few permutations and conclude that no association exists.

An important reason for achieving high-significance results was presented by Ioannidis et al.,^{32} who asked why different studies on the same genetic association sometimes have discrepant results. Their aim was to assess how often large studies arrive at conclusions different from those of smaller studies and whether this situation arises more often when there is a contradiction between the first study and subsequent works. They examined the results of 55 meta-analyses of genetic association and tested whether the magnitude of the genetic effect differs in large, as opposed to smaller, studies. They showed that, in only 16% of the meta-analyses, the genetic association was significant and the same result was obtained independently by several studies, without bias. In a later work, Ioannidis^{33} discussed possible reasons for bias in relatively small association studies. He argued that, when many research groups conduct similar association studies, the negative results in studies that do not reach a sufficient significance might never be published. Hence, the scientific literature may be biased. It is hard, or maybe impossible, to correct this multiple-testing effect, since a researcher may not be aware of other groups that study the same question. The solution to this problem is to conduct larger association studies, which, one would hope, would yield lower *P* values. In that sense, knowing that the *P* value is below, say, 10^{-2} is not sufficient, and obtaining the most accurate evaluation possible of the *P* value is crucial.

Our procedure also has an advantage in testing a large population for more than a single disease, where different diseases may be associated with the genotypes at different intensities. Here, one also has to correct for testing multiple diseases. Consider a study that addresses 100 diseases. In such a scenario, a *P* value of .01 for a specific phenotype obtained by SPT with 100 permutations is not sufficient. In this case, a more accurate evaluation of the significance of association for each of the phenotypes is required. This can be done either by increasing the number of permutations of SPT, which may be time prohibitive, or by using ^{RAT}.

Unlike several previous methods, we do not assume any distribution function of the trait, given the SNPs. The random model (adopted from Zhang et al.^{13}) assumes only that the cases and controls are sampled independently from a specific population, without any additional requirements about the distribution. However, even this assumption does not always hold. One of the crucial problems in drawing causal inferences from case-control studies is the confounding caused by the population structure. Differences in allele frequencies between cases and controls may be due to systematic differences in ancestry rather than to association of genes with disease.^{34}^{}^{–}^{36} In this article, this issue is not addressed, and we intend to study it in the future. We believe that this problem can be solved by incorporating methods for population structure inference^{37}^{,}^{38} into ^{RAT}.

Using the LD decay property improves the theoretical running time of our method, from *O*(*n*β+*N*_{R}*nm*) to *O*(*n*β+*N*_{R}*nc*). This improvement is meaningful when the tested region is much larger than *c,* the linkage upper bound. In practice, in our experiments, the reduction in the running time due to the importance sampling was much more prominent. We are not aware of a method that can take advantage of LD decay to reduce the running time in SPT. As we show, the importance-sampling approach can readily exploit the LD decay property. Since each drawn permutation in the importance-sampling procedure is induced by a known locus, testing only 2*c* neighboring loci is possible.

^{RAT} can also expedite association analysis when the phenotypic information available for each individual is more complex. For instance, there may be several additional phenotype columns in the input that describe smoking status, sex, age group, or existence of another specific disease. Obviously, with certain factors one cannot use the property of LD decay, but the speed-up due to the importance-sampling algorithm still applies.

We have focused here on the problem of finding association between a genotype matrix and a binary trait (cases and controls), but our algorithm can easily be adapted to also handle continuous traits. A possible score function for a specific column *j* can be the score used in the ANOVA model, denoted by *F*_{j}. The statistic is , and the *P* value can be calculated by permuting the trait values of individuals, similarly to the binary-traits case. We can use the same methodologies presented here to efficiently calculate the *P* value.

This work improves the methodologies for the upcoming large-scale association problems. We achieve a dramatic reduction in the time complexity, enabling us to evaluate low-probability (and high-significance) associations with many loci, which was previously time prohibitive. Nevertheless, much more research should be done in this direction. If the number of loci is in the tens of thousands, testing all pairwise interactions is too time consuming, even with our algorithm. If one wants to examine *k* loci interactions, the running time increases exponentially with *k* and becomes prohibitive, even for a relatively small number of SNPs. Additional assumptions, such as nonnegligible marginal effects,^{20} may help to reduce complexity. We hope that, eventually, combining such assumptions with faster algorithms like RAT may facilitate better analysis of very large association studies.

## Acknowledgments

R.S. was supported by a grant from the German-Israeli Fund (grant 237/2005). We thank Jacqui Beckmann (Lausanne), Irit Gat-Viks, Isaac Meilijson (Tel Aviv University), and Jonathan Marchini (Oxford University), for fruitful discussions.

## Appendix A

#### Theoretical Upper Bounds on the Accuracy

We use the SD of the estimated *P* value in both algorithms, as a measure of accuracy. Obviously, in both algorithms, when more permutations are sampled, the SD is lower. Here, we provide mathematical analysis that relates the number of permutations, the data parameters, and the accuracy.

For SPT, given that *N*_{S} permutations are performed, if none of the permutations yields a score *S*(*d*_{i})>*S*(*d*), we can evaluate the SD by

which implies that, to achieve an accuracy of ε, ~1/ε permutations are needed. In particular, when an accuracy equal to the true *P* value *p* is desired, *N*_{S}≈1/*p*.

For the ^{RAT} algorithm, let =||, and let *c*_{i} be *Q*(*d*_{i}), where *d*_{i} is the *i*th permutation out of all possible permutations in . Let *Q* denote the random variable *Q*(*e*), where *e* is a permutation sampled from .

The expectation of 1/*Q* is

and the variance of the calculated *P* value is

Observe that

Substituting equation (A3) into equation (A2) yields

where the last inequality follows from .

Without additional assumptions, the expectation of 1/*Q* is 1/*m*. Substituting in equation (A4), we have

Hence, to obtain accuracy *p, m* permutations are needed.

This bound can be improved if we exploit the LD decay property of biological data. Since LD decay is limited to 100 kb (see the “Real Biological Data” subsection in the “Results” section) and the SNP density is, at most, 1:300 bases, *c*<350 in practice. With the assumption of a linkage upper bound *c* for a specific locus *l*, there are, at most, 2*c* loci that may depend on *l.* For each of the other loci, the probability that its score with a permutation of the vector at locus *l* is >*S*(*d*) is *p.* Hence, we can write

Since 1/[*E*(1/*Q*)]*E*(*Q*) always holds true because of Jensen’s inequality, when substituting equation (A5) in equation (A4), we get

Equation (A6) establishes the connection between the data’s parameters and the accuracy. A prominent difference from the accuracy of SPT, described in equation (A1), is the strong dependence on *p.* Interestingly, when all other data parameters and *N*_{R} are fixed, the smaller *p* is, the more accurate the ^{RAT} algorithm is. In other words, as *p* decreases, the convergence rate of RAT increases.

Arranging equation (A6) differently,

If we set the required accuracy, SD(*p*_{R}), to be *p,* we have

Hence, to search for *P* values as low as *p,* the number of required permutations is <(2*c*+*mp*). In that case, the time complexity of ^{RAT} can be written as *O*(*n*β+*nc*^{2}+*pcnm*). The theoretical complexity of the algorithms is summarized in table A1.

#### Proof of Irreducibility of the T-Sampler Algorithm

We provide a proof that the T-sampler algorithm presented in the subsection “An approximation algorithm” (in the “Methods” section) is irreducible. Consider two tables, *T*_{1} and *T*_{2}, from the sample space, such that π(*T*_{1})>0 and π(*T*_{2})>0. Our goal is to show that there is a path with probability >0 between *T*_{1} and *T*_{2}.

If both *T*_{1} and *T*_{2} are boundary tables, then *T*_{2}*N*_{g}(*T*_{1}) and, hence, *J*(*T*_{1},*T*_{2})>0, and there is positive probability to move from *T*_{1} directly to *T*_{2}.

Suppose that, without loss of generality, *T*_{1} is not a boundary table. In that case, there are at least two nonextreme rows *a* and *b* in *T*_{1}. There are two tables in *N*_{g}(*T*_{1}) that are created by legal tweaks on the submatrix

We use *T*_{x} to denote the table in which *T*_{a,0} is increased by one and *T*_{y} to denote the other table. The difference in the Pearson score of the tables *T*_{x} and *T*_{1} is

where

and

Similarly, *S*(*T*_{y})-*S*(*T*_{1})=-δ+ψ.

Since ψ>0, at least one of the expressions *S*(*T*_{x})-*S*(*T*_{1}) and *S*(*T*_{y})-*S*(*T*_{1}) is positive. Suppose that, without loss of generality, δ>0. Then, *S*(*T*_{x})>*S*(*T*_{1})*S*(*d*), and π(*T*_{x})>0. This means that the probability that the sampler moves from *T*_{1} to *T*_{x} is positive.

If rows *a* and *b* still do not have extreme values in *T*_{x}, the exact same procedure can be repeated again and again, until we obtain a table *T*^{*}_{1} in which at least one of these rows has an extreme value.

Suppose α steps were performed, generating a sequence of tables *T*_{1},*T*_{2},…,*T*_{α+1}=*T*^{*}_{1}. A straightforward inductive argument shows that, for all *k,* *S*(*T*_{k+1})-*S*(*T*_{k})=δ+2*k*ψ+ψ>0. The last inequality follows by the assumption that δ>0. Hence, all the tables in the sequence have positive probability. The same argument is repeated with additional nonextreme rows until a boundary table is reached.

Consequently, there is a path with positive probability from any nonboundary table to some boundary table. Since, by definition, transitions between boundary tables have positive probability, it follows that there is a path of positive probability between any two tables with π(*T*)>0, which proves the irreducibility of the sampler.

### Table A1.

Algorithm | Prepossessing Phase | Permutations Phase | No. of Permutations^{a} | Total Running Time^{a} |

SPT | … | Θ(N_{S}nm) | 1/p | Θ( 1/pnm) |

RAT (no assumptions) | O(nβ) | O(N_{R}nm) | m | O(nβ+nm^{2}) |

RAT (LD decay assumption) | O(nβ) | O(N_{R}nc) | 2c+mp | O(nβ+nc^{2}+pcnm) |

Note.— For RAT with LD decay, as the true *P* value decreases, fewer permutations are needed, and the relative weight of the preprocessing phase increases.

^{a}Needed to achieve accuracy

*p.*

## Web Resources

URLs for data presented herein are as follows:

## References

*APOE*in Alzheimer’s disease. Am J Hum Genet 67:383–394 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.1M)

- PRESTO: rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies.[BMC Bioinformatics. 2008]
*Browning BL.**BMC Bioinformatics. 2008 Jul 13; 9:309. Epub 2008 Jul 13.* - PAWE-3D: visualizing power for association with error in case-control genetic studies of complex traits.[Bioinformatics. 2005]
*Gordon D, Haynes C, Blumenfeld J, Finch SJ.**Bioinformatics. 2005 Oct 15; 21(20):3935-7. Epub 2005 Aug 25.* - Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies.[Am J Hum Genet. 2004]
*Dudbridge F, Koeleman BP.**Am J Hum Genet. 2004 Sep; 75(3):424-35. Epub 2004 Jul 19.* - Adaptation of the extended transmission/disequilibrium test to distinguish disease associations of multiple loci: the Conditional Extended Transmission/Disequilibrium Test.[Ann Hum Genet. 2000]
*Koeleman BP, Dudbridge F, Cordell HJ, Todd JA.**Ann Hum Genet. 2000 May; 64(Pt 3):207-13.* - Gene finding strategies.[Biol Psychol. 2002]
*Vink JM, Boomsma DI.**Biol Psychol. 2002 Oct; 61(1-2):53-71.*

- An adaptive permutation approach for genome-wide association study: evaluation and recommendations for use[BioData Mining. ]
*Che R, Jack JR, Motsinger-Reif AA, Brown CC.**BioData Mining. 79* - Rapid and Robust Resampling-Based Multiple-Testing Correction with Application in a Genome-Wide Expression Quantitative Trait Loci Study[Genetics. 2012]
*Zhang X, Huang S, Sun W, Wang W.**Genetics. 2012 Apr; 190(4)1511-1520* - Fast and Accurate Approximation to Significance Tests in Genome-Wide Association Studies[Journal of the American Statistical Associa...]
*Zhang Y, Liu JS.**Journal of the American Statistical Association. 2011 Sep 1; 106(495)846-857* - Efficient p-value evaluation for resampling-based tests[Biostatistics (Oxford, England). 2011]
*Yu K, Liang F, Ciampa J, Chatterjee N.**Biostatistics (Oxford, England). 2011 Jul; 12(3)582-593* - Multiple testing corrections for imputed SNPs[Genetic epidemiology. 2011]
*Gao X.**Genetic epidemiology. 2011 Apr; 35(3)154-158*

- A Fast Method for Computing High-Significance Disease Association in Large Popul...A Fast Method for Computing High-Significance Disease Association in Large Population-Based StudiesAmerican Journal of Human Genetics. Sep 2006; 79(3)481PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...