- Journal List
- Bioinformatics
- PMC2732219

# Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studies

^{1,}

^{2}Waibhav D. Tembe,

^{1}Szabolcs Szelinger,

^{1}Margot Redman,

^{1}Dietrich A. Stephan,

^{1}John V. Pearson,

^{1}Stanley F. Nelson,

^{2}and David Craig

^{1,}

^{*}

^{1}Translational Genomics Research Institute (TGen), Phoenix, AZ 85004 and

^{2}Department of Computer Science, University of California, Los Angeles CA 90095-7088, USA

^{*}To whom correspondence should be addressed.

## Abstract

**Summary:** For many genome-wide association (GWA) studies individually genotyping one million or more SNPs provides a marginal increase in coverage at a substantial cost. Much of the information gained is redundant due to the correlation structure inherent in the human genome. Pooling-based GWA studies could benefit significantly by utilizing this redundancy to reduce noise, improve the accuracy of the observations and increase genomic coverage. We introduce a measure of correlation between individual genotyping and pooling, under the same framework that *r*^{2} provides a measure of linkage disequilibrium (LD) between pairs of SNPs. We then report a new non-haplotype multimarker multi-loci method that leverages the correlation structure between SNPs in the human genome to increase the efficacy of pooling-based GWA studies. We first give a theoretical framework and derivation of our multimarker method. Next, we evaluate simulations using this multimarker approach in comparison to single marker analysis. Finally, we experimentally evaluate our method using different pools of HapMap individuals on the Illumina 450S Duo, Illumina 550K and Affymetrix 5.0 platforms for a combined total of 1 333 631 SNPs. Our results show that use of multimarker analysis reduces noise specific to pooling-based studies, allows for efficient integration of multiple microarray platforms and provides more accurate measures of significance than single marker analysis. Additionally, this approach can be extended to allow for imputing the association significance for SNPs not directly observed using neighboring SNPs in LD. This multimarker method can now be used to cost-effectively complete pooling-based GWA studies with multiple platforms across over one million SNPs and to impute neighboring SNPs weighted for the loss of information due to pooling.

**Contact:** gro.negt@giarcd

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

Genome-wide association (GWA) studies have emerged as a new and powerful tool to detect genetic predisposition to complex diseases. Frequently, upwards of thousands of individuals are genotyped for several hundred thousand SNPs in order to find the single most significant SNP using a genotype or an allele-based χ^{2}-test. Considering the cost of such an experiment is several hundred thousand dollars with no guarantee of success, it is of high importance to identify cost-effective methods for completing GWA studies. Pooling genomic DNA and assaying on a few replicate arrays is such an approach, and it has yielded new candidate associations in situations where individual genotyping of samples was not possible (Brown *et al.*, 2008; Hanson *et al.*, 2007; Johnson, 2007; McGhee *et al.*, 2005; Melquist *et al.*, 2007; Papassotiropoulos *et al.*, 2006; Steer *et al.*, 2007).

Genotype multiplexing has reached the level of one million largely non-redundant SNPs for two separate technologies developed in parallel, Illumina and Affymetrix. For many populations, such as Caucasian or Asian, the additional information gained from genotyping 1 000 000 or more SNPs versus 500 000 SNPs is limited and largely redundant due to high correlation (LD) between neighboring SNPs. However, in the context of a pooling-based GWA study, this redundancy in coverage should theoretically allow for reduction in noise from the assay and substantial improvement in the performance of a pooling-based GWA study thus increasing the number of true associations found and reducing the number of false positives. Furthermore, one should be able to utilize multiple platforms within a single study in order to improve overall resolution and increase genomic coverage. To date, numerous papers have examined the efficacy of pooling-based GWA studies (Barratt *et al.*, 2002; Craig *et al.*, 2005; Pearson *et al.*, 2007; Yang *et al.*, 2005), including study design (Barratt *et al.*, 2002; Sham *et al.*, 2002; Zou and Zhao, 2005; Zuo *et al.*, 2006), accounting for sources of errors (Barratt *et al.*, 2002; Macgregor, 2007; Yang *et al.*, 2005; Zou and Zhao, 2004), and cost/power analysis (Hinds *et al.*, 2004; Law *et al.*, 2004; Macgregor, 2007; Meaburn *et al.*, 2005; Pearson *et al.*, 2007; Yang *et al.*, 2006) Zuo *et al.*, 2006). Some of these studies have explored the use of multimarker statistics (Hinds *et al.*, 2004; Kirkpatrick *et al.*, 2007; Wang *et al.*, 2003), though in large part the effectiveness of these approaches has not been explored under the context of a GWA study. Methods have been developed to leverage the correlation structure between SNPs with respect to haplotype analysis and individual genotyping (Kirkpatrick *et al.*, 2007; Zaitlen *et al.*, 2007) but methods for pooling-based studies have been largely limited. Additionally, methods have been developed that are able to impute or estimate unobserved genotypes, increasing the power to detect association (Dai *et al.*, 2006; Marchini *et al.*, 2007; Servin and Stephens, 2007). Nevertheless, their application to pooling-based studies where individual genotypes are unknown has not been fully explored.

To this end, we develop and evaluate a new method of analysis and demonstrate the effectiveness of this approach using 1 333 631 SNPs combined from the Affymetrix 5.0, Illumina 550K and Illumina 450S Duo arrays. Our approach leverages the correlation structure between SNPs to reduce Type I and Type II errors, combines multiple platforms, and increases the accuracy and power of pooling-based GWA studies. We introduce the concept of pooling correlation coefficient, the square root of *r*_{p}^{2}, analogous to in individual genotyping, where *r*_{p}^{2} is the amount of information recovered from an allele-based test of association from individual genotype data by using a pooling-based method. Our multimarker test statistic utilizes both the pooling correlation coefficient and the correlation or LD between neighboring SNPs to combine data from multiple neighboring SNPs and from multiple platforms (Affymetrix and Illumina) within a single pooling-based GWA study. Therefore, we more accurately determine the significance of association for each SNP as well as giving greater coverage of the human genome. Additionally, our method lends itself to imputation, where an unobserved SNP is given a significance value based on directly observed neighboring SNPs in LD. Combining the Illumina 450S Duo and Illumina 550K v3 platforms we are able to accurately impute 748 348 additional SNPs from the HapMap that are not present on any of the three platforms. Nevertheless, in genomic regions of high LD, the number of proxies for an imputed SNP can exceed 50 SNPs but typically have only a few (<5) proxies. Therefore, as microarray technologies are able to probe more SNPs, our method will have more observations to reduce noise further increasing the accuracy of the observations.

## 2 METHODS

### 2.1 Experimental

Two main pools (A and B) were pooled for a total of two pools (see Supplementary Table 1). Both pools (cohorts) were run on duplicates Illumina 550K v3 arrays, Illumina 450S Duo arrays and Affymetrix 5.0 arrays, respectively (see Supplementary Methods).

### 2.2 Derivation of a pooling-based test statistic

We assume Hardy–Weinberg equilibrium. Let the probability of having allele *A* (or the population frequency) be *p*_{A}, *q*_{A}=1 −*p*_{A}, so that (*p*_{A})^{2}+2*p*_{A}*q*_{A}+(*q*_{A})^{2}=1. We choose to represent variables with a ‘+’ as belonging to the cases and variables with a ‘−’ as belonging to the controls. For individual genotyping, suppose we observe *N*_{A} number of *A alleles*, where *N*_{A}=*N*_{A}^{+}+*N*_{A}^{−}, *N*_{A}^{+} is the number of case *A* alleles, and *N*_{A}^{−} is the number of control alleles. Then from individual genotyping the frequency or probability of allele *A* in the cases is *p*_{A}^{+}=*N*_{A}^{+}/(*N*_{A}^{+}+*N*_{a}^{+}) and in the controls is *p*_{A}-=*N*_{A}^{−}/(*N*_{A}^{−}+*N*_{a}^{−}) where *a* is the other allelic variant. We assume in practice that *p*_{A}^{−}≈*p*_{A} since typically *p*_{A} is not known. To test for association we used a two-sample test of proportions, which is equivalent to a *t*-test under HWE, as shown in Equation (1) where *T*_{A} is the test statistic.

Under the null hypothesis, we have the expected value *E*(*p*_{A}^{+}−*p*_{A}^{−})=0, and the variance *Var*(*p*_{A}^{+}−*p*_{A}^{−})=*p*_{A}^{+}(1−*p*_{A}^{+})/*N*_{A}^{+}+*p*_{A}^{−}(1−*p*_{A}^{−})/*N*_{A}^{−}. If we approximate *Var*(*p*_{A}^{+}−*p*_{A}^{−}) with 2*p*_{A}*q*_{A}/*N*_{A} then *T*_{A} is expected to follow the normal distribution under HWE. In a pooling-based estimate of allele frequency, we do not observe the allele counts but instead indirectly observe an allelic frequency for each pool by measuring pooled amplified genomic DNA, labeled with a fluorophore, and hybridized to an oligonucleotide probe, though not in that order. Typically, a predicted allelic frequency is calculated based on the observed relative probe intensity of the oligonucleotide probes interrogating both SNP alleles. Here, we are more concerned with predicting allele frequency differences than accurately predicting the allele frequencies themselves as will become evident by defining our pooling test statistic below. We define , and as the respective measured frequencies for the *A* allele in the case, control, and combined populations through pooling. We consequentially define an analogous test statistic for our measurement of pooled DNA:

Here, we have that 2^{2}/*M* is the variance of *N*_{A} alleles with *M* replicate measurements, where ^{2} is the measurement variance. In order to simplify our discussion in later sections, we denote the total variance from sample mean with a defined set of individuals as *V*_{t}=*V*_{s}+*V*_{p}, where *V*_{s}=2*p*_{A}*q*_{A}/*N*_{A} and *V*_{p}= 2^{2}/*M*. There potentially exist other sources of bias and variance, including systematic biases to the measured values for , additional source of variances from the arrays, use of multiple sub-pools, and experimental variance. Previous studies have investigated the relative source of variation in pooled experiments and have shown that the variance from the measured arrays is significantly larger than all other sources of variances (Barratt *et al.*, 2002; Macgregor, 2007).

### 2.3 Derivation of a pooling-based quality control statistic ( or *r*_{p}^{2})

To mathematically investigate a relationship between individual genotyping and pooling-based tests, we introduce a measure of correlation between individual genotyping and pooling under the same framework that *r*^{2} provides a measure of LD between pairs of SNPs. Briefly, we compute the sum of squared deviations to determine the correlation between the pooling test statistic and a modified individual genotyping test statistic that has a shifted mean . The shift in the mean comes from the introduction of errors due to pooling. Next, we repeat the calculation to determine the correlation between the individual genotyping test statistic with a shifted mean and the standard individual genotyping test statistic *T*_{A}. Finally, we combine these correlations to obtain a theoretical pooling correlation coefficient that is simply the correlation between the pooling test statistic and the individual genotyping test statistic *T*_{A}. A detailed derivation can be found in the Supplementary Methods, where we derive the relationship:

We can therefore view pooling-based experiments according to their theoretical pooling correlation coefficient . This value could be used as a measure for the ability of a SNP to resolve allelic associations and also allows us to correlate our test statistics with individual genotyping, critical for development of a multimarker statistic. An alternative viewpoint is that the pooling correlation gives us a measure of the loss of power due to pooling when compared to individual genotyping. For clarity and since it holds a similar theoretical basis as the term *r*^{2} for LD, we similarly refer to this value as *r*_{p}^{2} in the discussion sections.

### 2.4 Development of a multimarker test statistic

To develop a multimarker test statistic for pooling-based GWA studies, we use the previously derived pooling correlation coefficient, the square root of *r*_{p}^{2}, and the measured LD between two different SNPs, measured as *r*^{2} or the coefficient of determination between a typed and an un-typed marker. From indirect association, we know that the power of observing an association at marker *A* for a causal mutation at marker *B* is simply scaled by the correlation between SNP *A* and SNP *B* (Pritchard and Przeworski, 2001). Combining this correlation with our pooling correlation, we create a multimarker test statistic that combines the information from neighboring SNPs to give more accurate and meaningful association values.

It has been previously shown that the test statistics of two neighboring SNPs *A* and *B* are equivalent when scaled by the correlation *r*_{AB}^{2} between the two SNPs (Pritchard and Przeworski, 2001). We give a formal derivation in the Supplementary Methods. Now, suppose we have a causal mutation in SNP *A*, and a set *S*_{A} of other SNPs in LD with A. Let be the test statistic for the true genotypes but with a shifted mean as above, let be the pooling test statistic, and let be the multimarker test statistic. Then we propose the following test statistic:

where, *r*_{AB}^{2} is the coefficient of determination (or LD) between SNP *A* and SNP *B* and *r*_{pB}^{2} is the square of the pooling correlation for SNP *B*. Essentially, using the square root of *r*_{p}_{B}^{2} and the square root of *r*_{AB}^{2}, we transform multiple indirect observations of SNP *A* into equivalent measurements and take the weighted average of those observations. Note that if we assume ^{+} ≈ ^{−} then:

Otherwise, we have:

To compute , we estimate *V*_{s}=2*p*_{A}*q*_{A}/*N*_{A} and *V*_{p}=2^{2}/*M*. We use the approximation *p*_{A}^{−}≈*p*_{A} for computing *V*_{s}. To estimate ^{2}, we simply sum the variance from each cohort, where the variance from a cohort is simply the sum of variances between the microarrays in that cohort, with the variances within each microarray. In practice, the number of individuals within each cohort (or pool) may not be equal, which is adjusted for by substituting *N*_{A} for 2*N*_{A}^{+}*N*_{A}^{−}/(*N*_{A}^{+}+*N*_{A}^{−}).

### 2.5 Imputation using the multimarker test statistic

Imputing the significance of association for SNPs that are not directly observed is achieved by using the derived multimarker test statistic. For a given unobserved SNP *A*, we simply have the set *S*_{A} of other observed SNPs in LD with *A* excluding the (unobserved) multimarker test statistic for SNP . The SNPs in *S*_{A} simply act as proxies for SNP *A*. The main advantage to using the multimarker test statistic is that we have multiple proxies from which we measure significance. Modifying Equation (4), we obtain the following multimarker for SNP A:

Intuitively, as the size of *S*_{A} increases so does the accuracy of the multimarker since we then have more than one proxy for the given SNP. The variance may increase as well but is determined by the accuracy of the pooling correlation and LD estimates.

### 2.6 Combining multiple platforms using the multimarker test statistic

The multimarker test statistic can also be used to combine data from multiple SNP microarray platforms, even when the platforms contain common SNPs. To combine the data we first calculate the pooling test statistic and pooling correlation for each SNP and each platform separately. Let the SNP *B*_{i} be a SNP on the *i*-th microarray platform and in the set *S*_{A}. Then from Equation *B* in *S*_{A} on the *i*-th platform. Then from Equation (4), the pooling test statistic is simply:

If SNP *A* is not directly observed, we can impute SNP *A* from observations on multiple platforms with the following test statistic:

## 3 RESULTS

To experimentally evaluate the efficacy of our multimarker method, we used the HapMap dataset to compare individual genotyping and pooling under an example GWA study. From the HapMap project, we are able to retrieve the genotypes for the CEU population. We randomly split CEU trios into two separate pools, consisting of 41 individuals in pool A and 47 individuals in pool B to create a model GWA study whereby the genotypes for each individual were certain. Due to sample quality, we excluded one individual from a given trio from each pool. Both pools were run on duplicates Illumina 550K v3 arrays, Illumina 450S Duo arrays and Affymetrix 5.0 arrays, respectively. For each microarray, we removed the lowest 1% of raw intensity values and normalized the microarray by dividing by the mean channel intensity.

We were able to probe 504 604 SNPs on the Illumina 550K v3 arrays, with 487 723 (~96.6%) of those SNPs having associated genotypes in the HapMap dataset. On the Illumina 450S Duo arrays, we were able to probe 510 506 SNPs, with 493 495 (96.7%) of those SNPs having associated genotypes in the HapMap dataset. Finally, we were able to probe 440 729 SNPs on the Affymetrix 5.0 arrays, with 427 254 (~96.9%) of those SNPs having associated genotype information in the HapMap dataset. There were two replicate for each pool and platform.

To evaluate our multimarker method, we used all SNPs on the arrays filtering out those that could not generate pooling test statistic due to errors or insufficient data. Nevertheless, it has been found that there is an enrichment of false-positives due to genotyping error among the most significant SNPs when individually genotyping (Hua *et al.*, 2007). In order to accurately and fairly assess our approach, it is necessary to remove these false positives due to genotyping error. In other words, simply because a SNP is identified as the single most associated SNP by individual genotyping, this does not mean that this result is not due to a calling problem, copy number variant or an assay problem. While there is no perfect method to screen out SNPs that give rise to false positives, the most accepted approach is a series of filters. Thus, only SNPs passing the following filters were used in successive order for evaluation:

- All SNPs that had an individual genotyping minor allele frequency >0.05.
- All SNPs that had less than two no calls in both case and control pools, respectively, with HapMap genotypes.
- All SNPs that when tested for Hardy–Weinberg equilibrium with a χ
^{2}-test had a*P*−value ≥ 0.01 across cohorts. - Only autosomal SNPs were used.
- All SNPs that had at least one other SNP in LD with value of
*R*^{2}≥0.8. - All SNPs that had genotype data in the HapMap.

We define the true rank of a SNP to be the rank of the SNP according to the Fisher's exact *P*-value from individual genotype data. Additionally, we define the top X truly associated SNPs as those SNPs are in the top X inclusive when ranked. We adopt these filters because the remaining SNPs allow us to better assess the performance of our method. A detailed explanation of these filters can be found in the Supplementary Results. From these filters, we are left with 139 202 SNPs (~29.1%) for the Illumina 550K v3 arrays, 87 678 SNPs (~27.6%) for the Illumina 450S Duo arrays and 194 074 SNPs (~44.0%) for the Affymetrix 5.0 arrays, with the overwhelming majority filtered by the fifth criteria in all three cases.

For the analysis of Illumina 550K data alone, using previous individual genotype data, we were able to correct for preferential amplification during the PCR process for the Illumina arrays. This was done through a traditional *k*-correction factor (Hoogendoorn *et al.*, 2000; Le Hellard *et al.*, 2002). This type of correction can significantly reduce biases in alleles between true and observed allelic frequency. Nevertheless, for both the Illumina 450S Duo analysis and the Affymetrix 5.0 analysis there was no previous individual genotype data associated with the version of arrays used. Additionally, when combining platforms, *k*-correction was not used. It is also interesting to note that because the HapMap CEU individuals are composed of trios, the number of independent chromosomes per trio is four instead of six for unrelated individuals, which may cause our variance estimates to be less accurate than when using unrelated individuals.

### 3.1 Analysis improvement by a multimarker statistic

comparison between single marker (SA) and multimarker (MM) analysis of a pooled GWA study is shown in Figure 1 and Supplementary Figures 1 and 2 under several scenarios. Supplementary Figures 1 and 2 show the multimarker analysis for Affymetrix 5.0 arrays and Illumina arrays, respectively, and considers the trade-off between restricting our analysis to a single platform (alone) and combining platforms (combined). To analyze the data on Illumina platform, we combined the data from Illumina 550K v3 arrays and the Illumina 450S Duo arrays for a total of 1 015 110 SNPs before filtering, and 309 688 SNPs (30.5%) after filtering. Figure 1 shows a combined multimarker analysis when data is merged from the three different microarrays. When combining the Illumina 550K v3 arrays, Illumina 450S Duo arrays and Affymetrix 5.0 arrays, the total number of SNPs before filtering was 1 333 631 SNPs and 560 202 SNPs (~42.0%) after filtering. We completed the same analysis within each figure (Fig. 1A-E). There are various methods by which one could evaluate performance of a multimarker statistic, and this choice is largely arbitrary. We observe that our test statistic presented may not follow a chi-square distribution and therefore we choose a rank-based evaluation (Spearman rank correlation). We also wish to perform an evaluation on how a researcher would use the data and so we focus on the number of true associations identified from individual genotyping that would be carried forward in a two-stage design. We define the analysis of our initial pooling test statistic as single marker analysis (SA) and our multimarker test statistic as multimarker analysis (MM).

**...**

#### 3.1.1 Evaluation metric 1—identification of the most associated SNPs within a two-stage design [Fig. 1A B]

Within a GWA study, typically the first objective of the researcher is to identify those SNPs exhibiting the largest change in allelic association. Typically, and especially with two-stage GWA designs, a somewhat arbitrary number of SNPs are taken forward for individual genotyping in order to accurately calculate significance, reducing the dataset from 500K+to a few hundred or few thousand. Suppose we wish to carry forward as little as 100 and at most 5000 SNPs for validation. Therefore, it is important to consider what percentage of the true associated SNPs that are observed in the set of SNPs carried forward. For this analysis, we consider the top 100 truly associated SNPs. In Figure 1A and Supplementary Figures 1 and 2, we plot for a given observed rank threshold (the number of SNPs to be carried forward), the percentage of SNPs that were observed to be in the observed rank threshold (*x*-axis) and were a top 100 truly associated SNP. We plot this percentage for both the Affymetrix and Illumina platforms, respectively, and also for both the single marker and multimarker analysis, respectively. In Figure 1B, we simply look at the difference, or improvement, in the percentages from observing the single marker ranks to observing the multimarker ranks.

#### 3.1.2 Evaluation metric 2—rank correlation [Fig. 1A and B]

Another measure of significance is the correlation between the true ranks and our observed single marker or multimarker ranks. In Figure 1C and Supplementary Figures 1 and 2, we plot the Spearman rank correlation between the true and observed ranks considering the SNPs within the true rank threshold (*x*-axis). We see in Figure 1D the improvement in correlation between the single marker ranks and the multimarker ranks.

#### 3.1.3 Evaluation metric 3—identification of top SNPs and directionality of change [Fig. 1E]

In Figure 1E, we plot the percentage of SNPs that fall within one of two criteria: either the SNP was both observed in the top 100 and within the true rank threshold (*x*-axis) or the SNP was moved in the correct direction by the multimarker analysis. A SNP moves in the correct direction if the multimarker rank is closer to the true rank than the single marker rank. Our main goal is to improve the correspondence between the true and observed ranks and thus an improvement in the observed ranks should be found.

We clearly see in Figure 1 and Supplementary Figures 2 and 2 that the multimarker rank improves on the single marker under all scenarios. As an example (see Supplementary Table 2), consider the single marker and multimarker ranks for the top 100 truly associated SNPs on the Illumina platform when considering just the 450S Duo and 550K platforms (alone) and when considering all three microarray types (combined). We notice that the improvement is greater on the Affymetrix platform, which is expected since there is greater noise and the number of probes per SNP is fewer than on the Illumina platform. The improvement is significant since using the multimarker method [Fig. 1B] we potentially increase the number of the top 100 truly associated SNPs carried forward by 5–35% depending on the number of observed SNPs to be carried forward for validation. Furthermore, if we combine the information from all platforms, we can include 100% of the top 100 truly associated SNPs by carrying forward the top 2500 observed SNPs. We can include 90% of the top 100 truly associated SNPs by carrying forward the top 1000 observed SNPs. We see the correlation between the true ranks and the observed ranks is higher in the multimarker analysis [Fig. 1C and D], and we verify this in Figure 1E by seeing the directionality of that change. Additionally, in analyzing Supplementary Figures 1 and 2, we see there is an improvement in combining both the Affymetrix and Illumina data versus considering them separately, clearly suggesting that our method improves further when data from multiple platforms are combined.

### 3.2 Simulation

We performed a simulation of a pooling study using pools composed by random sampling individuals from the 1958 Control Cohort of the Wellcome Trust dataset [The Wellcome Trust Case Control Consortium, 2007]. Ignoring duplicates, relatives and other data anomalies left a total of 1423 individuals. The genotype calls for these individuals were provided from the WTCCC and were previously genotyped on the Affymetrix 500K platform. Using this dataset, we simulated the Affymetrix 5.0 arrays by using four probes per SNP and by adding a mean zero error with variance 0.006 to the value of each probe. The probe variance of 0.006 matches our observed probe variance for Affymetrix 5.0 arrays. Our simulated study design consisted of pools of one hundred cases and one hundred controls with four replicate arrays for each cohort using a total number of 500 567 SNPs. Similar to our experimental analysis, we assumed a correlation structure to that of the HapMap and used the correlation structure (LD) found in the HapMap as input to our method. We evaluated the simulation results using the same metrics used to evaluate our experimental results. Unlike in the experimental results, the correlation structure used in the simulations was not directly trained from the WTCCC data but instead from the HapMap. Nevertheless, the results showed almost exactly the same improvements as the empirical results, including a noticeable increase in Spearman rank correlation for our multimarker method over the single marker method (figures omitted). In particular, we found an increase of 5–35% of true associated SNPs would be carried forward for validation if we were to go from 100 to 5000 SNPs for validation, which is precisely the result observed experimentally.

### 3.3 Imputation

To test the efficacy of our imputation method, we performed the same experimental analysis described above, including a list of SNPs to be imputed. We imputed only SNPs from the HapMap that had an LD value of at least 0.8 with an observed SNP on one of the two Illumina platforms (450S Duo and 550K v3). Additionally, we used the same filters previously stated and only used observed SNPs to impute if the observed SNPs had a true rank from genotypes as described above. In Supplementary Figure 3, we see Evaluation Metrics 1 and 2 in Figure 1A and C. We compare four methods, a baseline where we perform no imputation as in the previous analyses and three methods where the minimum number of proxies required for imputation is one, two and three, respectively. For at least one, two and three, minimum required proxies we imputed were 748 348, 544 041 and 424 314 SNPs, respectively. As the minimum numbers of proxies required are increased, our imputation method performs slightly better and approaches the results from if we directly measure the given SNP [see Supplementary Fig. 3A and C]. This is expected since the SNPs imputed are in strong LD with our unobserved SNP and the number of indirect observations increases as the minimum number of proxies required increases. For some imputed SNPs in stretches of high LD, we see over a hundred proxies, opening the possibility to a great deal of information to be recovered as well as a problem of overfitting. Nevertheless, the imputation method achieves a high rank correlation between the true ranks and imputed ranks, and maintains a large number of truly associated SNPs within the rank threshold.

## 4 DISCUSSION

In this article, we developed theoretically, and demonstrated through simulation and experimentally a multimarker analysis method that improves the power of pooling-based GWA. We first formalized a model for pooling-based studies with errors and gave a basic description of a suitable test statistic for both individual genotyping and pooling-based studies. We then tested this model using experimental data on multiple platforms, including the Illumina 550 v3, Illumina 450S Duo and Affymetrix 5.0 microarrays validating our results using simulations.

Perhaps more importantly, we demonstrate that an approach for combining Affymetrix and Illumina data is feasible and improves the assessment of the association significance noticeably. Combining platforms increases our genomic coverage as well as giving more measurements for those SNPs where the platforms intersect, with the number of SNPs measured in our experiments after combining platforms totaling 1 333 631 SNPs. Potentially, the number of truly associated SNPs or true positives selected for a second stage of validation can be increased by 5–35% when the number of SNPs to be carried forward is under 5000 SNPs. Additionally, 100% of the truly associated SNPs are carried forward if the observed top 2500 SNPs are chosen for validation. This percentage reduces to 90% if the observed top 1000 SNPs are chosen for validation. In our analysis, we examined only those SNPs with at least one other SNP in LD. Through this criterion, we are able to show that using the LD information improves the accuracy of assessing the significance of SNPs and will continue to improve as denser microarray technologies become available. The resulting increase in the number of SNPs measured will give rise to more pair wise correlations between SNPs and because our method takes advantage of this increase in LD it will perform better as a result. Additionally, this new ability permits new considerations when designing a pooling-based study, namely that if arrays from one platform are significantly noisier than another platform, we could run more replicates on the noisier platform to compensate for the difference in noise. It is clear from our theoretical framework that increasing the number of replicates and increasing the number of probes per SNP reduces noise associated with pooling. In this case, individual genotype data is not available. The presented multimarker method could be used to improve the results when only allele counts or allele frequencies are present, thereby extending the utility of this method.

Finally, a novel application of our method is to impute the significance of association for unobserved SNPs. When an unobserved SNP is imputed, we essentially do not gain any more information since we are not gathering more observations. However, when an unobserved SNP has more than one proxy, we are able to increase the accuracy of imputation. This type of imputation is a useful tool to evaluate associations within a specific region since we can use the imputed SNP *A* as a bridge between neighboring SNPs of SNP *A*. Nevertheless, as the number of SNPs measured increases, so will the number of proxies thereby increasing the accuracy of our method when assessing strength of association for an imputed SNP.

With emerging technologies from Affymetrix and Illumina having >1 million SNPs, we hope to gain considerable power from the increased LD when combining data from both platforms. Additionally, our method uses LD without considering the underlying haplotype structure. It is feasible to adapt current haplotype-based methods such as WHAP (Zaitlen *et al.*, 2007) to also increase power in pooling-based studies (Hinds *et al.*, 2004). A strength of this LD based approach is that a correlation measure (*r*_{p}^{2}) is derived that describes the information content lost by pooling when compared to individual genotyping. This measure is similar to the power lost in individual genotyping when we do not observe directly the causal SNP. Consequentially, it is theoretically straightforward to combine both measures in a hybrid multimarker statistic. This method will be particularly powerful where the phased data is not possible or not reliable. Frequently, this may be the case when HapMap populations are not used, or are seen as underpowered in lieu of larger LD databases derived from case-control association studies. Regardless of approach, the increasing densities present on a number of SNP microarray platforms the accuracy and utility of our method will only improve bringing wider adoption of pooling-based GWA to give a cost-effective alternative to individual genotyping with minimal loss in power.

## ACKNOWLEDGEMENTS

This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk.

*Funding*: We wish to provide acknowledgement of funding from NIH (1I24MS-43581), the Stardust foundation (DWC, WT), and the University of California Systemwide Biotechnology Research & Education Program GREAT Training Grant 2007-10 (NH).

*Conflict of Interest*: none declared.

## REFERENCES

- Barratt BJ. Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. 2002;66:393–405. [PubMed]
- Brown KM, et al. Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat. Genet. 2008 [PMC free article] [PubMed]
- Craig DW, et al. Identification of disease causing loci using an array-based genotyping approach on pooled DNA. BMC Genomics. 2005;6:138. [PMC free article] [PubMed]
- Dai JY, et al. Imputation methods to improve inference in SNP association studies. Genet. Epidemiol. 2006;30:690–702. [PubMed]
- Hanson RL, et al. Diabetes. 2007. A potential locus for end-stage renal disease in type 2 diabetes identified by a pooling-based genome-wide association study. in press.
- Hinds DA, et al. Application of pooled genotyping to scan candidate regions for association with HDL cholesterol levels. Hum. Genomics. 2004;1:421–434. [PMC free article] [PubMed]
- Hoogendoorn B, et al. Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum. Genet. 2000;107:488–493. [PubMed]
- Hua J, et al. SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics. 2007;23:57–63. [PubMed]
- Johnson T. Bayesian method for gene detection and mapping, using a case and control design and DNA pooling. Biostatistics. 2007;8:546–565. [PubMed]
- Kirkpatrick B, et al. HAPLOPOOL: improving haplotype frequency estimation through DNA pools and phylogenetic modeling. Bioinformatics. 2007 [PubMed]
- Law GR, et al. Application of DNA pooling to large studies of disease. Stat. Med. 2004;23:3841–3850. [PubMed]
- Le Hellard S, et al. SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res. 2002;30:e74. [PMC free article] [PubMed]
- Macgregor S. Most pooling variation in array-based DNA pooling is attributable to array error rather than pool construction error. Eur. J. Hum. Genet. 2007;15:501–504. [PubMed]
- Macgregor S, et al. Analysis of pooled DNA samples on high density arrays without prior knowledge of differential hybridization rates. Nucleic Acids Res. 2006;34:e55. [PMC free article] [PubMed]
- Marchini,J., et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. [PubMed]
- McGhee KA, et al. Investigation of the apolipoprotein-L (APOL) gene family and schizophrenia using a novel DNA pooling strategy for public database SNPs. Schizophr. Res. 2005;76:231–238. [PubMed]
- Meaburn E, et al. Genotyping DNA pools on microarrays: tackling the QTL problem of large samples and large numbers of SNPs. BMC Genomics. 2005;6:52. [PMC free article] [PubMed]
- Melquist S, et al. Identification of a novel risk locus for progressive supranuclear palsy by a pooled genomewide scan of 500,288 single-nucleotide polymorphisms. Am. J. Hum. Genet. 2007;80:769–778. [PMC free article] [PubMed]
- Papassotiropoulos A, et al. Common Kibra alleles are associated with human memory performance. Science. 2006;314:475–478. [PubMed]
- Pearson JV, et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am. J. Hum. Genet. 2007;80:126–139. [PMC free article] [PubMed]
- Pritchard JK, Przeworski M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 2001;69:1–14. [PMC free article] [PubMed]
- Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. [PMC free article] [PubMed]
- Sham P, et al. DNA Pooling: a tool for large-scale association studies. Nat. Rev. Genet. 2002;3:862–871. [PubMed]
- Steer S, et al. Genomic DNA pooling for whole-genome association scans in complex disease: empirical demonstration of efficacy in rheumatoid arthritis. Genes Immun. 2007;8:57–68. [PubMed]
- Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. The Wellcome Trust Case Control Consortium. [PMC free article] [PubMed]
- Wang S, et al. On the use of DNA pooling to estimate haplotype frequencies. Genet. Epidemiol. 2003;24:74–82. [PubMed]
- Yang HC, et al. New adjustment factors and sample size calculation in a DNA-pooling experiment with preferential amplification. Genetics. 2005;169:399–410. [PMC free article] [PubMed]
- Yang HC, et al. PDA: pooled DNA analyzer. BMC Bioinformatics. 2006;7:233. [PMC free article] [PubMed]
- Zaitlen N, et al. Leveraging the HapMap correlation structure in association studies. Am. J. Hum. Genet. 2007;80:683–691. [PMC free article] [PubMed]
- Zou G, Zhao H. The impacts of errors in individual genotyping and DNA pooling on association studies. Genet. Epidemiol. 2004;26:1–10. [PubMed]
- Zou G, Zhao H. Family-based association tests for different family structures using pooled DNA. Ann. Hum. Genet. 2005;69:429–442. [PubMed]
- Zuo Y, et al. Two-stage designs in case-control association analysis. Genetics. 2006;173:1747–1760. [PMC free article] [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (402K) |
- Citation

- Genome-wide selection of tag SNPs using multiple-marker correlation.[Bioinformatics. 2007]
*Hao K.**Bioinformatics. 2007 Dec 1; 23(23):3178-84. Epub 2007 Nov 15.* - FastTagger: an efficient algorithm for genome-wide tag SNP selection using multi-marker linkage disequilibrium.[BMC Bioinformatics. 2010]
*Liu G, Wang Y, Wong L.**BMC Bioinformatics. 2010 Jan 29; 11:66. Epub 2010 Jan 29.* - LdCompare: rapid computation of single- and multiple-marker r2 and genetic coverage.[Bioinformatics. 2007]
*Hao K, Di X, Cawley S.**Bioinformatics. 2007 Jan 15; 23(2):252-4. Epub 2006 Dec 5.* - The extent of linkage disequilibrium and computational challenges of single nucleotide polymorphisms in genome-wide association studies.[Curr Drug Metab. 2011]
*Huang YT, Chang CJ, Chao KM.**Curr Drug Metab. 2011 Jun; 12(5):498-506.* - Characterization of LD structures and the utility of HapMap in genetic association studies.[Adv Genet. 2008]
*Gu CC, Yu K, Rao DC.**Adv Genet. 2008; 60:407-35.*

- An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data[BMC Genetics. ]
*Kuk AY, Li X, Xu J.**BMC Genetics. 1482* - The efficacy of detecting variants with small effects on the Affymetrix 6.0 platform using pooled DNA[Human genetics. 2011]
*Chiang CW, Gajdos ZK, Korn JM, Butler JL, Hackett R, Guiducci C, Nguyen TT, Wilks R, Forrester T, Henderson KD, Le Marchand L, Henderson BE, Haiman CA, Cooper RS, Lyon HN, Zhu X, McKenzie CA, Palmer MR, Hirschhorn JN.**Human genetics. 2011 Nov; 130(5)607-621* - High-resolution genetic mapping with pooled sequencing[BMC Bioinformatics. ]
*Edwards MD, Gifford DK.**BMC Bioinformatics. 13(Suppl 6)S8* - Estimating the effect of SNP genotype on quantitative traits from pooled DNA samples[Genetics, Selection, Evolution : GSE. ]
*Henshall JM, Hawken RJ, Dominik S, Barendse W.**Genetics, Selection, Evolution : GSE. 44(1)12* - USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA[The annals of applied statistics. 2010]
*Wen X, Stephens M.**The annals of applied statistics. 2010 Sep; 4(3)1158-1182*

- Multimarker analysis and imputation of multiple platform pooling-based genome-wi...Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studiesBioinformatics. Sep 1, 2008; 24(17)1896

Your browsing activity is empty.

Activity recording is turned off.

See more...