- Journal List
- Hum Hered
- PMC2868915

# Optimal DNA Pooling-Based Two-Stage Designs in Case-Control Association Studies

## Abstract

Study cost remains the major limiting factor for genome-wide association studies due to the necessity of genotyping a large number of SNPs for a large number of subjects. Both DNA pooling strategies and two-stage designs have been proposed to reduce genotyping costs. In this study, we propose a cost-effective, two-stage approach with a DNA pooling strategy. During stage I, all markers are evaluated on a subset of individuals using DNA pooling. The most promising set of markers is then evaluated with individual genotyping for all individuals during stage II. The goal is to determine the optimal parameters (π^{p}_{sample}, the proportion of samples used during stage I with DNA pooling; and π^{p}_{marker}, the proportion of markers evaluated during stage II with individual genotyping) that minimize the cost of a two-stage DNA pooling design while maintaining a desired overall significance level and achieving a level of power similar to that of a one-stage individual genotyping design. We considered the effects of three factors on optimal two-stage DNA pooling designs. Our results suggest that, under most scenarios considered, the optimal two-stage DNA pooling design may be much more cost-effective than the optimal two-stage individual genotyping design, which use individual genotyping during both stages.

**Key Words:**Two-stage design, DNA-pooling, Genome-wide association study, Measurement errors, Optimal design

## Introduction

A genome-wide association (GWA) study is a powerful approach to detect susceptibility genes with small effects associated with complex diseases [1,2,3,4]. Recent advances in genotyping technologies have made GWA studies possible; however, even though genotyping costs have dropped significantly, they remain prohibitively high for many GWA studies due to necessity of genotyping a large number of SNPs from a large number of case and control subjects.

DNA pooling [5] and two-stage designs [6,7,8,9,10,11] have been proposed to reduce genotyping costs. Using a DNA pooling strategy, pooled DNA samples from a group of case subjects and pooled DNA samples from a group of control subjects are formed. The allele frequency of each marker is then estimated by genotyping case pools and control pools instead of individual subjects. This reduces the number of genotyping assays substantially. In a recent publication, Pearson et al. [12] suggested that pooling-based GWA studies might provide a viable alternative for screening disorders with common variations of large effect. Another recent study has demonstrated the applicability of using DNA pools and 500 K Affymetrix GeneChip mapping arrays as a cost-effective, reliable and valid initial screening tool [13]. DNA pooling has also been utilized in a whole-genome association study of neuroticism [14] and a genome-wide scan of progressive supranuclear palsy [15], using Affymetrix 100 and 500 K mapping arrays. In well-developed two-stage designs using individual genotyping, all *M* markers are evaluated on a subset of *N* total subjects (*N* × π_{sample}) during stage I, and the most promising subset of *M* markers (*M* × π_{marker}) is then evaluated in the remaining *N* × (1 – π_{sample}) subjects during stage II. This greatly reduces the total amount of genotyping and therefore leads to a more cost-effective design. Zuo et al. [16] combined DNA pooling and a two-stage design. In their design, all *M* markers and all *N* subjects were genotyped with DNA pooling during stage I. A subset of the most promising markers was selected and individually genotyped on a set of newly collected subjects during stage II. However, how to define an optimal cost-effective two-stage design with DNA pooling and compare the cost-effectiveness with the widely studied optimal two-stage individual genotyping design under a fixed study power has not been investigated.

In this study, we propose a cost-effective two-stage approach with a DNA pooling strategy. During stage I, all *M* markers are evaluated on a subset of subjects using pooled DNA samples. The most promising set of markers is then evaluated with individual genotyping in all subjects during stage II. The goal is to determine the optimal parameters (π^{p}_{sample}, the proportion of samples used during stage I with DNA pooling; and π^{p}_{marker}, the proportion of markers evaluated during stage II) that minimize the cost of a two-stage DNA pooling design while maintaining a desired overall significance level and achieving a level of power similar to that of a one-stage individual genotyping design. We compare the costs of the optimal two-stage DNA pooling designs to those of the optimal two-stage individual genotyping designs.

We investigate the effects of three factors on the optimal two-stage DNA pooling designs. The first factor involves the DNA pooling-related measurement errors including, but not limited to, the following: accuracy of pool construction by pipetting, integrity of the pooled genomic DNA and possible unequal amplification of one allele over another [5, 12,17,18,19]. For array-based DNA pooling, additional variation may be introduced by the array itself [20, 21]. In this study, we have grouped the above-mentioned DNA pooling-related measurement errors into two types. The first type is the pool-specific error, σ_{P}, which includes the pool-construction errors and the array errors known for their confounding effects [20]. The second type of error is due to unequal allele amplification, which has been shown to cause a systematic bias in testing allele frequency differences between pools [20]. If it were possible for there to be no pooling-related errors, studies using DNA pooling would be able to achieve the same level of power as studies using individual genotyping. However, in reality, pooling-related errors always exist.

The second factor, denoted as *R*, is the ratio of the cost per genotyping assay performed in stage II to that of stage I. A recent study [10] of the impact of *R* on the optimal two-stage individual genotyping design suggests that, despite the markedly higher per-genotyping cost during stage II (*R* ranges from 15 to 20), two-stage individual genotyping designs are more cost-effective than one-stage designs. In an optimal two-stage individual genotyping design, it has been shown that the optimal proportion of samples (π_{sample}) used during stage I increases as increases, and the optimal proportion of markers (π_{marker}) evaluated during stage II decreases as *R* increases [10].

The third factor is the DNA pool size, denoted as *s*. The pool size is the number of individuals used in constructing a pool that will be genotyped on a single array. Given pool size *s*, we can obtain the number of pools with a fixed sample size. Pearson et al. [12] suggests a limiting point beyond which increasing the pool size becomes less efficient based on simulation studies.

Our results suggest that, under most scenarios considered, the optimal two-stage DNA pooling designs are much more cost-effective than the optimal two-stage individual genotyping designs. When the pool-specific error is small, and the pool size is as small as 20, the cost of the two-stage DNA pooling design is only 4–16% that of the two-stage individual genotyping design. When the pool-specific error is large, the two-stage DNA pooling design may not be as cost-effective as the two-stage individual genotyping design, especially when the genetic relative risk (GRR) is small and the allele responsible for the risk is rare. The optimal design parameters and the optimal cost fractions of the two-stage DNA pooling designs and the two-stage individual genotyping designs compared to the cost of the one-stage individual genotyping designs are also provided. We believe that with the prohibitive costs of genome-wide association scans and the great demand for conducting hypothesis-free genome-wide scans, the optimal two-stage DNA pooling design can be a cost-effective high-throughput screening tool for evidence of association. (A program written in R is available upon request from the authors.)

## Methods

### Notations and Assumptions

We consider *M* markers and 2*N* (a balanced design with *N* cases and *N* controls) samples in a case-control study. Among the *M* markers, we assume *D* (≥1) markers are the true disease markers.

Assume bi-allelic markers with alleles *A* and *a*, which have frequencies *p* and 1 – *p*, respectively. Let *X*_{i} and *Y*_{i} be the number of copies of the risk-allele *A* that the *i*-th case and the *i*-th control carry. Under the Hardy-Weinberg equilibrium assumption, *X*_{i} or *Y*_{i} = 2 has a probability of *p*^{2}; *X*_{i} or *Y*_{i} = 1 has a probability of 2*p*(1 – *p*); and *X*_{i} or *Y*_{i} = 0 has a probability of (1 – *p*)^{2}. Let *f*_{0} = Pr(*affected* *aa*), and ψ_{1}, ψ_{2} be the genetic relative risk (GRR) for genotypes *Aa* and *AA*, where

The risk allele frequencies within case group (P_{A}) and control group (P_{U}) are functions of *p*, ψ_{1}, ψ_{2}, and *f*_{0}; that is,

### Study Costs

To investigate the cost-effectiveness of two-stage DNA pooling designs, we compared the optimal cost of the two-stage DNA pooling design and the optimal cost of the two-stage individual genotyping design relative to the cost of the one-stage individual genotyping design.

Let *C*_{ind} and *C*_{pool} be the per-genotyping-array costs of individual genotyping and DNA pooling during stage I, respectively. Without considering the cost of constructing pools, we assume *C*_{pool} = *C*_{ind}. With *M* markers and 2*N* samples, the total cost of the one-stage individual genotyping design, which serves as the reference, is:

Given the ratio of the per-genotyping cost during stage II to that during stage I, *R*, the per-genotyping cost during stage II, will be *R* × *C*_{ind} for both the two-stage DNA pooling design and the two-stage individual genotyping design under the assumption of *C*_{pool} = *C*_{ind}. Therefore, the total cost of the two-stage individual genotyping design is:

The total cost of the two-stage DNA pooling design is:

CFI is the ratio of the cost of the two-stage individual genotyping design to that of the one-stage design, and CFP is the ratio of the cost of the two-stage DNA pooling design to that of the one-stage design. That is,

### Test Statistics

#### One-Stage Individual Genotyping Designs

The one-stage designs with individual genotyping serve as reference designs. Let $\stackrel{\circ}{P}$_{A} and $\stackrel{\circ}{P}$_{U} be the estimated risk-allele frequencies in case and control groups. The test statistic to test the association between a marker and the disease is:

Under the null hypothesis, *T* follows approximately a normal distribution with mean 0 and variance 1. Under the alternative hypothesis, *T* follows approximately a normal distribution with mean

and the variance is 1. When the total sample size is fixed, the power of the one-stage design for a one-sided test is 1 – β = 1 – Φ(*z*_{1 – α/M} – μ), where we control the marker-wise false positive rate at α/*M* using a Bonferroni correction.

#### Two-Stage Individual Genotyping Designs

In two-stage individual genotyping designs, all *M* markers are evaluated during stage I using *N* × π_{sample} cases and *N* × π_{sample} controls. The most promising set of markers π_{marker} are selected and genotyped in the remaining cases and controls during stage II. We focus on the joint analysis, and follow the test statistics defined by Skol et al. [22]. Let $\stackrel{\circ}{P}$_{A1} and $\stackrel{\circ}{P}$_{U1} be the estimated risk-allele frequencies in case and control groups using *N* × π_{sample} cases and *N* × π_{sample} controls, and $\stackrel{\circ}{P}$_{A2} and $\stackrel{\circ}{P}$_{U2} be the estimated risk-allele frequencies using the remaining *N* × (1 – π_{sample}) cases and *N* × (1 – π_{sample}) controls. The test statistics *T*_{1} during stage I and *T*_{all} during stage II are defined as:

Under the null hypothesis of no association, *T*_{1} and *T*_{all} follow an approximate bivariate normal distribution *N*(0, Σ), where

Under the alternative hypothesis, *T*_{1} and *T*_{all} follow an approximate bivariate normal distribution *N*($\underset{~}{\mu}$_{ind}, Σ), where

Given the sample size 2*N*, π_{sample} and μ_{marker} must be such that the following two equations are satisfied:

and

This is to achieve an overall study power of 1 – β (which will be less than, but close to, the power of the one-stage individual genotyping design), while controlling the marker-wise false positive rate at α/*M*. Here the critical values *K*_{1} and *K*_{all} of stage I and stage II are functions of π_{sample}, π_{marker}, GRR, and risk-allele frequency. Note that we are able to find multiple pairs (π_{sample}, π_{marker}) of estimates using a grid search method such that the above conditions are met. Among those pairs of parameter estimates, the optimal set (π_{sample}, π_{marker}) is the one that minimizes CFI for a given R.

#### Two-Stage DNA Pooling Designs

In the two-stage DNA pooling designs, all *M* markers are genotyped using pooled DNA samples during stage I using *N* × π^{p}_{sample} cases and *N* × π^{p}_{sample} controls. The most promising set of markers *M* × π^{p}_{marker} are individually genotyped during stage II using all cases and controls. Let $\stackrel{\circ}{P}$_{Ap} and $\stackrel{\circ}{P}$_{Up} be the estimated risk-allele frequencies using DNA pooling of case and control groups using *N* × π^{p}_{sample} cases and *N* × π^{p}_{sample} controls during stage I. Consider *m* case pools, where each pool contains *s* cases, and *m* control pools, where each pool contains *s* controls, and *m* × *s* = *N* × π^{p}_{sample}. Assume a simple linear model for the estimated risk allele frequencies with DNA pooling, we have:

and

Where *X*_{ij} and *Y*_{ij} denote the number of copies risk-allele *A* carried by the *j*-th case in the *i*-th case pool and the *j*-th control in the *i*-th control pool; _{Ai} and _{Ui} are the independent pool-specific errors associated with the *i*-th case pool and the *i*-th control pool, where they both follow a normal distribution with mean 0 and variance σ^{2}_{p}; _{AKi} and _{UKi} are the errors due to unequal allele amplification in case and control groups, which are introduced in the allele frequency estimates through error in estimating the correction factor k [17], which is the parameter to adjust for unequal allele amplification. Two steps are involved. First, the coefficient of preferential amplification k, also called k-correction factor, is estimated. For each SNP marker, $\stackrel{\circ}{k}$ is estimated by the ratio of *A* to *B* from a number of independent heterozygotes [14], where *A* and *B* are the fluorescent signal intensities of alleles *A* and *B,* respectively. Therefore, for a particular marker locus, $\stackrel{\circ}{k}$ is approximately normally distributed with mean *k* and variance σ^{2}_{k}. Following the methods of Visscher and Le Hellard [17], we define the coefficient of variation of $\stackrel{\circ}{k}$, as *CVK* = σ_{k}/*k* = *s.e.*($\stackrel{\circ}{k}$)/*k*. In the second step, the allele frequency estimates in case and control groups are corrected with the estimated k-correction factor applied,

where *A* and *B* are the observed peak heights of alleles *A* and *B* from a DNA pool. Note that the same k-correction factor is applied to the correction of the allele frequency estimates for both case and control groups at a given marker locus. Thus, the errors introduced to the allele frequency estimates in the case and control groups due to unequal allele amplification are not independent. _{AKi} and _{UKi} follow an approximate bivariate normal distribution *N*(0, Σ_{k}) [17], where

When the test statistic during stage I is *T*_{pool}, we have:

The test statistic *T* during stage II is defined similarly as in the one-stage design. Under the null hypothesis of no association, *T*_{pool} and *T* follow an approximate bivariate normal distribution *N*(0, Σ_{0}), where

and

Under the alternative hypothesis, *T*_{pool} and *T* follow an approximate bivariate normal distribution *N*($\underset{~}{\mu}$_{2}, Σ_{1}), where

and

Given the sample size 2*N*, π^{p}_{sample} and π^{p}_{marker} must be such that the following two equations are satisfied:

and

This is to achieve an overall study power of 1 – β (which will be less than, but close to, the power of the one-stage individual genotyping design), while controlling the marker-wise false positive rate at α/*M*. Here the critical values *K*_{pool} and *K* of stage I and stage II are functions of π^{p}_{sample}, π^{p}_{marker} GRR, pool size, and risk-allele frequency. Note that we are able to find multiple pairs of (π^{p}_{sample}, π^{p}_{marker}) estimates using a grid search method such that the above conditions are met. Among those pairs of parameter estimates, the optimal set (π^{p}_{marker}, π^{p}_{sample}) is the one that minimizes CFP for a given R.

#### Fixed Power

The above scenarios can be readily adapted to a situation where a constant study power (e.g., 80% power), rather than a sample size, is fixed. Let α/*M* be the marker-wise false positive rate and 1 – β be the desired study power. Assume a balanced case-control design with *N* samples in each group. The sample size needed to have 1 – β power in a one-stage design is given by:

With the calculated *N*, the optimal CFP and CFI can be obtained following the same procedure as described above with a fixed sample size.

## Results

### Parameter Settings

We considered two sets of GRRs and four genetic models. The first GRR set is: ψ_{1} = ψ_{2} = 1.5 for the dominant models; ψ_{1} = 1, ψ_{2} = 1.5 for the recessive models; ψ_{1} = 1.5, ψ_{2} = 2.25 for the multiplicative models and ψ_{1} = 1.5, ψ_{2} = 3 for the additive models. For illustration simplicity, we refer to this set as GRR = 1.5. The second set is: ψ_{1} = ψ_{2} = 4 for the dominant models; ψ_{1} = 1, ψ_{2} = 4 for the recessive models; ψ_{1} = 4, ψ_{2} = 16 for the multiplicative models and ψ_{1} = 4, ψ_{2} = 8 for the additive model. We refer to this set as GRR = 4. We set *f*_{0} to 0.01 for all models considered. We also considered risk-allele frequencies of *p* = 0.05, 0.1, 0.2, 0.3 and 0.4; the ratio of the per-genotyping cost during stage II to that during stage I being *R* = 1, 10 and 15; the pool-specific errors of σ_{p} = 0, 0.005, 0.01, 0.03 and 0.05 and the pool sizes of *s* = 20, 40, 60, and 100. We considered three *CVK* levels, 0.05, 0.2 and 0.5, which were chosen based on suggestions from a simulation study by Yang et al. [23]. The total sample size was fixed at 2*N* = 2,000, with 1,000 cases and 1,000 controls. For the fixed-power scenario, the power was fixed at 80%. The total number of markers was fixed at *M* = 500,000. For simplicity, among 500,000 markers, one marker was assumed to be the true disease marker (*D* = 1). The overall Type I error rate was controlled at α = 0.05 level by controlling the marker-wise false positive rate at α_{marker} = 1 × 10^{−7}. The optimal cost fractions were evaluated when the powers of the two-stage designs reached at least 99% of the powers of the one-stage designs.

### Effect of R on the Optimal Cost Fractions

The optimal cost fractions of the two-stage DNA pooling designs (CFP) and the optimal cost fractions of the two-stage individual genotyping designs (CFI) under the multiplicative model are summarized in table table11 with different risk-allele frequencies, *p*, and different ratios of the per-genotyping cost during stage II to that during stage I, *R*. For the two-stage DNA pooling designs, the pool size *s* was fixed at 20, the pool-specific error σ_{P} = 0.03, and *CVK* = 0.2. We noticed that under the current parameter setting, the cost of the two-stage DNA pooling design is around 10-fold lower than that of the two-stage individual genotyping design no matter what the *R* or *p* values are. Similar trends were also observed in the other genetic models (data not shown).

*R*, on the optimal cost fractions, CFP and CFI, and the distribution of the optimal parameters in the two-stage DNA pooling designs and the two-stage individual genotyping

**...**

Although increasing trends were observed in both CFI and CFP as *R* increases (table (table1),1), CFP increases only slightly on the absolute scale. For example, under the multiplicative model when GRR = 1.5, *p* = 0.05 and σ_{p} = 0.03, when increases from 1 to 15, CFP increases from 4.7 to 8.4%, while CFI increases from 36 to 55%. One explanation could be that when the per-genotyping cost during stage II is very high, the DNA pooling design tends to use more samples during stage I, however increasing the number of samples does not increase the number of genotyping assays as dramatically as with the individual genotyping design. A similar effect of *R* on CFP and CFI was also observed under the dominant model and when the GRR and pool-specific error, σ_{p}, are large (fig. (fig.11).

### Distributions of the Optimal Parameters

Also presented in table table11 are the optimal proportions of samples used during stage I (π^{p}_{sample} or π_{sample}) and the optimal proportions of markers evaluated during stage II (π^{p}_{marker} or π_{marker}) for the two-stage DNA pooling designs or the two-stage individual genotyping designs under the multiplicative model. For the two-stage DNA pooling designs, π^{p}_{sample} ranges from 56 to 100% and π_{marker} ranges from 0.01 to 0.42%. For the two-stage individual genotyping designs, a smaller proportion of samples are genotyped during stage I, with π_{sample} ranging from 27 to 50%, and a larger proportion of markers are evaluated during stage II, with π_{marker} ranging from 0.69 to 11%. Similar trends were also observed under the other genetic models (data not shown). This observation is consistent with what we expected because the cost savings of the two-stage DNA pooling designs mainly come from the usage of DNA pooling during stage I. If all samples are pooled for genotyping during stage I (i.e., π^{p}_{sample} = 100%), with a pool size of 20 and the pool-specific error less than 0.05, CFP is always around 5%, regardless of the genetic models, risk-allele frequencies, and *R* (data not shown).

### Effect of DNA Pooling-Related Measurement Errors on the Optimal Cost Fractions

Table Table22 summarizes the effect of the pool-specific errors, σ_{P}, on CFP under the multiplicative model when GRR = 1.5, *R* = 15 and the pool size *s* = 20. There is an increasing trend in CFP as σ_{P} increases. However, the magnitude of increase in CFP decreases as the risk-allele frequency increases. For example, when the risk-allele frequency is 0.05, CFP increases from 3.9 to 67% as σ_{P} increases from 0 to 0.05. When the risk-allele frequency is 0.4, CFP increases only slightly from 3.6 to 4.9%. One possible explanation is, as the true risk-allele frequency increases, the relative influence σ_{P} of on the allele frequency estimates decreases and the study power also decreases. Similar trends were also observed in the other genetic models (data not shown).

*σ*, on the optimal cost fractions (CFP) under the multiplicative model with different levels of the risk-allele frequency when GRR = 1.5, R = 15, and the pool size

_{p}*s*= 20

Figure Figure22 shows the effect of *CVK* on CFP, which suggests that *CVK* has little impact on CFP when the risk-allele frequency is high. When the risk allele is rare, increasing *CVK* increases CFP. This is because the errors due to unequal allele amplification (introduced in the allele frequency estimates through errors in estimating the k-correction factor) involve not only *CVK* but also risk-allele frequency (see Σ_{k}). When the risk allele is rare, the relative influence of *CVK* on the allele frequency estimates increases and the study power increases as well. Similar effects of *CVK* on CFP were also observed in the other genetic models (data not shown).

### Effect of the Pool Sizes on the Optimal Cost Fractions

The total costs of stage I in the two-stage DNA pooling designs mainly depend on the number of DNA pools formed, which is determined by the pool size when the sample size is fixed. The effects of the pool sizes on CFP under the multiplicative model when *R* = 15, *p* = 0.1, and GRR = 1.5 are summarized in table table3.3. The effect of the pool sizes on CFP depends heavily on σ_{P}. When σ_{P} = 0.01, there is a consistent decreasing trend in CFP as the pool size is increased from 20 to 100. However, when σ_{P} = 0.05, there is a consistent increasing trend in CFP as the pool size is increased from 20 to 100. When σ_{P} = 0.03, CFP decreases first, then increases when the pool size is increased from 20 to 100. The different effects of the pool sizes at different levels of σ_{P} are due to the small number of pools generated during stage I when the pool size is very large. This leads to a relatively larger effect of σ_{P} on the allele-frequency estimates and the study power when σ_{P} is large. Similar patterns were also observed in π^{p}_{marker}, as the pool size was increased at different levels of σ_{P}, but π^{p}_{sample} increased consistently as the pool size was increased at all levels of σ_{P} considered. Similar trends were also observed in the other genetic models (data not shown).

### Effects of All Three Factors on the Optimal Cost Fractions when the Study Power Is Fixed

We also investigated the effects of the *R*, pooling-related measurement errors and pool sizes on the optimal cost fractions when the study power was fixed at 0.8. Similar patterns of the effects of *R*, the pool-specific errors and the pool sizes on CFP were observed compared to when the sample size was fixed. A set of tables and figures are provided in the supplementary materials (www.karger.com/doi/10.1159/000164398).

## Discussion

We have proposed an optimal two-stage DNA pooling design as a screening tool to identify genetic susceptibility loci in GWA studies, where we focused on the joint analysis in the two-stage design given its demonstrated greater power than the replication analysis [22]. The effects of the following three factors on the cost of the optimal two-stage DNA pooling designs were studied: (1) DNA pooling-related measurement errors; (2) the ratio of the per-genotyping assay cost during stage II to that during stage I, and (3) the DNA pool sizes. We considered a wide range of parameter settings. Our results suggest that the optimal two-stage DNA pooling designs are much more cost-effective than the optimal two-stage individual genotyping designs under most scenarios considered.

The effects of the three factors on the optimal study cost of the two-stage DNA pooling design can be summarized as follows. First, increasing the pool-specific errors σ_{p} increases the optimal cost fraction CFP, the proportion of samples evaluated during stage I with DNA pooling (π^{p}_{sample}) and the proportion of markers evaluated during stage II with individual genotyping (π^{p}_{marker}). In most scenarios, CFP is smaller than CFI, providing a more cost-effective design. In some scenarios, when σ_{p} is large and the risk allele is rare, or when σ_{p} is moderate and the pool size is large, the relative influence of σ_{p} on the allele-frequency estimates increases and the study power increases. In these scenarios, the two-stage DNA pooling designs might not be as cost-effective as the two-stage individual genotyping designs. The coefficient of variation of $\stackrel{\circ}{k}$, *CVK*, hardly affects CFP when the risk allele frequency is high. When the risk allele is rare, increasing *CVK* increases CFP. This is because the errors introduced by estimating the k correction factor involve not only *CVK* but also the true risk-allele frequency. When the risk allele is rare, the relative influence of *CVK* on the allele-frequency estimates increases and the study power increases as well.

Second, increasing the ratio of the per-genotyping cost during stage II to that during stage I increases both CFP and CFI. Note that when *R* increases from 1 to 15, the absolute increase of 17.5% for CFI might have a huge impact on the actual study cost, while the absolute increase of 1.4% for CFP will have much less impact on the actual study cost. The reason that R has a smaller effect on CFP than CFI is because when *R* increases, more samples could be evaluated during stage I with DNA pooling and increasing the number of samples does not increase the number of genotyping assays as dramatically as with individual genotyping. For example, as shown in table table1,1, when the risk-allele frequency is 0.1, about 20% more samples will be evaluated during stage I in both the two-stage DNA pooling designs and the two-stage individual genotyping designs when *R* increases from 1 to 15. However, the increase in the number of genotyping assays during stage I with 20% more samples will be much smaller with DNA pooling than with individual genotyping.

Third, increasing the pool size does not always decrease CFP. The effect of pool size on CFP is heavily dependent on σ_{p}. When σ_{p} is smaller than 0.05, CFP is always lower than CFI if the pool size is between 20 and 60. When σ_{p} is 0.05, CFP is lower than CFI only when the pool size is 20. Therefore, in the DNA pooling-based GWA studies when σ_{p} is unknown, a pool size of 20 may be the best choice.

Although our results clearly suggest the cost-effectiveness of the two-stage DNA pooling designs, there are disadvantages to using DNA pooling as well. Note that since no genotypic data is available with DNA pooling-based studies, we could not perform tests of genotypic association based on the assumptions of different genetic models such as dominant or recessive models. We could only perform tests of allelic association. In addition, it is common for large-scale genotyping studies to have some missing genetic markers in some subjects. Existing imputation methods to infer missing genotypes such as impute [24] and TUNA [25], which utilize haplotype information, were developed for individual genotyping and provide benefits to individual genotyping but not to DNA pooling. We expect more effort from researchers to develop new imputation methods for DNA pooling. Moreover, pooled genotype data pose even greater challenges in haplotype-based analysis than individual genotype data. Several studies have been done on the efficiency of haplotype estimation and on haplotype-disease associations with pooled DNA [26,27,28]. Although software has been developed for haplotype estimation with pooled DNA [27] and for directly analyzing SNP array intensity data [12], the lack of software for analyzing pooled data with more sophisticated methods makes the amount of time spent analyzing pooled data much greater than that for individual genotyping data. More research effort is needed in developing software to further promote DNA pooling designs.

We did not take linkage disequilibrium (LD) into account. We assumed all SNP markers to be independent. In analysis of disease association with real data, the standard significance tests assuming independence may not be valid. A permutation procedure that maintains the autocorrelation between test statistics of the adjacent SNP markers could be applied to obtain the empirical Type I error. We did not consider technical replicates of DNA pools but assume only one replicate per pool. When pooling error is quite large, running two or more arrays per pool might be helpful. This point was demonstrated nicely in the work of Ji et al. [29]. However, in array-based DNA pooling, a recent study [20] suggests that the most pooling variation is attributable to array errors rather than pool-construction errors. We only considered a rather stringent Type I error rate with the Bonferroni correction. We believe a similar relationship between CFP and CFI would be observed when applying other types of multiple comparison adjustments, such as the False Discovery Rate (FDR). We assumed genetic models with both DNA pooling strategy and individual genotyping in the design stage. If a certain genetic model is assumed, the genetic relative risks of genotypes AA and Aa will follow some relationship (e.g., under the dominant model, GRR of genotype AA will be equal to GRR of genotype Aa). If no genetic model is assumed, the genetic relative risks of genotypes AA and Aa will not necessarily follow any relationship. We believe a similar relationship between CFP and CFI would be observed when no genetic model is assumed.

In summary, the optimal two-stage DNA pooling designs are more cost-effective than the optimal two-stage individual genotyping designs under most scenarios considered. Because costs remain the major limiting factor for GWA studies, applying the optimal two-stage DNA pooling designs will make hypothesis-free GWA studies much more feasible, especially in the beginning stages of disease mapping when interests are focused on detecting the differences in allele frequencies between the case and control groups. Our results suggest that the optimal two-stage DNA pooling designs have only 1–5% of the costs of the one-stage individual genotyping designs, leading to a 4- to 20-fold decrease in costs compared to the optimal two-stage individual genotyping designs, which have 20–50% of the costs of the one-stage individual genotyping designs. When the pool-specific error σ_{p} is large, relatively small pool size should be considered. However, a pool size of 20 or smaller will lead to a cost-effective design even when pooling error is large. In general, two-stage DNA pooling designs use a relatively large proportion (at least 70% when the per-genotyping cost is higher during stage II) of samples during stage I, and evaluate a relatively small proportion (less than 0.5%) of markers during stage II. To get the optimal two-stage DNA pooling designs in specific scenarios, we recommend the readers to use our R program.

## Acknowledgements

This work was supported in part by grant 5-T32-CA09529 from the National Cancer Institute. We thank Dr. Susan E. Hodge for helpful discussions.

## References

**Karger Publishers**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (204K) |
- Citation

- Optimal two-stage design for case-control association analysis incorporating genotyping errors.[Ann Hum Genet. 2008]
*Zuo Y, Zou G, Wang J, Zhao H, Liang H.**Ann Hum Genet. 2008 May; 72(Pt 3):375-87. Epub 2008 Jan 23.* - Two-stage designs in case-control association analysis.[Genetics. 2006]
*Zuo Y, Zou G, Zhao H.**Genetics. 2006 Jul; 173(3):1747-60. Epub 2006 Apr 19.* - Identification of susceptibility genes for complex diseases using pooling-based genome-wide association scans.[Hum Genet. 2009]
*Bossé Y, Bacot F, Montpetit A, Rung J, Qu HQ, Engert JC, Polychronakos C, Hudson TJ, Froguel P, Sladek R, et al.**Hum Genet. 2009 Apr; 125(3):305-18. Epub 2009 Jan 29.* - Microarray-based genome-wide association studies using pooled DNA.[Methods Mol Biol. 2011]
*Szelinger S, Pearson JV, Craig DW.**Methods Mol Biol. 2011; 700:49-60.* - Study designs for genome-wide association studies.[Adv Genet. 2008]
*Kraft P, Cox DG.**Adv Genet. 2008; 60:465-504.*

- Analysis and Optimal Design for Association Studies Using Next-Generation Sequencing With Case-Control Pools[Genetic epidemiology. 2012]
*Liang WE, Thomas DC, Conti DV.**Genetic epidemiology. 2012 Dec; 36(8)870-881* - Two-phase and family-based designs for next-generation sequencing studies[Frontiers in Genetics. ]
*Thomas DC, Yang Z, Yang F.**Frontiers in Genetics. 4276* - Genome-wide association study identifies PERLD1 as asthma candidate gene[BMC Medical Genetics. ]
*Anantharaman R, Andiappan AK, Nilkanth PP, Suri BK, Wang DY, Chew FT.**BMC Medical Genetics. 12170* - On optimal pooling designs to identify rare variants through massive resequencing[Genetic Epidemiology. 2011]
*Lee JS, Choi M, Yan X, Lifton RP, Zhao H.**Genetic Epidemiology. 2011 Apr; 35(3)139-147* - Pooled versus Individual Genotyping in a Breast Cancer Genome-wide Association Study[Genetic epidemiology. 2010]
*Huang Y, Hinds DA, Qi L, Prentice RL.**Genetic epidemiology. 2010 Sep; 34(6)603-612*

- Optimal DNA Pooling-Based Two-Stage Designs in Case-Control Association StudiesOptimal DNA Pooling-Based Two-Stage Designs in Case-Control Association StudiesHuman Heredity. 2008 Nov; 67(1)46

Your browsing activity is empty.

Activity recording is turned off.

See more...