- Journal List
- Genetics
- v.168(4); 2004 Dec
- PMC1448705

# An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait Loci

^{*}Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599-7420

^{†}Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706

^{1}

*Corresponding author:*Department of Biostatistics, University of North Carolina, 3107D McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599-7420. E-mail: ude.cnu.soib@uozf

## Abstract

Assessing genome-wide statistical significance is an important and difficult problem in multipoint linkage analysis. Due to multiple tests on the same genome, the usual pointwise significance level based on the chi-square approximation is inappropriate. Permutation is widely used to determine genome-wide significance. Theoretical approximations are available for simple experimental crosses. In this article, we propose a resampling procedure to assess the significance of genome-wide QTL mapping for experimental crosses. The proposed method is computationally much less intensive than the permutation procedure (in the order of 10^{2} or higher) and is applicable to complex breeding designs and sophisticated genetic models that cannot be handled by the permutation and theoretical methods. The usefulness of the proposed method is demonstrated through simulation studies and an application to a Drosophila backcross.

A number of statistical methods are available for mapping QTL in experimental populations, such as backcrosses (BCs) and F_{2}'s. The interval-mapping method of Lander and Botstein (1989) uses two markers flanking a region where the QTL may fall and evaluates the LOD score at each genome position. This method has been implemented in several freely distributed software packages (Lincoln* et al.* 1993; Basten* et al.* 1997; Manly and Olson 1999) and is commonly used in practice. Various extensions, including composite-interval mapping (CIM; Zeng 1993, 1994), the multiple-QTL model (Jansen and Stam 1994), and multiple-interval mapping (MIM; Kao and Zeng 1997; Kao* et al.* 1999), can be used to map multiple QTL. Broman (2001) and Doerge (2001) provided excellent reviews of QTL-mapping methods.

All the aforementioned methods entail a common problem: how to determine the threshold of the test statistic. This is not a trivial problem. Many factors, such as genome size, genetic map density, informativeness of markers, and proportion of missing data, may affect the distribution of the test statistic. The usual pointwise significance level based on the chi-square approximation is inadequate because the entire genome (or at least several regions) is tested for the presence of QTL and the test statistics are not independent among loci.

Theoretical approximations have been developed to determine threshold and power (Lander and Botstein 1989; Dupuis and Siegmund 1999; Rebai* et al.* 1994, 1995) in some standard designs. For backcross populations, Lander and Botstein (1989) showed that with an infinitely dense map, the LOD score may be approximated in large samples by an Ornstein-Uhlenbeck diffusion process. Dupuis and Siegmund (1999) derived a similar result for F_{2}. Zou* et al.* (2001) extended the results to more general experimental designs. The asymptotic calculations are straightforward, but require relatively dense maps with fairly evenly spaced markers. The parameters needed in the calculations are model specific and are difficult to determine for complicated designs. Furthermore, the calculations are applicable only to single-QTL models and not to multiple-QTL mapping.

Rebai* et al.* (1994)(1995) noted that, for interval mapping, the position of the QTL is a parameter that presents only under alternative hypotheses. On the basis of this observation, they found an explicit formula for the upper bound for the BC and F_{2} populations and derived a conservative threshold using the results of Davies (1977)(1987). The formula is algebraically involved and an approximation to the upper bound is usually necessary. Piepho (2001) proposed an efficient numerical method to compute the thresholds in Rebai* et al*. (1994)(1995) for general designs. His simulation results indicate that the approximation is generally conservative when markers are relatively dense.

To avoid asymptotic approximations, one may use permutation testing (Churchill and Doerge 1994). The idea is to replicate the original analysis many times on data sets generated by randomly reshuffling the original trait data while leaving the marker data unchanged. This approach accounts for missing marker data, actual marker densities, and nonrandom segregation of marker alleles. However, this method is computationally intensive. For MIM, where model selection is involved, Zeng* et al.* (1999) proposed using a bootstrap resampling method for hypothesis testing. However, the heavy computational burden has limited the use of the bootstrap test (Z-B. Zeng, personal communication). Furthermore, permutation testing is limited to situations in which there is complete exchangeability under the null hypothesis. It is this exchangeability that ensures the validity of inference based on the permutation distribution. It is unclear how to apply the bootstrap method in Zeng* et al.* (1999) to the situation where a nonlinear model, such as logistic regression or a Poisson model, is used to map multiple QTL with MIM, since the bootstrap procedure in Zeng* et al.* (1999) is performed on model-based residuals.

In this article, we propose a resampling method to assess the genome-wide significance for QTL mapping. The method is less computationally demanding than permutation tests, more accurate than theoretical approximations when rigid requirements of theoretical approximations are not satisfied, and applicable to more complicated designs and models than the theoretical and permutation methods are. The performance of the proposed method is assessed through simulation studies. An illustration with data from the Drosophila backcross of Zeng* et al.* (2000) is provided.

## METHODS

We use single-QTL mapping in an F_{2} population as a working example to illustrate the rationale of the proposed method. We assume that the trait is normally distributed. Extensions to nonnormal traits and other mapping populations as well as MIM/CIM are described later.

### Resampling methods for single-QTL models:

For mapping a quantitative trait, a series of genetic markers are observed over the entire genome or in some specific regions depending on the purpose of the experiment. Specifically, we observe *L* genetic markers {*M _{il}*;

*l*= 1, … ,

*L*} located at positions {

*s*;

_{l}*l*= 1, … ,

*L*} along the genome for subject

*i*(

*i*= 1, … ,

*n*). We also observe the trait value

*y*for the

_{i}*i*th subject (

*i*= 1, … ,

*n*). The goal of the QTL mapping is to use the marker information to search for QTL associated with the trait.

At each fixed location *d* of the genome, the conditional probabilities of the unobserved QTL genotypes can be inferred using flanking markers and the distribution of the quantitative trait given the markers follows a discrete mixture model. Specifically, for a given locus *d*, the log-likelihood for θ = (*a*, *b*, μ, σ^{2}) takes the form

where *l _{i}*(θ;

*d*) = log[π

*(*

_{i}*d*)((

*y*− μ +

_{i}*a*)/σ) + π

*(*

_{i}*d*)((

*y*− μ −

_{i}*b*)/σ) + π

*(*

_{i}*d*)((

*y*− μ −

_{i}*a*)/σ)], (

*x*) is the density of a standard normal random variable, μ is the grand mean,

*a*and

*b*are the additive and dominant effects of the QTL, respectively, and π

*(*

_{i}*k*;

*d*) = Pr(QTL genotype at locus

*d*of subject

*i*is

*k*|subject

*i*'s marker genotypes) with

*k*=

_{2}). Note that the conditional probability π

*(*

_{i}*k*;

*d*) depends on the flanking marker genotypes, the distance between the two flanking markers, as well as the distances between the putative QTL locus

*d*and the right and left markers; see chapter 15 of Lynch and Walsh (1998) for details. For other mapping populations, such as advanced intercrosses and advanced backcrosses, the likelihood still takes the form of (1), although the conditional probabilities π

*(*

_{i}*k*;

*d*) are calculated differently (Lynch and Walsh 1998, Chap. 15). With missing markers, π

*(*

_{i}*k*;

*d*) is conditional on the genotypes of the two closest flanking markers if both are available or on the genotype of the single marker if only one flanking marker is available.

The maximum-likelihood estimator (MLE) (*â*, , , ^{2}) can be obtained by maximizing *l*(θ; *d*) directly or by using the EM algorithm (Dempster* et al.* 1977) in which the unknown QTL genotypes are treated as missing data. Let the maximum-likelihood estimator of θ under H_{0}: *a* = *b* = 0 be denoted by (0, 0, , ^{2}). Then the likelihood-ratio test statistic (LRT) for testing H_{0}: *a* = *b* = 0 against H_{1}: *a* ≠ 0 and/or *b* ≠ 0 at location *d* takes the form

which is approximately chi-square distributed with 2 d.f. We can replace in (1) with a nonnormal density function, such as exponential (continuous phenotype) and binomial or Poisson (discrete phenotype). In practice, the true distribution is often unknown and the normal model is used almost exclusively with continuous phenotypes. As is shown later in simulation studies, the proposed method is quite robust to model misspecification. When the normal model is used to fit nonnormal data (chi-square data as in our simulation), the empirical type I error based on our proposed method is well controlled at the targeted level (Table 1).

**Comparison of the proposed, theoretical, and empirical thresholds in an F**

_{2}population and an advanced intercross F_{3}populationIn multipoint linkage analysis, we maximize LRT(*d*) or LOD(*d*) over all possible values of *d* in the genome. Thus, it is necessary to derive the distribution of LRT(*d*) as a stochastic process indexed by the genome location *d*. To this end, it is more convenient to work with the score test statistic for testing the same hypothesis. The equivalence between the likelihood ratio and score test statistics in large samples is shown in mathematical statistics texts, such as Cox and Hinkley [1974, Sect. 9.3 (iii)]. The reason for working with the score test statistic is that it can be approximated by a sum of independent random vectors so that its large-sample distribution, when regarded as a stochastic process in the genome location, can be readily derived. The same large-sample distribution also applies to the likelihood-ratio statistic since it is equivalent to the score test statistic in large samples.

For the interval mapping of the F_{2} population, we write θ = (*a*, *b*, μ, σ^{2}) and we are interested in testing the null hypothesis H_{0}: β (*a*, *b*) = 0 in the presence of the nuisance parameter η (μ, σ^{2}). In the sequel, the general notation of θ will be used, where β pertains to the parameter of primary interest and η to the nuisance parameter, so that general QTL models other than the specific F_{2} model are encompassed.

Let *U*_{β,}* _{i}*(θ;

*d*) =

*l*(β, η;

_{i}*d*)/β and

*U*

_{η,}

*(θ;*

_{i}*d*) =

*l*(β, η;

_{i}*d*)/η. These are the contributions of the

*i*th subject to the score functions for β and η. Further, let

*U*(

*d*) = ∑

_{i}U_{β,}

*(0, ;*

_{i}*d*), where is the restricted MLE of η under H

_{0}: β = 0,

*i.e.*, the solution of the equation ∑

_{i}U_{η,}

*(0, η;*

_{i}*d*) = 0. Note that

*U*(

*d*) is the score function for β evaluated at β = 0 and η = . It follows from Taylor series expansions and the law of large numbers that

*n*

^{−1/2}

*U*(

*d*) has the same asymptotic distribution as , where

and ∑_{βη}(β, η; *d*) and ∑_{ηη}(β, η; *d*) are, respectively, the limits of *n*^{−1}^{2}*l*(β, η; *d*)/βη and *n*^{−1}^{2}*l*(β, η; *d*)/η^{2} as *n* goes to infinity [Cox and Hinkley 1974, Sect. 9.3 (iii)]. Since *U _{i}*(

*d*) involves only the information from the

*i*th subject, the

*U*(

_{i}*d*) (

*i*= 1, … ,

*n*) are independent zero-mean random variables for any given

*d*. Thus, it follows from the multivariate central limit theorem that the process

*n*

^{−1/2}

*U*(

*d*) is asymptotically a zero-mean Gaussian process, where the covariance between

*n*

^{−1/2}

*U*(

*d*

_{1}) and

*n*

^{−1/2}

*U*(

*d*

_{2}) at any two given positions

*d*

_{1}and

*d*

_{2}is Ξ(

*d*

_{1},

*d*

_{2}), the limit of

*n*

^{−1}∑

_{i}

*U*

_{i}

*d*

_{1}

*U*

^{T}

_{i}

*d*

_{2}. The replacement of the unknown parameters in (3) by their sample estimators yields

The restricted MLE in (4) may be replaced by the unrestricted MLE . By the law of large numbers and the consistency of the maximum-likelihood estimators, Ξ(*d*_{1}, *d*_{2}) can be consistently estimated by .

The score test statistic for H_{0}: β = 0 against H_{1}: β ≠ 0 at location *d* takes the form

where *Û*(*d*) = ∑* _{i}Û_{i}*(

*d*) and (

*d*) =

*n*(

*d*,

*d*) [Cox and Hinkley 1974, Sect. 9.3 (iii)]. It can be shown that

*W*(

*d*) is asymptotically equivalent to LRT(

*d*) [Cox and Hinkley 1974, Sect. 9.3 (iii)]. To assess the genome-wide statistical significance, we need to evaluate the distribution of max

*(*

_{d}W*d*). In general, this is not analytically tractable. We propose a resampling method similar to that of Lin

*et al.*(1993) to approximate the distribution of max

*(*

_{d}W*d*). The idea of simulating thresholds using a score test statistic is mentioned in Rebai

*et al.*(1994).

Define

where *G _{i}* (

*i*= 1, … ,

*n*) are independent standard normal random variables. Let

In (5), we regard the *Û _{i}*(

*d*) in

*U**(

*d*) and as fixed and the

*G*in

_{i}*U**(

*d*) as random. Conditional on the observable data,

*U**(

*d*) is normal with mean 0 at each location

*d*and the covariance between

*n*

^{−1/2}

*U**(

*d*

_{1}) and

*n*

^{−1/2}

*U**(

*d*

_{2}) equals (

*d*

_{1},

*d*

_{2}), which converges to Ξ(

*d*

_{1},

*d*

_{2}). It follows that the conditional distribution of

*n*

^{−1/2}

*U**(

*d*) given the observed data converges to the same limiting distribution of

*n*

^{−1/2}

*Û*(

*d*). Consequently, the distribution of

*W*(

*d*) can be approximated by that of

*W**(

*d*). Our resampling method is essentially a parametric bootstrap.

We have shown that, under the null hypothesis, the test statistics are functions of certain zero-mean Gaussian processes over the genome positions and the realizations from the Gaussian processes can be generated by Monte Carlo simulations. In practice, the resampling procedure is as follows:

- Sample
*G*,_{i}*i*= 1, 2, … ,*n*, from*N*(0, 1). - Calculate and
*S** = max*(_{d}W*d*). - Repeat steps 1 and 2 a large number of times, say
*R*times. - For a given genome-wide type I error rate α, calculate the 100(1 − α)th percentile of the
*R*values of the*S**. If the observed value of the LRT exceeds this threshold, then reject the null hypothesis.

The above calculations are based on the score function and the observed information matrix from the original data. These quantities are evaluated once and used repeatedly in step 2. Since it does not involve refitting the model in each iteration, the proposed method is computationally much more efficient than the permutation method. This is important with complex breeding designs and sophisticated QTL models (*e.g.*, CIM and MIM), where the likelihood calculations are time consuming.

### Extensions to MIM/CIM:

In this section, we show how to apply the resampling method to MIM and CIM. Suppose that we are searching for multiple QTL in a backcross population. Given the genotypes of *K* QTL, the normal regression model takes the form

where *x _{k}* is the QTL genotype indicator variable, which takes the value −1 or 1 when the

*k*th QTL is heterozygote or homozygote, respectively, μ is the grand mean, γ

*is the main effect of the*

_{k}*k*th QTL, γ

*is the interaction between the*

_{jk}*j*th and

*k*th QTL, and

*e*is a zero-mean normal error with variance σ

^{2}.

As in the case of the single-QTL analysis, the QTL genotypes are generally unobservable but the conditional probabilities of the QTL can be calculated given flanking markers. This results in the following mixture-model likelihood for *K* putative QTL loci *d*_{1}, … , *d _{K}*,

where *l _{i}*(θ;

*d*

_{1}, … ,

*d*) =

_{K}and π* _{i}*(

*a*

_{1}, … ,

*a*;

_{K}*d*

_{1}, … ,

*d*) = Pr(

_{K}*x*

_{1}=

*a*

_{1}, … ,

*x*=

_{K}*a*|subject

_{K}*i*'s marker genotypes), which is the conditional probability of the joint genotypes of

*K*QTL given the marker genotypes of the

*i*th subject. Let θ = (β, η), where β = (γ

_{1}, … , γ

*, γ*

_{K}_{12}, … , γ

_{K}_{−1,}

*) and η = (μ, σ*

_{K}^{2}). We test the null hypothesis H

_{0}: β = 0 against the alternative hypothesis H

_{1}: β ≠ 0. Note that for MIM, the profile likelihood is calculated in

*K*-dimensional space (

*d*

_{1}, … ,

*d*). Once the likelihood is obtained, the resampling procedure above can be applied to the resulting score test statistic with

_{K}*d*= (

*d*

_{1}, … ,

*d*).

_{K}For CIM, the model is essentially the same as MIM except that, for a given putative QTL position *d*, (*x*_{1}, … *x _{K}*

_{−1}) corresponding to the selected marker genotypes are known and only

*x*corresponding to the putative QTL genotype is unobservable. Also, in CIM the interaction terms are generally ignored. Thus for CIM, our mixture model will have likelihood (1) but and π

_{K}*(*

_{i}*a*;

_{K}*d*) = Pr(

*x*=

_{K}*a*|subject

_{K}*i*'s marker genotypes), the conditional probability of the genotypes of the putative QTL given the marker genotypes of the

*i*th subject. In this situation, β = γ

*and η = (μ, γ*

_{K}_{1}, … , γ

_{K}_{−1}, σ

^{2}).

## SIMULATION STUDIES

Simulations were conducted to study the behavior of the proposed method in an F_{2} population. One chromosome with a total length of 100 cM was simulated. The markers were evenly spaced with a marker distance of 2, 10, or 20 cM. The null and alternative models were simulated to investigate the type I error and power. Under the null hypothesis, the trait was randomly sampled from the standard normal distribution. Under the alternative, a QTL was simulated at 40 cM with different additive and dominant effects. We set the sample size to 200. We simulated 10,000 data sets for each combination of the marker distance and QTL effects. For each simulated data set, we set *R* = 10,000 and α = 0.05 or α = 0.01. The step width of the QTL scan is set to 1 cM for all simulations. To compare the resampling method with the theoretical method, we also calculated the thresholds on the basis of the dense-map and sparse-map approximations of Dupuis and Siegmund (1999) as well as the corresponding type I error and power. The results are summarized in Tables 1 and and2.2. To demonstrate the generality of the proposed method, simulations were also performed on an advanced intercross F_{3} (see Tables 1 and and22).

**Empirical type I error and power of the proposed and theoretical methods in an F**

_{2}population and an advanced intercross F_{3}populationThe thresholds based on the proposed method match the empirical thresholds reasonably well, and the thresholds are similar when the data are generated from the null and alternative hypotheses. The corresponding tests have proper type I error and power. The theoretical thresholds based on the dense-map assumption are too conservative while those based on the sparse-map approximation tend to be too liberal, especially for sparse maps. The results based on the method of Piepho (2001) are also included in Table 2. As mentioned before, Piepho's method is generally conservative when the marker density is high. In contrast, the proposed method is somewhat on the liberal side in small samples with dense maps. We may combine the proposed method with Piepho's method when the marker density is relatively high.

To further assess the proposed method, we simulated a backcross population in searching for multiple QTL. Again, one chromosome with a total length of 100 cM was simulated. Markers are evenly distributed with a distance of 2, 10, or 20 cM. The sample size is 300. A single QTL is located at 20 cM.

When mapping multiple QTL, the analysis is done either sequentially so that we search for the next most significant QTL after accounting for the effects of the identified QTL or jointly so that we search all QTL simultaneously. In the former, we search for a new gene conditional on previously identified genes.

For the sequential analysis, either we assumed that the position of the first QTL is known (at 20 cM), and given this QTL, we searched for the second QTL, or we assumed that the position of the first QTL is unknown and the marker closest to the locus with the maximum LOD is selected as the locus for the first QTL. Regardless of the method used to choose the position of the first QTL, we tested the null hypothesis H_{0}: γ_{2} = γ_{12} = 0 under model (6) against the alternative hypothesis H_{1}: γ_{2} ≠ 0 or/and γ_{12} ≠ 0 across the whole chromosome. We treated *x*_{1} as fixed and calculated the profile likelihood at all possible loci for the second QTL.

If the putative QTL locus is very close to the primary QTL, the collinearity between *x*_{1} and *x*_{2} will be very strong, which may result in relatively high LOD scores in a region very close to the primary QTL. To investigate this, we simulated another chromosome that is also 100 cM long and searched the second QTL only on the second chromosome.

The above two cases are examples of the CIM analysis in which only one marker, instead of several, is used as the covariate in the analysis. To show the strength of the proposed method in multiple-QTL mapping (MIM), where the computational demand for permutation tests is very high, we also simulated two 100-cM chromosomes and fit a two-QTL model to investigate how the type I errors are controlled under the global null hypothesis of no QTL present. For simplicity, we restricted our profile likelihood calculation to one QTL on each chromosome.

The results of the sequential analysis under γ_{1} = 1 and γ_{2} = γ_{12} = 0 are summarized in Table 3. The proposed thresholds are again close to the empirical levels and have proper control of the type I error regardless of whether the first QTL locus is fixed at its true position or selected with the results of the single-QTL interval mapping. Additional simulations (not shown) demonstrate that the resampling thresholds for data generated under alternatives with two QTL are similar to those of Table 3, so that the resampling method yields adequate power.

**Simulation results on the proposed thresholds and the corresponding empirical type I error in a backcross population**

As shown in Table 3, when we search for the second gene on a different chromosome from the chromosome where the first gene resides, the thresholds are slightly lower than when we search for the second gene on the same chromosome as that of the first gene. This suggests that to retain the power to detect genes not linked to the primary gene, we may partition the whole genome into two groups, one linked with the primary QTL and one unlinked with the primary QTL. The LOD scores within each group can then be compared to the corresponding threshold. We can also exclude a small region, say 10 cM to the left and to the right of the primary QTL to break down the high collinearity between *x*_{1} and *x*_{2}, as in the case with CIM.

For MIM, we fit model (6), where neither *x*_{1} nor *x*_{2} is observed and the profile likelihood is calculated for all possible locus combinations of the first and second QTL. The number of testing positions for MIM is on the order of *L ^{K}*, where

*L*is the total number of loci in the single-QTL analysis and

*K*is the total number of QTL fitted in MIM, which is a dramatic increase relative to CIM. We performed 1000 simulations for MIM. As shown in Table 3, the proposed method works reasonably well for MIM mapping and the type I errors are well controlled.

To investigate the robustness of the proposed method, we also simulated situations with smaller sample sizes, missing marker genotypes, and χ^{2}_{1}-distributed traits. The results for χ^{2}_{1} traits are presented in the bottom three rows of Tables 1 and and2.2. The type I error is only slightly inflated for α = 0.05. As shown in Table 4, the performance of the proposed method is also fairly insensitive to missing genotype data and small sample sizes. With sample size 100 and 10% missing marker genotypes, the type I error is still close to the nominal level.

## APPLICATION TO A DROSOPHILA BACKCROSS

We use a Drosophila data set (Zeng* et al.* 2000) to compare the permutation procedure with the proposed method. Two closely related allopatric species, *Drosophila simulans* and *D. mauritiana*, differ dramatically in the size and shape of the posterior lobe of the male genital arch. To investigate the genetic architecture of the morphometric difference between the two species, female *D. simulans* were crossed to males of *D. mauritiana* to generate an F_{1} population. The F_{1} females were backcrossed to parental line *D. simulans* and 299 backcross males were produced. A morphometric descriptor, referred to as PC1 by Zeng* et al.* (2000), is the average over both sides of the first principal components of the Fourier coefficients of the posterior lobe and is used to quantify both the size and shape variation. There are 42 markers unevenly distributed on the X chromosome and on chromosomes 2 and 3. Interval mapping was performed across all three chromosomes. The step size of the QTL scan was 1 cM. Threshold calculations were based on 10,000 permutations and resamples. Our recorded running time showed that the proposed method is several hundred times faster than the permutation procedure (the recorded CPU times of the resampling and permutation procedures running on an IBM BladeCenter HS20 machine are 13 and 6000 sec, respectively).

The derived 95% thresholds are 10.08 and 9.96 from the proposed and permutation methods, respectively. The corresponding 99% thresholds are 13.49 and 13.46. The two procedures result in very similar thresholds, but the proposed method takes far less computing time. The LOD score profile of the original data and the estimated 95% threshold are plotted in Figure 1. The genetic signals on all three chromosomes are very strong. As suggested in Zeng* et al.* (2000), as many as 19 different QTL controlling this morphometric descriptor may exist. For these complicated real data, where some of our assumptions, such as normality, are likely to fail, the thresholds from permutation and our proposed procedures agree very well. Since the permutation procedure is known to be robust to those violations, this real example further demonstrates the usefulness of the proposed method. To further compare the permutation and proposed method, we provided a QQ-plot (Figure 2) of the permutation- and the resample-based null distribution estimates of the maximum profile likelihood-ratio test statistic. The two estimated distributions match rather well up to the 99.5th percentile. The discrepancy in the tails of the distributions may be due, in part, to the limited number of resamplings and permutations. To improve the accuracy of the estimates of the null distribution in the tail, a larger number of resamplings and permutations are necessary. For comparison, we also calculated the 95 and 99% thresholds by Piepho's (2001) method, which are 11.23 and 14.44, respectively. Those thresholds are slightly larger than both the permutation-based and our resampling-based thresholds.

## DISCUSSION

In this article, we propose a new empirical method to calculate the threshold for QTL mapping. The method is far more efficient than the popular permutation procedure since the proposed method needs to maximize the likelihood of the observed data only once with no need to maximize the likelihood in each resampling iteration any more. For standard interval mapping with simple crosses, the resampling method is several hundred times faster than the permutation procedure. Furthermore, the proposed method is applicable to more complicated designs and models that cannot be handled by the permutation procedure. For example, for MIM where the model selection is involved, the bootstrap resampling method of Zeng* et al.* (1999) is applicable to the linear regression model but may not be applicable to nonlinear models, such as logistic regression and Poisson regression. The proposed method also avoids the derivation of parameters in the Ornstein-Uhlenbeck diffusion approximations, which can be a difficult task when the model is complicated.

The computational advantage of the proposed method over the permutation procedure depends on how complex the original model is. The more complicated the model is, the more there is to be gained from the proposed method. In the Drosophila analysis, where a simple interval-mapping model was fitted on three chromosomes, there was a decrease in computing time in the order of 10^{2}. If more complicated models, such as multiple-QTL mapping or CIM, are used, where maximization via the EM algorithm is more time consuming, the proposed method may be thousands of times faster than the permutation procedure. With the recent efforts to map the gene expression levels of thousands of genes via microarrays (Lan* et al.* 2003), an efficient way to compute thresholds in a large number of screens is critical, even with the current trend in computing power.

The simulations indicated that for simple interval mapping with F_{2} or backcross, either the restricted or unrestricted estimator of η can be used and the two estimators tend to give very similar thresholds. However, for the two-gene model, we found that the unrestricted estimator of η works slightly better than the restricted one. For this reason, we suggest the use of the unrestricted estimator of the nuisance parameters in evaluating the thresholds, and the simulation results presented in this article are based on an unrestricted estimator of η.

The simulations also showed that the proposed method is robust to nonnormality as well as missing data. Though the normal model is used to fit the nonnormal chi-square data, the empirical type I error from the proposed method is reasonably controlled at the targeted level. However, it is unclear how this method will work for data with segregation distortion, which is a complex phenomenon. Due to different mechanisms of segregation distortion, it is difficult to predict the performance of the method. If the existence of segregation distortion is suspected, a simple solution is to remove those markers that are in segregation distortion from the analysis. Including markers in segregation distortion in the analysis will result in a distorted map estimate and may give biased mapping results, regardless of the method used to compute the thresholds. Last, in contrast to Piepho's (2001) method, which is generally conservative when the marker density is high, the proposed method is somewhat on the liberal side in small samples with dense maps. We may consider combining the proposed method with Piepho's method when the marker density is relatively high.

## References

- Basten, C. J., B. S. Weir and Z-B. Zeng, 1997
*QTL Cartographer: A Reference Manual and Tutorial for QTL Mapping.*Department of Statistics, North Carolina State University, Raleigh, NC. - Broman, K. W., 2001. Review of statistical methods for QTL mapping in experimental crosses. Lab Anim. 30
**:**44–52. [PubMed] - Churchill, G. A., and R. W. Doerge, 1994. Empirical threshold values for quantitative trait mapping. Genetics 138
**:**963–971. [PMC free article] [PubMed] - Cox, D., and C. Hinkley, 1974
*Theoretical Statistics.*Chapman & Hall, London. - Davies, R. B., 1977. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64
**:**247–254. [PubMed] - Davies, R. B., 1987. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74
**:**33–43. [PubMed] - Dempster, A. P., N. M. Laird and D. B. Rubin, 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39
**:**1–38. - Doerge, R. W., 2001. Mapping and analysis of quantitative trait loci in experimental populations. Nat. Rev. Genet. 3
**:**43–52. [PubMed] - Dupuis, J., and D. Siegmund, 1999. Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151
**:**373–386. [PMC free article] [PubMed] - Jansen, R. C., and P. Stam, 1994. High resolution of quantitative traits into multiple quantitative trait in line crosses using flanking markers. Genetics 136
**:**1447–1455. [PMC free article] [PubMed] - Kao, C. H., and Z-B. Zeng, 1997. General formulas for obtaining the maximum likelihood estimates and the asymptotic variance-covariance matrix in QTL mapping when using the EM algorithm. Biometrics 53
**:**653–665. [PubMed] - Kao, C. H., Z-B. Zeng and R. D. Teasdale, 1999. Multiple interval mapping for quantitative trait loci. Genetics 152
**:**1203–1216. [PMC free article] [PubMed] - Lan, H., J. P. Stoehr, S. T. Nadler, K. L. Schueler, B. S. Yandell
*et al.*, 2003. Dimension reduction for mapping mRNA abundance as quantitative traits. Genetics 164**:**1607–1614. [PMC free article] [PubMed] - Lander, E. S., and D. Botstein, 1989. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121
**:**185–199. [PMC free article] [PubMed] - Lin, D. Y., L. J. Wei and Z. Ying, 1993. Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80
**:**557–572. - Lincoln, S. E., M. J. Daly and E. S. Lander, 1993
*A Tutorial and Reference Manual for MAPMAKER/QTL*. Whitehead Institute, Cambridge, MA. - Lynch, M., and B. Walsh, 1998
*Genetics and Analysis of Quantitative Traits.*Sinauer Asociates, Sunderland, MA. - Manly, K. F., and J. M. Olson, 1999. Overview of QTL mapping software and introduction to Map Manager QT. Mamm. Genome 10
**:**327–334. [PubMed] - Piepho, H. P., 2001. A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics 157
**:**425–432. [PMC free article] [PubMed] - Rebai, A., B. Goffinet and B. Mangin, 1994. Approximate thresholds of interval mapping tests for QTL detection. Genetics 138
**:**235–240. [PMC free article] [PubMed] - Rebai, A., B. Goffinet and B. Mangin, 1995. Comparing power of different methods for QTL detection. Biometrics 51
**:**87–99. [PubMed] - Zeng, Z-B., 1993. Theoretical basis of separation of multiple linked gene effects on mapping quantitative trait loci. Proc. Natl. Acad. Sci. USA 90
**:**10972–10976. [PMC free article] [PubMed] - Zeng, Z-B., 1994. Precision mapping of quantitative traits loci. Genetics 136
**:**1457–1468. [PMC free article] [PubMed] - Zeng, Z-B., C. H. Kao and C. J. Basten, 1999. Estimating the genetic architecture of quantitative traits. Genet. Res. 74
**:**279–289. [PubMed] - Zeng, Z-B., L. Liu, L. F. Stam, C. H. Kao, J. M. Mercer
*et al.*, 2000. Genetic architecture of a morphological shape difference between two Drosophila species. Genetics 154**:**299–310. [PMC free article] [PubMed] - Zou, F., B. S. Yandell and J. P. Fine, 2001. Statistical issues in the analysis of quantitative traits in combined crosses. Genetics 158
**:**1339–1346. [PMC free article] [PubMed]

**Genetics Society of America**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (105K) |
- Citation

- Linkage analysis of quantitative trait loci in multiple line crosses.[Genetica. 2002]
*Yi N, Xu S.**Genetica. 2002 Apr; 114(3):217-30.* - Assessing genome-wide statistical significance for large p small n problems.[Genetics. 2013]
*Diao G, Vidyashankar AN.**Genetics. 2013 Jul; 194(3):781-3. Epub 2013 May 11.* - Genome Reshuffling for Advanced Intercross Permutation (GRAIP): simulation and permutation for advanced intercross population analysis.[PLoS One. 2008]
*Peirce JL, Broman KW, Lu L, Chesler EJ, Zhou G, Airey DC, Birmingham AE, Williams RW.**PLoS One. 2008 Apr 23; 3(4):e1977. Epub 2008 Apr 23.* - Novel linkage mapping approach using DNA pooling in human and animal genetics. II. Detection of quantitative traits loci in dairy cattle.[J Appl Genet. 2002]
*Pareek CS, Pareek RS, Walawski K.**J Appl Genet. 2002; 43(3):309-18.* - Linkage mapping of quantitative trait loci in humans: an overview.[Ann Hum Genet. 2002]
*Ghosh S, Reich T, Majumder PP.**Ann Hum Genet. 2002 Nov; 66(Pt 5-6):431-8.*

- Mapping epistatic quantitative trait loci[BMC Genetics. ]
*Laurie C, Wang S, Carlini-Garcia LA, Zeng ZB.**BMC Genetics. 15(1)112* - Efficient Calculation of Empirical P-values for Genome-Wide Linkage Analysis Through Weighted Permutation[Behavior genetics. 2009]
*Medland SE, Schmitt JE, Webb BT, Kuo PH, Neale MC.**Behavior genetics. 2009 Jan; 39(1)91-100* - Meta-Analysis of Gene Level Tests for Rare Variant Association[Nature genetics. 2014]
*Liu DJ, Peloso GM, Zhan X, Holmen OL, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, Orho-Melander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem K, Melander O, Kathiresan S, Abecasis GR.**Nature genetics. 2014 Feb; 46(2)200-204* - Multiple trait multiple interval mapping of quantitative trait loci from inbred line crosses[BMC Genetics. ]
*Costa E Silva LD, Wang S, Zeng ZB.**BMC Genetics. 1367* - Dynamic semiparametric Bayesian models for genetic mapping of complex trait with irregular longitudinal data[Statistics in medicine. 2013]
*Das K, Li J, Fu G, Wang Z, Li R, Wu R.**Statistics in medicine. 2013 Feb 10; 32(3)509-523*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- An Efficient Resampling Method for Assessing Genome-Wide Statistical Significanc...An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait LociGenetics. 2004 Dec; 168(4)2307

Your browsing activity is empty.

Activity recording is turned off.

See more...