# Statistical Methods for Mapping Multiple QTL

^{1}Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA

^{2}Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

^{3}Department of Genetics, North Carolina State University, Raleigh, NC 27695, USA

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

Since Lander and Botstein proposed the interval mapping method for QTL mapping data analysis in 1989, tremendous progress has been made in the last many years to advance new and powerful statistical methods for QTL analysis. Recent research progress has been focused on statistical methods and issues for mapping multiple QTL together. In this article, we review this progress. We focus the discussion on the statistical methods for mapping multiple QTL by maximum likelihood and Bayesian methods and also on determining appropriate thresholds for the analysis.

## 1. INTRODUCTION

Quantitative genetics studies the variation of quantitative traits and their genetic basis. When R. A. Fisher laid down the basic theoretical foundations of quantitative genetics, the focus of study was to partition the overall variation into genetic and environmental ones. With the development of polymorphic markers for many species, current research interest is to partition genetic variation to individual quantitative trait loci (QTL) in the genome as well as interaction among them [1]. A QTL is a chromosomal region that is likely to contain causal genetic factors for the phenotypic variation under study.

The basic principle of QTL mapping has been established in Sax'swork [2] work in beans. If there is a linkage disequilibrium (LD) between the causal factor and a marker locus, mean values of the trait under study will differ among subject groups with different genotypes at the marker locus [3]. Though this idea is still directly used in certain settings (e.g., LD-based QTL mapping in unrelated human), the advance of QTL mapping methodology has allowed simultaneous use of multiple marker information to improve the accuracy and power to estimate QTL locations and effects. Lander and Botstein [4] presented a likelihood-based framework for interval mapping (IM), where the putative QTL genotype was conditional upon a pair of flanking markers' genotypes as well as the phenotype. A least square equivalence of IM [5] was also proposed where phenotypic values were regressed onto expected genetic coefficients of a putative QTL. Motivated by the conditional independency between marker genotypes, composite interval mapping [6] proposed to introduce additional flanking markers as covariates into the likelihood function to reduce the confounding effects from nearby QTL when scanning the current interval. However, most of these methods were still designed to detect a single QTL at a time based on a statistical test that a candidate position for a QTL has significant effect or not. The test was constructed to test each position in a genome and thus created a genome scan for QTL analysis.

Though intuitive and widely used, these methods are still insufficient to study the genetic architecture of complex quantitative traits that are affected by multiple QTL. When a trait is affected by multiple loci, it is more efficient statistically to search for those QTL together. Also in order to study epistasis of QTL, multiple QTL need to be analyzed together. In this setting, QTL analysis is basically a model-selection problem. In this paper, we discuss recent research progress and outstanding statistical issues associated with mapping multiple QTL in experimental cross populations.

## 2. MULTIPLE INTERVAL MAPPING (MIM)

Multiple interval mapping is targeted to analyze multiple QTL with epistasis together through a model selection procedure to search for the best genetic model for the quantitative trait [1, 7, 8].

For *m* putative causal genes for the trait, the model
of MIM is specified as

where

*y*_{i}is the phenotypic value of individual*i*,*i*= 1, 2,…,*n*;*u*is the mean of the model;*α*_{r}is the main effect of the*r*th putative causal gene,*r*= 1,…,*m*;*x*_{ir}* is an indicator variable denoting genotype of the*r*th putative causal gene, which follows a multinomial distribution conditional upon flanking marker genotypes and genetic distances;*β*_{rs}is the possible epistatic effect between the*r*th and the*s*th putative causal genes, assuming there are*t*such effects;*e*_{i}is an environmental effects assumed to be normally distributed.

As shown by Kao and Zeng [7], Kao et al. [8], given a genetic model (number, location, and interaction of multiple QTL), this linear model suggests a likelihood function similar to that in IM but with more complexity. An expectation/maximmization (EM) algorithm can be used to maximize the likelihood and obtain maximum likelihood estimates (MLE) of parameters.

The following model-selection method is used to transverse the genetic model space in QTL cartographer [1, 9, 10].

- Forward selection of QTL main effects sequentially. In each cycle of selection, pick the best position of an additional QTL, and then perform a likelihood ratio test for its main effect. If a test statistic exceeds the critical value, this effect is retained in the model. Stop when no more QTL can be found.
- Search for epistatic effects between QTL main effects included in the model, and perform likelihood ratio tests on them. If a test statistic exceeds the critical value, the epistatic effect is retained in the model. Repeat the process until no more significant epistatic effects can be found.
- Reevaluate the significance of each QTL main effect in the model. If the test statistic for a QTL falls below the significant threshold conditional on other retained effects, this QTL is removed from the model. However, if a QTL is involved in a significant epistatic effect with other QTL, it is not subject to this backward elimination process. This process is performed stepwisely until no effects can be dropped.
- Optimize estimates of QTL positions based on the currently selected model. Instead of performing a multidimensional search around the regions of current estimates of QTL positions, estimates of QTL positions are updated in turn for each region. For the
*r*th QTL in the model, the region between its two neighbor QTLs is scanned to find the position that maximizes the likelihood (conditional on the current estimates of positions of other QTL and QTL epistasis). This refinement process is repeated sequentially for each QTL position until there is no change on estimates of QTL positions.

An important issue in model selection is the significance level to include or eliminate effects. In regression analysis, such threshold is usually decided based on information criteria, which has the following general form

where *L*_{k} is the likelihood of data given a genetic
model with *k* parameters, *c*(*n*) is a penalty function and can take a variety
of forms, such as,

*c*(*n*) =*log*(*n*), which is the classical Bayesian information criterion (BIC);*c*(*n*) = 2, which is Akaike information criteron (AIC);*c*(*n*) = 2*log*(*log*(*n*));*c*(*n*) = 2*log*(*n*);*c*(*n*)=3*log*(*n*).

When the penalty for an additional parameter in the model is low, more QTL and epistatic effects are likely to be included in the model. Thus it is particularly important to determine an appropriate penalty function for model selection.

Sequential genome scans require detectable main
effects for the components of interaction effect. An alternative approach,
exhaustive search of all marker combinations, is a computational and statistical
problem even in two dimensions. From a yeast eQTL mapping data with over 6000
expression traits and 112 individuals [11], Storey et al. [12] showed that the sequential search was more powerful than
exhaustive one to detect pair-wise QTL main effects and interaction effects.
However, in a different setting using simulations under a series of
quantitative trait model assumptions, Evans et al. [13] showed that the exhaustive search can be more powerful
than the sequential one with over 1000 individuals in the mapping population and a
Bonferroni correction for 100 000 tests. The inconsistency is partially related
to sample size. A larger sample can make unadjusted *P* values more significant. Witte et al. [14]
showed that the required sample size increases linearly as the number of tests increases logarithmically with a simple
Bonferroni correction.

## 3. THRESHOLD TO CLAIM QTL

We need to decide the threshold for declaring QTL from the profile of test statistics across the genome. General asymptotic results for regression and likelihood ratio tests are not directly applicable for genome scans given the large number of correlated tests performed in the scans and the limited sample size.

### 3.1. Type I error rate control

When markers are dense and the sample size is large, Lander and Botstein [4] showed that an appropriate threshold for LOD score
was (2log10)*t*_{α},
where *t*_{α} solves the equation *t*_{α} = (*C* + 2*G**t*_{α})*χ*^{2}(*t*_{α}). *C* is the number of chromosomes of the organism. *G* is the length of the genetic map, measured in
Morgans. *χ*^{2}(*t*_{α}) is the probability that a random variable from
a *χ*_{1}^{2} distribution is less than *t*_{α}.

Churchill and Doerge [15] proposed a method based on permutation tests to find an empirical threshold specifically for a QTL mapping study. Data were shuffled by randomly pairing one individual'sgenotypes with another'sphenotypes, in order to simulate the null hypothesis of no intrinsic relationship between genotypes and phenotypes. Thus, this method takes into account sample size, genome size of the organism under study, genetic marker density, segregation ratio distortions, and missing data.

According to Churchill and Doerge [15], the genome-wide threshold to control type I error rate for mapping a single trait can be found in the following procedure.

- Shuffle the data
*N*times by randomly pairing trait values with genotypes. When there are multiple traits under study, these phenotypes should be shuffled together to keep their correlation structure. - Perform mapping analysis and obtain the maximum test statistic in each of
*N*shuffled data. This provides an empirical distribution*F*_{M}of the test statistic for the genome scan at the null. - The 100(1 −
*α*) percentile of*F*_{M}will provide an estimated critical value.

This permutation procedure is equivalent to the
Bonferroni correction for multiple testing when test statistics are
independent. Suppose there are *n* such statistics *t*_{i} (for *i* = 1,…, *n*) from a null distribution *F*. *F*_{M}(*T*),
the distribution function of maximum of the *n* statistics, can be expressed as Pr(max(*t*_{i}) < *T*) = *F*(*T*)^{n}.
When we find a threshold *T*,
such that Pr(max(*t*_{i}) > *T*) = 1 − Pr(max(*t*_{i}) < *T*) ≤ *α*,
it is equivalent to require 1 − *F*(*T*)^{n} = 1 − (1 − Pr(*t*_{i} > *T*))^{n} ≤ *α*,
or Pr(*t*_{i} > *T*) ≤ *α*/*n*,
the Bonferroni adjusted threshold. When test statistics are correlated, the
permutation method provides a threshold that is less than that from Bonferroni.

A related permutation procedure was also suggested by Doerge and Churchill [16] for mapping procedures that QTLs are declared sequentially using a forward selection procedure. Two methods were suggested to find a genome-wide threshold for the second QTL while controlling effects of the first QTL.

- Conditional empirical threshold (CET). Mapping subjects are put into blocks according to the genotype of the marker identified as (or closest to) the first QTL. Permutation is applied within each block. Following the procedure described above by Churchill and Doerge [15], maximal test statistic of each genome scan is collected and CET is obtained. One problem of CET is that markers linked to the first QTL will continue to show assocation with the trait variation as in the original data. To avoid CET being elevated by such markers, it is suggested to exclude the complete chromosome where the first QTL is located when collecting null statistics.
- Residual empirical threshold (RET). The residues from the genetic model with the first QTL are used as new phenotypic values to be permuted. Maximal null statistics from genome-wide scans are then collected to find RET.

Applied to multiple interval mapping, Zeng et al. [1] also suggested to use a residual permutation or bootstrap test to estimate appropriate threshold for the model selection in each step of sequential test. In this test, after fitting a model with identified QTL effects, residuals of individuals are calculated and permutated or bootstrapped to generate a null model for the conditional test to identify additional QTL. This threshold is more appropriate for the conditional search, but computationally more intensive.

### 3.2. Score-statistic-based resampling methods

Permutation is a computationally intensive method for generating empirical threshold. Zou et al. [17] suggested that the genome-wide threshold could be more efficiently computed based on score statistic and a resamplng method. A score statistic can be computed at each genome position. If we multiply a score function by a standard normal random variable with mean zero and variance one, the resulting score statistic mimics that under the null hypothesis. Thus by multiplying a number of standard normal variables, we can very efficiently generate an empirical distribution of score statistic under the null. This method is flexible and can be used to test a null hypothesis in a complex model. A similar algorithm was also suggested by Seaman and Müller-Myhsok [18]. Conneely and Boehnke [19] extended the approach by replacing the resampling step with a high dimensional numerical integration.

Unlike these approaches, Nyholt [20] suggested another method that addresses the multiple testing issue: the number of independent tests across the genome can be approximately estimated as a function of eigenvalues derived from the correlation matrix of marker genotypes.

### 3.3. False discovery rate (FDR)

In QTL mapping, as marker coverage in a genome increases, it is less likely that a casual variant is not in LD with any marker and may be missed. On the other hand, the number of markers showing significant correlation with the phenotype by chance is also expected to grow, if the type I error rate for each test is controlled at a preset level. To handle this multiple testing problem, stringent family-wise control of type I error is usually applied, which is designed to control the probability of making at least one false discovery in a genome-wide test. However, a more powerful approach may be to control false discovery rate (FDR) [21], or to control the expected proportion of false discoveries among all the markers passing a threshold. This essentially allows multiple false positive declarations when many “significant” test statistics are found. Such a relaxation is driven by the nature of the problem under study. “It is now often up to the statistician to find as many interesting features in a data set as possible rather than test a very specific hypothesis on one item” [22].

According to the notation from Storey [22], Table 1 shows the possible outcomes
when *m* hypotheses *H*_{1}, *H*_{2},…, *H*_{m} are tested. For independent tests, Benjamini and Hochberg [21] provided a procedure (known as linear step-up
procedure or BH procedure) to control expected FDR, that is, *E*[(*V*/*R*) ∣ *R* > 0] × Pr(*R* > 0) at the desired level *α*(*m*_{0}/*m*) or *α* (since *m*_{0} is generally unknown, a conservative up-bound
estimate *m*_{0} = *m* was used) as follows:

- sort
*P*values from the smallest to the largest such that*P*_{(1)}≤*P*_{(2)},…,*P*_{(m)}; - starting from
*P*_{(m)}, compare*P*_{(i)}with*α*(*i*/*m*); - let
*k*be the first time*P*_{(i)}≤*α*(*i*/*m*), reject all*P*_{(1)}through*P*_{(k)}.

Benjamini and Yekutieli [23] showed that “if test statistics are positively
regression dependent on each hypothesis from the subset corresponding to true
null hypotheses (PRDS), the BH procedure controls FDR at level *α*(*m*_{0}/*m*).” For QTL mapping, PRDS can be interpreted as
follows [24]: if two markers have correlated allele frequencies
and neither is related biologically to the trait, test statistics associated
with the two markers should be positively correlated. Such positive correlation
is intuitively correct and supported by simulation results [24].

To check the performance of BH procedure on FDR control in genome-wide QTL scan for a single trait, Sabatti et al. [24] considered a simulated case-control study in human. Three susceptibility genes were simulated to affect the disease status. The genes were assumed to be additive and located on different chromosomes. The results showed that the BH procedure can control the expected value of the FDR for single-trait genome-wide scan. For multiple-trait QTL analysis, Benjamini and Yekutieli [25] considered 8 positively or negatively correlated traits. Using simulation, they showed that BH approach seemed to work for multiple trait analysis too.

According to Benjamini and Yekutieli [25], to control FDR for QTL analysis in each trait at
level *α* does not always mean that the overall FDR for these
multiple traits is also *α* :
if there are *k* independent and nonheritable traits, the
overall FDR should be 1 − (1 − *α*)^{k} ≈ *k**α*.
It is safer to control FDR for all the tests simultaneously.

Yekutieli and Benjamini [26] suggested to make use of dependency structure in
data, rather than treat them as annoying cases. They expected an increase of
testing power when using an empirical true null statistic distribution instead
of assuming some theoretical ones to get *P* values. Empirical null distributions
are used extensively in pFDR and local FDR as discussed below.

Though BH approach is a handy and intuitive tool, it
should be used with caution when applied to QTL mapping. First, BH approach
controls the expected value of FDR. Simulation studies showed that the actual
FDR for a particular QTL mapping dataset can be higher [24]. Second, FDR = *E*[(*V*/*R*) ∣ *R* > 0] Pr(*R* > 0) may be tricky to interpret when Pr(*R* > 0) is far below than 1. Weller et al. [27] are the first ones to apply FDR criteria in QTL
mapping area. They claimed that 75% of the 10 QTL declared in their study were
probably true by controlling FDR at 25% using BH approach. Zaykin et al. [28] however objected to the interpretation because *E*[(*V*/*R*) ∣ *R* > 0] could be much higher than FDR = 25% when Pr(*R* > 0) was much smaller than 1. It is *E*[(*V*/*R*) ∣ *R* > 0],
also known as pFDR discussed in Section 3.4, that really contains the
information about the proportion of false discovery. Weller [29] further argued that if one assumes *R* follows a Poisson distribution, Pr(*R* > 0) should be very close to 1 when they observed *R* = 10.
In later FDR literature, the assumption that Pr(*R* > 0) ≈ 1 has been widely adopted [25, 30].

### 3.4. Positive discovery rate (pFDR)

pFDR, or *E*[(*V*/*R*) ∣ *R* > 0],
was considered less favorable than FDR [21] without the additional term Pr(*R* > 0):
we cannot decide an arbitrary threshold *α*, 0 < *α* < 1, and guarantee that pFDR ≤ *α* regardless of the actual proportion of true
null hypothesis in all tests. For example, when *m*_{0}/*m* = 1,
pFDR is always 1 and cannot be controlled at a small *α*.
In this case, however, FDR = pFDR × Pr(*R* > 0) can be controlled at *α* by reducing the rejection region to push Pr(*R* > 0),
and then FDR, towards *α*.
Thus, Benjamini and Yekutieli [25] considered that pFDR should be only estimated after a
fixed rejection threshold was decided; or in QTL mapping, pFDR should estimate
(instead of control) the proportion of true/false QTL, after the null
hypothesis was rejected at *R* linkage peaks by a certain rule (e.g., a type-I
error control procedure).

From the discussion in Section 3.3 we know
that allowing Pr(*R* > 0) much smaller than 1 might bring trouble in
interpreting the results; and when *m*_{0}/*m* ≈ 1,
we might want to see an FDR measure that is close to 1 [31]. This helped to bring up the interest in pFDR. Storey [31] presented a Bayesian interpretation of pFDR based on
*P* values. Assuming there are *m* tests *H*_{1}, *H*_{2},…, *H*_{m} with associated *P* values *P*_{1}, *P*_{2},…, *P*_{3},
and

*H*_{i}= 0 denots the*i*th null hypothesis is true, and*H*_{i}= 1 otherwise;- to model the uncertainty in each hypothesis test, each
*H*_{i}is viewed as an identical and independent random variable from a Bernoulli distribution with Pr(*H*_{i}= 0) =*π*_{0}and Pr(*H*_{i}= 1) =*π*_{1}= 1 −*π*_{0}; *P*_{i}follows a distribution*F*_{0}if*H*_{i}= 0 and*F*_{1}if*H*_{i}= 1; it follows a mixed distribution unconditionally:*F*_{m}=*π*_{0}*F*_{0}+*π*_{1}*F*_{1};- Γ = [0,
*γ*] is the common rejection region for all*H*_{i},

then

Notice that, when *π*_{0} = 1 and *m* is reasonably large, formula (5)
is roughly equal to *γ*(*m*/*R*),
which is the BH approach to estimate FDR. We can also see from formula
(5) that instead of assuming *π*_{0} = 1, a data-driven estimate of *π*_{0} will give a smaller pFDR estimate or a more
powerful procedure if *π*_{0} is estimated smaller than 1. Efron et al. [32] gave an upper bound of *π*_{0} by assuming an “accept region” *A* = [*λ*, 1] such that Pr(*H* = 0 ∣ *P* ∈ *A*) ≈ 1:

where *f*_{0}, *f*_{1}, *f*_{m} are the density functions of *F*_{0}, *F*_{1}, *F*_{m},
respectively. The left-hand side of this equation is equivalent with the
following estimator [30]:

These formulas lead to *q* value, the minimal pFDR when
rejecting a hypothesis with *P* value *P*_{i}.
This approach is more powerful than the BH approach [21] in genomics applications. An R package “*q* value” is
available to convert a list of *P* values to *q* values [30]. The *q* value approach can actually find Γ while controlling a specific pFDR rate besides
finding pFDR given Γ.
However, as discussed above, FDR control in QTL mapping should be applied with
caution.

When *P*_{i} are identical and independent distributed,
pFDR becomes identical to the definition of proportion of false positives (PFP)
[22, 33]:

PFP is also estimated similarly as pFDR (cf. formula (5) [33]:

Contrary to FDR discussed in
Section 3.3, PFP has a property that when it is controlled at *α* for each of *n* sets of independent tests, the
overall PFP is still *α*. *P*_{i} from different sets of tests can have
different distributions. In this case, PFP can be different from pFDR [33].

Currently available procedures to control or estimate
pFDR or PFP may have variable utility in various mapping designs. Chen and Storey [34] noted that the threshold to control FDR at marker
level from one-dimensional genome scan for a single trait could be “dubious”
because FDR is affected by the marker densities along the chromosomes. Since a
true discovery is to claim a QTL at a marker which is in strong LD with a
causal polymorphism, markers that are in strong LD with the true discovery can
be additional true discoveries. Thus, FDR decreases when we genotype more
markers around a true linkage peak. Using simulations, Chen and Storey [34] showed that the threshold is obtained by controlling FDR
varied with the marker density. However, in real applications, people generally
consider all markers surrounding a test statistic peak as parts of one QTL,
rather than distinct positive discoveries. On the other hand, we can still
estimate FDR from *P* values at linkage peaks for different traits that pass
certain cutoff value. This is a common situation from an expression QTL study
where thousands of traits are analyzed together.

Zou and Zuo [35] showed that family wise error rate control via
Bonferroni correction can be more powerful than PFP control. In their
simulation, they assumed 1 to 5 true non-null hypotheses out of 1000
independent tests, corresponding to *π*_{1} ≤ 0.5%,
which might be too pessimistic for certain QTL mapping studies. As we can see
from formula (10), when *π*_{1} is so small, Pr(*P*_{i} ≤ *γ* ∣ *H*_{i} = 1)/Pr(*P*_{i} ≤ *γ* ∣ *H*_{i} = 0) has to be very large so that its product with *π*_{1} is considerably larger than *π*_{0} and there is an acceptable PFP level. When the
density function of *P*_{i} ∣ *H*_{i} = 1 is monotonously decreasing in [0, *γ*],
which is quite common in reality, *γ* has to be very small to increase the ratio of
power over type-I error. Thus, it is not surprising that such a *γ* from PFP would result in a more stringent test
than that under Bonferroni correction. [36] made a similar argument and pointed out that
family wise error rate control was effective when *π*_{1} was relatively low, and PFP or pFDR approach
can be more powerful when *π*_{1} was high. Again, expression QTL studies stand
out as an example when pFDR is more favorable: in the yeast eQTL mapping data [11], Storey et al. [12] estimated that *π*_{1} = 0.85 among the genome-wide maximal test statistic
for each expression trait.

### 3.5. Local FDR

The Bayesian interpretation of pFDR extends naturally to local FDR [32], denoted as FDR here,

where *T*_{i} is the test statistic associated with *H*_{i}; *H*_{i} = 0 denotes that the *i*th null hypothesis is true, and *H*_{i} = 1 otherwise; *T*_{i} has a density *f*_{0} if *H*_{i} = 0,
and *f*_{1} if *H*_{i} = 1;
or *f*_{m} = *π*_{0}*f*_{0} + *π*_{1}*f*_{1} when hypotheses are mixed together.

There is great similarity between formula
(3) and formula (12). Actually Efron [37] showed that *q* value associated with *T*_{i} is equivalent with *E*_{t≥Ti}FDR(*t*).
Storey et al. [12] showed that FDR could be estimated within each cycle
of a sequential genome scan for thousands of expression traits; then the
average of the FDR in the rejection region was an estimate of pFDR.

The key part in estimating FDR is to estimate *f*_{0}/*f*_{m}.
It is possible to assume certain standard distribution under the null
hypothesis as *f*_{0} and estimate *f*_{m} from nonparametric regression [37]. The following procedure is modified from [12, 32]:

- permute data under null hypothesis
*B*times, and obtain test statistics*z*_{ij},*i*= 1,…,*m*,*j*= 1,…,*B*; - estimate
*f*_{0}/*f*_{m}from*T*_{i}and*z*_{ij}(see below);

*f*_{0}/*f*_{m} is estimated in the following way:

- pool all
*T*_{i}and*z*_{ij}into bins; - create an indicator variable
*y*, let*y*= 1 for each*T*_{i}and*y*= 0 for each*z*_{ij}, thus, Pr(*y*= 1) =*f*_{m}/(*f*_{m}+*B**f*_{0}) in each bin; - obtain a smooth estimate of $\widehat{\mathrm{Pr}}(y=1)$ in each bin from an overall regression curve across bins, by combining natural cubic spline with generalized linear models for binomially distributed response variables;
- equate $\widehat{\mathrm{Pr}}(y=1)$ with
*f*_{m}/(*f*_{m}+*B**f*_{0}), and get a moment estimate of*f*_{0}/*f*_{m}.

It is noticed that *H*_{i} and its associated *T*_{i} or *P*_{i} are assumed to be from a mixture distribution
in both pFDR and FDR estimation. Thus, as pointed out by Storey [22], there is a connection between multiple hypothesis
testing and classification. For each test, the test procedure is to classify *H*_{i} as 0 or 1, or accepted or rejected.
Classification decisions can be made based upon *T*_{i},
with a rejection region Γ:
if *T*_{i} ∈ Γ, we classify *H*_{i} as 1.

## 4. BAYESIAN METHODS

Bayesian QTL mapping methods try to take full account of the uncertainties in QTL number, location, and effects by studying their joint distributions. Such a method takes the prior knowledge about these parameters as a prior distribution, reduces the uncertainty by integrating the information from the data, and expresses the remaining uncertainty as a posterior distribution of parameters.

Satagopan et al. [38] illustrated the use of Markov chain Monte Carlo (MCMC) algorithm to generate a sequence of samples from the joint posterior distribution of QTL parameters conditional upon the number of QTL. Gibbs sampling algorithm approximates the joint distribution by sampling each parameter in turn, conditional upon the current values of the rest of parameters. Conjugate priors can be chosen so that most of these conditional distributions have explicit distribution functions and can be sampled directly. Otherwise, Metropolis-Hastings algorithm is to be used. Point estimates of individual parameters are obtained by taking the averages of the corresponding marginal distributions. Confidence intervals are obtained as high posterior density regions.

Sen and Churchill [39] proposed to sample QTL genotypes from their posterior distribution conditional upon marker genotypes and trait phenotypes. This multiple imputation method offers a framework to handle several issues in QTL mapping: nonnormal and multivariate phenotypes, covariates, and genotyping errors. It differs from usual MCMC procedures in that a two-step approach is used: first, QTL genotypes are sampled conditioning only on marker genotypes; then, weights can be calculated as the likelihood of the phenotypes given a genotype. These genotypes and weights are then used to estimate posterior distributions.

In both studies, MCMC simulations were conditional
upon the number of QTL (*L*) underlying the phenotype. Different values
of *L* were tried and Bayes factors were used to
decide which value of *L* was more plausible. Bayes factor is the ratio
of the probability density of data under two hypotheses. It is a likelihood
ratio in some simple cases. In Bayesian QTL mapping, however, likelihoods are
weighted by prior distributions of parameters under different hypotheses to get
Bayes factors [38].

The development of reversible jump MCMC algorithm [40] suggested one way to treat *L* as a parameter and generate its posterior
distribution. Satagopan and Yandell [41] introduced this method to QTL mapping: when updating
the current value of *L*,
a single QTL may be added or deleted from the model. Yi et al. [42] extended this method to model interacting QTL by
allowing 3 types of transdimensional exploration: a main effect or epistatic
effect, a single QTL with its effects or a pair of QTL may be added or deleted
in one updating step. However, they also observed low acceptance rates for
transdimensional jump proposals and hence slow mixing of the markov chains and
high computational demand associated with the algorithm.

Yi [43] introduced composite model space approach as a
unified Bayesian QTL mapping frame work. The approach incorporates reversible
jump MCMC as a special case, and turns the transdimensional search into a
fixed-dimensional one. The central idea is to use a fixed length vector to
record the current locations of *L* putative QTL and another indicator vector to
record whether a QTL'smain or interaction effect is included or excluded.
These two vectors decide the actual number of QTL. The fixed length sets an
upper bound for the number of detectable QTL. This approach has been
implemented in the R/qtlbim package [44, 45].

Without considering computational efficiency, the upper bound can actually be fairly large. Unlike multiple regression analysis based upon least square criteria, which gets in trouble when the number of explanatory variables is larger than the number of observations, Bayesian analysis can handle a large number of explanatory variables through a large number of cycles within each step of Gibbs sampling. Xu [46] showed an example where markers across genome were used simultaneously in Bayesian linear regression and suggested that the Bayesian method could result in model selection naturally: markers with weak effects are “shrunk” so that their posterior mean effects are around zero, markers with strong effects remain strong.

A closely related procedure is ridge regression, where
the parameters to an usual linear model *Y* = *X**β* + *ϵ* are estimated as $\widehat{\beta}=({X}^{T}X+K{)}^{-1}{X}^{T}Y$ (*K*=*λ*
*I*, *ϵ*∼*N*(0, *σ*^{2}
*I*), *I* is an identity matrix). The estimator turns
out to be identical to a Bayesian one which assumes that *β* has a prior normal distribution *N*(0, *σ*^{2}/*λ*) [47]. Ridge regression was initially applied to marker-assisted
selection by Whittaker et al. [48] to reduce the standard errors in estimation. Xu [46] found ridge regression was not applicable to
genome-wide markers simultaneously because each marker demand a different *λ* in its prior distribution variance. This
problem of ridge regression can be fixed in a generalized ridge estimator where *K* is a diagonal matrix with different diagonal
elements *λ*_{1}, *λ*_{2},… [49]. However, an iterative procedure is required to find
these *λ*,
which is similar to the MCMC sampling in Bayesian analysis.

A Bayesian QTL mapping analysis usually results in a
posterior distribution over the QTL model space. Numerical characteristics from
such a distribution provide estimates for parameters. Such an estimation is
based upon entire model space, weighted by the posterior probabilities, and
hence is more robust than usual MLE. In terms of hypothesis test, Bayes
factors, together with prior odds, are used to compare two hypotheses. Unlike *P*
values, Bayes factors are calculated under both competing hypotheses; like *P*
values, they have to be compared with some commonly used cutoff values (like
0.05 for *P* values) to decide which hypothesis to prefer [50].

There is very clearly a null hypothesis and an alternative hypothesis in single QTL scan [4]. In model selection for a multiple QTL model, however, the number of hypotheses in the model space is huge and the data may well support many models for a complex trait. Posterior distribution emphasizes such uncertainty. Picking a better supported model to interpret mapping results may not fully convey the uncertainty of inference. On the other hand, a pragmatic mathematical model may choose to simplify the complexity and present an inference with basic structural characteristics. See Kass and Raftery [50] for more discussion.

Bayesian QTL mapping does not have the multiple testing issue discussed above. Bayesians agree with the idea to require very strong evidence to call QTL from genome. However, they believe that the reason is that the genome is so large that the prior probability that any one variant or variant combination is causative is very low. Thus, in Bayesian QTL mapping, the multiple testing issue is handled by the low prior density on any one marker or low prior odds for any one hypothesis; and such a stringent requirement is recommended even when exploring a very limited QTL model space unless there is strong prior knowledge against that [51].

## 5. CONCLUSION

To document the evolution of the statistical approaches for QTL mapping, we attempt to follow some threads of methodological development on multiple QTL mapping, threshold determination, and Bayesian QTL mapping methods. We see that this area has been advanced greatly by the interaction between genotyping technologies and statistical methodologies in the last several years, and will continue to be so in the future. The availability of efficient computational algorithms/softwares is essential to the QTL mapping community. However, it is equally important that these tools are applied with thorough understanding of the genetic data and the tools themselves.

## ACKNOWLEDGMENT

This work was partially supported by the USDA Cooperative State Research, Education and Extension Service, Grant no. 2005-00754.

## References

*Genetical Research*. 1999;74(3):279–289. [PubMed]

*Genetics*. 1923;8(6):552–560. [PMC free article] [PubMed]

*Annual Review of Genetics*. 2001;35:303–339. [PubMed]

*Genetics*. 1989;121(1):185–199. [PMC free article] [PubMed]

*Heredity*. 1992;69(4):315–324. [PubMed]

*Proceedings of the National Academy of Sciences of the United States of America*. 1993;90(23):10972–10976. [PMC free article] [PubMed]

*Biometrics*. 1997;53(2):653–665. [PubMed]

*Genetics*. 1999;152(3):1203–1216. [PMC free article] [PubMed]

*Genetics*. 2000;154(1):299–310. [PMC free article] [PubMed]

*QTL Cartographer, Version 1.17*. Raleigh, NC, USA: North Carolina State University; 2002.

*Science*. 2002;296(5568):752–755. [PubMed]

*PLoS Biology*. 2005;3(8):1–11. [PMC free article] [PubMed]

*PLoS Genetics*. 2006;2(9):1424–1432. [PMC free article] [PubMed]

*Statistics in Medicine*. 2000;19(3):369–372. [PubMed]

*Genetics*. 1994;138(3):963–971. [PMC free article] [PubMed]

*Genetics*. 1996;142(1):285–294. [PMC free article] [PubMed]

*Genetics*. 2004;168(4):2307–2316. [PMC free article] [PubMed]

*The American Journal of Human Genetics*. 2005;76(3):399–408. [PMC free article] [PubMed]

*The American Journal of Human Genetics*. 2007;81(6):1158–1168. [PMC free article] [PubMed]

*The American Journal of Human Genetics*. 2004;74(4):765–769. [PMC free article] [PubMed]

*Journal of the Royal Statistical Society: Series B*. 1995;57(1):289–300.

*q*-value.

*Annals of Statistics*. 2003;31(6):2013–2035.

*Annals of Statistics*. 2001;29(4):1165–1188.

*Genetics*. 2003;164(2):829–833. [PMC free article] [PubMed]

*Genetics*. 2005;171(2):783–790. [PMC free article] [PubMed]

*Journal of Statistical Planning and Inference*. 1999;82(1-2):171–196.

*Genetics*. 1998;150(4):1699–1706. [PMC free article] [PubMed]

*Genetics*. 2000;154(4):1917–1918. [PMC free article] [PubMed]

*Genetics*. 2000;154(4):1919 pages. [PMC free article] [PubMed]

*Proceedings of the National Academy of Sciences of the United States of America*. 2003;100(16):9440–9445. [PMC free article] [PubMed]

*Journal of the Royal Statistical Society: Series B*. 2002;64(3):479–498.

*Journal of the American Statistical Association*. 2001;96(456):1151–1160.

*Genetics*. 2004;166(1):611–619. [PMC free article] [PubMed]

*Genetics*. 2006;173(4):2371–2381. [PMC free article] [PubMed]

*Genetics*. 2006;172(1):687–691. [PMC free article] [PubMed]

*Genome Research*. 2004;14(6):997–1001. [PubMed]

*Genetics*. 1996;144(2):805–816. [PMC free article] [PubMed]

*Genetics*. 2001;159(1):371–387. [PMC free article] [PubMed]

*Biometrika*. 1995;82(4):711–732.

*Genetics*. 2003;165(2):867–883. [PMC free article] [PubMed]

*Genetics*. 2004;167(2):967–975. [PMC free article] [PubMed]

*Genetics*. 2005;170(3):1333–1344. [PMC free article] [PubMed]

*Bioinformatics*. 2007;23(5):641–643. [PubMed]

*Genetics*. 2003;163(2):789–801. [PMC free article] [PubMed]

*The Statistician*. 1975;24(4):267–268.

*Genetical Research*. 2000;75(2):249–252. [PubMed]

*Journal of Animal Science*. 2005;83(8):1788–1800. [PubMed]

*Journal of the American Statistical Association*. 1995;90(430):773–795.

*Nature Reviews Genetics*. 2006;7(10):781–791. [PubMed]

**Hindawi Publishing Corporation**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (547K) |
- Citation

- Multitrait analysis of quantitative trait loci using Bayesian composite space approach.[BMC Genet. 2008]
*Fang M, Jiang D, Pu LJ, Gao HJ, Ji P, Wang HY, Yang RQ.**BMC Genet. 2008 Jul 18; 9:48. Epub 2008 Jul 18.* - On the differences between maximum likelihood and regression interval mapping in the analysis of quantitative trait loci.[Genetics. 2000]
*Kao CH.**Genetics. 2000 Oct; 156(2):855-65.* - Methodologies for segregation analysis and QTL mapping in plants.[Genetica. 2009]
*Zhang YM, Gai J.**Genetica. 2009 Jun; 136(2):311-8. Epub 2008 Aug 23.* - Bayesian mapping of quantitative trait loci for multiple complex traits with the use of variance components.[Am J Hum Genet. 2007]
*Liu J, Liu Y, Liu X, Deng HW.**Am J Hum Genet. 2007 Aug; 81(2):304-20. Epub 2007 Jul 3.* - QTL mapping and the genetic basis of adaptation: recent developments.[Genetica. 2005]
*Zeng ZB.**Genetica. 2005 Feb; 123(1-2):25-37.*

- Enrichment of statistical power for genome-wide association studies[BMC Biology. ]
*Li M, Liu X, Bradbury P, Yu J, Zhang YM, Todhunter RJ, Buckler ES, Zhang Z.**BMC Biology. 1273* - Recombination locations and rates in beef cattle assessed from parent-offspring pairs[Genetics, Selection, Evolution : GSE. ]
*Weng ZQ, Saatchi M, Schnabel RD, Taylor JF, Garrick DJ.**Genetics, Selection, Evolution : GSE. 46(1)34* - Analysis of malaria parasite phenotypes using experimental genetic crosses of Plasmodium falciparum[International Journal for Parasitology. 201...]
*Ranford-Cartwright LC, Mwangi JM.**International Journal for Parasitology. 2012 May 15; 42(6)529-534* - DNA fingerprinting in botany: past, present, future[Investigative Genetics. ]
*Nybom H, Weising K, Rotter B.**Investigative Genetics. 51* - Genomic breeding value prediction and QTL mapping of QTLMAS2010 data using Bayesian Methods[BMC Proceedings. ]
*Sun X, Habier D, Fernando RL, Garrick DJ, Dekkers JC.**BMC Proceedings. 5(Suppl 3)S13*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Statistical Methods for Mapping Multiple QTLStatistical Methods for Mapping Multiple QTLInternational Journal of Plant Genomics. 2008; 2008()

Your browsing activity is empty.

Activity recording is turned off.

See more...