Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. May 2005; 170(1): 465–480.
PMCID: PMC1449727

Bayesian Shrinkage Estimation of Quantitative Trait Loci Parameters

Abstract

Mapping multiple QTL is a typical problem of variable selection in an oversaturated model because the potential number of QTL can be substantially larger than the sample size. Currently, model selection is still the most effective approach to mapping multiple QTL, although further research is needed. An alternative approach to analyzing an oversaturated model is the shrinkage estimation in which all candidate variables are included in the model but their estimated effects are forced to shrink toward zero. In contrast to the usual shrinkage estimation where all model effects are shrunk by the same factor, we develop a Bayesian method that allows the shrinkage factor to vary across different effects. The new shrinkage method forces marker intervals that contain no QTL to have estimated effects close to zero whereas intervals containing notable QTL have estimated effects subject to virtually no shrinkage. We demonstrate the method using both simulated and real data for QTL mapping. A simulation experiment with 500 backcross (BC) individuals showed that the method can localize closely linked QTL and QTL with effects as small as 1% of the phenotypic variance of the trait. The method was also used to map QTL responsible for wound healing in a family of a (MRL/MPJ × SJL/J) cross with 633 F2 mice derived from two inbred lines.

GENETIC variance of a quantitative trait, by definition, is controlled by the segregation of multiple loci, namely quantitative trait loci (QTL). Densely distributed molecular markers along the genome allow us to localize these QTL and eventually clone the actual genes. Detection and localization of QTL using molecular markers is called QTL mapping. Interval mapping (Lander and Botstein 1989) and various modified versions of this approach (Jansen 1993; Zeng 1994) have been the common methods of QTL analysis. The basic idea of these methods is to divide the entire genome into a finite number of points 1 or 2 cM apart. These points are subject to statistical test and evaluation and are thus called putative QTL. These putative loci follow Mendel's law of segregation. Their genotypes are not observable but can be inferred from the genotypes of flanking markers. Two flanking markers define an interval that may contain several putative QTL. This explains why the method is called interval mapping. Interval mapping is a one-dimensional search algorithm in the sense that it tests one putative position at a time and a search for the entire genome requires multiple tests for a series of putative positions in the genome. A smoothed plot of the test-statistic value against the genome position forms a continuous profile. A significant peak of the profile indicates a QTL located in the neighborhood of the peak. Multiple peaks may imply existence of multiple QTL. As a one-dimensional search algorithm, the interval-mapping procedure handles only models with a single QTL. Only effects of the putative QTL at the current position are included in the model and all other QTL effects are ignored. Effects of QTL not included in the model are actually lumped into the residual error. Composite interval mapping (Jansen 1993; Zeng 1994), a variant of the interval-mapping procedure, treats QTL effects ignored by the interval mapping as background effects, which are absorbed by selective markers (called the cofactors) outside the tested interval. Composite interval mapping can substantially improve the efficiency of QTL mapping if the cofactors are properly chosen. When multiple QTL are detected, one may have to rewrite the model to include all the significant intervals in a single multiple-QTL model and reestimate the QTL effects with QTL positions fixed at their estimated values (Yano et al. 1997; Hunt et al. 1999; Bunyamin et al. 2002). This two-step approach may not be optimal and is being replaced by a one-step multiple-QTL mapping strategy in which locations and effects are estimated simultaneously (Kao et al. 1999).

Multiple-QTL mapping has become the state-of-the-art gene mapping procedure. However, implementation of the multiple-QTL model is difficult. A major hurdle comes from problems associated with variable selection. The number of QTL defines the dimension of the model, but it is also an unknown parameter of interest. Several approaches have been taken for selecting the optimal set of putative QTL. The earliest work can be found in Jansen (1993) and Jansen and Stam (1994). Kao et al. (1999) adopted a stepwise regression approach to adding and deleting QTL progressively until the model is stabilized. The Bayesian information criterion (BIC) has been investigated by Ball (2001), Piepho and Gauch (2001), and Broman and Speed (2002). Recently, a Bayesian method implemented via the Markov chain Monte Carlo (MCMC) algorithm has been developed for mapping multiple QTL (Satagopan et al. 1996; Heath 1997; Sillanpaa and Arjas 1998). The number of QTL is determined either by the Bayes factor (Kass and Raftery 1995; Satagopan et al. 1996) or by reversible-jump MCMC (Green 1995; Sillanpaa and Arjas 1998). It has been noted that the reversible-jump MCMC for model selection is usually subject to poor mixing, i.e., slow convergence. Most recently, Yi et al. (2003) applied a stochastic search variable selection (SSVS) method to mapping multiple QTL. The original SSVS was developed by George and McMulloch (1993), who tried to avoid changing the dimension of the model yet still implemented the idea of variable selection. In the SSVS analysis, each regression coefficient is assigned a mixture of two normal distributions. One distribution has a zero mean with a tiny variance and the other has a zero mean with a large variance. If the posterior probability that a model effect belongs to the distribution with a large variance is high, this effect is considered as selected. Otherwise, it is considered as excluded. This semimodel selection approach has avoided many problems associated with any full-model selection approach.

There is another class of methods for handling models with a large number of model effects that require no variable selection. These methods start with the ridge regression method (Hoerl and Kennard 1970; Whittaker et al. 2000), where all potential model effects are included in the model but the estimated effects are forced to shrink toward zero. This family of estimates is called shrinkage estimates (Breiman 1995; Tibshirani 1996) or model-selection-free estimates because variable selection is not required. Recently, Xu (2003) showed that the usual ridge regression method can fail if the number of model effects is too large. Ridge regression actually has a Bayesian analogy (Hoerl and Kennard 1970). Therefore, the small positive number added to the diagonal elements of the coefficient matrix in the ridge regression is equivalent to the ratio of the residual variance to the variance parameter of the QTL effects. Xu (2003) then modified the variance parameter of the prior distribution of the QTL effects and let the variance parameter vary across QTL. This method also is similar to the method of Sauerbrei (1999), who allowed the value added to the diagonals of the coefficient matrix to vary. The modified method of Xu (2003) is based on the random-model approach of Meuwissen et al. (2001). A similar random-model approach was also developed by Gianola et al. (2003) from a marker-assisted selection perspective. As a result, the method can handle models with the number of effects many times larger than the number of observations.

Xu (2003) satisfactorily applied the Bayesian shrinkage method to study the genetic architecture of a number of traits in barley. Kopp et al. (2003) applied the same method to identify many important markers associated with the variation of bristle number in fruit flies. The method of Xu (2003) adopted the idea of shrinkage analysis in statistics. Each marker was treated as a putative QTL and thus included in the model as one variable. The incidence matrix (determined by the marker genotypes) is fully observed if there are no missing genotypes of markers. QTL mapping, however, is further complicated by the uncertainty of genotypes of QTL, and thus the incidence matrix is no longer observed. Furthermore, QTL positions are additional parameters of interest, estimation of which requires more complicated techniques beyond the shrinkage estimation. Therefore, the objective of this study is to extend the Bayesian shrinkage estimation of Xu (2003) to QTL mapping in which the positions and effects of QTL are estimated simultaneously. We first develop the theory and method and then demonstrate the advantages of this method over existing methods using simulated data. Finally, we apply the method to map QTL for the genetic difference of wound healing between two inbred lines of laboratory mice (Masinde et al. 2001).

METHODS

Linear model:

The method is developed on the basis of a backcross (BC) design. Extension to an F2 mating design is mentioned briefly in a subsequent section. Let yi be the phenotypic value of the ith progeny for i = 1, 2, … , n. The linear model for yi is

equation M1
1

where b0 is the population mean, p is the number of QTL included in the model, xij is a dummy variable defined as xij = 1 for the heterozygote and xij = −1 for the homozygote (there are only two possible genotypes at any locus in a BC family), bj is the jth QTL effect, and ei is the residual error with a N0, σ20 distribution. The genotype indicator variables, xij's, are not observed but their probability distributions can be inferred from marker information and the positions (λj's) of the QTL relative to the marker map.

In a typical Bayesian mapping implemented via a reversible-jump MCMC algorithm (Sillanpaa and Arjas 1998), p (the dimension of the model) is a parameter of interest. In this study, however, we treat p as a constant. It may be interpreted as the maximum number of QTL, which is fixed and depends on the method chosen in the analysis. In marker analysis, each marker is assumed to be associated with a QTL and thus p equals the number of markers. This method was developed by Xu (2003) and herein is called method II. Note that we reserve method I for the single-marker analysis for the sake of comparison. If the marker map is sparse and markers are unevenly distributed throughout the genome, we insert one or more virtual markers in large intervals (larger than a predetermined constant). Genotypes of the virtual markers are treated as missing values. Again, we assume that each marker (true or virtual) is associated with a QTL, and thus p is equivalent to the total number of markers (true plus virtual markers). This marker insertion approach (Sen and Churchill 2001) is referred to as method III. In this study, we develop a new method, namely method IV, that allows a QTL to take a position varying within a marker interval, rather than fixed at a marker. This is the real multiple-QTL-mapping analysis. We assume that each marker interval is associated with a QTL, and thus p is identical to the number of intervals.

We define p as the maximum number of QTL allowed in the model. In reality, most marker intervals are not associated with any QTL, and these intervals are called null intervals. With the Bayesian shrinkage analysis, we are able to force the estimated genetic effects of null intervals to shrink toward zero, and thus their inclusion in the model has negligible effect on other intervals. This approach largely avoids all problems involved in variable selection.

Bayesian estimation:

Our Bayesian regression model differs from the usual regression model in that each bj is assumed to be sampled from a normal distribution with mean zero and variance σ2j. Herein, we classify variables into observables and unobservables. The observables include y = {yi} for i = 1, 2, … , n and marker information denoted by m = {mik} for i = 1, 2, … , n and k = 1, 2, … , q, where q is the number of markers. In general, pq − 1, and the equality holds only if there is one chromosome in the genome. The unobservables include the regression coefficients, b = {bj}; the variances, equation M2 for j = 0, 2, … , p; the QTL positions, λ = {λj}, for j = 0, 2, … , p; and the QTL genotype indicator variables, x = {xij}, for i = 1, 2, … , n and j = 1, 2, … , p. Both m and x are matrices whose rows are denoted by mi.= {mik} ∀ k = 1, 2, … , q and xi. = {xij} ∀ j = 1, 2, … , p, respectively. The corresponding columns of these two matrices are denoted by m.k = {mik} ∀ i = 1, 2, … , n and x.j = {xij} ∀ i = 1, 2, … , n, respectively. The purpose of Bayesian analysis is to infer the posterior distribution of the unobservables conditional on the observables.

In this study, we choose the following prior distributions, equation M3 and pj) = 1/dj, where dj is the length of the marker interval where the jth QTL resides. The prior probability of xij is p(xijj) = 1/2 for i = 1, 2, … , n and j = 1, 2, … , p. The joint prior for all the unobservables is

equation M4
2

The joint probability p(xi.|λ) can be derived from a Markov model provided that segregation interference is absent.

The probability distribution of the data conditional on the unobservables is called the likelihood. There are two sources of data, the phenotypic values and the marker information (marker map and marker genotypes). The two sources of data are conditionally independent. Therefore,

equation M5
3

where

equation M6
4

and

equation M7
5

Both p(mi., xi.|λ) and p(xi.|λ) are derived from a Markov model under the assumption of no segregation interference. For example, consider three markers (i.e., two intervals) with a QTL in each interval. The Markov models for p(mi., xi.|λ) and p(xi.|λ) are

equation M8

and

equation M9

respectively, where p(mi1) = p(xi1) = 1/2 and the conditional probabilities depend on the genotypes of the two loci in question and the recombination fraction between the two loci. The conditional probability equals the recombination fraction if the two loci have different genotypes or equals the complementary of the recombination fraction if the two loci have the same genotype.

Combining the joint prior with the likelihood, we get the joint distribution of all variables, which is proportional to the posterior distribution of the unobservables,

equation M10
6

This is the target distribution, from which inferences are drawn. In the MCMC-implemented Bayesian analysis, we sample the unobservables from the above joint posterior distribution. In the following, we present the MCMC steps that are based largely on the Gibbs sampling approach.

Step 1. Initialization:

The intercept (b0) is initialized with the mean of the phenotypic values of the trait. The genetic effects of all QTL (b) are initialized with zero. The residual variance σ20 is initialized with the phenotypic variance of the trait. All other σ2j's are initialized with 0.5. The initial value of λj takes the middle point of the interval where the jth QTL resides. The initial value of genotype indicator xij is sampled from the probability of xij conditional on the flanking marker genotypes. Missing marker genotypes are sampled first from their prior distributions. The initial values of all the unobservables are superscripted with (0).

Step 2. Updating the population mean:

The conditional posterior distribution of b0 is normal with mean equation M11 and variance equation M12 from which a new b0 is sampled. The sampled b0 is denoted by b10, which will replace b00 in all subsequent steps of the sampling.

Step 3. Updating QTL effects:

The conditional posterior distribution for bj is normal with mean

equation M13
7

and variance

equation M14
8

which are used to sample bj. The newly sampled bj is denoted by b1j and will replace b0j in all subsequent sampling steps. Equations 7 and 8 are the keys to the Bayesian shrinkage analysis. The special forms of the posterior mean and posterior variance allow the method selectively shrink QTL effects. For example, if a QTL has a large effect, its variance parameter σ2j will be relatively large. A large σ2j means a negligible σ20/σ2j, and thus shrinkage will have a negligible effect on the posterior distribution of bj. However, if a QTL has a small effect, we will expect a small σ2j which will generate a large ratio σ20/σ2j. This large ratio will dominate over equation M15 and lead to a posterior distribution with approximately zero mean and zero variance; i.e., bj will be shrunk toward zero.

Step 4. Updating residual variance:

The residual variance is sampled from a scaled inverted chi-square distribution, pσ20|… ~ Inv − χ2 n, SS, that is equation M16 where equation M17 and χ2n is a random number sampled from a chi-square distribution with n degrees of freedom. The variance is immediately updated, equation M18

Step 5. Updating variance parameters of QTL:

We sample σ2j from a scaled inverted chi-square distribution, pσ2j|… ~ Inv − χ21, b20j, that is, equation M19 where χ21 is a random number sampled from a chi-square distribution with 1 d.f.

Step 6. Update QTL genotypes:

The QTL genotype xij is updated one individual and one locus at a time, using flanking marker information. It is sampled from the conditional probability distribution

equation M20
9

where miL and miR are the genotypes of the left and right markers flanking the QTL, and pmiL, miR|xij, λ0j is obtained from the Markov model. The probability density of yi is

equation M21
10

Step 7. Sampling missing marker genotypes:

Unless a marker is located in one end of a chromosome, every marker is bracketed by two QTL on the basis of our assumption of one QTL within each interval. Assume that the kth marker of individual i is missing and the marker is bracketed by QTL j and j + 1. Given the QTL genotypes, the missing marker genotype is sampled from the probability distribution

equation M22
11

where px0ij, mik, x0ij+10j, λ0j+1 is obtained from the Markov model. If the kth marker is in one end of a chromosome, the missing genotype of mik is sampled from pmik|x0ij, λ0j, assuming that the jth QTL is the closest locus to marker k.

Drawing missing marker genotypes conditioning on flanking QTL genotypes is a full Bayesian treatment. QTL parameters provide useful information for generating missing marker genotypes, but this information can be used only through the dependence of marker genotypes on QTL genotypes. This approach does not preclude the use of information from other markers because that information has been used when we draw the genotypes of the QTL. Therefore, as the chain extends, information from all markers will be used jointly. This approach of sampling missing marker genotypes will not work for organisms with an extremely dense marker map. In that case, we adopt a block updating approach, which is discussed later.

Step 8. Update QTL positions:

There is no closed form for the conditional posterior probability density of a QTL position. Therefore, we take the Metropolis-Hastings (Metropolis et al. 1953; Hastings 1970) approach for sampling the position of a QTL. First, we sample a new position around the existing one from a uniform distribution, [λ0j − δ, λ0j + δ), where δ is a tuning parameter, usually taking a value of 1 or 2 cM. We also sample genotypes of the QTL corresponding to the new position, denoted by x*.j, from an appropriate proposal distribution. The sampled new position, denoted by λ*j, is only a proposed position, which is accepted with a probability equal to min(1, α), where

equation M23
12

If the new position is accepted, the sampled genotypes are also accepted simultaneously. The first and second ratios in (12) are called the posterior and the proposal ratios, respectively. If neither the old nor the new position is close to the ends of the interval, equation M24 However, if either one is close to an end of the interval, we need to use the general formulas

equation M25
13

where τ0j is the distance of the old position from the nearest end of the interval and τ*j is the distance of the new position from the nearest end of the interval. Note that the general formulas are required because the QTL position must be sampled within the interval. The proposal distribution from which the QTL genotypes are sampled takes the posterior distribution obtained from Equation 9, i.e.,

equation M26

and

equation M27

The posterior ratio of the new to the old positions is

equation M28
14

At this moment, we have completed one sweep of the MCMC and are ready to continue our sampling for the next round by repeating steps 2–8. When the chain converges to the stationary distribution, the sampled parameters actually follow the joint posterior distribution. Likewise, a sample of any single parameter is a draw from its marginal posterior density. To ensure the convergence of the Markov chain, we conduct several (usually three) independent runs and monitor the traces of a few parameters, e.g., the population mean, the residual variance, and the QTL effects of a few intervals. A trace of a parameter is defined as the sampled parameter value plotted against the iteration number of the MCMC process. If all the independent runs generate the same patterns of the traces for each parameter evaluated, convergence is considered satisfactory. Our experience indicates that the MCMC shrinkage analysis converges faster than that of the reversible-jump MCMC.

Extension to F2 progeny:

In an F2 population, there is a dominance effect associated with each QTL. The linear model given by Xu (2003) applies here except that the additive-effect indicator variable xij is defined as 1, 0, and −1, respectively, for the three genotypes A1A1, A1A2, and A2A2. This is consistent with the traditional notation of QTL mapping in F2 populations. The method developed for BC progeny applies to F2 progeny by increasing the number of genotypes per locus from two to three and updating the genotype of a locus with two genetic effects (additive and dominance) separately.

APPLICATIONS

Simulation study:

Experimental design:

A backcross population of 500 individuals was simulated. We investigated a single large chromosome of 24 M, which is equivalent to 12 chromosomes, each 200 cM long. This giant chromosome was covered by 121 evenly spaced markers (120 intervals, each 20 cM long). We put 20 QTL along the genome with positions and effects listed in Table 1. The proportion of phenotypic variance contributed by an individual QTL ranged from 0.50 to 20.0% (see Table 1, column 4). The total genetic variance contributed by all 20 QTL was 93.52. The population mean and the environmental (residual) variance were set at b0 = 5.0 and equation M29 respectively. On the basis of the backcross genetic model described in this study, the proportion of the phenotypic variance contributed by all 20 QTL was 93.52/(93.52 + 10.00) = 90.34%. Positions of these simulated QTL varied in terms of distances from the nearest markers, some overlapping with a marker and some residing in the middle of an interval.

TABLE 1
QTL parameters and their estimates obtained from the simulated data

Analysis:

Three methods were used to analyze the data. Method I was the traditional individual-marker analysis (one-dimensional genome scan) where one marker was analyzed at a time and the entire genome scan required 121 separate analyses (121 markers). Method II was a multiple-marker Bayesian analysis where all 121 markers were included in a single linear model (Xu 2003). Method IV was the method developed in this study where each marker interval was assumed to contain one QTL and the position of the QTL varied within the interval. We did not use method III with pseudo-markers for the simulated data because markers were simulated with equal distance. Method III, however, was used for the real data analysis described later. Method IV requires a single linear model with 120 QTL effects. In the MCMC-implemented Bayesian analysis, the length of a Markov chain consisted of 20,000 sweeps. The first 4000 sweeps (burn-in period) were deleted and thereafter the chain was trimmed by keeping one observation in every 20 sweeps, and thus the posterior samples contained 800 observations for post-MCMC analysis. The MCMC experiment with the same data was repeated a few times to make sure that the chain had converged to the stationary distribution. We present one of the replicates to demonstrate the general behavior of the methods.

Result:

Figure 1 shows the estimated QTL-effect profiles of the three methods along with the true locations and effects of the simulated QTL. The behavior of method I (Figure 1a) shows overestimation of the QTL effects, low resolution to separate loosely linked QTL, and no power to separate closely linked QTL. Method II (Figure 1b), in general, underestimated the QTL effects and improved the resolution of separating loosely linked QTL, but still had no power to separate closely linked QTL. Both methods I and II deal with situations where QTL positions are fixed at marker positions. Therefore, there is no estimation of the QTL position if the QTL was actually located between two markers. In contrast, method IV (Figure 1c) estimates both effects and locations of the QTL. In general, method IV provided more accurate estimates of the QTL effects and improved resolution of separating closely linked QTL. For example, QTL 3 and 4 were previously inseparable using methods I and II, but now they have been separated by the new method. This argument also holds for QTL 11 and 12 and other pairs of QTL. Some simulated QTL were located within the same interval or shared a common marker, for which the method was still incapable of resolving them. In fact, no method is available to resolve those QTL in such a situation because there is simply not enough information. The only way to solve the problem is to increase marker density and population size. In addition, if two closely linked QTL have effects with opposite directions, the effects will most likely cancel out each other and both QTL may be missed. This has been demonstrated by the failure to detect QTL 7 and 8 (see Figure 1).

Figure 1.
The estimated QTL-effect profiles for the entire genome of the simulated data. (a) Method I (single-marker analysis); (b) method II (multiple-marker analysis); (c) method IV (multiple-QTL analysis). The true positions and effects of the ...

The vertical axis of the QTL-effect profile of method IV (Figure 1c) was defined as the weighted effect. The weighted effect is the actual effect estimated at that location multiplied by the relative frequency of that location hit by the QTL. The relative frequency itself forms a QTL intensity profile (Figure 2). The intensity of a particular location of the genome was defined as the relative number of hits by a QTL to a 2-cM interval covering that location. In this simulation experiment, most of the marker intervals did not contain any QTL and thus the QTL intensity was uniform within these marker intervals. The actual value of intensity for these marker intervals was ~0.1 because each marker interval was divided into 10 smaller intervals (0.1 × 10 = 1). For those intervals with QTL, the values of the intensities varied from 0.15 to 0.79. The QTL intensity profile does not provide clean signals for QTL. It was used only to generate the weighted QTL-effect profile (Figure 1c), which shows clear signals of QTL in terms of locations and effects.

Figure 2.
QTL intensity profile of the simulated data estimated with method IV from the simulated data.

We now zoom in on the QTL-rich region (between 7 and 14 M) to show the advantage of method IV over methods I and II (see Figure 3). Not only did method IV provide more accurate estimates of the QTL effects, but also it gave a high resolution to separate closely linked QTL.

Figure 3.
The estimated QTL-effect profiles of the three methods for the QTL-rich region of the genome (7–14 M) for the simulated data. The asterisks along the zero-reference line of the vertical axis point to the marker positions. See Figure 1 legend.

The estimated QTL locations and effects obtained from method IV are summarized in Table 1 along with the true parameters. Overall, we detected ~16 QTL regions and these regions collectively contributed 90% of the phenotypic variance, which is very close to the true value used in the simulation. Note that the proportion of phenotypic variance contributed by an individual QTL was defined as equation M30 where bj is the QTL effect and σ2P is the total phenotypic variance. In the simulation study, the sum of all estimated QTL variances was 116.42 and the estimated residual variance was 13.37, leading to equation M31 In summary, the new method missed four QTL, two of which were tightly linked and had effects with opposite directions; the remaining two had very small effects (explaining <1% of the phenotypic variance).

Replicated simulations:

In Bayesian mapping of multiple QTL, simulations are usually replicated a few times, simply for the sake of testing convergence of the algorithm and making sure the program is bug free. The number of replications rarely suffices to allow accurate assessment of the bias of parameter estimation and the statistical power. This is largely due to (1) high computational demands and (2) difficulty in summarizing results from replicated simulations. That the estimated position of a simulated QTL varies across replicates and a single chromosome may contain many QTL makes the assessment of bias very difficult. Nevertheless, we simulated 20 independent samples using the parameters described in Experimental design (the master data set). Although 20 samples are not sufficiently large to allow accurate estimation of statistical power (at least a few hundred replicates may be needed to obtain accurate estimation of the power), we may get a general idea about the performance of the method. The QTL-effect profile of each replicate is depicted in a figure given in supplementary material (http://www.genetics.org/supplemental/). The empirical statistical power of each simulated QTL was calculated as the ratio of the number of samples in which the QTL was “detected” to the total number of replicates (i.e., 20 in this report). A QTL was claimed as detected if there was a notable peak, or a bump, around the simulated position. This approach is somewhat arbitrary, but is not hard to do because the background (with no simulated QTL) is largely flat, which makes the comparison of the peaks with the background easy. Figure 4 presents the average QTL effect profile of the 20 replicates along with the empirical power of each simulated QTL. All QTL that explain >2% of the phenotypic variance have virtually reached a perfect power (100%) unless a QTL is very close to a larger one that tends to absorb the QTL nearby. Even the smallest QTL (QTL 19, explaining 0.48% of the variance) has a power of 65%. QTL 7 and 8, which are closely linked and have the same size but with opposite directions, were not detected in almost all samples. A much denser marker map along with a larger population size is required to resolve QTL with that tight linkage. The estimated QTL effect of a fixed position is not an unbiased estimate of the QTL effect of that position because the effect of position with a QTL tends to be underestimated and positions surrounding the QTL tend to be overestimated (see the profile plotted in Figure 4). We took the following approach to assess the bias of a simulated QTL. In each replicated simulation, we took the position in the neighborhood of the simulated QTL (not necessarily at the QTL position) that has a local maximum estimated effect. This location was treated as the estimated position of the QTL. The local maximum estimated QTL effect was treated as the estimated effect of this QTL. We then took the average of estimated positions and the average of estimated effects of this simulated QTL across replicates as the expected values of the estimates, which were compared with the true position and effect of the QTL to assess the biases. Because of the complexity of multiple closely linked QTL in the study, we hand picked these estimated values for each QTL in each sample. The results are summarized in Table 2. In general, larger QTL tend to have less bias than smaller QTL. This is because the shrinkage method discriminates against smaller QTL. In addition, parameters of closely linked QTL are hard to estimate. For example, the following QTL pairs, (5, 6), (10, 11), and (16, 17), were so tightly linked that they were inseparable in all the replicates. We then combined the closely linked QTL pairs and reported them as a single large one for each pair of QTL. The estimated effects are indeed close to the sum of two effects (see Table 2). QTL 7 and 8 have opposite effects and the sum of the two effects is close to zero. Neither QTL 7 nor QTL 8 was detected in the replicated samples.

Figure 4.
The average effects and locations of QTL estimated with method IV from 20 replicated simulations. The numerical percentage (%) on each tip of the true QTL needle is the empirical power obtained from the 20 replicates.
TABLE 2
QTL parameters and their average estimates obtained from replicated simulations

Additional analyses:

We also analyzed the simulated BC data (see Experimental design and Table 1 for the parameters of the simulation) with the multiple-interval mapping (MIM) procedure developed by Kao et al. (1999). The QTL Cartographer program (Basten et al. 2002) has an option to implement MIM. MIM is a maximum-likelihood-based method, which involves model selection (implemented with the stepwise regression approach). Therefore, only large QTL can pass the criterion of test and be included in the final model. Six of the 20 simulated QTL were sufficiently large to be included in the final model in MIM. The estimated effects and genome locations of the six QTL (blue diamonds) are depicted in Figure 5a along with the true effects and locations of the 20 simulated QTL (black triangles). The first and second simulated QTL (counted from the left to the right) were detected around the true locations with approximately the true effects. The third and fourth QTL were combined as a single QTL by the program and detected at a location close to the third QTL with an estimated effect 30% larger than that of the third QTL. These two QTL, however, were successfully separated by the new method (method IV, see Figure 1c). QTL 5 and 6 are actually located in the same marker interval and thus are not separable with any methods. They were detected as a single large QTL by both the new method and MIM. QTL 10, 11, and 12 are tightly linked and were detected as a single large QTL by MIM. However, the new method (method IV) was able to separate QTL 12 from the other two QTL (see Figure 1c). The last QTL detected by MIM is QTL 15, which is closely linked to some QTL with opposite effects. In addition to the 6 QTL (detected by MIM), the new approach (method IV) detected 7 other QTL with moderate effects (see Figure 1c), clearly demonstrating the advantage of the new method over MIM.

Figure 5.
The effects and locations of QTL estimated by three additional methods: (a) multiple-interval mapping (MIM, blue diamonds) compared with the 20 simulated true QTL (black triangles), (b) stochastic search variable selection (SSVS, blue curve), and (c) ...

We also analyzed the simulated BC data (see Experimental design and Table 1 for the parameters of the simulation) with SSVS developed by Yi et al. (2003). This method requires a mixture of two normal prior distributions for each QTL effect. Both prior distributions have mean zero, but one prior has a “tiny” variance and one has a “large” variance. Yi et al. (2003) suggested three (tiny, large) variance pairs, (0.001, 10), (0.01, 10), and (0.01, 100). We took the (0.01, 10) pair. The method of Yi et al. (2003) is a marker analysis, similar to the method of Xu (2003). Although Yi (2004) recently extended the SSVS approach to allow QTL positions to vary within intervals, no program is available. The theory of Yi (2004) has not been tested with either simulated or real data. Therefore, we had to modify our variable-position method to incorporate SSVS and rewrote the program to implement the variable-position SSVS. The results of SSVS are given in Figure 5b. The variable-position SSVS analyses had comparable results to our method (see Figure 1c for comparison). However, SSVS generated some spurious peaks between positions 14 and 15 M. In addition, the background noise of the variable-position SSVS seems to be larger than that of our method (see Figure 1c for comparison). A general conclusion from the simulation study was that SSVS and our method are comparable.

Finally, we modified the Bayesian shrinkage method by allowing QTL to be sampled from variable intervals. This modification enables the method to analyze data more efficiently with a large genome and high marker density. When the marker density is high and/or the genome is large, the number of intervals defined by the markers can be very large. It is neither reasonable nor efficient to put one QTL in every marker interval when the intervals are so short. We now assume that, on average, there is one QTL in every d cM, say d = 45 cM, regardless of how many markers are within the d-cM interval. The number of intervals along the genome is fixed, but the size of each interval can vary from one cycle of the MCMC to another. The positions of the fixed number of QTL are disjoint. This can be achieved by introducing a nonindependent prior for QTL positions. The prior for λj is defined as pjj−1, λj+1) = (λj+1 − λj−1)−1 so that λj is sampled only from (λj−1, λj+1). Incorporating this dependent prior will not change the algorithm described earlier except that the computer program needs to be modified slightly to reflect this variable-interval feature. We analyzed the same data generated earlier (see Experimental design and Table 1 for the parameters of the simulation), using the variable-interval version of the Bayesian shrinkage. We set the average d = 5, 10, 20, 40, … , 100 cM and found that when the average d < 50 cM, the results are all similar to what we observed when the interval sizes were determined by the flanking markers. When d > 50 cM, however, some QTL detected with the original method were missed. Figure 5c illustrates the result when the average d = 45 cM. By increasing the sizes of QTL-searching intervals, the computing time has been saved by an order of magnitude.

Wound-healing QTL in the mouse:

Masinde et al. (2001) published results of interval mapping for the difference of wound healing in a line-crossing experiment of laboratory mice. The authors genotyped 633 F2 mice from the cross of MRL/MP × SJL/J for 119 codominant markers along the genome. These markers covered ~1100 cM of the genome with an average marker interval of ~15 cM. A total of 10 QTL were identified to six chromosomes for the trait. The sizes of the identified QTL varied from 3 to 13% of the phenotypic variance (Masinde et al. 2001). The overall contribution of the 10 QTL to the phenotypic variance was ~65%.

This data set was reanalyzed in this study using three methods. Method I was the single-marker analysis (one-dimensional genome scan), method III (the pseudo-marker approach) was the multiple-marker analysis with a slight modification by inserting virtual markers into all the marker intervals that are >15 cM, and method IV was the method developed in this project by allowing QTL position to vary within marker intervals. In method III, some marker intervals were as large as 35 cM and thus three virtual markers were inserted into such a large interval. The MCMC experiment was set up in the same way as that in the simulation study. We did not use method II (Xu 2003) for the real data analysis.

The estimated additive QTL-effect profiles for the three methods are shown in Figure 6, where the 20 chromosomes were joined in a single genome. The marker locations shown in the genome were recalculated from their corresponding locations on the chromosomes. The length of each chromosome was redefined as the distance between the two markers in the ends of the chromosome. The first marker on the first chromosome started at position zero. For example, the first marker on chromosome 1 is located at position 3.3 cM and the last marker on the same chromosome is at position 110.4 cM. Therefore, the length of the first chromosome was defined as 107.1 cM. In the genome, the locations of the two markers become 0 and 107.1 cM, respectively. The positions of the first and second markers on the second chromosome are 2.2 and 15.3 cM, respectively. In the genome, the new positions of the two markers become 107.1 cM (overlapping with the last marker of the first chromosome) and 120.2 cM, respectively.

Figure 6.
The estimated QTL-effect profiles for the entire genome of the mouse data. (a) Method I (single-marker analysis); (b) method III (multiple-marker analysis); (c) method IV (multiple-QTL analysis). The dotted reference (vertical) lines on ...

The single-marker analysis (method I) shows a profile with too much noise (Figure 6a). This method is almost equivalent to interval mapping except that the positions of the QTL were always fixed to the nearest markers. The multiple-marker analysis (method III) has significantly improved the QTL signals (Figure 6b), where we identified 8 regions with clear evidence of QTL. Note that the interval mapping published by Masinde et al. (2001) showed 10 QTL. The 2 additional QTL identified by interval mapping appear to be false positive. The new approach (method IV) identified the same 8 QTL as those identified by method III. However, the new method provides additional information about the positions of the identified QTL (Figure 6c). The 8 QTL have estimated locations closely matching the locations obtained from interval mapping.

The obvious differences between the new method and interval mapping are that the new method (i) generated much clearer signals than interval mapping and (ii) QTL effects were simultaneously estimated in a single model. The estimated effects of the eight QTL obtained from method IV are given in Table 3 along with the estimates by interval mapping (Masinde et al. 2001). The estimated population mean and the residual error variance of method IV are 0.7367 ± 0.0131and 0.0787 ± 0.0069, respectively. The total estimated genetic variance of the eight QTL is 0.0357, leading to a phenotypic variance of 0.1144. Therefore, the overall contribution of the eight QTL to the phenotypic variance is 31.2%. The QTL contribution estimated by interval mapping was 65% (Masinde et al. 2001), twice the value of the new estimate. Note that in the F2 mating design, the proportion of phenotypic variance contributed by an individual additive QTL is equation M32 Dominance effects were also included in the model, but the estimated effects were close to zero throughout the entire genome (data not shown).

TABLE 3
Estimated QTL parameters obtained from the mouse data

DISCUSSION

The Bayesian shrinkage method for multiple-QTL mapping developed in this study appears to outperform most of the existing QTL-mapping methods, although the ultimate validity of the new method needs to be established in future studies using QTL data for a phenotype in which the gene has been identified. The new method assumes that every marker interval contains a QTL. Some QTL have large effects and most of the assumed QTL have zero effect. A QTL with zero effect is equivalent to no QTL. Instead of excluding QTL with zero effects from the model, as done by any variable-selection approach, the method developed here includes these zero-effect QTL in the model. The mechanism that allows effects of these null QTL to be estimated close to zero is the selective shrinkage estimation approach. Each QTL effect is assumed to be a random realization sampled from a normal distribution with an unknown variance. We must emphasize that the QTL variance parameter varies from one locus to another, and it must be estimated from the data rather than given as a parameter in the prior distribution. If all the QTL were assumed to be sampled from a common distribution, as usually treated by traditional Bayesian regression analysis, we would not have the power to discriminate QTL effects and thus would not generate clear signals of QTL along the genome. This treatment appears to have violated the parsimonious requirement for model fitting, but it works very well. Although we have not encountered any problem in both the simulated and real data analyses, a potential technical problem may occur when the marker intervals are too short. The problem is not due to too many QTL effects but due to the fact that the genotypic indicator variables of nearby QTL may be highly correlated (complete cosegregation of nearby markers). This potential problem may be reduced by increasing the sample size or deleting markers that are too close to any existing markers. In the mouse data of 633 individuals, the closest markers were 1.1 cM apart and the data were handled very well with the method. An alternative way of handling a dense marker map is to assume one QTL in every d cM, regardless the marker density, where d depends on the sample size and the type of mating design. This variant of the Bayesian shrinkage method has been applied to the simulated data, where we set d = 45 cM and obtained an equivalent result to the fixed-interval approach.

Regardless of the high power of the Bayesian method for handling complicated models, one criticism of the method may be the lack of statistical testing on the detected QTL. The simulation study shows that a small QTL, even as small as 1% of the phenotypic variance, can generate a notable bump in the QTL-effect profile (Figure 1c). Nowhere in the genome that does not contain a QTL is a high peak in the QTL-effect profile shown. Results of the mouse data analysis also show notable peaks for almost all regions that have been detected with a QTL, using likelihood-ratio test statistics. This further demonstrates that statistical tests in Bayesian analysis may not be as important as maximum-likelihood analysis. In practice, investigators may not be interested in the small QTL, which are usually arguable for significance. These QTL, even if statistically significant, may not be biologically “significant.” Any actions, e.g., marker-assisted selection or gene cloning, may be done only on the large QTL, which are usually detectable with clear evidence. However, it is still important to include these small QTL in the model because these QTL, collectively, can make a significant contribution to the variance of the trait. If they are ignored, as done by any model-selection approaches, the residual variance will be inflated, which is undesirable. If one wants to really have a decision rule, the 90% credibility interval (drawn from the posterior probability) of the QTL effects may be used for this purpose (see Tables 1 and and3).3). The credibility intervals of QTL effects reflect the precision of the estimates and should be used with caution as a way to test the significance of QTL effects. A credibility interval has a quite different meaning from a confidence interval. A confidence interval can be used as a tool for a significance test, but a credibility interval is determined by the posterior distribution conditional on the current data and no future experiments are implied. If one uses the credibility interval to test the significance of a QTL, many QTL reported will not be significant. Take the simulation experiment described in this study as an example (Table 1 and Figure 1c). Many QTL reported in Table 1 show a credibility interval covering zero and would not be declared as significant. But the posterior distributions of these QTL have nonnegligible probability mass away from zero. These QTL were simulated and we know that they exist. Therefore, we see the peaks in the profile (Figure 1c). No notable peaks occur at any places in the genome where there is no simulated QTL nearby.

The method developed in this study applies to QTL mapping with intermediate marker density. Although the results of our simulation studies appear to coincide with fine mapping, the method requires large samples to achieve an equivalent resolution to fine mapping. In addition, the method of sampling missing marker genotypes requires some modification before it is used for QTL mapping with high marker density. This is because when the marker density is extremely high, sampling a missing marker genotype conditional on genotypes of neighboring loci may cause the chain to be trapped locally and it may be hard to jump out of the local domain. This problem may be solved by sampling all missing marker genotypes jointly (a blockwise sampling), where we first sample all the missing marker genotypes sequentially from the left to the right of a chromosome and then use the Metropolis-Hastings rule to accept the sampled genotypes. To incorporate information from QTL, the optimal sampling strategy should have the sampling of QTL genotypes embedded in the sampling process of missing marker genotypes. In other words, genotypes of all missing loci (QTL and missing markers) should be sampled sequentially from the left to the right and the sampled genotypes are accepted or rejected simultaneously on the basis of the Metropolis-Hastings rule. Further investigation is required to formulate the Metropolis-Hastings acceptance probability.

The proposed method is simpler and easier to program than any model selection methods. As a result, it may be widely applied to QTL mapping. However, the simplicity may also pose some problems. First, there is no flexibility to incorporate prior knowledge about the number of QTL and the positions of the QTL, whereas the reversible-jump MCMC allows such information to be incorporated. Second, there is no explicit probabilistic statement about how often a QTL is included in the model, although the posterior density of a QTL effect can be used to extract such information. The SSVS method (Yi 2004), however, utilizes a mixture prior and explicitly makes a probabilistic statement about the inclusion of a QTL. Sharing the same disadvantage as the SSVS, the shrinkage method must update every putative QTL in each round of the iterations, and thus the amount of computing time can be intensive within each sweep of the updating process, compared to that of a model selection or reversible-jump algorithm. The time saved in the smaller number of iterations is compensated by the extra time cost within each sweep of the iterations. Finally, there is a potential risk that the chain may be trapped for some small QTL at variances near zero for a long time (Gelman et al. 2004), although we have not encountered such a difficulty in our simulations and real data analysis. The reason for the so-called slow mixing is that, if the current draw of a variance parameter is near zero, then in the next updating step the corresponding effect parameter will itself be shrunk to very close to zero (the prior mean). Then, in turn, the variance parameter will be estimated to be close to zero. This drawback can be overcome by using the parameter expansion technique proposed by Gelman et al. (2004). Therefore, some room is left for improvement. The proposed method is considered as an alternative, yet very efficient, method for mapping multiple QTL. Extensive simulation experiments will be conducted in the future to compare all these alternative methods. Hopefully, a “best” method exists to recommend after the simulations.

Finally, the computing time required for the Bayesian shrinkage estimation is usually lengthy because of the large number of QTL effects included in the model. The simulated and the mouse data analyses each took approximately 6 hr on a Pentium IV PC. No comparison has been made with the model selection method implemented via the reversible-jump MCMC (Sillanpaa and Arjas 1998). The new method converges to stationary distribution much more quickly than the reversible-jump MCMC, but each sweep takes a longer time than the reversible-jump MCMC. Therefore, the overall time taken by the two methods may be comparable. Anyway, computing time is less of a concern than the result given the fast pace of updating in computing technology. A computer program in C++ is available to interested users on request.

Acknowledgments

This research was supported by the National Institutes of Health grant R01-GM55321 to S.X. The work was also supported in part by the Assistance Award no. DAMD17-99-1-9571. The U.S. Army Medical Research Acquisition Activity, 820 Chandler St., Fort Detrick, MD 21702-5014 is the awarding and administering acquisition office for the award. The information contained in this publication does not necessarily reflect the position or the policy of the U.S. Government and no official endorsement should be inferred.

References

  • Ball, R. D., 2001. Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion. Genetics 159: 1351–1364. [PMC free article] [PubMed]
  • Basten, C. J., B. S. Weir and Z-B. Zeng, 2002 QTL Cartographer. North Carolina State University, Raleigh, NC.
  • Breiman, L., 1995. Better subset selection using the nonnegative garotte. Technometrics 37: 373–384.
  • Broman, K. W., and T. P. Speed, 2002. A model selection approach for the identification of quantitative trait loci in experimental crosses. J. R. Stat. Soc. 64: 641–656.
  • Bunyamin, T., T. E. Michaels and K. P. Pauls, 2002. Genetic mapping of agronomic traits in common bean. Crop Sci. 42: 544–556.
  • Gelman, A., J. B. Carlin, H. S. Stern and D. B. Rubin, 2004 Bayesian Data Analysis, Ed. 2. Chapman & Hall/CRC Press, New York/Boca Raton, FL.
  • Gianola, D., M. Perez-Enciso and M. A. Toro, 2003. On marker-assisted prediction of genetic value: beyond the ridge. Genetics 163: 347–365. [PMC free article] [PubMed]
  • George, E. I., and R. E. McMulloch, 1993. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 91: 883–904.
  • Green, P., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732.
  • Hastings, W. K., 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.
  • Heath, S. C., 1997. Markov chain Monte Carlo segregation and linkage analysis of oligogenic models. Am. J. Hum. Genet. 61: 748–760. [PMC free article] [PubMed]
  • Hoerl, A. E., and R. W. Kennard, 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12: 55–67.
  • Hunt, G. J., A. M. Collins, R. Rivera, R. E. Page and E. Guzman-Novoa, 1999. Quantitative trait loci influencing honeybee alarm pheromone levels. J. Hered. 90: 585–589. [PubMed]
  • Jansen, R. C., 1993. Interval mapping of multiple quantitative trait loci. Genetics 135: 205–211. [PMC free article] [PubMed]
  • Jansen, R. C., and P. Stam, 1994. High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136: 1447–1455. [PMC free article] [PubMed]
  • Kao, C. H., Z-B. Zeng and R. D. Teasdale, 1999. Multiple interval mapping for quantitative trait loci. Genetics 152: 1203–1216. [PMC free article] [PubMed]
  • Kass, R., and A. Raftery, 1995. Bayes factors. J. Am. Stat. Assoc. 90: 773–795.
  • Kopp, A., R. M. Graze, S. Xu, S. B. Carroll and S. V. Nuzhdin, 2003. Quantitative trait loci responsible for variation in sexually dimorphic traits in Drosophila melanogaster. Genetics 163: 771–787. [PMC free article] [PubMed]
  • Lander, E. S., and D. Botstein, 1989. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199. [PMC free article] [PubMed]
  • Masinde, G. L., X. Li, W. Gu, H. Davidson, S. Mohan et al., 2001. Identification of wound healing/regeneration quantitative trait loci (QTL) at multiple time points that explain seventy percent of variance in (MRL/MpJ and SJL/J) mice F2 population. Genome Res. 11: 2027–2033. [PMC free article] [PubMed]
  • Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller, 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21: 1087–1091.
  • Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard, 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [PMC free article] [PubMed]
  • Piepho, H. P., and H. G. Gauch, Jr., 2001. Marker pair selection for mapping quantitative trait loci. Genetics 157: 433–444. [PMC free article] [PubMed]
  • Satagopan, J. M., B. S. Yandell, M. A. Newton and T. C. Osborn, 1996. A Bayesian approach to detect quantitative trait loci using Markov chain Monte Carlo. Genetics 144: 805–816. [PMC free article] [PubMed]
  • Sauerbrei, W., 1999. The use of resampling methods to simplify regression models in medical statistics. Appl. Stat. 48: 313–329.
  • Sen, S., and G. A. Churchill, 2001. A statistical framework for quantitative trait mapping. Genetics 159: 371–387. [PMC free article] [PubMed]
  • Sillanpaa, M. J., and E. Arjas, 1998. Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics 148: 1373–1388. [PMC free article] [PubMed]
  • Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58: 267–288.
  • Whittaker, J. C., R. Thompson and M. C. Denham, 2000. Marker-assisted selection using ridge regression. Genet. Res. 75: 249–252. [PubMed]
  • Xu, S., 2003. Estimating polygenic effects using markers of the entire genome. Genetics 163: 789–801. [PMC free article] [PubMed]
  • Yano, M., Y. Harushima, Y. Nagamura, N. Kurata and Y. Minobe, 1997. Identification of quantitative trait loci controlling heading date in rice using a high density linkage map. Theor. Appl. Genet. 95: 1025–1032.
  • Yi, N., 2004. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics 167: 967–975. [PMC free article] [PubMed]
  • Yi, N., V. George and D. B. Allison, 2003. Stochastic search variable selection for identifying multiple quantitative trait loci. Genetics 164: 1129–1138. [PMC free article] [PubMed]
  • Zeng, Z-B., 1994. Precision mapping of quantitative trait loci. Genetics 136: 1457–1468. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...