• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. May 2009; 182(1): 375–385.
PMCID: PMC2674834

Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree

Abstract

The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available.

GENOMEWIDE dense marker maps are now available for many species in plants and animals (e.g., WANG et al. 2005). An important challenge is how this information should be incorporated into statistical models for prediction of genetic values in animal and plant breeding programs or prediction of diseases.

A standard quantitative genetic model assumes that genetic equation M1 and environmental equation M2 effects act additively, to produce phenotypic outcomes equation M3 according to the rule equation M4. The information set now available for predicting genetic values may include, in addition to phenotypic records, a pedigree, molecular markers, or both.

Several methodologies have been proposed for incorporating dense marker data into regression models. A distinction can be made between methods that explicitly regress phenotypic records on markers via the regression function equation M5, where equation M6 is a vector of marker covariates and equation M7 is a vector of regression coefficients, e.g., equation M8, and those that view genetic values as a function of the subject and use marker information to build a (co)variance structure between subjects. The first group of methods includes standard Bayesian regression (BR) with random coefficients, i.e., a Bayesian model where regression coefficients are assigned the same Gaussian prior, and other shrinkage methods such as Bayes A or Bayes B of Meuwissen et al. (2001), and specifications described in Gianola et al. (2003). The second type of approach was suggested by Gianola et al. (2006) and Gianola and van Kaam (2008), who proposed using reproducing kernel Hilbert spaces regression (RKHS), with the information set consisting of SNP (single-nucleotide polymorphism) genotypes, possibly supplemented by genealogies. As discussed in De los Campos et al. (2009), in this approach, marker information is used to create a prior (co)variance structure between genomic values, equation M9, equation M10, where equation M11 is some positive-definite function and equation M12 is a parameter to be estimated from the data.

The two types of approaches lead to predictions of genomic values for quantitative traits. An advantage of explicitly regressing phenotypes on marker covariates is that the model can produce information about genomic regions that may affect the trait of interest. However, a main difficulty is that the number of regression coefficients (p) is typically large, even larger than the number of records (n), with p [dbl greater-than sign] n. Therefore, a crucial aspect is how this methodology can cope with the curse of dimensionality and with colinearity.

With whole-genome scans, many markers are likely to be located in regions that are not involved in the determination of traits of interest. On the other hand, some markers may be in linkage disequilibrium with some QTL or in regions harboring genes involved in the infinitesimal component of the trait. This suggests that differential shrinkage of marker effects should be a feature of the model, as noted by Meuwissen et al. (2001). Tibshirani (1996) proposed a regression method (least absolute shrinkage and selection operator, LASSO) that combines the good features of subset selection (i.e., variable selection) with the shrinkage produced by BR. Recently, Park and Casella (2008) presented a Bayesian version of the LASSO method (Bayesian LASSO, BL) and suggested a Gibbs sampler for its implementation. Alternatives to the Gibbs sampler of Park and Casella are discussed in Hans (2008). While the BL described by Park and Casella is appealing for the reasons mentioned above, it does not accommodate pedigree information or regression on (co)variates other than the markers for which a different shrinkage approach may be desired.

Several authors have considered combining pedigree and marker data into a single model in the context of QTL analysis (e.g., Fernando and Grossman 1989; Bink et al. 2002, 2008). Here, in this spirit, the BL is modified and extended to accommodate pedigree information as well as covariates other than markers.

The main objectives of this article are to (1) discuss the use of BL and related methods in the context of linear regression of quantitative traits on molecular markers, (2) evaluate the sensitivity of BL with respect to the choice of the prior for the regularization parameter, (3) extend the BL so that pedigrees or regressions on covariates other than markers can also be included in the model, and (4) evaluate the methodology using data from a self-pollinated wheat population and an outcross mouse population. The article is organized as follows: the first section, bayesian lasso, introduces the BL as presented in Park and Casella (2008) and discusses connections between BL and closely related methods, such as those proposed by Meuwissen et al. (2001) or variants proposed by other authors. monte carlo study evaluates the sensitivity of BL with respect to the choice of prior for the regularization parameter. bayesian regression coupled with lasso presents an extension of BL, treating effects of different types of regressors with different priors. In data analysis, the proposed methodology is applied to two data sets representing a collection of wheat lines and a population of mice. concluding remarks are provided in the final section of the article. An R function (R Development Core Team 2008) that fits the model and data sets used in this article are made available (see supporting information, File S1 and File S2).

THE BAYESIAN LASSO

Tibshirani (1996) proposed using the sum of the absolute values of the regression coefficients (or L1 norm) as a penalty in regression models, to simultaneously produce variable selection and shrinkage of coefficients; the proposed methodology was termed LASSO. In LASSO, estimates are obtained by solving the constrained optimization problem

equation M13
(1)

where equation M14 is a vector of covariates, equation M15 is the corresponding vector of regression coefficients, and t is an arbitrary positive constant. Above, it is assumed that data are centered, i.e., equation M16 has zero mean. Optimization problem (1) is equivalent to

equation M17
(2)

(e.g., Tibshirani 1996), for some value of the smoothing parameter equation M18. It is known that the solution to (2) may involve zeroing out some elements of equation M19, and there are many ways of illustrating why this may be so. One manner is to examine the shape of the feasible set in (1) (e.g., Tibshirani 1996); another way is to consider the Bayesian interpretation of the LASSO. From (2), it follows that the solution can be viewed as the posterior mode in a Bayesian model with Gaussian likelihood, equation M20, and a prior on equation M21 that is the product of p independent, zero-mean, double-exponential (DE) densities; that is, equation M22. In contrast, BR is obtained by assuming the same likelihood and a prior on equation M23 that is the product of p independent normal densities; that is, equation M24, where equation M25 is a variance parameter common to all regression coefficients. The difference between these two priors is illustrated in Figure 1: the DE density places more mass at zero and has thicker tails than the Gaussian distribution. From this perspective, relative to BR, LASSO produces stronger shrinkage of regression coefficients that are close to zero and less shrinkage of those with large absolute values.

Figure 1.
Densities of a normal and of a double-exponential distribution (both with null mean and with unit variance).

Parameter equation M26, sometimes referred to as a regularization parameter, plays a central role: as it approaches zero, the solution to (2) tends to ordinary least squares, while large values of equation M27 penalize the L1 norm of equation M28, equation M29, highly. In the Bayesian view of LASSO, equation M30 controls the prior on equation M31, with large values of this parameter associated with more informative (sharper) priors.

By construction, the non-Bayesian LASSO solution admits at most n − 1 nonzero regression coefficients (e.g., Park and Casella 2008). This is not desirable in models with dense marker-based regressions since, a priori, there is no reason why the number of markers with effectively nonzero effects should be smaller than the number of observations. This problem does not arise in BL, which is discussed next.

A computationally convenient hierarchical formulation of a DE distribution is obtained by exploiting the fact that the DE density can be represented as a mixture of scaled Gaussian densities (e.g., Andrews and Mallows 1974; Rosa 1999), where the mixing process of the variances is an exponential distribution. Following Park and Casella (2008),

equation M32

Above, equation M33 is the unknown effect of the jth marker and equation M34 is a variance parameter (measuring prior uncertainty) associated with equation M35. Using this, Park and Casella (2008) suggested the following hierarchical model (BL),

equation M36
(3)
equation M37
(4)

Above, equation M38 and equation M39 are normal densities centered at equation M40 and 0, with variances equation M41 and equation M42, respectively; equation M43 is a scaled-inverted chi-square density, with degrees of freedom equation M44. and scale equation M45, in this parameterization, equation M46; equation M47 is an exponential density indexed by a single parameter, equation M48, and equation M49 is a Gamma distribution, with shape parameter equation M50 and rate parameter equation M51.

The role of the equation M52's becomes more clear by changing variables in (3) and (4) from equation M53 to equation M54. After this change of variables, the product of the likelihood function and of the joint prior for the regression coefficients, equation M55, becomes equation M56, where equation M57 and equation M58 is a vector of regression coefficients with homogeneous variance. Thus, one way of viewing this class of regression models is as a standard BR model with additional unknowns, equation M59, which assign different weights to the columns of X, with equation M60 being equivalent to removing the jth covariate from the model.

Park and Casella (2008) presented a set of fully conditional distributions that allows fitting the BL model via the Gibbs sampler. Some of these distributions are discussed next, to illustrate main features of the algorithm.

Location parameters:

In the Gibbs sampler of Park and Casella (2008), the fully conditional distribution of the regression coefficients is multivariate normal with mean (covariance matrix) equal to the solution (inverse of the coefficient matrix) of the system of equations,

equation M61
(5)

Recall that ordinary least-squares estimates are obtained by solving equation M62 and that the counterpart of (5) in BR is equation M63. A key aspect of BL is that it produces a shrinkage that is marker specific, contrary to BR. Since equation M64 is a scaling factor common to all regression coefficients, the differential shrinkage is due to the equation M65's. If equation M66 is large, i.e., a large variance is associated with the effect of the jth marker, the quantity added to the diagonal will be small. Conversely, if a small variance is associated with the effect of the jth coefficient, equation M67 will be large. Adding a large constant to the jth diagonal element shrinks the least-squares estimates toward zero and reduces the variance of its fully conditional distribution.

Variances of the regression coefficients:

An important aspect of the algorithm is how samples of the regression coefficients affect realizations of the variances of marker effects. In BL, the fully conditional posterior distributions of the equation M68's can be shown to be inverse Gaussian (e.g., Chhikara and Folks 1989), with mean equation M69 and scale parameter equation M70. For a given equation M71, a small absolute value of equation M72 will lead to a fully conditional distribution of equation M73 with a large mean, which in turn will generate relatively small values of equation M74.

The λ parameter of the exponential prior:

In the standard LASSO, equation M75 controls the trade-off between goodness of fit and model complexity, and this may be crucial in defining the ability of a model to uncover signal. Small values of equation M76 produce better fit, in the sense of the residual sum of squares (equation M77 = 0 gives ordinary least squares); as equation M78 increases, the penalty on model complexity increases (in optimization problem (1) the feasible set is smaller). On the other hand, in BL, equation M79 controls the shape of the prior distribution assigned to the equation M80's. In general, the exponential prior assigns more density to small values of the equation M81's than to large ones, and this may be reasonable for most SNPs under the expectation that most of their effects are nil.

In BL equation M82 can be treated as any other unknown. If, as in (4), a Gamma prior is assigned to equation M83, the fully conditional posterior distribution of equation M84 is also Gamma, with shape and rate parameters equal to equation M85 and equation M86, respectively. The expectation of this Gamma distribution is equation M87, so a large value of equation M88 will lead to a relatively small equation M89, and the opposite will occur if the sum of the variances of the regression coefficients is small.

Relationship between LASSO and other regression models used in genomic selection:

Standard BR may not be suitable for regressing phenotypes on a large number of markers because shrinkage of regression coefficients is homogeneous across markers (Fernando et al. 2007). In contrast, in BL the variance is marker specific, producing shrinkage whose extent is related to the absolute value of the estimated regression coefficient.

Meuwissen et al. (2001) recognized that marker-specific variances may be needed and suggested regression models based on marginal priors that are also mixtures of scaled-Gaussian distributions (“Bayes A”) or mixtures of scaled-Gaussian distributions and of a point mass at zero (“Bayes B”). In these models, the likelihood is as in (3) and, in Bayes A, the prior is

equation M90
(6)

The first two components of (6), equation M91, are the counterparts of the first two components of (4), equation M92, with equation M93. The difference between BL and Bayes A (or Bayes B) is how the priors of the variances of the marker-specific regression coefficients (equation M94 in Bayes A and equation M95 in BL) are specified. At this level, Bayes A and BL differ in two respects:

  1. In Bayes A, the prior assumption is that the marker-specific variances are independent random variables following the same scaled-inverted chi-square distribution with known prior degree of belief equation M96 and scale equation M97. In BL, the assumption is that these variances are independent as well, but each following the same exponential distribution with unknown parameter equation M98. The conditional (given equation M99) marginal prior in BL equation M100 is DE, while in Bayes A equation M101 is a t-distribution. Although a t-distribution may place more density at zero than the Gaussian prior of BR, the density at zero is larger in the DE. This issue was recognized by Meuwissen et al. (2001), leading to the development of Bayes B.
    Xu (2003) employed an improper prior for the marker-specific variances; if equation M102 and equation M103, then equation M104. Similar to the exponential prior, this density decreases monotonically with equation M105. However, unlike the exponential distribution, where equation M106 can be used to “tune” the shape of the distribution, this prior does not have parameters to allow any control. In addition, as noted by Ter Braak et al. (2005), equation M107 yields an improper posterior. As an alternative, these authors suggested to use equation M108 with equation M109; although this prior is improper, it does not yield an improper posterior. As with the exponential prior, equation M110 is decreasing with respect to equation M111. Ter Braak (2006) furthered discussed the role of equation M112, which, as equation M113 in the BL, controls the shape of the prior density on the variance of the regression coefficients.
  2. A second difference is that, in Bayes A, values of parameters equation M114 and equation M115 are specified as known a priori. On the other hand, in BL there is an extra level in the model: equation M116 is assigned a Gamma distribution, and information from all regression coefficients is pooled. This difference is illustrated in Figure 2: in Bayes A, equation M117 and equation M118 control, as equation M119 does in BL, the trade-offs between goodness of fit and model complexity.
    Figure 2.
    Graphical representation of the hierarchical structure of the Bayesian LASSO (top) and Bayes A (bottom). In the Bayesian LASSO, the variances of the marker effects are equation M120, equation M121, with counterparts equation M122 in Bayes A.

Yi and Xu (2008) discuss an extension of Bayes A where a prior is assigned to equation M123 and equation M124, and these quantities are treated as nuisances, as equation M125 is in BL. However, as argued earlier, the DE seems to be a better choice, if the assumption is that most markers have no effect on the trait of interest.

MONTE CARLO STUDY

Although in the BL equation M126 can be treated as unknown, it is not clear how sensitive results might be with respect to the choice of hyperparameters equation M127 and equation M128. Park and Casella (2008, p. 683) recognized that this may be an issue: “The prior density for equation M129 should approach 0 sufficiently fast as equation M130 (to avoid mixing problems) but should be relatively flat and place high probability near the maximum likelihood estimate.” The main problem of applying this recommendation is that one does not know in advance what the maximum-likelihood estimate is.

The sensitivity of BL with respect to the choice of the prior distribution of equation M131 was investigated here by fitting the model under different priors to simulated data. In addition to the conjugate Gamma prior, we also considered (see File S1 and File S2)

equation M132
(7)

The above distribution gives great flexibility for specifying a relatively flat prior over a wide range of values. The uniform prior appears as a special case when equation M133. When the Beta prior is used, the fully conditional distribution of equation M134 does not have closed form; however, draws from the distribution can be obtained using the Metropolis–Hastings algorithm (see File S1 and File S2).

Data-generating process:

Data were simulated in a simple setting, such that problems could be identified easily, while the phenotypic and genotypic structure attempted to resemble those encountered in real data sets.

Data were generated under the additive model,

equation M135

where equation M136 is the phenotype for individual i, equation M137 is the effect of allele substitution at marker j equation M138, and equation M139 is the code for the genotype of subject i at locus j, equation M140. Residuals were independently sampled from a standard normal distribution; that is, equation M141.

Two scenarios regarding the genotypic distribution were considered. In scenario X0, markers were in low linkage disequilibrium (LD), with almost no correlation between adjacent markers (Table 1). In scenario X1 a relatively high LD was considered (Table 1).

TABLE 1
Absolute values of the correlation between marker genotypes (average across markers and 100 Monte Carlo simulations) by scenario (X0, low linkage disequilibrium; X1, high linkage disequilibrium)

The effects of allele substitutions equation M142 were kept constant across simulations and set to zero for all markers except for 10 (Figure 3). The locations of markers with nonnull effects were chosen such that different situations regarding effects of linked markers were represented.

Figure 3.
Positions (chromosome and marker number) and effects of markers (there were 280 markers, with 270 with no effect).

Choice of prior distribution of λ:

For each Monte Carlo (MC) replicate, five variations of BL were fitted, and four involved a Gamma prior equation M143 with the following values of parameters: equation M144; equation M145; equation M146; equation M147. In BL5, the prior on equation M148 was equation M149 (Figure 4).

Figure 4.
Unnormalized density of the five priors evaluated in the MC study (BL1–BL4 use Gamma priors on equation M150, and BL5 uses a prior for equation M151 based on a Beta distribution; the densities in this figure are the corresponding densities for equation M152).

Results:

Table 2 shows the average (across 100 MC replicates) of posterior means of the residual variance, the regularization parameter, and the correlation between the true and the estimated quantity of several features (phenotypes, genomic values, and marker effects). equation M153 is a goodness-of-fit measure, equation M154 measures how well the model estimates genomic values, and equation M155 evaluates how well the model estimates marker effects.

TABLE 2
Posterior mean of residual variance, equation M156, regularization parameter, equation M157, and correlation between the true and estimated value for several items (equation M158, phenotypes; equation M159, true genomic value; equation M160, marker effects; all quantities averaged over 100 MC replicates)

The posterior mean and standard deviation of equation M173 were influenced by the prior (Table 2). The posterior mean was shrunk toward the prior mode, and the posterior standard deviation was larger for more dispersed priors (see Table 2 and Figure 4). These results suggest that there is not much information about equation M174 in the type of samples evaluated. On the other hand, model goodness of fit and the ability of the model to uncover signal were not affected markedly by the choice of prior. This suggest that, while it may be difficult to learn about equation M175 from data, inferences on quantities of interest (e.g., genetic values) may be robust with respect to values of equation M176 over a fairly wide range. For example, differences in equation M177 or in equation M178 were small when the prior was changed.

A relatively flat prior based on a Beta distribution (BL5) produced a more dispersed posterior distribution of equation M179, and mixing was not as good as when the sharper Gamma priors (BL1–BL4) were used. For example, the average (across MC replicates) effective sample sizes (e.g., Plummer et al. 2008) for the residual variance were 1468, 1155, 1091, 1138, and 578 for BL1–BL5, respectively.

BAYESIAN REGRESSION COUPLED WITH LASSO

In practice, the information set available for prediction of genomic values may include components other than genetic markers. For example, data may cluster into known contemporary groups (e.g., individuals may be measured under different experimental conditions), or a pedigree may be available in addition to genetic markers. It is natural to treat the various classes of predictors in a different way. From a penalized-likelihood point of view, this amounts to using penalty functions that are specific to each class of predictors. From a Bayesian standpoint, treating predictors differently may be achieved by assigning different priors. A straightforward extension of the BL is described next.

The data structure is denoted as equation M180, where equation M181 is the phenotype of subject i, equation M182 is a vector of covariates that is treated as in a standard BR with a normal prior and variance common to all regressions, equation M183 is a set of covariates whose effects are assigned a double-exponential prior as in BL, and equation M184 is a label that allows tracking subjects in a pedigree. The equation for the data is

equation M185

where equation M186 is an intercept, equation M187 and equation M188 are regressions of equation M189 on equation M190 and equation M191, respectively, equation M192 is an infinitesimal genetic effect pertaining to individual i for which the prior (co)variance structure is determined by a pedigree, and equation M193 is a model residual, assumed to be identically and independently distributed of other residuals. The likelihood function is

equation M194
(8)

Prior specification (4) is modified as

equation M195
(9)

where equation M196, equation M197, and equation M198 are the variances of equation M199, equation M200, and equation M201, respectively; equation M202 and equation M203 are prior degrees of freedom and scale parameter of the corresponding distributions; equation M204 is a (co)variance structure computed from the genealogy (for example, a numerator-relationship matrix); and, equation M205 is the prior on equation M206 that may be as in (4) or (7).

In the model defined by (8) and (9) all fully conditional distributions (except that of equation M207 if a nonconjugate prior is chosen for equation M208) have closed form, so a Gibbs sampler (with a Metropolis–Hastings step) can be used to draw samples from the joint posterior distribution (see File S1 and File S2). To distinguish the above model from the standard BL we refer to it as Bayesian regression coupled with LASSO (BRL).

DATA ANALYSIS

Two data sets were analyzed with the BRL model. The first set pertains to a collection of wheat lines (see File S1 and File S2); the second set contains information from a population of mice (publicly available at http://gscan.well.ox.ac.uk).

The wheat data set is from the Global Wheat program of the International Maize and Wheat Improvement Center (CIMMYT). This program conducted several international trials across a wide variety of environments. For this study, we took a subset of 599 wheat lines derived from 25 years of Elite Spring Wheat Yield Trials (ESWYT) conducted from 1979 through 2005. The environments represented in these trials were grouped into four macroenvironments. The phenotype considered here was average grain yield performance of the 599 wheat lines evaluated in one of the macroenvironments. An association mapping study based on a reduced number of these ESWYT trials is presented in Crossa et al. (2007).

The Browse application of the International Crop Information System (ICIS), as described in http://cropwiki.irri.org/icis/index.php/TDM_GMS_Browse (McLaren et al. 2005), was used for deriving the relationship matrix A between the 599 lines, and it accounts for selection and inbreeding.

A total of 1447 Diversity Array Technology (DArT) markers were generated by Triticarte (Canberra, Australia; http://www.triticarte.com.au). The DArT markers may take on two values, denoted by their presence or their absence.

The mouse data come from an experiment carried out to detect and locate QTL for complex traits in a mouse population (Valdar et al. 2006a,b). These data have already been analyzed for comparing genome-assisted genetic evaluation methods (Legarra et al. 2008). The data file consists of 1884 individuals (168 full-sib families), each genotyped for 10,946 polymorphic markers. The trait analyzed here was body mass index (BMI), precorrected by body weight, season, month, and day. Mice were housed in 359 cages; on average, each litter was allocated into 2.84 cages.

Three models were fitted to each of the data sets: P (standing for pedigree) is a pedigree-based model where markers were not included; M is a model where the only genetic component is the regression on markers; P&M (standing for pedigree and markers) includes regressions on markers and an additive effect with (co)variance structure computed from the pedigree. For both data sets, phenotypes were standardized to have a sample variance equal to one, so that results are easily compared across data sets.

In the mouse data set, equation M209 were the effects of cages where groups of mice were reared. In the wheat data set, the component equation M210 was omitted because there was no such set of regressors.

Models were first fitted to the entire data set. Subsequently, a fivefold cross-validation (CV) was carried out with assignment of individuals into folds at random. The CV yields prediction of phenotypes equation M211 (f = 1, … , 5) obtained from a model in which all observations in the fth fold were excluded. The ability of each model to predict out-of-sample data was evaluated via the correlation between phenotypes and predictions from CV. Inferences for each fit were based on 70,000 samples (after 5000 were discarded as burn-in). Convergence was checked by inspection of trace plots and with estimates of effective sample size for (co)variance components computed using the coda package of R (Plummer et al. 2008). Parameters of the prior distributions were equation M212 and equation M213. This latter prior is flat over a wide range of values of equation M214.

RESULTS AND DISCUSSION

Table 3 shows summaries of the posterior distributions of the variance components and of λ by model and data set. In both populations, a moderate reduction in the posterior mean of the residual variance was observed when the P&M model was fitted, relative to P. Using model P in the wheat population gave a posterior mean of heritability of grain yield of 0.34, while in the mouse population the posterior mean of h2 of body-mass index was 0.11. These results are in agreement with previous reports (Valdar et al. 2006b and Legarra et al. 2008 for the mouse data and Crossa et al. 2007 for the wheat data) for these traits and populations. The inclusion of markers (P&M) reduced the estimate of the variance of the infinitesimal additive effect, relative to P. This happens because, in P&M, part of the infinitesimal additive effect is captured by the regression on markers (e.g., Habier et al. 2007; Bink et al. 2008). In the model for body-mass index in M, the variance between cages equation M215 was reduced only slightly when the effects of the markers were fitted.

TABLE 3
Posterior means (standard deviations) of variance components for yield in wheat and body-mass index in mice, and of equation M216 for each of the models, by data set

Figure 5 gives absolute values of the posterior means of marker effects. In the mouse data there are several regions showing groups of markers with relatively large estimated effects. This is not evident in the wheat data set where fewer markers were available.

Figure 5.
Absolute values of the posterior means of effects of allele substitution in a model including markers and pedigree information (P&M), by data set.

From a breeder's perspective, a relevant question is whether or not the P, M, and P&M models lead to different ranking of individuals on the basis of the estimated genetic values. Table 4 shows the rank (Spearman) correlation of estimated genetic values. As expected, these correlations were high, but not perfect. The correlation between predicted genetic values from M and P&M was larger than that of the estimates from P and P&M, suggesting that the inclusion of markers in the model is probably critical. This was clearer in the mouse data set, where (a) the extent of additive relationships was not as strong as in the wheat population and (b) a much larger number of markers were available.

TABLE 4
Rank correlation (Spearman) between genetic values estimated from models including different sources of genetic information (pedigree, markers, and pedigree and markers), by data set (mouse data set above diagonal, wheat data set below diagonal)

Figure 6 shows scatter plots of predicted genomic values in P and P&M for both data sets. Although the correlation between genetic values estimated from different models was high, using P and P&M would lead to different sets of selected individuals. The difference was more marked in the mouse data set, illustrating that the impact of considering markers in breeding decisions depends on the data structure and on how informative the pedigree and markers are. Also, the dispersion of predicted genetic values was larger when markers were fitted, and this is consistent with the smaller posterior mean of the residual variance observed for P&M (Table 3). An interpretation of this result is that, in certain contexts, markers may help to uncover genetic variance that would not be captured if only pedigree-based predictions were used.

Figure 6.
Predicted genetic value using markers and pedigree (P&M) vs. using pedigree only (P), by data set.

The aforementioned results indicate that incorporation of markers into a genetic model can influence inferences and breeding decisions. In contrast, cross-validation allows comparing models from the standpoint of their ability to predict future outcomes. Table 5 shows the correlation between phenotypic records and predictions from cross-validation. Two CV correlations were considered:

  1. equation M225 is the correlation between phenotypic records and their prediction from CV. That is, equation M226, where equation M227 in P, equation M228 in M, and equation M229 in P&M.
  2. equation M230 is the correlation between the CV estimate of the genetic value and phenotypic records adjusted with CV estimates of nongenetic effects. That is, equation M231 in the wheat population, and equation M232 in the mouse data set.
TABLE 5
Rank correlation (Spearman) between phenotypic values or corrected phenotypic records and predictions from cross-validation, by population and model (P, pedigree-based model; P&M, pedigree and marker information)

Overall, P&M models had better predictive ability than models based on pedigrees or markers only. In the wheat data set, the increases in the correlation observed when markers were included in the model were 13% for equation M241 and 42% for equation M242. In the mouse data set the relative increases in correlation were 14 and 100% for equation M243 and equation M244, respectively. We conclude that there are sizable benefits from using markers for breeding decisions and that the relative impact of the contribution depends upon data structure and on how informative the pedigree and the set of markers are.

CONCLUDING REMARKS

Additive models with infinitesimal effects are ubiquitous in animal and plant breeding. For many decades, predictions of genetic values have been made using phenotypic records and pedigrees, i.e., some sort of family-based evaluation. Markers capture Mendelian segregation and may enhance prediction of genomic values, independently of the mode of gene action.

With highly dense markers, marker-specific shrinkage may be needed. Priors on marker effects based on mixtures of scaled-Gaussian distributions allow this type of shrinkage and constitute a promising tool for genomic-based additive models. This family of models includes, among others, the t or DE distributions. Models based on marginal priors that belong to the t family have been proposed for marker-based regressions (e.g., Meuwissen et al. 2001).

If the hypothesis that most markers do not have any effect holds, a DE prior may be a better choice than the t. For this reason, the Bayesian LASSO appears to be an interesting alternative for performing regressions on markers, at least under an additive model.

Our results indicate that in the type of samples that are relevant for genomic selection (i.e., p [dbl greater-than sign] n) the choice of prior for equation M245 matters in terms of inferences about this unknown. However, estimates of genetic values and of marker effects may be robust with respect to the choice of prior, over a wide range. To circumvent the potential influence of the prior, we proposed an alternative formulation of the BL where the prior on equation M246 is formulated using a Beta distribution. Unlike the Gamma prior, this prior allows expressing vague prior preferences over a wide range of values of equation M247.

Two data analyses carried out with the proposed model showed that (a) markers may allow capturing fractions of additive variance that would be lost if pedigrees are the only source of genetic information used, (b) considering markers has a sizable impact on selection decisions, and (c) models including marker and pedigree information had better predictive ability than pedigree-based or marker-based models.

Acknowledgments

We greatly appreciate suggestions of two anonymous reviewers and of the Associate Editor. The Wellcome Trust Center for Human Genetics, Oxford, is gratefully acknowledged for making the mouse data available at http://gscan.well.ox.ac.uk. Vivi Arief from the School of Land Crop and Food Sciences of the University of Queensland, Australia, is thanked for assembling the historical wheat phenotypic and molecular marker data and for computing additive relationships between wheat lines. Financial support by the Wisconsin Agriculture Experiment Station, grant DMS-NSF DMS-044371, and by the Chaire D'Excellence Pierre de Fermat programme of the Midi-Pyrennées Region, France, is acknowledged.

Notes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.101501/DC1.

References

  • Andrews, D. F., and C. L. Mallows, 1974. Scale mixtures of normal distributions. J. R. Stat. Soc. Ser. B 36 99–102.
  • Bink, M. C. A. M., P. Uimari, M. J. Sillanpää, L. L. G. Janss and R. C. Jansen, 2002. Multiple QTL mapping in related plant populations via a pedigree-analysis approach. Theor. Appl. Genet. 104 751–762. [PubMed]
  • Bink, M. C. A. M., M. P. Boer, C. J. F. Ter Braak, J. Jansen, R. E. Voorrips et al., 2008. Bayesian analysis of complex traits in pedigreed populations. Euphytica 161 85–96.
  • Chhikara, R. S, and J. L. Folks, 1989. The Inverse Gaussian Distribution: Theory, Methodology and Applications. Marcel Dekker, NY.
  • Crossa, J., J. Burgueño, S. Dreisigacker, M. Vargas, S. A. Herrera-Foessel et al., 2007. Association analysis of historical bread wheat germplasm using additive genetic covariance of relatives and population structure. Genetics 177 1889–1913. [PMC free article] [PubMed]
  • De los Campos, G., D. Gianola and G. J. M. Rosa, 2009. Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J. Anim. Sci.(in press). [PubMed]
  • Fernando, R. L., and M. Grossman, 1989. Marker assisted selection using best linear unbiased prediction. Genet. Sel. Evol. 21 467–477.
  • Fernando, R. L., D. Habier, C. Stricker, J. C. M. Dekkers and L. R. Totir, 2007. Genomic selection. Acta Agric. Scand. Sect. A 57 192–195.
  • Gianola, D., and J. B. van Kaam, 2008. Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178 2289–2303. [PMC free article] [PubMed]
  • Gianola, D., M. Perez-Enciso and M. A. Toro, 2003. On marker-assisted prediction of genetic value: beyond the ridge. Genetics 163 347–365. [PMC free article] [PubMed]
  • Gianola, D., R. Fernando and A. Stella, 2006. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173 1761–1776. [PMC free article] [PubMed]
  • Habier, D., R. L. Fernando and J. C. M. Dekkers, 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics 177 2389–2397. [PMC free article] [PubMed]
  • Hans, C., 2008. Bayesian LASSO regression. Technical Report No. 810. Department of Statistics, Ohio State University, Columbus, OH. (http://www.stat.osu.edu/~hans/Papers/blasso.pdf).
  • Legarra, A., C. Robert-Granié, E. Manfredi and J. M. Elsen, 2008. Performance of genomic selection in mice. Genetics 180 611–618. [PMC free article] [PubMed]
  • McLaren, C. G., R. Bruskiewich, A. M. Portugal and A. B. Cosico, 2005. The international rice information system. A platform for meta-analysis of rice crop data. Plant Physiol. 139 637–642. [PMC free article] [PubMed]
  • Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard, 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157 1819–1829. [PMC free article] [PubMed]
  • Park, T., and G. Casella, 2008. The Bayesian LASSO. J. Am. Stat. Assoc. 103 681–686.
  • Plummer, M., N. Best, K. Cowles and K. Vines, 2008. coda: output analysis and diagnostics for MCMC. http://cran.r-project.org/web/packages/coda/index.html
  • R Development Core Team, 2008. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org.
  • Rosa, G. J. M., 1999. Robust mixed linear models in quantitative genetics: Bayesian analysis via Gibbs sampling. International Symposium on Animal Breeding and Genetics, September 21–24, Viçosa, Minas Gerais, Brazil, pp. 133–159.
  • Ter Braak, C. J. F., 2006. Bayesian sigmoid shrinkage with improper variance priors and an application to wavelet denoising. Comput. Stat. Data Anal. 51 1232–1242.
  • Ter Braak, C. J. F, M. P. Boer and M. C. A. M. Bink, 2005. Extending Xu's Bayesian model for estimating polygenic effects using markers of the entire genome. Genetics 170 1435–1438. [PMC free article] [PubMed]
  • Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58 267–288.
  • Valdar, W., L. C. Solberg, D. Gauguier, S. Burnett, P. Klenerman et al., 2006. a Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat. Genet. 38 879–887. [PubMed]
  • Valdar, W., L. C. Solberg, D. Gauguier, W. O. Cookson, J. N. P. Rawlins et al., 2006. b Genetic and environmental effects on complex traits in mice. Genetics 174 959–984. [PMC free article] [PubMed]
  • Wang, W. Y., B. J. Barratt, D. G. Clayton and J. A. Todd, 2005. Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet. 6 109–118. [PubMed]
  • Xu, S., 2003. Estimating polygenic effects using markers of the entire genome. Genetics 163 789–801. [PMC free article] [PubMed]
  • Yi, N., and S. Xu, 2008. Bayesian LASSO for quantitative trait loci mapping. Genetics 179 1045–1055. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...