• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Oct 2010; 186(2): 713–724.
PMCID: PMC2954475

Prediction of Genetic Values of Quantitative Traits in Plant Breeding Using Pedigree and Molecular Markers

Abstract

The availability of dense molecular markers has made possible the use of genomic selection (GS) for plant breeding. However, the evaluation of models for GS in real plant populations is very limited. This article evaluates the performance of parametric and semiparametric models for GS using wheat (Triticum aestivum L.) and maize (Zea mays) data in which different traits were measured in several environmental conditions. The findings, based on extensive cross-validations, indicate that models including marker information had higher predictive ability than pedigree-based models. In the wheat data set, and relative to a pedigree model, gains in predictive ability due to inclusion of markers ranged from 7.7 to 35.7%. Correlation between observed and predictive values in the maize data set achieved values up to 0.79. Estimates of marker effects were different across environmental conditions, indicating that genotype × environment interaction is an important component of genetic variability. These results indicate that GS in plant breeding can be an effective strategy for selecting among lines whose phenotypes have yet to be observed.

PEDIGREE-BASED prediction of genetic values based on the additive infinitesimal model (Fisher 1918) has played a central role in genetic improvement of complex traits in plants and animals. Animal breeders have used this model for predicting breeding values either in a mixed model (best linear unbiased prediction, BLUP) (Henderson 1984) or in a Bayesian framework (Gianola and Fernando 1986). More recently, plant breeders have incorporated pedigree information into linear mixed models for predicting breeding values (Crossa et al. 2006, 2007; Oakey et al. 2006; Burgueño et al. 2007; Piepho et al. 2007).

The availability of thousands of genome-wide molecular markers has made possible the use of genomic selection (GS) for prediction of genetic values (Meuwissen et al. 2001) in plants (e.g., Bernardo and Yu 2007; Piepho 2009; Jannink et al. 2010) and animals (Gonzalez-Recio et al. 2008; VanRaden et al. 2008; Hayes et al. 2009; de los Campos et al. 2009a). Implementing GS poses several statistical and computational challenges, such as how models can cope with the curse of dimensionality, colinearity between markers, or the complexity of quantitative traits. Parametric (e.g., Meuwissen et al. 2001) and semiparametric (e.g., Gianola et al. 2006; Gianola and van Kaam 2008) methods address these problems differently.

In standard genetic models, phenotypic outcomes, equation M1, are viewed as the sum of a genetic value, equation M2, and a model residual, equation M3; that is, equation M4. In parametric models for GS, equation M5 is described as a regression on marker covariates equation M6 (j = 1,    , p molecular markers) of the form equation M7, such that

equation M8

(or equation M9, in matrix notation), where equation M10 is the regression of equation M11 on the jth marker covariate equation M12.

Estimation of equation M13 via multiple regression by ordinary least squares (OLS) is not feasible when p > n. A commonly used alternative is to estimate marker effects jointly using penalized methods such as ridge regression (Hoerl and Kennard 1970) or the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani 1996) or their Bayesian counterpart. This approach yields greater accuracy of estimated genetic values and can be coupled with geostatistical techniques commonly used in plant breeding to model multienvironments trials (Piepho 2009).

In ridge regression (or its Bayesian counterpart) the extent of shrinkage is homogeneous across markers, which may not be appropriate if some markers are located in regions that are not associated with genetic variance, while markers in other regions may be linked to QTL (Goddard and Hayes 2007). To overcome this limitation, many authors have proposed methods that use marker-specific shrinkage. In a Bayesian setting, this can be implemented using priors of marker effects that are mixtures of scaled-normal densities. Examples of this are methods Bayes A and Bayes B of Meuwissen et al. (2001) and the Bayesian LASSO of Park and Casella (2008).

An alternative to parametric regressions is to use semiparametric methods such as reproducing kernel Hilbert spaces (RKHS) regression (Gianola and van Kaam 2008). The Bayesian RKHS regression regards genetic values as random variables coming from a Gaussian process centered at zero and with a (co)variance structure that is proportional to a kernel matrix K (de los Campos et al. 2009b); that is, equation M14, where equation M15, equation M16 are vectors of marker genotypes for the ith and jth individuals, respectively, and equation M17 is a positive definite function evaluated in marker genotypes. In a finite-dimensional setting this amounts to modeling the vector of genetic values, equation M18, as multivariate normal; that is, equation M19 where equation M20 is a variance parameter. One of the most attractive features of RKHS regression is that the methodology can be used with almost any information set (e.g., covariates, strings, images, graphs). A second advantage is that with RKHS the model is represented in terms of n unknowns, which gives RKHS a great computational advantage relative to some parametric methods, especially when p [dbl greater-than sign] n.

This study presents an evaluation of several methods for GS, using two extensive data sets. One contains phenotypic records of a series of wheat trials and recently generated genomic data. The other data set pertains to international maize trials in which different traits were measured in maize lines evaluated under severe drought and well-watered conditions.

MATERIALS AND METHODS

Experimental data:

Two distinct data sets were used: the first one comprises information from a collection of 599 historical CIMMYT wheat lines, and the second one includes information on 300 CIMMYT maize lines.

Wheat data set:

This data set includes 599 wheat lines developed by the CIMMYT Global Wheat Breeding program. Environments were grouped into four target sets of environments (E1–E4). The trait was grain yield (GY). Hereinafter we refer to this data set as wheat-grain yield (W-GY). A pedigree was used for deriving the additive relationship matrix A among the 599 lines, as described in http://cropwiki.irri.org/icis/index.php/TDM_GMS_Browse (McLaren et al. 2005). The entries of this matrix equal twice the kinship coefficient (or coefficient of parentage) between pairs of lines.

Wheat lines were genotyped using 1447 Diversity Array Technology markers (hereinafter generically referred to as markers) generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). These markers may take on two values, denoted by their presence (1) or absence (0). In this data set, the overall mean frequency of the allele coded as 1 was 0.561, with a minimum of 0.008 and a maximum of 0.987. Markers with allele frequency <0.05 or >0.95 were removed. Missing genotypes were imputed using samples from the marginal distribution of marker genotypes, that is, equation M21, where equation M22 is the estimated allele frequency computed from the nonmissing genotypes. After edition, 1279 markers were retained.

Maize data set:

The maize data set is from the Drought Tolerance Maize for Africa project of CIMMYT's Global Maize Program. The original data set included 300 tropical lines genotyped with 1148 single-nucleotide polymorphisms (hereinafter generically referred to as markers). For each marker, the allele with lowest frequency was coded as one.

No pedigree was available for these data. Traits analyzed for this study were GY, female flowering (FFL) (or days to silking), male flowering (MFL) (or days to anthesis), and the anthesis-silking interval (ASI), each evaluated under severe drought stress (SS) and well-watered (WW) conditions. Hereinafter we refer to these data sets as maize-grain yield (M-GY) and maize-flowering (M-F), respectively. The number of lines in the M-F data set was 284, whereas 264 lines were available in M-GY. The average minor allele frequency in these data sets was 0.20. After editing (with the same procedures as those described above), the numbers of markers available for analysis were 1148 and 1135 in M-F and M-GY, respectively.

Statistical models:

This study evaluated several models for GS that differ depending on the type of information used for constructing predictions (pedigree, markers, or both) and on how molecular markers were incorporated into the model (parametric vs. semiparametric). All the unknowns in the model were trait–environment specific. Consequently, separate models were fitted to each trait–environment combination. For ease of presentation, models are described for a generic trait–environment.

Likelihood function:

In all models, phenotypic records were described as

equation M23

where equation M24 is the average performance of the ith line, equation M25 is the number of replicates used for computing the mean value of the ith genotype, equation M26 is an intercept, equation M27 is the genetic value of the ith genotype, and equation M28 is a model residual. In all environments, the response variable was standardized to a sample variance equal to one. The joint distribution of model residuals was equation M29. With this assumption, the likelihood function becomes

equation M30
(1)

Models differed on how pedigree and molecular marker information was included in equation M31.

Standard infinitesimal model:

In this model, denoted as pedigree (P), equation M32 and equation M33, where equation M34 is the additive relationship matrix computed from the pedigree and equation M35 is the infinitesimal additive genetic variance. Following standard assumptions, the joint prior of model unknowns in P was

equation M36
(2a)

where equation M37 are scaled inverse chi-square priors assigned to the variance parameters. The prior scale and degrees of freedom parameters were set to equation M38 and equation M39, respectively. This prior has finite variance and an expectation of 0.5. Combining (1) and (2a), the joint posterior density of P is

equation M40
(2b)

Above, equation M41 denotes all hyperparameters indexing the prior distribution. This posterior distribution does not have a closed form; however, samples from the above model can be obtained from a Gibbs sampler, as described, for example, in Sorensen and Gianola (2002). No pedigree data were available for the maize data set; therefore, this model was only in the wheat data set.

Parametric genomic models:

For parametric regression, we use the Bayesian LASSO (BL) (Park and Casella 2008), extended by inclusion of an infinitesimal effect, as described in de los Campos et al. (2009a). In this model,

equation M42

and the joint prior density of the model unknowns (upon assigning a flat prior to equation M43) is

equation M44
(3a)

Above, marker effects are assigned independent Gaussian priors with marker-specific variances (equation M45). At the next level of the hierarchical model, the equation M46's are assigned iid exponential priors equation M47. At a deeper level of the hierarchy equation M48 is assigned a Gamma prior with rate (δ) and shape (r), which in this study were set to equation M49 and equation M50, respectively. Finally, independent scaled inverse chi-square priors were assigned to the variance parameters, and the scale and degree of freedom parameters were set to equation M51 and equation M52, respectively. The above model is referred to as pedigrees plus markers BL (PM)-BL.

The effect of the prior choice for equation M53 in the BL has been addressed in de los Campos et al. (2009a). These authors studied the influence of the choice of hyperparameters for equation M54 on inference of several items and concluded that, even when the prior for equation M55 had influence on inferences about this unknown, model goodness-of-fit and estimates of genetic values were robust with respect to the choice of equation M56. Figure A1 (appendix a) depicts the prior density of λ, equation M57, corresponding to the hyperparameter values used in this study; this prior gave a high density over a wide range of values of equation M58. Also, as shown later, the posterior mean of λ changed between traits and data sets, indicating that Bayesian learning took place.

Figure A1.
Prior density of the regularization parameter, p(equation M151), used to fit the Bayesian LASSO.

Combining the assumptions of the likelihood (1) and the prior described in (3a), the joint posterior density is

equation M59
(3b)

This density does not have a closed form; however, samples from the above model can be obtained from a Gibbs sampler, as described in de los Campos et al. (2009a). Inferences for the regularization parameter are presented in terms of equation M60, which were obtained by taking the positive square root of samples from the posterior distribution of equation M61.

A marker-based model, M-BL, can be obtained from (3b) by setting equation M62, which implies equation M63.

BLUP using marker genotypes:

Prediction of genetic values using BLUP (e.g., Robinson 1991) of marker effects is commonly used in GS (e.g., Meuwissen et al. 2001; Bernardo and Yu 2007). We include this method as a reference. BLUP estimates are derived from the model

equation M64
equation M65

where D = equation M66. From these assumptions, the BLUP estimates of marker effects are

equation M67

Computation of BLUPs requires knowledge of equation M68. To this end, we fitted a random-effects model

equation M69

where equation M70 is the observed phenotype of the kth replicate of the ith genotype (equation M71; equation M72), equation M73 and equation M74. This model yields estimates of equation M75, where equation M76. An estimate of equation M77 was obtained by plugging the estimate of equation M78 in equation M79 (e.g., Meuwissen et al. 2001; VanRaden 2007), where equation M80 is the estimated allelic frequency of the jth marker, and equation M81 is the average (across markers) allele frequency, which in our case was estimated from the marker data.

Semiparametric models (RKHS):

In RKHS, genetic values are viewed as a Gaussian process. When markers and a pedigree are available, genetic values can be modeled as the sum of two components

equation M82

where equation M83 is as before and equation M84 is a Gaussian process with a (co)variance function proportional to the evaluations of a reproducing kernel, equation M85, evaluated in marker genotypes; here equation M86 and equation M87 are vectors of marker genotype codes for the ith and jth individuals, respectively. The joint prior distribution of equation M88, equation M89, and the associated variance parameters equation M90, equation M91, equation M92, and equation M93, are as follows:

equation M94
(4a)

Above, K is a kernel matrix, which is symmetric and positive definite. In this study, the entries of these matrices were the evaluations of a Gaussian kernel, equation M95, where equation M96 is a squared-Euclidean distance, and equation M97 is a bandwidth parameter that controls how fast the prior correlation drops as lines get farther apart in the sense of equation M98. The values of the distance function depend on p, on allele frequencies, and on how related the lines are. The choice of the bandwidth parameter should consider the observed distribution of equation M99 to avoid situations where K is either a matrix full of ones or an identity matrix. In this study we chose equation M100, where equation M101 is the sample median of equation M102. This choice yields equation M103 at the median distance. Higher (lower) prior correlation is assigned to pairs of lines that are closer (farther apart) than equation M104, as measured by equation M105. Addressing the optimal choice of bandwidth parameter is not within the scope of this study; see de los Campos et al. (2010). The scale and degree of freedom parameters of the prior described in (4a) were equation M106 and equation M107.

Combining the assumptions in (1) and (4a), the joint posterior density of this marker and pedigree RKHS model (PM-RKHS) is

equation M108
(4b)

This density does not possess a closed form; however, samples from this posterior distribution can be obtained using a slightly modified version of the Gibbs sampler that implements the pedigree model in (2a).

In the RKHS regression of (4b), the variances of equation M109 and equation M110 can gauge the relative contribution of each of these components to the conditional expectation function. From (4a), equation M111, where equation M112 is the ith diagonal element of matrix A, and equation M113. Here, equation M114 is a standardized kernel, with equation M115. This does not occur in equation M116; here equation M117, where equation M118 is the coefficient of inbreeding of the ith individual. In the wheat population, the average value of equation M119 was 1.98.

As with parametric methods, a marker-based model, M-RKHS, can be obtained as a particular case of (4b), with equation M120, which implies equation M121.

Data analysis:

Full-data analysis:

Models were first fitted using all lines in the data set, and inferences for each fit were based on 30,000 samples (obtained after discarding 5000 samples as burn-in). Convergence was checked by inspecting trace plots of variance parameters.

Cross-validation:

Prediction of performance of lines whose phenotypes are yet to be observed is a central problem in plant breeding. Such prediction can be used, for example, to decide which of the newly generated lines will be evaluated in field trials. Cross-validation (CV) methods were used to evaluate the ability of a model to predict future outcomes. To this end, data were divided into 10 folds; this was done by using an index variable, equation M122, i = 1,    , n, that randomly assigns observations to 10 disjoint folds, equation M123, j = 1,    , 10. CV predictions of the observations in the first fold, equation M124, are obtained by omitting phenotypic data on all lines in the first fold. This yields CV predictions of lines in the first fold, that is, equation M125. Repeating this exercise for the second, third,    , 10th folds yields a whole set of CV predictions equation M126 that can be compared with actual observations equation M127 to assess predictive ability.

Principal component analysis of estimated marker effects:

Parametric models such as the BL yield estimates of marker effects, which, in our case, are environment specific. These estimates can be used to assess and visualize genetic effect × environment interaction. Biplots from principal component analysis of the matrix of estimated marker effects in each trait–environment combination were obtained. The methodology is briefly explained in appendix b. Use of biplots to assess genetic effect × environment interaction is further described in Cornelius et al. (2001).

RESULTS

This section begins by presenting estimates of variance parameters and of the regularization parameters of BL and RKHS that were obtained when models were fitted using all available records (i.e., full data analysis). Next, results from the principal components analysis of estimated marker effects (also obtained from the full data analysis) for the W-GY data set are given (results for the maize data set are provided in appendix c). Subsequently, estimates of measures of predictive ability obtained from cross-validation are presented.

Variance and regularization parameters:

Tables 1 and and22 give the estimates of posterior means of variance parameters and of λ in the BL. The posterior mean of the residual variance (equation M128) can be used to assess model goodness-of-fit. Since the response variable was standardized within trait–environment combinations, the estimate of equation M129 gives an indication of the fraction of the phenotypic variance that can be attributable to model residuals. In the GY-W data set (Table 1), RKHS models fitted data markedly better (smaller equation M130) than P, M-BL, or PM-BL. Model M-BL had a posterior mean of residual variance that was either similar to or slightly larger than that of P, while PM-BL fitted the data better than P. Results from the maize data sets (Table 2) were mixed: M-BL fitted the data much better than M-RKHS for FFL and MFL, regardless of environmental conditions, but the opposite was observed (i.e., M-RKHS fitted data better than M-BL) for ASI and GY (Table 2).

TABLE 1
Estimates of posterior mean of parameters σepsilon2, σu2, σf2, and λ from the full-data analysis of grain yield (GY) of 599 wheat lines genotyped with 1279 molecular markers
TABLE 2
Estimates of posterior means of parameters σepsilon2, σf2, and λ from the full-data analysis of female flowering time (FFL), male flowering time (MFL), the MFL to FFL interval (ASI) of 284 maize genotypes and 1148 markers, and ...

For the W-GY data set, the posterior means of equation M145 in PM-BL and PM-RKHS were smaller than that obtained in P (Table 1). This indicates that the inclusion of markers reduces the relative contribution of the regression on the pedigree, equation M146. In PM-RKHS, the ratio equation M147, evaluated at equation M148 and at the posterior mean of equation M149 and equation M150, was always >2 (Table 1), indicating that in PM-RKHS models, the regression on the markers made a much more important contribution to the conditional expectation than the regression on the pedigree.

Marker effects:

Estimated marker effects obtained from PM-BL are provided in supporting information, Table S1, Table S2, and Table S3.

The multivariate analysis of estimated marker effects for the W-GY data set indicated that the first two principal components explained 74% of the total variability in estimated marker effects (Figure 1). Sample correlations between phenotypes in the four environments (E) showed that E2 and E3 had a correlation of 0.661, whereas E2 and E4 and E3 and E4 had correlations of 0.411 and 0.388, respectively. The correlation patterns of estimated marker effects were similar, but the strength of the association was slightly weaker. For instance, the correlations between estimates of marker effects were 0.633 (E2–E3), 0.388 (E2–E4), and 0.384 (E3–E4). Correlations between E1 and the other environments were low and negative for phenotypic and estimated marker effect data.

Figure 1.
Biplot of the first two principal components (Comp. 1 and Comp. 2) of estimates of marker effects on grain yield (GY) in wheat evaluated in four environments (E1–E4). Marker effects were obtained from a full-data analysis and using a pedigree ...

The variance of estimated marker effects was slightly smaller in E4; this can be inferred by the length of the corresponding vector in Figure 1. The vast majority of the estimated effects are located around the center of Figure 1 (i.e., estimated effects were small, in absolute value), which reflects shrinkage of the BL model. However, some markers had estimated effects that were large in absolute value; some of those markers are identified by their name in Figure 1, and the estimated effects are given in Table S1. An approximation to the estimated effect of the presence of a marker in GY for a given environment can be obtained by orthogonal projection of the marker effect displayed in Figure 1 on the vector of the corresponding environment. To illustrate this, consider E1, where the presence of markers wPt.9256, wPt.6047, and wPt.3904 is expected to increase GY (Figure 1); in contrast, the presence of markers wPt.3462, wPt.3922, and wPt.4988 (located in the opposite direction of E1) is expected to reduce GY.

The multivariate analysis of estimated marker effects allows identifying which markers contribute to positive/negative genetic correlation between environments. Markers whose presence is expected to increase or decrease GY across environments can be viewed as contributing to positive genetic correlations in GY between environments. Examples of this group are markers wPt.9256, wPt.6047, and c.373879, whose presence increased GY in the four environments, and wPt.3393, c.380591, and c.381717, whose presence decreased GY in all environments. However, some markers act in an “antagonistic” fashion; that is, the presence of a marker increases (decreases) GY in some environments and decreases (increases) GY in others.

Results from the multivariate analysis of marker effects in the maize data sets (M-F and M-GY) were similar to those observed in the wheat data set in regard to the following: (1) the first two principal components explained a large proportion (85.8%) of the observed variability of estimated marker effects; (2) due to shrinkage, most estimated marker effects clustered around zero; and (3) although the overall correlation patterns between estimated marker effects reflected the type of association observed between phenotypes, it was possible to identify subsets of markers that contributed to positive genetic correlation and others that induced negative genetic associations. A detailed discussion of these results is given in appendix c.

Predictive ability:

Tables 3 and and44 show the estimated correlations between phenotypic outcomes and CV predictions for W-GY, M-F, and M-GY data sets. Overall, the values of these correlations, especially those obtained with BL or RKHS methods, were large for all models, data sets, and traits, indicating that genomic selection can be effective for predicting the performance of lines with yet-to-be observed phenotypes. Predictive ability was different between models and data sets: for W-GY correlations ranged from 0.355 to 0.608, for M-F correlations varied from 0.464 to 0.79, and for M-GY they ranged from 0.415 to 0.514.

TABLE 3
Cross-validation (CV) correlation between predicted and observed phenotypes, obtained in a 10-fold CV conducted for grain yield (GY) records of 599 wheat lines genotyped with 1279 molecular markers
TABLE 4
Cross-validation (CV) correlation between predicted and observed phenotypes, obtained in a 10-fold CV conducted for female flowering (FFL), male flowering (MFL), the MFL to FFL interval (ASI) of 284 maize lines genotyped for 1148 markers, and grain yield ...

Wheat data set:

In the W-GY, correlations ranged from 0.355 (BLUP in E3) to 0.608 (PM-RKHS in E1) (Table 3), and relative to the P model, the PM-RKHS model produced the highest relative gain in CV correlation in three of four environments. BLUP was outperformed by BL and RKHS methods across environments. In these data, PM models had better predictive ability than P models, and the magnitude of the gain in predictive ability attained by including markers in the model varied from a modest 7.7% (PM-BL in GY-E3) to a very important 35.7% (PM-RKHS in GY-E1) (Table 3). In general, RKHS outperformed BL both in M and PM, and BLUP outperformed P models in three of four environments (all but E3); however, as stated, BLUP was outperformed by BL and RKHS.

Maize flowering:

In the M-F, correlations ranged from 0.464 (BLUP for MFL-SS) to 0.790 (M-BL for MFL-WW) (Table 4). For these traits, BLUP was systematically outperformed by BL and RKHS. Also for these traits, M-BL yielded better predictions than M-RKHS, with relatively high correlation values that ranged from 0.774 to 0.790. However, for ASI under severe drought stress and well-watered conditions, correlations were not as strong as those found for the other flowering-time traits, and M-RKHS outperformed M-BL, with correlation values of 0.547 and 0.572, respectively (Table 4).

Maize grain yield:

Predictive correlations in M-GY (Table 4) were smaller than those obtained in flowering traits, and the differences between methods were not clear as in the M-F data set. Here, CV correlations ranged from 0.415 (M-BL GY under drought stress) to 0.525 (M-BL GY well watered). These traits did not yield a clear ranking of models: BL was best for GY under well-watered conditions, and RKHS was best for GY under drought stress. However, as stated, in M-GY the differences in predictive ability between models were not large.

DISCUSSION

Several simulation studies (Bernardo and Yu 2007; Wong and Bernardo 2008; Mayor and Bernardo 2009; Zhong et al. 2009) have reported important gains in genetic progress associated with the use of GS in plant breeding. Recently, Heffner et al. (2009) concluded that the high correlation between true breeding values and the genomic estimated breeding values found in several simulation studies is sufficient for considering selection based on molecular markers alone; however, evaluation of these methods with real plant data is still very limited.

Empirical evaluation of GS:

The results of this study indicate that, even with a modest number of molecular markers, models for GS can attain relatively high predictive ability for genetic values of traits of economic interest in contrasting environmental conditions. These findings are in agreement with simulation-based studies such as those mentioned above and with empirical evidence reported in animal breeding (e.g., Gonzalez-Recio et al. 2008; VanRaden et al. 2008; Hayes et al. 2009; Weigel et al. 2009).

Evaluation of predictive ability indicated that models using marker and pedigree data jointly (PM) outperformed pedigree models (P) across traits and environments, regardless of the choice of model (BL, RKHS). These results are consistent with those reported by Crossa et al. (2010), who evaluated P, M, and PM models using the BL and RKHS for grain yield in wheat (n = 170) and several disease traits in maize.

Despite the gains in predictive ability obtained with PM models, our results suggest that there is room for improving predictive ability even further. To illustrate this, and as an exercise, let us assume that the model equation M152 holds, and consider as the best (unlikely) scenario that CV predictions, equation M153, are such that equation M154. If so, the maximum attainable correlation is equation M155, where h is the square root of the heritability of the trait. Thus, if heritability is 0.5, then the maximum correlation is 0.707. This will hold if only one replicate is available; for data involving repeated measures, as was the case in this study, the maximum correlation is equation M156. CV correlations in this study ranged from 0.40 to 0.79; these values are well below the theoretical maxima given the heritability of the traits and the number of replicates available. We therefore conclude that larger gains in predictive ability can be expected (1) when more markers are available or (2) by improving upon the methods used to implement GS.

Choice of model:

There are different ways of incorporating markers into models for GS. Here we evaluated the BL, BLUP, and RKHS methods. BLUP and BL use parametric regression on marker covariates, whereas RKHS is a semiparametric method. In general, BL outperformed BLUP, which may be attributed to at least two reasons: (1) similar to other methods for GS such as methods Bayes A and Bayes B of Meuwissen et al. (2001), BL performs marker-specific shrinkage of effects, whereas BLUP penalizes all marker effects equally; and (2) in BL, variance parameters and marker effects are inferred jointly, whereas BLUP typically involves two steps (a first one in which variance parameters are inferred and a second one in which marker effects are estimated).

The comparison between BL and RKHS yielded mixed results; this finding is in agreement with those of Zhong et al. (2009), who evaluated different models in different scenarios (mating systems) and did not find one method that performed best across scenarios. For grain yield and anthesis-silking interval, RKHS methods performed either similarly or better than the BL; however, for female and male flowering traits in maize, BL outperformed RKHS markedly. The BL is an additive model, whereas RKHS may be able to capture complex epistatic interactions better (e.g., Gianola and van Kaam 2008). Therefore, one could expect the BL to perform well in traits where additive effects play a central role and RKHS to perform better in traits where epitasis is more relevant. Buckler et al. (2009) provide evidence suggesting that female and male flowering traits in maize are, for the most part, additive traits. The good performance of the BL observed in this study for those traits is consistent with this finding.

Marker vs. pedigree plus marker models:

In general, PM models in W-GY had a slight but consistent superiority in all four environments for predictive ability as compared to the M model; this is in agreement with previous findings (e.g., de los Campos et al. 2009a). The advantage of considering pedigree and markers jointly is small because there is some redundancy between regression on the pedigree and regression on markers (e.g., Habier et al. 2009). It is reasonable to expect that as the number of molecular markers increases, the relative contribution of pedigree information will decrease.

Assessment of genetic effect × environment interaction with estimates of marker effects:

Parametric methods such as M-BL, PM-BL, or BLUP provide estimates of “marker effects” that may be used to gain a better understanding of the underlying architecture of the traits. The results obtained here with W-GY are consistent with those reported by Crossa et al. (2007) and indicate that markers such as wPt.6047, wPt.3393, wPt3462, and wPt.3904 (located in chromosome 3B, the long arm of chromosome 7A, chromosome 1A, and the short arm of chromosome 1A, respectively) are indeed associated with GY in wheat.

Estimates of marker effects can be also used to gain insights on the sources of genetic effect × environment interaction. Here, we used principal component analysis of estimates of marker effects as a way of assessing sources of marker effect × environment interaction. Overall, the correlation patterns of estimated marker effects were similar to those observed at the phenotypic level; however, in all trait–environment combinations it was possible to detect markers that made contributions to positive or negative genetic correlation. For example, for the M-F data set, results indicate important molecular marker effect × environment interactions, which translate into genotype × environment interaction. In this respect, our results are different from those of Buckler et al. (2009), who reported low levels of genotype × environment interaction for the same traits.

Conclusion:

Results of this study showed that models including markers or markers and pedigrees yield relatively high correlations between predicted and observed phenotypic outcomes. The superiority of models using markers or markers and pedigree was clear regardless of the choice of method (BL, RKHS). Moreover, we did not find a method (BL or RKHS) that was consistently superior across environments and traits. Differences in the underlying genetic architecture of the traits may well explain these results.

The relatively promising results from RKHS indicate that designing methods to address the problem of kernel choice is a relevant area of research in the context of semiparametric models for GS. In this study, separate models were fitted to each trait–environment combination. Multiple-environment (multiple-trait) models are ubiquitous in plant and animal breeding, and the development and evaluation of multiple-environment models for GS where marker effects and genomic values for several traits are estimated jointly appears to be a relevant area of research.

The Bayesian LASSO was fitted using the BLR package which is available in R (R Development Core Team 2010; G. de los Campos and P. Pérez) and described in Pérez et al. (2010). The wheat and maize experimental data, and other computer programs written in R for fitting the RKHS models using the Gibbs sampler described in this article, are available in File S1.

Acknowledgments

This article benefited from valuable comments from two associate editors and two anonymous reviewers. The maize data set used in this study comes from the Drought Tolerance Maize for Africa project financed by the Bill and Melinda Gates Foundation. We thank the numerous cooperators in national agricultural research institutes who carried out the maize trials in Africa and the Elite Spring Wheat Yield Trials and provided the phenotypic data analyzed in this article. We also thank the International Nursery and Seed Distribution Units in the International Maize and Wheat Improvement Center (CIMMYT, Mexico), for preparing and distributing the seed and digitalizing the data. Gustavo de los Campos and Daniel Gianola acknowledge support by the Wisconsin Agriculture Experiment Station and from grant DMS-11044371 made by the Division of Mathematical Sciences of the National Science Foundation.

APPENDIX A:

APPENDIX B: MULTIVARIATE ANALYSIS OF ESTIMATED MARKER EFFECTS

Consider a matrix of estimated molecular marker effects, equation M157, whose columns, equation M158, equation M159, are estimates of the effects of p markers in q different environments. The singular value decomposition of this matrix is equation M160, where equation M161 and equation M162 are ortho-normal matrices that span the row (marker) and column (environment) spaces of equation M163, respectively, and equation M164 is a diagonal matrix whose nonnull entries are the singular values of equation M165; that is, equation M166.

The biplot is constructed using the first two principal components axis of equation M167 (equation M168, equation M169 and equation M170, equation M171). Points in the biplot are the marker effects projected in the first two components and are displayed using the coordinates provided by equation M172 and equation M173. The “environmental effects” are displayed as vectors whose coordinates are given by equation M174 and equation M175. The length of the vectors approximates the variance accounted for by the specific molecular marker and environmental effect. Molecular markers represented in the same direction as the environments had positive effects on those environments, whereas molecular markers located in the opposite direction to the environmental vectors had negative effects on those environments. The cosine of the angle between the vectors representing a pair of environments (or molecular marker effect) approximates the correlation of the two environments (or molecular marker), with an angle of zero indicating a correlation of +1, an angle of 90° (or −90°) a correlation of 0, and an angle of 180° a correlation of −1.

APPENDIX C

Marker effects for maize flowering data:

The display of the first two component axes (accounting for 85.79% of the total variability in estimated marker effects) on estimated effects of the markers in the six trait–environment combinations (MFL-SS, MFL-WW, FFL-SS, FFL-WW, ASI-SS, and ASI-WW) of the M-F data set obtained from the BL model is depicted in Figure C1. Clearly the two groups of trait–environment combinations are dominated more by the trait (ASI vs. FFL and MFL) and less by the environmental condition (SS and WW). Phenotypic outcomes and estimates of marker effects for ASI showed relatively small correlations with those of FFL and MFL. Phenotypic correlations between MFL in WW and SS, ASI in WW and SS, and FFL in SS and WW were positive and high, ranging from 0.686 to 0.728. Correlations ASI-MFL and ASI-FFL at the different water regimes (SS and WW) ranged from −0.123 to 0.446.

Figure C1.
Biplot of the first two principal components (Comp. 1 and Comp. 2) of estimates of marker effects for female flowering (FFL), male flowering (MFL), and the FFL-MFL interval (ASI) evaluated under well-watered (WW) and drought-stress (SS) conditions. Estimates ...

Interpretation of the estimated marker effect on these traits should be different from that for grain yield. For FFL and MFL, the favorable allele is the one whose estimated effect is negative (i.e., it decreases FFL and MFL), whereas for ASI, selection seeks to set this trait as close to zero as possible. Alleles coded as 1 of markers whose estimated effects are located on the left side and in the top left corner of Figure C1 (i.e., PZA03551.1, PZA03578.1, PZA03222.1, PZA03385.1, PZB01201.1, and PZB00118.2) increase FFL, MFL, and ASI (they all have positive effects in all trait–environment combinations), whereas those markers located on the opposite side of the biplot (bottom right corner) (i.e., PZA02587.16, PZA00236.7, PZB0255.1, and PZA00676.2) decrease the value of FFL, MFL, and ASI. Those markers whose presence is expected to increase or decrease traits across environments can be viewed as contributing to positive genetic correlations in FFL, MFL, and ASI between environments.

Despite the high heritability (between 0.74 and 0.87) found for flowering time and ASI in this maize trial, results show substantial interaction between molecular marker effects and environment. The biplot in Figure C1 shows markers that had very contrasting effects across environments. For example, the minor alleles of markers whose estimated effects are located in the top right corner of the biplot (PZA03592.3, PZB01077.3, and PZB02076.1) increase the anthesis-silking interval under drought and well-watered conditions, but decrease days to male and female flowering. In contrast, the minor alleles of markers whose estimated effects are located in the opposite quadrant of the biplot (bottom left corner) (PZB00592.1, PHM13183.12, and PZB01964.5) showed a complete rank reversal with respect to the effects of markers PZA03592.3, PZB01077.3, and PZB01077.3 on those trait–environment combinations, i.e., a decrease in ASI under SS and WW and an increase in male and female flowering times.

The estimated effects used to perform the multivariate analysis included in this section are provided in Table S2.

Marker effects for maize grain yield under stress and well-watered environments:

Since only two trait–environment combinations (GY-WW and GY-SS) are available for the M-GY data set, no principal component analysis was performed. The phenotypic correlations between GY-WW and GY-SS (0.260), as well as the correlations between the estimated marker effects for grain yield (0.251), were low. Also, none of the 10 markers with the largest/smallest estimated effects in GY-WW was among those with the largest/smallest effects under GY-SS conditions. This indicates important context-dependent effects due to genotype × environment interaction. Estimates of marker effects for GY-WW and GY-SS are provided in Table S3.

Notes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.118521/DC1.

References

  • Bernardo, R., and J. Yu, 2007. Prospects for genome-wide selection for quantitative traits in maize. Crop Sci. 47 1082–1090.
  • Buckler, E. S., J. B. Holland, P. J. Bradbury, C. B. Acharya, P. J. Brown et al., 2009. The genetic architecture of maize flowering time. Science 325 714–718. [PubMed]
  • Burgueño, J., J. Crossa, P. L. Cornelius, R. Trethowan, G. McLaren et al., 2007. Modeling additive × environment and additive × additive × environment using genetic covariances of relatives of wheat genotypes. Crop Sci. 43 311–320.
  • Cornelius, P. L., J. Crossa, M. S. Seyedsadr, G. Liu and K. Viele, 2001. Contributions to multiplicative model analysis of genotype-environment data. Statistical Consulting Section, American Statistical Association, Joint Statistical Meetings, August 7, Atlanta, GA.
  • Crossa, J., J. Burgueño, P. L. Cornelius, G. McLaren, R. Trethowan et al., 2006. Modeling genotype × environment interaction using additive genetic covariances of relatives for predicting breeding values of wheat genotypes. Crop Sci. 46 1722–1733.
  • Crossa, J., J. Burgueño, S. Dreisigacker, M. Vargas, S. A. Herrera-Foessel et al., 2007. Association analysis of historical bread wheat germplasm using additive genetic covariance of relatives and population structure. Genetics 177 1889–1913. [PMC free article] [PubMed]
  • Crossa, J., P. Perez, G. de los Campos, G. Mahuku, S. Dreisigacker et al., 2010. Genomic selection and prediction in plant breeding. Quantitative Genetics, Genomics, and Plant Breeding, Ed. 2, edited by M. S. Kang. CABI Publishing, New York (in press) http://genomics.cimmyt.org/.
  • de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. Legarra et al., 2009. a Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182 375–385. [PMC free article] [PubMed]
  • de los Campos, G., D. Gianola and G. J. M. Rosa, 2009. b Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J. Anim. Sci. 87 1883–1887. [PubMed]
  • de los Campos, G., D. Gianola, G. J. M. Rosa, K. A. Wiegel and J. Crossa, 2010. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet. Res. (in press). http://genomics.cimmyt.org/. [PubMed]
  • Fisher, R. A., 1918. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52 399–433.
  • Gianola, D., and R. L. Fernando, 1986. Bayesian methods in animal breeding theory. J. Anim. Sci. 63 217–244.
  • Gianola, D., and J. B. C. H. M. van Kaam, 2008. Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178 2289–2303. [PMC free article] [PubMed]
  • Gianola, D., R. L. Fernando and A. Stella, 2006. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173 1761–1776. [PMC free article] [PubMed]
  • Goddard, M. E., and B. J. Hayes, 2007. Genomic selection. J. Anim. Breed. Genet. 124 323–330. [PubMed]
  • Gonzalez-Recio, O., D. Gianola, N. Long, K. Wiegel, G. J. M. Rosa et al., 2008. Non parametric methods for incorporating genomic information into genetic evaluation: an application to mortality in broilers. Genetics 178 2305–2313. [PMC free article] [PubMed]
  • Habier, D., R. L. Fernando and J. C. M. Deckkers, 2009. Genomic selection using low-density marker panels. Genetics 182 343–353. [PMC free article] [PubMed]
  • Hayes, B. J., P. J. Bowman, A. J. Chamberlain and M. E. Goddard, 2009. Invited review: genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92 433–443. [PubMed]
  • Heffner, E. L., M. R. Sorrels and J.-L. Jannink, 2009. Genomic selection for crop improvement. Crop Sci. 49 1–12.
  • Henderson, C. R., 1984. Application of Linear Models in Animal Breeding. University of Guelph, Guelph, Ontario, Canada.
  • Hoerl, A. E., and R. W. Kennard, 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 55–67.
  • Jannink, J.-L., A. J. Lorenz and H. Iwata, 2010. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics. 9(2): 166–177. [PubMed]
  • Mayor, P. J., and R. Bernardo, 2009. Genome-wide selection and marker-assisted recurrent selection in double haploid versus F2 population. Crop Sci. 49 1719–1725.
  • McLaren, C. G., R. Bruskiewich, A. M. Portugal and A. B. Cosico, 2005. The International Rice Information System. A platform for meta-analysis of rice crop data. Plant Physiol. 139 637–642. [PMC free article] [PubMed]
  • Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard, 2001. Prediction of total genetic values using genome-wide dense marker maps. Genetics 157 1819–1829. [PMC free article] [PubMed]
  • Oakey, H., A. Verbyla, W. Pitchford, B. Cullis and H. Kuchel, 2006. Joint modeling of additive and non-additive genetic line effects in single field trials. Theor. Appl. Genet. 113 809–819. [PubMed]
  • Park, T., and G. Casella, 2008. The Bayesian LASSO. J. Am. Stat. Assoc. 103 681–686.
  • Pérez, P., G. de los Campos, J. Crossa and D. Gianola, 2010. Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. Plant Genome(in press). http://genomics.cimmyt.org/. [PMC free article] [PubMed]
  • Piepho, H. P., 2009. Ridge regression and extensions for genome-wide selection in maize. Crop Sci. 49 1165–1176.
  • Piepho, H. P., J. Möhring, A. E. Melchinger, and A. Büchse, 2007. BLUP for phenotypic selection in plant breeding and variety testing. Euphytica 161 209–228.
  • Robinson, G. K., 1991. That BLUP is a good thing: the estimation of random effects. Stat. Sci. 6(1): 15–51.
  • R Development Core Team, 2010. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.http://www.R-project.org.
  • Sorensen, D., and D. Gianola, 2002. Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer-Verlag, New York.
  • Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58 267–288.
  • vanRaden, P. M., 2007. Genomic measures of relationship and inbreeding. Interbull Annual Meeting Proceedings, Interbull Bulletin, Vol. 37, pp. 33–36.
  • vanRaden, P. M., C. P. Van Tassell, G. R. Wiggans, T. S. Sonstegard, R. D. Schnabel et al., 2008. Invited review: reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 92 16–24. [PubMed]
  • Weigel, K. A., G. de los Campos, O. González-Recio, H. Naya, X. L. Wu et al., 2009. Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers. J. Dairy Sci. 92 5248–5257. [PubMed]
  • Wong, C., and R. Bernardo, 2008. Genome-wide selection in oil palm: increasing selection gain per unit time and cost with small populations. Theor. Appl. Genet. 116 815–824. [PubMed]
  • Zhong, S., J. C. M. Dekker, R. L. Fernando and J.-L. Jannink, 2009. Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics 182 355–364. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...