• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Feb 2005; 169(2): 1045–1059.
PMCID: PMC1449126

Joint Mapping of Quantitative Trait Loci for Multiple Binary Characters

Abstract

Joint mapping for multiple quantitative traits has shed new light on genetic mapping by pinpointing pleiotropic effects and close linkage. Joint mapping also can improve statistical power of QTL detection. However, such a joint mapping procedure has not been available for discrete traits. Most disease resistance traits are measured as one or more discrete characters. These discrete characters are often correlated. Joint mapping for multiple binary disease traits may provide an opportunity to explore pleiotropic effects and increase the statistical power of detecting disease loci. We develop a maximum-likelihood method for mapping multiple binary traits. We postulate a set of multivariate normal disease liabilities, each contributing to the phenotypic variance of one disease trait. The underlying liabilities are linked to the binary phenotypes through some underlying thresholds. The new method actually maps loci for the variation of multivariate normal liabilities. As a result, we are able to take advantage of existing methods of joint mapping for quantitative traits. We treat the multivariate liabilities as missing values so that an expectation-maximization (EM) algorithm can be applied here. We also extend the method to joint mapping for both discrete and continuous traits. Efficiency of the method is demonstrated using simulated data. We also apply the new method to a set of real data and detect several loci responsible for blast resistance in rice.

MULTIPLE traits are measured virtually in all line-crossing experiments of QTL mapping. Yet, almost all data collected for multiple traits are analyzed separately for different traits. Joint analysis for multiple traits has shed new light in QTL mapping by improving the statistical power of QTL detection and increasing the accuracy of QTL localization when different traits segregating in the mapping population are genetically related. Joint analysis for multiple traits is defined as a method that includes all traits simultaneously in a single model, rather than analyzing one trait at a time and reporting the results in a format that appears to be multiple-trait analysis. In addition to the increased power and resolution of QTL detection, joint mapping can provide insights into fundamental genetic mechanisms underlying trait relationships such as pleiotropy vs. close linkage and genotype-by-environment (G × E) interaction, which would otherwise be difficult to address if traits are analyzed separately.

Substantial work has been done in joint mapping for multiple quantitative traits (Jiang and Zeng 1995; Korol et al. 1995, 2001; Mangin et al. 1998; Henshall and Goddard 1999; Williams et al. 1999; Knott and Haley 2000; Hackett et al. 2001). In general, there are two ways to perform a joint mapping. One way is the true multivariate analysis in which a multivariate normal distribution is assumed for the multiple traits, and thus a multivariate Gaussian model is applied to construct the likelihood function. Parameter estimation is conducted via either the expectation-maximization (EM) algorithm (Dempster et al. 1977) or the multiple-traits least-squares method (Knott and Haley 2000). One problem with these multivariate analyses is that if the number of traits is large, there will be too many hypotheses to test and interpretation of the results will become cumbersome. The other way of multiple-trait analysis is to utilize a dimension reduction technique, e.g., the principal component analysis, to transform the data into fewer variables, i.e., “super traits,” that explain the majority of the total variation of the entire set of traits. Analyzing the super traits requires little additional work (Korol et al. 1995, 2001; Mangin et al. 1998) compared to that for the single-trait genetic mapping statistics. However, as pointed out by Hackett et al. (2001), inferences based on the super traits might result in detection of spurious QTL. Furthermore, parameters of the super traits are often difficult to interpret biologically. Nevertheless, joint mapping provides a good opportunity to answer more questions about the genetic architecture of complex trait sets and deserves continued efforts from investigators in the QTL mapping community.

In contrast to that for multiple quantitative traits, relatively little work has been done in joint mapping for multiple discrete traits (Lange and Whittaker 2001). In fact, multiple discrete traits, especially multiple binary disease traits, are frequently collected in plants and laboratory animals. Most disease resistance traits are measured as one or more dichotomous characters. For example, in experiments mapping disease resistance loci, multiple pathogen races or strains are commonly used to determine the number of race-specific resistance loci involved. In such cases, infection by each strain is measured as a binary trait and the overall infection spectrum is a vector of several binary measurements. In practice, scientists may be less interested in identifying resistance loci to individual strains, but more interested in loci with a wide spectrum of resistance, which, in principle, can be better addressed with the joint-mapping strategy. Unfortunately, there has been no report on such a joint-mapping analysis for multiple binary traits. Recently, Lange and Whittaker (2001) applied the generalized estimating equations (GEE; Liang and Zeger 1986) method to mapping multiple discrete trait loci. Results of GEE are hardly compared with those of single-trait mapping because there is no univariate version of the GEE.

Furthermore, it is not uncommon that investigators may collect both continuous and discrete traits in a single mapping experiment. For example, disease resistance characters may be measured in a QTL mapping experiment for yield traits, or vice versa. Combining the disease resistance traits (discrete) with the yield traits (continuous) may allow investigators to answer some important questions such as possible fitness penalty of resistance loci. Even if the associated quantitative traits are not directly responsible for the disease status but linked to the disease loci, joint analysis will still provide additional power in identifying the disease loci (Williams et al. 1999; Huang and Jiang 2003). So far, joint analysis of mixed types of traits has been attempted only in pedigree analysis of human populations under the random model methodology.

The goal of this study is to develop a formal multivariate version of the maximum-likelihood methodology for joint mapping of QTL underlying multiple binary traits and mixed types of traits in line-crossing experiments under the fixed-model framework. We analyzed both simulated data and data collected from field experiments.

METHODS

Joint mapping for multiple binary traits:

Statistical model:

Suppose that we have a sample of n individuals from an F2 population derived from the cross of two inbred lines with observation on m binary traits. Assume that we also genotype a number of codominant molecular markers with known map positions for the species in question. Let wjk denote the phenotype of the kth binary trait on the jth individual and wjk = 1 if individual j is affected and wjk = 0 if j is unaffected. Further define yjk as the underlying latent variable, i.e., the liability, for the kth binary trait on the jth individual. The relationship between the liability and the discrete phenotype is described by the following threshold model,

equation M1
1

where the threshold 0 is chosen in an arbitrary fashion. Each of the m liabilities is a continuous variable, similar to the phenotypic value of a quantitative trait. The difference between a liability and a quantitative trait is that the former is not observable but inferred from the discrete phenotype. The liabilities are described by the following linear models,

equation M2
2

where b0k is the mean effect (intercept) for trait k in the scale of liability; b1k and b2k are, respectively, the additive and dominance effects of the putative QTL; x0j, x1j, and x2j are the incidence variables for the mean effect and the additive and dominance effects, respectively; and ejk is the residual error for trait k of individual j. We assume that the residual errors are independent among individuals but correlated among traits within individuals. In matrix notation, model (2) can be written as

equation M3
3

where yj = [yj1yjm] is a 1 × m vector for the liabilities; xj = [x0j x1j x2j], which is defined as xj = h1 = [1 1 0] if individual j takes genotype QQ, xj = h2 = [1 0 1] if j has a genotype Qq, and xj = h3 = [1 −1 0] if j is qq at the putative QTL position; B is a 3 × m matrix defined as

equation M4
4

and ej is a 1 × m vector for the residual errors, which has a covariance matrix

An external file that holds a picture, illustration, etc.
Object name is 19406.e5.jpg
5

Note that the variances are estimable from the latent variables but not from the observed data. Therefore, some restrictions are required when the binary data are taken into account (McCulloch 1994). The probability that individual j is affected by all the m binary diseases is

equation M5
6

where

equation M6
7

is the multivariate normal probability density. The probabilities for other joint binary phenotypes are derived similarly. For m dichotomous traits, there will be 2m possible joint binary phenotypes. With the threshold model, mapping binary traits has been formulated as a problem of mapping quantitative trait loci. Therefore, methods of QTL mapping for multiple quantitative traits (Jiang and Zeng 1995) may be adopted here in multiple binary trait mapping.

Model (3) is a general multiple linear model (GMLM) with missing values in xj because the QTL genotypes are not observable. The next step of the GMLM analysis with missing values is to infer the probabilities of QTL genotypes conditional on the marker information, denoted by pjq = Pr(xj = hq|IM) for q = 1, 2, 3, where IM denotes marker information. We can compute pjq using Table 1 of Jiang and Zeng (1995) if double recombinants are not considered or using Table 1 of Luo and Kearsey (1992) if no crossover interference is assumed. However, in general, we need to adopt the multipoint method (Jiang and Zeng 1997) to handle missing or dominance markers.

TABLE 1
QTL parameters used in the simulation experiments

Maximum-likelihood estimation:

Let us denote the parameters by a vector θ = {B, V}. The likelihood function for the jth individual conditional on xj is

equation M7
8

where wj = [wj1, … , wjm], g1(wjk) = (wjk − 1) × ∞, and g2(wjk) = wjk × ∞, for k = 1, … , m. Note that g1(wjk) = (wjk − 1) × ∞ = −∞ and g2(wjk) = wjk × ∞ = 0 if wjk = 0, whereas g1(wjk) = (wjk − 1) × ∞ = 0 and g2(wjk) = wjk × ∞ = ∞ if wjk = 1. The m-dimensional integral Φm(wj; xjB, V) may be calculated with some special algorithms or by executing an intrinsic function from some software packages. A two-dimensional integral can be found in SAS (SAS Institute 1999). The probability Pr(wj|xj, θ) is also called the penetrance of the QTL with genotype xj.

Since xj is missing and only pjq is calculated, the actual likelihood function for the jth individual is

equation M8
9

The overall log-likelihood for the entire mapping population is

equation M9
10

Solving the above log-likelihood function is tedious. We now introduce a simple EM algorithm to find the solution, which takes advantage of the simplicity of the original linear model with both equation M10 and equation M11 being treated as missing values.

Instead of directly maximizing the log likelihood given in Equation 10, the EM algorithm deals with the complete-data log-likelihood function,

equation M12
11

Because X and Y are treated as missing values, the EM algorithm starts with maximizing the expectation of the complete-data log-likelihood,

equation M13
12

where the expectation is taken with respect to X and Y, conditional on the data equation M14 and the parameters at the current values θ(t). For brevity, we use E[ξ(X, Y)] to denote the conditional expectation throughout the entire text, but the full notation for the expectation should be

equation M15

where ξ(X, Y) represents the term whose expectation is required in the EM algorithm. Note that Equation 12 differs from Equation 10 in two aspects: (i) Equation 10 takes the expectation before the log transformation whereas Equation 12 takes the expectation after the log transformation; and (ii) the expectations are taken using different probability distributions for the two equations. In Equation 12, the expectation is taken using the probability distribution conditional on the current parameter values and the phenotypic values. Maximizing Equation 12 with respect to the parameters, we get

equation M16
13
equation M17
14

The MLE of B and V in the complete-data situation (both X and Y are observed) can be found in Anderson (1984)(Sect. 8.2). Giri (1996)(pp. 92–98) also provided the derivation in the simple case where xjB = μ is a 1 × m vector of means. The results shown in Equations 13 and 14 are extensions of the results in the complete-data situation by adding the symbols of conditional expectation. appendix C gives a step-by-step derivation of Equations 13 and 14.

In binary data analysis under the liability model, the usual restriction is to let all the variances (diagonal elements of matrix V) equal unity (McCulloch 1994). If we had maximized the complete-data log likelihood function with such a restriction, the maximization step would be extremely complicated because there is no easy way to find a closed-form expression for V. Trying to find an explicit form for V, one would lose the advantage of the EM algorithm. Fortunately, Gueorguieva and Agresti (2001) showed that maximizing Equation 12 with V unrestricted, the EM algorithm converges to unique estimates of the fully identifiable ratios, BS−1, where

equation M18
15

Therefore, Gueorguieva and Agresti (2001) maximized the log-likelihood function with V unrestricted and then standardized the model effects by taking B* = BS−1. The standardized covariance matrix became R = S−1VS−1. At each EM iteration, Gueorguieva and Agresti (2001) estimated B and V and immediately replaced these two parameters by their standardized versions, B* and R, before entering the next EM iteration to make sure that the EM converges. In our EM algorithm, we have already defined B as a standardized vector of genetic effects. We simply need to standardize V during each iteration of the EM algorithm to ensure the convergence of the iterations. The estimated correlation matrix R is indeed the MLE of R based on the invariance property of the MLE (DeGroot 1986) because V is the MLE of V and R is a function of V. Equations 13 and 14 represent the maximization step of the EM algorithm. We now investigate the expectation step of the EM algorithm. Recall that the probability of xj conditional on marker information is denoted by pjq. This probability may be called the prior probability. After incorporating the phenotypic value and the parameters, we obtain the posterior probability, denoted by

equation M19
16

Note that the V matrix has been replaced by the R matrix to reflect the standardization. In fact, the unrestricted covariance matrix V is used only once when we try to estimate it. Once V is estimated, it is immediately standardized into the form of R, which is then used in all steps of the EM iterations. The expectations are actually obtained using the posterior probabilities rather than the prior probabilities. Therefore, the conditional expectations given the data W are

equation M20
17

where E(yj|wj, hq, θ) is the expectation of a truncated multivariate normal distribution. The residual error covariance matrix in Equation 14 becomes

equation M21
18

where μjq = E(yj|wj, hq, θ) is a short notation for the conditional expectation and Ujq = Var(yj|wj, hq, θ) denotes the conditional covariance matrix of a truncated multivariate normal distribution. Neither μjq nor Ujq has an explicit expression except in the special case when m = 2 and 3. These expectation and covariance matrices are calculated using the moment-generating function (Tallis 1963) or the Gibbs sampler (Chan and Kuk 1997). The moment-generating function approach needs multidimensional integrals and cannot be implemented easily in practice when m is large (see appendix A for the special case when m = 2). The Gibbs sampling approach requires Monte Carlo simulation, which is suitable for large m. Details of the Monte Carlo method are given in appendix B.

The EM algorithm may be summarized in the following steps:

  • Step 1. Choose the initial values for θ, θ(0) = {B(0), R(0)}.
  • Step 2. Calculate the posterior probabilities of the QTL genotype given the current values of all unknowns using Equation 16.
  • Step 3. Calculate the expectations using Equation 17, a process in the E-step.
  • Step 4. Calculate μjq = E(yj|wj, hq, θ) and Ujq = Var(yj|wj, hq, θ) using the moment-generating function or the Gibbs sampler (see appendix A and appendix B), another process of the E-step.
  • Step 5. Update parameter B using Equation 13, update parameter V using Equation 14, and convert V into R.
  • Step 6. Replace the initial parameters by the updated values and repeat steps 2–5 until convergence.

Likelihood-ratio test:

Define the log-likelihood function evaluated at the maximum-likelihood estimate (MLE) of parameters as

equation M22
19

where equation M23 and [theta w/ hat] is the MLE of θ. This is also called the likelihood value under the full model. We need the likelihood values under various restricted models to test different hypotheses.

The overall null hypothesis is “no effect of QTL at the locus of interest,” denoted by H0: b1k = b2k = 0 for k = 1, … , m or H0: LB = 0, where

equation M24

If we solve the MLE of the parameters under the restriction of LB = 0 and evaluate the likelihood function evaluated at the solutions with this restriction, we have

equation M25
20

where [theta w/ tilde] is the MLE of θ under the restricted model. The likelihood-ratio test statistic is

equation M26
21

Under the null hypothesis, this test statistic will approximately follow a chi-square distribution with 2(m − 1) d.f.

Various other test statistics may be defined by choosing different L matrices, as given by Jiang and Zeng (1995). For example, to test the additive effects for all traits, the L matrix is defined as L = [0 1 0]. The degrees of freedom for this test are m − 1. To test the dominance effects for all traits, we use L = [0 0 1] with m − 1 d.f. for the test statistic.

To test trait-specific effects, we need to introduce another matrix, T, which is used to postmultiply matrix B. For example, to test QTL effects (both additive and dominance) for the kth trait, the null hypothesis is H0: LBT = 0, where

equation M27

and T is an m × 1 vector with the kth element being one and all the remaining elements being zeros. The test statistic will be

equation M28
22

with 2 d.f. In general, the L matrix controls the type of effects (population mean, additive, and dominance) being tested and the T matrix controls the traits (from 1 to m) being tested.

The position of the QTL is another parameter of interest. However, in the one-dimensional genome scan, the position is first treated as fixed and then the entire genome is tested for every putative position with a 1- or 2-cM increment. The likelihood-ratio test statistic forms a test statistic profile. The position corresponding to the highest peak is declared as the estimated QTL position if the peak surpasses a given critical value (Churchill and Doerge 1994; Diggle et al. 1996; Piepho 2001).

Joint mapping for mixed types of traits:

We now describe a statistical model and likelihood analysis for joint mapping of loci that affect one binary trait and multiple quantitative traits. Let wj1 be the phenotype of the binary trait for the jth individual and defined as wj1 = 1 if individual j is affected and wj1 = 0 if it is not affected. Further define yjk as the value of the kth observed quantitative trait, for k = 2, … , m, on the jth individual. We also define yj1 as the liability for the binary trait,

equation M29

The liability of the binary trait and the phenotypic values of the quantitative traits are arranged in a vector called

An external file that holds a picture, illustration, etc.
Object name is 19406ea.jpg

where equation M30 is a special notation for a subvector of yj that excludes the first element. The yj vector is described by the same model as given in Equation 3. Let us further partition matrix B into equation M31, where equation M32 is the first column of matrix B and equation M33 is the remaining columns of B. The residual errors are joint normal with mean zero and a covariance matrix V, which can be partitioned into

An external file that holds a picture, illustration, etc.
Object name is 19406.e23.jpg
23

where equation M34, and

An external file that holds a picture, illustration, etc.
Object name is 19406eb.jpg

is the lower right block of matrix V. The variance of the liability for the disease trait, however, is restricted to unity. Therefore, the standardized form of the V matrix is R = S−1VS−1, which is a function of the unrestricted covariance matrix V, where S = diag(σ1 1 … 1). Similar partitioning given in Equation 23 also applies to matrix R. The joint distribution of the phenotype for individual j is

equation M35
24

where

equation M36
25

and

equation M37
26

The probability density [var phi]yj1; yjequation M38, xjb1, R11 within the integral is a conditional density of yj1 given yjequation M39. It is a univariate normal with mean

equation M40
27

and variance

equation M41
28

Therefore, Φ(wj1; yj(−1), xjb1, R11) is a truncated univariate normal probability.

The likelihood function for the jth individual is

equation M42
29

The overall log-likelihood of the entire mapping population is

equation M43
30

Again, we adopt the EM algorithm to find the MLE of parameters. The maximization step is accomplished through Equations 13 and 14. The expectation step requires calculation of the posterior probabilities of QTL genotypes and then uses these probabilities to find various expectations. The posterior probability of a QTL genotype is

equation M44
31

from which we get the expectations

equation M45
32

where

equation M46
33

is a 1 × m vector, which has been partitioned into a scalar Eyj1|wj1, hq, θ, yjequation M47 and a vector yjequation M48. The expectation is taken only for the liability of the binary trait. The remaining traits already take the observed values and thus no expectations are taken. The expectation for the liability, Eyj1|wj1, hq, θ, yjequation M49, is obtained from the truncated normal distribution with mean given by Equation B5 of appendix B and variance given by Equation B6 of appendix B. The residual error covariance matrix is given by Equation 18. However, the conditional expectation is replaced by

equation M50
34

and the conditional variance by

An external file that holds a picture, illustration, etc.
Object name is 19406.e35.jpg
35

where Varyj1|wj1, hq, θ, yjequation M51 is the variance of a truncated normal distribution. Both Eyj1|wj1, hq, θ, yjequation M52 and Varyj1|wj1, hq, θ, yjequation M53 can be found from the truncated normal distribution (Cohen 1991) and no Gibbs sampler is required.

RESULTS

Simulation studies:

To further evaluate the properties of the proposed method, we conducted two simulation experiments. For the sake of simplicity, we designed one experiment to evaluate the performance of joint mapping for two binary traits and another experiment for the mixture of one binary and one quantitative trait. In each experiment, one chromosome with 11 evenly distributed markers was simulated for an F2 population. We simulated a single QTL located at 35 cM of the chromosome and the QTL effects of both traits under three different levels of heritability (proportion of variance in liability explained by the QTL). The effects of the QTL used in the simulation experiments are given in Table 1. The correlation coefficient between the residuals of the liabilities for the two traits was chosen at 0.25. Under these settings, both traits had the same heritability. The three levels of the QTL effects led to three different levels of the heritability: 5, 10, and 15%. The genetic correlation between the two traits was expected to be 1.0, −0.446, and 0.423, respectively, for the different chosen levels of the heritability. The sample size of the simulated F2 population was n = 200. Each simulation experiment was replicated 100 times.

The first simulation experiment was to evaluate the efficiency of the joint binary trait mapping. We first simulated the liabilities of the two traits and then artificially truncated the continuous liabilities into two binary phenotypes using a threshold of zero. In the second simulation experiment, we truncated only the liability of the first trait to generate a binary phenotype but left the second trait intact so that we had one binary trait and one continuous trait.

Each data sample was analyzed using both the joint-mapping and single-trait-mapping statistics. For the single-trait analysis, we used Lander and Botstein's (1989) method for the continuous trait and the method of Xu et al. (2003) for the binary trait. Since we considered only two traits in the joint mapping, explicit formulas were used in each of the EM iteration steps (see appendix A for the explicit formulas). The critical values for the chromosomewise type I error rate of 5% were determined by the approximate method of Piepho (2001). In real data analysis, one should use the permutation test (Churchill and Doerge 1994) to obtain the most appropriate critical values for significance tests. For the joint analysis, the empirical power was determined by the proportion of the replicated samples (out of 100) whose highest test statistic values along the chromosome were greater than Piepho's (2001) critical value. The peak where the highest test statistic occurred was usually close to the true QTL position. However, a significant QTL was declared even if the peak was not exactly at the true position. For the separate analyses of individual traits, the statistical power was determined for the analysis of each trait as in the joint analysis. The critical value was recalculated for each trait in each replicate.

Tables 2 and and33 show the observed powers of QTL detection, the mean, and standard deviations (SD) of the estimated QTL locations and effects obtained from 100 replicated simulations. We compared the power of the joint analysis with that of a single-trait analysis for each trait separately. Joint analysis has a substantially higher power than either single-trait analysis. We understand that this may not be a fair comparison because joint analysis uses two traits while the single-trait analysis uses only one trait. However, this has been the standard way for comparison of joint mapping with separate mapping (Jiang and Zeng 1997). One may want to redefine the power for the separate analyses as the ability to detect at least one QTL effect (either additive or dominance) in at least one trait. Under this definition of the power, results of the two separate analyses may be combined so that the power is recalculated in the combined result. The combined power analysis requires either redefining the critical values by taking into account the multiple tests, which is difficult because the two separate analyses may be highly correlated, or simply using the sum of the powers of separate analyses (with an appropriate adjustment) as the combined power, which is p1 + p2p12, where p1 and p2 are the powers for traits 1 and 2, respectively, and p12 is the proportion of the replicated simulations in which both traits are significant. For example, among the 100 replicates, if a significant QTL effect is detected in 50 samples for the first trait and a significant QTL effect is detected in 80 samples for the second trait, then p1 = 0.5 and p2 = 0.8. If QTL effects for both traits are detected in 40 samples, then p12 = 0.4. The combined power for the separate analyses will be 0.5 + 0.8 − 0.4 = 0.9. Using this approach to calculating the power, the combined power of the separate trait analyses was almost identical to that of the joint analysis (data not shown). Therefore, power increase in joint mapping as opposed to separate mapping depends on how one defines the power in the separate analyses. From the traditional definition of statistical power for single-trait analysis (Jiang and Zeng 1997), joint analysis has higher power than single-trait analysis, i.e., joint power greater than p1 and joint power greater than p2. But the joint analysis has an equivalent power to the combined power for separate analyses if the combined power is defined as p1 + p2p12, i.e., joint power ≈ p1 + p2p12.

TABLE 2
Comparison of joint mapping with single trait mapping from the first simulation experiment (both traits are binary)
TABLE 3
Comparison of joint mapping with single trait mapping from the second simulation experiment (one binary trait and one quantitative trait)

The QTL effects and their standard deviations estimated from the joint mapping are comparable to those obtained from separate analyses. No obvious advantage of the joint mapping has been demonstrated from the simulation studies with respect to the estimates of QTL effects. The real advantage of the joint mapping over separate analyses has been demonstrated by the increased precision of the QTL position estimates in all situations examined (see Tables 2 and and33).

Overall, the parameter estimates are fairly close to the true parametric values. The general trend follows our expectation: high heritability tends to produce more accurate estimates than low heritability. If we compare the joint mapping of two binary traits with that of one binary and one continuous trait, we will note the power difference between the two experiments. Experiment 2 shows higher powers than experiment 1. This observation also follows our expectation because binary data are not as informative as continuously distributed data.

Mapping rice blast resistance loci:

Developing blast resistance cultivars is one of the major objectives in rice (Oryza sativa L.) breeding in both tropical and temperate countries. The causal organism of the rice blast, Pyricularia grisea, is known for its high genetic variability, allowing it to overcome the resistance of the host plant. A framework linkage map was developed using 284 F10 recombinant inbred lines (RILs) from a “Lemont” × “Teqing” rice cultivar cross. A subset of 245 RILs innoculated with two rice blast races, IB54 and IG1, was used to map loci responsible for the hypersensitive reaction. Details of the experimental design, the measurements of phenotypes, and genotypes can be found in the original article by Tabien et al. (2000). The phenotypes were evaluated using a completely randomized design with three replicates. In other words, each line was evaluated three times for its reaction to each of the phathogen infections. The original scores of the plant response were measured from grade 0 to grade 5. The average score of the three replicates for each line was recorded as the raw data observation. The binary phenotype was defined as w = 0 if the average score was within the range 0–3 and w = 1 if the average score was 4–5. We were provided only with the binary data, not the original scores. The breeders were more interested in the genetic study of the qualitative dichotomous trait than in the genetic study of the numerical scores. This explains why we were approached by the breeders to analyze their data using the new methods.

Since the mapping population was a RIL population, a slight modification of our method for F2 was required. We replaced the probability transition matrix of F2 by that of F10 in calculating the conditional probability of QTL genotype (Jiang and Zeng 1997). There was still a 4% residual heterozygosity in the RIL lines (due to F10 instead of F), which is sufficiently high to allow the dominance effects to be estimated. We treated the plant responses to blast pathogen races IB54 and IG1 as two separate binary traits. Therefore, joint mapping for both traits and separate mappings for individual traits were conducted for comparisons. The critical values of test statistics used to declare QTL were calculated using the method of Piepho (2001).

Table 4 shows the results of joint mapping and separate analyses. The joint mapping may have a greater power than separate analyses, as demonstrated by more detected QTL and higher test statistic values. A total of five resistance loci were identified by the joint mapping (qtl1–qtl5), but only four of them were detected with separate analyses (qtl4 was missed). Of these detected QTL, three of them (qtl1, qtl3, and qtl4) corresponded to Pi-tq5, Pi-lm2, and Pi-tq6 detected previously on the basis of chi-square tests of individual marker-trait associations (Tabien et al. 2000). Two additional loci (qtl2 and qtl5) were detected on chromosomes 3 and 12, and they were not reported in the previous study (Tabien et al. 2000). For each of the two loci, the allele carried by the Lemont parent was responsible for the resistance. None of the genetic parameters, e.g., the QTL effects and positions, were estimable in the previous chi-square tests conducted by the original authors (Tabien et al. 2000). The most striking result from the joint mapping was that all five resistance loci showed fairly consistent effects against both P. grisea races, while different resistance loci were detected separately by the single-trait analyses.

TABLE 4
QTL mapping result for rice blast resistance in the “Lemont” × “Teqing” crossing experiment

It is worth mentioning that results of joint mapping and separate mapping do not seem to be consistent in the real data analyses. This inconsistency, however, did not occur in the simulation studies. The reason for this is that we have taken a one-dimensional genome-scan approach, which uses a single-QTL model. In the simulation studies, we indeed simulated a single QTL. Therefore, the model adequately described the data. In the real data analysis, however, we used the single-QTL model to fit data controlled by apparently multiple QTL. The remaining QTL not fitted in the model may have caused all the inconsistencies observed between the joint and the separate analyses. In addition, the background QTL also have caused the observed high residual correlation. These problems can be solved by fitting a multiple-QTL model (see the discussion in a later section).

Table 5 shows the probabilities of the four possible phenotypic combinations under different genotypes of the identified QTL. For a single disease trait, penetrance is defined as the probability that a specific QTL genotype shows the affected phenotype. Penetrance has not been defined for multiple disease traits. Therefore, we listed the probabilities of all the four possible phenotype combinations for all genotypes of each detected QTL in Table 5. The penetrances of any particular genotypes for each QTL may be calculated from this table. For example, if we define the penetrance of a genotype as the probability that a plant with this genotype is affected by either of the two pathogens, the penetrance should be calculated using 1 − Pr(IB54 = R and IG1 = R). On the other hand, if we define the penetrance as the probability that the plant is affected by both pathogens, then we should use Pr(IB54 = S and IG1 = S). The marginal penetrance for one pathogen, say pathogen IB54, should be defined as

equation M54

Taking the first genotype of the first QTL, for example, we may be able to find penetrances defined in all possible ways, as shown below,

equation M55

and

equation M56
TABLE 5
The penetrances of QTL genotypes for rice blast resistance in the “Lemont” × “Teqing” crossing experiment

Interested rice geneticists and breeders may want to find out all kinds of penetrances of interest from Table 5. This table may also help rice breeders develop an optimal marker-assisted seletion scheme to improve blast resistance in rice.

DISCUSSION

Joint mapping offers several advantages over single-trait analyses. First, joint mapping may increase statistical power of QTL detection compared to single-trait analyses. Second, joint analysis can improve the precision of parameter estimation. Third, joint mapping provides an opportunity to answer more questions related to the genetic architecture of complex traits. These have been discussed by many authors (Jiang and Zeng 1995; Korol et al. 1995; Mangin et al. 1998; Henshall and Goddard 1999; Knott and Haley 2000) in multiple quantitative traits QTL mapping. Similar advantages also have been demonstrated here in the joint mapping for multiple binary traits. In this study, we paid more attention to the development of the EM algorithm rather than to various hypotheses tests, because the latter have been fully addressed by Jiang and Zeng (1995). In addition, the method was derived in the context of interval mapping. Extension to composite interval mapping should be preferred in practice, but this is simply a matter of implementation. Furthermore, the proposed method for F2 populations can be easily extended to other types of populations, e.g., backcrosses or four-way crosses, as demonstrated by the extension from F2 to RILs described in this study. The method differs from one mating design to another only by the possible different number of genotypes and different transition matrix from one locus to another.

In fact, there has been much work on single binary trait mapping (Hackett and Weller 1995; Xu and Atchley 1996; Yi and Xu 1999, 2000; Xu et al. 2003), using likelihood-based methods or Bayesian methods. However, the method of separate analyses of individual binary traits is, so far, the only approach currently available. For the first time, we developed the full probability model for joint mapping of multiple binary traits. The method requires numerical multiple integrals, as we know that high-dimensional numerical integration cannot be implemented easily in practice. Therefore, we presented the method using two traits as examples. In real data analysis, one may pay more attention to the information extracted from the data and thus may wish to perform joint mapping for more than two traits using the general algorithm developed here. Two factors may limit the number of traits included in the analysis. One is the computing time and the other is the difficulty in interpreting the results. For the rice blast data analysis with two binary traits, QTL search for the entire rice genome took ~10 min, which is quite reasonable. For more than two traits, computing time is a big factor of concern. We highly recommended using a different but fast numerical integration algorithm specially designed for high-dimensional integration, e.g., Monte Carlo integration. The Bayesian method implemented via Markov chain Monte Carlo (MCMC) is an ideal tool to accomplish this. In addition, the Bayesian method can handle the multiple-QTL model with ease. To deal with the problem of interpretation, one must have some intuitive knowledge about the trait relationships and hypotheses underlying the traits. In the disease-resistance case, one would be interested not only in the number of loci involved, but also in the level of race specificity of individual resistance loci, since the hypersensitive response of rice to P. grisea is known to be controlled by the gene-for-gene system (Silué et al. 1992). However, this gene-for-gene system normally assumes that only two consequences, resistance or susceptibility, would result from interactions between alleles at a resistance locus of host plants and alleles at its corresponding avirulence locus in pathogens, which may not be always true, as is discussed in the following section; imperfect penetrance appears to be an important feature of resistance loci involved in the gene-for-gene interactions between host plants and their pathogens.

We took the maximum-likelihood approach and implemented the method via the EM algorithm. This is different from the GEE method described earlier. We favor the EM algorithm because it was developed on the basis of all existing theory and methods currently used for mapping loci of regular quantitative traits. In single binary trait mapping, one of the most frequently asked questions is “What is the advantage of using the probability model over the simple analysis that treats the binary traits as if they were continuous traits?” (Visscher et al. 1996). The same question also may be asked here for joint mapping of multiple binary traits. Although we did not try the joint analysis of binary traits by ignoring the binary nature of the traits, we predict that treating binary traits as if they were continuous traits may result in similar power in most situations. In some special cases, the probabilistic model may provide higher power than the simplified analysis, and we have not figured out the parameter range in which this will happen. The probabilistic model approach enables estimation of penetrance of a particular QTL genotype, which is an important property of genes involved in human diseases (Terwilliger and Weiss 1998) and plant disease resistance, as we demonstrated here. McIntyre et al. (2001) developed a different probabilistic model for QTL mapping of binary traits. The method also allows calculation of penetrance. Extension of their model to multiple binary trait analysis is another alternative approach. We did not choose this extension because the threshold model via the EM algorithm has a natural connection to existing methods of QTL mapping.

Finally, with the successful development of joint mapping of both multiple quantitative and qualitative traits, an important but largely unexploited area in genetic mapping begins to emerge, i.e., the joint analysis of mixed types of traits. Although some authors (Williams et al. 1999; Huang and Jiang 2003) already exploited this idea in the context of human genetic mapping under the IBD-based random model framework, it has never been explored in QTL mapping of experimental populations. The method may be particularly useful in situations where mapping qualitative disease resistance of the gene-for-gene system is the primary objective while traits associated with quantitative resistance to the same or different pathogens are measured as by-products in the experiment. This coexistence of qualitative and quantitative resistance is widely present in many relationships between host plants and their pathogens in natural and agricultural systems (Leonard and Czochor 1980). Joint analyses of the correlated qualitative and quantitative phenotypes may substantially increase the power of detecting disease resistance loci and allow exploration of new features of loci involved.

Acknowledgments

We are grateful to three anonymous reviewers for their suggestive comments on early versions of the manuscript. This research was supported by the National Institutes of Health grants R01-GM55321 and the United States Department of Agriculture National Research Initiative competitive grants program 00-35300-9245 to S.X.

APPENDIX A:

MOMENTS OF TRUNCATED BIVARIATE NORMAL DISTRIBUTION

Let zT = [z1 z2] be a vector of two variables distributed as a standardized bivariate normal distribution with correlation ρ. Let [a1, c1] and [a2, c2] be the double truncation points on variables z1 and z2, respectively, and define α1 = Pr(z1 > c1, z2 > c2) as the area (integral) within the domain. Let us further define the first and second moments of the truncated standardized bivariate normal distribution at z1 > c1 and z2 > c2 as

equation M57
A1

where

equation M58
A2

and

equation M59
A3

Equations A1 can be found from Tallis (1963). Similarly, we can calculate the first and second moments of truncated standardized bivariate normal distribution at z1 > a1 and z2 > c2, z1 > a2 and z2 > c1, and z1 > a1 and z2 > a2, respectively. We further denote the above four truncated domains by 1, 2, 3, and 4, respectively. The following formula is used to calculate the moments under the double truncation with [a1, c1] and [a2, c2],

equation M60
A4

where Ti, i = 1, 2, 3, 4, represents the arbitrary first moment of (A1), αi represents the probability under the corresponding truncated domain, and

equation M61
A5

APPENDIX B:

CONDITIONAL EXPECTATION AND VARIANCE VIA GIBBS SAMPLER

The basic idea of the Gibbs sampler is to find the distribution of one element, say yjk, conditional on the remaining components in vector yj and sample yjk from the conditional distribution. Under the assumption of multivariate normality for the liability vector, i.e., yj ~ Nm(xjB, R), the conditional density of a single component is univariate normal with mean and variance described as follows. First, let us make the following matrix partitioning, equation M62 where

equation M63
B1

is a special notation for a subset of vector yj that excludes yjk; i.e., the subscript equation M64 indexes all elements except k. Using this special notation we can partition matrix B into equation M65, where equation M66 is the kth column of matrix B and

equation M67
B2

is a submatrix of B with the kth column left out. Let us further partition matrix R into

equation M68
B3

where Rkk = 1, equation M69, and

An external file that holds a picture, illustration, etc.
Object name is 19406.x4.jpg
B4

Note that Requation M70 is the submatrix of R with the kth row and kth column removed. The above matrix partitionings allow us to define the conditional mean of yjk as

equation M71
B5

and the conditional variance as

equation M72
B6

Having found the distribution of one component conditional on the remaining components, one can easily sample each element from its perspective univariate normal distribution. The binary phenotype for each trait has not played a role in the above sampling scheme. To incorporate this information, we need to sample each liability from a truncated normal distribution with the mean and variance given above. For example, if wjk = 0, yjk should be sampled only if yjk ≤ 0. If wjk = 1, however, yjk should be sampled only if yjk > 0. In fact, we adopted the algorithm of Devroye (1986) to simulate a variable from a truncated normal distribution. This special algorithm has a 100% rate of acceptance. The Monte Carlo sampling process is repeated many times with the simulated yj forming a large sample, y1j, y2j, … , yMj, where M is a large number. Discarding the observations during the burn-in period and thereafter saving one observation every few cycles, we get a sample containing roughly independent observations, from which the sampled mean vector and the covariance matrix are calculated. The sampled mean and covariance matrix are used in place of μjq = E(yj|wj, hq, θ) and Ujq = Var(yj|wj, hq, θ).

APPENDIX C:

DERIVATION OF THE EM ALGORITHM

The expected likelihood function:

The complete-data log likelihood is

equation M73
C1

where θ = {B, V} is the vector of parameters and X and Y are the missing values. The data are the phenotypes of multiple binary traits, denoted by W.

The expectation of the complete-data log-likelihood function conditional on the current parameter values and the data is

equation M74
C2

where the expectation is taken with respect to the missing values, X and Y, conditional on the current parameters θ(t) = {B(t), V(t)} and the data W. Note that we use a special notation EY|X to denote conditional expectation with respect to Y given X. The expectation of the complete-data log-likelihood (C2) is the target function subject to maximization in the EM algorithm.

Maximization with respect to B:

The expectation of the complete-data log-likelihood function relevant to B is

equation M75
C3

The partial derivative of L(B|θ(t)) with respect to B is

equation M76
C4

Setting (C4) equal to zero and solving for B, we obtain

equation M77
C5

In the main text, we used the following simple notation for the conditional expectations,

equation M78

With this short notation, the solution for B becomes

equation M79
C6

which concludes the proof of Equation 13 of the main text.

Maximization with respect to V:

The expectation of the complete-data log-likelihood function relevant to V is

equation M80
C7

The partial derivative of this likelihood function with respect to V is complicated, but the derivative of L with respect to V−1 is straightforward. On the basis of the invariance property of ML analysis, if equation M81 is the MLE of V−1, then equation M82 should be the MLE of V. Therefore, we set the partial derivative of L with respect to V−1 equalto zero and solve for V, as

equation M83
C8

Setting (C8) equal to zero and solving for V, we get

equation M84
C9

In the main text, we adopted a short notation for the expectation and denoted Equation C9 by

equation M85
C10

This proves Equation 14 of the main text of the article.

References

  • Anderson, T. W., 1984 An Introduction to Multivariate Statistical Analysis, Ed. 2. Wiley, New York.
  • Chan, J. S. K., and A. C. Kuk, 1997. Maximum likelihood estimation for probit-linear mixed models with correlated random effects. Biometrics 88 86–97.
  • Churchill, G. A., and R. W. Doerge, 1994. Empirical threshold values for quantitative trait mapping. Genetics 138 963–971. [PMC free article] [PubMed]
  • Cohen, A. C., 1991 Truncated and Censored Samples. Marcel Dekker, New York.
  • DeGroot, M. H., 1986 Probability and Statistics. Addison-Wesley, Reading, MA.
  • Dempster, A. P., N. M. Laird and D. B. Rubin, 1977. Maximum likelihood from incomplete data via the EM-algorithm. J. R. Stat. Soc. 39 1–38.
  • Devroye, T., 1986 Non-Uniform Random Variable Generation. Springer- Verlag, New York.
  • Diggle, A. P., K.-Y. Liang and S. L. Zeger, 1996. Permutation tests for multiple loci affecting a quantitative character. Genetics 142 285–294. [PMC free article] [PubMed]
  • Giri, N. C., 1996 Multivariate Statistical Analysis. Marcel Dekker, New York.
  • Gueorguieva, R. V., and A. Agresti, 2001. A correlated probit model for joint modeling of clustered binary and continuous responses. J. Am. Stat. Assoc. 96 1102–1112.
  • Hackett, C. A., and J. I. Weller, 1995. Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics 51 1252–1263. [PubMed]
  • Hackett, C. A., R. C. Meyer and W. T. B. Thomas, 2001. Multi-trait QTL mapping in barley using multivariate regression. Genet. Res. 77 95–106. [PubMed]
  • Henshall, J. M., and M. E. Goddard, 1999. Multiple trait mapping of quantitative trait loci after selective genotyping using logistic regression. Genetics 151 885–894. [PMC free article] [PubMed]
  • Huang, J., and Y. M. Jiang, 2003. Genetic linkage analysis of a dichotomous trait incorporating a tightly linked quantitative trait in affected sib pairs. Am. J. Hum. Genet. 72 946–960. [PMC free article] [PubMed]
  • Jiang, C., and Z-B. Zeng, 1995. Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140 1111–1127. [PMC free article] [PubMed]
  • Jiang, C., and Z-B. Zeng, 1997. Mapping quantitative trait loci with dominant and missing markers in various crosses from two inbred lines. Genetica 101 47–58. [PubMed]
  • Knott, S. A., and C. S. Haley, 2000. Multitrait least squares for quantitative trait loci detection. Genetics 156 899–911. [PMC free article] [PubMed]
  • Korol, A. B., Y. T. Ronin and V. M. Kirzhner, 1995. Interval mapping of quantitative trait loci employing correlated trait complexes. Genetics 140 1137–1147. [PMC free article] [PubMed]
  • Korol, A. B., Y. T. Ronin, A. M. Itskovich, J. Peng and E. Nevo, 2001. Enhanced efficiency of quantitative trait loci mapping analysis based on multivariate complexs of quantitative traits. Genetics 157 1789–1803. [PMC free article] [PubMed]
  • Lander, E. S., and S. D. Botstein, 1989. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121 185–199. [PMC free article] [PubMed]
  • Lange, C., and J. C. Whittaker, 2001. Mapping quantitative trait loci using generalized estimating equations. Genetics 159 1325–1337. [PMC free article] [PubMed]
  • Leonard, K. J., and R. J. Czochor, 1980. Theory of genetic interactions among populations of plants and their pathogens. Ann. Rev. Phytopathol. 18 237–258.
  • Liang, K. Y., and S. L. Zeger, 1986. Longitudinal data analysis using generalized linear models. Biometrika 73 13–22.
  • Luo, Z. W., and M. J. Kearsey, 1992. Interval mapping of quantitative trait loci in an F2 population. Heredity 69 236–242.
  • Mangin, B., P. Thoquet and N. Grimslev, 1998. Pleiotropic QTL analysis. Biometrics 54 88–99.
  • McCulloch, C. E., 1994. Maximum likelihood variance components estimation for binary data. J. Am. Stat. Assoc. 89 330–335.
  • McIntyre, L. M., C. Coffman and R. W. Doerge, 2001. Detection and location of a single binary trait locus in experimental populations. Genet. Res. 78 79–92. [PubMed]
  • Piepho, H. P., 2001. A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics 157 425–432. [PMC free article] [PubMed]
  • SAS Institute, 1999 SAS/IML User's Guide Version 8. SAS Institute, Cary, NC.
  • Silué, D., J. L. Notteghem and D. Tharreau, 1992. Evidence of a gene-for-gene relationships in the Oryza sativa-Magnaporthe grisea pathosystem. Phytopathology 82 577–580.
  • Tabien, R. E., Z. Li, A. H. Paterson, M. A. Marchetti, J. W. Stansel et al., 2000. Mapping of four major rice blast resistance genes from ‘Lemont’ and ‘Teqing’, and evaluation of their combinatorial effect for field resistance. Theor. Appl. Genet. 101 1215–1225.
  • Tallis, G. M., 1963. The moment generating function of the truncated multi-normal distribution. J. R. Stat. Soc. Ser. B 23 223–229.
  • Terwilliger, J. D., and K. M. Weiss, 1998. Linkage disequilibrium mapping of complex disease: Fantasy or reality? Curr. Opin. Biotechnol. 9 578–594. [PubMed]
  • Visscher, P. M., C. S. Haley and S. A. Knott, 1996. Mapping QTLs for binary traits in backcross and F2 populations. Genet. Res. 68 55–63.
  • Williams, J. T., P. Van Eerdewegh, L. Almasy and J. Blangero, 1999. Joint multipoint linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formulation and simulation results. Am. J. Hum. Genet. 65 1134–1147. [PMC free article] [PubMed]
  • Xu, S., and W. R. Atchley, 1996. Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics 143 1417–1424. [PMC free article] [PubMed]
  • Xu, S., N. Yi, D. Burke, A. Galecki and R. A. Miller, 2003. An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a four-way cross mouse family. Genet. Res. 82 127–138. [PubMed]
  • Yi, N., and S. Xu, 1999. Mapping quantitative trait loci for complex binary traits in outbred populations. Heredity 82 668–676. [PubMed]
  • Yi, N., and S. Xu, 2000. Bayesian mapping of quantitative trait loci for complex binary traits. Genetics 155 1391–1403. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...