NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Hong H, Carlin BP, Chu H, et al. A Bayesian Missing Data Framework for Multiple Continuous Outcome Mixed Treatment Comparisons [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan.

Cover of A Bayesian Missing Data Framework for Multiple Continuous Outcome Mixed Treatment Comparisons

A Bayesian Missing Data Framework for Multiple Continuous Outcome Mixed Treatment Comparisons [Internet].

Show details


OA Data

We reviewed publications in English after 1979 that examined physical therapy interventions for community dwelling adults with knee pain secondary to osteoarthritis. A total of 4,266 references were retrieved.12 After screening out studies that contained no eligible exposure, target population, outcomes, or associative hypothesis tested, 422 references were included in our review. Knee pain, disability, quality of life, and functional outcomes after physical therapy interventions were reported in 193 RCTs; 84 of those met the study inclusion/exclusion criteria given in the next paragraph. Because definitions of physical therapy interventions and outcomes varied dramatically among studies, only a small proportion of comparisons met these criteria.

Inclusion/exclusion criteria involved the following aspects. First, comparators should include no active treatment, usual care (education), sham stimulation (placebo), or other therapy intervention (that is, active-active trials were not excluded). Eligible patient-centered outcomes were knee pain, disability, quality of life, perceived health status, and global assessments of treatment effectiveness. The target population was adults with knee pain secondary to knee osteoarthritis in outpatient settings, including home-based therapy. Chronic OA was defined as meeting diagnostic criteria and having symptoms of OA for >2 months. We excluded populations with knee OA who had knee arthroplasty on the “study limb” within 6 months before the study, osteonecrosis, acute knee injuries, inflammatory arthritis, arthritis secondary to systemic disease, and physical therapy treatment combined with drug treatments. Since all included studies are applied to the same inclusion and exclusion criteria, we assume that all populations are similar to each other.

For the present analysis, we selected the pain and disability outcomes as primary and secondary outcomes, respectively, resulting in the inclusion of 54 RCTs. Table 1 displays the data from these 54 RCTs, comprising aggregated continuous outcomes (sample mean and standard deviation [SD]) measuring the level of pain and disability after physical therapies using various standard scores. The OA data compare eight physical therapies (low intensity diathermy, high intensity diathermy, electrical stimulation, aerobic exercise, aquatic exercise, strength exercise, proprioception exercise, and ultrasound treatment) and three reference therapies (no treatment, placebo, and education). Under proprioception exercise, we also included tai chi and balance exercise. Most studies reported treatment outcomes at a single followup time, but when a study investigated outcomes at multiple followup times, we selected the one most commonly reported for that treatment. To measure the pain outcome, the Western Ontario MacMaster (WOMAC), Visual Analogue Scale (VAS), Arthritis Impact Measurement Scale (AIMS), and other standard scores were used. For the disability outcome, the measurement tools included the WOMAC total, Medical Outcome Study (MOS) 36-Item Short-Form Health Survey (SF-36 physical function), AIMS, Health Assessment Questionnaire (HAQ), and Knee Injury and Osteoarthritis Outcome Score (KOOS). Although these scores do not share the same scale and differ in a few details, in general they do measure outcomes equivalently, and all of their scales cover the same qualitative ranges (from “no pain” to “extreme pain” for pain measurements, and from “no impairment” to “profound impairment” for disability). The scores they yield also tend to be highly correlated when reported for the same subjects.16-18 Because the scores' different scales make their values incomparable, we rescaled the mean scores to range from 0 to 10, where small values indicate better condition, and called this the rescaled score. We also recalculated the SDs based on the transformation of the mean score, and call this the rescaled SD. We remark that we have no reason to doubt the appropriation of linear retransformation here, but our methods apply equally well under nonlinear transformations if more appropriate clinically.

Table 1. Raw OA data.

Table 1

Raw OA data.

Among the 54 studies, 51 measure the pain outcome, 26 measure the disability outcome, and 23 include both outcomes. Figure 1 exhibits the trial network among therapies for each outcome. The size of each node represents the number of studies investigating the therapy, and the thickness of each edge denotes the total number of samples for the relation. The numbers on the edges indicate the numbers of studies investigating the relation. For example, in the pain outcome, there are five studies investigating the relation between no treatment and proprioception exercise, but this line is thinner than the line between education and strength exercise, though it has only three studies. The network features are similar in both outcomes, but we have limited information on the disability outcome, with fewer connections between therapies and smaller total sample sizes overall than for the pain outcome.

Figure 1 is a network graph depicting the relations of physical therapies among studies in the OA data for each outcome; (a) pain and (b) disability. In this figure, each therapy is listed with the proper size of node indicating the number of studies investigating the therapy, and some nodes are connected if the data include the relations. The thickness of each edge implies the total number of samples for the relation, and the number of studies for the relation is on the line. Panel (a) has thick edges between no treatment and strength exercise, no treatment and aerobic exercise, and education and aerobic exercise with more than eight studies investigating these relations. In Panel (b), the network is simpler than in the pain outcome because only about half the studies reported the disability outcome. The edges between no treatment and strength exercise, no treatment and aerobic exercise, and education and aerobic exercise are thickest with four studies examining these relations.

Figure 1

Network graphs of OA data for each outcome; (a) pain and (b) disability. Note: The size of each node represents the number of studies investigating the therapy, and the thickness of each edge implies the total number of samples for the relation. The number (more...)


In MTCs, we must carefully distinguish between the terms treatment and arm. The former refers to a drug or device being tested, while the latter is the data on patients randomized to a particular drug or device in a single study. We must also distinguish between reference and baseline treatments. The reference treatment is a standard control treatment (often placebo, or simply no treatment) which can be compared with other active treatments. In our OA data, we select “no treatment” as the reference treatment among three possibilities (no treatment, education, and placebo). The baseline treatment is defined as the treatment assigned to the control arm in each study. That is, each study has its own baseline treatment, which is often the same as the reference treatment, but could differ. In this report, we assume there is no inconsistency, defined as discrepancy in treatment effects arising from direct and indirect comparisons.8

Suppose we are comparing K treatments from I studies in terms of L outcomes. For the continuous outcome, we assume that the data for a specific outcome from each study follow a normal distribution. That is,


Where ȳikl is the observed sample mean of the measurements, Δikl is the unknown true population mean, σikl2 is the known sample variance, and nikl is the number of subjects in the kth treatment arm from the ith study with respect to the Ith continuous outcome. For the simplicity, we consider k = 1 as the reference treatment. Generally, in meta-analysis, we cannot estimate within-study correlations because we have only aggregated data.19 We assume ȳikl are independent across arms and outcomes in study i since within-study correlations are not observed in every studies.

Existing Lu and Ades-Style Model

Fixed Effects Model

For meta-analysis, a fixed effects model, assuming no variability between studies, can easily be implemented. Following Lu and Ades,7, 8 the model can be written as


where B indicates the baseline treatment in each study i. Here, αiBl is the effect of baseline treatment and ηBkl is the mean difference between treatment k and the baseline treatment (B) for outcome l in study i. However, we have to be careful to interpret αiBl when the baseline treatment is not always the same. We define dkl as the mean difference between treatment k and the reference treatment for outcome l, with d1l = 0. Thus, ηBkl can be calculated as dkldBl, and we infer the treatment effects in terms of dkl; that is, we assign a prior distribution to dkl, rather than ηBkl. We denote this model as the Lu and Ades (LA)-style fixed effects model (LAFE). In this approach, it is hard to interpret the baseline treatment effect αiBl because not all studies have the same baseline treatment.

Random Effects Model

Next, in order to allow variability between studies, we introduce random effects, δiBkl, replacing the ηBkl. Specifically model (1) is respecified as


where we can assume homogeneous variance across random effects for all arms, i.e.,


Here, δiBkl is 0 when k = B, and τl is the standard deviation of the random effects for each outcome l. We denote this model as the Lu and Ades-style homogeneous random effects model (LAREhom). For multi-arm trials, Lu and Ades provides a between-arm-contrast correlation of 0.5, as a consequence of homogeneous variance and their consistency equation.8 The δiBkl in (3) are replaced by a vector δil that follows a multivariate normal distribution with dimension equal to the number of arms in study i minus one, for each outcome l.

Allowing for Missing Data and Correlations Between Outcomes

Contrast-Based Approach

We denote a model that parameterizes relative effects (e.g., the ηBkl and δiBkl in (1) and (2), respectively) as a contrast-based (CB) model. Lu and Ades-style models use such a CB approach. Note that the mean effect difference between treatment k and reference treatment in terms of outcome l (dkl) is the parameter of interest in CB models. In MTCs it is common that the number of treatments compared in the ith study is less than the complete collection of K treatments. Since each study contributes to the likelihood for a different set of treatments, using the observed measurements only can complicate estimating the covariance matrix for the δil and lead to difficulties in prior assignment and parameter inference. In addition, it is plausible that researchers select study arms based on the trials conducted previously, what statisticians call“nonignorable missingness.” In this case, ignoring the missing treatment arms can potentially lead to biased parameter estimates.15

To remedy this, we assume that all studies can in principle contain every treatment as their arms, but in practice much of this information is missing for various reasons. Under this assumption, all studies can always have a common (though possibly missing) baseline treatment, B = 1, and the distribution for the random effects δiBkl in (3) can be replaced with a matrix form as follows:


Where δil = (δi12l, …, δi1Kl)T, dl = (d2l, …, dKl)T, and lTrt is a (K − 1) × (K − 1) unstructured covariance matrix for l = 1, …, L. Note that since δi11l and d1l are always 0, they are not included in δil and dl. Here, lTrt captures all random contrasts' relations among treatments in each outcome l. We refer to this model as a contrast-based random effects model assuming independence between outcomes (CBRE1).

To allow correlations among outcomes, the distribution of δil in (4) needs to be respecified to


where δik = (δi1k1, …, δi1kL)T, dk = (dk1, …, dkL)T, and kOut is a L × L unstructured covariance matrix for k = 2, …, K. In this model, we assume independent random contrasts between treatments but incorporate the correlation structure of those contrasts between outcomes through kOut. We call this model CBRE2. Alternatively, we can also use the same Out for all k, if such an assumption is sensible.

In this approach, we can always have the same length of vector δil or δik in each study i, and incorporate all sources of uncertainty by considering unobserved arms as missing data to be imputed by our MCMC algorithm using Gibbs-Metropolis sampling. For example, suppose Study 1 compares treatments 1, 2, and 3, giving information about two contrasts, δi12l and δi13l, whereas Study 2 compares only treatments 1 and 2, and Study 3 includes only treatments 1 and 3. We can impute the missing contrast δi13l and δi12l in Studies 2 and 3 respectively by using the information related to these contrasts observed in Study 1. The reference treatment effect, αiBl in (2), is uninterpretable in this case, since each study will have different baseline treatment, as in the LA models. However, in our CB approach, αiBl becomes meaningful because the baseline treatment is the same (B = 1) across all studies.

Although we only introduced the LA homogeneous random effects model, a heterogeneous random effects model can be applied with rigorous construction of covariance matrices to satisfy the positive definiteness condition under the consistency assumption.20 However, our approach does not lead to this same set of consistency equation; the imputation allows us to independently estimate all possible contrasts in every study.

Arm-Based Approach

The CB method estimates the treatment contrasts; say, the mean difference between treatment k and the reference treatment. However, the approach's singular focus on relative treatment effects ultimately leads to many limitations. First, although we may resolve the incomparable baseline treatment problem by imputing such missing arms in our CB models, LA models still need complex model parameterizations for those studies with incomparable baseline treatments. Second, the interpretation of correlations between treatments or outcomes with respect to relative effects can be difficult. For example, we cannot directly calculate the correlation between treatments via correlation between differences of treatment effects. Furthermore, our CB model restricts the variance of a baseline effect to always be smaller than that of other treatments. That is, the variance of population mean of baseline treatment, ΔiBl, is Var(αiBl), whereas for other treatments we have Var(αiBl) + Var(δiBkl), which is never smaller than Var(αiBl).

As an alternative, we introduce an arm-based (AB) approach10, 21 by respecifying mean structure (2) as

Δikl= μkl + vikl,

where μkl is the fixed mean effect of treatment k with respect to outcome l and vikl is the study-specific random effect. In this approach, we estimate the absolute treatment effect size, μkl, not the relative effect size, dkl.

If we begin by assuming independent random effects between outcomes, then the random effects vikl in (6) can be structured as (vi1l,…,viKl)T ∼ MVN (0,ΛlTrt) with ΛlTrt a K × K unstructured covariance matrix having relations of random effects between treatments, for l = 1, …, L. We denote this model as ABRE1. Alternatively, we can allow dependence of random effects between outcomes but independence between treatments by defining (vik1, …, vikL)T∼ MVN (0,ΛkOut) where ΛkOut is a L × L unstructured covariance matrix having relations between outcomes, for k = 1, …, K. We refer to this model as ABRE2. Again, we can also use the same ΛOut for all k when it is reasonable to do so.

The parameters in arm-based models permit more straightforward interpretation, especially in estimating a pure treatment effect. However, these models do require strong assumptions regarding the similarity and exchangeability of all populations, in order to preserve the randomization and permit meaningful clinical inference. Note that in AB models, there is no restriction on variances of random effects because all of our covariance matrices are unstructured. That is, AB models are less constrained, but thus have slightly larger number of parameters than CB models.

Choice of Priors

Lu and Ades assume a noninformative prior on each parameter, in order to let the data dominate the posterior calculation. For αiBl and dkl, a normal distribution with mean 0 and variance 1002 is used, and a Uniform (0.01, 10) is assigned for τ in LAREhom. In all CB models, we assume αiBl follows a N(al,ξl2) rather than a N(0, 1002) distribution, where al is the mean reference treatment effect, with noninformative priors for al and ξl; namely, N(0, 1002) and Uniform(0.01, 10), respectively. Throughout all CB and AB models, the fixed effects (dkl and μkl, respectively) follow a N(0, 1002) distribution, while the inverse covariance matrices follow a Wishart(Ω, γ) having mean γΩ−1, with the matrix dimension usually chosen for the degrees of freedom parameter γ because it is the smallest value that will still yield a proper prior.22 We can select Ω to be γ times a prior guess for the covariance matrix (Ω0). Since we do not know the true covariance matrices, we begin with a vague Wishart prior having Ω0 = (5005), and later investigate more informative Wishart priors in a sensitivity analysis.


Regarding Bayesian model choice, we adopt the Deviance Information Criterion (DIC).22,23 DIC is a hierarchical models generalization of the Akaike Information Criterion, and is the sum of D̄, a measure of goodness of fit, and PD, a measure of complexity. For all CB and AB models we implement, we insist that only the observed data contribute to the calculation of D̄.24

We can identify the best treatments based on a reasonable measurement of the effect size.25 For instance, we can calculate the probability of being the best or second best treatment, which we call the “Best12” probability. Suppose Δkl is the marginal mean effect of having event l under treatment k, modeled from (2) using the posterior of dkl and posterior mean of μi1l across studies, instead of δiBkl and μiBl in CB models. For AB models, we can obtain Δkl by plugging in the posterior of μkl in (6), noting that the prior mean of vikl is 0. Denoting the data on outcome l by yl, then define the “Best12” probability under each outcome as

Pr{k is the best treatment ∣ yl} = Pr{rank(Δkl) = 1 or 2 ∣ yl}

To integrate these univariate probabilities over all the outcomes and obtain one omnibus measure of “best,” we propose an overall, weighted score denoted by Sk. Suppose all measurements have the same directionality, that is, small values indicate better condition in all outcomes, our overall score is defined as

Sk = ∑l wl Δkl,

where wl is the weight for outcome l, and lwl = 1. This score can be used to obtain overall Best12 probabilities by replacing Δkl by Sk in (7). The weights can be chosen by physicians or public health professionals based on their preferences (say, for weighting safety versus efficacy).

Simulation Study Settings

In this simulation, we generate 1,000 data pairs (ȳik1, ȳik2) and fit the LAREhom, CBRE2, and ABRE2 models to investigate how the missingness in our design affects 5 percent two-sided Type I error, power, and the rates of incorrect decisions when the correlation between outcomes is incorporated into the models (CBRE2 and ABRE2) or not (LAREhom). Figure 2 illustrates the design of the simulated complete and partially missing data. For the “complete” data, we generate artificial data from 40 studies having two treatments and two outcomes featuring moderate positive correlation between outcomes, but independence between arms. In panel (b), we drop 20 studies in the first outcome; that is, we mimic our OA data, in which only half the studies report the disability outcome. For simplicity, we assume that every study has sample size 100 and standard deviation of 2 for every arm.

Figure 2 shows our simulation settings graphically; (a) complete data and (b) partially missing data. Panel (a) has four vertically long rectangles representing 40 studies. The first two rectangles are for the first outcome and the next two are for the second outcome with two treatments. Panel (b) has the same format as panel (a) but the lengths of first two rectangles are half of the original length representing 20 studies.

Figure 2

Data structure for simulation; (a) complete data and (b) partially missing data.

To sample the partially missing data, we compare the results under missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) mechanisms. The MCAR mechanism assumes that the missingness does not depend on the data, so we choose 20 studies randomly and make ȳi11 and ȳi21 missing for those studies. The MAR mechanism assumes that the missingness depends only on the observed data, but not on the missing data, whereas MNAR missingness can depend on both observed and unobserved data. To generate partially missing data under the MAR and MNAR mechanisms, we first calculate the ‘probability of missing’ (pi,mis) for study i by applying a logit model with the observed or missing data as covariates. Here ȳi12 and ȳi22 are considered as observed data, and ȳi11 and ȳi21 are missing data since they are not fully observed in our design. We use the following two logit models:

MAR: logit(pi,mis) = 2 + i12i22
MNAR: logit(pi,mis) = − 4 − i11 + i22.

The coefficients are selected to result in a mean pi,mis of about 30 to 40 percent. Given pi,mis, we generate the missingness indicator vector until 20 studies are selected as missing data.

For the true parameters, (μ11*,μ21*,μ12*,μ22*)=(0,0,0,3) is chosen in (6), yielding d21*=0 and d22*=3 in the LAREhom and CBRE models. We calculate Type I error in terms of parameter d21 in the three models, with the superscript * indicating the truth. To estimate power at two particular alternatives, we select (μ11*,μ21*,μ12*,μ22*)=(0,1,0,3)and(0,2,0,3), giving d21*=1and2, respectively, which we notate as “Power1” and “Power2.” We also calculate the rate of incorrectly selecting the best treatment, given as Pr(μ11^>μ21^) under Power1 and 2 scenarios because the truth is that μ11*<μ21*. This rate should be around 0.5 under the Type I error setting.

For the random effect parameters, in (6), we generate them from (νi11ABνi21AB)~MVN((00),(1ρABρAB1)) and (νi12ABνi22AB)MVN((00),(33ρAB3ρAB3)), which on the CB scale corresponds to (νi21CBνi22CB)MVN((00),(23ρAB3ρAB2)). Here, the superscripts and subscripts on vikl and ρ*, AB and CB, indicate the model used. From the covariance matrix of random effects in the CB model, we can easily calculate the true correlation in the CB model, ρCB=32ρAB. To ensure a positive definite covariance matrix for the random effects in the CB model, ρAB should therefore be between 23 and 23. We set ρAB=0.6and0.0 which induces ρCB=0.9and0.0.

For the OA data analysis, WinBUGS is used to generate two parallel chains of 50,000 MCMC samples after a 50,000-sample burn-in. To check MCMC convergence, we used standard diagnostics, including trace plots and lag 1 sample autocorrelations. The WinBUGS codes are now publicly available at

We used the R2WinBUGS package26 in R to perform our simulation studies, where we call WinBUGS27 1,000 times from R, once for each simulated data set. In each case, we obtain 20,000 samples, after a 20,000 sample burn-in, and collect medians of parameters across 1,000 simulated datasets, then estimate Type I error and power.


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...