• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Off Stat. Author manuscript; available in PMC Nov 26, 2009.
Published in final edited form as:
J Off Stat. Dec 1, 2008; 24(4): 517–540.
PMCID: PMC2783643

Model Averaging Methods for Weight Trimming


In sample surveys where sampled units have unequal probabilities of inclusion, associations between the inclusion probabilities and the statistic of interest can induce bias. Weights equal to the inverse of the probability of inclusion are often used to counteract this bias. Highly disproportional sample designs have highly variable weights, which can introduce undesirable variability in statistics such as the population mean or linear regression estimates. Weight trimming reduces large weights to a fixed maximum value, reducing variability but introducing bias. Most standard approaches are ad-hoc in that they do not use the data to optimize bias-variance tradeoffs. This manuscript develops variable selection models, termed “weight pooling” models, that extend weight trimming procedures in a Bayesian model averaging framework to produce “data driven” weight trimming estimators. We develop robust yet efficient models that approximate fully-weighted estimators when bias correction is of greatest importance, and approximate unweighted estimators when variance reduction is critical.

Keywords: Sample survey, sampling weights, Bayesian population inference, weight pooling, variable selection, fractional Bayes Factors

1 Introduction

Analysis of data from samples designed to have differential probabilities of inclusion typically use case weights equal to the inverse of the probability of inclusion to provide reduce bias in the estimators of population quantities of interest. An example is the Horvitz-Thompson estimator (Horvitz and Thompson 1952) of a population mean Y¯=N1i=1Nyi given by Y¯^=N1iswiyi, where wi = 1/πi, πi is the probability of inclusion and s is the subset of the population units sampled. This fully-weighted estimator is unbiased for the population mean. For the wide class of non-linear estimators such as ratio estimators or linear regression slopes that are functions of linear statistics, bias can be reduced and consistent estimates of population values obtained by replacing implicit means or totals with their weighted equivalents (Binder 1983).

There is little debate that sampling weights be utilized when considering descriptive statistics such means and totals, although even here, highly variable probabilities of selection can give rise to bias-variance tradeoffs and the desire to employ weight trimming (Little et. al 1997). However, when estimating “analytical” models (Cochran 1977, p. 4) that focus on associations between, e.g., risk factors and health outcomes estimated via linear and generalized linear models, the decision to use sampling weights is less definitive (c.f. Korn and Gaubard, 1999, p. 180–182). Consider a population generated from


while the superpopulation model of interest is the conditional distribution of Yi given Xi modeled by


the superpopulation model is correctly specified when C = 0 and misspecified when C ≠ 0. We consider two sampling schemes; an ignorable sampling scheme that oversamples large values of Xi, and a non-ignorable scheme that oversamples large values of Yi at a given value of Xi. The sampling scheme is ignorable in the regression context when the sampling probability is a function of Xi only and thus the inclusion indicator Ii is independent of Yi | Xi because our goal is to determine the distribution of Y | X; non-ignorable designs in the regression setting retain an association between Yi and Ii even conditional on Xi. Of course, designs in which Ii depends on Xi are non-ignorable for parameters that describe the marginal distribution of Yi, unless Yi [perpendicular] Xi (see Section 2). We assume that the goal of the modeler is to describe the association between Y and X using the regression slope β from the superpopulation model. If the superpopulation model is correctly specified, the target quantity of interest could be either the superpopulation slope or the population slope defined by B=i=1NAi(YiY¯), where Ai=(XiX¯)/i=1N(XiX¯)2,Y¯=N1i=1NYi,X¯=N1i=1NXi (the “corresponding descriptive population quantity” in Pfeffermann [1993]). If the superpopulation model is misspecified, then only the population slope makes sense as a target quantity. The unweighted ordinary least squares (OLS) estimator and (case-)weighted least squares (WLS) estimator of α and β respectively are given by


where the ith row of X, XiT, is given by (1 Xi)T, S is a diagonal matrix of sample inclusion indicators Si, and Sw replaces Si in S with Sii. Thus, the WLS estimator replaces the means and totals in the unweighted estimator with the Horvitz-Thompson equivalents.

Table 1 shows the results from 500 simulations for equivalent populations of N = 10000, under correctly specified and misspecified models and ignorable and non-ignorable sample designs, for sample sizes of n = 50 and n = 500. When the sample design is ignorable (probability of selection depends only on X) and the mean model correctly specified, both the unweighted and fully-weighted estimators are essentially unbiased, and the larger variance of the weighted estimator results in a larger mean square error (MSE). When the sampling is ignorable but the mean model incorrectly specified (linear instead of quadratic), the weighted estimator provides protection against model misspecificiation, but can introduce large variability into the estimator (note the larger MSEs for the weighted estimator when n=50). When the sample design is non-ignorable for the population slope, the weighted population slope estimator [beta]w accounts for the underrepresentation of smaller values of Y when X is small, reducing the negative bias in the slope; in these simulations this bias in the unweighted estimator was a greater contributor to MSE than variance from the weighted estimator.

Table 1
% Bias (MSE in parentheses) for population slope for population generated under YiXiN(A+BXi+CXi2,1), i = 1,…, 10000, and superpopulation model is given by Yi |Xi ~ N (α + βXi, σ2): correctly specified ...

The fully-weighted estimators ([alpha]w [beta]w)T are sometimes termed “pseudo-maximum likelihood” estimators (PMLEs) (Binder 1983, Pfeffermann 1993) because they are “design consistent” for the MLEs that would solve the score equations under the sampling model defined in (1.2) if we had observed data for the entire population:


In brief, design consistency implies that the difference between the population target quantity and the estimate derived from the sample tends to zero as the sample size and population size jointly increase, or that these difference will on average tend to 0 from repeated sampling of the population, where samples are selected in an identical fashion from t → ∞ replicates of the population: see Sarndal (1980) or Isaki and Fuller (1982).

1.1 Weight Trimming

While PMLEs are popular in practice for the reasons discussed above, their bias reduction typically comes at the cost of increased variance. This increase can overwhelm the reduction in bias, so that the mean square error (MSE) actually increases under a weighted analysis, as in the example in Table 1. Even in cases where disproportional sample designs do reduce variance, as in “optimal allocation” where strata with more variable outcomes are oversampled (Kish 1965), designs that are optimal for one outcome may not be optimal for another, or for examination of associations (e.g., regression models).

Perhaps the most common approach to dealing with this problem is weight trimming or winsorization (Potter 1990, Kish 1992, Alexander et al. 1997), in which weights larger than some value w0 are fixed as w0. Thus bias is introduced to reduce variance, with the goal of an overall reduction in MSE. This manipulation of the weights reflects a traditional design-based approach to survey inference.

Other design-based methods have been considered in the literature. Potter (1990) discusses systematic methods for choosing w0, including the weight distribution and MSE trimming procedures. The weight distribution technique assumes that the weights follow an inverted and scaled beta distribution; the parameters of the inverse-beta distribution are estimated by method-of-moment estimators, and weights from the upper tail of the distribution, say where 1− F (wi) < .01, are trimmed to w0 such that 1 − F (w0) = .01. The MSE trimming procedure (Cox and McGrath 1981) determines the empirical MSE at a variety of trimming levels t = 1,…, T under the assumption that the true population mean is given by the fully weighted estimate: MSE^t=(θ^tθ^T)2+V^(θ^t), where t = 1 corresponds to the unweighted data and t = T to the fully-weighted data, and [theta w/ hat]t is the value of the statistic using the trimmed weights at level t. The trimming level is then given by the level l minimized MSE^t over t.

In addition to adjusting for unequal probabilities of selection, case weights are also used to calibrate sample elements to known control totals in the population (Deville and Sarndal 1992), either jointly (poststratification weights) or marginally (raking weights). In the calibration literature, techniques have been developed that allow generalized poststratification or raking adjustments to be bounded to prevent the construction of extreme weights (Folsom and Singh 2000). Beaumont and Alavi (2004) extend this idea to develop estimators that focus on trimming large weights of highly influential or outlying observations. While these bounds trim extreme weights to a fixed cutpoint value, the choice of this cutpoint remains arbitrary. Another approach is to consider robust regression estimates (Hampel 1986) that downweight highly influential observations, although applications which consider downweighting influence statistics as an alternative to weight trimming in the context of survey designs are limited (Zavlasky et al. 2001 considered their use with ratio estimators).

This manuscript develops an alternative approach to weight trimming that considers the case weights as stratifying variables within strata defined by the probability of inclusion. These “inclusion strata” may correspond to formal strata from a disproportional stratified sample design, or may be “pseudo-strata” based on collapsed or pooled weights derived from selection, poststratification, and/or non-response adjustments. Ordering these weight strata by the inverse of the probability of selection and collapsing together the largest valued strata mimics weight trimming by assuming the underlying data from these combined strata are exchangeable (conditional on any covariates of interest). In a regression setting, this model can be posed as a variable selection problem, where dummy variables for the inclusion strata interact with the regression parameters; subtracting from or adding to the inclusion strata design matrix allows for a greater or lesser degree of weight trimming. By averaging over all possible of these “weight pooling” models, we can compute an estimator of the population parameter of interest whose bias-variance tradeoff is data-driven. By allowing for all contiguous inclusion strata (strata whose weights are closest in value) to be considered for pooling, we induce a high degree of robustness into our model, protecting against ”over-pooling” that simpler models suffered from (Elliott and Little 2000). We embed this model in a Bayesian framework, as we believe it provides a natural setting for model averaging, as well as a proper framework for population inference.

Section 2 reviews Bayesian finite population inference. Section 3 develops the weight pooling models for linear regression models in a fully Bayesian setting. Section 4 provides simulation results to determine the repeated sampling properties of the weight pooling estimators of linear regression parameters in a disproportional-stratified sample design and compares them with standard design-based estimators. Section 5 illustrates the use of the weight pooling estimator using data from the National Health and Nutrition Examination Survey to consider evidence for “Barker’s Hypothesis” that low birth weight babies are at greater risk for cardiovascular disease later in life. Section 6 summarizes the results of the simulations and considers extensions to generalized linear models.

2 Bayesian Finite Population Inference

Let the population data for a population with i = 1,…, N units be given by Y = (y1,…, yN), with associated covariate vectors X = (x1,…, xN) and sampling indicator variable I = (I1,…, IN), where Ii = 1 if the ith element is sampled and 0 otherwise. Similar to design-based population inference, Bayesian population inference focuses on population quantities of interest Q(Y), such as population means Q(Y) = Y. In contrast to design-based inference, however, one posits a model for the population data Y as a function of parameters θ: Y ~ f (Y |θ). Inference about Q(Y) is made based on the posterior predictive distribution of p(Ynob | Yobs, I), where Ynob consists of the elements of Yi for which Ii = 0:


where ϕ models the inclusion indicator. If we assume that ϕ and θ are a priori independent and if the distribution of sampling indicator I is independent of Y, the sampling design is said to be “unconfounded” or “noninformative”; if the distribution of I depends only on Yobs, then the sampling mechanism is said to be “ignorable” (Rubin 1987), equivalent to the standard missing data terminology (the unobserved elements of the population can be thought of as missing by design). Under ignorable sampling designs p(θ, ϕ) = p (θ)p(ϕ) and p (I | Y, θ, ϕ) = p (I | Yobs, ϕ), and thus (2.1) reduces to


allowing inference about Q(Y) to be made without explicitly modeling the sampling inclusion parameter I (Ericson 1969, Holt and Smith 1979, Little 1993, Rubin 1987, Skinner et al. 1989). In the regression setting, where inference is desired about parameters that govern the distribution of Y conditional on fixed and known covariates X, (2.1) becomes


which reduces to


if and only if I depends only on (Yobs, X), of which dependence on X only is a special case. Thus if inference is desired about a regression parameter Q(Y, X)|X, then a noninformative or more generally ignorable sample design can allow inclusion to be a function of the fixed covariates.

2.1 Accommodating Unequal Probabilities of Selection

Maintaining the ignorability assumption for the sampling mechanism often requires accounting for the sample design in both the likelihood and prior model structure. In the case of the disproportional probability-of-inclusion sample designs, this can be accomplished by developing an index h = 1,…, H of the probability of inclusion (Little 1983, 1991); this could either be a one-to-one mapping of the case weight order statistics to their rankings, or a preliminary “pooling” of the case weights using, e.g., the 100/H percentiles of the case weights. Let nh be the number of included units and Nh the population size in weight stratum h, so that wh = Nh/nh for h = 1,…, H. We assume here that Nh is known, as when the weight strata come from a stratified random sample. (If Nh is unknown, as would be the case when the weights are constructed from estimated probabilities of inclusion via calibration or non-response adjustments, it can be replaced with Nh = nhwh. Nh can be treated as known, or if the underlying within-stratum samples are small, uncertainty in Nh can be incorporated into the model by treating n1, …, nH as a multinomial distribution of size n parameterized by unknown inclusion stratum probabilities q1,…, qH with, e.g., a Dirichlet prior [Lu and Gelman 2003]. Draws of Nh could then be obtained as Nwhqh/n, where qh is drawn from the Dirichlet posterior for q. If the weights within a stratum are not all equal, then wh can be approximated by the inverse of the mean probability of inclusion with the stratum given by nh/ihwhi1.) The data are then modeled by


for all elements in the hth inclusion stratum, where θh allows for an interaction between the model parameter(s) θ and the inclusion stratum h. Putting a noninformative prior distribution on θh then reproduces a fully-weighted analysis with respect to the expectation of the posterior predictive distribution of Q(Y).

3 Weight Pooling Models

Weight trimming effectively pools units with high weights by assigning them a common, trimmed weight. The untrimmed (design-based) weighted mean estimator in a disproportionally stratified design is then y¯w=hiwhyhihiwh=hNhN+y¯h, where N+ = Σh Nh, the total population. Weight trimming typically proceeds by establishing an a priori cutpoint, say 3 for the normalized weights, and multiplying the remaining weights by a normalizing constant γ = (N+ − Σκiwo)/Σ(1 − κi)wi, where κi is an indicator variable for whether or not wiw0. The trimmed mean estimator is thus given by


where γ=N+w0h=lHnhh=1l1Nh and y¯(l)=(1/h=lHnh)h=lHnhy¯h. Choosing w0=h=lHNhh=lHnh yields γ = 1 and y¯wt=h=1l1NhN+y¯h+(h=lHNh)N+y¯(l), which corresponds to the estimate for a model that assumes distinct stratum means for the smaller weight strata and a common mean for the larger weight strata, that is:


Elliott and Little (2000) considered an extension of this model where we no longer assume the cutpoint l is known:


where μ1 = β0,…, μl = β0 + βl−1. This “weight pooling” model averages the estimators obtained from all possible weight trimming cutpoints, where each estimator contributes to the final average based on the probability that the cutpoint is “correct”. This posterior probability is determined via Bayesian variable selection models that determine the posterior probability of each cutpoint model conditional on the observed data.

3.1 Weight Pooling Models for Linear Regression

This manuscript extends Elliott and Little (2000) in two ways. First, we consider the linear regression of Yi on fixed covariates xi. Thus the most general model must allow for interactions between the probability of selection and the linear regression slopes; the full interaction model (a different slope within each probability-of-selection stratum, equivalent to no pooling) approximately reproduces the fully-weighted estimator, while the minimal model (a single slope across all probability-of-selection strata, equivalent to full pooling) approximately reproduces the unweighted estimator. Pooling of some, but not all, of the strata, reproduces the trimmed estimator where the degree of trimming is determined by the degree to which the data suggest that distinct probability-of-selection strata have similar linear regression slopes. Second, we allow for the pooling of all conterminous inclusion strata. This increases the robustness of the model, by permitting the lowest probability-of-selection strata to interact with the linear regression slopes even when higher probability-of-selection strata are pooled. Thus


where Zli = Dhl [multiply sign in circle] xhi and Dhl is a vector of dummy variables that pool the appropriate conterminous inclusion strata based on the lth pooling pattern.

Table 2 shows the set of pooling patterns when H = 4. Under weak or non-informative priors, the first four pooling strata mimic standard weight trimming estimators, with L = 1 corresponding to an unweighted analysis and L = 4 corresponding to a fully-weighted analysis.

Table 2
The set of {Dhl} when 4 weight strata are present: all patterns of pooling coterminous strata.

Our population quantity of interest B = (B1,…, Bp)T is the slope that solves the population score equation (1.3) where




Note that the quantity B such that U(B) = 0 is always a meaningful population quantity of interest even if the model is misspecified (i.e., yi is not exactly linear with respect to the covariates), since it is the linear approximation of xi to E(Yi | xi).

The posterior predictive distribution of B is then given by


for (θl= (βl, σ2, L = l). Simulations from p(B | y, X) can be obtained by first obtaining a draw from p(θl|y, X), and then computing B=[h=1HWhi=1nhZliZliT]1[h=1HWh(i=1nhZliZliT)βl] where Wh = Nh/nh for the population size Nh and sample size nh is the hth inclusion stratum. Note that this preserves the distribution of the covariates under the sample design while allowing the slopes to still be fully-modeled.

A direct draw from p(θl | y, X) = p(βl | σ2, L = l, y, X)p(σ2|L = l, y, X)p(L = l | y, X) is possible if H is of modest size; otherwise a Metropolis step can be run to obtain an approximation to the marginal posterior of p(L = l | y, X), and direct draws obtained accordingly. Details are provided in the Appendix.

3.2 Fractional Bayes Factors

In the absence of strong prior information to define p(θl), the Bayes Factors comparing weight pooling model l with weight pooling model l


can be quite sensitive to the choice of p(θl) (Kass and Rafter 1995). We have a similar issue in our weight pooling model, since our marginal pooling probabilities are simply Bayes Factors converted from the odds to the probability scale. To counter this, we consider the “fractional Bayes factor” approach proposed in O’Hagan (1995). The concept extends the training-sample idea first proposed in Spiegelhalter and Smith (1982). A fraction b of the sample is set aside as to provide a data-based proper prior for θl. O’Hagan (1995) shows that the resulting Bayes factor for comparing model l with model l′ using the data-based prior, which he terms a fractional Bayes factor (FBF), is of the form BFb(y,X)=ql(f,y,X)P(L=l)ql(f,y,X)P(L=l), where


Small values of b should be most efficient at choosing correct models, while larger values of b are protective against outliers (data generated under a model not in the classes considered). O’Hagan proposed n−1 log n and n−1/2 as increasingly “robust” choices of b. O’Hagan assumes a non-informative prior h(θl) in contrast to our proper prior, but very weakly informative priors, as we use in simulations and examples below, can be used as well. The Appendix provides details describing the use of FBF in the weight pooling application.

4 Simulation Results

4.1 Mean Models

We consider the repeated sampling properties of our proposed models for estimating population means given by Y¯=N1i=1NYi (i.e,, xi = 1 for all i). We generated data under the following model:


The population size of the H = 10 selection strata were as follows:


from which disproportional samples of size 500 and 100 were drawn:


(maximum normalized weight=13.9).

We consider two patterns for the means across 10 inclusion strata:

  1. μC = (22.5, 14.4, 9.0, 4.8, 1.8, −1.2, −1.8, −2.16, −1.92, −1.8)′
  2. μD = (−1.8, −1.92, −2.16, −1.8, −1.2, 1.8, 4.8, 9.0, 14.4, 22.5)′

and considered values of σ2 = 10l, l = −1, 0,…, 3; 200 simulations were generated for each value of σ2. The mean pattern μC would generally be favorable for weight trimming, since the means for the low probability-of-selection weight strata are approximately equal; μD would generally be unfavorable for weight trimming, since the means for the low probability-of-selection weight strata differ substantially. Generally, weight trimming should be more favorable as σ2 → ∞ and the effect of the bias correction is minimized; the fully-weighted estimator will generally be favored as σ2 → 0, and bias correction is paramount.

For priors, we considered μ0 = [mu] =(y1, …, yH)′, 0=ch=1Hi=1nh(yhiy¯h)2, and a = s = 10−8 (see (3.2)). This is a “data-based” prior that centers all the inclusion means at their unweighted sample values, with a variance scaled by the sample size n so that it is equivalent to a variance estimate based on a single observation. We further scale this prior by a factor c ≥ 1 to allow for reduced informativeness; we consider c = 1000 in the simulations below, making the prior effectively non-informative. We term the estimator of Y obtained under this model PWT. We also consider the Factional Bayes Factor data-based prior as well; PWTF1, which uses a training fraction of log n/n, and PWTF2, which uses a larger training fraction of n−1/2. O’Hagan suggests that PWTF1 will be more efficient at choosing the correct model when the true model is among the models considered, whereas PWTF2 will be more robust (have better repeated sampling properties when the true model is not among the models considered).

In addition to these three weight pooling models, we consider the standard designed-based (fully weighted) estimator (FWT), as well as two trimmed weight (TWT3, TWT7) and unweighted (UNWT) estimators. The TWT3 estimator is obtained by replacing the weights whi with trimmed values whit that set the maximum normalized value to 3: whit=Nwhith=1Hnhwht, where whit=min(whi,3N/n); this approximately corresponds to the weight pooling model (3.1) with l = 6. The TWT7 estimator uses trimmed values that set the maximized values to 7, approximately corresponds to the weight pooling model (3.1) with l = 8. The UNWT estimator obtained by fixing whi = N/n for all h, i. We estimate their variance using the Taylor Series (linearization) approximation (Binder 1983) that accounts for weighting and stratification.

Table 3 shows the root mean square error (RMSE) relative to the fully-weighted estimator and nominal 95% coverage for the three design-based and three model-based estimators of the population mean, as a function of the variance σ2, under μC, the structure that favors weight trimming. Table 4 shows the equivalent measures under μD, the structure that is not consistent with weight trimming.

Table 3
Square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and true coverage of the 95% CI or PPI of population mean estimator under the model μC that is consistent with weight trimming.
Table 4
Square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and true coverage of the 95% CI or PPI of population mean estimator under the model μD that is not consistent with weight trimming. (Dotted line indicates 95% ...

Even when the mean structure is favorable for weight trimming, the unweighted estimator (UNWT) and crude trimming estimators (TWT3, TWT7) behave poorly when σ2 is small, but have better MSE properties than the fully weighted estimator and conservative coverage when then within-stratum variance is considerably greater than the between-stratum variance. The trimmed estimator requires a smaller residual variance to have better MSE properties that the fully weighted estimator, but the unweighted estimator has the best MSE properties for the largest residual variance. The fully weighted estimator is design-unbiased; coverage is approximately correct for n = 500, but anti-conservative when n = 100 due to the poor asymptotic approximation. The pooled weight estimator under the flat prior nearly dominates the fully-weighted estimator with respect to MSE and has approximately correct coverage when n = 100, since asymptotic assumption are not necessary for the Bayesian estimator. Similar results are found for the pooled weight estimator using the fractional Bayes Factor priors, except that the increase in efficiency is greater for larger σ2 (RMSE reductions of nearly 30%).

When the mean structure is not favorable for weight trimming, the UNWT and TWT estimators both have larger MSE than the FWT estimator and very poor coverage except for very large σ2. The pooled weight estimators are fairly robust, with slightly increased MSE relative to the fully weighted estimator for intermediate values of σ2, and improved MSE relative to the fully weighted estimator for large values of σ2. The true coverage of the pooled weight estimator is somewhat less that the nominal coverage when n = 100 but is still better than that of the fully weighted estimator, again reflecting the lack of need for asymptotic assumptions in the Bayesian paradigm.

4.2 Linear Regression Models

For the linear regression model, we generated population data under a linear spline as follows:


where (x)+ = x if x ≥ 0 and (x)+ = 0 if x < 0. A noninformative, disproportionally stratified sampling scheme sampled elements as a function of Xi (Ii equals 1 if sampled and 0 otherwise):


This created 10 strata, defined by the integer portions of the Xi values. A total of n = 1000 elements were sampled without replacement for each simulation (maximum normalized weight ≈ 14.9). The object of the analysis is to obtain the population slope B1=i=1N(YiY¯)(XiX¯)i=1N(XiX¯)2.

We considered three patterns for β:

  1. βC = (0, 0, 0, 0, .5, .5, 1, 1, 2, 2, 4)′
  2. βD = (0, 11, −4, −2, −2, −1, −1, −.5, −.5, 0, 0)′
  3. βE = (0, 2, 0, 0, 0, 0, 0, 0, 0, 0,)′

and considered values of σ2 = 10l, l = 1,…, 5; 200 simulations were generated for each value of σ2. The effect of model misspecification increases as σ2 → 0 as the bias of the estimators becomes larger relative to the variance, and conversely decreases as σ2 → ∞. Under βC, weight trimming is likely to be a productive strategy under smaller values of σ2 than under βD, since the low probability-of-selection slopes are equal. Under βE, the linear regression model for the population is correctly specified, and the unweighted estimator should be most efficient.

We use priors equivalent to the “data-based” priors we used for population means, extended to population slopes: β0 = [beta] = (XT X)−1XTy, Σ0 = cnVar([beta]) for Var([beta]) = [tau]2(XT X)−1, [tau]2 = (np)−1(yX[beta])T (yX[beta]), a = s = 10−8, and c = 1000. We again consider Fractional Bayes Factor with training fraction of log n/n and n−1/2.

As in the population mean evaluation, we consider the FWT, TWT3, TWT7 and UNWT estimators, again estimating their variance using the Taylor Series (linearization) approximation that accounts for weighting and stratification. As in the mean model TWT3 approximately corresponds to the weight pooling model (3.1) with l = 6, and TWT7 approximately corresponds to the weight pooling model (3.1) with l = 8.

Table 5 shows the root mean square error (RMSE) relative to the fully-weighted estimator and nominal 95% coverage for the three design-based and three model-based estimators of the population slope (second component of [B with circumflex]) as a function of the variance σ2, under βC, the structure that favors weight trimming for smaller values of σ2; Tables 6 and and77 show the equivalent measures under βD and βE, the structures that respectively favor weight trimming for only larger values of σ2, and the correctly specified linear model. Under all three models, the nominal coverage of the 95% CI of fully weighted estimator is approximately correct.

Table 5
Square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and true coverage of the 95% CI or PPI of population linear regression slope estimator under the misspecified model βC that favors weight trimming. (Dotted line ...
Table 6
Square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and true coverage of the 95% CI or PPI of population linear regression slope estimator under the misspecified model βD that is not consistent with weight trimming. ...
Table 7
Square root of mean square error (RMSE) relative to RMSE of fully-weighted estimator, and true coverage of the 95% CI or PPI of population linear regression slope estimator under the correctly specified model βE. (Dotted line indicates 95% interval ...

The unweighted and trimmed estimators are always biased because of model misspecification, although the reduction in variance overwhelms bias correction for large σ2, yielding approximately correct nominal 95% CI coverage and smaller MSEs relative to the fully weighted estimator. When the model is correctly specified, the unweighted and trimmed estimators reduce RMSE by 35–45%, and nominal 95% CI coverage is correct.

The weight pooling estimator with non-informative prior generally tracks the fully weighted estimator in the presence of model misspecification, although for large σ2 there is a 10% reduction in RMSE. Nominal 95% coverage is correct except for small values of σ2 under βD, the model least favorable to weight trimming. Under the correctly specified model, the weight pooling estimator with non-informative prior has a 5–10% reduction in RMSE, with correct nominal 95% PPI coverage.

The weight pooling estimator with the smaller training fraction FBF prior (PWTF1) has equivalent RMSE to the fully-weighted estimator when σ2 is small under βC and weight trimming is not warranted, but has equivalent RMSE to the unweighted estimator when σ2 is large and weight trimming is appropriate. A similar pattern is seen under βD, except that PWTF1 “overpools” somewhat for intermediate levels of σ2, leading to slightly higher RMSE that the fully-weighted estimator. Under the correctly specified model βE, PWTF1 has RMSE properties similar to that of TWT3, with a 35–45% reduction in RMSE. There is modest undercoverage of the nominal 95% PPI when σ2 is small and the model is misspecified.

The weight pooling estimator with the larger training fraction FBF prior (PWTF2) is more robust that PWTF1, with little increase in RMSE over the fully-weighted estimator even when the model is misspecified and σ2 is small, but retaining substantial RMSE reductions (over 30%) when bias correction is unimportant or the model is correctly specified. Coverage properties of the 95% PPI are correct, except for modest undercoverage under the “worst case” model (βD with small σ2).

5 Application: Consideration of the Barker Hypothesis using NHANES data

Barker et al. (1993) described an association between low birth weight, and adult cardiovascular disease and type 2 diabetes. It was postulated that in face of a nutritionally stressed fetal environment, the fetus adapts in a manner which predisposes to the development of insulin resistance and increased CVD risk factors in later life. This hypothesis has been evaluated by a number of others (Curhan et al. 1996, Rich-Edwards et al. 1997, among many), but usually in convenience samples. A few analyses have considered whether evidence of the “Barker Hypothesis” exists in children (Forrester et al. 1996, Matthes et al. 1994), again with convenience samples and limited ethnic diversity.

To evaluate the Barker hypothesis in children using a population-based sample, we use the National Health and Examination Nutrition Survey III (NHANES III). NHANES III (U.S. Department of Health and Human Services 1997) is a US-wide survey designed to collect information about the diet and health status of the US population. The survey was conducted between 1988 and 1994 with 33,394 subjects, drawn from a probability sample of the US population with a complex sample design construction. The primary sample units (PSUs) consisting of standard metropolitan statistical areas (SMSAs), counties, or groups of counties were collapsed into strata. Strata containing a single large SMSA had that PSU selected with probability 1. Two PSUs were selected from the remaining strata using a “controlled selection” process that selected PSUs proportional to population while assuring balance on key covariates such as region, socioeconomic status, etc. Within each PSU clusters of dwelling units were sampled using controlled selection as well, and a systematic sample of addresses were then selected from each cluster. Oversamples of minorities (African- and Mexican-Americans) and the young (<6) and old (60+) were also obtained. The NHANES III sampling weights are highly variable: 215 ≤ wi ≤ 79, 382, where 8% of the weights have a normalized values greater than 3. The weights include a non-response adjustment as well as a post-stratification adjustment to known Census age-sex-geographic-ethnicity (non-Hispanic Caucasian, non-Hispanic African-American, Mexican-American, and other) totals that also account for the age-ethnicity oversampling, and included crude trimming adjustments at each step. (No detail is provided about the weight trimming procedures except that fewer than 1% of cases have trimmed weights [Mohadjer et al., 1996]). In the analysis below, the weights were grouped into 10 strata for the weight-pooling model.

To evaluate this hypothesis using the population-based estimates in NHANES, we regress non-HDL cholesterol on birth weight and birth weight2 among 4–12 year olds, unadjusted and adjusting for age, gender, age × gender, and current body-mass index (BMI). Table 8 shows the unweighted, fully-weighted, weight trimming (to a maximum normalized value of 3), pooled weight, and fractional Bayes factor pooled weight estimators along with estimates of bias and mean squared error under the assumption that the fully weighted estimator is unbiased for both the unadjusted and adjusted models. Because a fully-weighted regression estimator [beta]w is unbiased only in expectation, the estimated squared bias of a regression estimator [beta]* is given by max(([beta]*[beta]w)2V01, 0) where V^01=Var^(β^)+Var^(β^w)2Cov^(β^,β^w)) (Kish 1992). To account for the effects of clustering and stratification in the multi-stage sample design, the variances of the regression estimators were calculated using a bootstrap (Davidson and Hinckley 1997, p.92–102) where PSUs were resampled with replacement within strata. For each resampled dataset, the unweighted and fully-weighted estimates were computed as Bu = (XX)−1Xy and Bw = (XW X)−1XW y respectively, where W is the n × n diagonal case weight matrix. Point estimates under the weight pooling method were computed as Bp=l=1HB^lP(L=ly,X), where B^l=[h=1HWhi=1nhZliZliT]1[h=1HWh(i=1nhZliZliT)β^l], for β^l=(ZlZl)1Zly where Zl consists of the stacked vectors of Zli. To compute P(L = l | y, X) we used a Factional Bayes Factor data-based prior with a training fraction of n−1/2 (PWTF2).

Table 8
Change in non-HDL cholesterol (mg/dL) associated with each 1 lb. change in birth weight, among US 4–12 year-olds, using unweighted (UNWT), fully-weighted (FWT), trimmed weight (TWT3), pooled weight (PWT), and fractional Bayes factor pooled weight ...

In this example, the unweighted estimator appears to have better RMSE properties than the fully-weighted estimator, particularly for the linear term; the unweighted and weighted quadratic terms are approximately equal under both models. The weight pooling estimator compromises between the unweighted and fully-weighted estimator for the unadjusted linear term, but tracks the unweighted estimator in the adjusted model. The weight pooling estimator tracks the unweighted estimator in the unadjusted model and compromises between the weighted and unweighted estimator in the adjusted model; the fractional Bayes factor weight pooling estimator compromises between the unweighted and fully-weighted estimator for the unadjusted linear term, but tracks the unweighted estimator in the adjusted model. The weight pooling estimator has the best MSE properties, somewhat smaller than those of the unweighted estimator; the variance of the fractional Bayes factor weight pooling estimator is somewhat greater than that of the unweighted estimator. The crude trimming estimator has the next-best MSE properties, with the fully-weighted estimator having the maximum MSE for both the unadjusted and adjusted models.

Both the unadjusted and adjusted estimates suggest that a quadratic effect might be present, with extremely underweight, and normal and above-normal weight children having lower levels of non-HDL cholesterol than moderately underweight children. However, the trends were not jointly significant using a Wald test with 2 degrees of freedom using either the unweighted, fully-weighted, or weight-pooling estimators.

6 Discussion

In this manuscript we have developed a “weight smoothing” methodology that allows the data to make a principled tradeoff between bias and variance – approximating the fully weighted estimator when bias is of great importance, but moving toward the unweighted estimator when variance overwhelms the square of the bias correction factor. This model generalizes the work Elliott and Little (2000), where population inference was restricted to population means using a weight pooling model that mimicked weight trimming. A shortcoming of the previous model was lack of robustness: by considering submodels that pooled only the largest weight strata, data structures that favored fully-weighted estimators were “overpooled” and the resulting bias yielded MSEs that were larger than the fully-weighted estimators’ MSEs. Here we consider a model that allows for the pooling of all conterminous inclusion strata. This yields weight pooling estimators that are protected against overpooling, but have limited efficiency gains over fully-weighted estimators. By considering the “Fractional Bayes Factors” of O’Hagan (1995), in which a fraction b of the sample is set aside as to provide a data-based proper prior, we showed our resulting estimators retained their robustness properties while gaining considerable efficiencies over standard fully-weighted estimators. This manuscript also extends the weight pooling method to consider population linear regression slopes as well as population means.

We also applied the methods to assess “Barker’s Hypothesis,” an association between low birth weight, and adult cardiovascular disease and type 2 diabetes (Barker et al. 1993), using the nationally-representative National Health and Examination Nutrition Survey III. In this situation, the unweighted estimates of the quadratic effect of birth weight on non-HDL cholesterol generally had the best RMSE properties; however the weight pooling estimators outperformed the fully-weighted estimators.

When sampling weights are used to account for misspecification of the mean in a regression setting, it could be argued that the correct approach is to correctly specify the mean to eliminate discrepancies between the fully-weighted and unweighted estimates of the regression parameters. However, perfect specification is an unattainable goal, and even good approximations might be highly biased if case weights are ignored if the sampling probabilities are highly variable, even if the sampling itself is noninformative. In the informative sampling setting, it may be impossible to determine whether discrepancies between weighted and unweighted estimates are due to model misspecification or to the sample design itself. Finally, even misspecified regression models have the attractive feature in the finite population setting of yielding a unique target population quantity. Consequently accounting for the probability of inclusion in linear model settings continues to be advised, and methods that balance between a low-bias, high variance fully-weighted analysis and a high bias, low variance unweighted analysis remain useful.

The next logical extension of the weight pooling methods is into the generalized linear model setting. The situation is complicated here by the lack of a closed form solution for p(y | L = l, X) outside of the Gaussian special case, making it difficult to compute a Fraction Bayes Factor to enhance efficiency. One possibility is to utilize Laplace approximations (Tierney and Kadane 1986). In general, we have p(L = l | y) = Cl/Σl Cl, where Cl = ∫ f (y | θl)p(θl)l. By approximating the posterior with a normal distribution, we estimate Cl with (2π)l/2 | [Sigma] |1/2 f (y | [theta w/ hat]l) p([theta w/ hat]l), where [theta w/ hat]l is a value with high posterior probability (a median or mode). DiCiccio et al. (1997) discuss improvements on this approximation that may be utilized as well.


This research was supported by National Institute of Heart, Lung, and Blood grant R01-HL-068987-01. The author acknowledges Jack Chen for his assistance with programming, and Dr. Andrew Tershakovec for his assistance with the Barker’s Hypothesis analysis, as well as the Editor, Associate Editor, and three anonymous reviewers whose comments improved the manuscript.

7 Appendix

From (3.2), we obtain a direct draw from the posterior of p(βl, σ2, L = l | y, X) as follows:

  1. p(L=ly,X)=p(yL=l,X)P(L=l)lp(yL=l,X)P(L=l), where p(yL=l,X)Ψl1/2[ΔlθlTΨlθl](n+a)/2 for Ψl=((ZlTZl)+0)1,θl=(ZlTZl)b+0β0,Δl=bT(ZlTZl)b+β0T01β0+Ql2+as2,b=(ZlTZl)1ZlTy, and Ql2=yT(IpHHl)y,Hl=Zl(ZlTZl)1ZlT.
  2. σ2L=l,y,XInvχ2(n+a,ΔlθlTΨlθl)
  3. βl | σ2, L = l, y, X ~ Nl Al, σ2Γl), Al=ZlTy+01β0,Γl=[01+(ZlTZl)]1

We derive these marginal and conditional distributions in reverse order to simplify computation and notation.

3. is derived by noting that




and thus by standard results (Gelman et al., 2004, p. 85–86)


where β=[(σ20)1+(σ2(ZlTZl)1)1]1[(σ2(ZlTZl)1)1b+(σ20)1β0]=[01+ZlTZl]1[ZlTy+01β0] and =σ2[01+ZlTZl]1.

2. is derived by






from the normalizing constant for a N (μ, Σ) distribution, and thus


which is the kernel of a scaled inverse chi-square distribution with n + a degrees of freedom and scaling factor ΔlθlTΨlθl.

1. then follows from 2.:




from the normalizing constant for the Invχ2(n, s2) distribution.

7.1 Fractional Bayes Factors

To implement O’Hagan’s (1995) Fractional Bayes Factors for the marginal weight pooling selection probability, we replaced




where 0 < b < 1 represents a “training fraction” of the data set aside to provide prior information for the parameters for the lth pooling model. From the derivation of 1. above we have


for for Ψbl=((bZlTZl)+0)1,θbl=b(ZlTZl)b+0β0,Δbl=b[bT(ZlTZl)b+Ql2]+β0T01β0+as2. Thus using FBF, we have



  • Alexander CH, Dahl S, Weidman L. Making estimates from the American Community Survey. Proceedings of the Social Statistics Section, American Statistical Association. 1997;2000:88–97.
  • Barker DJP, Gluckman PD, Godfrey KM, Harding JE, Owens JA, Robinson JS. Fetal Nutrition and Cardiovascular Disease in Adult Life. Lancet. 1993;341:938–941. [PubMed]
  • Beaumont J-F, Alavi A. Robust Generalized Regression Estimation. Survey Methodology. 2004;30:195–208.
  • Binder DA. On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review. 1983;51:279–292.
  • Cox BG, McGrath DS. An Examination of the Effect of Sample Weight Truncation on the Mean Square Error of Survey Estimates. Paper presented at the 1981 Biometric Society ENAR meeting; Richmond, VA. 1981.
  • Curhan GC, Willett WC, Rimm EB, Spiegelman D, Ascherio AL, Stampfer MJ. Birth Weight and Adult Hypertension, Diabetes Mellitus and Obesity in US Men. Circulation. 1996;94:3246–3250. [PubMed]
  • Davidson AC, Hinckley DV. Bootstrap Methods and their Applications. Cambridge Press; Cambridge: 1997.
  • Deville JC, Sarndal CE. Calibration Estimators in Survey Sampling. Journal of the American Statistical Association. 1992;87:376–382.
  • DiCiccio TJ, Kass RE, Rafter A, Wasserman L. Computing Bayes factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association. 1997;92:903–915.
  • Elliott MR, Little RJA. Model-based Approaches to Weight Trimming. Journal of Official Statistics. 2000;16:191–210.
  • Ericson WA. Subjective Bayesian Modeling in Sampling Finite Populations. Journal of the Royal Statistical Society. 1969;B31:195–234.
  • Folsom RE, Singh AC. The Generalized Exponential Model for Sampling Weight Calibration for Extreme Values, Nonresponse, and Poststratification. Proceedings of the Survey Research Methods Section, American Statistical Association. 2000;2000:598–603.
  • Forrester TE, Wilks RJ, Bennett FI, Simeon D, Osmond C, Allen M, Chung AP, Scott P. Fetal Growth and Cardiovascular Risk factors in Jamaican Schoolchildren. British Medical Journal. 1996;312:156–160. [PMC free article] [PubMed]
  • Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. Boca Raton, FL: Chapman and Hall/CRC; 2004.
  • Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach Based on Influence Functions. New York: Wiley; 1986.
  • Holt D, Smith TMF. Poststratification. Journal of the Royal Statistical Society. 1979;A142:33–46.
  • Horvitz DG, Thompson DJ. A Generalization of Sampling Without Replacement from a Finite Universe. Journal of the American Statistical Association. 1952;47:663–685.
  • Isaki CT, Fuller WA. Survey Design under a Regression Superpopulation Model. Journal of the American Statistical Association. 1982;77:89–96.
  • Kass RE, Rafter AE. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795.
  • Kish L. Survey Sampling. New York: John Wiley and Sons; 1965.
  • Kish L. Weighting for Unequal Pi. Journal of Official Statistics. 1992;8:183–200.
  • Korn EL, Graubard BI. Analysis of Health Surveys. Wiley; New York: 1999.
  • Little RJA. Estimating a Finite Population Mean from Unequal Probability Samples. Journal of the American Statistical Association. 1983;78:596–604.
  • Little RJA. Inference with Survey Weights. Journal of Official Statistics. 1991;7:405–424.
  • Little RJA. Poststratification: A Modeler’s Perspective. Journal of the American Statistical Association. 1993;88:1001–1012.
  • Little RJA, Lewitzky S, Heeringa S, Lepkowski J, Kessler RC. Assessment of Weighting Methodology for the National Comorbidity Survey. American Journal of Epidemiology. 1997;146:439–449. [PubMed]
  • Lu H, Gelman A. A Method for Estimating Design-based Sampling Variances for Surveys with Weighting, Poststratification, and Raking. Journal of Official Statistics. 2003;19:133–152.
  • Matthes JW, Lewis PA, Davies DP, Bethel JA. Relation between Birth Weight at Term and Systolic Blood Pressure in Adolescence. British Medical Journal. 1994;308:1074–1077. [PMC free article] [PubMed]
  • Mohadjer L, Montaquila J, Waksberg J, Bell B, James P, Flores-Cervantes I, Montes M. Prepared by Westat, Inc., for the National Center for Health Statistics; Hyattsville, MD: 1996. National Health and Nutrition Examination Survey III: Weighting and Estimation Methodology, Executive Summary. http://www.cdc.gov/nchs/data/nhanes/nhanes3/cdrom/NCHS/MANUALS/WGTEXEC.PDF.
  • O’Hagan A. Fraction Bayes Factors for Model Comparison. Journal of the Royal Statistical Society. 1995;B57:99–138.
  • Potter F. A Study of Procedures to Identify and Trim Extreme Sample Weights. Proceedings of the Survey Research Methods Section, American Statistical Association. 1990;1990:225–230.
  • Pfeffermann D. The Role of Sampling Weights when Modeling Survey Data. International Statistical Review. 1993;61:317–337.
  • Pfeffermann D. The Use of Sampling Weights for Survey Data Analysis. Statistical Methods in Medical Research. 1996;5:239–261. [PubMed]
  • Rich-Edwards JW, Stampfer MJ, Manson JE, Rosner B, Hankinson SE, Colditz GA, Willett WC, Hennekens CH. Birth Weight and Risk of Cardiovascular Disease in a Cohort of Women Followed up since 1976. British Medical Journal. 1997;315:396–400. [PMC free article] [PubMed]
  • Rubin DB. Multiple Imputation for Non-Response in Surveys. New York: Wiley; 1987.
  • Sarndal CE. On p-inverse Weighting Verses Best Linear Unbiased Weighting in Probability Sampling. Biometrika. 1980;67:639–650.
  • Skinner CJ, Holt D, Smith TMF. Analysis of Complex Surveys. Wiley; New York: 1989.
  • Spiegelhalter DJ, Smith AFM. Bayes factors for Linear and Log-linear Models with Vague Prior Information. Journal of the Royal Statistical Society. 1982;B44:377–387.
  • Tierney L, Kadane J. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association. 1986;81:82–86.
  • Zaslavsky AM, Schenker N, Belin TR. Downweighting Influential Clusters in Surveys: Application to the 1990 Post Enumeration Survey. Journal of the American Statistical Association. 2001;96:858–869.
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...