- Journal List
- NIHPA Author Manuscripts
- PMC2881223

# Modeling Data with Excess Zeros and Measurement Error: Application to Evaluating Relationships between Episodically Consumed Foods and Health Outcomes

^{1,}

^{*}Douglas Midthune,

^{1}Dennis W. Buckman,

^{2}Kevin W. Dodd,

^{1}Patricia M. Guenther,

^{3}Susan M. Krebs-Smith,

^{4}Amy F. Subar,

^{4}Janet A. Tooze,

^{5}Raymond J. Carroll,

^{6}and Laurence S. Freedman

^{7}

^{1}Biometry, Division of Cancer Prevention, National Cancer Institute, 6130 Executive Boulevard, EPN-3131, Bethesda, Maryland 20892-7354, U.S.A

^{2}Information Management Services, Inc., 12501 Prosperity Drive, Silver Spring, Maryland 20904, U.S.A

^{3}Center for Nutrition Policy and Promotion, U.S. Department of Agriculture, 3101 Park Center Drive, Ste 1034, Alexandria, Virginia 22302, U.S.A

^{4}Applied Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, 6130 Executive Boulevard, EPN-4005, Bethesda, Maryland 20892, U.S.A

^{5}Department of Biostatistical Sciences, Wake Forest University, School of Medicine, Medical Center Boulevard, Winston-Salem, North Carolina 27157, U.S.A

^{6}Department of Statistics, Texas A&M University, 3143 TAMU, College Station, Texas 77843-3143, U.S.A

^{7}Gertner Institute for Epidemiology and Health Policy Research, Sheba Medical Center, Tel Hashomer 52161, Israel

## Summary

Dietary assessment of episodically consumed foods gives rise to nonnegative data that have excess zeros and measurement error. Tooze et al. (2006, *Journal of the American Dietetic Association* **106,** 1575–1587) describe a general statistical approach (National Cancer Institute method) for modeling such food intakes reported on two or more 24-hour recalls (24HRs) and demonstrate its use to estimate the distribution of the food’s usual intake in the general population. In this article, we propose an extension of this method to predict individual usual intake of such foods and to evaluate the relationships of usual intakes with health outcomes. Following the regression calibration approach for measurement error correction, individual usual intake is generally predicted as the conditional mean intake given 24HR-reported intake and other covariates in the health model. One feature of the proposed method is that additional covariates potentially related to usual intake may be used to increase the precision of estimates of usual intake and of diet-health outcome associations. Applying the method to data from the Eating at America’s Table Study, we quantify the increased precision obtained from including reported frequency of intake on a food frequency questionnaire (FFQ) as a covariate in the calibration model. We then demonstrate the method in evaluating the linear relationship between log blood mercury levels and fish intake in women by using data from the National Health and Nutrition Examination Survey, and show increased precision when including the FFQ information. Finally, we present simulation results evaluating the performance of the proposed method in this context.

**Keywords:**Dietary measurement error, Dietary survey, Episodically consumed foods, Excess zero models, Food frequency questionnaire, Fish, Individual usual intake, Mercury, Nonlinear mixed models, Regression calibration, 24-hour recall

## 1. Introduction

U.S. national nutritional surveys traditionally have used the 24-hour recall (24HR) to collect information on food intake as the primary assessment instrument (Dwyer et al., 2003). The main purposes of such surveys are to estimate the distribution of usual (that is, average long-term) intake of nutrients and foods in the population, and to monitor such intakes over time. Another important purpose is to relate individual usual intakes to health outcomes such as blood pressure.

Even assuming unbiasedness of the 24HR, there has been concern over its use for assessing intake of foods that are not typically consumed every day. Many consumers of such episodically consumed foods report zero intake on the 24HR if the report happens to be on a nonconsumption day. Consequently, with typically only one or two administrations of a 24HR in surveys, usual intake of such foods is difficult to estimate. For this reason, an additional instrument, a food frequency questionnaire (FFQ) that queries frequency of consumption over the past year, was included in the National Health and Nutrition Examination Survey (NHANES) conducted in 2003–2006 (Subar et al., 2006). Although the FFQ leads to biased reporting of intake of energy (Kipnis et al., 2003) and therefore of at least some foods, it might nevertheless provide valuable information together with the 24HR to improve estimates of usual intake.

Dodd et al. (2006) reviewed the methods for estimating distributions of usual intake. Tooze et al. (2006) proposed a new method, called the NCI method, to handle nonnegative data with excess zeros that occur in 24HR reports on episodically consumed foods, and demonstrated its use for estimating distributions of usual intakes. Generalizing the two-part modeling approach to longitudinal semicontinuous observations (Olsen and Schafer, 2001; Tooze, Grunwald, and Jones, 2002) to include latent variables, the NCI method uses a two-part nonlinear mixed effects *measurement error* model with correlated random effects, where both parts may incorporate covariates, including other dietary-assessment instruments, such as a FFQ. In this article, we propose an extension of the NCI method to estimate an individual’s usual intake of episodically consumed foods using 24HR data with covariate information. The method may fill a void in analyzing relationships between usual intake and health outcomes for these foods.

All dietary assessment methods based on self-report, including the 24HR, fail to measure true usual intake precisely. The measurement error distorts diet-health outcome relationships, often attenuating them. A popular method of correcting for measurement error is regression calibration (Carroll et al., 2006), which uses, in place of the unknown usual intake, its best mean square error (MSE) predictor, that is, its estimated conditional expectation given the observed 24HRs and other covariates in the health outcome model.

In this article, we derive this conditional expectation for episodically consumed foods. In Section 2, we describe the measurement error model and derive the corresponding regression calibration predictor. The approach allows conditioning on additional covariates related to intake, which may increase the precision of estimates of usual intake and diet-health outcome associations. In Section 3, using data from the Eating at America’s Table Study (EATS), we quantify the increased precision obtained from including a FFQ report as a covariate. In Section 4, we demonstrate the method for evaluating the relationship between blood mercury and fish intake based on NHANES data and show increased precision of the estimated association from incorporating the FFQ. In Section 5, we present simulations to evaluate the finite-sample performance of the method. Section 6 contains discussion.

## 2. Measurement Error and Regression Calibration Models

For individual *i* on day *j*, *i* = 1, …, *n; j* = 1, …, *J _{i}*; let

*T*and

_{ij}*R*denote true intake of a nutrient or food and the corresponding 24HR, respectively. We define true individual usual intake

_{ij}*T*as the within-person expectation of true daily intake,

_{i}*T*=

_{i}*E*(

*T*|

_{ij}*i*). The notation

*E*(•|

*i*) indicates that the expectation is conditional on the

*i*th individual. Let

*Y*denote a health outcome possibly related to

_{i}*T*through the regression model

_{i}

where **Z*** _{i}* = (

*Z*

_{i}_{1}, …,

*Z*)

_{ip}*is a vector of covariates measured without error, and*

^{t}*m*

^{−1}is a link function. For example,

*m*(

*v*)

*v*leads to linear regression, while

*m*(

*v*)

*H*(

*v*) = 1/(1 +

*e*

^{−}

*) to logistic regression. Our main interest is estimation of the dietary effect*

^{v}

when dietary intake changes from *T*_{0} to *T*_{1}. For example, for linear regression, expression (2) represents a change in the mean outcome, while, for logistic regression, a log odds ratio.

Let
${\mathbf{X}}_{i}={({\mathbf{Z}}_{i}^{t},{\mathbf{C}}_{i}^{t})}^{t}$ be a vector of covariates related to usual intake. It generally includes covariates **Z*** _{i}* in the health outcome model (1) and some additional factors

**C**

*= (*

_{i}*C*

_{i}_{1}, …,

*C*)

_{iq}*that are related to usual intake*

^{t}*T*given

_{i}**Z**

*, but unrelated to outcome*

_{i}*Y*given

_{i}*T*and

_{i}**Z**

*. Although*

_{i}**C**

*formally follows the definition of instrumental variables, its use here is different.*

_{i}Regression calibration requires evaluation of the best MSE predictor *E*(* _{i}* |

**R**

*,*

_{i}**X**

*) given the reported intakes*

_{i}**R**

*= (*

_{i}*R*

_{i}_{1}, …,

*R*)

_{iJi}*and covariates*

^{t}**X**

*for individual*

_{i}*i*. Assumingthat error in

**R**

*is nondifferential with respect to*

_{i}*Y*, i.e., the conditional distribution of

_{i}*Y*given (

_{i}*T*,

_{i}**X**

*) is independent of*

_{i}**R**

*, using predictor*

_{i}*E*(

*T*|

_{i}**R**

*,*

_{i}**X**

*) in model (1) in place of the unknown*

_{i}*T*retains the same (e.g., linear regression), or approximately the same (e.g., logistic regression), regression coefficient

_{i}*α*. When predictor

_{T}*E*(

*T*|

_{i}**R**

*,*

_{i}**X**

*) is consistently estimated, regression calibration yields a consistent (approximately consistent for most nonlinear models) estimate of*

_{i}*α*. It is required that

_{T}**X**

*include all covariates*

_{i}**Z**

*in the health outcome model (1). We show theoretically in Web Appendix A and through examples in the main text that inclusion of additional predictors*

_{i}**C**

*in*

_{i}**X**

*can improve both the MSE of predicted usual intake and the precision of the estimated coefficient*

_{i}*.*

_{T}Before introducing the measurement error model for episodically consumed foods, we present the model for nutrients and foods consumed daily.

### 2.1 Statistical Model for Daily Consumed Nutrients or Foods

#### 2.1.1 Classical measurement error model

Following convention, the 24HRs are assumed unbiased for individual usual intake, i.e.,

where within-person random errors *ε _{ij}* reflect the daily variation in an individual’s intake and other sources of random error. The model requires that

*ε*be independent of

_{ij}*T*and each other and have a constant variance ${\sigma}_{\epsilon}^{2}$. In addition, it is often assumed that $\epsilon \sim \text{Normal}(0,{\sigma}_{\epsilon}^{2})$.

_{i}Assume the linear regression of *T _{i}* on covariates

**X**

*= (*

_{i}*X*

_{i}_{1}, …,

*X*),

_{ik}*k*=

*p*+

*q*, i.e.,

From equations (3)–(4), the 24HR-reported intake follows the linear mixed measurement error model

In addition to the fixed effect population-level parameters
$\mathit{\beta}={({\beta}_{0},{\mathit{\beta}}_{x}^{t})}^{t}$, model (5) includes the random effect *u _{i}* representing person-specific deviations of usual intake from the population profile defined by covariates

**X**

*, and within-person random error*

_{i}*ε*. Two or more 24HRs on a number of individuals are required to distinguish between- and within-person variation and uniquely estimate all model parameters.

_{ij}Evaluation of the regression calibration predictor *E*(*T _{i}* |

**R**

*,*

_{i}**X**

*) is a well-known procedure in the theory of mixed models (McCulloch and Searle, 2001). Let $\mathit{\theta}={({\mathit{\beta}}^{t},{\sigma}_{u}^{2},{\sigma}_{\epsilon}^{2})}^{t}$ be the vector of parameters in model (5) and*

_{i}*f*denote a probability density function. Because, from equation (4),

*T*is a function of

_{i}**X**

*and*

_{i}*u*, denoted as

_{i}*T*= (

_{i}*X*,

_{i}*u*;

_{i}**), we have**

*θ*

where, according to Bayes’ theorem

When parameters in ** θ** are estimated by fitting model (5) to the data, predicted usual intake

*(*

_{i}**) is known as the Empirical Bayes’ (EB) estimator, which for the linear model (5) is also known as the best linear unbiased predictor (BLUP), given by the weighted average**

of the mean of the *J _{i}* reported intakes,

*, and the covariate predictor ${\widehat{\beta}}_{0}+{\widehat{\mathit{\beta}}}_{X}^{t}{\mathbf{X}}_{i}$, with weights ${\widehat{w}}_{i}={\scriptstyle \frac{{\widehat{\sigma}}_{u}^{2}}{{\widehat{\sigma}}_{u}^{2}+{\widehat{\sigma}}_{\epsilon}^{2}/{J}_{i}}}$ and 1 −*

_{i}*ŵ*, respectively (McCulloch and Searle, 2001).

_{i}This methodology is already known and was previously suggested for classical measurement error correction (e.g., Whittemore, 1989; Tsiatis, DeGruttola, and Wulfsohn, 1995), but is applicable only to reported intakes that follow the classical error model on the original scale.

#### 2.1.2 Classical error model on transformed scale

Often, within-person random error in the 24HR reported intake is dependent on the individual mean and has a skewed distribution, violating the classical error model assumptions. The most common fix has been to monotonically transform the intakes *R _{ij}* to values
${R}_{ij}^{\ast}=g({R}_{ij})$ that more closely follow the classical model with normally distributed error (Eckert, Carroll, and Wang, 1997). In most cases, it may be achieved using the Box–Cox family of transformations (Box and Cox, 1964)

We assume that such a transformation exists and that, on the transformed scale, we have

where *ε _{ij}* are independent of the individual mean
${\mu}_{i}^{\ast}=E({R}_{ij}^{\ast}\mid i)$ and of each other. Assuming the regression of
${\mu}_{i}^{\ast}$ on

**X**

*is linear with a normally distributed regression error*

_{i}*u*, i.e.,

_{i}

the reported intakes follow the *nonlinear* mixed effects measurement error model

Following convention (Dodd et al., 2006), we continue to assume that 24HRs are unbiased for true individual intake *on the original scale*, so the usual intake of person *i* is:

where, from Taylor’s expansion,

Following equation (11), the best predictor of individual usual intake is given by

When *g* is the identity function, this conditional expectation reduces to the BLUP (8). For general *g*, it is different and needs to be evaluated according to equations (6)–(7). Because *u _{i}* and

*ε*are independent and normally distributed, and because conditioning on (

_{ij}**R**

*,*

_{i}**X**

*;*

_{i}**) is the same as conditioning on $({\mathbf{R}}_{i}^{\ast},{\mathbf{X}}_{i};\mathit{\theta})\equiv \{g({\mathbf{R}}_{i},{\lambda}_{R}),{\mathbf{X}}_{i};\mathit{\theta}\},\mathit{\theta}={({\beta}_{0},{\mathit{\beta}}_{X}^{t},{\sigma}_{\epsilon}^{2},{\sigma}_{u}^{2},{\lambda}_{R})}^{t}$, we have:**

*θ*so that

where is the standard normal distribution density. Following the EB approach, the integrals in expression (13) are evaluated by substituting in ** after fitting model (10) to the data.**

As far as we know, this method has not previously been described and should be useful in itself for intakes of nutrients or foods that are daily consumed. It is also an intermediate step to the estimation of usual intake for episodically consumed foods.

### 2.2 Statistical Model for Episodically Consumed Foods

#### 2.2.1 The measurement error model

Generalizing the approach developed at Iowa State University (Nusser, Fuller, and Guenther, 1997) to deal with episodically consumed foods, Tooze et al. (2006) considered two components of usual intake in the NCI method. The first is the individual *probability* to consume a food on a given day, *p _{i}* =

*P*(

*T*> 0|

_{ij}*i*). The second is the usual intake

*amount*on a consumption day,

*A*=

_{i}*E*(

*T*|

_{ij}*i;T*> 0). It follows that usual intake is

_{ij}

the product of the probability to consume and the usual amount on consumption days.

To specify a measurement error model in this case requires modifying the assumptions. Following Tooze et al. (2006), we assume that (i) a food is reported on the 24HR as consumed on a certain day if and only if it *was* consumed on that day, so that *P*(*R _{ij}* > 0|

*i*) =

*P*(

*T*> 0|

_{ij}*i*)

*p*and (ii) the 24HR is unbiased for true usual intake on consumption days,

_{i}*E*(

*R*|

_{ij}*i;R*> 0) =

_{ij}*A*. From this it follows that overall the 24HR is unbiased for true usual intake

_{i}Following the NCI method, we consider a two-part measurement error model for the 24HR. In the first part, we model the consumption probability as the mixed effects logistic regression

where
${u}_{1i}\sim \text{Normal}(0;{\sigma}_{u1}^{2})$ is independent of **X**_{1}* _{i}*. In addition to fixed effect population-level parameters
${\mathit{\beta}}_{1}={({\beta}_{10},{\mathit{\beta}}_{X1}^{t})}^{t}$, the model includes the random effect

*u*

_{1}

*allowing an individual’s consumption probability to deviate from the population profile defined by*

_{i}**X**

_{1}

*. In the second part, the measurement error model for the positive reported intake is the same as equation (10), except that it relates only to consumption days. Specifically, we assume that the Box–Cox transformed positive intake follows the nonlinear mixed effects measurement error model*

_{i}

where *u*_{2}* _{i}* and

*ε*are independent of (

_{ij}**X**

_{1}

*,*

_{i}**X**

_{2}

*), each other, and are normally distributed.*

_{i}The two parts of the model are linked in two ways. First, the random effects, *u*_{1}* _{i}* and

*u*

_{2}

*, may be correlated, so that*

_{i}Second, both parts of the model may share common covariates among the components of **X**_{1}* _{i}* and

**X**

_{2}

*, also inducing correlation between probability and amount.*

_{i}In our model, the probability of consumption for an individual may be arbitrarily small, but is always positive. The model therefore allows for any finite number of days with zero intakes, but does not incorporate never-consumers, if they exist. We discuss this further in Section 6.

#### 2.2.2 Regression calibration model

According to equation (15), usual intake *T _{i}* on the original scale, is

where *g*^{*} is defined by equation (12). As before, we follow formulas (6)–(7) to obtain:

where

Following the EB approach, the integrals in equation (20) may be evaluated using adaptive Gaussian quadrature (Liu and Pierce, 1994) by substituting in the maximum likelihood estimates of parameters ** θ** after simultaneously fitting models (16)–(18) to the data.

When usual intake has a skewed distribution with high leverage points, one might transform it to a more appropriate scale before relating to a health outcome in model (1). Assuming that such transformation is given by *g*(*T _{i}*,

*λ*), the only change in the methodology is in substituting

_{T}

instead of *T _{i}* in formula (20).

## 3. Contribution of FFQ Data

Because the FFQ report is related to true food intake and is independent of the health outcome given true intake and other covariates in model (1), it may be included as an additional covariate (a component of vector **C*** _{i}*) in the calibration model. In this section, we quantify the contribution of the FFQ to the prediction of individual usual intake, using the EATS data (Subar et al., 2001). We considered 965 respondents who successfully completed four 24HRs and a FFQ.

We assumed a simple univariate model relating a transformed food intake to either a hypothetical continuous health outcome (linear regression) or to a dichotomous outcome (logistic regression). For a given food, we fit models (16)–(18) to the data, estimated the distribution of usual intake by applying the NCI method (Tooze et al., 2006), found the best Box–Cox transformation of true intake to approximate normality, and finally predicted individual usual intake on the transformed scale. We assumed the same set of additional covariates for both parts of the model and considered three different sets. The first set was empty; the second contained age, body mass index (BMI), and education (no college, some college, and college graduate); the third set additionally contained the FFQ report. We Box–Cox transformed FFQ positive values to improve linearity and homoscedasticity of model (17), and used an indicator variable for zero FFQ reports.

Finally, we calculated the variances *V*_{1}, *V*_{2}, *V*_{3} of the predicted usual intakes corresponding to each of the three scenarios described above. We quantified the contribution of the additional set of covariates (age, BMI, education) by the ratio *V*_{2}/*V*_{1}, and the additional contribution of the FFQ as *V*_{3}/*V*_{2}. As we prove in Web Appendix A, the larger the ratios *V*_{2}/*V*_{1} and *V*_{3}/*V*_{2} are, the larger is the contribution of the covariate(s) to the precision of the predictor and the greater is the precision of estimated exposure effects in the health outcome model.

We present results of our analyses in Table 1, separately for men and women, for five selected food groups: dark green vegetables, tomatoes and tomato products, fruit, milk and milk products, and fish. The variance ratios show that the FFQ contributed considerably more to the estimation of usual intake than did the other covariates, but the size of contribution varied substantially across foods. For dark green vegetables, tomatoes (for women), and fish intake, the increase in efficiency due to the FFQ was large and ranged from 21% (dark green vegetables for women) to 137% (dark green vegetables for men). For other foods, the increase in efficiency from the FFQ was more modest, ranging from 3% (fruit for men) to 13% (tomatoes for men).

## 4. Relationship between Fish Intake and Mercury Level in Women in NHANES

To further illustrate our methods, we examined the relationship between fish intake and blood mercury levels in women of child-bearing age by using NHANES data. This subject is important because it has been shown that high levels of mercury can increase complications of pregnancy (Xue et al., 2007). We analyzed 1605 females, aged 12–49 years, who participated in the 2003–2004 round of NHANES, and provided at least one 24HR, one FFQ, and a blood sample for measurement of serum mercury. Among these participants, 1206 (75%) reported no fish consumption on either 24HR, 342 (21%) reported fish consumption on one of the two days, and 57 (3.6%) reported consuming fish on both days. In view of the large proportion of nonconsumption days, one might expect that the FFQ, with 1553 (96.8%) women reporting fish consumption over the past year, would add useful information.

We examined the linear regression of log serum mercury level (*μ*g/l) on usual fish intake (oz/day), obtained after a suitable Box–Cox transformation. We compared three regressions, one where individual fish intake was represented by the average of the 24HRs (the “naïve” analysis), and the other two that used the regression calibration to adjust for measurement error in the 24HR using the proposed method. In the first calibration model, individual fish intake was predicted using age, race (White, African American, other), and education (no college, some college, college graduate) as additional covariates (components of vector **C*** _{i}*). In the second, fish intake was predicted with the same three additional covariates plus the FFQ frequencies that were handled the same way as described above for the EATS example. Standard errors (SEs) of the estimated regression slopes were estimated using the balanced repeated replication method (Wolter, 1995). Results are presented in Table 2.

The naïve analysis indicates a clear relationship between serum mercury and fish intake (Wald’s *z* = 0.33/0.038 = 8.7), but quantifies the effect as a 39% increase (exp(0.33)) in serum mercury between those who consume an average of 0.1 oz of fish per day and those who consume 1 oz per day. (These consumption levels were approximately the 10th and 90th percentiles in the population.) Such a small increase might only be of modest public health concern. The proposed method estimates the increase in serum mercury to be approximately fivefold (exp(1.58)) to sevenfold (exp(1.97)) with/without the FFQ in the calibration model, respectively, values that certainly would warrant concern. Note that the addition of the FFQ increases efficiency in the estimated effect approximately twofold yielding a SE of 0.38 compared with 0.54 for the regression calibration without the FFQ.

To assess model fit, we applied an informal graphical approach, as described in Web Appendix B. The results (Web Figures 1–4) did not exhibit obvious model misspecification.

## 5. Simulation Results

We conducted a simulation study to evaluate the performance of the proposed method in a finite sample. We designed our simulation to mimic the investigation of the relationship between serum mercury levels and usual fish intake in women in NHANES, described in Section 4. In the simulation, the true relationship was specified as the simple linear regression

where *Y* represented log mercury concentration (*μ*g/l), and *T*^{*} represented Box–Cox transformed usual fish intake (oz/day) with parameter *λ _{T}* = 0.17, chosen to improve the linearity and homoscedasticity of the model. The coefficient of 0.8 leads to a true 1.52 increase in log mercury level between persons who consume 0.1 oz of fish compared with 1 oz per day, similar to the 1.58 estimated from the NHANES data (Table 2). We used

*δ*= Normal (0, 1.07) so that the linear regression (22) would explain 25% of the total variation of

_{i}*Y*.

The details of the simulations are provided in Web Appendix C. Three different sets of additional covariates (components of vector **C*** _{i}*) were used in the calibration model: (a) empty set; (b) age, BMI, and education; and (c) age, BMI, education, and Box–Cox transformed FFQ report. Table 3 presents the overall results of applying this procedure to 250 simulated data sets.

**...**

The results show that, due to measurement error, the naïve approach using the average of the 24HRs grossly underestimates the true value, as expected by theory. On average, estimates based on the proposed method have negligible bias, although, again as expected by theory, their precision is poorer than that of the naïve estimate. Importantly, the precision improves with the inclusion of additional covariates for predicting an individual’s usual intake. The estimate based on demographic covariates and the FFQ report is four times more efficient ([0.018/0.009]^{2}) than the estimate based on no covariates and 2.42 times more efficient than the estimate based on the demographic covariates only. The latter effect is equivalent to reducing the sample size of the study by approximately 60% ([1 − 1/2.42] × 100%), illustrating the potential gains from including the FFQ in the prediction of an individual’s usual intake.

## 6. Discussion

We have presented a method of predicting an individual’s usual intake of an episodically consumed food and relating it to a health outcome. The method is based on regression calibration prediction applied to short-term repeat observations of intake that contain measurement error and excess zeros, under two important assumptions. First, the fact of short-term consumption is assumed to be correctly classified. Second, the reported intake on consumption days is assumed unbiased for true intake. In our method, information from the main dietary instrument may be combined with that from another longer-term, presumably less precise and even biased, report using an auxiliary instrument. We have demonstrated, through real data and simulations, that the gain from combining two instruments may be substantial, with increases in the precision of the predicted usual intake and of the estimated diet-health outcome relationship.

In our applications, the main instrument was a 24HR and the auxiliary instrument a FFQ. Unfortunately, the assumption of unbiasedness of the main instrument does not strictly apply to the 24HR. Recent biomarker studies (Kipnis et al., 2003) have shown that, for total energy, the 24HR also involves systematic error related to true usual intake. Such biases in reporting energy intake indicate bias also in the reporting of at least some energy-contributing foods. On the other hand, these same studies confirmed that the bias in 24HR reports is considerably less than that in FFQs. Thus, in the absence of any accurate biomarker for most foods and nutrients, using the 24HR in our proposed method may provide the best available approximation.

Our method appears to fill a gap in the analytic tools of nutritional epidemiologists estimating food and health outcome associations. Use of 24HRs alone is known to be problematic when there is a large number of zero values, whereas use of the FFQ alone is marred by the large reporting biases of this instrument. Our examples have demonstrated that the proposed method is feasible to implement and produces nearly unbiased estimates of associations of intakes of episodically consumed foods with health outcomes. The method outperformed the “naïve” approach even without the FFQ in the calibration model, giving an estimate with a much reduced MSE. However, use of the FFQ greatly increased the precision of the estimate.

As shown in Section 3, use of the FFQ will not have a large impact for all foods. Probably the most important factor that determines the impact of the FFQ is the overall probability to consume the food on a given day. For foods with a relatively low probability of consumption (e.g., fish and dark green vegetables in Table 1), the FFQ will most likely provide a larger increase in efficiency. However, a larger sample size (or, alternatively, more repeat 24HRs) is required to obtain reliable estimates of the model parameters when the consumption probability is very low. This is because a substantial number of individuals with at least two consumption days are needed to estimate properly the within-person variance in the second part of the model. In our NHANES example, there were 57 women (out of 1605) who consumed fish on both days. We would not expect reliable fits for very rarely consumed foods (e.g., organ meats or yogurt in NHANES) with considerably fewer than 50 individuals with two positive intakes and indeed we have encountered some convergence problems in simulations of such cases.

In our two-part model, the first part specifies the probability of the point mass at zero, and the second part *conditionally* models the continuous variable given that it is positive. Another potential approach to modeling semicontinuous data with measurement error was proposed by Li, Shao, and Palta (2005). It is based on the sample selection model that posits an underlying continuous variable censored by a random mechanism. Using our notation, true long-term and reported intakes are specified as *T _{i}* = max (0,

*V*) and

_{i}*R*= max (0,

_{ij}*V*+

_{i}*ε*), respectively, with the underlying variable ${V}_{i}={\beta}_{0}+{\mathit{\beta}}_{X}^{t}\phantom{\rule{0.16667em}{0ex}}{\mathbf{X}}_{i}+{u}_{i}$. The use of the same linear function of covariates and the same random effect to specify the censoring mechanism and the positive observations makes this model less flexible than ours. Its advantage is formal modeling of never-consumers.

_{i j}Our two-part model assumes that each food is ultimately consumed by all individuals, so that *T _{i}* > 0. This derives from specifying the random effect in the probability part as a continuous variable. In a similar situation, Olsen and Schafer (2001) suggested a two-part mixture for the distribution of this random effect, where the status of a “teetotaler” is specified by a latent class classification variable, but did not provide any details of fitting such a model.

We considered adding a third part to our model, which specifies for each person the probability to be a never-consumer by using fixed-effect logistic regression on a vector of covariates **X**_{3}* _{i}*. We have fitted this model to the data on fish intake in EATS among 515 women, including 30 who reported zero intakes on the FFQ. An indicator variable of whether fish consumption was reported on the FFQ was used as a covariate in

**X**

_{3}

*. In a simulation study similar to the one described in Section 5 (but this time simulating never-consumers), we investigated cases where the number of 24HRs was 2, 4, or 6. With only two 24HRs, the model fit was unstable in 64 out of 250 simulated data sets, although the problem disappeared when we increased the number of 24HRs to four or more. Modeling never-consumers is an area for further research, but, with only two 24HRs, the two-part model seems the most feasible approach.*

_{i}Our methodology is suitable for analysis of a particular food and its relationship with a health outcome that involves no other dietary factors. An extension to a multivariate case with several foods and nutrients requires conditioning in formula (20) on potentially correlated random effects for all considered dietary factors simultaneously and is another area for future research.

Although we concentrated on dietary surveys, the proposed method can also be applied to cohort studies of associations between episodically consumed foods and disease. Currently, most such studies use a FFQ as the main dietary-assessment instrument, while a more precise short-term reference instrument is available only in a calibration substudy. In such cases, the regression calibration is based on estimating

which involves conditioning on the FFQ and other covariates, but not on the 24HR (and therefore random effects) as in formulas (6)–(7). This simplifies the method and, more importantly, allows its application to a multivariate case with several foods and nutrients by considering regression calibration of each dietary factor, one at a time.

In the future, as automated 24HRs become available, our methodology could combine multiple administrations of this instrument with the FFQ to achieve more precise results.

## 7. Supplementary Materials

Web Appendices A–C, Web Figures 1–4, and NHANES example data, referenced in Sections 2, 4, and 5, as well as the SAS program implementing the proposed method are available under the Paper Information link at the *Biometrics* web-site http://www.biometrics.tibs.org.

## Supplementary Material

#### Data Example

^{(249K, csv)}

#### SAS Program as a text file

^{(29K, txt)}

#### SAS macro as a text file

^{(35K, txt)}

## Acknowledgments

R.J.C.’s research was supported by a grant from the National Cancer Institute (CA57030) and by Award KUS-CI-016-04, made by King Abdullah University of Science and Technology.

## References

- Box GEP, Cox DR. An analysis of transformations. Journal of the Royal Statistical Society, Series B. 1964;26:211–252.
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2. Boca Raton, Florida: Chapman and Hall CRC Press; 2006.
- Dodd K, Guenther PM, Freedman LS, Subar AF, Kipnis V, Midthune D, Tooze JA, Krebs-Smith SM. Statistical methods for estimating usual intake of nutrients and foods: A review of the theory. Journal of the American Dietetic Association. 2006;106:1640–1650. [PubMed]
- Dwyer J, Picciano MF, Raiten DJ. Members of the Steering Committee. Collection of food and dietary supplement intake data: What we eat in America—NHANES. The Journal of Nutrition. 2003;133:590S–600S. [PubMed]
- Eckert RS, Carroll RJ, Wang N. Transformations to additivity in measurement error models. Biometrics. 1997;53:262–272. [PubMed]
- Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano R, Bingham S, Schoeller DA, Schatzkin A, Carroll RJ. The structure of dietary measurement error: Results of the OPEN biomarker study. American Journal of Epidemiology. 2003;158:14–21. [PubMed]
- Li L, Shao J, Palta M. A longitudinal measurement error model with a semicontinuous covariate. Biometrics. 2005;61:824–830. [PubMed]
- Liu Q, Pierce DA. A note on Gauss-Hermite quadrature. Biometrika. 1994;81:624–629.
- McCulloch CE, Searle SR. Generalized, Linear, and Mixed Models. New York: Wiley; 2001.
- Nusser SM, Fuller WA, Guenther PM. Estimating usual dietary intake distributions: Adjusting for measurement error and nonnormality in 24-hour food intake data. In: Lyberg L, Biemer P, Collins M, Deleeuw E, Dippo C, Schwartz N, Trewin D, editors. Survey Measurement and Process Quality. New York: Wiley; 1997. pp. 670–689.
- Olsen MK, Schafer JL. A two-part random-effects model for semicontinuous longitudinal data. Journal of the American Statistical Association. 2001;96:730–745.
- Subar AF, Thompson FE, Kipnis V, Midthune D, Hurwitz P, McNutt S, McIntosh A, Rosenfeld S. Comparative validation of the Block, Willett, and National Cancer Institute food frequency questionnaires: The Eating at America’s Table Study. American Journal of Epidemiology. 2001;154:1089–1099. [PubMed]
- Subar AF, Dodd KW, Guenther PM, Kipnis V, Midthune D, McDowell M, Tooze JA, Freedman LS, Krebs-Smith SM. The Food Propensity Questionnaire (FPQ): Concept, development and validation for use as a covariate in model to estimate usual food intake. Journal of the American Dietetic Association. 2006;106:1556–1563. [PubMed]
- Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data clumping at zero. Statistical Methods in Medical Research. 2002;11:341–355. [PubMed]
- Tooze JA, Midthune D, Dodd KW, Freedman LS, Krebs-Smith SM, Subar AF, Carroll RJ, Kipnis V. A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution. Journal of the American Dietetic Association. 2006;106:1575–1587. [PMC free article] [PubMed]
- Tsiatis AA, DeGruttola V, Wulfsohn MS. Modeling the relationship of survival to longitudinal data measured with error. Application to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association. 1995;90:27–37.
- Whittemore AS. Errors-in-variables regression using Stein estimates. The American Statistician. 1989;43:226–228.
- Wolter KM. Introduction to Variance Estimation. New York: Springer-Verlag; 1995.
- Xue F, Holzman C, Rahbar MH, Trosko K, Fischer L. Maternal fish consumption, mercury levels, and risk of preterm delivery. Environmental Health Perspectives. 2007;115:42–47. [PMC free article] [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (768K) |
- Citation

- A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution.[J Am Diet Assoc. 2006]
*Tooze JA, Midthune D, Dodd KW, Freedman LS, Krebs-Smith SM, Subar AF, Guenther PM, Carroll RJ, Kipnis V.**J Am Diet Assoc. 2006 Oct; 106(10):1575-87.* - The food propensity questionnaire: concept, development, and validation for use as a covariate in a model to estimate usual food intake.[J Am Diet Assoc. 2006]
*Subar AF, Dodd KW, Guenther PM, Kipnis V, Midthune D, McDowell M, Tooze JA, Freedman LS, Krebs-Smith SM.**J Am Diet Assoc. 2006 Oct; 106(10):1556-63.* - Taking advantage of the strengths of 2 different dietary assessment instruments to improve intake estimates for nutritional epidemiology.[Am J Epidemiol. 2012]
*Carroll RJ, Midthune D, Subar AF, Shumakovich M, Freedman LS, Thompson FE, Kipnis V.**Am J Epidemiol. 2012 Feb 15; 175(4):340-7. Epub 2012 Jan 24.* - Statistical methods for estimating usual intake of nutrients and foods: a review of the theory.[J Am Diet Assoc. 2006]
*Dodd KW, Guenther PM, Freedman LS, Subar AF, Kipnis V, Midthune D, Tooze JA, Krebs-Smith SM.**J Am Diet Assoc. 2006 Oct; 106(10):1640-50.* - Estimation of usual intakes: What We Eat in America-NHANES.[J Nutr. 2003]
*Dwyer J, Picciano MF, Raiten DJ, Members of the Steering Committee, National Health and Nutrition Examination Survey.**J Nutr. 2003 Feb; 133(2):609S-23S.*

- Use of Two-Part Regression Calibration Model to Correct for Measurement Error in Episodically Consumed Foods in a Single-Replicate Study Design: EPIC Case Study[PLoS ONE. ]
*Agogo GO, van der Voet H, Veer PV, Ferrari P, Leenders M, Muller DC, Sánchez-Cantalejo E, Bamia C, Braaten T, Knüppel S, Johansson I, van Eeuwijk FA, Boshuizen H.**PLoS ONE. 9(11)e113160* - Soda and Cell Aging: Associations between Sugar-Sweetened Beverage Consumption and Leukocyte Telomere Length in Healthy Adults from the National Health and Nutrition Examination Surveys[American journal of public health. 2014]
*Leung CW, Laraia BA, Needham BL, Rehkopf DH, Adler NE, Lin J, Blackburn EH, Epel ES.**American journal of public health. 2014 Dec; 104(12)2425-2431* - Bayesian Semiparametric Density Deconvolution in the Presence of Conditionally Heteroscedastic Measurement Errors[Journal of computational and graphical stat...]
*Sarkar A, Mallick BK, Staudenmayer J, Pati D, Carroll RJ.**Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America. 2014 Oct 1; 23(4)1101-1125* - A Quantile Regression Approach Can Reveal the Effect of Fruit and Vegetable Consumption on Plasma Homocysteine Levels[PLoS ONE. ]
*Verly-Jr E, Steluti J, Fisberg RM, Marchioni DM.**PLoS ONE. 9(11)e111619* - Estimating the Distribution of Dietary Consumption Patterns[Statistical science : a review journal of t...]
*Carroll RJ.**Statistical science : a review journal of the Institute of Mathematical Statistics. 2014; 29(1)2-8*

- Modeling Data with Excess Zeros and Measurement Error: Application to Evaluating...Modeling Data with Excess Zeros and Measurement Error: Application to Evaluating Relationships between Episodically Consumed Foods and Health OutcomesNIHPA Author Manuscripts. Dec 2009; 65(4)1003

Your browsing activity is empty.

Activity recording is turned off.

See more...