- Journal List
- HHS Author Manuscripts
- PMC3145332

# A NEW MULTIVARIATE MEASUREMENT ERROR MODEL WITH ZERO-INFLATED DIETARY DATA, AND ITS APPLICATION TO DIETARY ASSESSMENT

^{*}This paper forms part of Zhang's Ph.D. dissertation at Texas A&M University. Zhang and Carroll's research was supported by a grant from the National Cancer Institute (CA57030). This work was also supported by National Science Foundation Instrumentation grant number 0922866.

^{†}Corresponding Author.

## Abstract

In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. Also, diet represents numerous foods, nutrients and other components, each of which have distinctive attributes. Sometimes, it is useful to examine intake of these components separately, but increasingly nutritionists are interested in exploring them collectively to capture overall dietary patterns. Consumption of these components varies widely: some are consumed daily by almost everyone on every day, while others are episodically consumed so that 24-hour recall data are zero-inflated. In addition, they are often correlated with each other. Finally, it is often preferable to analyze the amount of a dietary component relative to the amount of energy (calories) in a diet because dietary recommendations often vary with energy level. The quest to understand overall dietary patterns of usual intake has to this point reached a standstill. There are no statistical methods or models available to model such complex multivariate data with its measurement error and zero inflation. This paper proposes the first such model, and it proposes the first workable solution to fit such a model. After describing the model, we use survey-weighted MCMC computations to fit the model, with uncertainty estimation coming from balanced repeated replication.

The methodology is illustrated through an application to estimating the population distribution of the Healthy Eating Index-2005 (HEI-2005), a multi-component dietary quality index involving ratios of interrelated dietary components to energy, among children aged 2-8 in the United States. We pose a number of interesting questions about the HEI-2005 and provide answers that were not previously within the realm of possibility, and we indicate ways that our approach can be used to answer other questions of importance to nutritional science and public health.

**Keywords:**Bayesian methods, Dietary assessment, Latent variables, Measurement error, Mixed models, Nutritional epidemiology, Nutritional surveillance, Zero-Inflated Data

## 1. INTRODUCTION

This paper presents statistical models and methodology to overcome a major stumbling block in the field of dietary assessment. More nutritional background is provided in Section 2: a summary of the key conceptual issues follows.

- Nutritional surveys conducted in the United States typically use 24-hour (24hr) dietary recalls to obtain intake data, i.e., an assessment of what was consumed in the past 24 hours.
- Because dietary recommendations are intended to be met over time, nutritionists are interested in “usual” or long-term average daily intake.
- Dietary intake is thus assessed with considerable measurement error.
- Consumption patterns of dietary components vary widely; some are consumed daily by almost everyone, while others are episodically consumed so that 24-hour recall data are zero-inflated. Further, these components are correlated with one another.
- Nutritionists are interested in dietary components collectively to capture patterns of usual dietary intake, and thus need multivariate models for usual intake.
- These multivariate models for usual intakes, taking into account episodically consumed foods, do not exist, nor do methods exist for fitting them.

One way to capture dietary patterns is by scores, although our work is not limited to scores. The Healthy Eating Index-2005 (HEI-2005), described in detail in Section 2, is a scoring system based on a priori knowledge of dietary recommendations, and is on a scale of 0 to 100. Ideally, it consists of the usual intake of 6 episodically consumed and thus 24hr-zero inflated foods, 6 daily-consumed dietary components, adjusts these for energy (caloric) intake, and gives a score to each component. The total score is the sum of the individual component scores. Higher scores indicate greater compliance with dietary guidelines and, therefore, a healthier diet. Here are a few questions that nutritionists have not been able to answer, and that our approach can address.

- What is the distribution of the HEI-2005 total score, and what % of Americans are eating a healthier diet defined for example, by a total score exceeding 80?
- What is the correlation between the individual score on each dietary component and the scores of all other dietary components?
- Among those whose total HEI-2005 score is > 50 or ≤ 50, what is the distribution of usual intake of whole grains, whole fruits, dark green and orange vegetables and legumes (DOL) and calories from solid fats, alcoholic beverages and added sugars (SoFAAS)?
- What % of Americans exceed the median score on all 12 HEI-2005 components?

In this paper, to answer public health questions such as these that can have policy implications, we build a novel multivariate measurement error model for estimating the distributions of usual intakes, one that accounts for measurement error and zero-inflation, and has a special structure associated with the zero-inflation. Previous attempts to fit even simple versions of this model, using nonlinear mixed effects software, failed because of the complexity and dimensionality of the model. We use survey-weighted Monte Carlo computations to fit the model with uncertainty estimation coming from balanced repeated replication. The methodology is illustrated using the HEI-2005 to assess the diets of children aged 2-8 in the United States. This work represents the first analysis of joint distributions of usual intakes for multiple food groups and nutrients.

The paper is outlined as follows. In Section 2 we give the background for the data we observe. In particular, we provide more information about the HEI-2005. Section 3 describes our model which is a highly nonlinear, zero-inflated, repeated measures model with multiple latent variables. The model also has a patterned covariance matrix with structural zeros and ones. We derive a parameterization that allows estimated covariance matrices to be actual covariance matrices. We also define technically what we mean by usual intake, and illustrate the use of simulation methods used to answer the questions posed above, as well as many others.

Section 4 describes our estimation procedure. Previous attempts using nonlinear mixed effects models to estimate the distribution of episodically consumed food groups (Tooze, et al., 2006; Kipnis, et al., 2009) do not work here because of the high dimensionality of the problem. We instead develop a Monte Carlo strategy based on the idea of Gibbs sampling; although because of sampling weights, we treat the method as a frequentist (non-Bayesian) one. This section describes some of the basics of the methodology; the full technical details of implementation are given in an appendix.

Section 5 describes the analysis of the HEI-2005 components using the 2001-2004 National Health and Nutrition Examination Survey (NHANES) for children ages 2-8. Important contextual points arise because of the nature of the data. For example, if whole grains are consumed, then necessarily total grains are consumed with probability one, a restriction that a naive use of our model cannot handle. We develop a simple novel device to uncouple consumption variables that are tightly linked in this way. Finally in this section, we provide the first answers to the four questions we have posed. In Section 6, we discuss various additional aspects of the problem and the data analysis. Concluding remarks and a policy application are given in Section 7.

There are a number of general reviews of the measurement error field (Fuller, 1987; Gustafson, 2003; Carroll, et al., 2006; Buonaccorsi, 2010). Recent papers that focus on estimating the density function of a univariate continuous random variable subject to measurement error include Delaigle (2008), Delaigle and Hall (2008, 2010), Delaigle and Meister (2008), Delaigle, et al. (2008), Staudenmayer, et al. (2008) and Wand (1998). The field of measurement error in regression continues to expand rapidly, with some recent contributions including Küchenhoff, et al (2006), Guolo (2008), Liang, et al. (2008), Messer and Natarajan (2008) and Natarajan (2009). There is also a large statistical literature on measurement error as it relates to public health nutrition: some recent papers relevant to our work include Carriquiry (1999, 2003), Ferrari, et al. (2009), Fraser and Shavlik (2004), Kott, et al. (2009), Nusser, et al. (1996, 1997), Prentice (1996, 2003), and Tooze, et al. (2003, 2006).

## 2. Data and the HEI-2005 Scores

Here we give more detail about the nutrition context that motivates this work.

In surveys conducted in the United States, the preferred method of obtaining intake data is the 24-hour dietary recall because it limits respondent burden and facilitates accurate reporting; yet the measure of greatest interest is “usual” or long-term average daily intake. Thus dietary intake is assessed with considerable measurement error. Also, diets are comprised of numerous foods, nutrients, and other components, each of which may have distinctive attributes and effects on nutritional health. Sometimes, it is useful to examine intake of these components separately, but increasingly nutritionists are interested in exploring them collectively to capture patterns of dietary intake. Consumption patterns of these components vary widely; some are consumed daily by almost everyone while others are episodically consumed so that 24-hour recall data are zero-inflated. In addition, these various components are often correlated with one other. Finally, it is often preferable to analyze the amount of a dietary component relative to the amount of energy (calories) in a diet because dietary recommendations often vary with energy level, and this approach provides a way of standardizing dietary assessments.

One of the US Department of Agriculture's (USDA's) strategic objectives is “to promote healthy diets” and it has developed an associated performance measure, the Healthy Eating Index-2005 (HEI-2005, http://www.cnpp.usda.gov/HealthyEatingIndex.htm). The HEI-2005 is based on the key recommendations of the 2005 Dietary Guidelines for Americans (http://www.health.gov/dietaryguidelines/dga2005/document/default.htm). The index includes ratios of interrelated dietary components to energy. The HEI-2005 comprises 12 distinct component scores and a total summary score. See Table 1 for a list of these components and the standards for scoring, and see Guenther et al. (2008) for details. Intakes of each food or nutrient, represented by one of the 12 components, are expressed as a ratio to energy intake, assessed, and ascribed a score.

**...**

The HEI-2005 is used to evaluate the diets of Americans to assess compliance with the 2005 Dietary Guidelines, yet use of the HEI-2005 is limited by the challenges described above. Until recently, there have been no solutions to these challenges, so published evaluations have been limited to analyses of mean scores for the population and various subgroups. Freedman, et al. (2010) have described a method of estimating the population distribution of a single component of HEI-2005, and the prevalence of high or low scores on that component; but there has been to date no satisfactory way to determine the prevalence of high or low total HEI-2005 scores, considering all of its interrelated components simultaneously. In addition, answers to the complex questions posed in the Introduction remain unavailable. This paper aims to provide a means to do these crucial evaluations.

The 12 HEI-2005 components represent 6 episodically consumed food groups (total fruit, whole fruit, total vegetables, dark green and orange vegetables and legumes or DOL, whole grains and milk), 3 daily-consumed food groups (total grains, meat and beans and oils), and 3 other daily-consumed dietary components (saturated fat; sodium; and calories from solid fats, alcoholic beverages and added sugars, or SoFAAS). The classification of food groups as “episodically” and “daily” consumed is based on the number of individuals who report them on 24hr recalls. If there are only a few zeros for a component, we treat that as a daily-consumed food, and replace all zeros with 1/2 the minimum value of the non-zeros for that food. However, the crucial statistical aspect of the data is that six of the food groups are zero-inflated. The percentages of reported non-consumption of total fruit, whole fruit, whole grains, total vegetables, DOL, and milk on any single day are 17%, 40%, 42%, 3%, 50% and 12%, respectively.

We are interested in the usual intake of foods for children aged 2-8. The data available to us, described in more detail in Section 5, came from the National Health and Nutrition Examination Survey, 2001-2004 (NHANES). The data used here consisted of *n* = 2, 638 children, each of whom had a survey weight *w _{i}* for

*i*= 1, ...,

*n*. In addition, one or two 24hr dietary recalls were available for each individual. Along with the dietary variables, there are covariates such as age, gender, ethnicity, family income and dummy variables that indicate a weekday or a weekend day, and whether the recall was the first or second reported for that individual.

Using the 24hr recall data reported, for each of the episodically consumed food groups, two variables are defined: (a) whether a food from that group was consumed; and (b) the amount of the food that was reported on the 24hr recall. For the 6 daily-consumed food groups and nutrients, only one variable indicating the consumption amount is defined. In addition, the amount of energy that is calculated from the 24hr recall is of interest. The number of dietary variables for each 24hr recall is thus 12+6+1 = 19. The observed data are *Y _{ijk}* for the

*i*person, the

^{th}*j*variable and the

^{th}*k*replicate,

^{th}*j*= 1, . . . , 19 and

*k*= 1, . . . ,

*m*. In the data set, at most two 24hr recalls were observed, so that

_{i}*m*≤ 2. Set

_{i}*Ỹ*= (

_{ik}*Y*

_{i1k}, ...,

*Y*

_{i,19,k})

^{T}, where

- ${Y}_{i,2\ell -1,k}$ = Indicator of whether dietary component $\#\phantom{\rule{thickmathspace}{0ex}}\ell $ is consumed, with $\ell =1,2,3,4,5,6$.
- ${Y}_{i,2\ell ,k}$ = Amount of food $\#\phantom{\rule{thickmathspace}{0ex}}\ell $ consumed. This equals zero, of course, if none of food $\#\ell $ is consumed, with $\ell =1,2,3,4,5,6$.
- ${Y}_{i,\ell +6,k}$ = Amount of non-episodically consumed food or nutrient $\#\ell $, with $\ell =7,8,9,10,11,12$.
*Y*_{i,19,k}= Amount of energy consumed as reported by the 24hr recall.

## 3. Model and Methods

### 3.1. Basic Model Description

Our model is a generalization of work by Tooze et al. (2006) and Kipnis, et al. (2009) for a single food and Kipnis, et al. (2010) and Zhang, et al. (2010) for a single food and nutrient. Observed data will be denoted as *Y*, and covariates in the model will be denoted as *X*. As is usual in measurement error problems, there will also be latent variables, which will be denoted by *W*.

We use a probit threshold model. Each of the 6 episodically consumed foods will have 2 sets of latent variables, one for consumption and one for amount, while the 6 daily-consumed foods and nutrients as well as energy will have 1 set of latent variables, for a total of 19. The latent random variables are *ε _{ijk}* and

*U*, where (

_{ij}*U*

_{i1}, . . . ,

*U*

_{i,19}) = Normal(0, Σ

*) and (*

_{u}*ε*

_{i1k}, . . . ,

*ε*

_{i,19,k}) = Normal(0, Σ

*) are mutually independent. In this model, food $\ell =1,\dots ,6$ being consumed on day*

_{ε}*k*is equivalent to observing the binary ${Y}_{i,2\ell -1,k}$, where

If the food is consumed we model the amount reported ${Y}_{i,2\ell ,k}$ as

where ${g}_{\mathrm{tr}}(y,\lambda )=\sqrt{2}\{g(y,\lambda )-\mu \left(\lambda \right)\}\u2215\sigma \left(\lambda \right)$, *g*(*y, λ*) is the usual Box-Cox transformation with transformation parameter *λ*, and {*μ*(*λ*), *σ*(*λ*)} are the sample mean and standard deviation of *g*(*y, λ*), computed from the non-zero food data. This standardization is simply a convenient device to improve the numerical performance of our algorithm without affecting the conclusions of our analysis.

The reported consumption of daily consumed foods or nutrients $\ell =7,\dots ,12$ are modeled as

Finally, energy is modeled as

As seen in (3.3)-(3.4), different transformations (*λ*_{1}, ..., *λ*_{13}) are allowed to be used for the different types of dietary components, see Section A.12.

In summary, there are latent variables ${\stackrel{~}{W}}_{ik}={({W}_{i1k},\dots ,{W}_{i,19,k})}^{T}$, latent random effects *Ũ _{i}* = (

*U*

_{i1}, ...,

*U*

_{i,19})

^{T}, fixed effects (

*β*

_{1}, ...,

*β*

_{19}), and design matrices (

*X*

_{i1k}, ...,

*X*

_{i,19,k}). Define ${\stackrel{~}{\u220a}}_{ik}={({\u220a}_{i1k},\dots ,{\u220a}_{i,19,k})}^{T}$. The latent variable model is

where *Ũ _{i}* = Normal(0, Σ

*) and ${\stackrel{~}{\u220a}}_{ik}=\text{Normal}(0,{\Sigma}_{\u220a})$ are mutually independent.*

_{u}### 3.2. Restriction on the Covariance Matrix

Two necessary restrictions are set on Σ* _{ε}*. First, following Kipnis, et al. (2009, 2010), ${\u220a}_{i,2\ell -1,k}$ and ${\u220a}_{i,2\ell ,k},\phantom{\rule{thickmathspace}{0ex}}(\ell =1,\dots ,6)$ are set to be independent. Second, in order to technically identify ${\beta}_{2\ell -1}$ and the distribution of ${U}_{i,2\ell -1}\phantom{\rule{thickmathspace}{0ex}}(\ell =1,\dots ,6)$, we require that $\text{var}\left({\u220a}_{i,2\ell -1,k}\right)=1$, because otherwise the marginal probability of consumption of dietary component $\#\ell $ would be $\Phi \{({X}_{i,2\ell -1,k}^{\mathrm{T}}{\beta}_{2\ell -1}+{U}_{i,2\ell -1})\u2215{\text{var}}^{1\u22152}\left({\u220a}_{i,2\ell -1,k}\right)\}$, and thus components of

*β*and Σ

*would be identified only up to the scale ${\text{var}}^{1\u22152}\left({\u220a}_{i,2\ell -1,k}\right)$.*

_{u}So that we can handle any number of episodically consumed dietary components and any number of daily consumed components, suppose that there are *J* episodically consumed dietary components, and *K* daily consumed dietary components, and in addition there is energy. Then the restrictions defined above lead to the covariance matrix

The diffculty with parameterizations of (3.2) is that the cells that are not constrained to be 0 or 1 cannot be left unconstrained, otherwise (3.2) need not be a covariance matrix, i.e., positive semidefinite.

We have developed an unconstrained parameterization that results in the structure (3.2). Consider an unconstrained lower triangular matrix *V* and define Σ* _{ε}* =

*VV*

^{T}. This is positive semidefinite and therefore qualifies Σ

*as a proper covariance matrix. The form of*

_{ε}*V*is

To achieve the desired pattern (3.2), we derive the following four restrictions:

The third restriction can be ensured by the further parameterization

where *q* = 2, 3, . . . , *J* – 1; |*r _{t}*| ≤ 1,

*t*= 1, . . . ,

*J*– 1, and |

*θ*| ≤

_{s}*π, s*= 1, . . . , (

*J*– 1)

^{2}.

Similarly, the fourth restriction can be further expressed by setting

where *q* = 3, 5, . . . , 2*J* – 1. Note that $\mid {\Sigma}_{\u220a}\mid ={\mid V\mid}^{2}={\prod}_{q=1}^{2J+K+1}{v}_{qq}^{2}={\prod}_{q=1}^{J}{v}_{2q,2q}^{2}{\prod}_{q=2J+1}^{2J+K+1}{v}_{q,q}^{2}{\prod}_{q=1}^{J-1}(1-{r}_{q}^{2})$.

### 3.3. The Use of Sampling Weights

As described in the Appendix, we used the survey sample weights from NHANES both in the model fitting procedure and, after having fit the model, in estimating the distributions of usual intake.

While not displayed here, we redid the model fitting calculations without weighting, because the covariates we use are major players in determining the sampling weights, hence it is reasonable to believe that the model in Section 3 holds both in the sample and in the population. When we did this, the parameter estimates were essentially unchanged.

Thus, we use the sampling weights only for estimation of the population distributions. We actually did this for the purpose of handling the clustering in the sample design. For such a complex statistical procedure as ours, we knew we could not do theoretical standard errors, so we thought about the bootstrap, and realized that putting together a bootstrap for the complex survey would be nearly impossible. However, we already had developed a set of Balanced Repeated Replication (BRR) weights (Wolter, 1995), see Section 5.7 for details. These BRR weights have the property that, in the frequentist survey sampling sense, they appropriately reflect the clustering in the standard error calculations.

Of course, the use of sampling weights in the modeling provide unbiased estimates of the (super) population parameters of interest. In addition, the use of sampling weights in the distribution estimation provides an estimated distribution that is representative of the US population, not just the sample.

### 3.4. Distribution of Usual Intake and the HEI-2005 Scores

We assume here that estimates of Σ* _{u}*, Σ

*and*

_{ε}*β*for

_{j}*j*= 1, ..., 19 have been constructed, see Section 4. Here we discuss what we mean by usual intake for an individual, how to estimate the distribution of usual intakes, how to convert usual intakes into HEI-2005 scores, and how to assess uncertainty.

Consider the first episodically consumed dietary component, a food group, with reporting being done on a weekend. Set *X*_{i1,wkend} and *X*_{i2,wkend} to be the versions of *X*_{i1k} and *X*_{i2k} where the dummy variable has the indicator of the weekend and that the recall is the first one. Following Kipnis, et al. (2009), we define the usual intake for an individual on the weekend to be the expectation of the reported intake conditional on the person's random effects *Ũ _{i}*. Let the (

*q, p*) element of Σ

*be denoted as Σ*

_{ε}_{ε,q,p}. As in Kipnis, et al. define

Detailed formulas for this are given in Appendix A.11. Then, following the convention of Kipnis, et al. (2009), the person's usual intake of the first episodically consumed dietary component on the weekend is defined as

Similarly, let *X*_{i1,wkday} and *X*_{i2,wkday} be as above but the dummy variable is appropriate for a weekday. Then the person's usual intake of the first episodically consumed food group on weekdays is defined as

Finally, the usual intake of the first episodically consumed food for the individual is

since Fridays, Saturdays and Sundays are considered to be weekend days. Usual intake for the other episodically consumed food groups is defined similarly.

A person's usual intake of a daily-consumed food group/nutrient and energy on the original scale is defined similarly. Consider, for example, energy, which is the 13* ^{th}* dietary component and the 19

*set of terms in the model. Let*

^{th}*X*

_{i,19,wkend}and

*X*

_{i,19,wkday}be the versions of

*X*

_{i,19,k}where the dummy variable has the indicator of the weekend or weekday, respectively, and that the recall is the first one. Then

Similar formulae are used for the other daily-consumed foods and nutrients.

Finally, the energy-adjusted usual intakes and the HEI-2005 scores are then obtained as in Table 1, using the estimated usual intakes of the dietary components.

To find the joint distribution of usual intakes of the HEI-2005 scores, it is convenient to use Monte-Carlo methods. Recall that *w _{i}* is the sampling weight for individual

*i*. Let

*B*be a large number: we set

*B*= 5, 000. Generate

*b*= 1, ...,

*B*observations

*Ũ*

_{bi}= Normal(0, Σ

*) and then obtain ${\stackrel{~}{T}}_{bi}={\left({T}_{bi\ell}\right)}_{\ell =1}^{13}$ by replacing*

_{u}*U*in their formulae by

_{ij}*U*. With appropriate sample weighting, the

_{bij}*$\stackrel{\u0303}{T}$*

_{bi}can be used to estimate joint and marginal distributions. Thus, for example, consider the total HEI-2005 score, which is a deterministic function of the usual intakes, say

*G*(

*$\stackrel{\u0303}{T}$*). Its cumulative distribution function is estimated as

_{i}Frequentist standard errors of derived quantities such a mean, median and quantiles can be estimated using the Balanced Repeated Replication (BRR) method (Wolter, 1995), see Section 5.7 for details.

## 4. Comments on the Approach to Estimation

Our model (3.3)-(3.4) is a highly nonlinear, mixed effects model with many latent variables and nonlinear restrictions on the covariance matrix Σ* _{ε}*. As seen in Section 3.4, we can estimate relevant distributions of usual intake in the population if we can estimate Σ

*, Σ*

_{u}*and*

_{ε}*β*for

_{j}*j*= 1, ..., 19. We have found that working within a pseudo-likelihood Bayesian paradigm is a convenient way to do this computation. We emphasize, however, that we are doing this only to get frequentist parameter estimates based on the well-known asymptotic equivalence of frequentist likelihood estimators and Bayesian posterior means, and especially the consistency of both (Lehmann and Casella, 1998). We are specifically not doing Bayesian posterior inference, since valid Bayesian inference in a complex survey such as NHANES is an immensely challenging task, and because frequentist estimation and inference are the standard in the nutrition community.

Kipnis, et al. (2009) were able to get estimates of parameters separately for each food group using the nonlinear mixed effects program NLMIXED in SAS with sampling weights. While this gives estimates of *β _{j}* for

*j*= 1, ..., 19, it only gives us parts of the covariance matrices Σ

*and Σ*

_{u}*, and not all the entries. Using the 2001-2004 NHANES data, we have verified that our estimates and the subset of the parameters that can be estimated by one food group at a time using NLMIXED are in close agreement, and that estimates of the distributions of usual intake and HEI-2005 component scores are also in close agreement. We expect this because of the rather large sample size in our data set. Zhang, et al. (2010) have shown that even considering a single food group plus energy is a challenge for the NLMIXED procedure, both in time and in convergence, and using this method for the entire HEI-2005 constellation of dietary components is impossible.*

_{ε}Full technical details of the model fitting procedure are given in Appendices A.1-A.10.

Of course, our model has assumptions, e.g., additivity and homoscedasticity on a transformed scale for observed and latent variables, normality of person-specific random effects and normality of day-to-day variability on the transformed scale. These assumptions are clearly not exactly correct, although our marginal model-checking suggests to us that they are mostly not disastrously wrong. Some reasons for this conclusion include the facts that we reproduce the marginal distributions of the components, that comparison with 24hr recalls shows differences that decrease when moving from one 24hr recall to two 24hr recalls, that q-q plots of the data are fairly satisfactory, etc. Thinking, as we do, of our work as a first step, and not a last step, it would be extremely interesting to make the model more general, e.g., skew-normal, skew-t or Dirichlet process distributions after transformation, and possibly directly modeling heteroscedasticity. Such generalizations will require effort to implement, but will speak to the robustness of the results and would be a useful future step.

## 5. Empirical Work

### 5.1. Basic Analysis

We analyzed data from the 2001-2004 National Health and Nutrition Examination Survey (NHANES) for children age 2-8. The study sample consisted of 2, 638 children, among whom 1, 103 children have two 24hr recalls and the rest have only one. We used the dietary intake data to calculate the 12 HEI-2005 components plus energy. In addition, besides age, gender, race and interaction terms, two covariates were employed, along with an intercept. The first was a dummy variable indicating whether or not the recall was for a weekend day (Friday, Saturday, or Sunday) because food intakes are known to differ systematically on weekends and weekdays. The second was a dummy variable indicating whether the 24hr recall was the first or second such recall, the idea being that there may be systematic differences attributable to the repeated administration of the instrument.

### 5.2. Contextual Information

When we ran our program based on the variables in Table 1, the results were disastrous. Mixing of the MCMC sampler was very poor, with long sojourns in different regions.

The reason for this failure to converge depends on the context of the dietary variables. For example, whole grains are a subset of total grains. Thus, if someone consumes any whole grains, then necessarily, with probability 1.0, that person also consumes total grains. Such a restriction cannot be handled by our model, because it would force one of the random effects *U* to equal infinity. A similar thing happens for energy. Calories coming from saturated fat are a subset of total calories as are calories from SoFAAS, so there is a restriction that total calories must be greater than calories from saturated fat and also greater than calories from SoFAAS. Since the latter sum makes up a significant portion of calories, this restriction is not something that our model can handle well.

Luckily, there is an easy and natural context-based solution. Instead of using total grains in the model, we used grains that are not whole grains, i.e., refined grains, thus decoupling whole grains and total grains, and removing the restriction mentioned above. Similarly, instead of using total fruit, we use fruit that is not whole fruits, i.e., fruit juices. Additionally, instead of using total vegetables, we use total vegetables excluding dark green and orange vegetables and legumes. Finally, instead of total energy, we use total energy minus the sum of energy from saturated fat (11% of mean energy) and from SoFAAS (35% of mean energy). We recognize that there is overlap of energy from saturated fat and energy from solid fat, but this has no impact on our analysis since total energy has sources other than these two. An alternative of course, would have been to simply use total energy minus energy from SoFAAS,

This is sufficient to estimate the distributions of interest. If, for example, in the new data set *T*_{i1} represents usual intake of non-whole fruits, and *T*_{i2} is usual intake of whole fruits, then the usual intake of total fruits is *T*_{i1} + *T*_{i2}. Similar remarks apply for total grains and total vegetables.

With these new variables, our model mixed well and gave reasonable looking answers that, as mentioned in Section 4, give similar results to other methods employed with smaller parts of the data set.

### 5.3. Estimation of the HEI-2005 Scores

In the introduction, we posed 4 questions to which answers had not been possible previously. The first open question concerned the distribution of the HEI total score. Along the way towards this, Table 2 presents the energy-adjusted distributions of the dietary components used in the HEI-2005. Table 3 presents the distributions of the HEI-2005 individual component scores and the total score, with a graphical view given in Figure 1.

**...**

**...**

Table 3 presents the first estimates of the distribution of HEI-2005 scores for a vulnerable subgroup of the population, namely children aged 2-8 years. A previous analysis of 2003-04 NHANES data, looking separately at 2-5 year olds and 6-11 year olds, was limited to estimates of mean usual HEI-2005 scores (59.6 and 54.7, respectively, see Fungwe, et al., 2009). The mean scores noted here are comparable to those and reinforce the notion that children's diets, on average, are far from ideal. However, this analysis provides a more complete picture of the state of US children's diets. By including the scores at various percentiles, we estimate that only 5% of children have a score of 69 or greater and another 10% have scores of 41 or lower. While not in the Table, we also estimate that the 99* ^{th}* percentile is 74. This analysis suggests that virtually all children in the US have suboptimal diets and that a sizeable fraction (10%) have alarmingly low scores (41 or lower.)

We have also considered whether our multivariate model fitting procedure gives reasonable marginal answers. To check this, we note that it is possible to use the SAS procedure NLMIXED *separately for each component* to fit a model with one episodically consumed food group or daily consumed dietary component together with energy. The marginal distributions of each such component done separately are quite close to what we have reported in Table 3, as is our mean, which is 53.50 compared to the mean of 53.25 based on analyzing one HEI-2005 component at a time with the NLMIXED procedure. The only case where there is a mild discrepancy is in the estimated variability of the energy-adjusted usual intake of oils, likely caused by the NLMIXED procedure itself, which has an estimated variance 9 times greater than our estimated variance.

Of course, it is the distribution of the HEI-2005 total score that cannot be estimated by analysis of one component at a time.

There are other things that have not been computed previously that are simple by-products of our analysis. For example, the correlations among energy-adjusted usual intakes involving episodically consumed foods have not been estimated previously, but this is easy for us, see Table 4. The estimated correlation of –0.64 between energy-adjusted total fruit and energy-adjusted SoFAAS, and the –0.47 correlation between DOL and SoFAAS are surprisingly high.

### 5.4. Component Scores and Other Scores

As described in the introduction, an open problem has been to estimate the correlation between the individual score on each dietary component and the scores of all other dietary components. In their Table 3, Guenther, et al. (2008b) consider this problem, but of course they did not have a model for usual energy adjusted intakes, and instead they used a single 24hr recall. In Table 5, we show the resulting correlations using (a) a single 24hr recall; (b) the mean of two 24hr recalls for those who have two 24hr recalls; and (c) our model for usual intake. The numbers for the former differ from that of Guenther, et al. (2008b) because we are considering here a different population than do they. A striking and not unexpected aspect of this table is that for those components with non-trivial correlations, the correlations all increase as one moves from a single 24hr recall to the mean of two 24hr recalls and then finally to estimated usual intake. Thus, for example, the correlation between the HEI-2005 score for total fruit and its difference with the total score is 0.38 for a single 24hr recall, 0.44 for the mean of two 24hr recalls and then finally 0.62 for usual intake.

### 5.5. Distributions of Intakes for Subsets of HEI Total Scores

A third open question is: among those whose total HEI-2005 score is > 50 or ≤ 50, what is the distribution of energy-adjusted usual intake of whole grains, whole fruits, dark green and orange vegetables and legumes (DOL) and calories from solid fats, alcoholic beverages and added sugars (SoFAAS)? This follows naturally from our method. Following (3.8), let *G*_{1}(*$\stackrel{\u0303}{T}$ _{bi}*) be energy adjusted usual intake and let

*G*

_{2}(

*$\stackrel{\u0303}{T}$*) be the HEI total score. Then the distributions in question for when the total HEI-2005 score is > 50 can be estimated as $\widehat{F}\left(x\right)={\sum}_{i=1}^{n}{\sum}_{b=1}^{B}{w}_{i}I\{{G}_{1}\left({\stackrel{~}{T}}_{bi}\right)\le x\}I\{{G}_{2}\left({\stackrel{~}{T}}_{bi}\right)>50\}\u2215{\sum}_{i=1}^{n}{\sum}_{b=1}^{B}{w}_{i}I\{{G}_{2}\left({\stackrel{~}{T}}_{bi}\right)>50\}$.

_{bi}The results are provided in Table 6, with a graphical view in Figure 2. The results show that those who have poorer diets with usual HEI-2005 total score ≤ 50 are consistently eating poorer diets, i.e., less whole fruits, less whole grains and less DOL, but higher SoFAAS.

**...**

### 5.6. Dietary Consistency

We stated in the introduction that it is interesting to understand the percentage of children whose usual intake HEI score exceeds the median HEI score on all 12 HEI components. Those median scores, say (*κ*_{1}, ..., *κ*_{12}), are estimated in Table 3. If *G _{j}*(

*$\stackrel{\u0303}{T}$*) is the HEI component score for episodically consumed food

_{bi}*j*, then following (3.8) the quantity in question can be estimated as ${\sum}_{i=1}^{n}{\sum}_{b=1}^{B}{w}_{i}{\prod}_{j=1}^{6}I\{{G}_{j}\left({\stackrel{~}{T}}_{bi}\right)\ge {\kappa}_{j}\}\u2215{\sum}_{i=1}^{n}{\sum}_{b=1}^{B}{w}_{i}$. We estimate that the percentage is 6%, woefully small. The percentage of children whose usual intake HEI score exceeds the median HEI score on all 12 HEI components is 0.24%. Figure 3 gives the estimated probabilities of exceeding the

*κ*percentile on all 12 HEI components simultaneously, for

*κ*= 1, 2, ..., 99.

### 5.7. Uncertainty Quantification

The BRR standard errors of HEI-2005 components’ adjusted usual intakes and scores are shown in Tables 2 and and3.3. The BRR weights are only used in variance calculations. Once we have estimated some quantity, say $\widehat{\theta}$, from the sample using sample weight, we will need to compute the same quantity using, in succession, the 32 BRR weights. This will give us 32 estimates ${\widehat{\theta}}_{1},{\widehat{\theta}}_{2},\dots ,{\widehat{\theta}}_{32}$. The BRR estimate for the variance of $\widehat{\theta}$ is ${(32\times 0.49)}^{-1}{\sum}_{p=1}^{32}{({\widehat{\theta}}_{p}-\widehat{\theta})}^{2}$. The 32 in the denominator is for the 32 different estimates from the 32 different sets of weights, and the 0.49 is the square of the perturbation factor used to construct the BRR weight sets (Wolter, 1995).

## 6. Further Discussion of the Analysis

### 6.1. Never Consumers

An aspect of the modeling that we have not discussed is the possibility that some people never, ever consume an episodically consumed dietary component. Our model does not allow for this, for general reasons and for reasons that are specific to our data analysis.

It is in principle possible to add an additional modeling step for non-consumers, via fixed effects probit regression, but we do not think this is a practical issue in our case, for two reasons.

- The first is that the HEI-2005 is based on 6 episodically consumed dietary components, namely total fruit, whole fruit, whole grains, total vegetables, DOL, and milk, the latter of which includes cheese, yogurt and soy beverages. None of these are “lifestyle adverse”, unlike say alcohol. While 40% of the responses for whole fruits, for example, equal zero, the percentage of children who never eat any whole fruits at all is likely to be minuscule.
- Even if one disputes whether there are very few individuals who never consume one of the dietary components, then it necessarily follows that we have overestimated the HEI-2005 total scores, and hence the estimates of the proportion of individuals with alarmingly low HEI scores are deflated, and not inflated. The reason is that our model suggests everyone has a positive usual intake of the 6 episodically consumed dietary components. Since the HEI-2005 score components are nondecreasing functions of usual intake of the episodically consumed dietary components, this would mean that we overestimate the HEI-2005 total score.

### 6.2. Computing and Data

Our programs were written in Matlab. The programs, along with the NHANES data we used, are available in the *Annals of Applied Statistics* online archive. Although a much smaller amount of computing effort yields similar results, using 70, 000 MCMC steps with a burn-in of 20, 000 takes approximately 10 hours on a Linux server.

We also estimated the Monte Carlo standard error which is defined by Flegal, et al. (2008) as ${\widehat{\sigma}}_{g}\u2215\sqrt{n}$, where *n* is the total of iterations, and *n* = *ab*, where *a* is the number of blocks and *b* is the block size, and where

The batch means estimate of ${\sigma}_{g}^{2}$ is

The ratio of the Monte Carlo standard error to the estimated standard deviation of the estimated parameters averages 3.4% for Σ* _{u}* and 1.7% for

*β*.

Because of the public health importance of the problem, the National Cancer Institute has contracted for the creation of a SAS program that performs our analysis. It will allow any number of episodically and daily consumed dietary components. The first draft of this program, written independently in a different programming language, gives almost identical results to what we have obtained, at least suggesting that our results are not the product of a programming error.

## 7. Discussion

### 7.1. Transformations

In Section A.12, we describe how we estimated the transformation parameters as a separate component-wise calculation. We have done some analyses where we simultaneously transform each component, and found very little difference with our results. However, the computing time to implement this is extremely high, because of the fact that different transformations make data on different scales, so we have to compute the usual intakes at each step in the MCMC, and not just at the end.

### 7.2. What Have We Learned That Is New

There are many important questions in dietary assessment that have not been able to be answered because of a lack of multivariate models for complex, zero-inflated data with measurement errors and a lack of ability to fit such multivariate models. Nutrients and foods are not consumed in isolation, but rather as part of a broader pattern of eating. There is reason to believe that these various dietary components interact with one another in their effect on health, sometimes working synergistically and sometimes in opposition. Nonetheless, simply characterizing various patterns of eating has presented enormous statistical challenge. Until now, descriptive statistics on the HEI-2005 have been limited to examination of either the total scores or only a single energy-adjusted component at a time. This has precluded characterization of various patterns of dietary quality as well as any subsequent analyses of how such patterns might relate to health.

This methodology presented in this paper presents a workable solution to these problems which has already proven valuable. In May 2010, just as we were submitting the paper, a White House Task Force on Childhood Obesity created a report. They had wanted to set a goal of all children having a total HEI score of 80 or more by 2030, but when they learned we estimated only 10% of the children ages 2-8 had a score of 66 or higher, they decided to set a more realistic target. The facility to estimate distributions of the multiple component scores simultaneously will be important in tracking progress toward that goal.

### 7.3. In What Other Arenas Will Our Work Have Impact?

There are many other important problems where multivariate models such as ours will be important. One such problem arises when studying the relationship between multiple dietary components or dietary patterns and health outcomes. Traditionally, for cost reasons, large cohort studies have used a food frequency questionnaire (FFQ) to measure dietary intake, sometimes with a small calibration study including short-term measures such as 24hr recalls. However, there is a new web-based instrument called the Automated Self-administered 24-hour Dietary Recall (ASA24™), see http://riskfactor.cancer.gov/tools/instruments/asa24, which has been proposed to replace or at least supplement the FFQ and which is currently undergoing extensive testing. The dietary data we will see then is what we have called *Y _{ijk}*, i.e., 24hr recall data. In order to correct relative risk estimates for the measurement error inherent in the ASA24™, regression calibration (Carroll, et al., 2006) will almost certainly be the method of choice, as it is in most of nutritional epidemiology. This method attempts to produce an estimate of the regression of usual intake on the observed intakes, and then to use these estimates in Cox and logistic regression for the health outcome. In order to perform this regression, a multivariate measurement error model will be required, since the regression is on all the observed dietary intake components in the regression model measured by the ASA24™, and not on each individual component. Our methodology is easily extended to address this problem.

## ACKNOWLEDGMENTS

This paper forms part of Zhang's Ph.D. dissertation at Texas A&M University. Zhang and Carroll's research was supported by a grant from the National Cancer Institute (CA57030). This work was also supported by National Science Foundation Instrumentation grant number 0922866.

## APPENDIX A: DETAILS OF THE FITTING PROCEDURE

In this Appendix we give the full details of the model fitting procedure.

#### A.1. Notational Convention

In our example, age was standardized to have mean 0.0 and variance 1.0, to improve numerical stability.

As described in Section 3.1, the observed, transformed non-zero 24hr recalls were standardized to have mean 0.0 and variance 2.0. More precisely, for $\ell =1,2,\dots ,6$, we first transformed the non-zero food group data as ${Z}_{i,2\ell ,k}=g({Y}_{i,2\ell ,k},{\lambda}_{\ell})$, and then we standardized these data as ${Q}_{i,2\ell ,k}=\sqrt{2}\{{Z}_{i,2\ell ,k}-\mu \left({\lambda}_{\ell}\right)\}\u2215\sigma \left({\lambda}_{\ell}\right)$, where $\{\mu \left({\lambda}_{\ell}\right),\sigma \left({\lambda}_{\ell}\right)\}$ are the mean and standard deviation of the non-zero food intakes ${Z}_{i,2\ell ,k}$. Similarly, for non-episodically consumed dietary components and energy we transformed to ${Z}_{i,6+\ell ,k}=g({Y}_{i,6+\ell ,k},{\lambda}_{\ell})$ for $\ell =7,\dots ,13$, and then standardized to ${Q}_{i,6+\ell ,k}=\sqrt{2}\{{Z}_{i,6+\ell ,k}-\mu \left({\lambda}_{\ell}\right)\}\u2215\sigma \left({\lambda}_{\ell}\right)$. Of course, whether the food group is consumed or not is ${Q}_{i,2\ell -1,k}={Y}_{i,2\ell -1,k}$ for $\ell =1,\dots ,6$. Collected, the data are ${\stackrel{~}{Q}}_{ik}={\left({Q}_{ijk}\right)}_{j=1}^{19}$. The terms $\{\mu \left({\lambda}_{\ell}\right),\sigma \left({\lambda}_{\ell}\right)\}$ are not random variables but are merely constants used for standardization, and we need not consider inference for them. Back-transformation is discussed in Appendix A.11.

#### A.2. Prior Distributions

Because the data were standardized, we used the following conventions.

- The prior for all
*β*were normal with mean zero and variance 100._{j} - The prior for Σ
was exchangeable with diagonal entries all equal to 1.0 and correlations all equal to 0.50. There were 21 degrees of freedom in the inverse Wishart prior, i.e.,_{u}*m*= 21. Thus, the prior is IW{(_{u}*m*– 19 – 1)Σ_{u}_{u,prior},*m*}. We experimented with this prior by using zero correlation, and the results were essentially unchanged._{u} - The prior for
*r*is Uniform[-1, 1]. Set the initial value:_{k}*r*= 0,_{k}*k*= 1, . . . , 5. - The prior for
*θ*is Uniform[–_{k}*π, π*]. Set the initial value:*θ*= 0,_{k}*k*= 1, . . . , 25. - The priors for
*v*_{22},*v*_{44}, . . . ,*v*_{12,12}and*v*_{13,13}, . . . ,*v*_{19,19}were Uniform[-3,3]. Set the initial values:*v*_{22}=*v*_{44}= . . . =*v*_{12,12}=*v*_{13,13}= . . . =*v*_{19,19}= 1. - For the rest of the non-diagonal
*v*_{ij}'s which could not be determined by the restrictions, we used Uniform[-3,3] priors. Set the initial values to be 0.

The constraints on Σ* _{ε}* are nonlinear, and our parameterization enforces them easily without having to have prior distributions for the original parameterization that satisfy the nonlinear constraints.

The key thing that makes things work well with the other components of the matrix *V* with Σ_{ε} = *VV*^{T} is that we have standardized the data as described in Section A.1. With this standardization, things become much nicer. For example, the variance of the *ε*'s for energy is ${\sum}_{j=1}^{19}{v}_{19,j}^{2}$. However, since the sample variance for energy is standardized to equal 2.0, we simply just need to make priors for *v*_{19,j} be uniform on a modest range to have real flexibility.

#### A.3. Generating Starting Values for the Latent Variables

While we observe *$\stackrel{\u0303}{Q}$*_{ik}, in the MCMC we need to generate starting values for the latent variables ${\stackrel{~}{W}}_{ik}={\left({W}_{ijk}\right)}_{j=1}^{19}$ to initiate the MCMC.

- For nutrients and energy,
*Q*=_{ijk}*W*, no data need be generated,_{ijk}*j*= 13, . . . , 19. - For the amounts,
*Q*_{i2k},*Q*_{i4k},*Q*_{i6k},*Q*_{i8k},*Q*_{i,10,k}and*Q*_{i,12,k}, we set*W*_{i2k}=*Q*_{i2k},*W*_{i4k}=*Q*_{i4k},*W*_{i6k}=*Q*_{i6k},*W*_{i8k}=*Q*_{i8k},*W*_{i,10,k}=*Q*_{i,10,k}and*W*_{i,12,k}=*Q*_{i,12,k}. - For consumption, we generate
*Ũ*_{i}as normally distributed with mean zero and covariance matrix given as the prior covariance matrix for Σ. For $\ell =1,\dots ,6$, we also compute ${z}_{ik}=\mid {X}_{i,2\ell -1,k}^{\mathrm{T}}{\beta}_{2\ell -1,\text{prior}}+{U}_{i,2\ell -1}+{\mathcal{Z}}_{ik}\mid $, where ${\mathcal{Z}}_{ik}=\text{Normal}(0,1)$ are generated independently. We then set ${W}_{i,2\ell -1,k}={z}_{ik}{Q}_{i,2\ell -1,k}-{z}_{ik}(1-{Q}_{i,2\ell -1,k})$._{u} - Finally, we then updated
*$\stackrel{\u0303}{W}$*_{ik}by a single application of the updates given in Appendix A.9.

#### A.4. Complete Data Loglikelihood

Let *J* = 19. The complete data include the indicators of whether a food was consumed, the *W* variables, and the random effect *U* variables. The loglikelihood of the complete data is

We used Gibbs sampling to update this complete data loglikelihood, the details for which are given in subsequent appendices. The weights *w _{i}* are integers and are used here in a pseudo-likelihood fashion. One can also think of this as expanding each individual into

*w*individuals, each with the same observed data but different latent variables. For computational convenience, since we are only asking for a frequentist estimator and not doing full Bayesian inference, the latent variables in the process are generated once for each individual. Estimates of Σ

_{i}*, Σ*

_{u}*and*

_{ε}*β*for

_{j}*j*= 1, ...,

*J*were computed as the means from the Gibbs samples. Once again, we emphasize that we are not doing a proper Bayesian analysis, but only using MCMC techniques to obtain a frequentist estimate, with uncertainty assessed using the frequentist BRR method.

#### A.5. Complete Conditionals for *r*_{q}, *θ*_{q} and *v*_{pq}

Except for irrelevant constants, the complete conditional for *r _{q}* (

*q*= 1, . . . , 5) is

Except for irrelevant constants, the complete conditionals for *v _{qq}* (

*q*= 2, 4, 6, 8, 10, 12, 13, . . . , 19) are

Except for irrelevant constants, the compete conditionals for *θ _{q}*, (

*q*= 1, . . . , 25) and non-diagonal free parameters

*v*are

_{pq}The full conditionals do not have an explicit form, so we use a Metropolis-Hastings within a Gibbs sampler to generate it.

*r*(_{q}*q*= 1, . . . , 5)- We discretize the values of
*r*to the set {–0.99 + 2 × 0.99(_{q}*j*– 1)/(*M*– 1)}, where*j*= 1, ...,*M*and we choose*M*= 41. - Proposal: The current value is
*r*. The proposed value of_{q,t}*r*_{q,t+1}is selected randomly from the current value and the two nearest neighbors of*r*. Then_{q,t}*r*_{q,t+1}is accepted with probability min{1,*g*(*r*_{q,t+1})/*g*(*r*_{q,t}), wherewhere here and in what follows, for any$$g\left(y\right)\propto {(1-{y}^{2})}^{-\frac{1}{2}\sum _{i=1}^{n}{w}_{i}{m}_{i}}\times \mathrm{exp}\left[-\frac{1}{2}\sum _{i=1}^{n}{w}_{i}\sum _{k=1}^{{m}_{i}}{\{{\stackrel{~}{W}}_{ik}-{({X}_{i1k}^{\mathrm{T}}{\beta}_{1},...,{X}_{i,19,k}^{\mathrm{T}}{\beta}_{19})}^{\mathrm{T}}-{\stackrel{~}{U}}_{i}\}}^{\mathrm{T}}\sum _{\epsilon}^{-1}\left(\u2022\right)\right],$$*A*, ${A}^{\mathrm{T}}{\sum}_{\u220a}^{-1}\left(\u2022\right)={A}^{\mathrm{T}}{\sum}_{\u220a}^{-1}A$.

*θ*(_{q}*q*= 1, . . . , 25)- We discretize similarly as above.
- Proposal: The current value is
*θ*. The proposed value_{q,t}*θ*_{q,t+1}is selected randomly from the current value and the two nearest neighbors of*θ*. Then_{q,t}*θ*_{q,t+1}is accepted with probability min{1,*g*(*θ*_{q,t+1})/*g*(*θ*)}, where_{q,t}$$g\left(y\right)\propto \phantom{\rule{thickmathspace}{0ex}}\mathrm{exp}\left[-\frac{1}{2}\sum _{i=1}^{n}{w}_{i}\sum _{k=1}^{{m}_{i}}{\{{\stackrel{~}{W}}_{ik}-{({X}_{i1k}^{\mathrm{T}}{\beta}_{1},...,{X}_{i,19,k}^{\mathrm{T}}{\beta}_{19})}^{\mathrm{T}}-{\stackrel{~}{U}}_{i}\}}^{\mathrm{T}}\sum _{\epsilon}^{-1}\left(\u2022\right)\right].$$

*v*(_{qq}*q*= 2, 4, 6, 8, 10, 12, 13, . . . , 19)- Proposal: The current value is
*v*. A candidate_{qq,t}*v*_{qq,t+1}is generated from the Uniform distribution of length 0.4 with mean*v*. The candidate value_{qq,t}*v*_{qq,t+1}is accepted with probability min{1,*g*(*v*_{qq,t+1})/*g*(*v*)}, where_{qq,t}$$g\left(y\right)\propto {y}^{-\sum _{i=1}^{n}{w}_{i}{m}_{i}}\times \mathrm{exp}\left[-\frac{1}{2}\sum _{i=1}^{n}{w}_{i}\sum _{i=1}^{{m}_{i}}{\{{\stackrel{~}{W}}_{ik}-{({X}_{i1k}^{\mathrm{T}}{\beta}_{1},...,{X}_{i,19,k}^{\mathrm{T}}{\beta}_{19})}^{\mathrm{T}}-{\stackrel{~}{U}}_{i}\}}^{\mathrm{T}}\sum _{\epsilon}^{-1}\left(\u2022\right)\right].$$

- non-diagonal free parameters
*v*_{pq}- Proposal: The current value is
*v*. The candidate value_{pq,t}*v*_{pq,t+1}is generated from the Uniform distribution of length 0.4 with mean*v*. The candidate value is accepted with probability min{1,_{pq,t}*g*(*v*_{pq,t+1})/*g*(*v*)}, where_{pq,t}$$g\left(y\right)\propto \mathrm{exp}\left[-\frac{1}{2}\sum _{i=1}^{n}{w}_{i}\sum _{k=1}^{{m}_{i}}{\{{\stackrel{~}{W}}_{ik}-{({X}_{i1k}^{\mathrm{T}}{\beta}_{1},...,{X}_{i,19,k}^{\mathrm{T}}{\beta}_{19})}^{\mathrm{T}}-{\stackrel{~}{U}}_{i}\}}^{\mathrm{T}}\sum _{\epsilon}^{-1}\left(\u2022\right)\right].$$

#### A.6. Complete Conditionals for Σ_{u}

The dimension of the covariance matrices is *J* = 19. By inspection, the complete conditional for Σ* _{u}* is

where here IW = the Inverse-Wishart distribution. The density of IW(Ω, *m*) for a *J × J* random variable is

This has expectation Ω/(*m* – *J* – 1).

#### A.7. Complete Conditionals for *β*

Let the elements of ${\sum}_{\u220a}^{-1}$ be ${\sigma}_{\u220a}^{j\ell}$. For any *j*, except for irrelevant constants,

which implies $[{\beta}_{j}\mid \text{rest}]=\text{Normal}({\mathcal{C}}_{2}{\mathcal{C}}_{1},{\mathcal{C}}_{2})$, where

#### A.8. Complete Conditionals for *Ũ*_{i}

_{i}

The NHANES 2001-2004 weights are integers, representing the number of children that each sampled child represents. Thus, as described therein, the loglikelihood in Section A.4 could also be rewritten equivalently by developing *w _{i}* pseudo-children, each with the same observed data values. It thus does not make sense to use the weights to generate an individual

*Ũ*

_{i}. Instead, as described in Section A.4, for computational convenience for generating a

*Ũ*

_{i}to represent

*w*children, we set the weight for that child temporarily = 1.0. Then, except for irrelevant constants,

_{i} Remembering that for purposes of this section we are setting *w _{i}* = 1.0, this implies that $[{\stackrel{~}{U}}_{i}\mid \text{rest}]=\text{Normal}({\mathcal{C}}_{2}{\mathcal{C}}_{1},{\mathcal{C}}_{2})$, where

#### A.9. Complete Conditional for ${W}_{i\ell k}$, $\ell =1$, 3, 5, 7, 9, 11

Here we do the complete conditional for ${W}_{i\ell k}$ with $\ell =1$, 3, 5, 7, 9, 11. Except for irrelevant constants,

where, using the convention of Section A.8,

If we use the notation TN_{+}(*μ, σ, c*) for a normal random variable with mean *μ* and standard deviation *σ* that is truncated from the left at *c*, and similarly use TN_{–}(*μ, σ, c*) when truncation is from the right at *c*, then it follows that with $\mu ={\mathcal{C}}_{2}{\mathcal{C}}_{1}$ and $\sigma ={\mathcal{C}}_{2}^{1\u22152}$,

Generating TN_{+}(0, 1, *c*) is easy: if *c* < 0, simply do rejection sampling of a Normal(0, 1) until you get one that is > *c*. If *c* > 0, there is an adaptive rejection scheme (Robert, 1995).

#### A.10. Complete Conditionals for *W*_{i2k}, *W*_{i4k}, *W*_{i6k}, *W*_{i8k}, *W*_{i,10,k} and *W*_{i,12,k} When Not Observed

For *p* = 2, 4, 6, 8, 10, 12, the variable *W _{ipk}* is not observed when

*Q*

_{i,p–1,k}= 0, or, equivalently, when

*W*

_{i,p–1,k}< 0. Except for irrelevant constants,

where, using the convention of Section A.8,

Therefore,

#### A.11. Usual Intake, Standardization and Transformation

Here we present detailed formulas for functions defined in Section 3.4. When *λ* = 0, the back-transformation is

When *λ* ≠ 0, the back-transformation is

#### A.12. Transformation Estimation

As part of an earlier project (Freedman, et al., 2009), we estimated the transformations for one food/nutrient at a time using the method of Kipnis, et al. (2009), both for the data and also for each BRR weighted data set. To facilitate comparison with the one food/nutrient at a time analysis, in our analysis of all HEI-2005 components, we used these transformations as well. Of course, our methods can be generalized to allow for estimation of the transformations as well. By allowing a different transformation for each BRR weighted data set, we have captured the variation due to estimation of the transformations.

SUPPLEMENTARY MATERIAL

Included in the supplementary materials (Zhang, et al., 2011) are (a) additional tables in a pdf file; (b) data files of the NHANES data used in the analysis; and (c) Matlab programs for the data analysis. (http://???/???). ???

## Contributor Information

Saijuan Zhang, Department of Statistics Texas A&M University 3143 TAMU College Station, Texas 77843-3143 U.S.A.

Douglas Midthune, Biometry Research Group Division of Cancer Prevention National Cancer Institute 6130 Executive Boulevard EPN-3131 Bethesda, Maryland 20892-7354 U.S.A.

Patricia M. Guenther, Center for Nutrition Policy and Promotion U.S. Department of Agriculture 3101 Park Center Drive, Ste. 1034 Alexandria, Virginia 22302 U.S.A. Email: vog.adsu.ppnc@rehtneuG.aicirtaP.

Susan M. Krebs-Smith, Applied Research Program Division of Cancer Control and Population Sciences National Cancer Institute 6130 Executive Boulevard, EPN-4005 Bethesda, Maryland 20892, U.S.A. Email: vog.hin.liam@smssberk.

Victor Kipnis, Biometry Research Group Division of Cancer Prevention National Cancer Institute 6130 Executive Boulevard EPN-3131 Bethesda, Maryland 20892-7354 U.S.A.

Kevin W. Dodd, Biometry Research Group Division of Cancer Prevention National Cancer Institute 6130 Executive Boulevard EPN-3131 Bethesda, Maryland 20892-7354 U.S.A.

Dennis W. Buckman, Information Management Services, Inc. 12501 Prosperity Drive Silver Spring, Maryland 20904, U.S.A. Email: moc.bewsmi@DnamkcuB.

Janet A. Tooze, Department of Biostatistical Sciences Wake Forest University, School of Medicine Medical Center Boulevard Winston-Salem, North Carolina 27157, U.S.A. Email: ude.cmbufw@ezootj.

Laurence Freedman, Gertner Institute for Epidemiology and Health Policy Research Sheba Medical Center Tel Hashomer 52161, Israel ; Email: li.oc.moctca@fsl.

Raymond J. Carroll, Department of Statistics Texas A&M University 3143 TAMU College Station, Texas 77843-3143 U.S.A.

## REFERENCES

- Buonaccorsi J. Measurement Error: Models, Methods and Applications. Chapman and Hall/CVRC Press; 2010.
- Carriquiry AL. Assessing the prevalence of nutrient inadequacy. Public Health Nutrition. 1999;2:23–33. [PubMed]
- Carriquiry AL. Estimation of usual intake distributions of nutrients and foods. Journal of Nutrition. 2003;133:601–608. [PubMed]
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. Second Edition Chapman and Hall CRC Press; 2006.
- Delaigle A. An alternative view of the deconvolution problem. Statistica Sinica. 2008;18:1025–1045.
- Delaigle A, Hall P. Estimation of observation-error variance in errors-in-variables regression. Statistica Sinica. 2010 to appear.
- Delaigle A, Hall P. Using SIMEX for smoothing-parameter choice in errors-in-variables problems. Journal of the American Statistical Association. 2008;103:280–287.
- Delaigle A, Hall P, Meister A. On deconvolution with repeated measurements. Annals of Statistics. 2008;36:665–685.
- Delaigle A, Meister A. Density estimation with heteroscedastic error. Bernoulli. 2008;14:562–579.
- Ferrari P, Roddam A, Fahey MT, Jenab M, Bamia C, Ock M, Amiano P, Hjartker A, Biessy C, Rinaldi S, Huybrechts I, Tjnneland A, Dethlefsen C, Niravong M, Clavel-Chapelon F, Linseisen J, Boeing H, Oikonomou E, Orfanos P, Palli D, Santucci de Magistris M, Bueno-de-Mesquita HB, Peeters PH, Parr CL, Braaten T, Dorronsoro M, Berenguer T, Gullberg B, Johansson I, Welch AA, Riboli E, Bingham S, Slimani N. A bivariate measurement error model for nitrogen and potassium intakes to evaluate the performance of regression calibration in the European Prospective Investigation into Cancer and Nutrition study. European Journal of Clinical Nutrition. 2009;63(Supplement 4):S179–187. [PubMed]
- Flegal JM, Haran M, Jones GL. Markov Chain Monte Carlo: can we trust the third significant figure? Statistical Science. 2008;23:250–260.
- Fraser GE, Shavlik DJ. Correlations between estimated and true dietary intakes. Annals of Epidemiology. 2004;14:287–95. [PubMed]
- Freedman LS, Guenther PM, Krebs-Smith SM, Dodd KW, Midthune D. A population's distribution of Healthy Eating Index-2005 component scores can be estimated when more than one 24-hour recall is available. Journal of Nutrition. 2010;140:1529–1534. [PMC free article] [PubMed]
- Fuller WA. Measurement Error Models. Wiley; New York: 1987.
- Fungwe T, Guenther PM, Juan WY, Hiza H, Lino M. Nutrition Insight. Vol. 43. USDA Center for Nutrition Policy and Promotion; 2009. The quality of children's diets in 2003-04 as measured by the Healthy Eating Index-2005.
- Guolo A. A flexible approach to measurement error correction in casecontrol studies. Biometrics. 2008;64:1207–1214. [PubMed]
- Gustafson P. Measurement Error and Misclassi cation in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman and Hall/CRC Press; 2003.
- Guenther PM, Reedy J, Krebs-Smith SM. Development of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008a;108:1896–1901. [PubMed]
- Guenther PM, Reedy J, Krebs-Smith SM, Reeve BB. Evaluation of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008b;108:1854–1864. [PubMed]
- Kipnis V, Midthune D, Buckman DW, Dodd KW, Guenther PM, Krebs-Smith SM, Subar AF, Tooze JA, Carroll RJ, Freedman LS. Modeling data with excess zeros and measurement error: application to evaluating relationships between episodically consumed foods and health outcomes. Biometrics. 2009;65:1003–1010. [PMC free article] [PubMed]
- Kipnis V, Freedman LS, Carroll RJ, Midthune D. A measurement error model for episodically consumed foods and energy. 2010. Preprint.
- Kott PS, Guenther PM, Wagstaff DA, Juan WY, Kranz S. Fitting a linear model to survey data when the long-term average daily intake of a dietary component is an explanatory variable. Survey Research Methods. 2009;3(3):157–165.
- Küchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics. 2006;62:85–96. [PubMed]
- Lehmann EL, Casella G. Theory of Point Estimation. Springer; New York: 1998.
- Liang H, Thurston S, Ruppert D, Apanasovich T, Hauser R. Additive partial linear models with measurement errors. Biometrika. 2008;95:667–678.
- Messer K, Natarajan L. Maximum likelihood, multiple imputation and regression calibration for measurement error adjustment. Statistics in Medicine. 2008;27:6332–6350. [PMC free article] [PubMed]
- Natarajan L. Regression Calibration for Dichotomized Mismeasured Predictors. International Journal of Biostatistics. 2009;5:nihpa121098. [PMC free article] [PubMed]
- Nusser SM, Carriquiry AL, Dodd KW, Fuller WA. A semiparametric approach to estimating usual intake distributions. Journal of the American Statistical Association. 1996;91:1440–1449.
- Nusser SM, Fuller WA, Guenther PM. Estimating usual dietary intake distributions: Adjusting for measurement error and non-normality in 24-hour food intake data. In: Lyberg L, Biemer P, Collins M, Deleeuw E, Dippo C, Schwartz N, Trewin D, editors. Survey Measurement and Process Quality. Wiley; New York: 1997. 1997. pp. 670–689.
- Prentice RL. Measurement error and results from analytic epidemiology: dietary fat and breast cancer. Journal of the National Cancer Institute. 1996;88:1738–47. [PubMed]
- Prentice RL. Dietary assessment and the reliability of nutritional epidemiology reports. Lancet. 2003;362:182–183. [PubMed]
- Staudenmayer J, Ruppert D, Buonaccorsi JP. Density estimation in the presence of heteroskedastic measurement error. Journal of the American Statistical Association. 2008;103:726–736.
- Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data clumping at zero. Statistical Methods in Medical Research. 2002;11:341–355. [PubMed]
- Tooze JA, Midthune D, Dodd KW, Freedman LS, Krebs-Smith SM, Subar AF, Guenther PM, Carroll RJ, Kipnis V. A new statistical method for estimating the distribution of usual intake of episodically consumed foods. Journal of the American Dietetic Association. 2006;106:1575–1587. [PMC free article] [PubMed]
- Wand MP. Finite sample performance of deconvolving kernel density estimators. Statistics and Probability Letters. 1998;37:131–139.
- Wolter KM. Introduction to Variance Estimation. Springer-Verlag; New York: 1995.
- Zhang S, Midthune D, Pérez A, Buckman DW, Kipnis V, Freedman LS, Dodd KW, Krebs-Smith SM, Carroll RJ. A bivariate measurement error model for episodically consumed dietary components. 2010. [PMC free article] [PubMed]
- Zhang S, Midthune D, Guenther PM, Krebs-Smith SM, Kipnis V, Dodd KW, Buckman DW, Tooze JA, Freedman LS, Carroll RJ. Supplement to “A new multivariate measurement error model with zero-inflated dietary data, and its application to dietary assessment”. 2011. [PMC free article] [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.3M) |
- Citation

- Estimating the Distribution of Dietary Consumption Patterns.[Stat Sci. 2014]
*Carroll RJ.**Stat Sci. 2014; 29(1):2-8.* - Fitting a bivariate measurement error model for episodically consumed dietary components.[Int J Biostat. 2011]
*Zhang S, Krebs-Smith SM, Midthune D, Perez A, Buckman DW, Kipnis V, Freedman LS, Dodd KW, Carroll RJ.**Int J Biostat. 2011; 7(1):1. Epub 2011 Jan 6.* - A population's distribution of Healthy Eating Index-2005 component scores can be estimated when more than one 24-hour recall is available.[J Nutr. 2010]
*Freedman LS, Guenther PM, Krebs-Smith SM, Dodd KW, Midthune D.**J Nutr. 2010 Aug; 140(8):1529-34. Epub 2010 Jun 23.* - Statistical methods for estimating usual intake of nutrients and foods: a review of the theory.[J Am Diet Assoc. 2006]
*Dodd KW, Guenther PM, Freedman LS, Subar AF, Kipnis V, Midthune D, Tooze JA, Krebs-Smith SM.**J Am Diet Assoc. 2006 Oct; 106(10):1640-50.* - Estimating energy and nutrient intakes in studies of human fertility.[J Biosoc Sci. 1992]
*Ulijaszek SJ.**J Biosoc Sci. 1992 Jul; 24(3):335-45.*

- Use of Two-Part Regression Calibration Model to Correct for Measurement Error in Episodically Consumed Foods in a Single-Replicate Study Design: EPIC Case Study[PLoS ONE. ]
*Agogo GO, van der Voet H, Veer PV, Ferrari P, Leenders M, Muller DC, Sánchez-Cantalejo E, Bamia C, Braaten T, Knüppel S, Johansson I, van Eeuwijk FA, Boshuizen H.**PLoS ONE. 9(11)e113160* - Daughters and Mothers Against Breast Cancer (DAMES): Main outcomes of a randomized controlled trial of weight loss in overweight mothers with breast cancer and their overweight daughters[Cancer. 2014]
*Demark-Wahnefried W PhD, RD, Jones LW PhD, Snyder DC MS, RD, Sloane RJ MPH, Kimmick GG MD, MS, Hughes DC PhD, Badr HJ PhD, Miller PE PhD, RD, Burke LE PhD, Lipkus IM PhD.**Cancer. 2014 Aug 15; 120(16)2522-2534* - Estimating the Distribution of Dietary Consumption Patterns[Statistical science : a review journal of t...]
*Carroll RJ.**Statistical science : a review journal of the Institute of Mathematical Statistics. 2014; 29(1)2-8* - The Healthy Eating Index-2010 Is a Valid and Reliable Measure of Diet Quality According to the 2010 Dietary Guidelines for Americans[The Journal of Nutrition. 2014]
*Guenther PM, Kirkpatrick SI, Reedy J, Krebs-Smith SM, Buckman DW, Dodd KW, Casavale KO, Carroll RJ.**The Journal of Nutrition. 2014 Mar; 144(3)399-407* - Using surrogate biomarkers to improve measurement error models in nutritional epidemiology[Statistics in Medicine. 2013]
*Keogh RH, White IR, Rodwell SA.**Statistics in Medicine. 2013 Sep 30; 32(22)3838-3861*

- PubMedPubMedPubMed citations for these articles

- A NEW MULTIVARIATE MEASUREMENT ERROR MODEL WITH ZERO-INFLATED DIETARY DATA, AND ...A NEW MULTIVARIATE MEASUREMENT ERROR MODEL WITH ZERO-INFLATED DIETARY DATA, AND ITS APPLICATION TO DIETARY ASSESSMENTNIHPA Author Manuscripts. 2011 Jun 1; 5(2B)1456

Your browsing activity is empty.

Activity recording is turned off.

See more...