- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2910194

# A multiple imputation approach to disclosure limitation for high-age individuals in longitudinal studies

## Abstract

Disclosure limitation is an important consideration in the release of public use data sets. It is particularly challenging for longitudinal data sets, since information about an individual accumulates with repeated measures over time. Research on disclosure limitation methods for longitudinal data has been very limited. We consider here problems created by high ages in cohort studies. Because of the risk of disclosure, ages of very old respondents can often not be released; in particular this is a specific stipulation of the Health Insurance Portability and Accountability Act (HIPAA) for the release of health data for individuals. Top-coding of individuals beyond a certain age is a standard way of dealing with this issue, and it may be adequate for cross-sectional data, when a modest number of cases are affected. However, this approach leads to serious loss of information in longitudinal studies when individuals have been followed for many years. We propose and evaluate an alternative to top-coding for this situation based on multiple imputation (MI). This MI method is applied to a survival analysis of simulated data, and data from the Charleston Heart Study (CHS), and is shown to work well in preserving the relationship between hazard and covariates.

**Keywords:**confidentiality, disclosure protection, longitudinal data, multiple imputation, survival analysis

## 1 Introduction

Statistical disclosure control (SDC) procedures deliberately alter data collected by statistical agencies before release to the public, to prevent the identity of survey respondents from being revealed. These methods have increased in importance, with the extensive use of computers and the internet. The goal of SDC methods is to reduce the risk of disclosure to acceptable levels, while releasing a dataset that provides as much useful information as possible for researchers. One aspect of this is the ability to draw valid statistical inferences from the altered data.

Top-coding is a simple and common SDC method that seeks to prevent disclosure on the basis of extreme values of a variable, by censoring values above a pre-chosen “top-code”. For example, in surveys that include income, extremely high income values are considered to be sensitive and to have the potential to reveal the identity of respondents. By recoding income values greater than a selected “top-code” value to that value, the disclosure risk of respondents with very high income is reduced.

It is left to the analyst to decide how top-coded data are analyzed. One approach is to categorize the variable so that top-coded cases all fall in one category – this is sensible, but does not work for analyses that treat the variable as continuous. Another approach is to ignore the top-coding and treat the top-coded values as the truth. This method is straightforward, but clearly the data distribution is distorted and biased estimates will be obtained. A better method is to treat the extreme values as censored. Under an assumed statistical model, maximum likelihood (ML) estimates can be obtained using algorithms such as the Expectation-Maximization (EM) algorithm [1]. This method is model-based, and should yield good inferences if the model is correctly specified. But we expect this method to be quite sensitive to model misspecification, especially when the upper tail of the assumed distribution differs markedly from that of the true distribution. The data users can also apply an imputation method to the top-coded dataset and fill in the censored values. A limitation is that the imputed data fail to reflect imputation uncertainty, and imputations are sensitive to assumptions about the right tail of the distribution. An and Little [2] propose an alternative to top-coding based on multiple imputation (MI), which allows valid inferences to be created based on applying multiple imputation combining rules described by Reiter [3], while preserving the SDC benefits of top-coding; for other discussions of MI in the disclosure control setting, see Little [4]; Rubin [5]; Raghunathan, Reiter and Rubin [6]; Little, Liu and Raghunathan [7]; Reiter [8, 9, 10]. The methods in An and Little [2] are extended to handle covariate information in An and Little (2007, unpublished).

We propose here MI for disclosure control in the context of the treatment of age in longitudinal data sets. Because of the risk of disclosure, ages of very old respondents can often not be released; in particular this is a specific stipulation of HIPAA regulations [11, 12] for the release of health data for individuals. Top-coding of individuals beyond a certain age (say 80) is a standard way of dealing with this issue, and it may be adequate for cross-sectional data, since the number of cases affected may be modest. However, this approach has severe limitations in longitudinal studies, when individuals have been in the study for many years; for example, consider an individual in a 40-year longitudinal study, who enters the study at age 42 at time *t* and is still in the study at age 82 at time *t*+40. The age at time *t*+40 cannot simply be replaced by a top code of 80, since age at time *t*+40 can be inferred by simply adding 40 to the age at time *t*. A strict application of top-coding would replace all individuals aged 40 or older at time *t* by a top code of 40, but this strategy seriously limits the ability to do longitudinal analysis, particularly survival analyses where chronological age is a key variable of interest. In particular, since age at entry is a marker for cohorts, differences in outcomes between cohorts aged 40 or greater at entry can no longer be estimated, since these cohorts are all top-coded to the same value.

This problem arises in the Charleston Heart Study [13], a longitudinal study that collects data over 40 years (1960-2000). The study was originally conducted to understand the natural aging process in a community-based cohort. The data include baseline characteristics such as age, race, gender, occupation, education; as well as death information for respondents. For longitudinal data from this study to be included in the National Archive of Computerized Data on Aging (NACDA) - the gerontologocal data archive at the University of Michigan, individual ages beyond age 80 cannot be disclosed because of HIPAA regulation, given the geographic specificity of the respondents. Also, given the longitudinal nature of the data, a top-coding approach would need to be applied to all individuals aged 40 or older in 1960, which has the limitation discussed above.

The goal of this research is to develop MI methods that adequately limit disclosure risk and preserve the relationship between hazard and covariates in survival analysis. We propose a non-parametric MI method, specifically a stratified hot-deck procedure, where we create strata and draw deleted ages with replacement from each stratum. Our method multiply imputes values of two age variables – entry age and final age (age at death or age at last contact).

To assess the proposed method, we apply a proportional hazard (PH) model to the multiply-imputed datasets, calculate estimates of regression coefficients for putative risk factors, and compare these estimates, and corresponding estimates from top-coded data, with estimates from the PH model applied to the original data prior to SDC. We also present simulation studies where data are simulated according to a known survival model, and inferences for parameters of this model are compared with the true values.

The rest of this paper is organized as follows. Section 2 presents our SDC approaches for longitudinal data and describes corresponding methods of inference for regression coefficients. Section 3 describes a simulation study to evaluate the approaches in Section 2, and Section 4 applies the methods to CHS data. Section 5 gives discussion and future work.

## 2 Methods

### 2.1 SDC methods for longitudinal data

An and Little [2] propose SDC methods for a single variable with extreme values. In this paper, we investigate a more complicated situation with longitudinal data, where two age variables are subject to top-coding.

Let *Y*_{end} denote participants’ age at the end of study (referred to as final age) and *Y*_{start} denote their entry age. Let *C* be the censoring indicator. Let *L* represent the length of study and *S* denote survival time. Individuals with *S* ≥ *L* are treated as censored (*C* = 1), and otherwise as died (*C* = 0). We consider individuals with values of *Y*_{end} greater than a particular value *y*_{0} to be at risk of disclosure, and refer to these individuals as sensitive cases. Thus values of *Y*_{end} and *Y*_{start} of the sensitive cases are treated as sensitive values. We consider the following SDC approaches:

**Top-coding**. Replace values of*Y*_{end}greater than*y*_{0}by*y*_{0}and replace values of*Y*_{start}greater than*y*_{0}–*L*by*y*_{0}–*L*. The resulting dataset is referred to as “top-coded” data.**Hot-deck MI (HDMI)**. Classify sensitive and non-sensitive values into strata, to be defined below. Then delete the values of*Y*_{end},*Y*_{start}, and*C*for sensitive cases and replace them with random draws from the set of deleted values in the same stratum. Our stratified HDMI method is similar to the approach described in An and Little (2007, unpublished), where we assign the deleted data into strata based on predicted values of either age variables from regression on other variables, and apply HDMI within each stratum to impute deleted values. The following choices of strata are considered here:**HD1**. Strata are defined by predicted values of the logarithm of hazard computed from a proportional hazards model for survival. This choice is motivated by the idea that if survival analysis is a primary analysis involving the imputed age data, imputing within strata defined by predicted hazard will minimize distortions of survival analysis applied to the imputed data.**HD2**. We develop a two-way stratification, where strata are defined by both predicted values of the logarithm of hazard as in (i), and predicted values of entry-age from the regression of entry-age on other variables involved.**HD3**. Stratification depends on the value of*C*. For individuals that are censored, strata are defined by predicted values of entry-age; and for those not censored, strata are defined by both predicted values of the logarithm of hazard and predicted values of entry-age as in (ii).**HDU**. Unstratified HDMI, which we include as a baseline for comparison with the stratified methods.

Note that for methods HD1 and HD2, we delete values of *Y*_{end}, *Y*_{start}, and *C* of sensitive cases and jointly impute these values. HD3 retains values of *C* and imputes *Y*_{end} and *Y*_{start} only.

It is worth noting that for above stratified methods, we perform regression only on the set of sensitive cases with values deleted to obtain predicted values. We also considered an alternative way of creating strata, where we perform regression on the complete data, and then stratify the sensitive cases for imputation. These methods did not perform as well in terms of empirical bias, root mean squared error and confidence coverage in the simulation study reported in Section 3, so we do not consider them further.

### 2.2 Methods of inference

We consider the properties of the SDC methods for inferences about the regression coefficient, where a PH model is fitted to the dataset before and after imputation. The following estimates and associated standard errors are considered:

**Before Deletion (BD)**– the estimates of regression coefficients calculated from original data prior to SDC, used as a benchmark for comparing SDC methods.**Top-coding (TC)**– the estimates of regression coefficients calculated from the top-coded dataset.

The standard errors for methods (1) and (2) are computed by the bootstrap.

The four remaining methods HD1 – HD3 and HDU are as described in Section 2.1, yielding *D* MI datasets. The MI estimate is calculated as

where ^{(d)} is the parameter estimate from *d* th data set. The MI estimate of variance is

where $\stackrel{\u2012}{W}={\Sigma}_{d=1}^{D}{W}^{\left(d\right)}\u2215D$ is the average of the within-imputation variances *W*^{(d)} for imputed data set *d*, and $B={\Sigma}_{d=1}^{D}{({\widehat{\theta}}^{\left(d\right)}-{\widehat{\theta}}_{MI})}^{2}\u2215(D-1)$ is the between-imputation variance. The formula (2) differs from the original MI formula for missing data (where *B* is multiplied by a factor *(D*+1*)/D*, see e.g. Little and Rubin [14], p86), for reasons discussed in Reiter [3].

## 3 Simulation study

A simulation study was carried out to evaluate the SDC methods in Section 2. We computed estimates of regression coefficients, their corresponding variances and confidence intervals from the imputed and top-coded datasets, and compared them with those calculated from the original dataset prior to SDC.

### 3.1 Study design

For simplicity we simulated survival data with just two binary covariates, representing gender (male and female) and entry age (say 30 - 40 and 40 - 50). Values of these variables were simulated from a multinomial distribution for the 4 categories. Values of entry-age were generated from uniform distribution. Survival times (in years) were generated from piece-wise exponential distributions with hazard rates specified in Table Table11 and.and.2.2. An individual was treated as censored if (s)he survived more than 40 years from age at entry. We investigated the following three scenarios.

#### Scenario I

Males and females have same entry-age distributions. Entry age values are generated from the Uniform distribution with the ranges 30~40 and 40~50. For both males and females, values from the former distribution are 1.5 times those from the latter distribution. Gender and entry-age have additive effects on the log-hazard of survival; hazards are shown in Table 1.

#### Scenario II

The distribution of entry age differs for males and females. Males have the same distribution of entry age as in Scenario I. Entry age values for female are generated in a similar manner, except that 70% of the values lies within the range of 35~45. Gender and entry-age effects on survival are as for Scenario I.

#### Scenario III

Males and females have the same entry-age distribution as specified in Scenario I, and there is interaction between entry age and gender on the log-hazard of survival; hazards are shown in Table 2.

In this study we considered individuals with final age greater than or equal to 75 years to be at risk of disclosure, and refer to these individuals as sensitive cases. About 25% of the cases have sensitive values, and about one-third of the cases are censored. For each simulated dataset, we applied the stratified HDMI methods to both final age and entry age variables for sensitive cases as described in Section 2. We also applied the top-coding method, with top-code being 75 for final age and 35 for entry age (as the length of study is 40 years). We then calculated estimates of regression coefficients from the PH model, the corresponding empirical bias and root mean squared error (RMSE) of the estimates, average width of the 95% confidence intervals (CI’s) based on a normal approximation relative to the CI from the data before deletion, and the confidence coverage of these intervals.

### 3.2 Results

Simulation results are based on 500 datasets of sample size 2000. We set the number of bootstraps *B* to be 100 for calculating standard errors of BD and TC estimates; and create *D* = 5 imputed datasets. For stratified HDMI methods, we create strata with stratum size around 25.

Table 3 presents results from scenario I, where entry-age and gender are independent and their log-hazards are additive. TC yields estimate of regression coefficients with serious empirical bias and high RMSE, and zero confidence coverage for the entry-age variable. The TC estimate of the gender coefficient is less biased, but it still has sizeable empirical bias, and the CI has below nominal coverage. All stratified HDMI methods produce quite satisfactory results for the coefficient of entry-age, with negligible empirical bias and close to nominal confidence coverage. The unstratified method, HDU, also works well in terms of empirical bias and coverage, but it is somewhat less efficient than the stratified HD methods. HD3 works best for the gender coefficient, yielding an estimate with minimal empirical bias and good confidence coverage. Estimates from the other HD methods are also acceptable, though they have slightly higher empirical bias and below nominal confidence coverage. When male and female have different entry-age distributions as in scenario II (Table 4), most methods perform as in the first scenario, except that HD2 yields larger empirical bias, RMSE and less coverage for estimate of the regression coefficient of gender. In fact, it has even worse results than TC method.

Table 5 displays results from scenario III, where there is interaction between the age and gender variables. TC yields estimates with considerable empirical bias and poor coverage for regression coefficients of age, gender and the interaction between these two variables. Among stratified HD methods, HD3 has the best performance and yields estimates with good inferences for both variables and the age-gender interaction. Estimates from HD1 and HD2 methods have satisfactory results for all three terms, though they have more empirical bias than HD3. Estimates from HDU have larger empirical bias and smaller confidence coverage than the stratified HD methods.

In summary, HD3 performs best under all circumstances. Other stratified HD methods yield estimates of regression coefficients with good inferential properties for the entry-age variable. These methods also provide satisfactory results for gender, except for HD2 in scenario II. With the presence of interaction between age and gender, estimates for the interaction term from HD1 and HD2 methods do not have sufficient coverage. HDU tends to be slightly less efficient than the stratified HD methods, but it works surprisingly well in the first two scenarios, indicating stratification may not be necessary in these settings. For the more complicated situation (scenario III), it yields biased estimates with low confidence coverage.

## 4 Application to Charleston Heart Study data

We chose a subset of the CHS data and studied the relationship between hazard rate and certain risk factors. Since an intact data file prior to disclosure control was available to us, the effectiveness of our SDC methods can be readily assessed.

### 4.1 Primary data analysis

After deletion of missing values and recoding on some variables, our sample included 1344 individuals, of which 303 survived the study. The variables involved were entry-age, final-age, censoring indicator, race/gender, education level, current cigarette smoking status, history of myocardial infraction (MI), history of diabetes, history of hypertension, electro-cardiographic interpretation (EKG), living place between age 20 to 65 and body mass index (BMI). For the PH regression model, final-age instead of survival time was treated as the time-scale variable.

To examine effects of our chosen risk factors, we applied the PH model to the dataset prior to SDC. Table 6 displays results from the regression. All factors have a significant effect on participant’s hazard ratio except BMI and entry-age (overall). Comparing to individuals that enter the study between 35 and 40 years old, those with entry-age greater than 50 have about a 30% increase in risk of death. White females tend to have 34% less risk than white males. Achieving education after high school reduces hazard by 30% comparing to non-high school education. Smoking cigarettes increases death risk by 76%. Participants with definite history of myocardial infraction have twice the risk of death as those without a history. History of diabetes as well as EKG problems increases the hazard by over 50%, while history of hypertension increases risk of death by 17%. Rural residents have 25 % less hazard than urban residents. Most of these coefficients are in the expected direction.

### 4.2 Results from SDC methods

As described earlier, variables subject to disclosure limitation are entry-age and final-age variables. Respondents with final-age greater than or equal to 80 years are considered to be sensitive cases, which leads to top-code values of 40 for entry-age and 80 for final-age. For this dataset, top-coding the age variables has great impact on the analysis, since the entry-age variable is recoded into only two categories (40 or below 40), in contrast to the five categories for entry-age in the original data. We applied HDMI methods with *D* = 5 imputed datasets to the data and computed estimates of regression coefficient from a PH model.

Table 7 shows results from original, top-coded and imputed datasets based on 500 replications. Figure 1 summarizes these results with box plots of the percentage deviations of the TC and HD estimates of the regression coefficients from the BD estimates. Estimates of coefficients of entry-age variable have not been plotted as TC method cannot differentiate between the age categories. Predictably, TC considerably alters the relationship between hazard and covariates and yields estimates of the regression coefficients with serious bias, especially for the entry-age variable. The unstratified hot deck method HDU yields better estimates than TC for some covariates, but one coefficient is seriously underestimated. The stratified methods all do considerably better, yielding box-plots with narrower inter-quartile ranges and less extreme outliers than TC and HDU. There is not much to choose between the stratified hot deck methods – HD2 and HD3 yield better estimates of the entry-age coefficients than HD1, but HD1 provides better estimates of the regression coefficient for gender than HD2 and HD3. Overall, the stratified HD methods all work better than top-coding in preserving the relationship between risk of death and the covariates on this dataset.

**Percentage deviation from BD estimates for TC and HD estimates of the regression coefficients from PH model, CHS data after SDC**

## 5 Discussion

Longitudinal data raise particular confidential concerns with potentially extensive longitudinal information gathered over time. We consider a specific application concerning disclosure risk caused by some participants attaining high ages because of prolonged participation in a longitudinal study, as in the Charleston Heart Study. One of the authors (McNally) has the responsibility to prepare a public use version of this data set through NACDA that meets HIPAA regulations. As discussed earlier, the standard approach of top-coding age has severe limitations in this longitudinal setting, especially for survival analyses with age being a key variable of interest. HIPPA restrictions make a full public release impossible and require a formal Limit Use Agreement which imposed significant barriers to accessing the data. We develop MI-based SDC methods for this particular data setting. Similar to the methods in An and Little (2007, unpublished), our proposed MI methods are based on stratification, with strata defined by the predicted values of the age variables from a regression model.

Regarding the longitudinal nature of dataset in this study, we have focused on inference about regression coefficients from Cox’s proportional hazard model for survival. As expected, the top-coding method yields seriously biased estimates, especially for the entry-age variable. In principle, it is possible to improve statistical performance of the top-coding method by treating top-coded values as censored, but this yields a non-standard problem of survival analysis with censored covariates, and does not address the problem of severe loss of information from top-coding in this setting.

Among our stratified HDMI methods, HD3 has the best performance and yields results close to those before deletion in simulation studies. The other stratified methods also work well overall, except that sometimes they do not quite attain the nominal confidence coverage. When there are fewer censored cases, as with the CHS data (number of censored cases is one fourth the total sample size), HD3 does not have the obvious advantage over other methods, though it still yields satisfactory results. The unstratified method HDU works almost as well as stratified HD methods in simple data settings. In situations with more covariates and a larger number of sensitive cases, it yields biased estimates with below-nominal confidence coverage.

An and Little [2] present two versions of MI methods, the “C” method which is based on a model fitted to the complete data; and the “D” method based on a model fitted to the deleted values alone. The “D” method is somewhat less efficient than the “C” method, but it is more robust to model misspecification, since the model is fitted to the data that are being deleted. As mentioned in Section 2, we present results for the “D” method here, since the “C” method was inferior in simulations.

Note that in this study, the predicted logarithm of the hazard is considered an appropriate factor for stratification, as the primary focus is survival analysis, and preserving the original relationship between the hazard and covariates. The current size of strata is selected based on empirical experience, by trying to maintain a good balance between limiting disclosure risk, and best retaining the utility of the data. Therefore it is not a universal recommendation. The statistical agencies/data producers are encouraged to make reasonable choice of the stratification factor and the size of strata, based on the interest of the specific data.

Our stratified HDMI methods produce excellent inferences, but they arguably have the limitation as SDC methods that original values in the dataset are retained, although not attached to the right records. As multiply-imputed datasets protect an individual with extremely high age value from being linked to a specific record, a potential data snooper may still recognize the fact that this individual is included in the dataset, especially for data with geographic specificity. To address this concern, we will develop parametric MI methods in our future work.

Reiter (2005) proposes the use of classification and regression trees (CART) to generate partially synthetic data. For the CART approach, subpopulations with relatively homogeneous outcome (imputation classes) are created, by partitioning the predictor space. The imputation model is fit on the cases with sensitive values only, and sensitive values of the outcome can be replaced by random draws from the same class according to the predictor values by Bayesian bootstrap. This is in spirit quite similar to the stratified hot-deck methods in this paper, where we create strata (imputation classes) based on some predicted values from a regression model fitted to the cases with sensitive values, and replace sensitive values with random draws from a set of sensitive values in the same stratum. Both methods are non-parametric, and have been shown to have good repeated sampling properties. Reiter also suggested an alternative method of drawing samples from a kernel density estimator based on the random draws from the first step, which yields added protection since it avoids releasing real data values.

An important issue that is not addressed in this article is quantifying the reduction in disclosure risk from multiple imputation of the ages of high-age individuals, compared with alternatives such as top-coding. This is a complex question which depends on the set of “key” variables available to the intruder from external databases that include the target individuals, the probability that target individuals are in the sample, and the joint distribution of the key variables for the high-age individuals. Reiter and Mitra (2009) and Drechsler and Reiter (2008) describes approaches for addressing this issue with partially synthesized data, and future research should address how these methods translate into the longitudinal data setting.

We have confined attention here to imputing age-related variables for individuals with high-age values, and SDC methods for other types of variables (such as geography) in longitudinal health data like the CHS data remain a topic for future research.

## Acknowledgments

This work was supported by National Institute of Child and Human Development grant (P01HD045753). The Charleston Heart Study is supported by National Institute of Aging grants (P30AG004590 and R03AG021162). The authors thank Trivellore Raghunathan, Michael Elliott, and Myron Gutmann, for useful comments.

## Contributor Information

Di An, Merck Research Laboratories Merck & Co., Inc., P.O. Box 1000, Upper Gwynedd, PA 19454, Office: (267) 305-1628; Cell: (517) 230-3572; FAX: (267) 305-6395, Email: moc.kcrem@na_id.

Roderick J.A. Little, Department of Biostatistics, University of Michigan, 1420 Washington Heights M4045, Ann Arbor, Michigan 48109-2029, Office: (734) 936-1003; Fax: (734) 763-2215.

James W. McNally, Institute for Social Research, University of Michigan, 330 Packard Street, Ann Arbor, Michigan 48109, Office: (734) 615-9520; Fax: (734) 998-9889.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (370K)

- Missing values in longitudinal dietary data: a multiple imputation approach based on a fully conditional specification.[Stat Med. 2009]
*Nevalainen J, Kenward MG, Virtanen SM.**Stat Med. 2009 Dec 20; 28(29):3657-69.* - Attrition in longitudinal studies. How to deal with missing data.[J Clin Epidemiol. 2002]
*Twisk J, de Vente W.**J Clin Epidemiol. 2002 Apr; 55(4):329-37.* - Genome-wide linkage analysis of systolic blood pressure slope using the Genetic Analysis Workshop 13 data sets.[BMC Genet. 2003]
*Pinnaduwage D, Beyene J, Fallah S.**BMC Genet. 2003 Dec 31; 4 Suppl 1:S86. Epub 2003 Dec 31.* - Imputation strategies for missing continuous outcomes in cluster randomized trials.[Biom J. 2008]
*Taljaard M, Donner A, Klar N.**Biom J. 2008 Jun; 50(3):329-45.* - Analysis of binary outcomes in longitudinal studies using weighted estimating equations and discrete-time survival methods: prevalence and incidence of smoking in an adolescent cohort.[Stat Med. 1999]
*Carlin JB, Wolfe R, Coffey C, Patton GC.**Stat Med. 1999 Oct 15; 18(19):2655-79.*

- PubMedPubMedPubMed citations for these articles

- A multiple imputation approach to disclosure limitation for high-age individuals...A multiple imputation approach to disclosure limitation for high-age individuals in longitudinal studiesNIHPA Author Manuscripts. Jul 30, 2010; 29(17)1769PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...