NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Gliklich RE, Dreyer NA, editors. Registries for Evaluating Patient Outcomes: A User's Guide. 2nd edition. Rockville (MD): Agency for Healthcare Research and Quality (US); 2010 Sep.

Cover of Registries for Evaluating Patient Outcomes: A User's Guide

Registries for Evaluating Patient Outcomes: A User's Guide. 2nd edition.

Show details

Chapter 7Linking Registry Data: Technical and Legal Considerations


The purpose of this chapter is to identify important technical and legal considerations and provide guidance to researchers and research sponsors who are interested in linking data held in a health information registry with additional data, such as data from claims or other administrative files or from another registry. Its goals are to help investigators find an appropriate way to address their critical research questions, remain faithful to the conditions under which the data were originally collected, and protect individual patients by safeguarding their privacy and maintaining the confidentiality of the data under applicable law.

There are two equally important questions to address in the planning process: (1) What is a feasible technical approach to linking the data, and (2) Is the linkage legally feasible under the permissions, terms, and conditions that applied to the original compilations of each dataset? Legal feasibility depends on the applicability to the specific purpose of the data linkage of Federal and State legal protections for the confidentiality of health information and participation in human research, and also on any specific permissions obtained from individual patients for the use of their health information. Indeed, these projects require a great deal of analysis and planning, as the technical approach chosen may be influenced by permitted uses of the data under applicable regulations, while the legal assessment may change depending on how the linkage needs to be performed and the nature and purpose of the resulting linked dataset. Tables 9 and 10, respectively, list regulatory and technical questions for the consideration of data linkage project leaders during the planning of a project. The questions are intended to assist in organizing the resources needed to implement the project, including the statistical, regulatory, and collegial advice that might prove helpful in navigating the complexities of data linkage projects. This chapter presumes that the investigators have identified an explicit purpose for the data linkage in the form of a scientific question they are trying to answer. The nature of this objective is critical to an assessment of the applicable regulatory requirements for uses of the data. Investigators should assign the goal of the data linkage project to one of the following categories of health care operations as defined by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule: including: health care quality-related activities, public health practice, research, or some combination of these purposes. If research is one purpose of the project, then the Common Rule (Federal human subjects protection regulations) is likely to apply to the project. More information on HIPAA and the Common Rule is provided in Chapter 8.

Table 9. Legal Planning Questions.

Table 9

Legal Planning Questions.

Table 10. Technical Planning Questions.

Table 10

Technical Planning Questions.

The application of the HIPAA Privacy and Security Rules depends on the origins of the datasets being linked, and such origins may also influence the feasibility of making the data linkage. Investigators should know the source of the original data, the conditions under which they were compiled, and what kinds of permissions, from both individual patients and the custodial institutions, apply to the data. Health information is most often data that have two sources: individual and institutional; these sources may have legal rights and continuing interests in the use of the data.

It is important to be aware that the legal requirements may not remain stable and that the protections limiting the research use of health information are likely to change in response to continued development of electronic health information technologies.

This chapter provides eight sections focusing on core issues in three major parts: Technical Aspects of Data Linkage Projects, Legal Aspects of Data Linkage Projects, and Risk Mitigation for Data Linkage Projects. The Technical Aspects of Data Linkage Projects section discusses the reasons for and technical methods of linking datasets containing health information, including data held in registries. It should be noted that this list of techniques is not intended to be comprehensive, and the techniques presented have limitations for certain types of studies. The reader is referred to the published literature on linkage for alternative techniques. The Legal Aspects of Data Linkage Projects section defines important concepts, including the different definitions of “disclosure” as used by statisticians and in the HIPAA Privacy Rule. This section also discusses the risks of identification of individuals inherent in data linkage projects and describes the legal standards of the HIPAA Privacy Rule that pertain to these risks. Finally, the Risk Mitigation for Data Linkage Projects section summarizes both recognized and developing technical methods for mitigating the risks of identification. In addition, Appendix D consists of a hypothetical data linkage project intended to provide context for the technical and legal information presented below. Case Examples 20, 21, and 22 describe registry-related data linkage activities. While some of the concepts presented are applicable to other important nonpatient identities that might be at risk in data linkage, such as provider identities, those issues are beyond the scope of the discussion below.

Box Icon

Case Example 20

Linking Registries at the International Level. The number of options for systemic treatment of psoriasis has greatly increased in recent years. Because psoriasis is a chronic disease involving lifelong treatment, data on long-term effectiveness and safety (more...)

Box Icon

Case Example 21

Linking a Procedure-Based Registry With Claims Data To Study Long-Term Outcomes. The registry sponsor was interested in studying long-term patient outcomes for diagnostic cardiac catheterizations and percutaneous coronary interventions (PCI), but longitudinal (more...)

Box Icon

Case Example 22

Linking Registry Data To Examine Long-Term Survival. In 2002, approximately 1,500 children in the United Kingdom (UK) were diagnosed with cancer. Previous estimates of malignant bone tumors in children have been approximately 5 per million person-years (more...)

Technical Aspects of Data Linkage Projects

Linking Records for Research and Improving Public Health

Data in registries regarding the health of individuals come in a wide variety of forms. Most of these data have been gathered originally for the delivery of clinical services or payment for those services, and under promises or legal guarantees of confidentiality, privacy, and security. The sources of data may include individual doctors’ records, billing information, vital statistics on births and deaths, health surveys, and data associated with biospecimens, among other sources.

The broad goal of registries is to amass data from potentially diverse sources to allow researchers to explore and evaluate alternative health outcomes in a systematic fashion. This goal is usually accomplished by gathering data from multiple sources and linking the data across sources, either with explicit identifiers designed for linking, or in a probabilistic fashion via the characteristics of the individuals to whom the data correspond. From the research perspective, the more data included, the better, both in terms of the number of cases and the details and the extent of the health information. The richer the database, the more likely it is that data analysts will be able to discover relationships that might affect or improve health care. On the other hand, many discussions about privacy protection focus on limiting the level of detail available in data to which others have access.

There is an ethical obligation to protect patient interests when collecting, sharing, and studying person-specific biomedical information.1 Many people fear that information derived from their medical or biological records will be used against them in employment decisions, result in limitations to their access to health or life insurance, or cause social stigma.2 These fears are not unfounded, and there have been various cases in which it was found that an individual’s genetic characteristics or clinical manifestations were used in a manner inconsistent with an individual‘s expectations of privacy and fair use.3 If individuals are afraid that their health-related information may be associated with them or used against them, they may be less likely to seek treatment in a clinical context or participate in research studies.4

A tension exists between the broad goals of registries and regulations protecting individually identifiable information. Approaches and formal methodologies that help mediate this tension are the principal technical focus of this chapter. To understand the extent to which these tools can assist data linkages involving registry data, one needs to understand the risks of identification in different types of data.

There is a large body of Federal law relating to privacy. A recent comprehensive review of privacy law and its effects on biomedical research identified no fewer than 15 separate Federal laws pertaining to health information privacy.5 There are also special Federal laws governing health information related to substance abuse.6 A full review of all laws related to privacy, confidentiality, and security of health information also would consider separate State privacy protections, as well as State laws pertaining to the confidentiality of data. Nevertheless, the legal aspects of this chapter focus only on the Federal regulations commonly referred to as the HIPAA Privacy Rule.

What Do Privacy, Disclosure, and Confidentiality Mean?

Privacy is a term whose definition varies with context.7 In the HIPAA Privacy Rule, the term applies to protected health information (PHI); specifically, to permitted uses and disclosures of individually identifiable health information. The Privacy Rule addresses to whom the custodian of PHI, a covered entity, may transmit the information and under what conditions. It establishes three basic concepts of health information: identifiable data; data that lack certain direct identifiers, otherwise known as a limited dataset; and de-identified data. Registries commonly acquire identifiable data and may create the last two categories of data. Along this spectrum of data, the HIPAA Privacy Rule applies different legal standards and protections.5 Not all registries contain PHI; Chapter 8 provides more information on how PHI is defined under HIPAA.

Disclosure has two different meanings: one is technical and the other is a HIPAA Privacy Rule definition.

Technical Definition

Technically, disclosure relates to the attribution of information to the source of the data, regardless of whether the data source is an individual or an organization. There are basically three types of disclosure of data that possess the capacity to make the identity of particular individuals known: identity disclosure, attribute disclosure, and inferential disclosure.

Identity disclosure occurs when the data source becomes known from the data release itself.8,9

Attribute disclosure occurs when the released data make it possible to infer the characteristics of an individual data source more accurately than would have otherwise been possible.8,9 The usual way to achieve attribute disclosure is through identity disclosure. First, one identifies an individual through some combination of variables and then learns about the values of additional variables included in the released data. Attribute disclosure may occur, however, without identity disclosure, such as when all people from a population subgroup share a characteristic and this quantity becomes known for any individual in the subgroup.

Inferential disclosure relates to the probability of identifying a particular attribute of a data source. Because almost any data release can be expected to increase the likelihood of an attribute being associated with a data source, the only way to guarantee protection is to release no data at all. It is for this reason that researchers use certain methods not to prevent disclosure, but to limit or control the nature of the disclosure. These methods are known as disclosure limitation methods or statistical disclosure control.10

HIPAA Privacy Rule Definitions

Disclosure according to the HIPAA Privacy Rule means the release, transfer, provision of, access to, or divulging in any other manner of information outside of the entity holding the information.11

Confidentiality broadly refers to a quality or condition of protection accorded to statistical information as an obligation not to permit the transfer of that information to an unauthorized party.5 Confidentiality can be owed to both individuals and health care organizations. A different notion of confidentiality, arising from the special relationship between a clinician and patient, refers to the ethical, legal, and professional obligation of those who receive information in the context of a clinical relationship to respect the privacy interests of their patients. Most often the term is used in the former sense and not in the latter, but these two meanings inevitably overlap in a discussion of health information as data. The methods for disclosure limitation described here have been developed largely in the context of confidentiality protection, as defined by laws, regulations, and especially by the practices of statistical agencies.

Linking Records and Probabilistic Matching

Computer-assisted record linkage goes back to the 1950s, and was put on a firm statistical foundation by Fellegi and Sunter.12 Most common techniques for record linkage either rely on the existence of unique identifiers or utilize a structure similar to the one Fellegi and Sunter described with the incorporation of formal statistical modeling and methods, as well as new and efficient computational tools.13,14 The simplest way to match records from separate databases is to use a so-called “deterministic” method of linking the databases employing unique identifiers contained in each record. In the United States, these identifiers might be names or Social Security Numbers; however, these particular identifiers may not in fact be unique. As a result, some form of probabilistic approach is typically used to match the records. Thus, there is little actual difference between methods using deterministic vs. probabilistic linkage, except for the explicit representation of uncertainty in the matching process in the latter.

The now-standard approach to record linkage is built on five key components for identifying matching pairs of records across two databases:13

  1. Represent every pair of records using a vector of features (variables) that describe similarity between individual record fields. Features can be Boolean, discrete, or continuous.
  2. Place feature vectors for record pairs into three classes: matches (M), nonmatches (U), and possible matches (P). These correspond to “equivalent,” “nonequivalent,” and “possibly equivalent” (e.g., requiring human review) record pairs, respectively.
  3. Perform record-pair classification by calculating the ratio (P (γ | M))/(P (γ | U)) for each candidate record pair, where γ is a feature vector for the pair and P (γ | M) and P (γ | U) are the probabilities of observing that feature vector for a matched and nonmatched pair, respectively. Two thresholds based on desired error levels—Tμ and Tλ —optimally separate the ratio values for equivalent, possibly equivalent, and nonequivalent record pairs.
  4. When no training data in the form of duplicate and nonduplicate record pairs are available, matching can be unsupervised; that is, conditional probabilities for feature values are estimated using observed frequencies in the records to be linked.
  5. Most record pairs are clearly nonmatches, so one need not consider them for matching. This situation is managed by “blocking,” or partitioning the databases, for example, based on geography or some other variable in both databases, so that only records in comparable blocks are compared. Such a strategy significantly improves efficiency.

The first four components lay the groundwork for accuracy of record-pair matching using statistical or machine learning prediction models, such as logistic regression. The fewer identifiers used in steps 1 and 2, the poorer the match is likely to be. Accuracy is well known to be high when there is a 1–1 match between records in the two databases, and accuracy deteriorates as the overlap between the files decreases and the measurement error in the feature values consequently increases.

The fifth component provides for efficiently processing large databases, but to the extent that blocking is approximate and possibly inaccurate, its use decreases the accuracy of record-pair matching. The less accurate the matching, the more error (i.e., records not matched or matched inappropriately) there will be in the merged registry files. This error will impede quality analyses and findings from the resulting data.15,16

This standard approach has problems when there are lists or files with little overlap, when there are undetected duplications within files, and when one needs to link three or more lists. In the latter case, one essentially matches all lists in pairs, and then resolves discrepancies. Unfortunately, there is no single agreed-upon way to do this.

Record linkage methodology has been widely used by statistical agencies, especially in the U.S. Census Bureau. The methodology has been combined with disclosure limitation techniques such as the addition of “noise” to variables in order to produce public use files that the agencies believe cannot be linked back to the original databases used for the record linkage. Another technique involves protecting individual databases by stripping out identifiers and then attempting record linkage. This procedure has two disadvantages: first, the quality of matches is likely to decrease markedly; and second, the resulting merged records will still need to be protected by some form of disclosure limitation. Therefore, as long as there are no legal restrictions against the use of identifiers for record linkage purposes, it is preferable to use detailed identifiers to the extent possible and to remove them following the matching procedure.

Currently there are no special features of registry data known to enhance or inhibit matching. Registry data may be easier targets for re-identification because the specifics of diseases or conditions help to define the registries. In the United States, efforts are often made to match records using Social Security Numbers. There are large numbers of entry errors for these numbers in many databases, and there are problems associated with multiple people using one number and some people using multiple numbers.17 Lyons et al. describe a very large-scale matching exercise in the United Kingdom linking multiple health care and social services datasets using National Health Service numbers and various alternative sets of matching variables in the spirit of the record linkage methods described above. They report achieving accurate matching at rates of only about 95 percent.18

Procedural Issues in Linking Datasets

It is important to understand that neither data nor link can be unambiguously defined. For instance, a dataset may be altered by the application of tools for statistical disclosure limitation, in which case it is no longer the same dataset. Linkage need not mean, as it is customarily construed, “bringing the two (or more) datasets together on a single computer.” Many analyses of interest can be performed using technologies that do not require literal integration of the datasets. Even the relationship between datasets can vary. Two datasets can hold the same attributes for different individuals (horizontal partitioning), different attributes for the same individuals (vertical partitioning), or a complex combination of the two.

The process of linking horizontally partitioned datasets engenders little incremental risk of re-identification. There is, in almost all cases, no more information about a record on the combined dataset than was present in the individual dataset containing it. Moreover, any analysis requiring only data summaries (i.e., in technical terms, sufficient statistics) that are additive across the datasets can be performed using tools based on the computer science concept of secure summation.19 Examples of analyses for which this approach works include creation of contingency tables, linear regression, and some forms of maximum likelihood estimation.

Only in a few cases have comparable techniques for vertically partitioned data been well enough understood to be employed in practice.20 Instead, it is usually necessary to actually link individual subjects’ records that are contained in two or more datasets. This process is inherently and unavoidably risky because the combined dataset contains more information about each subject than either of the components.

Discussed below is a preferred approach that is complex, but that attenuates or can even obviate other problems. Suppose that each of the two datasets to be linked contains the same unique identifiers (for individuals, an example is Social Security Numbers) in all of the records. In this case, there exist techniques based on cryptography (homomorphic encryption21 and hash functions) that enable secure determination of which individuals are common to both datasets and assignment of unique but uninformative identifiers to the shared records. Each dataset can then be purged of individual identifiers and altered to further limit re-identification, following which error-free and risk-free linkage can be performed.

Such techniques are computationally very complex, and may need to involve trusted third parties that do not have access to information in either dataset other than the common identifier. Therefore, in many cases the database custodian may prefer to remove identifiers and carry out statistical disclosure limitation prior to linkage. It is important to understand that this latter approach compromises, perhaps irrevocably, the linkage process, and may introduce errors into the linked dataset that later—perhaps dramatically—alter the results of statistical analyses.

Many techniques for record linkage depend at some level on the presence of sets of attributes in both databases that are unique to individuals but do not lead to re-identification—a combination that may be difficult to find. For instance, the combination of date of birth, gender, and ZIP Code of residence might be present in both databases. It is estimated that this combination of attributes uniquely characterizes a significant portion of the U.S. population—somewhere between 65 and 87 percent, or even higher for certain subpopulations—so re-identification would only require access to a suitable external database.22,23 Other techniques such as the Fellegi-Sunter record linkage methods described above are more probabilistic in nature. They can be effective, but as noted, they also introduce data quality effects that cannot readily be characterized.

No matter how linkage is performed, a number of other issues should be addressed. For instance, comparable attributes should be expressed in the same units of measure in both datasets (e.g., English or metric values for weight). Also, conflicting values of attributes for each individual common to both databases need reconciliation. Another issue involves the management of records that appear in only one database; the most common decision is to drop them. Data quality provides another example; it is one of the least understood statistical problems and has multiple manifestations.24 Even assuming some limited capability to characterize data quality, the relationship between the quality of the linked dataset and the quality of each component should be considered. The linkage itself can produce quality degradation. The best way to address these issues is not clear, and intuition can be faulty. For example, there is reason to believe that the quality of a linked dataset is strictly less than that of either component, and not, as might be supposed, somewhere between the two.

Finally, it is important to understand that there exist endemic risks to data linkage. Anyone with access to one of the original datasets and the linked dataset may learn, even if imperfectly, the values of attributes in the other. It may not be possible to determine what knowledge the linkage will create without actually executing the linkage. For these reasons, strong consideration should be given to forms of data protection such as licensing and restricted access in research data centers, where both analyses and results can be controlled.

Legal Aspects of Data Linkage Projects

Risks of Identification

The HIPAA Privacy Rule describes two methods for de-identifying health information.25 One method requires the removal of certain data elements. The other method requires a qualified statistician to certify that the potential for identifying an individual from the data elements is negligible. (See Chapter 8 for more information.) The data removal process alone may not be sufficient. Residual data especially vulnerable to disclosure threats include (1) geographic detail, (2) longitudinal information, and (3) extreme values (e.g., income). Population health data are clearly more vulnerable than sample data, and variables that are available in other accessible databases pose special risks.

Statistical organizations such as the National Center for Health Statistics have traditionally focused on the issue of identity disclosure and thus refuse to report information in which individuals or institutions can be identified. This situation occurs, for example, when a data source is unique in the population for the characteristics under study, and is directly identifiable in the database to be released. But such uniqueness and subsequent identity disclosure may not reveal any information other than the association of the source with the data collected in the study. In this sense, identity disclosure may only be a technical violation of a promise of confidentiality. Thus, uniqueness only raises the issue of possible confidentiality problems resulting from identification. A separate issue is whether the release of information is one that is permitted by the HIPAA Privacy Rule or is authorized by the data source.

The foregoing discussion implicitly introduces the notion of “harm,” which is not the same as a breach of confidentiality. For example, it is possible for a pledge of confidentiality to be technically violated, but produce no harm to the data source because the information is “generally known” to the public. In this case, some would argue that additional data protection is not required. Conversely, if one attempts to match records from one file to another file which is subject to a pledge of confidentiality, and an “incorrect” match is made, there is no breach of confidentiality, but there is the possibility of harm if the match is assumed to be correct. Furthermore, information on individuals or organizations in a release of sample statistical data may well increase the information about characteristics of individuals or organizations not in the sample. This information may produce an inferential disclosure for such individuals or organizations and cause them harm, even though there was no confidentiality obligation. Figure 2 depicts the overlapping relationships among confidentiality, disclosure, and harm.

Figure 2. Relationships Among Confidentiality, Disclosure, and Harm.

Figure 2

Relationships Among Confidentiality, Disclosure, and Harm.

Some people believe that the way to ensure confidentiality and prevent identity disclosure is to arrange for individuals to participate in a study anonymously. In many circumstances, such a belief is misguided, because there is a key distinction between collecting information anonymously and ensuring that personal identifiers are not inappropriately made available. Moreover, clinical health care data are simply not collected anonymously. Not only do patient records come with multiple identifiers crucial to ensuring patient safety for clinical care, but they also contain other information that may allow the identification of patients even if direct identifiers are stripped from the records.

Moreover, health- or medical-related data may also come from sample surveys in which the participants have been promised that their data will not be released in ways that would allow them to be individually identified. Disclosure of such data can produce substantial harm to the personal reputations or financial interests of the participants, their families, and others with whom they have personal relationships. For example, in the pilot surveys for the National Household Seroprevalence Survey, the National Center for Health Statistics moved to make responses during the data collection phase of the study anonymous because of the harm that could potentially result from information that an individual had an HIV infection or engaged in high-risk behavior. But such efforts still could not guarantee that one could not identify a participant in the survey database. This example also raises an interesting question about the confidentiality of registry data after an individual’s death, in part because of the potential for harm to others. The health information of decedents is subject to the HIPAA Privacy Rule, and several statistical agencies explicitly treat the identification of a deceased individual as a violation of their confidentiality obligations.

Examples of Patient Re-Identification

For years, the confidentiality of health information has been protected through a process of “de-identification.” This protection entails the removal of person-specific features such as names, residential street addresses, phone numbers, and Social Security Numbers. However, as discussed above, de-identification does not guarantee that individuals may not be identified from the resulting data. On multiple occasions, it has been shown that de-identified health information can be “re-identified” to a particular patient without hacking or breaking into a private health information system. For instance, in the mid-1990s Latanya Sweeney, then a graduate student at the Massachusetts Institute of Technology, showed that de-identified hospital discharge records, which were made publicly available at the State level, could be linked to identifiable public records in the form of voter registration lists. Her demonstration received notoriety because it led to the re-identification of the medical status of the then-governor in the Commonwealth of Massachusetts.26 This result was achieved by linking the data resources on their common fields of patient’s date of birth, gender, and ZIP Code. As noted earlier, this combination identifies unique individuals in the United States at a rate estimated at somewhere between 65 and 87 percent or even higher in certain subpopulations.

High-Risk Identifiers

One response to the Sweeney demonstration was the HIPAA Privacy Rule method for de-identification by removal of data elements. This process requires the removal of explicit identifiers such as names, dates, geocodes (for populations of less than 20,000 inhabitants), and other data elements that, in combination, could be used to ascertain an individual’s identity. In all, the de-identification standard enumerates 18 features that should be removed from patient information prior to data sharing. (See Chapter 8.)27

Nonetheless, even the removal of these data elements may fail to prevent re-identification. In many instances, there are residual features that can lead to identification. The extent to which residual features can be used for re-identification depends on the availability of relevant data fields. Thus, one can roughly partition identifiers into “high” and relatively “low” risk features. The high-risk features are the sort that are documented in multiple environments and are publicly available. These are features that could be exploited by any recipient of such records. For instance, patient demographics are high-risk identifiers. Even de-identified health information permitted under the HIPAA Privacy Rule may leave certain individuals in a unique status, and thus at high risk for identification through public data resources containing similar features, such as public records containing birth, death, marriage, voter registration, and property assessment information.

Relatively Low-Risk Identifiers

In contrast, lower-risk data elements are those that do not appear in public records and are less available. For instance, clinical features, such as an individual’s diagnosis and treatments, are relatively static because they are often mapped to standard codes for billing purposes. These features might appear in de-identified information, such as hospital discharge databases, as well as in identified resources such as electronic medical records. While combinations of diagnostic and treatment codes might uniquely describe an individual patient in a population, the identifiable records are available to a much smaller group than the general public. Moreover, these select individuals, such as the clinicians and business associates of the custodial organization for the records, are ordinarily considered to be trustworthy, because they owe independent ethical, professional, and legal duties of confidentiality to the patients.

Special Issues With Linkages to Biospecimens

Health care is increasingly moving towards evidence-based and personalized systems. In support of this trend, there is a growing focus on associations between clinical and biological phenomena. In particular, the decreasing cost of genome sequencing technology has facilitated a rapid growth in the volume of biospecimens and derived DNA sequence data. As much of this research is sponsored through Federal funding, it is subject to Federal data sharing requirements. However, biospecimens, and DNA in particular, are inherently unique and there are a number of routes by which DNA information can be identified to an individual.28 For instance, there are over a million single nucleotide polymorphisms (SNPs) in the human genome; these little snippets of DNA are often used to make genetic correlations with clinical conditions. Yet it is estimated that fewer than one hundred SNPs can uniquely represent an individual.29 Thus, if de-identified biological information is tied to sensitive clinical information, it may provide a match to the identified biological information—as, for example, in a forensic setting.30

Biospecimens and information derived from them are of particular concern because they can convey knowledge not only about the individual from whom they are derived, but also about other related individuals. For instance, it is possible to derive estimates about the DNA sequence of relatives.31 If the genetic information is predictive or diagnostic, it can adversely affect the ability of family members to obtain insurance and employment, or it may cause social stigmatization.32,33,34 The Genetic Information Nondiscrimination Act of 2008 (GINA) prohibits health insurers from using genetic information about individuals or their family members, whether collected intentionally or incidentally, in determining eligibility and coverage, or in underwriting and premium setting. Insurers may, in collaboration with external research entities, request that policyholders undergo genetic testing, but a refusal to do so cannot be permitted to affect the premium or result in medical underwriting.35

Risk Mitigation for Data Linkage Projects

Methodology for Mitigating the Risk of Re-Identification

The disclosure limitation methods briefly described in this section are designed to protect against identification of individuals in statistical databases, and are among the techniques that data linkage projects involving registries are most likely to use. One problem these methods do not address is the simultaneous protection of individual and institutional data sources. The discussion here also relates to the problems addressed by secure computation methodologies, which are explored in the next section.

Basic Methodology for Statistical Disclosure Limitation

Duncan36 categorizes the methodologies used for disclosure limitation in terms of disclosure limiting masks, i.e., transformations of the data where there is a specific functional relationship (possibly stochastic) between the masked values and the original data. The basic idea of masking involves data transformations. The goal is to transform an n × p data matrix Z through pre- and post-multiplication and the possible addition of noise, such as depicted in Equation (1):


where A is a matrix that operates on cases, B is a matrix that operates on variables, and C is a matrix that adds perturbations or noise to the original information. Matrix masking includes a wide variety of standard approaches to disclosure limitation:

  • Adding noise,
  • Releasing a subset of observations (deleting rows from Z),
  • Cell suppression for cross-classifications,
  • Including simulated data (adding rows to Z),
  • Releasing a subset of variables (deleting columns from Z), and
  • Switching selected column values for pairs of rows (data swapping).

This list also omits some methods, such as micro-aggregation and doubly random swapping, but it provides a general idea of the types of techniques being developed and applied in a variety of contexts, including medicine and public health.

The possibilities of both identity and attribute disclosure remain even when a mask is applied to a dataset, although the risks may be substantially diminished.

Duncan suggests that we can categorize most disclosure-limiting masks as suppressions (e.g., cell suppression), recodings (e.g., collapsing rows or columns, or swapping), or samplings (e.g., releasing subsets), although he also allows for simulations as discussed below. Further, some masking methods alter the data in systematic ways (e.g., through aggregation or through cell suppression), whereas others do it through random perturbations, often subject to constraints for aggregates. Examples of perturbation methods are controlled random rounding, data swapping, and the post-randomization method (PRAM) of Gouweleeuw,37 which has been generalized by Duncan and others. One way to think about random perturbation methods is as restricted simulation tools. This characterization connects them to other types of simulation approaches.

Various authors pursue simulation strategies and present general approaches to “simulating” from a constrained version of the cumulative, empirical distribution function of the data. In 1993, Rubin asserted that the risk of identity disclosure could be eliminated by the use of synthetic data (in his case using Bayesian methodology and multiple imputation techniques) because there is no direct function link between the original data and the released data.38 Said another way, the data remain confidential because simulated individuals have replaced all of the real ones. Raghunathan, Reiter, and Rubin39 provide details on the implementation of this approach. Abowd and Woodcock (for their chapter in Doyle et al., 2001)40 describe a detailed application of multiple imputation and related simulation technology for a longitudinally linked individual and work history dataset. With both simulation and multiple-imputation methodology, however, it is still possible that the data values of some simulated individuals remain virtually identical to those in the original sample, or at least close enough that the possibility of both identity and attribute disclosure remain. As a result, checks should be made for the possibility of unacceptable disclosure risk.

Another important feature of the statistical simulation approach is that information on the variability of the dataset is directly accessible to the user. For example, in the Fienberg, Makov, and Steele41 approach for categorical data, the data user can begin with the reported table and information about the margins that are held fixed, and then run the Diaconis-Sturmfels Monte Carlo Markov chain algorithm to regenerate the full distribution of all possible tables with those margins. This technique allows the user to make inferences about the added variability in a modeling context that is similar to the approach to inference in Gouweleeuw et al.37 Similarly, Raghunathan and colleagues proposed the use of multiple imputations to directly measure the variability associated with the posterior distribution of the quantities of interest.39 As a consequence, Rubin showed that simulation and perturbation methods represent a major improvement in access to data over cell suppression and data swapping without sacrificing confidentiality. These methods also conform to the statistical principle allowing the user of released data to apply standard statistical operations without being misled.

There has been considerable research on disclosure limitation methods for tabular data, especially in the form of multiway tables of counts (contingency tables). The most popular methods include a process known as cell suppression, which systematically deletes the values in selected cells in the table and collapses categories. This process is a form of aggregation. While cell suppression methods have been very popular among the U.S. Government statistical agencies, and are useful for tables with nonnegative entries rather than simple counts, they also have major drawbacks. First, good algorithms do not yet exist for the methodology when it is associated with high-dimensional tables. More importantly, the methodology systematically distorts the information about the cells in the table for users, and, as a consequence, makes it difficult for secondary users to draw correct statistical inferences about the relationships among the variables in the table. For further discussion of cell suppression and extensive references, see the various chapters in Doyle et al.,40 notably the one by Duncan and his collaborators.

A special example of collapsing categories involves summing over variables to produce marginal tables. Instead of reporting the full multiway contingency table, one or more collapsed versions of it might be reported. The release of multiple sets of marginal totals has the virtue of allowing statistical inferences about the relationships among the variables in the original table using log-linear model methods (e.g., see Yvonne, Bishop, Fienberg, and Holland).42 With multiple collapsed versions, statistical theory makes it clear that one may have highly accurate information about the actual cell entries in the original table. As a result, the possibility of disclosures still requires investigation. In part to address this problem, a number of researchers have recently worked on the problem of determining upper and lower bounds for the cells of a multi-way table given a set of margins; however, other measures of risk may clearly be of interest. The problem of computing bounds is in one sense an old one, at least for two-way tables, but it is also deeply linked to recent mathematical developments in statistics and has generated a flurry of new research.43,44

The Risk-Utility Tradeoff

Common to virtually all the methodologies discussed in the preceding section is the notion of a risk-utility tradeoff, in which the risk of disclosure is balanced with the utility of the released data (e.g., see Duncan,36 Fienberg,45 and their chapter with others in Doyle et al.40). To keep this risk at a low level requires applying more extensive data masking, which limits the utility of what is released. Advocates for the use of simulated data often claim that this use eliminates the risk of disclosure, but still others dispute this claim.

Privacy-Preserving Data Mining Methodologies

With the advances in data mining and machine learning over the past two decades, there have been a large number of methods introduced under the banner of privacy-preserving computation. The methodologies vary, and many of them focus on standard tools such as the addition of noise or data swapping of one sort or another. But the claims of identity protection in this literature are often exaggerated or unverifiable. For a discussion of some of these ideas and methods, see Fienberg and Slavkovic.44 For two recent interesting examples explicitly set in the context of medical data, see Malin and Sweeney46 and Boyens, Krishnan, and Padman.47

The common message of this literature is that privacy protection has costs measured in the lack of availability of research data. To increase the utility of released data for research, some measure of privacy protection, however small, needs to be sacrificed. It is nonetheless still possible to optimize utility, subject to predefined upper bounds on what is considered to be acceptable risk of identification. See a related discussion in Fienberg.48

Cryptographic Approaches to Privacy Protection

While the current risks of identification in modern databases are similar for statistical agencies and biomedical researchers, there are also new challenges: from contemporary information repositories that store social network data (e.g., cell phone, MySpace, and Facebook data), product preferences data (e.g., Amazon), Web search data, and other sources of information not previously archived in a digital format. A recent literature emanating from cryptography focuses on algorithmic aspects of this problem with an emphasis on automation and scalability of a process for conferring anonymity. Automation, in turn, presents a fundamentally different perspective on how privacy is defined and provides for both a formal definition of privacy and proofs for how it can be protected. By focusing on the properties of the algorithm for anonymity, it is possible to formally guarantee the degree of privacy protection and the quality of the outputs in advance of data collection and publication.

This new approach, known as differential privacy, limits the incremental information a data user might learn beyond that which is known before exposure to the released statistics. No matter what external information is available, the differential privacy approach guarantees that the same information is learned about an individual, whether or not information about the individual is present in the database. The papers by Dwork et al.49,50 provide an entry point to this literature. Differential privacy, as these authors describe it, works primarily through the addition of specific forms of noise to all data elements and the summary information reported, but it does not address issues of sampling or access to individual-level microdata. While these methods are intriguing, their utility for data linkages with registry data remains an open issue.

Security Practices, Standards, and Technologies

In general, people adopt two different philosophical positions about how the confidentiality associated with individual-level data should be preserved: (1) by “restricted or limited information,” that is, restrictions on the amount or format of the data released, and (2) by “restricted or limited access,” that is, restrictions on the access to the information itself.

If registry data are a public health good, then restricted access is justifiable only in situations where the confidentiality of data in the possession of a researcher cannot be protected through some form of restriction on the information released. Restricted access is intended to allow use of unaltered data by imposing certain conditions on users, analyses, and results that limit disclosure risk. There are two primary forms of restricted access. The first is through licensing, whereby users are legally bound by certain conditions, such as agreeing not to use data for re-identification and to accept advance review of publications. The licensure approach allows users to transfer data to their sites and use the software of their choice. The second approach is exemplified by research data centers, discussed in more detail below, and remote analysis servers, which are conceptually similar to data centers: users, and sometimes analyses, are evaluated in advance. The results are reviewed, and often limited, in order to limit risk of disclosure. The data remain at the holder’s site and computers; the difference is whether access is in person at a data center or using a remote analysis center via the World Wide Web.

Registries as Data Enclaves

Many statistical agencies have built enclaves, often referred to as research data centers, where users can access and use data in a regulated environment. In such settings, the security of computer systems is controlled and managed by the agency providing the data. Such environments may maximize data security. For a more extensive discussion of the benefits of restricted access, see the chapter by Dunne in Doyle et al.40

These enclaves incur considerable costs associated with their establishment and upkeep. A further limitation is that the enclave may require the physical presence of the data user, which also increases the overall cost to researchers working with the data. Moreover, such environments often prevent users from executing specialized data analyses, which may require programming and other software development beyond the scope of traditional statistical software packages made available in the enclave.

The process for access to data in enclaves or restricted centers involves an examination of the research credentials of those wishing to do so. In addition, these centers control the physical access to confidential data files and they review the materials that data users wish to take from the centers and to publish. Researchers who are accustomed to reporting residual plots and other information that allows for a partial reconstruction of the original data, at least for some variables, will encounter difficulties, because restricted data centers typically do not allow users to remove such information.


To limit the possibility of re-identification, data can be manipulated by the above techniques to mitigate risk. At the same time, it is important to ensure that researchers are accountable for the use of the datasets that are made available to them. Best practices in data security should be adopted with specific emphasis on authentication, authorization, access control, and auditing. In particular, each data recipient should be assigned a unique login identification (ID), or, if the data are made available online, access may be provided through a query-response server. Prior to each session of data access, data custodians should authenticate the user’s identity. Access to information should be controlled either in a role-based or information-based manner. Each user access and query to the data should be logged to enable auditing functions. If there is a breach in data protection, the data custodian can investigate the potential cause and make any required notifications.

Layered Restricted Access to Databases

In many countries, the traditional arrangement for data use involves restrictions on both information and access, with only highly aggregated data and summary statistics released for public use.

One potential strategy for privacy protection for the linkage of registries to other confidential data is a form of layered restrictions that combines two approaches with differing levels of access at different levels of detail in the data. The registry might function as an enclave, similar to those described above, and in addition, public access might be limited to only aggregate data. Between these two extremes there might be several layers of restricted access. An example is licensing that includes privacy protection, requiring greater protection as the potential for disclosure risk increases.

Such a layered approach might require a broader interpretation of the HIPAA Privacy Rule restrictions for certain kinds of medical records5 or different forms of releases for patient records. The HIPAA Privacy Rule’s detailed approach to releasing data can be shown to protect individual data only partially, and at the same time, to unnecessarily restrict access to medical record data for research purposes. As a result, there is a need to develop a clearer sense of how health information subject to the HIPAA Privacy Rule might be linked with registry data and subsequently protected. Such clarifications could allow for more complete research data while offering protection against the risks of identity disclosure to individuals and health care providers.


This chapter describes technical and current legal considerations for researchers interested in creating data linkage projects involving registry data. The discussion of the HIPAA Privacy Rule provides a basis for understanding the conditions under which the use and disclosure of protected health information (PHI) is permitted for research and other purposes relevant to registries. These conditions determine whether and how the linkage of certain datasets may be legally feasible. In addition, the chapter presents typical methods for record linkage that are likely to form the basis for the construction of data linkage projects. It also discusses both the hazards for re-identification created by data linkage projects, and the statistical methods used to minimize the risk of re-identification. Two topics not covered in this chapter are: (1) considerations about linking data from public and private sectors, where different, perhaps conflicting, ethical and legal restrictions may apply, and (2) the risks involved in identifying the health care providers that collect and provide data.

Dataset linkage entails the risks of loss of reliable confidential data management and of identification or re-identification of individuals and institutions. Recognized and developing statistical methods and secure computation may limit these risks and allow the public the health benefits that registries linked to other datasets have the potential to contribute.

Summary of Legal and Technical Planning Questions

The questions in Tables 9 and 10 are intended to assist in the planning of data linkage projects that involve using registry data plus other files. Registry operators should use the answers to these questions to assemble necessary information and other resources to guide planning for their data linkage projects. Like the preceding discussion, this section considers regulatory and technical questions.

The assumptions listed below in Table 9 apply to the regulatory questions that follow. Their application to the proposed data linkage project should be confirmed or determined.

  • The HIPAA Privacy Rule applies to the initial data sources.
  • Other laws may restrict access or use of the initial data sources.
  • The Common Rule or FDA regulations may or may not apply to data linkage.
  • The Common Rule or FDA regulations may or may not apply to the original datasets.

Different regulatory concerns arise depending on the answers to each category of the following questions. Consult as necessary with experienced health services, social science, or statistician colleagues; and with regulatory personnel (e.g., the agency Privacy Officer) or legal counsel to clarify answers for specific data linkage projects.

References for Chapter 7

Clayton E. Ethical, legal, and social implications of genomic medicine (Review) New England Journal of Medicine. 2003;349:562–9. [PubMed: 12904522]
Louis Harris and Associates. Health Care Information Privacy: A Survey of the Public and Leaders. Conducted for EQUIFAX Inc; 1993.
Gottlieb S. US employer agrees to stop genetic testing –Burlington Northern Santa Fe News. British Medical Journal. 2001;322:449. [PMC free article: PMC1119680] [PubMed: 11222414]
Sterling R, Henderson G, Corbie-Smith G. Public willingness to participate in and public opinions about genetic variation research: a review of the literature. American Journal of Public Health. 2006;96:1971–8. [PMC free article: PMC1751820] [PubMed: 17018829]
Institute of Medicine, National Academy of Science. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. In: Nass SJ, et al., editors. Committee on Health Research and the Privacy of Health Information. Washington, DC: National Academies Press; 2009. [PubMed: 20662116]
Beckerman JZ, Pritts J, Goplerud E, et al. Health Information Privacy, Patient Safety, and Health Care Quality: Issues and Challenges in the Context of Treatment for Mental Health and Substance Use. BNA’s Health Care Policy Report. 2008 Jan 14;16(2):3–10.
Solove D. A taxonomy of privacy. University of Pennsylvania Law Review. 2006;154:477–560.
Duncan GT, Jabine TB, de Wolf VA, editors. Committee on National Statistics. Washington, DC: National Research Council and the Social Science Research Council, National Academy Press; 1993. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. Panel on Confidentiality and Data Access.
Fienberg SE. Encyclopedia of Social Measurement. Vol. 2. Academic Press; 2005. Confidentiality and disclosure limitation; pp. 463–9.
Federal Committee on Statistical Methodology: Report on statistical disclosure limitation methodology. Statistical Policy Working paper. 2005. [Accessed May 15, 2010]. Publication No. NTIS PB94-165305. Available at http://www​​/spwp22.html.
45 C.F.R. 160.103.
Fellegi IP, Sunter AB. A Theory for Record Linkage. Journal of the American Statistical Association. 1969;40:1183–1210.
Bilenko M, Mooney R, Cohen WW, et al. Adaptive name matching in information integration. IEEE Intelligent Systems. 2003;18(5):16–23.
Herzog TN, Schuren FJ, Winkler WE. Data Quality and Record Linkage Techniques. New York: Springer-Verlag; 2007.
Winkler WE. Publication No. RR 2006/02. US Census Bureau; Overview of record linkage and current research directions.
Christen P, Churches T, Hegland M. A parallel open source data linkage system. 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining; Sydney, AUS. May 2004.
Abowd J, Vilhuber L. The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers (with discussion) Journal of Business and Economics Statistics. 2005;23(2):133–165.
Lyons RA, Jones KH, John G, et al. The SAIL databank: linking multiple health and social care datasets. BMC Med Inform Decis Mak. Jan 16, 2009. [Accessed 28 June 2010]. p. 3. Available at http://www​.biomedcentral​.com/1472-6947/9/3. [PMC free article: PMC2648953] [PubMed: 19149883]
Karr AF, Fulp WJ, Lin X, et al. Secure, privacy-preserving analysis of distributed databases. Technometrics. 2007;49(3):335–5.
Karr AF, Lin X, Sanil AP, Reiter JP. Privacy-preserving analysis of vertically partitioned data using secure matrix products. Journal of Official Statistics. 2009;25(1):125–138.
Rivest RL, Adleman L, Dertouzos ML. On data banks and privacy homomorphisms. In: DeMillo R, editor. Foundations of Secure Computation. New York: Academic Press; 1978.
Golle P. Revisiting the uniqueness of simple demographics in the U.S. population. ACM Workshop on Privacy in the Electronic Society; 2006. pp. 77–80.
Sweeney L. Uniqueness of simple demographics in the US population. Carnegie Mellon University Data Privacy Laboratory; Pittsburgh, PA: 2000. Report Number LIDAP-WP04.
Karr AF, Banks DL, Sanil AP. Data quality: A statistical perspective. Statistical Methodology. 2006;3(2):137–73.
45 CFR 164.514(b).
Sweeney L. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine, and Ethics. 1997;25:98–110. [PubMed: 11066504]
45 CFR 164.514(b)(2)(i).
Malin B. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. Journal of the American Medical Informatics Association. 2005;12:28–34. [PMC free article: PMC543823] [PubMed: 15492030]
Lin Z, Owen A, Altman R. Genetics: genomic research and human subject privacy. Science. 2004;305:183. [PubMed: 15247459]
Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics. 2008;4:e1000167. [PMC free article: PMC2516199] [PubMed: 18769715]
Cassa C, Schmidt B, Kohane I, et al. My sister’s keeper? genomic research and the identifiability of siblings. BMC Medical Genomics. 2008;(32) [PMC free article: PMC2503988] [PubMed: 18655711]
Rothstein MA. Genetic secrets: promoting privacy and confidentiality in the genetic era. New Haven: Yale University Press; 1997.
Kass N, Medley A. Genetic screening and disability insurance: what can we learn from the health insurance experience. Journal of Law, Medicine, and Ethics. 2007;35:66–73. [PubMed: 17543060]
Phelan JC. Geneticization of deviant behavior and consequences for stigma: the case of mental illness. Journal of Health and Social Behavior. 2005;46:307–22. [PubMed: 16433278]
Pub. L. 110–233.
Duncan GT. Confidentiality and statistical disclosure limitation. In: Smelser N, Baltes P, editors. International Encyclopedia of the Social and Behavioral Sciences. Vol. 4. New York: Elsevier; 2001. pp. 2521–5.
Gouweleeuw JM, Kooiman P, Willenborg LCRJ, et al. Post randomization for statistical disclosure control: Theory and implementation. Journal of Official Statistics. 1998;14:463–78.
Rubin Donald B. Discussion: Statistical Disclosure Limitation. Journal of Official Statistics. 1993;9(2):461–8.
Raghunathan TE, Reiter J, Rubin DB. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics. 2003;19:1–16.
Doyle P, Lane J, Theeuwes J, et al., editors. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. New York: Elsevier; 2001.
Fienberg SE, Makov UI, Steele RJ. Disclosure Limitation Using Perturbation and Related Methods for Categorical Data (with discussion) Journal of Official Statistics. 1998;14(4):485–511.
Yvonne MM, Bishop YM, Fienberg SE, et al. Discrete Multivariate Analysis: Theory and Practice. Cambridge. MA: MIT Press; New York: Springer-Verlag; 1995. Reprinted 2007.
Dobra Adrian, Fienberg Stephen E. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. Proceedings of the National Academy of Sciences. 2000;97(22):11885–92. [PMC free article: PMC17264] [PubMed: 11050222]
Fienberg SE, Slavkovic AB. Preserving the confidentiality of categorical data bases when releasing information for association rules. Data Mining and Knowledge Discovery. 2005;11:155–80.
Fienberg SE. Statistical perspectives on confidentiality and data access in public health. Statistics in Medicine. 2001;20:1347–56. [PubMed: 11343356]
Malin B, Sweeney L. A secure protocol to distribute unlinkable health data. Proceedings of the American Medical Informatics Association Annual Meeting Symposium; Washington, DC: American Medical Informatics Association; 2005. [PMC free article: PMC1560734] [PubMed: 16779087]
Boyens C, Krishnan R, Padman R. On Privacy-Preserving Access to Distributed Heterogeneous Healthcare Information. Publication No. HICSS-37 2004); Proceedings of 37th Hawaii International Conference on System Sciences; 2009.
Fienberg SE. Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation. Statistical Science. 2006;21:143–54.
Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, editors. TCC, Lecture Notes in Computer Science. Berlin: Springer-Verlag; 2006a. pp. 3876–84.
Dwork C, Kenthapadi K, McSherry F, et al. Our data, ourselves: Privacy via distributed noise generation. EUROCRYPT. 2006:486–503.


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.2M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...