U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Gliklich RE, Dreyer NA, Leavy MB, editors. Registries for Evaluating Patient Outcomes: A User's Guide [Internet]. 3rd edition. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Apr.

Cover of Registries for Evaluating Patient Outcomes

Registries for Evaluating Patient Outcomes: A User's Guide [Internet]. 3rd edition.

Show details

16Linking Registry Data With Other Data Sources To Support New Studies

1. Introduction

The purpose of this chapter is to identify important technical and legal considerations for researchers and research sponsors interested in linking data in a patient registry with additional data, such as data from claims or other administrative files or from another registry. Its goal is to help these researchers find an appropriate way to address their critical research questions, remain faithful to the conditions under which the data were originally collected, and protect individual patients by safeguarding their privacy and maintaining the confidentiality of the data under applicable law.

There are two equally important questions to address in the planning process: (1) What is a feasible technical approach to linking the data, and (2) is the linkage legally feasible under the permissions, terms, and conditions that applied to the original compilations of each data set? Legal feasibility depends on the applicability to the specific purpose of the data linkage of Federal and State legal protections for the confidentiality of health information and participation in human research, and also on any specific permissions obtained from individual patients for the use of their health information. Indeed, these projects require a great deal of analysis and planning, as the technical approach chosen may be influenced by permitted uses of the data under applicable regulations, while the legal assessment may change depending on how the linkage needs to be performed and the nature and purpose of the resulting linked data set. Tables 16-1 and 16-2, respectively, list regulatory and technical questions for the consideration of data linkage project leaders during the planning of a project. The questions are intended to assist in organizing the resources needed to implement the project, including the statistical, regulatory, and collegial advice that might prove helpful in navigating the complexities of data linkage projects. This chapter presumes that investigators have identified an explicit purpose for the data linkage in the form of a scientific question they are trying to answer. The nature of this objective is critical to an assessment of the applicable regulatory requirements for uses of the data. For example, to the extent the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule applies, the use or disclosure of protected health information (PHI) for the registry purpose will need to fall into one of the specific regulatory permissions and comply with the relevant requirements for the permission (e.g., health care quality–related activities, public health practice, research, or some combination of these purposes), or individual authorization must be obtained. If research is one purpose of the project, then the Common Rule (Federal human subjects protection regulations) is also likely to apply to the project. More information on HIPAA and the Common Rule is provided in Chapter 7.

Table 16–1. Legal planning questions.

Table 16–1

Legal planning questions.

Table 16–2. Technical planning questions.

Table 16–2

Technical planning questions.

The application of the HIPAA Privacy and Security Rules depends on the origins of the data sets being linked, and such origins may also influence the feasibility of making the data linkage. Investigators should know the source of the original data, the conditions under which they were compiled, and what kinds of permissions, from both individual patients and the custodial institutions, apply to the data. Health information is most often data with two sources, individual and institutional; these sources may have legal rights and continuing interests in the use of the data.

It is important to be aware that the legal requirements may change, and that, in fact, the protections limiting the research use of health information are likely to change in response to continued development of electronic health information technologies.

This chapter has six sections focusing on core issues in three major parts: Technical Aspects of Data Linkage Projects, Legal Aspects of Data Linkage Projects, and Risk Mitigation for Data Linkage Projects. The Technical Aspects section discusses the reasons for and technical methods of linking data sets containing health information, including data held in registries. It should be noted that this list of techniques is not intended to be comprehensive, and these techniques have limitations for certain types of studies. The Legal Aspects section defines important concepts, including the different definitions of “disclosure” as used by statisticians and in the HIPAA Privacy Rule. This section also discusses the risks of identification of individuals inherent in data linkage projects and describes the legal standards of the HIPAA Privacy Rule that pertain to these risks. Finally, the Risk Mitigation section summarizes both recognized and developing technical methods for mitigating the risks of identification. Appendix D consists of a hypothetical data linkage project intended to provide context for the technical and legal information presented below. Case Examples 36, 37, 38, and 39 describe registry-related data linkage activities. Chapter 18 provides information on analyzing linked data sets. While some of the concepts presented are applicable to other important nonpatient identities that might be at risk in data linkage, such as provider identities, those issues are beyond the scope of the discussion below.

2. Technical Aspects of Data Linkage Projects

2.1. Linking Records for Research and Improving Public Health

Data in registries regarding the health of individuals come in many forms. Most of these data were originally gathered for the delivery of clinical services or payment for those services, and under promises or legal guarantees of confidentiality, privacy, and security. The sources of data may include individual doctors' records, billing information, vital statistics on births and deaths, health surveys, and data associated with biospecimens, among other sources.

The broad goal of registries is to amass data from potentially diverse sources to allow researchers to explore and evaluate alternative health outcomes in a systematic fashion. This goal is usually accomplished by gathering data from multiple sources and linking the data across sources, either with explicit identifiers designed for linking, or in a probabilistic fashion via the characteristics of the individuals to whom the data correspond. From the research perspective, the more data included the better, both in terms of the number of cases and the details and the extent of the health information. The richer the database, the more likely it is that data analysts will be able to discover relationships that might affect or improve health care. On the other hand, many discussions about privacy protection focus on limiting the level of detail available in data to which others have access.

There is an ethical obligation to protect patient interests when collecting, sharing, and studying person-specific biomedical information.1 Many people fear that information derived from their medical or biological records will be used against them in employment decisions, result in limitations to their access to health or life insurance, or cause social stigma.2, 3 These fears are not unfounded, and there have been various cases in which it was found that an individual's genetic characteristics or clinical manifestations were used in a manner inconsistent with an individual's expectations of privacy and fair use.4 If individuals are afraid that their health-related information may be associated with them or used against them, they may be less likely to seek treatment in a clinical context or participate in research studies.5

A tension exists between the broad goals of registries and regulations protecting individually identifiable information. Approaches and formal methodologies that help mediate this tension are the principal technical focus of this chapter. To understand the extent to which these tools can assist data linkages involving registry data, one needs to understand the risks of identification in different types of data.

There is a large body of Federal law relating to privacy. A recent comprehensive review of privacy law and its effects on biomedical research identified no fewer than 15 separate Federal laws pertaining to health information privacy.6 There are also special Federal laws governing health information related to substance abuse.7 A full review of all laws related to privacy, confidentiality, and security of health information would also consider separate State privacy protections as well as State laws pertaining to the confidentiality of data. Nevertheless, the legal aspects of this chapter focus only on the Federal regulations commonly referred to as the HIPAA Privacy Rule.

2.2. What Do Privacy, Confidentiality, and Disclosure Mean?

Privacy is a term whose definition varies with context.8 In the HIPAA Privacy Rule, the term refers to protected health information (PHI)— individually identifiable health information transmitted or maintained by a covered entity or business associate—that is to be used or disclosed only as expressly permitted or required by the Rule, and that is safeguarded against inappropriate uses and disclosures. The Privacy Rule addresses to whom the custodian of PHI, a covered entity or its business associate, may transmit the information and under what conditions. The Rule also establishes three levels of identifiability of health information: (1) fully identifiable data; (2) data that lack certain direct identifiers, otherwise known as a limited data set; and (3) de-identified data. Registries commonly acquire identifiable data and may create the last two categories of data in accordance with the Privacy Rule. Along this spectrum of data identifiability, the HIPAA Privacy Rule applies different legal standards and protections,5 extending the most stringent protections to data containing direct identifiers and none for de-identified information, which is not considered PHI. Not all registries contain PHI; Chapter 7 provides more information on how PHI is defined under HIPAA.

Confidentiality broadly refers to a quality or condition of protection accorded to statistical information as an obligation not to permit the transfer of that information to an unauthorized party.5 Confidentiality can be afforded to both individuals and health care organizations. A different notion of confidentiality, arising from the special relationship between a clinician and patient, refers to the ethical, legal, and professional obligation of those who receive information in the context of a clinical relationship to respect the privacy interests of their patients. Most often the term is used in the former sense and not in the latter, but these two meanings inevitably overlap in a discussion of health information as data. The methods for disclosure limitation described here have been developed largely in the context of confidentiality protection, as defined by laws, regulations, and especially by the practices of statistical agencies.

Disclosure for purposes of this discussion has two different meanings: one is technical and the other is regulatory and contained in the HIPAA Privacy Rule.

In the field of statistics, disclosure relates to the attribution of information to the source of the data, regardless of whether the data source is an individual or an organization. Three types of disclosure of data possess the capacity to make the identity of particular individuals known: identity disclosure, attribute disclosure, and inferential disclosure.

Identity disclosure occurs when the data source becomes known from the data release itself.9, 10

Attribute disclosure occurs when the released data make it possible to infer the characteristics of an individual data source more accurately than would have otherwise been possible.8, 9 The usual way to achieve attribute disclosure is through identity disclosure. First, one identifies an individual through some combination of variables and then learns about the values of additional variables included in the released data. Attribute disclosure may occur, however, without identity disclosure, such as when all people from a population subgroup share a characteristic and this quantity becomes known for any individual in the subgroup.

Inferential disclosure relates to the probability of identifying a particular attribute of a data source. Because almost any data release can be expected to increase the likelihood of an attribute being associated with a data source, the only way to guarantee protection is to release no data at all. It is for this reason that researchers use certain methods not to prevent disclosure, but to limit or control the nature of the disclosure. These methods are known as disclosure limitation methods or statistical disclosure control.11

Disclosure under the HIPAA Privacy Rule means the release, transfer, provision of, access to, or divulging in any other manner, of information outside of the entity holding the information.12

2.3. Linking Records and Probabilistic Matching

Computer-assisted record linkage goes back to the 1950s, and was put on a firm statistical foundation by Fellegi and Sunter.13 Most common techniques for record linkage either rely on the existence of unique identifiers or use a structure similar to the one Fellegi and Sunter described with the incorporation of formal statistical modeling and methods, as well as new and efficient computational tools.14, 15 The simplest way to match records from separate databases is to use a so-called “deterministic” method of linking databases in which unique identifiers exist for each record. In the United States, when these identifiers exist they might be names or Social Security numbers; however, these particular identifiers may not in fact be unique.16 As a result, some form of probabilistic approach is typically used to match the records. Thus, there is little actual difference between methods using deterministic versus probabilistic linkage, except for the explicit representation of uncertainty in the matching process in the latter.

The now-standard approach to record linkage is built on five key components for identifying matching pairs of records across two databases:13

  1. Represent every pair of records using a vector of features (variables) that describe similarity between individual record fields. Features can be Boolean, discrete, or continuous.
  2. Place feature vectors for record pairs into three classes: matches (M), non-matches (U), and possible matches (P). These correspond to “equivalent,” “nonequivalent,” and “possibly equivalent” (e.g., requiring human review) record pairs, respectively.
  3. Perform record-pair classification by calculating the ratio (P (γ | M)) / (P (γ | U)) for each candidate record pair, where γ is a feature vector for the pair and P (γ | M) and P (γ | U) are the probabilities of observing that feature vector for a matched and non-matched pair, respectively. Two thresholds based on desired error levels—Tμ and Tλ—optimally separate the ratio values for equivalent, possibly equivalent, and nonequivalent record pairs.
  4. When no training data in the form of duplicate and nonduplicate record pairs are available, matching can be unsupervised; that is, conditional probabilities for feature values are estimated using observed frequencies in the records to be linked.
  5. Most record pairs are clearly nonmatches, so one need not consider them for matching. This situation is managed by “blocking,” or partitioning the databases based on geography or some other variable in both databases, so that only records in comparable blocks are compared. Such a strategy significantly improves efficiency

The first four components lay the groundwork for accurate record-pair matching using statistical or machine-learning prediction models such as logistic regression. The fewer identifiers used in steps 1 and 2, the poorer the match is likely to be. Accuracy is well known to be high when there is a 1–1 match between records in the two databases, and deteriorates as the overlap between the files decreases and the measurement error in the feature values consequently increases.

The fifth component provides for efficient processing of large databases, but to the extent that blocking is approximate and possibly inaccurate, its use decreases the accuracy of record-pair matching. The less accurate the matching, the more error (i.e., records not matched or matched inappropriately) there will be in the merged registry files. This error will impede the quality of analyses and findings from the resulting data.17-19

This standard approach has problems when (1) there are lists or files with little overlap, (2) there are undetected duplications within files, and (3) one needs to link three or more lists. In the latter case, one essentially matches all lists in pairs, and then resolves discrepancies. Unfortunately, there is no single agreed-upon way to do this, but some principled approaches have recently been suggested.20 Record linkage methodology has been widely used by statistical agencies, especially the U.S. Census Bureau. The methodology has been combined with disclosure limitation techniques such as the addition of “noise” to variables in order to produce public use files that the agencies believe cannot be linked back to the original databases used for the record linkage. Another technique involves protecting individual databases by stripping out identifiers and then attempting record linkage. This procedure has two disadvantages: first, the quality of matches is likely to decrease markedly; and second, the resulting merged records will still need to be protected by some form of disclosure limitation. Therefore, as long as there are no legal restrictions against the use of identifiers for record linkage purposes, it is preferable to use detailed identifiers to the extent possible and to remove them following the matching procedure.

Currently there are no special features of registry data known to enhance or inhibit matching. Registry data may be easier targets for re-identification because the specifics of diseases or conditions usually help to define the registries. In the United States, efforts are often made to match records using Social Security numbers. There are large numbers of entry errors for these numbers in many databases, and there are problems associated with multiple people using one number and some people using multiple numbers.16 Lyons and colleagues describe a very large-scale matching exercise in the United Kingdom linking multiple health care and social services data sets using National Health Service numbers and various alternative sets of matching variables in the spirit of the record linkage methods described above. They report achieving accurate matching at rates of only about 95 percent.21

2.4. Procedural Issues in Linking Data Sets

It is important to understand that neither data nor link can be unambiguously defined. For instance, a data set may be altered by the application of tools for statistical disclosure limitation, in which case it is no longer the same data set. Linkage need not mean, as it is customarily construed, “bringing the two (or more) data sets together on a single computer.” Many analyses of interest can be performed using technologies that do not require literal integration of the data sets.

Even the relationship between data sets can vary. Two data sets can hold the same attributes for different individuals (horizontal partitioning); for example, one data set may contain information for individuals born before a certain date, while a second data set contains the same information for individuals born after that date. Or, two data sets may contain different attributes for the same individuals (vertical partitioning); for example, one data set may contain clinician-reported information for a set of individuals, while a second data set contains laboratory data for the same individuals. Finally, some data sets may contain a complex combination of different attributes for different individuals.

The process of linking horizontally partitioned data sets engenders little incremental risk of re-identification. There is, in almost all cases, no more information about a record on the combined data set than was present in the individual data set containing it. Moreover, any analysis requiring only data summaries (i.e., in technical terms, sufficient statistics) that are additive across the data sets can be performed using tools based on the computer science concept of secure summation.22 Examples of analyses for which this approach works include creation of contingency tables, linear regression, and some forms of maximum likelihood estimation.

Only in a few cases have comparable techniques for vertically partitioned data been well enough understood to be employed in practice.23 Instead, it is usually necessary to actually link individual subjects' records that are contained in two or more data sets. This process is inherently and unavoidably risky because the combined data set contains more information about each subject than either of the components.

Suppose that each of the two data sets to be linked contains the same unique identifiers (for individuals, an example is Social Security numbers) in all of the records. In this case, techniques based on cryptography (e.g., homomorphic encryption24 and hash functions) enable secure determination of individuals common to both data sets and assignment of unique but uninformative identifiers to the shared records. The combined data set can then be purged of individual identifiers and altered to further limit re-identification. These alterations will of necessity reduce the accuracy of standard statistical analyses compared with an unaltered data set.

Such linkage techniques are computationally very complex, and may need to involve trusted third parties without access to information in either data set other than the common identifier.25 Therefore, in many cases the database custodian may prefer to remove identifiers and carry out statistical disclosure limitation prior to linkage. It is important to understand that this latter approach compromises, perhaps irrevocably, the linkage process, and may introduce substantial errors into the linked data set that later— perhaps dramatically—alter the results of statistical analyses.

Many techniques for record linkage depend at some level on the presence of combinations of attributes in both databases that are unique to individuals but do not lead to re-identification—a combination that may be difficult to find. For instance, the combination of date of birth, gender, and ZIP Code of residence might be present in both databases. It is estimated that this combination of attributes uniquely characterizes a significant portion of the U.S. population—somewhere between 65 and 87 percent, or even higher for certain subpopulations—so that re-identification would only require access to a suitable external database.26, 27 Other techniques such as the Fellegi-Sunter record linkage methods described above are more probabilistic in nature. They can be effective, but they also introduce data quality effects that cannot readily be characterized, and the intrinsic error associated with the matching will need to be accounted for in some fashion when the linked data set is analyzed. Simulations and sensitivity analyses may help clarify the extent of the issues here, but will rarely be sufficient.

No matter how linkage is performed, a number of other issues should be addressed. For instance, comparable attributes should be expressed in the same units of measure in both data sets (e.g., English or metric values for weight). Also, conflicting values of attributes for each individual common to both databases need reconciliation.

Another issue involves the management of patient records that appear in only one database; the most common decision is to drop them. Data quality provides another example; it is one of the least understood statistical problems and has multiple manifestations.28 Even assuming some limited capability to characterize data quality, the relationship between the quality of the linked data set and the quality of each component should be considered. The linkage itself can produce quality degradation. For example, there is reason to believe that the quality of a linked data set is strictly less than that of either component, and not, as might be supposed, somewhere between the two.

Finally, it is important to understand that there exist endemic risks to data linkage. Anyone with access to one of the original data sets and the linked data set may learn, even if imperfectly, the values of attributes in the other. It may not be possible to determine what knowledge the linkage will create without actually executing the linkage. For these reasons, strong consideration should be given to forms of data protection such as licensing and restricted access in research data centers, where both analyses and results can be controlled.

3. Legal Aspects of Data Linkage Projects

3.1. Risks of Identification

The HIPAA Privacy Rule describes two methods for de-identifying health information.29 One method requires a formal determination by a qualified expert (e.g., a qualified statistician) that the risk is very small that an individual could be identified. The other method requires the removal of 18 specified identifiers of the individual and of the individual's relatives, household members, and employers, as well as no actual knowledge that the remaining information could be used alone or in combination with other information to identify the individual. (See Chapter 7 for more information.) For more information about methods of de-identification under the HIPAA Privacy Rule, see the recent HHS guidance published on this topic.30

The data removal process alone may not be sufficient to remove risks of re-identification.

Residual data especially vulnerable to disclosure threats include (1) geographic detail, (2) longitudinal information, and (3) extreme values (e.g., income). In addition, variables that are available in other accessible databases pose special risks.

Statistical organizations such as the National Center for Health Statistics have traditionally focused on the issue of identity disclosure and thus refused to report information in which individuals or institutions can be identified. Concerns about identity disclosure arise, for example, when a data source is unique in the population for the characteristics under study, and is directly identifiable in the database to be released. But such uniqueness and subsequent identity disclosure may not reveal any information other than the association of the source with the data collected in the study. In this sense, identity disclosure may only be a technical violation of a promise of confidentiality. Thus, uniqueness only raises the issue of possible confidentiality problems resulting from identification. A separate issue is whether the release of information is one that is permitted by the HIPAA Privacy Rule or is authorized by the data source.

The foregoing discussion implicitly introduces the notion of “harm,” which is not the same as a breach of confidentiality. For example, it is possible for a pledge of confidentiality to be technically violated, but produce no harm to the data source because the information is “generally known” to the public. In this case, some would argue that additional data protection is not required. Conversely, information on individuals or organizations in a release of sample statistical data may well increase the information about characteristics of individuals or organizations not in the sample. This information may produce an inferential disclosure for such individuals or organizations and cause them harm, even though there was no confidentiality obligation. Skinner31 suggests the separation of assessment of disclosure potential from harm.

Figure 16–1 depicts the overlapping relationships among confidentiality, disclosure, and harm.

The figure uses three overlapping circles to depict the relationships among confidentiality, disclosure, and harm. The circles are labeled “Disclosure,” “Confidentiality Obligations,” and “Harm.” The circles all overlap in the center of the figure.

Figure 16–1

Relationships among confidentiality, disclosure, and harm.

Some people believe that the way to ensure confidentiality and prevent identity disclosure is to arrange for individuals to participate in a study anonymously. In many circumstances, such a belief is misguided, because there is a key distinction between collecting information anonymously and ensuring that personal identifiers are not inappropriately made available. Moreover, clinical health care data are simply not collected anonymously. Not only do patient records come with multiple identifiers crucial to ensuring patient safety for clinical care, but they also contain other information that may allow the identification of patients even if direct identifiers are stripped from the records.

Moreover, health- or medicine-related data may also come from sample surveys in which the participants have been promised that their data will not be released in ways that would allow them to be individually identified. Disclosure of such data can produce substantial harm to the personal reputations or financial interests of the participants, their families, and others with whom they have personal relationships. For example, in the pilot surveys for the National Household Seroprevalence Survey, the National Center for Health Statistics moved to make responses during the data collection phase of the study anonymous because of the harm that could potentially result from information that an individual had an HIV infection or engaged in high-risk behavior. But such efforts still could not guarantee that one could not identify a participant in the survey database.

The question about the confidentiality of registry data persists after an individual's death, in part because of the potential for harm to others. The health information of decedents is subject to the HIPAA Privacy Rule until 50 years after their death (see Chapter 7 for more information), and several statistical agencies explicitly treat the identification of a deceased individual as a violation of their confidentiality obligations.

3.1.1. Examples of Patient Re-Identification

For years, the confidentiality of health information has been protected through a process of “de-identification.” This protection entails the removal of person-specific features such as names, residential street addresses, phone numbers, and Social Security numbers. However, as discussed above, de-identification does not guarantee that individuals may not be identified from the resulting data. On multiple occasions, it has been shown that de-identified health information can be “re-identified” to a particular patient without hacking or breaking into a private health information system. For instance, before the HIPAA de-identification standards were created, Latanya Sweeney, a graduate student at the Massachusetts Institute of Technology in the mid-1990s, showed that de-identified hospital discharge records, which were made publicly available at the State level, could be linked to identifiable public records in the form of voter registration lists. Her demonstration received notoriety because it led to the re-identification of the medical status of the then-governor in the Commonwealth of Massachusetts.32 This result was achieved by linking the data resources on their common fields of patient's date of birth, gender, and ZIP Code. As noted earlier, this combination identifies unique individuals in the United States at a rate estimated at somewhere between 65 and 87 percent or even higher in certain subpopulations.27

3.1.2. High-Risk Identifiers

One response to the Sweeney demonstration was the HIPAA Privacy Rule method for de-identification by removal of data elements. This process requires the removal of 18 explicit identifiers from patient information before it is considered de-identified, including dates of birth and ZIP Codes. (See Chapter 7.)33 Nonetheless, even the removal of these data elements may fail to prevent re-identification, as there may be residual features that can lead to identification. The extent to which residual features can be used for re-identification depends on the availability of relevant data fields. Thus, one can roughly partition identifiers into “high-risk” and relatively “low-risk” features. The high-risk features are documented in multiple environments and publicly available. These features could be exploited by any recipient of such records. For instance, patient demographics are high-risk identifiers. Even de-identified health information permitted under the HIPAA Privacy Rule may leave certain individuals at risk for identification if the data are combined with public data resources containing similar features, such as public records containing birth, death, marriage, voter registration, and property assessment information.30, 34-36

3.1.3. Relatively Low-Risk Identifiers

In contrast, lower risk data elements do not appear in public records and are less available. For instance, clinical features, such as an individual's diagnosis and treatments, are relatively static because they are often mapped to standard codes for billing purposes. These features might appear in de-identified information, such as hospital discharge databases, as well as in identified resources such as electronic medical records. While combinations of diagnostic and treatment codes might uniquely describe an individual patient in a population, the identifiable records are available to a much smaller group than the general public. Moreover, these select individuals, such as the clinicians and business associates of the custodial organization for the records, are ordinarily considered to be trustworthy, because they owe independent ethical, professional, and legal duties of confidentiality to the patients.

3.1.4. Special Issues With Linkages to Biospecimens

Health care is increasingly moving towards evidence-based and personalized systems. In support of this trend, there is a growing focus on associations between clinical and biological phenomena. In particular, the decreasing cost of genome sequencing technology has facilitated a rapid growth in the volume of biospecimens and derived DNA sequence data. As much of this research is sponsored through Federal funding, it is subject to Federal data sharing requirements. However, biospecimens, and DNA in particular, are inherently unique and there are a number of routes by which DNA information can be identified to an individual.37 For instance, there are over 1 million single nucleotide polymorphisms (SNPs) in the human genome; these little snippets of DNA are often used to make genetic correlations with clinical conditions. Yet it is estimated that fewer than 100 SNPs can uniquely represent an individual.38 Thus, if de-identified biological information is tied to sensitive clinical information, it may provide a match to the identified biological information—as, for example, in a forensic setting.39

Biospecimens and information derived from them are of particular concern because they can convey knowledge not only about the individual from whom they are derived, but also about other related individuals. For instance, it is possible to derive estimates about the DNA sequence of relatives.40

If the genetic information is predictive or diagnostic, it can adversely affect the ability of family members to obtain insurance and employment, or it may cause social stigmatization.41-43 The Genetic Information Nondiscrimination Act of 2008 (GINA) prohibits health insurers from using genetic information about individuals or their family members, whether collected intentionally or incidentally, in determining eligibility and coverage, or in underwriting and setting premiums.44 Insurers, in collaboration with external research entities, may request that policyholders undergo genetic testing, but a refusal to do so cannot be permitted to affect the premium or result in medical underwriting.

4. Risk Mitigation for Data Linkage Projects

4.1. Methodology for Mitigating the Risk of Re-Identification

The disclosure limitation methods briefly described in this section are designed to protect against identification of individuals in statistical databases, and are among the techniques that data linkage projects involving registries are most likely to use. One problem these methods do not address is the simultaneous protection of individual and institutional data sources. The discussion here also relates to the problems addressed by secure computation methodologies, which are explored in the next section.

4.1.1. Basic Methodology for Statistical Disclosure Limitation

Duncan45 categorizes the methodologies used for disclosure limitation in terms of disclosure-limiting masks, i.e., transformations of the data where there is a specific functional relationship (possibly stochastic) between the masked values and the original data. The basic idea of masking involves data transformations. The goal is to transform an n × p data matrix Z through pre- and post-multiplication and the possible addition of noise, such as depicted in Equation (1):


where A is a matrix that operates on cases, B is a matrix that operates on variables, and C is a matrix that adds perturbations or noise to the original information. Matrix masking includes a wide variety of standard approaches to disclosure limitation:

  • Addition of noise
  • Release of a subset of observations (deleting rows from Z)
  • Cell suppression for cross-classifications
  • Inclusion of simulated data (addition of rows to Z)
  • Release of a subset of variables (deletion of columns from Z)
  • Switching of selected column values for pairs of rows (data swapping)

This list also omits some methods, such as micro-aggregation and doubly random swapping, but it provides a general idea of the types of techniques being developed and applied in a variety of contexts, including medicine and public health. The possibilities of both identity and attribute disclosure remain even when a mask is applied to a data set, although the risks may be substantially diminished.

Duncan suggests that we can categorize most disclosure-limiting masks as suppressions (e.g., cell suppression), recodings (e.g., collapsing of rows or columns, or swapping), or samplings (e.g., release of subsets), although he also allows for simulations as discussed below. Further, some masking methods alter the data in systematic ways (e.g., through aggregation or through cell suppression), whereas others do it through random perturbations, often subject to constraints for aggregates. Examples of perturbation methods are controlled random rounding, data swapping, and the post-randomization method (PRAM) of Gouweleeuw,46 which has been generalized by Duncan and others. One way to think about random perturbation methods is as restricted simulation tools. This characterization connects them to other types of simulation approaches.

Various authors pursue simulation strategies and present general approaches to “simulating” from a constrained version of the cumulative, empirical distribution function of the data. In 1993, Rubin asserted that the risk of identity disclosure could be eliminated by the use of synthetic data (in his case using Bayesian methodology and multiple imputation techniques) because there is no direct function link between the original data and the released data.47 Said another way, the data remain confidential because simulated individuals have replaced all of the real ones. Raghunathan, Reiter, and Rubin48 provide details on the implementation of this approach. Abowd and Woodcock (in their chapter in Doyle et al., 2001)49 describe a detailed application of multiple imputation and related simulation technology for a longitudinally linked individual and work history data set. With both simulation and multiple-imputation methodology, however, it is still possible that the data values of some simulated individuals remain virtually identical to those in the original sample, or at least close enough that the possibility of both identity and attribute disclosure remain. As a result, checks should be made for the possibility of unacceptable disclosure risk.

Another important feature of the statistical simulation approach is that information on the variability of the data set is directly accessible to the user. For example, in the Fienberg, Makov, and Steele50 approach for categorical data, the data user can begin with the reported table and information about the margins that are held fixed, and then run the Diaconis-Sturmfels Monte Carlo Markov chain algorithm to regenerate the full distribution of all possible tables with those margins. This technique allows the user to make inferences about the added variability in a modeling context that is similar to the approach to inference in Gouweleeuw and colleagues.46 Similarly, Raghunathan and colleagues proposed the use of multiple imputations to directly measure the variability associated with the posterior distribution of the quantities of interest.48 As a consequence, Rubin showed that simulation and perturbation methods represent a major improvement in access to data over cell suppression and data swapping without sacrificing confidentiality. These methods also conform to the statistical principle allowing the user of released data to apply standard statistical operations without being misled.

There has been considerable research on disclosure limitation methods for tabular data, especially in the form of multidimensional tables of counts (contingency tables). The most popular methods include a process known as cell suppression, which systematically deletes the values in selected cells in the table and collapses categories. This process is a form of aggregation. While cell suppression methods have been very popular among the U.S. Government statistical agencies, and are useful for tables with nonnegative entries rather than simple counts, they also have major drawbacks. First, good algorithms do not yet exist for the methodology when it is associated with high-dimensional tables. More importantly, the methodology systematically distorts the information about the cells in the table for users, and, as a consequence, makes it difficult for secondary users to draw correct statistical inferences about the relationships among the variables in the table. For further discussion of cell suppression, and for extensive references, see the various chapters in Doyle et al.,49 notably the one by Duncan and his collaborators.

A special example of collapsing categories involves summing over variables to produce marginal tables. Instead of reporting the full multidimensional contingency table, one or more collapsed versions of it might be reported. The release of multiple sets of marginal totals has the virtue of allowing statistical inferences about the relationships among the variables in the original table using log-linear model methods (e.g., see Bishop, Fienberg, and Holland).51 With multiple collapsed versions, statistical theory makes it clear that one may have highly accurate information about the actual cell entries in the original table. As a result, the possibility of disclosures still requires investigation. In part to address this problem, a number of researchers have recently worked on the problem of determining upper and lower bounds for the cells of a multi-way table given a set of margins; however, other measures of risk may clearly be of interest. The problem of computing bounds is in one sense an old one, at least for two-way tables, but it is also deeply linked to recent mathematical developments in statistics and has generated a flurry of new research.52, 53

4.1.2. The Risk-Utility Tradeoff

Common to virtually all the methodologies discussed in the preceding section is the notion of a risk-utility tradeoff, in which the risk of disclosure is balanced with the utility of the released data (e.g., see Duncan,45 Fienberg,54 and their chapter with others in Doyle et al.49). To keep this risk at a low level requires applying more extensive data masking, which limits the utility of what is released. Advocates for the use of simulated data often claim that this use eliminates the risk of disclosure, but still others dispute this claim. See also the recent discussion of risk-utility paradigms by Cox and colleagues.55

4.1.3. Privacy-Preserving Data Mining Methodologies

With advances in data mining and machine learning over the past two decades, a large number of methods have been introduced under the banner of privacy-preserving computation. The methodologies vary, and many of them focus on standard tools such as the addition of noise or data swapping of one sort or another. But the claims of identity protection in this literature are often exaggerated or unverifiable. For a discussion of some of these ideas and methods, see Fienberg and Slavkovic53 and El Emam and colleagues.34 For two recent interesting examples explicitly set in the context of medical data, see Malin and Sweeney56 and Boyens, Krishnan, and Padman.57

The common message of this literature is that privacy protection has costs measured in the lack of availability of research data. To increase the utility of released data for research, some measure of privacy protection, however small, needs to be sacrificed. It is nonetheless still possible to optimize utility, subject to predefined upper bounds on what is considered to be acceptable risk of identification. See a related discussion in Fienberg.58

4.1.4. Cryptographic Approaches to Privacy Protection

While the current risks of identification in modern databases are similar for statistical agencies and biomedical researchers, there are also new challenges: from contemporary information repositories that store social network data (e.g., cell phone, Twitter, and Facebook data), product preferences data (e.g., Amazon), Web search data, and other sources of information not previously archived in a digital format. A recent literature emanating from cryptography focuses on algorithmic aspects of this problem with an emphasis on automation and scalability of a process for conferring anonymity. Automation, in turn, presents a fundamentally different perspective on how privacy is defined and provides for both a formal definition of privacy and proofs for how it can be protected. By focusing on the properties of the algorithm for anonymity, it is possible to formally guarantee the degree of privacy protection and the quality of the outputs in advance of data collection and publication.

This new approach, known as differential privacy, limits the incremental information a data user might learn beyond that which is known before exposure to the released statistics. No matter what external information is available, the differential privacy approach guarantees that the same information is learned about an individual, whether or not information about the individual is present in the database. The papers by Dwork and colleagues59, 60 provide an entry point to this literature. Differential privacy, as these authors describe it, works primarily through the addition of specific forms of noise to all data elements and the summary information reported, but it does not address issues of sampling or access to individual-level microdata. While these methods are intriguing, their utility for data linkages with registry data remains an open issue.61

4.1.5. Security Practices, Standards, and Technologies

In general, people adopt two different philosophical positions about how the confidentiality associated with individual-level data should be preserved: (1) by “restricted or limited information,” that is, restrictions on the amount or format of the data released, and (2) by “restricted or limited access,” that is, restrictions on the access to the information itself.

If registry data are a public health good, then restricted access is justifiable only in situations where the confidentiality of data in the possession of a researcher cannot be protected through some form of restriction on the information released. Restricted access is intended to allow use of unaltered data by imposing certain conditions on users, analyses, and results that limit disclosure risk. There are two primary forms of restricted access. The first is through licensing, whereby users are legally bound by certain conditions, such as agreeing not to use data for re-identification and to accept advance review of publications. The licensure approach allows users to transfer data to their sites and use the software of their choice. The second approach is exemplified by research data centers, discussed in more detail below, and remote analysis servers, which are conceptually similar to data centers: users, and sometimes analyses, are evaluated in advance. The results are reviewed, and often limited, in order to limit risk of disclosure. The data remain at the holder's site and computers; the difference between a research data center and a remote analysis server is whether access is in person at a data center or using a remote analysis center via the Internet.

4.1.6. Registries as Data Enclaves

Many statistical agencies have built enclaves, often referred to as research data centers, where users can access and use data in a regulated environment. In such settings, the security of computer systems is controlled and managed by the agency providing the data. Such environments may maximize data security. For a more extensive discussion of the benefits of restricted access, see the chapter by Dunne in Doyle et al.49

These enclaves incur considerable costs associated with their establishment and upkeep. A further limitation is that the enclave may require the physical presence of the data user, which also increases the overall cost to researchers working with the data. Moreover, such environments often prevent users from executing specialized data analyses, which may require programming and other software development beyond the scope of traditional statistical software packages made available in the enclave.

The process for granting access to data in enclaves or restricted centers involves an examination of the research credentials of those wishing to do so. In addition, these centers control the physical access to confidential data files and they review the materials that data users wish to take from the centers and to publish. Researchers who are accustomed to reporting residual plots and other information that allows for a partial reconstruction of the original data, at least for some variables, will encounter difficulties, because restricted data centers typically do not allow users to remove such information.

4.1.7. Accountability

To limit the possibility of re-identification, data can be manipulated by the above techniques to mitigate risk. At the same time, it is important to ensure that researchers are accountable for the use of the data sets that are made available to them. Best practices in data security should be adopted with specific emphasis on authentication, authorization, access control, and auditing. In particular, each data recipient should be assigned a unique login identification, or, if the data are made available online, access may be provided through a query response server. Prior to each session of data access, data custodians should authenticate the user's identity. Access to information should be controlled either in a role-based or information-based manner. Each user access and query to the data should be logged to enable auditing functions. If there is a breach in data protection, the data custodian can investigate the potential cause and make any required notifications.

4.1.8. Layered Restricted Access to Databases

In many countries, the traditional arrangement for data use involves restrictions on both information and access, with only highly aggregated data and summary statistics released for public use.

One potential strategy for privacy protection for the linkage of registries to other confidential data is a form of layered restrictions that combines two approaches with differing levels of access at different levels of detail in the data. The registry might function as an enclave, similar to those described above, and in addition, public access might be limited to only aggregate data. Between these two extremes there might be several layers of restricted access. An example is licensing that includes privacy protection, requiring greater protection as the potential for disclosure risk increases.

5. Legal and Technical Planning Questions

The questions in Tables 16–1 and 16–2 are intended to assist in the planning of data linkage projects that involve using registry data plus other files. Registry operators should use the answers to these questions to assemble necessary information and other resources to guide planning for their data linkage projects. Like the preceding discussion, this section considers regulatory and technical questions.

Several assumptions underlie the regulatory questions that follow in Table 16–1. Their application to the proposed data linkage project should be confirmed or determined. These assumptions are listed here:

  • The HIPAA Privacy Rule applies to the entities that first collect data from individuals/subjects.
  • Other laws may restrict access or use of the initial data sources.
  • The Common Rule or FDA regulations may or may not apply to data linkage.
  • The Common Rule or FDA regulations may or may not apply to the original data sets.

Different regulatory concerns arise depending on the answers to each category of the following questions. Consult as necessary with experienced health services, social science, or statistician colleagues; and with regulatory personnel (e.g., the agency Privacy Officer) or legal counsel to clarify answers for specific data linkage projects.

6. Summary

This chapter describes technical and current legal considerations for researchers interested in creating data linkage projects involving registry data. In addition, the chapter presents typical methods for record linkage that are likely to form the basis for the construction of data linkage projects. It also discusses both the hazards for re-identification created by data linkage projects, and the statistical methods used to minimize the risk of re-identification. Two topics not covered in this chapter are (1) considerations about linking data from public and private sectors, where different, perhaps conflicting, ethical and legal restrictions may apply, and (2) the risks involved in identifying the health care providers that collect and provide data.

Data set linkage entails the risks of loss of reliable confidential data management and of identification or re-identification of individuals and institutions. Recognized and developing statistical methods and secure computation may limit these risks and allow the public the health benefits that registries linked to other data sets have the potential to contribute.

Case Examples for Chapter 16

Case Example 36Linking registries at the international level

DescriptionPsonet is an investigator- initiated, international scientific network of coordinated population-based registries; its aim is to monitor the long-term effectiveness and safety of systemic agents in the treatment of psoriasis.
SponsorSupported initially by a grant from the Italian Medicines Agency (AIFA); supported since 2011 by a grant from the European Academy of Dermato Venereology (EADV) and coordinated by the Centro Studi GISED.
Year Started2005
Year EndedOngoing
No. of Sites9 registries across Europe and an Australasian registry
No. of Patients27,800


The number of options for systemic treatment of psoriasis has greatly increased in recent years. Because psoriasis is a chronic disease requiring lifelong treatment, data on long-term effectiveness and safety are needed for both old and new treatments. Several European countries have established patient registries for surveillance of psoriasis treatments and outcomes. However, these registries tend to have small patient populations and little geographic diversity, limiting their strength as surveillance tools for rare or delayed adverse events.

Proposed Solution

Combining the results from nation-based registries would increase statistical power and may enable investigators to conduct analyses that would not be feasible at a single-country level. Psonet was established in 2005 as a network of European registries of psoriasis patients being treated with systemic agents. The goal of the network is to improve clinical knowledge of prognostic factors and patient outcomes, thus improving treatment of psoriasis patients. An International Coordinating Committee (ICC), including representatives of the national registries and some national pharmacovigilance centers, oversees the network activities, including data management, publications, and ethical or privacy issues. The ICC has appointed an International Safety Review Board, whose job is to review safety data, prepare periodic safety reports, and set up procedures for the prompt identification and investigation of unexpected adverse events. Informed consent for data sharing is obtained before patients are enrolled in participating registries.

When drafting the registry protocol, member registries agreed to a common set of variables and procedures to be included and implemented in the national registries. However, some registries were already active at the time the draft was written, and harmonization is not perfect. Although inclusion criteria, major outcomes, and followup schedules are quite similar among registries, there are some differences. There are also differences in terms of software used, data coding, and data ownership arrangements. These factors made sharing individual patient data complicated, and an alternate solution was identified: meta-analysis of summary measures from each registry. As summary measures (or effect measures) are calculated, the methods used to obtain them are decided in advance, including methods used to control for confounding and methods used to temporarily link exposures and events.


Ten national and local registries at different stages of development are associated with the registry to date, contributing a total of about 27,800 patients. While the registry is too new to have published results, planned activities and analyses include comparative data on treatment strategies for psoriasis in Europe, rapid alerts on newly recognized unexpected events, regular reports on effectiveness and safety data, and analyses of risk factors for lack of response as a preliminary step to identifying relevant biomarkers.

Key Point

Data from multiple registries in different countries may be combined to provide larger patient populations for study of long-term outcomes and surveillance for rare or delayed adverse events. Meta-analysis of prospectively calculated summary measures can be a useful tool.

For More Information

Psonet: European Registry of Psoriasis. http://www.psonet.eu/cms/.

Lecluse LLA, Naldi L, Stern RS, et al. National registries of systemic treatment for psoriasis and the European ‘Psonet’ initiative. Dermatology. 2009;218(4):347–56. [PubMed: 19077384].

Naldi L. The search for effective and safe disease control in psoriasis. Lancet. 2008;371:1311–2. [PubMed: 18424307].

Case Example 37Linking a procedure-based registry with claims data to study long-term outcomes

DescriptionThe CathPCI Registry measures the quality of care delivered to patients receiving diagnostic cardiac catheterizations and percutaneous coronary interventions (PCI) in both inpatient and outpatient settings. The primary outcomes evaluated by the registry include the quality of care delivered, outcome evaluation, comparative effectiveness, and postmarketing surveillance.
SponsorAmerican College of Cardiology Foundation through the National Cardiovascular Data Registry. Funded by participation dues from catheterization laboratories.
Year Started1998
Year EndedOngoing
No. of Sites1,450 catheterization laboratories
No. of Patients12.7 million patient records; 4.5 million PCI procedures


The registry sponsor was interested in studying long-term patient outcomes for diagnostic cardiac catheterizations and PCI, but longitudinal data are not collected as part of the registry. Rather than create an additional registry, it was determined that the most feasible option was linking the registry data with available third-party databases such as Medicare.

Before the linkage could occur, however, several legal questions needed to be addressed, including what identifiers could be used for the linkage and whether institutional review board (IRB) approval was necessary.

Proposed Solution

The registry developers explored potential issues relating to the use of protected health information (under the Federal HIPAA [Health Insurance Portability and Accountability Act] law) to perform the linkage; the applicability of the Common Rule (protection of human subjects) to the linkage; and the contractual obligations of the individual legal agreement with each participating hospital with regard to patient privacy. The registry gathers existing data, including direct patient identifiers collected as part of routine health care activities. Informed consent is not required. The registry sponsor has business associate agreements in place with participating catheterization laboratories for which the registry conducts the outcomes evaluations.

After additional consultation with legal counsel, the registry sponsor concluded that the linkage of data could occur under two conditions: (1) that the data sets used in the merging process must be in the form of a limited data set (see Chapter 7), and (2) that an IRB must evaluate such linkage. The decision to implement the linkage was based on two key factors. First, the registry participant agreement includes a data use agreement, which permits the registry sponsor to perform research on a limited data set but also requires that no attempt be made to identify the patient. Second, since there was uncertainty as to whether the proposed data linkage would meet the definition of research on human subjects, the registry sponsor chose to seek IRB approval, along with a waiver of informed consent.


The registry data were linked with Medicare data, using probabilistic matching techniques to link the limited data sets. A research protocol describing the need for linkage, the linking techniques, and the research questions to be addressed was approved by an IRB. Researchers must reapply for IRB approval for any new research questions they wish to study in the linked data.

Results of the linkage analyses were used to develop a new measure, “Readmission following PCI,” for the Centers for Medicare & Medicaid Services' hospital inpatient quality pay-for-reporting program.

Key Point

There are many possible interpretations of the legal requirements for linking registry data with other data sources. The interpretation of legal requirements should include careful consideration of the unique aspects of the registry, its data, and its participants. In addition, clear documentation of the way the interpretation occurred and the reasoning behind it will help to educate others about such decisions and may allay anxieties among participating institutions.

Case Example 38Linking registry data to examine long-term survival

DescriptionThe Yorkshire Specialist Register of Cancer in Children and Young People is a population-based registry that collects data on children and young adults diagnosed with a malignant neoplasm or certain benign neoplasms, living within the Yorkshire and Humber Strategic Health Authority (SHA). The goals of the registry are (1) to serve as a data source for research at local, national, and international levels on the causes of cancer in children, teenagers, and young people, and (2) to evaluate the delivery of care provided by clinical and other health service professionals.
SponsorPrimary funding is provided by the Candlelighters Trust, Leeds.
Year Started1974
Year EndedOngoing
No. of Sites18 National Health Service (NHS) Trusts
No. of Patients7,728


In 2002, approximately 1,500 children in the United Kingdom were diagnosed with cancer. Previous estimates of malignant bone tumors in children have been approximately 5 per million person-years in the United Kingdom. The registry collects data on individuals younger than 30 living within the Yorkshire and Humber SHA, and diagnosed with a malignant neoplasm or certain benign neoplasms by pediatric oncology and hematology clinics or teenage and young adult cancer clinics. Primary patient outcomes of the registry include length of survival, access to specialist care, late effects following cancer treatment, and hospital activity among long-term survivors. While bone cancer is ranked as the seventh most common malignancy in the United Kingdom, the relative rarity of this type of childhood cancer makes it difficult to gather sufficient data to evaluate incidence and survival trends over time.

Proposed Solution

The registry participated in a collaborative effort to combine its data with three other population-based registries—the Northern Region Young Persons' Malignant Disease Registry, the West Midlands Regional Children's Tumour Registry, and the Manchester Children's Tumour Registry. Together, the four population-based registries represented approximately 35 percent of the children in England.


In a 20-year period from 1981 to 2002, 374 cases of malignant bone tumors were identified in children ages 0 to 14 years. The age-standardized incidence rate for all types of bone cancers (i.e., osteosarcoma, chondrosarcoma, Ewing sarcoma, and “other”) was reported to be 4.84 per million per year. For the two most common types of bone cancer, osteosarcoma and Ewing sarcoma, the incidence rates were 2.63 cases per million person-years (95% confidence interval [CI], 2.27 to 2.99) and 1.90 cases per million person year (95% CI, 1.58 to 2.21), respectively. While an improvement in survival was observed in patients with Ewing sarcoma, no survival improvement was detected in patients with osteosarcoma. The 5-year survival rate for children with all types of diagnoses observed in the study was an estimated 57.8 percent (95% CI, 52.5 to 63).

Key Point

In the analysis of rare diseases, the number of cases and deaths included in the study determines the statistical power for examining survival trends and significant risk factors, and the precision in estimating the incidence rate or other parameters of disease. In cases where it is difficult to obtain a large enough sample size within a single study, considerations should be given to combining registry data collected among similar patient populations.

For More Information

Eyre R, Feltbower RG, Mubwandarikwa E, et al. Incidence and survival of childhood bone tumours in Northern England and the West Midlands, 1981. Br J Cancer. 2002;2009(s100):188–93 [PMC free article: PMC2634696] [PubMed: 19127271].

Case Example 39Linking longitudinal registry data to Medicaid Analytical Extract files

DescriptionThe Cystic Fibrosis Foundation (CFF) Patient Registry is a rare-disease registry that collects data from clinical visits, hospitalizations, and care episodes to track national trends in morbidity and mortality, assess the effectiveness of treatments, and drive quality improvement in patients with cystic fibrosis (CF).
SponsorCystic Fibrosis Foundation
Year Started1986
Year EndedOngoing
No. of Sites110 CFF-accredited care centers in the United States
No. of PatientsMore than 26,000


Clinical services and health information generated outside of clinic visits and hospitalizations at accredited care centers may or may not be captured in the CFF Patient Registry. Therefore, administrative claims data such as Medicaid Analytical Extract (MAX), with comprehensive information on reimbursed health services, are necessary to completely evaluate drug exposure for epidemiological studies. To protect patient information, the CFF Patient Registry only collects the last four digits of the Social Security number, gender, and date of birth as direct patient identifiers. Since these identifiers are largely non-unique, linkage of the registry data to other data sources presents a challenge.

Proposed Solution

A deterministic patient matching algorithm, or linkage rule, between the CFF Patient Registry and MAX data using non-unique patient identifiers was developed to link the two data sources. MAX patients (with at least two in- or outpatient claims with diagnosis for CF) and CFF registry patients born between January 1, 1981, and December 31, 2006, were included. We examined the following variables for linking plausibility: date of birth, last four digits of the Social Security number, Zip Code, gender, date of sweat test, date of gene testing, and date of hospital admission. Specifically, we determined the percentage of unique records for each selected variable or combination of variables in the MAX data set and the registry data set. Only variable combinations with a 99 percent level of uniqueness (99 percent of records unique) were considered for the deterministic rule definitions. We then examined the linkage performance of each rule and the validation parameters (i.e., sensitivity, specificity, and positive predictive value [PPV]) of these rules were compared against the selected gold standard (defined as the rule with the highest linkage performance).


We assessed 14,515 and 15,446 patient records in MAX and CF registry data sets, respectively. A total of nine linkage rules were established. The linkage rule including gender, date of birth, and Social Security number had the highest performance with 32.04 percent successfully linked records and was considered the gold standard. Linkage rule performance ranged from 1.4 percent (95% CI, 1.2 to 1.6) to 32.0 percent (95% CI, 31.3 to 32.8). As expected, rules with lower linkage performance had fewer or no matching records. Compared with the selected gold standard, the sensitivity of the other linkage rules ranged from 4.3 percent (95% CI, 3.8 to 4.9) to 73.3 percent (95% CI, 72.0 to 74.6); the specificity ranged from 88.2 percent (95% CI, 87.6 to 88.9) to 99.9 percent (95% CI, 99.8 to 99.9); and the PPV ranged from 68.2 percent (95% CI, 62.6 to 73.4) to 99.0 percent (95% CI, 96.5 to 99.8).

Key Point

The defined linkage rules exhibited varying operational characteristics of sensitivity, specificity, and PPV. When using deterministic linkage methods to link registry data with administrative claims data, relying on multiple linkage rules may be necessary to optimize linkage performance. Applying probabilistic record linkage methods should be considered when deterministic linkage methods are likely to fail; however, the absence of a set criterion for establishing probability weights could pose a challenge for its implementation.

References for Chapter 16

Clayton EW. Ethical, legal, and social implications of genomic medicine. N Engl J Med. 2003 Aug 7;349(6):562–9. [PubMed: 12904522]
Westin AF. Health Care Information Privacy: A Survey of the Public and Leaders. New York, NY: EQUIFAX, Inc.; 1993. Conducted for.
Lake Research Partners. Consumers and Health Information Technology: A National Survey. California Healthcare Foundation; Apr, 2010. [March 11, 2013]. Conducted for the. http://www​.chcf.org/publications​/2010/04​/consumers-and-health-information-technology-a-national-survey.
Gottlieb S. US employer agrees to stop genetic testing - Burlington Northern Santa Fe News. BMJ. 2001;322:449. [PMC free article: PMC1119680] [PubMed: 11222414]
Sterling R, Henderson GE, Corbie-Smith G. Public willingness to participate in and public opinions about genetic variation research: a review of the literature. Am J Public Health. 2006 Nov;96(11):1971–8. [PMC free article: PMC1751820] [PubMed: 17018829]
Institute of Medicine. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. Jan 27, 2009. [August 15, 2012]. http://www​.iom.edu/Reports​/2009/Beyond-the-HIPAA-Privacy-Rule-Enhancing-Privacy-Improving-Health-Through-Research.aspx. [PubMed: 20662116]
Beckerman JZ, Pritts J, Goplerud E, et al. Health information privacy, patient safety, and health care quality: issues and challenges in the context of treatment for mental health and substance use. BNA's Health Care Policy Report. 2008 Jan 14;16(2):3–10. [September 30, 2013]; http://ihcrp​.georgetown​.edu/pdfs/pritts0208.pdf.
Solove D. A taxonomy of privacy. University of Pennsylvania Law Review. 2006;154:477–560.
Duncan GT, Jabine TB, de Wolf VA. Commitee on National Statistics. Panel on Confidentiality and Data Access. Washington, D.C.: National Research Council and the Social Science Research Council, National Academy Press; 1993. Private lives and public policies: confidentiality and accessibility of government statistics.
Fienberg SE. Encyclopedia of Social Measurement. Vol. 1. San Diego, CA: Academic Press; 2005. Confidentiality and disclosure limitation; pp. 463–9.
Federal Committee on Statistical Methodology: Report on statistical disclosure limitation methodology. Statistical Policy Working paper 22. 2005. [August 17, 2012]. Publication No. NTIS PB94-165305. http://www​.fcsm.gov/working-papers​/spwp22.html.
Code of Federal Regulations, Title 45, Public Welfare, Department of Health and Human Services, Administrative Data Standards and Related Requirements, General Administrative Requirements, General Provisions, Section 103, Definitions. http://www​.gpo.gov/fdsys​/pkg/CFR-2011-title45-vol1​/xml/CFR-2011-title45-vol1-sec160-103.xml.
Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association. 1969;40:1183–210.
Bilenko M, Mooney R, Cohen WW, et al. Adaptive name matching in information integration. IEEE Intelligent Systems. 2003;18(5):16–23.
Herzog TN, Schuren FJ, Winkler WE. Data Quality and Record Linkage Techniques. New York: Springer-Verlag; 2007.
Abowd J, Vilhuber L. The sensitivity of economic statistics to coding errors in personal identifiers (with discussion). Journal of Business and Economic Statistics. 2005;23(2):133–65.
Winkler WE. Overview of the record linkage and current research directions. U.S. Census Bureau; 2006. [September 30, 2013]. http://www​.census.gov​/srd/papers/pdf/rrs2006-02.pdf.
Christen P, Churches T, Hegland M, editors. A parallel open source data linkage system; 8th Pacific Asia Conference on Knowledge Discovery and Data Mining; Sydney, Australia. May 2004.
Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer; 2012.
Sadinle M, Fienberg SE. A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record-systems. Journal of the American Statistical Association. 2013;108(502):385–97.
Lyons RA, Jones KH, John G, et al. The SAIL databank: linking multiple health and social care datasets. BMC Med Inform Decis Mak. 2009;9:3. [PMC free article: PMC2648953] [PubMed: 19149883]
Karr AF, Fulp WJ, Lix X, et al. Secure, privacy preserving analysis of distributed databases. Technometrics. 2007;49(3):335–45.
Karr AF, Lin X, Sanil AP, et al. Privacy-preserving analysis of vertically partitioned data using secure matrix products. Journal of Official Statistics. 2009;25(1):125–38.
Rivest RL, Adleman L, Dertouzos ML. On data banks and privacy homomorphisms. In: DeMillo R, editor. Foundations of Secure Computation. New York: Academic Press; 1978.
Hall R, Fienberg SE. Privacy-preserving record linkage. In: Domingo-Ferrer J, Magkos E, editors. Privacy in Statistical Databases 2010. Berlin: Springer; 2010. pp. 269–83. Lecture Notes in Computer Science 6344.
Golle P. Revisiting the uniqueness of simple demographics in the U.S. population. ACM Workshop on Privacy in the Electronic Society. 2006:77–80.
Sweeney L. Uniqueness of simple demographics in the U.S. population. Pittsburg, PA: Carnegie Mellon University Data Privacy Laboratory; 2000.
Karr AF, Banks DL, Sanil AP. Data quality: a statistical perspective. Statistical Methodology. 2006;3(2):137–73.
Code of Federal Regulations, Title 45, Public Welfare, Department of Health and Human Services, Administrative Data Standards and Related Requirements, Security and Privacy, Privacy of Individually Identifiable Health Information, Section 164.514(b), Other Requirements Related to Uses and Disclosures of Protected Health Information, Implementation Specifications: Requirements for De-Identification of Protected Health Information. http://www​.gpo.gov/fdsys​/pkg/CFR-2002-title45-vol1​/pdf/CFR-2002-title45-vol1-sec164-514.pdf.
U.S. Department of Health & Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. [March 11, 2013]. http://www​.hhs.gov/ocr​/privacy/hipaa/understanding​/coveredentities​/De-identification/guidance.html.
Skinner C. Statistical disclosure risk: separating potential and harm (with discussion). International Statistical Review. 2012;80:349–81.
Sweeney L. Weaving technology and policy together to maintain confidentiality. J Law Med Ethics. 1997 Summer-Fall;25(2-3):98–110. 82. [PubMed: 11066504]
45 CFR 164.514(b)(2)(i).
El Emam K, Jonker E, Arbuckle L, et al. A systematic review of re-identification attacks on health data. PLoS One. 2011;6(12):e28071. [PMC free article: PMC3229505] [PubMed: 22164229]
Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med. 2010 Jan;58(1):11–8. [PMC free article: PMC2836827] [PubMed: 20051768]
Rothstein MA. Is deidentification sufficient to protect health privacy in research? Am J Bioeth. 2010 Sep;10(9):3–11. [PMC free article: PMC3032399] [PubMed: 20818545]
Malin BA. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. J Am Med Inform Assoc. 2005 Jan-Feb;12(1):28–34. [PMC free article: PMC543823] [PubMed: 15492030]
Lin Z, Owen AB, Altman RB. Genetics. Genomic research and human subject privacy. Science. 2004 Jul 9;305(5681):183. [PubMed: 15247459]
Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008 Aug;4(8):e1000167. [PMC free article: PMC2516199] [PubMed: 18769715]
Cassa CA, Schmidt B, Kohane IS, et al. My sister's keeper?: genomic research and the identifiability of siblings. BMC Med Genomics. 2008;1:32. [PMC free article: PMC2503988] [PubMed: 18655711]
Rothstein MA. Genetic secrets: promoting privacy and confidentiality in the genetic era. New Haven: Yale University Press; 1997.
Kass N, Medley A. Genetic screening and disability insurance: what can we learn from the health insurance experience? J Law Med Ethics. 2007 Summer;35(2 Suppl):66–73. [PubMed: 17543060]
Phelan JC. Geneticization of deviant behavior and consequences for stigma: the case of mental illness. J Health Soc Behav. 2005 Dec;46(4):307–22. [PubMed: 16433278]
Public Law 110-233, Genetic Information Non-Discrimination Act of 2008. http://www​.gpo.gov/fdsys​/pkg/PLAW-110publ233​/pdf/PLAW-110publ233.pdf.
Duncan GT. Confidentiality and statistical disclosure limitation. In: Smelser N, Baltes P, editors. International Encyclopedia of the Social and Behavioral Sciences. Vol. 4. New York, NY: Elsevier; 2001. pp. 2521–5.
Gouweleeuw JM, Kooiman P, Willenborg LCRJ, et al. Post randomization for statistical disclosure control: theory and implementation. Journal of Official Statistics. 1998;14:463–78.
Rubin DB. Discussion: statistical disclosure limitation. Journal of Official Statistics. 1993;9(2):461–8.
Raghunathan TE, Reiter JP, Rubin DB. Mutliple imputation for statistical disclosure limitation. Journal of Official Statistics. 2003;19:1–16.
Doyle P, Lane J, Theeuwes J, et al., editors. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. New York: Elsevier; 2001.
Fienberg SE, Makov UI, Steele RJ. Disclosure limitation using perturbation and related methods for categorical data (with discussion). Journal of Official Statistics. 1998;14(4):485–511.
Bishop YM, Fienberg SE, Holland PW, et al. Discrete Multivariate Analysis: Theory and Practice. New York: Springer-Verlag; 1995. Reprinted 2007.
Dobra A, Fienberg SE. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. Proc Natl Acad Sci USA. 2000 Oct 24;97(22):11885–92. [PMC free article: PMC17264] [PubMed: 11050222]
Fienberg SE, Slavkovic AB. Preserving the confidentiality of categorical data based when releasing information for association rules. Data Mining and Knowledge Discovery. 2005;11:155–80.
Fienberg SE. Statistical perspectives on confidentiality and data access in public health. Stat Med. 2001 May 15-30;20(9-10):1347–56. [PubMed: 11343356]
Cox LH, Karr AF, Kinney SK. Risk-utility paradigms for statistical disclosure limitation: how to think but not how to act (with discussion). International Statistical Review. 2011;79:160–99.
Malin BA, Sweeney L. A secure protocol to distribute unlinkable health data. AMIA Annu Symp Proc. 2005:485–9. [PMC free article: PMC1560734] [PubMed: 16779087]
Boyens C, Krishnan R, Padman R, editors. On privacy-preserving access to distributed heterogeneous healthcare information; 37th Hawaii International Conference on System Sciences; 2004; 2009. Publication No HICSS-37.
Fienberg SE. Privacy and confidentiality in an e-commerce world: data mining, data warehousing, matching and disclosure limitation. Statistical Science. 2006;21:143–54.
Dwork C, McSherry F, Nissim K. Calibrating noise to sensititvity in private data analysis. In: Halevi S, Rabin T, editors. TCC. Berlin: Springer-Verlag; 2006. pp. 265–84. Lecture Notes in Computer Science. 3876.
Dwork C, Kenthapadi K, McSherry F, et al. Our data, ourselves: privacy via distributed noise generation. EUROCRYPT. 2006:486–503.
Fienberg SE, Rinaldo A, Yang X. Differential privacy and the risk utility tradeoff for multidimensional contingency tables. In: Domingo-Ferrer J, Magkos E, editors. Privacy in Statistical Databases 2010. Berlin: Springer; 2010. pp. 187–99. Lecture Notes in Computer Sciences 6344.


  • PubReader
  • Print View
  • Cite this Page

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...