NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Dusetzina SB, Tyree S, Meyer AM, et al. Linking Data for Health Services Research: A Framework and Instructional Guide [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Sep.

Cover of Linking Data for Health Services Research

Linking Data for Health Services Research: A Framework and Instructional Guide [Internet].

Show details

5Evaluation of Methods Linking Health Registry Data to Insurance Claims in Scenarios of Varying Available Information

Objective

In this chapter, we expand upon our Chapter 4 discussion of linkage methods through an empirical linkage demonstration and evaluation using registry and insurance claims data. Here, we evaluate a set of linkage algorithms for registry-to-claims linkages covering scenarios of varying unique identifier availability and incorporate encryption algorithms to allow linkage without Protected Health Information transfer. We evaluated test algorithms against a gold standard used by the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER)-Medicare program. More specifically, we examined linkage algorithms first with full identifying information including name and Social Security Number (SSN), then iterated through scenarios with decreasing numbers of unique identifiers and increasing reliance on nonunique information such as date of birth and sex. Given the exceptionally limited availability of practical empirical examples researchers can use to inform their own data linkages, this examination articulates much-needed specific details of the steps researchers may take and what they may expect to find given each study’s unique scenario of data availability and quality.

Methods

Approach Overview

We compared four approaches:

1.

Employment of the current gold-standard linking algorithm, first with full identifying information and subsequently with partial identifiers in place of their full counterparts

2a.

Evaluation of deterministic approaches, modeling scenarios of decreasing individually identifiable information

2b.

Evaluation of deterministic approaches in the context of encrypted individual identifiers, simulating a scenario of restrictions on identifier release to researchers

3.

Evaluation of probabilistic approaches, modeling scenarios of decreasing individually identifiable information

Experimental linkage sets start with full available information and iteratively reduce the available information, in the end simulating a scenario in which unique identifiers are not available. To both streamline this examination and test the robustness of the algorithms in the context of the rareness of the condition of interest, we begin by focusing on a sample (subpopulation) with colon cancer, a common sex-neutral cancer. Next, we examine cancers that are rarer and sex specific. Finally, we evaluate algorithm performance in the context of all cancers simultaneously in the full health-plan claims population. Table 5.1 provides an overview of our approach.

Table 5.1. Overview of experimental linkage approach.

Table 5.1

Overview of experimental linkage approach.

Data Sources and Patient Populations

Case data: Individuals in the North Carolina Central Cancer Registry (NCCCR) diagnosed with colon cancer in the period 2007–08 (n = 6,444 unique individuals)

Claims data: Enrollment and claims data for beneficiaries in privately insured health plans in North Carolina (PAYER) for the period 2006–09 (n = 3,747,250 unique beneficiaries)

We selected these datasets because they have full information (i.e., all identifying variables) commonly captured in constituent datasets for linkages of this nature. The variables in the claims data are the same as those available in the Federal payer/claims data. Registry records were matched to all years of PAYER claims. The PAYER claims data can be restricted to simulate practical scenarios of comparatively limited linking information experienced by researchers.

Table 5.2 shows identifiers available in both datasets.

Table 5.2. Variables available for linkage and their completeness in study datasets.

Table 5.2

Variables available for linkage and their completeness in study datasets.

Data Cleaning and Standardization

Before the linkage, variables were cleaned and standardized as follows:

  1. All string variables were converted to uppercase and stripped of all punctuation and digits, and hyphenated names were broken out into two different name fields.
  2. All date variables were converted to date9. format (e.g., 01SEP2013).
  3. ZIP Codes were limited to the first 5 digits.
  4. FIPS (Federal Information Processing Standards) codes were broken out into State (first 2 digits) and county (last 3 digits) codes.
  5. Invalid SSNs were flagged and treated as missing.

Data Linkage

Blocking Phase. Rather than consider the Cartesian product of all possible matches between NCCCR and PAYER, we identified a subset of potential matches during an initial blocking phase. Two records were included in the subset of potential matches if they agreed on any of the following:

  1. SSN
  2. Date of birth, first name initial, and sex
  3. Date of birth, last name initial, and sex
  4. Last name, first name, and sex
  5. Date of birth, county, and sex

The blocking phase identified 104,360 possible matches.

Step 1. Application of a Gold-Standard Algorithm. At this time, because there is presently no definitive gold-standard algorithm for registry linkages, we used the linkage algorithm developed by the National Cancer Institute’s SEER-Medicare program as a gold standard.82 The iterative deterministic approach employed in this algorithm has demonstrated high validity and reliability in previous registry-to-claims linkages, has been employed successfully in numerous updates of the SEER-Medicare linked dataset, and is generally perceived to be strong in scenarios of high data quality and identifier completeness.8082

Individuals in the NCCCR database were linked to beneficiaries in the PAYER database using the SEER-Medicare algorithm, which consists of a sequence of deterministic matches using different match criteria in each successive round:

In the first step, records were declared a match if they agreed on SSN and one of the following:

  • First and last name (allowing for fuzzy matches, such as nicknames);
  • Last name, month of birth, and sex; or
  • First name, month of birth, and sex.

If SSN was missing or did not match, or two records failed to meet the initial match criteria, they were subjected to a second round of deterministic linkages. In the second round, records were declared a match if they agreed on last name, first name, month of birth, sex, and one of the following:

  • 7–9 digits of the SSN; or
  • Two or more of: year of birth, day of birth, middle initial, or date of death.

For each pair of records, match or nonmatch status was determined using the rules above, and match markers were generated indicating agreement or disagreement on each individual identifier. The SEER-Medicare algorithm classified 1,189 record pairs as matches and 103,171 record pairs as nonmatches. Based on prior knowledge of cancer incidence and insurance coverage in North Carolina, we expected that approximately 20 percent of individuals in the NCCCR database with colon cancer would be insured by PAYER. The 1,189 uniquely matched individuals represent 18.5 percent of the individuals with colon cancer identified in the NCCCR database.

All subsequent test algorithms were evaluated against the SEER-Medicare algorithm. The match decisions made by each algorithm were compared with the match decisions made through use of the gold-standard algorithm. Pairs identified as matches by both the SEER-Medicare algorithm and the test algorithm were declared to be “true matches.” Pairs identified as matches by the SEER-Medicare algorithm and nonmatches by the test algorithm were declared to be “false nonmatches.” Pairs identified as nonmatches by both the SEER-Medicare algorithm and the test algorithm were declared to be “true nonmatches.” Pairs identified as nonmatches by the SEER-Medicare algorithm and matches by the test algorithm were declared to be “false matches.”

To assess the success of each algorithm, we calculated sensitivity, positive predictive value, and f-measure (where beta was set at 1.0, giving equal weight to sensitivity and specificity) using SAS (version 9.3; SAS Institute, Cary, NC).

Step 2a. Comparison Approach 1—Deterministic Linking. Deterministic linkage strategies have been recommended for situations in which the data are of high quality and/or many identifier variables are available.59 Research has also shown that deterministic linkage on a sufficient number of partial and/or indirect identifiers, such as initials, year of birth, and county of residence, can provide sufficient discriminatory power to classify matches and nonmatches with good sensitivity and specificity.42,72

Using the matching markers generated above, we developed and tested a set of deterministic algorithms, using match variable combinations of full and partial identifiers. We covered information-rich situations in which full direct identifiers (e.g., SSNs and full names) are available, as well as information-poor situations in which only indirect or partial identifiers are available. Only record pairs that were uniquely identified by the given variable combination were considered potential matches. When multiple record pairs matched on the values of a given variable combination, the record pairs were flagged as ties and classified as nonmatches. To account for minor typographical errors in names, we used the Soundex algorithm to generate a code consisting of the first initial and up to three digits representing consonant sounds in the name, and matched on the Soundex values. By doing so, we were able to explore the possibility of linking algorithms that do not require the release of full or actual names.

Following Roos and Wajda, we determined the percentage of records identified uniquely by each combination of variables.63 Given the exploratory nature of this study, we relaxed the recommended threshold of ~1.00 record per unique value (100% uniqueness) and included for testing all variable combinations that identified 85 percent of records uniquely. Using this method, we selected 398 variable combinations for testing. To simulate scenarios of decreasing information availability, the variables with the largest number of unique values were removed in a stepwise manner. The first group of algorithms used all identifiers. The second group of algorithms excluded SSN. Finally, the third group of identifiers excluded SSN and name.

Step 2b. Comparison Approach 1—Encryption Variation. In situations where full identifiers or partial identifiers are available but may not be released or transmitted, research has shown that records can be successfully linked via deterministic algorithms using identifiers encrypted before release.45,47,85,86 To simulate the application of a hash encryption method before release, we converted the variable combinations presented in Step 2a to 128-bit hash values using the md5 algorithm. Each conversion was performed using the md5 function in SAS 9.3. It is important to note that the length, format, order, and content of the strings in the two datasets have to be perfectly consistent before the conversion. If there is even a slight difference between the two strings, the md5 algorithm will generate two different values, as shown in Table 5.3 below.

Table 5.3. Example md5 algorithm values from inconsistent strings.

Table 5.3

Example md5 algorithm values from inconsistent strings.

While the two records in this example clearly match on date of birth, first name, and last name, the md5 hash values for the two concatenated strings are very different due to the different casing on the names. Both strings would need to be ordered, formatted, and spaced uniformly for the md5 algorithm to generate the same value for the two strings. Using the example above, the best approach would be to standardize nicknames, concatenate the three identifiers, remove all spaces, and convert the case of the names to uppercase (i.e., ‘12312013BILLSMITH’) before applying the md5 algorithm. We performed a deterministic match on the hash values for each variable combination presented in Step 2a.

Step 3. Comparison Approach 2—Probabilistic Linking. Probabilistic linkage strategies have been recommended for situations in which the data contain many coding errors and/or only a few identifiers are available.63 Using the match markers generated earlier, we developed and tested a set of probabilistic algorithms using the match variable combinations in each group of full and partial identifiers that performed best in Step 2a. We covered information-rich situations in which full direct identifiers (e.g., SSNs and full names) are available, as well as information-poor situations in which only indirect or partial identifiers are available. Only record pairs that were uniquely identified by the given variable combination were considered potential matches. When multiple record pairs matched on the values of a given variable combination, the record pairs were flagged as ties and classified as nonmatches. The goal in this step is to improve on the match results in Step 2a by making use of the information ignored in deterministic algorithms. A summary of steps for probabilistic record linkage is provided in Chapter 4.

For each matched pair, we calculated agreement weights and disagreement weights for each identifier. Following the Fellegi and Sunter model,33 agreement weights were calculated by dividing the probability that true matches agree on the specific value of the identifier by the probability that false matches randomly agree on the specific value of the identifier, and taking the log2 of the quotient. For example, if the probability that true matches agree on month of birth is 97 percent and the probability that false matches randomly agree on month of birth is 8.3 percent (1/12), then the agreement weight for month of birth would be log2(.97/.083), or 3.54. Disagreement weights were calculated by dividing 1 minus the probability that true matches agree on the specific value of the identifier by 1 minus the probability that false matches agree on the specific value of the identifier, and taking the log2 of the quotient.

To allow for comparisons across linkage strategies, we tested the same 398 variable combinations that were selected for testing in Step 2a. The linkage score for each matched pair was then computed as the sum of the weights. Using the method developed by Cook et al.,44 we calculated the threshold weight needed to achieve a 95-percent probability that two matched records are a true match. Matched pairs with a linkage score greater than the threshold weight were declared “matches,” while matched pairs with a linkage score less than the threshold weight were declared “nonmatches.” We present results from the top five algorithms.

Results

Gold-Standard Linkage

The SEER-Medicare algorithm, using full identifiers, classified 1,189 record pairs as matches and 103,171 record pairs as nonmatches. In a stepwise fashion, we replaced the full identifiers with partial identifiers to determine whether the algorithm can work in the absence of full identifiers. Selected results of the SEER-Medicare iterative deterministic algorithm with full identifiers replaced with the indicated partial identifiers are presented in Table 5.4. The results indicate that the sensitivity of the algorithm was largely unaffected in the five examples presented in the table. The replacement of full identifiers with partial identifiers, however, did slightly increase the number of false matches. (Note that manual review confirmed that the additional matches identified by the algorithms with partial identifiers were in fact nonmatches.) Despite the small decrease in the specificity of the algorithm (not shown), these results indicate that the SEER-Medicare linkage can perform effectively in the absence of full identifiers.

Table 5.4. Selected results of gold-standard linkage algorithm with partial identifiers.

Table 5.4

Selected results of gold-standard linkage algorithm with partial identifiers.

Descriptions of the gold-standard algorithms follow.

Algorithm 1: Criteria for classifying a match–

Individuals match on last 4 of SSN and one of the following sets of criteria:

  1. First name Soundex, last name Soundex, 2 out of 3 DOB (date of birth) parts
  2. Last name Soundex, 2 out of 3 DOB parts, sex
  3. First name Soundex, 2 out of 3 DOB parts, sex

OR

Individuals do not match on last 4 digits of SSN, but match on last name Soundex, first name Soundex, 2 out of 3 DOB parts, sex, and one of the following sets of criteria:

  1. Middle initial or date of death
  2. ZIP Code or county
  3. Primary cancer site

Algorithm 2: Criteria for classifying a match–

Individuals match on last 4 of SSN and one of the following sets of criteria:

  1. First name Soundex, last name Soundex, 2 out of 3 DOB parts
  2. Last name Soundex, 2 out of 3 DOB parts, sex
  3. First name Soundex, 2 out of 3 DOB parts, sex

OR

Individuals do not match on last 4 digits of SSN, but match on last name Soundex, first name Soundex, 2 out of 3 DOB parts, sex, and one of the following sets of criteria:

  1. Middle initial or date of death
  2. County

Algorithm 3: Criteria for classifying a match–

Individuals match on last 4 of SSN and one of the following sets of criteria:

  1. First name Soundex, last name Soundex
  2. Last name Soundex, 2 out of 3 DOB parts, sex
  3. First name Soundex, 2 out of 3 DOB parts, sex

OR

Individuals do not match on last 4 digits of SSN, but match on last name Soundex, first name Soundex, month of birth, sex, and one of the following sets of criteria:

  • Two of the following match: year of birth, day of birth, middle initial, or date of death

Algorithm 4: Criteria for classifying a match–

Individuals match on last 4 of SSN and one of the following sets of criteria:

  1. First name Soundex, last name Soundex, 2 out of 3 DOB parts
  2. Last name Soundex, 2 out of 3 DOB parts, sex
  3. First name Soundex, 2 out of 3 DOB parts, sex

OR

Individuals do not match on last 4 digits of SSN, but match on one of the following sets of criteria:

  1. Last name Soundex, first name Soundex, DOB, sex
  2. Last name Soundex, 2 of 3 DOB parts, ZIP Code, sex, (middle initial or date of death)
  3. First name Soundex, 2 of 3 DOB parts, ZIP Code, sex, (middle initial or date of death)

Algorithm 5: Criteria for classifying a match–

Individuals match on last 4 digits of SSN and one of the following sets of criteria:

  1. First name Soundex, last name Soundex, 2 out of 3 DOB parts
  2. Last name Soundex, 2 out of 3 DOB parts, sex
  3. First name Soundex, 2 out of 3 DOB parts, sex

OR

Individuals do not match on last 4 digits of SSN, but match on one of the following sets of criteria:

  1. Last name Soundex, first name Soundex, DOB, sex
  2. Last name Soundex, 2 of 3 DOB parts, ZIP Code, sex

Deterministic Linkage

Results of the deterministic linkages are presented in Table 5.5. The relatively lower sensitivity scores (87.13–88.39) for algorithms using SSN reflect the fact that only 89 percent of the private payer’s members had a valid SSN listed. As expected, algorithms using SSN have very high specificity (99.99–100.00) and positive predictive value (99.33–99.90).

Table 5.5. Selected results of deterministic linkage algorithms.

Table 5.5

Selected results of deterministic linkage algorithms.

When we excluded SSN, the best performing algorithms were able to identify correctly more matches (85.70–92.26) without sacrificing specificity (99.99–100.00), and with only minor decreases in positive predictive value (99.03–100.00). The most encouraging result is the finding that DOB, last name Soundex, first name Soundex, and sex correctly and uniquely identified 92 percent of matches identified by the SEER-Medicare algorithm, with specificity and positive predictive value over 99 percent. Preferably, all values would be greater than 95 percent, but this finding demonstrates that a good linkage can be performed in the absence of SSN or actual name. Importantly, exclusion of a linkage variable may reduce the number of matches in some cases if it results in a greater number of ties. This is demonstrated in Table 5.5 when comparing algorithms without unique identifiers such as SSN, name, and DOB, where inclusion of the variable “sex” resulted in 552 true matches and exclusion resulted in only 541 matches.

The sensitivity of algorithms that did not include SSN or name was significantly lower than that of algorithms that did include SSN and/or name. However, algorithms that blocked on primary site (e.g., diagnosis code for colon cancer) demonstrated high specificity (99.96–99.99) and high positive predictive value (95.09–99.59) (data not shown).

Results of the deterministic linkage approaches using encryption are presented in Table 5.6. The results for each algorithm were consistent with the previous results (Table 5.5), indicating that a deterministic match on identifiers encrypted before release can be successful in instances where identifiers are available but not releasable.

Table 5.6. Selected results of deterministic linkage algorithms using encrypted data.

Table 5.6

Selected results of deterministic linkage algorithms using encrypted data.

Probabilistic Linkage

As shown in Table 5.7, the probabilistic approach improved the performance of all algorithms. When all identifiers were included, the sensitivity improved from ~87 percent to 97.92 percent, because many of the ~13 percent of the private payer’s members’ missing SSNs were matched using information provided by matches on other identifiers.

Table 5.7. Selected results of probabilistic linkage algorithms.

Table 5.7

Selected results of probabilistic linkage algorithms.

This demonstrates the ability of probabilistic algorithms to perform well when data quality for some identifiers is poor. In this instance, missing information in one important identifier was overcome by information provided in other identifiers, thus improving the sensitivity and accuracy of the probabilistic approach compared with a deterministic approach. While the iterative deterministic approach used in the SEER-Medicare algorithm is similarly able to overcome poor data quality in an important identifier such as SSN, it relies on SSN and full name. Conversely, probabilistic algorithms can be effective in scenarios where SSN and full name are unavailable, as demonstrated by the second probabilistic algorithm reported in Table 5.5. Using only DOB, first and last name Soundex values, residence, diagnosis, and sex, the probabilistic approach was able to identify correctly 96.67 percent of true matches and 99.99 percent of true nonmatches. Thus, if confidentiality concerns block the release of SSN and full name in the future, registry data can still be linked successfully to claims using the probabilistic approach. The final results reported in Table 5.7 show that algorithms relying solely on DOB, residence, diagnosis, and sex were unsuccessful, although the probabilistic approach showed some improvement over the deterministic approach. Additional information not used in this study (e.g., service dates) may provide a probabilistic approach with the added power needed for a successful linkage without SSN or name Soundex values.

Discussion

The results of this study indicate that a successful linkage is possible in the absence of full identifying information. We found that straightforward and easy-to-employ deterministic algorithms using DOB and Soundex codes for names demonstrated high specificity and positive predictive value with acceptable sensitivity. In situations where identifiers are available, but not allowed to be released, we found that deterministic matching on hash-encrypted variable combinations performed as well as deterministic matching on the same combination of unencrypted variables. However, the performance of the hash-encrypted deterministic match requires that the variables within each dataset be cleaned and standardized in exactly the same way, and the exact linkage method needs to be known ahead of time. Thus, data-providing organizations have to commit more effort and coordination to collaboratively determine the best standardization methods and combination of variables to use for the linkage before the encryption can be carried out in each of the organizations.

In information-rich scenarios where identifiers are available for release, iterative deterministic approaches such as the SEER-Medicare algorithm are highly effective, and much more time and resource efficient than probabilistic approaches, which can be highly complex and difficult to implement. However, when unique identifiers such as SSN and full name are unavailable, the probabilistic approach consistently outperforms the deterministic approach. These findings are particularly important, as confidentiality concerns are making it increasingly difficult to obtain identifying information for linkage projects.

Appendix 5.1. SEER-Medicare Algorithm With Partial Identifiers

Abbreviations: DOB = date of birth; SEER = Surveillance, Epidemiology and End Results; SSN = Social Security Number.

Algorithm 1

If last 4 digits of SSN match, then
     if first name Soundex, last name Soundex, 2 out
     of 3 DOB parts match
     or
     if last name Soundex, 2 out of 3 DOB parts, sex
     match
     or
     if first name Soundex, 2 out of 3 DOB parts, sex
     match,
            then it’s a match.
If last 4 digits of SSN do not match, then
     if last name Soundex, first name Soundex, 2 out
     of 3 DOB parts, sex match, then
            if sum(of middle initial, date of death) >= 1
            or
            if sum(of ZIP, county) >= 1
            or
            if primary_site match,
then it’s a match.

Algorithm 2

If last 4 digits of SSN match, then
     if first name Soundex, last name Soundex, 2 out
     of 3 DOB parts match
     or
     if last name Soundex, 2 out of 3 DOB parts, sex
     match
     or
     if first name Soundex, 2 out of 3 DOB parts, sex
     match,
     then it’s a match.
If last 4 digits of SSN do not match, then
              if last name Soundex, first name Soundex,
              2 out of 3 DOB parts, sex match, then
           if sum(of middle initial, date of death) > = 1
           or
           county,
           then it’s a match.

Algorithm 3

If last 4 digits of SSN match, then
     if first name Soundex, last name Soundex match
     or
     last name Soundex, 2 of 3 DOB parts, sex
     match
     or
     first name Soundex, 2 out of 3 DOB parts, sex
     match,
     then it’s a match.
If last 4 digits of SSN do not match, then
     if last name Soundex, first name Soundex,
     month of birth, sex match
     or
     if (sum(of year of birth, day of birth, middle
     initial, date of death) >= 2),
     then it’s a match.

Algorithm 4

If last 4 digits of SSN match, then
     if first name Soundex, last name Soundex, 2
     out of 3 DOB parts match
     or
     if last name Soundex, 2 out of 3 DOB parts, sex
     match
     or
     if first name Soundex, 2 out of 3 DOB parts, sex
     match,
     then it’s a match.
If last 4 digits of SSN do not match, then
     if last name Soundex, first name Soundex,
     DOB, sex match
     or
     if last name Soundex, 2 out 3 DOB parts, ZIP,
     sex, (middle initial or date of death) match
     or
     if first name Soundex, 2 out 3 DOB parts, ZIP,
     sex, (middle initial or date of death) match;
     then it’s a match.

Algorithm 5

If last 4 digits of SSN match, then
     if first name Soundex, last name Soundex, 2
     out of 3 DOB parts match
     or
     if last name Soundex, 2 out of 3 DOB parts,
     sex match
     or
     if first name Soundex, 2 out of 3 DOB parts,
     sex match,
     then it’s a match.
If last 4 digits of SSN do not match, then
     if last name Soundex, first name Soundex,
     DOB, sex match
     or
     if last name Soundex, 2 out 3 DOB parts, ZIP,
     sex match,
     then it’s a match.

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (1.0M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...