U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Dusetzina SB, Tyree S, Meyer AM, et al. Linking Data for Health Services Research: A Framework and Instructional Guide [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Sep.

Cover of Linking Data for Health Services Research

Linking Data for Health Services Research: A Framework and Instructional Guide [Internet].

Show details

4An Overview of Record Linkage Methods


Randomized controlled trials (RCTs) remain the gold standard for assessing intervention efficacy; however, RCTs are not always feasible or sufficiently timely. Perhaps more importantly, RCT results often cannot be generalized due to a lack of inclusion of “real-world” combinations of interventions and heterogeneous patients.14 With recent advances in information technology, data, and statistical methods, there is tremendous promise in leveraging ever-growing repositories of secondary data to support comparative effectiveness and public health research.2,6,39 While there are many advantages to using these secondary data sources, they are often limited in scope, which in turn limits their utility in addressing important questions in a comprehensive manner. These limitations can be overcome by linking data from multiple sources such as health registries and administrative claims data.7,1416

An extensive and complex process, record linkage is both a science and an art.59 While the process can be difficult to navigate, many effective strategies have been developed and documented in the health services literature. However, new policies and concerns over data security are making it more challenging for investigators to link data using traditional methods. Of particular importance are increasingly restrictive policies governing Protected Health Information that severely limit access to the unique identifiers on which many documented strategies rely. In light of these challenges, there is a need to increase understanding and extend capacity to perform reliable linkages in varying scenarios of data availability.

In this chapter, we guide the reader through an overview of data linkage methods and discuss the strengths and weaknesses of various linkage strategies in the effort to develop and document a set of best practices for conducting data linkages with optimal validity and reliability, and minimal risk to privacy and confidentiality.

Data Cleaning and Standardization

Data come in different shapes, sizes, and quality, creating scenarios that must be considered in building a linkage algorithm. For instance, demographic information often contains typographical and data entry errors, such as transposed Social Security Number (SSN) digits and misspellings. An individual’s information sometimes changes over time, with life changes such as marriage or moving leading to changes in one’s name or address. People sometimes deliberately report false information to defraud insurance providers or to avoid detection. Twins can have very similar information. Finally, the spouses and/or children of the family’s primary health insurance subscriber sometimes use the primary subscriber’s information. These idiosyncrasies are what make data linkage difficult, so the more work done upfront to clean and standardize the data, the better the chances of a successful linkage.

With this in mind, the first step after data delivery is to examine the nature of the data, paying particular attention to the way information is stored, the completeness of the identifying information, the extent to which information overlaps, and the presence of any idiosyncrasies in the data. By doing so, steps can be taken to clean and standardize the available information across data sources to minimize false matches attributable to typographical errors. Table 4.1 shows some of the commonly seen issues to clean, account for, and/or standardize before beginning the linkage process.

Table 4.1. Common variations found in selected linkage identifiers.

Table 4.1

Common variations found in selected linkage identifiers.

Many data manipulation techniques are available in commonly used software (e.g., SAS and Stata) to facilitate the data cleaning and standardization process. Using these techniques renders all linkage variables the same across data sources—that is, variables that will be compared are forced into the same case (e.g., all uppercase), the same format (e.g., 01SEP2013), the same content (e.g., stripped of all punctuation and digits), and the same length.

Additionally, identifiers can be parsed into separate pieces of information. For instance, full names can be parsed into first, middle, last, and maiden names; dates of birth can be parsed into month, day, and year of birth; and addresses can be parsed into street, city, State, and ZIP Code. Parsing identifiers into separate pieces where possible allows the researcher to maximize the amount of available information and give credit for partial agreement when record pairs do not agree character for character. This is particularly important in accounting for changes across time, such as a name change after marriage or an address change after a move. In such cases, matching on the separate pieces allows for the possibility of partial credit that, when combined with other information, may provide sufficient evidence that the records being compared represent the same person.

Other techniques have been developed to account for minor misspellings and typographical errors. Here are three examples.

  1. Strings can be converted to phonetic codes (e.g., Soundex or New York State Immunization Information System) before comparison.
  2. Strings can be compared using editing distance techniques to determine how many steps (insertions, deletions, transpositions, etc.) are required to get from String A to String B. For example, it would require one step—one character deletion—to get from “Billy” to “Bill.”
  3. Names can be linked to an array of synonyms (e.g., “William” and “Bill”) to account for the use of nicknames.6062

Each of these techniques has been shown to improve the accuracy of string comparisons.61,62 In Appendix 4.1 we have provided sample SAS code for performing these functions.

Finally, it is important to consider informational overlap in the available linkage identifiers. As in statistical modeling, where two strongly correlated variables should not be included together in the model, variables with overlapping information should not be included in the same linkage algorithm. In such cases, assigning credit for matches on both variables (e.g., ZIP Code and county) is redundant, leading to an overestimate of the extent to which two records agree. Similarly, assigning separate penalties for nonmatches on overlapping variables (e.g., first name and first name initial) is redundant, leading to an overestimate of the extent to which two records disagree. In the case of overlapping variables, researchers can pick either one or take the most informative match; for example, if two records match on ZIP Code and county, then the match on ZIP Code is more informative.

Alternatively, if there are concerns regarding the quality of two overlapping variables, the researcher may wish to explore the use of both variables to improve match rates. For example, if county to county matches are made but ZIP is off by one digit due to a typo, the researcher may choose to confirm the match based on the county variable. This redundancy in information can be used to account for random errors, if used properly. The researcher must weigh the extent to which the inclusion of additional variables is thought to improve the match (without resulting in additional false matches). Similarly, the researcher may also desire to retain overlapping variables to perform iterative deterministic matching (match first on ZIP Code for exact matches and then on county when ZIP Codes do not match exactly).

How much data cleaning and standardization are necessary depends on the quality of the data, the research question, and the discretion of the researcher. While cleaning can improve linkage rates, the cleaning process can be quite labor intensive, so researchers should consider the cost-benefit analysis before investing a significant amount of time on cleaning the data. Cleaning has been highly recommended if the data quality is poor and/or only a few identifiers are available.63 When the data quality is relatively good or many identifiers are available to override the errors, the cost usually outweighs the reward.64 The scope of the project should be considered as well. If the project is exploratory in nature, then the potential harm of false positives and false negatives may be negligible. Conversely, a hypothesis test may be significantly biased by false positives and false negatives.65 Each of these considerations should be weighed against one another in the effort to determine whether data cleaning is prudent.

Linkage Methods

There are two main types of linkage algorithms: deterministic and probabilistic. Both have been successfully implemented in previous research studies.11,21,24,35,36,44,59,6680 Choosing the best algorithm to use in a given situation depends on many interacting factors, such as time; resources; the research question; and the quantity and quality of the variables available to link, including the degree to which they individually and collectively are able to identify an individual uniquely. With this in mind, it is important that researchers be equipped with data linkage algorithms for varying scenarios. The key is to develop algorithms to extract and make use of enough meaningful information to make sound decisions. In this section, we will review the main algorithm types and discuss the strengths and weaknesses of each in an effort to derive a set of guidelines describing which algorithms are best in varying scenarios of data availability, data quality, and investigator goals.

Deterministic Linkage Methods

Deterministic algorithms determine whether record pairs agree or disagree on a given set of identifiers, where agreement on a given identifier is assessed as a discrete—“all-or-nothing”—outcome. Match status can be assessed in a single step or in multiple steps. In a single-step strategy, records are compared all at once on the full set of identifiers. A record pair is classified as a match if the two records agree, character for character, on all identifiers and the record pair is uniquely identified (no other record pair matched on the same set of values). A record pair is classified as a nonmatch if the two records disagree on any of the identifiers or if the record pair is not uniquely identified. In a multiple-step strategy (also referred to as an iterative or stepwise strategy), records are matched in a series of progressively less restrictive steps in which record pairs that do not meet a first round of match criteria are passed to a second round of match criteria for further comparison. If a record pair meets the criteria in any step, it is classified as a match. Otherwise, it is classified as a nonmatch. These two approaches to deterministic linkage can also be called “exact deterministic” (requiring an exact match on all identifiers) and “approximate or iterative deterministic” (requiring an exact match on one of several rounds of matching but not on all possible identifiers).

While the existence of a gold standard in registry-to-claims linkages is a matter of debate, the iterative deterministic approach employed by the National Cancer Institute to create the SEER (Surveillance, Epidemiology and End Results)-Medicare linked dataset81,82 has demonstrated high validity and reliability and has been employed successfully in multiple updates of the SEER-Medicare linked dataset.83,84 The algorithm consists of a sequence of deterministic matches using different match criteria in each successive round.

In the first step, two records must match on SSN and one of the following:

  • First and last name (allowing for fuzzy matches, such as nicknames)
  • Last name, month of birth, and sex
  • First name, month of birth, and sex

If SSN is missing or does not match, or two records fail to meet the initial match criteria, they may be declared a match if they agree on the match criteria in a second round of deterministic linkages, in which two records must match on last name, first name, month of birth, sex, and one of the following:

  • Seven to eight digits of the SSN
  • Two or more of the following: year of birth, day of birth, middle initial, or date of death

In situations in which full identifiers or partial identifiers are available but may not be released or transmitted, a deterministic linkage on encrypted identifiers may be employed. Quantin and colleagues33,4647 have developed procedures for encrypting identifiers using cryptographic hash functions so identifiers needed for linkage can be released directly to researchers without compromising patient confidentiality. A cryptographic hash function, such as the Secure Hash Algorithm version 2 (SHA-2), released by the National Security Agency and published by the National Institute of Standards and Technology, is a deterministic procedure that takes input and returns an output that was intentionally changed by an algorithm. Due to its deterministic attributes and the inability to reverse-engineer a hashed value, it has been widely adopted for security applications and procedures. Quantin’s research shows that records can successfully be linked via deterministic algorithms using identifiers encrypted before release.46,8586 It should be noted, however, that record pairs matched on encrypted identifiers cannot be manually reviewed or validated.

Probabilistic Linkage Methods

The deterministic approach ignores the fact that certain identifiers or certain values have more discriminatory power than others do. Probabilistic strategies have been developed to assess (1) the discriminatory power of each identifier and (2) the likelihood that two records are a true match based on whether they agree or disagree on the various identifiers.

According to the model developed by Fellegi and Sunter,33 matched record pairs can be designated as matches, possible matches, or nonmatches based on the calculation of linkage scores and the application of decision rules. Say we have two files, A and B, where file A contains 100 records and File B contains 1,000 records. The comparison space is the Cartesian product made up of all possible record pairs (A * B), or 100 * 1,000 = 100,000 possible matches. Each pair in the comparison space is either a true match or a true nonmatch.

When dealing with large files (e.g., claims files), considering the entire Cartesian product is often computationally impractical. In these situations, it is advisable to reduce the comparison space to only those matched pairs that meet certain basic criteria. This is referred to as “blocking,” which serves to functionally subset a large dataset into a smaller dataset of individuals with at least one common characteristic, such as geographic region or a specific clinical condition. For instance, the number of matched pairs to be considered may be limited to only those matched pairs that agree on clinical diagnosis or on both month of birth and county of residence. Those record pairs that do not meet the matching criteria specified in the blocking phase are automatically classified as nonmatches and removed from consideration. To account for true matches that were not blocked together (due to data issues), typically multiple passes are used so that rows that were not blocked together in one pass have the potential to be blocked and compared in another pass to avoid automatic misclassification. Since two records cannot be matched on missing information, the variables chosen for the blocking phase should be relatively complete, having few missing values. Blocking strategies such as this reduce the set of potential matches to a more manageable number. Because blocking strategies can influence linkage success, Christen and Goiser recommend that researchers report the specific steps of their blocking strategy.87

The two records in every matched pair identified in the blocking phase are compared on each linkage identifier, producing an agreement pattern. The weight assigned to agreement or disagreement on each identifier is assessed as a likelihood ratio, comparing the probability that true matches agree on the identifier (“m-probability”) to the probability that false matches randomly agree on the identifier (“u-probability”).

The m-probability can be estimated based on values reported in published linkage literature or by taking a random sample of pairs from the comparison space, assigning match status via manual review, and calculating the probability that two records agree on a particular identifier when they are true matches. The u-probability can be calculated by observing the probability that two records agree on a particular identifier merely by chance; for example, the u-probability for month of birth is 1/12, or .083. Calculating value-specific u-probabilities for an identifier based on the frequency of each value and the likelihood that two records would agree on a given value simply by chance yields additional information. For instance, a match on a rare surname such as Lebowski is less likely to occur by chance, and is thereby assigned greater weight than a match on a common surname such as Smith. This principle can be applied to any linkage identifier for which values are differentially distributed.

When two records agree on an identifier, an agreement weight is calculated by dividing the m-probability by the u-probability and taking the log2 of the quotient. For example, if the probability that true matches agree on month of birth is 97 percent and the probability that false matches randomly agree on month of birth is 8.3 percent (1/12), then the agreement weight for month of birth would be calculated as (log2[.97/.083]), or 3.54. When two records disagree on an identifier, a disagreement weight is calculated by dividing 1 minus the m-probability by 1 minus the u-probability. For example, the disagreement weight for month of birth would be calculated as (log2 [(1-.97)/(1-.083)]), or - 4.93.

While the method above accounts for the discriminatory power of the identifier, it does not yet take into account the degree to which records agree on a given identifier. This is important for fields where typographical errors are likely to occur (e.g., names, addresses). Assigning partial agreement weights in situations where two strings do not match character for character can account for minor typographical errors, including spelling errors in names or transposed digits in dates or SSNs.6062 Partial agreement weights for string comparators can account for both the length of the string and common human errors made in alphanumeric strings. If all of the characters in a string are matched character by character across two files, then the agreement weight is maximized (set at the full agreement weight). If there are no characters in common, where common characters are defined by Winkler as being no further than ([m/2] – 1) where m = (max [length of string in file 1, length of string in file 2]), then the agreement weight is zero.62 String comparator values take into account characteristics that would reduce the likelihood of matching, such as string length, number of transpositions of characters, number of characters in common, and location on the string where the nonmatch occurs. For example, lower weights would be assigned for short names or when the first characters of the string are not matched between the two files. From Winkler’s prior example, larger string comparator values were assigned when comparing “Shackleford” and “Shackelford” (string comparator value = 0.9848) than “Lampley” and “Campley” (string comparator value = 0.9048). The full agreement weight for the identifier can then be multiplied by the string comparator value to generate a partial agreement weight. For example, if the full agreement weight for first name is 12 and the string comparator value is 0.95 that the first name on one record matches the first name on another record, then the partial agreement weight would be equal to 12*0.95, or 11.4. Once the weights, full and partial, for each identifier have been calculated, the linkage score for each matched pair is equal to the sum of the weights across all linkage identifiers. Use of string comparator methods may significantly improve match rates if a large number of typographical errors are expected.

An initial assessment of linkage quality can be gained by plotting the match scores in a histogram. If the linkage algorithm is working properly, then the plot should show a bimodal distribution of scores, with one large peak among the lower scores for the large proportion of likely nonmatches and a second smaller peak among the higher scores for the smaller set of likely matches. The cutoff threshold for match/nonmatch status will be a score somewhere in the trough between the two peaks. Depending on the research question and the nature of the study, the initial threshold can be adjusted to be more conservative (higher score) or more liberal (lower score). A more conservative threshold will maximize the specificity of the linkage decision, as only those record pairs with a high score will be counted as matches. Conversely, a more liberal threshold will maximize the sensitivity of the linkage decision to possible matches.

Cook and colleagues44 define the cutoff threshold as the difference between the desired weight and the starting weight. Given two files, A and B, the starting weight for each record pair is equal to the log2 of the odds of picking a true match by chance,

log2(E / ((A × B) − E)),

where E is the number of expected matches, A is the number of records in File A, and B is the number of records in File B. If the number of records in File A is 1,000, the number of records in File B is 1,000,000, and the expected number of matches is 10 percent of 1,000 records in File A, or 100, then the starting weight will be −23.25. The “expected” number of matches can be determined by prior research, prior knowledge, or an educated guess if there is no precedent.

If P is the desired probability that two records were not matched together by chance (i.e., the desired positive predictive value), then the desired weight for each record pair is equal to the log2 of the odds associated with P:

log2(P / (1 − P)).

If the desired value of P is 0.95, then the desired weight is log2(0.95 / (1 – 0.95)) = 4.25. The cutoff threshold, or score needed to have a false-match rate of <0.05, is the difference between the desired weight, 4.25, and the starting weight, −23.25, which is 27.50. If the computed linkage score is greater than or equal to the cutoff threshold, then the record pair is classified as a match. If the computed linkage score is less than the cutoff threshold, then the record pair is classified as a nonmatch. Researchers wishing to maximize the sensitivity of the algorithm to potential matches can relax this threshold somewhat and manually review all record pairs with scores near the calculated cutoff.

Once this process is complete, a sample of the match decisions made by the linkage algorithm should be reviewed to ensure that the algorithm performed as intended. By reviewing match decisions, you can often identify conditions in which the algorithm could use some tweaking to account for difficult cases. For instance, frequently the children and/or spouses of the primary subscriber use the primary subscriber’s SSN, thereby making it difficult to identify them as unique individuals given the large weight often assigned to agreement on SSN. Twins are another difficult case, as they have the same birthdate, frequently have similar names, and often have SSNs that differ on only one to two digits. Records for a married woman can sometimes be difficult to match when marriage leads to changes in the woman’s last name and/or address. Finally, tweaks to the algorithm can often improve the performance of the algorithm on the “fuzzy” or borderline cases. By reviewing a sample of match decisions, you can tweak your algorithm to account for each of these cases, thereby improving the performance of the algorithm. This will be particularly important for researchers who hope to reuse the algorithm for future linkages (e.g., matching a new year of registry cases to a new year of administrative claims data). A summary of steps for performing probabilistic record linkage is provided below.

Summary of Steps for Probabilistic Record Linkage

  1. Estimate the m and u probabilities for each linking variable using the observed frequency of agreement and disagreement patterns among all pairs, commonly generated using the EM (expectation-maximization) algorithm described by Fellegi-Sunter.
  2. Calculate agreement and disagreement weights using the m and u probabilities.
  3. Calculate a total linking weight for each pair by summing the individual linking weights for each linkage variable.
  4. Compare the total linkage weight to a threshold above which pairs are considered a link. The threshold is set using information generated in step 1.

Alternative Linkage Methods

The methods presented above are those most commonly used in registry-to-claims linkages. Other methods are available for researchers who have more challenging linkage scenarios. The EM algorithm61,62 is an iterative approach to estimating m- and u-probabilities. According to Winkler, the EM algorithm provides very accurate estimates of m- and u-probabilities in situations where the amount of typographical error in the identifiers is minimal, but performs poorly when the identifiers contain numerous typographical errors. In the sorted-neighborhood approach,88 the data sources are stacked and sorted by various combinations of the available identifiers. For each sort, all records within a window of n-records are compared. The Bayesian approach89 is an alternative to the frequentist approach presented above. The computer science literature also includes distance-based techniques90 as well as unsupervised and supervised machine-learning approaches.91,92

Selecting a Linkage Method

What kind of linkage method to employ in a given situation depends on a variety of factors, some of which are scientific and some of which are more subjective. In information-rich scenarios where direct identifiers are available and of good quality, deterministic methods have been recommended.42 In these scenarios, deterministic methods are easy to implement, easy to interpret, and effective. In scenarios that are information poor (where direct identifiers are unavailable) and/or the data are of poor quality, probabilistic methods consistently outperform deterministic methods and thus merit the extra time and resources required to implement them. Beyond these broad guidelines, the decision is left to the researchers and their goals for the project. Herein lies the “art” of record linkage.59 For instance, researchers studying a rare disease may want to employ probabilistic methods even in information-rich scenarios in the effort to identify every possible match and maximize sample size. Ultimately, every researcher must weigh the pros and cons of the available methods in the context of the project and choose the method that best fits the budget, timeline, allotted resources, and research question.

Evaluating Linkage Algorithms

In record linkage, there are two types of accuracy errors. A Type I error occurs when a true nonmatch is classified as a match. A Type II error occurs when a true match is classified as a nonmatch. Minimizing these errors is critical, particularly when the product will be used in a cohort study,65 where linkage error can introduce bias into analyses. In the health sciences literature, the four metrics most often used to evaluate the accuracy of a linkage algorithm are: (1) sensitivity, (2) specificity, (3) positive predictive value, and (4) negative predictive value. Table 4.2 shows all possible outcomes of a linkage decision.

Table 4.2. True match status by algorithm output.

Table 4.2

True match status by algorithm output.

“A” represents all true matches that are correctly classified as matches; “B” represents all true matches that are incorrectly classified as nonmatches; “C” represents all true nonmatches that are incorrectly classified as matches; and “D” represents all true nonmatches that are correctly classified as nonmatches.

Sensitivity = A / (A + B). Sensitivity measures the ability of an algorithm to correctly classify true matches as matches.

Specificity = C / (C + D). Specificity measures the ability of an algorithm to correctly classify true nonmatches as nonmatches.

Positive predictive value (PPV) = A / (A + C). PPV represents the proportion of matched pairs classified by the algorithm as matches that are true matches.

Negative predictive value (NPV) = D / (B + D). NPV represents the proportion of matched pairs classified by the algorithm as nonmatches that are true nonmatches.

These metrics measure an algorithm’s ability to classify correctly true matches as matches, and true nonmatches as nonmatches.

Due to the large number of potential matches identified during the blocking phase, the bulk of the comparison space will be made up of true nonmatches. For this reason, Christen and Goiser87 argue that linkage metrics that include true nonmatches (e.g., specificity and NPV) in the equation will be skewed. Instead, they recommend metrics such as the f-measure,92 which represents the harmonic mean of the sensitivity, and PPV, which is not influenced by the large number of true nonmatches. The f-measure is calculated as

((β2 + 1.0) * Sensitivity * PPV) / (β2 * Sensitivity + PPV),

where β is equal to the user-assigned relative importance of sensitivity over PPV. If the user wishes to assign equal weight to sensitivity and PPV, then β = 1.0. If the user wishes to assign sensitivity twice the weight of PPV, then β = 2.0.

These metrics represent tradeoffs between sensitivity and specificity. Investigators hoping to maximize sensitivity may increase their false-positive rate, subsequently resulting in a smaller PPV. Those wishing to maximize specificity may increase their false-negative rate. If the goals of the investigation are to estimate all or most true cases, then investigators may want to focus on sensitivity. Otherwise, if the goals are to reduce the likelihood of including false positives, the researcher may focus on specificity. While there is no hard rule, a good linkage algorithm will typically have values of sensitivity, PPV, and the f-measure in excess of 95 percent. However, what is acceptable depends on the context of the study. If the study involves the testing of hypotheses and/or the results will have significant practical implications (e.g., findings will be incorporated into clinical practice or influence policy), a higher percent match is more desirable. On the other hand, if a study is exploratory, a lower percent match may be acceptable. What is acceptable may also vary depending on the frequency of the outcome. Researchers studying a rare disease may seek to emphasize sensitivity to maximize the sample size, while a researcher studying a more frequently occurring disease may want to emphasize PPV to ensure that matches identified by the algorithm are true matches.

Validating Linkage Results

The final step of the linkage process is the validation of the match results. Initial steps for determining linkage validity are to look for ties in which multiple record pairs identified as matches by the algorithm have exactly the same set of values. For example, common names such as “Mary Smith” or “Mike Brown” are typically repeated in large datasets, and thus have multiple matches. Where possible, ties should be adjudicated by reference to additional information. If no additional information is available, then the record pairs should be classified as nonmatches. If an algorithm is successful, there will be few to no ties.

The next step is to assess the extent to which your matched sample reflects the target population. For instance, in a study linking a single State’s cancer registry to Medicare administrative claims for that State, researchers may use estimates of the percentage of cancer patients age 65 and older to determine what percentage of patients in the cancer registry would be expected to be linked to the Medicare data. If estimates indicate that 60 percent of cancer patients in the State are 65 and older, then it is reasonable to expect that 60 percent of the patients in the cancer registry will be matched with Medicare. If, instead, the researcher finds that only 30 percent of patients in the cancer registry are successfully matched, this may serve as a signal that there is a problem with the matching algorithm.

While not well documented in the literature, some form of manual review is typically employed to check the results. Before starting the manual review process, a set of decision rules is developed to standardize the decision process across reviewers. Next, a random sample is drawn from the set of all potential matches identified during the blocking phase. Following the decision rules, one or more reviewers then determine whether each potential match is a match or nonmatch. Finally, the decisions documented during the manual review process are used as a gold standard against which the decisions made by the algorithm are compared, allowing for the calculation of the sensitivity, PPV, and f-measure of the algorithm. A good algorithm should have scores of 95 percent or better across the three metrics.

Final Remarks

In this chapter, we have provided an overview of data linkage methodology from the point of data delivery to the reporting of the linkage results. We noted that if data quality is good (e.g., unique identifiers with few typographical errors and/missing values), the cost of the labor-intensive process of cleaning and standardizing the data is not worth the reward, and therefore not recommended. In this scenario, deterministic linkage methods are accurate, straightforward, and easy to implement.

If data quality is poor (e.g., little identifying information and/or numerous typographical errors), cleaning and standardizing the data before linkage can greatly improve linkage rates. In these scenarios, iterative deterministic techniques or more sophisticated probabilistic techniques are recommended. Combining deterministic and probabilistic methods can improve efficiency and save computational resources. When combining methods, a deterministic match on all identifiers can be executed first to identify certain matches. The remaining record pairs that disagree on at least one of the identifiers can be submitted for probabilistic matching. Using deterministic and probabilistic methods in this stepwise fashion reduces the number of record pairs that will be processed in the more resource-intensive probabilistic matching phase.

In order to limit the computational resources required to compare the Cartesian product of all possible matches, blocking should be implemented to reduce the comparison space to record pairs that agree on some basic criteria (e.g., date of birth and county or clinical diagnosis). This process can improve computational efficiency and performance substantially. It is important to do multiple blocking passes to account for data issues on the blocking variable. Blocking techniques can have a significant effect on the linkage results, and thus researchers should report the blocking method.

Both deterministic procedures and probabilistic procedures should be considered iterative. After completing the initial linkage, a random sample of match decisions should be reviewed to ensure that the algorithm is performing as intended. If the review process reveals opportunities for improvement, then the algorithm should be adjusted to account for the identified weaknesses.

Once the linkage process is complete, the results should be compared to known metrics. For instance, if it is known that 20 percent of cancer patients in the State are covered by private insurance, then roughly 20 percent of the records in a State cancer-registry database should match to private insurance claims. If the observed match rate differs substantially from the expected results, then the linkage method should be reevaluated and repeated.

When reporting linkage results, estimates of the sensitivity, PPV, and f-measure of the algorithm should be reported to provide readers with a characterization of the validity and reliability of the linkage product. Due to the disproportionately large number of nonmatched pairs identified during the blocking phase, measures that include the number of nonmatched pairs in the calculation (e.g., specificity and NPV) should not be reported.

In the next chapter, we will empirically demonstrate the approaches described in this chapter and develop and test a series of deterministic and probabilistic algorithms for scenarios of varying unique identifier availability.

Researchers who wish to learn more about data linkage approaches and techniques beyond those covered in this chapter are referred to Dr. William Winkler’s list of Statistical Data Editing References: http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=BAA7B495D9CFBEB3276C67AB96BFFA6D?doi=

Appendix 4.1. Useful SAS Functions and Procedures

Data Cleaning and Standardization

Remove all special characters

Syntax: compress(varname, “.,’{}[]/\()+~`!&#$%^&*_<>?”);

Ex. Remove all special characters from the string “(John_Doe!)”
newname = compress(name, “.,’{}[]/\()+~`!&#$%^&*_<>?”);
put name;

Remove a specific character

Syntax: compress(varname, “-”);

Ex. Remove all dashes from the string “999-99-9999”
newssn = compress(ssn, “-”);
put newssn;

Remove all punctuation

Syntax: compress(varname, ,“P”);

Ex. Remove all punctuation from “O’Brien-Smith”
newname = compress(name, ,“P”);
put newname;

Remove all digits

Syntax: compress(varname, ,“D”);

Ex. Remove all digits from the string “J2ohn”
newname = compress(name, ,“D”);
put newname;

Remove all spaces

Syntax: compress(varname, “ ”);

Ex. Remove all spaces from the string “Van Slyke”
newname = compress(name, “ ”);
put newname;

Remove leading and trailing spaces

Syntax: strip(varname);

Ex. Remove all leading and trailing spaces from the string “ John “
newname = strip(firstname);
put newname;

Remove extra spaces with a single space

Syntax: compbl(varname);

Ex. Remove all extra spaces from the string “Van Slyke”
newname = compbl(name);
put newname;
Van Slyke

Extracting data from a substring

Syntax: substr(varname, start, count);

Ex. Extract first initial from the name “John”
initial = substr(name, 1, 1);
put initial;

Search for and replace a string

Syntax: tranwrd(varname, string_to_be_replaced, replacement_string);

Ex. Find all instances of the string “Road” in the string “Anystreet Road” and replace with “Rd.”
newstreet = tranwrd(street, “Road”, “Rd.”);
put newstreet;
Anystreet Rd.

Make all characters in a string uppercase

Syntax: upcase(varname);

Ex. Make all characters in the string “John” uppercase
newname = upcase(name);
put newname;

Make all characters in a string lowercase

Syntax: lowcase(varname);

Ex. Make all characters in the string “JOHN” lowercase
newname = lowcase(name);
put newname;

Make all characters in a string proper case (1st character uppercase, remaining characters lowercase)

Syntax: propcase(varname);

Ex. Convert the string “JOHN” to proper case
newname = propcase(name);
put newname;

Concatenate two strings

Syntax: varname1||varname2;
Syntax: cat(varname1,varname2);

Ex. Concatenate first name (“JOHN”) and last name (“DOE”)
newname = firstname||lastname;
newname = cat(firstname, lastname);
put newname;

For more information on CAT functions see:
“Purrfectly fabulous feline functions” by Louise S. Hadden

Parse out pieces in a string

Syntax: scan(varname, count, delimiter);

Ex. Parse out first, middle and last name from “JOHN SMITH DOE”
firstname = scan(fullname, 1, “ “);
put firstname;

middlename = scan(fullname, 2, “ “);
put middlename;

lastname = scan(fullname, 3, “ “);
put lastname;

Convert string to phonetic code

Syntax: soundex(varname);

Ex. Convert the first name “JOHN” to phonetic code
newname = soundex(firstname);
put newname;

Calculate editing distance

Syntax: spedis(varname);
Syntax: complev(varname);
Syntax: compged(varname);

Ex. Determine the editing distance between ‘CHARLES’ and ‘CHARLIE’
name1 = ‘CHARLES’;
name2 = ‘CHARLIE’;

spedis = spedis(name1, name2);
put spedis;

complev = complev(name1, name2);
put complev;

compged = compged(name1, name2);
put compged;

Note: For a comparison of the editing distance functions, see “Fuzzy Matching using the COMPGED
Function” by Paulette Staum

Social Security Number (SSN) Validation

See “Identifying Invalid Social Security Numbers” by Paulette Staum and Sally Dai in the NorthEast SAS Users Group (2007).

Variable Encryption

Syntax: md5(varname);

Ex. Encrypt the name “JOHNDOE”
newname = md5(name);
put newname;

General Matching Techniques

Loop to determine how many SSN digits match

attrib SSN_DIG_MATCH format = 1. label = ‘# of SSN Digits that Match’;

do i = 1 to 9; if substr(put(CCR_SSN,$9.),i,1) = substr(put(PAYER_SSN,$9.),i,1) then SSN_DIG_
MATCH + 1;

Compare beginnings of strings, up to length of shorter string

Syntax: compare(compress(string1,’ ‘), compress(string2,’ ‘), ‘:’);

Ex. Compare the last names ‘WILLIAMS-SMITH’ and ‘WILLIAMS’ character-for-character up to
length of shorter name

if compare(compress(lastname1,’ ‘), compress(lastname2,’ ‘), ‘:’) = 0
then ln_compare = ‘match’;

*** Use with caution. This technique is a useful way to identify cases to manually review (e.g., names
reported differently over time).

Compare ends of strings (in reverse order), up to length of shorter string

Syntax: compare(compress(reverse(string1),’ ‘),
compress(reverse(string2),’ ‘), ‘:’);

Ex. Compare the last names ‘WILLIAMS-SMITH’ and ‘SMITH’ character-for-character up to length of
shorter name

if compare(compress(reverse(lastname1),’ ‘),
compress(reverse(lastname2),’ ‘), ‘:’) = 0
then ln_compare = ‘match’;

*** Use with caution. This technique is a useful way to identify cases to manually review (e.g., names
reported differently over time).

Surveillance, Epidemiology and End Results (SEER)-Medicare Algorithm

attrib seermedicare format = 1. length = 3.;
seermedicare = 0;

if ssn then do;
            if ( firstname and lastname )
             (lastname and birthmonth and gender )
             (firstname and birthmonth and gender )
            then seermedicare = 1;


/* AND GENDER MATCH                                                                                                      */

if not ssn and ( lastname and firstname and birthmonth and gender )
then do;

/* MIDDLE INITIAL OR DATE OF DEATH MATCH                                           */
if (ssn_dig_match >= 7) or
           (sum(of birthyear, birthday, middleinitial, deathdate) >= 2)
           then seermedicare = 1;


Probabilistic Matching Techniques

Calculate u-probability

Ex. Generate u-probability for firstname

proc freq data = lib.dataset_a (keep = firstname) noprint;
table firstname / norow nocol out = lib.fnfreq;

data lib.fnfreq (keep = firstname fn_uprob);
set lib.fnfreq;
attrib fn_uprob length = 8. format = 10.9;
fn_uprob = percent / 100;

Calculate probabilistic weights


if firstname1 = firstname2 then fn_match = ‘match’;
else fn_match = ‘nonmatch’;

/* LOG2(M_PROB/U_PROB)                                                                                     */
/* ASSUME M-PROBABILITY FOR FIRST NAME = 0.95                                   */

if fn_match = ‘match’ then do;
  fn_agree_weight = log2(0.95 / fn_uprob);

/* LOG2( (1-M_PROB) / (1-U_PROB) )                                                                     */
/* ASSUME M-PROBABILITY = 0.95                                                                       */

else if fn_match = ‘nonmatch’ then do;
  fn_disagree_weight = log2( (1-0.95) / (1-fn_uprob) );

Use editing distance method to calculate probability that two strings are a match

if mean(

    1 - (length(firstname1) * spedis(firstname1, firstname2)) / 2400),
    1 - (length(firstname2) * spedis(firstname2, firstname1)) / 2400))
    >= .95
then firstname_spedis = ‘match’;

*** Source: “Fuzzy Key Linkage” by Sigurd Hermansen in Southeast SAS Users Group (2001)

Efficiency and Optimization Techniques

proc sql;
create index varname on datasetname (varname);

Ex. Create index on SSN to improve performance of match on SSN
proc sql;
create index ssn on dataset_a (ssn);

options msglevel = I;
proc sql;
create table matches as
select *
from dataset_a as f1, dataset_b as f2
where f1.ssn = f2.ssn;

Appendix 4.2. Data Linkage Software Packages

Manually writing the code to perform each step of the data linkage process in software packages such as SAS, MS SQL Server, or R gives the user full control over the entire process. While manual coding is ideal, the person writing the code must be familiar with linkage theory and must possess a great deal of programming expertise, qualities that may require funding for an expensive programmer. Furthermore, the amount of time required to write the code and the amount of computer resources needed to execute the linkage can be substantial. For researchers who lack the time, computer resources, expertise, or personnel required to write the needed code manually, many public-domain and commercial products are available to streamline and simplify the linkage process. Here, we present several commonly used products.

Publicly Available Packages

The following data linkage packages are available at no charge to the public:

  • Link Plus, developed by the Centers for Disease Control, is available at www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm. The package provides a graphical user interface that is straightforward and easy to use, requiring only beginner-level knowledge of the linkage process. While Link Plus is easy to use, it may not be efficient or capable of processing large datasets (those with > 1 million records). This makes it difficult for researchers attempting to link claims datasets, which typically contain millions of records.
  • The Link King, developed by Washington State’s Division of Alcohol and Substance Abuse, is available at www.the-link-king.com. The product itself is free, but it requires a license for base SAS, which currently costs ~ $2,000. Like Link Plus, the package provides a graphical user interface that is straightforward and easy to use, requiring only beginner-level knowledge of the linkage process.. While the Link King is capable of handling larger datasets, it requires first and last name, as well as Social Security Number or date of birth, which means that it can be used only in information-rich scenarios.
  • ChoiceMaker 2 (developed by ChoiceMaker Technologies and available at www.sourceforge.net/projects/oscmt/) and FEBRL (developed by the ANU Data Mining Group and available at www.sourceforge.net/projects/febrl/) are two products that health services researchers have used frequently in recent years. The authors are not aware of any scholarly evaluations of these products.

Commercially Available Products

Selected commercially available products are listed below:


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (1.0M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...