Format

Send to

Choose Destination
J Biomed Inform. 2014 Dec;52:43-54. doi: 10.1016/j.jbi.2014.01.016. Epub 2014 Feb 10.

Improving record linkage performance in the presence of missing linkage data.

Author information

1
University of Colorado, Denver, Business School, Denver, CO, USA; Department of Medicine, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA; Colorado Clinical and Translational Sciences Institute, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA. Electronic address: toan.ong@ucdenver.edu.
2
University of Colorado, Denver, Business School, Denver, CO, USA.
3
Department of Medicine, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
4
Department of Pediatrics, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA; Colorado Clinical and Translational Sciences Institute, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.

Abstract

INTRODUCTION:

Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values.

METHODS:

By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates.

RESULTS:

The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods.

CONCLUSIONS:

These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research.

KEYWORDS:

Comparative effectiveness research; Data quality; Missing data; Quasi-identifiers; Record linkage

PMID:
24524889
DOI:
10.1016/j.jbi.2014.01.016
[Indexed for MEDLINE]
Free full text

Supplemental Content

Full text links

Icon for Elsevier Science Icon for University of Colorado, Health Sciences Library
Loading ...
Support Center