Format

Send to

Choose Destination
BMC Med Inform Decis Mak. 2017 Apr 13;17(1):40. doi: 10.1186/s12911-017-0429-1.

Automatic identification of variables in epidemiological datasets using logic regression.

Author information

1
Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany. Matthias.lorenz@em.uni-frankfurt.de.
2
Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt/Main, Germany.
3
Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany.
4
IRCSS Multimedica, Milan, Italy.
5
Department of Pharmacological and Biomolecular Sciences, University of Milan, Milan, Italy.
6
Institute of Clinical Sciences, University of Oslo, Oslo, Norway.
7
Department of Cardiology, Oslo University Hospital Ullevål, Oslo, Norway.
8
Atherosclerosis Department, Cardiology Research Center, Moscow, Russia.
9
University Medical Center Utrecht, Utrecht, The Netherlands.
10
Department of Epidemiology and Biostatistics, Erasmus Medical Center, Rotterdam, The Netherlands.
11
Department of Neurology, Medical University Innsbruck, Innsbruck, Austria.

Abstract

BACKGROUND:

For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.

METHODS:

For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.

RESULTS:

In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.

CONCLUSIONS:

We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

KEYWORDS:

Data management; Epidemiology; Logic regression; Meta-analysis

PMID:
28407816
PMCID:
PMC5390441
DOI:
10.1186/s12911-017-0429-1
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for BioMed Central Icon for PubMed Central
Loading ...
Support Center