Optimizing large real-world data analysis with parquet files in R: A step-by-step tutorial

Pharmacoepidemiol Drug Saf. 2024 Mar;33(3):e5728. doi: 10.1002/pds.5728. Epub 2023 Nov 20.

Abstract

Purpose: The use of open-source programming languages can facilitate open science practices in real-world evidence (RWE) studies. Real-world studies often rely on using big data, which makes using such languages complicated. We demonstrate an efficient approach that enables RWE researchers to use R to undertake RWE analysis tasks from cohort building to reporting.

Methods: Using the Merative Marketscan data (2017-2019), we developed an R function to transform the data into parquet format to be used in R. Then, we compared the differences in data size before and after transformation. We compared the performance of the transformed data in R to the original data in terms of numerical consistency and running times required to complete simple exploratory tasks. To show how the transformed databases can be used in practice, we conducted a simplified replication of an active comparator new user study from the literature. All codes are available on GitHub.

Results: Our approach exhibited high efficiency in data storage, as evidenced by the converted data size, which ranged from 10% to 43% of that of the original data files. The runtime of the exploratory tasks in R generally outperformed that of the original data with SAS. We showed, through example, how the converted data can be efficiently used to implement an RWE study.

Conclusion: We demonstrate a free and efficient solution to facilitate the use of open-source programming languages with big real-world databases, which can facilitate the adoption of open science practices.

Keywords: R; big data; cohort building; open science; pharmacoepidemiology; real-world data.

MeSH terms

  • Data Analysis*
  • Databases, Factual
  • Humans
  • Information Storage and Retrieval*