Curating a longitudinal research resource using linked primary care EHR data—a UK Biobank case study

Abstract Primary care EHR data are often of clinical importance to cohort studies however they require careful handling. Challenges include determining the periods during which EHR data were collected. Participants are typically censored when they deregister from a medical practice, however, cohort studies wish to follow participants longitudinally including those that change practice. Using UK Biobank as an exemplar, we developed methodology to infer continuous periods of data collection and maximize follow-up in longitudinal studies. This resulted in longer follow-up for around 40% of participants with multiple registration records (mean increase of 3.8 years from the first study visit). The approach did not sacrifice phenotyping accuracy when comparing agreement between self-reported and EHR data. A diabetes mellitus case study illustrates how the algorithm supports longitudinal study design and provides further validation. We use UK Biobank data, however, the tools provided can be used for other conditions and studies with minimal alteration.

summarises data standards developed from the Clinical Practice Research Datalink (CPRD) Aurum "acceptable patient flag" [1]. Data quality was assessed against these standards by data provider. Coding completeness was also assessed and results are summarised in tables S2 to S4.
Records with missing dates (as defined in table S1) or codes were excluded. Dates were estimated where records had been de-identified by UK Biobank [2]. Birth-dated records (dated 02/02/1902) were set to the estimated date of birth 1 . Records recorded during the birth year (dated 03/03/1903) were assumed to have been recorded at age six months.
Data providers 1 to 3 had large numbers of registration records with missing de-registration dates. These were assumed to be open periods of registration at the date of data extract. Provider 4 (Wales) appeared to use a future date placeholder (07/07/2037) for open periods of registration. Table S1: Data standards developed from the CPRD Aurum "acceptable patient flag" [1].

CPRD Aurum exclusion criteria
Adapted UK Biobank criteria Year of birth is empty. Available for all participants.
Gender other than male, female or indeterminate.
Available for all participants.
Age is greater than 115 at end of follow-up (based on registration end date, death or last collection date).
Same criteria applied.
Patients are not permanently registered. All participants with registration records were assumed to be permanently registered. c) record date date was prior to birth. 1 Participants were censored at the earlier of the data extract date (the start of the date range provided by UK Biobank [2]) and the date of death in linked death registry data when cleaning registration record data.
To address the limitations of using practice registration histories to estimate periods of data collection, a rule-based approach was developed to determine the period of EHR data collection for each participant. This is summarised in algorithm A1 and figure S1. An R implementation is provided at https://github.com/philipdarke/ukbb-ehr-data.
The algorithm is applied separately to each data provider if a participant has data from multiple providers (for example, a participant that transfers from a medical practice in England to one in Wales). The resultant period(s) of data collection for each data provider are combined and data collection is assumed to be continuous during gaps of less than 1 year.  Algorithm A1: Algorithm used to identify periods of EHR data collection (continued).
Step Participants with multiple periods of practice registration (e.g. as a result of changing GP on relocation) may not be fully captured in the data extract. These participants may feature discontinuous periods of data collection. All additional periods during which EHR data may have been collected are therefore identified.
6. Identify the date of the first and last record within each G i where i = 1, ..., n. Set R i to the period between the first and last record for The gaps between registration periods are examined to determine whether any records have been recorded outside of periods of practice registration.
7. Active EHR data collection is assumed to take place during: a) P 1 . b) P 2 , . . . , P n . c) All periods R i containing at least one non-prescription record.
Active EHR data collection is assumed to take place: a) During the period from the start of data collection identified in step 4.
b) During subsequent periods of practice registration (i.e. it is assumed that a participant did not move from a practice using an EHR system to a paper-based one).
c) Between the first and last record during periods where a participant is not registered with a GP practice. Un-registered periods that only contain prescription records are ignored e .
8. Include gaps between the periods identified in step 7 if they are of length less than 1 year (inclusive).
Participants that move GP practice may not have continuous periods of registration. Gaps of 1 year or less are included to reflect this.
e Complete data collection is unlikely to have taken place during un-registered periods that only feature prescription records. Figure S1: Application of algorithm A1. This synthetic participant corresponds to example 4 in figure 1 in the main manuscript (continued on next page). Step 1: ignore records dated before 1 January 1985 1980 1990 2000 2010 2020 Step   Step 5: identify other potential periods of data collection 1980 1990 2000 2010 2020 Step 6: R 1 = period between first and last records during gap between registered periods 1980 1990 2000 2010 2020 Step 7: select each period of data collection Registered with GP Diagnosis/event Test/observation Prescription Step 8: join together periods P 1 and R 1 as separated by less than one year

Summary of approach
Observations and biomarkers were recorded in up to three value fields. Each data provider adopted a different approach [3] with numeric test results extracted primarily from the first and second value fields. Units, where available, were typically recorded in the value3 field. d) The median value was taken where multiple test results were recorded on the same day e.g. blood pressure measurements. e) Measurements and biomarkers recorded at UK Biobank assessment visits were extracted, outliers removed and added to those from the EHR data.

Unit harmonisation
f) The EHR value was discarded when both a UK Biobank and EHR observation were recorded on the same day.
Additional processing was required where multiple measurements were recorded under the same entity type in the Vision practice management system (for example weight in value1 and BMI in value2 under code 22A..).  became 0205021. All Read v2 codes were trimmed to length 5 3 .
The majority of prescription records can only be resolved to BNF subparagraph. This is insufficient for some use cases, for example identifying atypical anti-psychotic medication which are included under BNF subparagraph 0402010 along with other anti-psychotic medications.
Tables S45 and S46 illustrate how the drug name field can be used to identify drugs where insufficient detail is included in BNF coding. Table S7: UK Biobank prescription data. X indicates the coding terminology used by the data provider.
Code Country Data provider Practice system Read v2 BNF dm+d prescription records were searched for the relevant prescription codes. Figure S3 illustrates the time between prescription records for a selection of drug types. A weekly repeat pattern is present. Previous work based on EMIS data [5] used a 28 day cut-off to determine "active" prescriptions i.e. a prescription within 28 days of the date of interest evidenced a current drug prescription. Based on figure S3, a 90 day cut-off was used for our analysis. Figure S3: Time between prescriptions in days for a range of drugs. 28 days is the most common interval (except for anti-psychotics) but gaps of 56 days and beyond are common.  Tables S8 to S11 show the agreement between a selection of self-reported conditions and medications and the processed EHR data. Comparison is made as at the first UK Biobank study visit as set out in the main manuscript.
Hypertension, diabetes and myocardial infarction showed high levels of agreement across evaluation metrics. Transient ischaemic attack (TIA) and mental health conditions showed a high number of "false positives" (low precision) where EHR codes were present but participants did not self-report the condition. Potential reasons for this include reluctance to self-report or erroneous code recording (for example a suspected TIA where diagnosis codes are rarely removed if a TIA is later ruled out).

Diabetes incidence criteria
Participants were assumed to enter a diabetic state at the date of the first diagnosis code. Where multiple diabetes sub-types were present in the data, participants were assumed to have type 2 diabetes with the exception of those under age 35 and with an insulin prescription prior to one year after diagnosis who were assumed to have type 1 diabetes.

Diabetes remission criteria
Remission was defined as the cessation of all diabetes medication followed by two sub-diabetic blood glucose test results (glycated hemoglobin (HbA1c) < 48 mmol/mol or fasting plasma glucose < 7.0 mmol/l) separated by at least six months [7].

Pre-diabetes criteria
Normoglycaemic participants were deemed to enter a pre-diabetic state on the first date they met National Institute for Health and Clinical Excellence PH38 criteria [8]: a) HbA1c ≥ 42 mmol/mol or fasting plasma glucose ≥ 5.5 mmol/l (two-hour oral glucose tolerance test results ≥ 7.8 mmol/mol were also included to capture those with impaired glucose tolerance [9]) b) no previous diabetes diagnosis c) no diagnosis in the subsequent three months (excluding gestational diabetes) to allow for the delayed recording of a clinical code following a diagnosis based on a blood glucose test.
Glucose tests during periods of gestational diabetes or anti-diabetes medication were not tested against these criteria.

QDiabetes-2018
QDiabetes-2018 was evaluated in line with Hippisley-Cox and Coupland [5] using a study start of 1 January 2005 and the same inclusion criteria. Performance was evaluated using our algorithm to identify periods of data collection and our phenotyping approach to determine predictors and outcomes (tables S12 and S14) and, for comparison, assuming complete data collection during periods of GP registration record and predictors/outcomes determined as in Hippisley-Cox and Coupland [5] (tables S13 and S15).
Our algorithm results in a broadly the same number of eligible participants (205,901 vs 205,290) but longer post visit follow-up (mean of 10.6 vs 9.7 years).     Periods of data collection identified using our algorithm and outcomes using our phenotyping approach.   Figure S5: Calibration of QDiabetes-2018 model on UK Biobank data (all data providers).
Periods of data collection identified using GP registration records and outcomes as in Hippisley-Cox and Coupland [5].

Leicester risk score
The Leicester risk score was calculated as in Gray et al. [10]. Performance was evaluated when predicting the 5 year incidence of diabetes at the first UK Biobank visit. 5 year performance was used due to the relatively small number of UK Biobank participants with 10 years of EHR follow-up following the first study visit. Our algorithm was used to identify periods of data collection and our phenotyping approach to determine outcomes (table S16). For comparison, the evaluation was also carried out assuming complete data collection during periods of GP registration record and outcomes determined as in Hippisley-Cox and Coupland [5] (table S17).
Results are shown in table S18.
Our algorithm results in a larger number of participants deemed to have EHR data collection from the first UK Biobank study visit (186,208 vs 181,233 when using registration record data) and longer post visit follow-up (mean of 7.4 vs 6.9 years). Leicester score performance for the 5 year incidence of diabetes was generally higher using our algorithm and phenotyping approach.   S18: Confusion matrices for the 5 year incidence of diabetes following the first UK Biobank visit using the Leicester score. For comparison, Barber et al. [11] reported sensitivity of 89.2% and specificity of 42.3% for 10 year diabetes prediction.
Active data collection/ diabetes outcomes determined using: Our algorithm/phenotyping tool GP registration records/as QDiabetes [5] Score < 16 Score ≥ 16 Score < 16 Score ≥ 16                            substance therefore drugs were identified by searching the drug name field in all TPP records with BNF code starting 0402 for these generic and brand terms. Fuzzy matching was used e.g.

Drug
Search terms (case insensitive)

UK BIOBANK CODING
Demographic inputs were taken from UK Biobank visit data as these were generally unavailable in the linked EHR data. Relevant biomarkers measured by UK Biobank were used to augment those extracted from the EHR data. The fields used are summarised in table S47. Mapping for ethnicity and smoking are summarised in tables S48 to S50.
The codes used to identify self-reported conditions are in table S51. Self-reported medications were identified from field 20003 by searching for drug names (both generic and brand names). The search terms are available at https://github.com/philipdarke/ehr-codesets. Any self-reported medications matching a steroid term were excluded if they also included the terms "eye", "ear" or "cream" as the aim was to identify "regular steroid tablets" as under the QDiabetes-2018 model.