Transforming and evaluating the UK Biobank to the OMOP Common Data Model for COVID-19 research and beyond

Patient demographic and clinical characteristics presented for the source population, the OMOP CDM transformed population and the subset of the transformed population with COVID-19

	Source UK Biobank data	OMOP-transformed UK Biobank data	Transformed UK Biobank COVID-19 positive sub population
Patients	502 505	502 504	3086
% Female	54.4	54.4	48.76
Median age (IQR)	58 (13)	58 (13)	58 (15)
Median Townsend deprivation index (IQR)	−2.135 (4.18)	−2.135 (4.18)	−1.111 (5.19)
BMI median—baseline (IQR)	26.652 (5.72)	26.65 (5.70)	27.7 (6.21)
BMI median—GP EMIS (IQR)	27.2 (6.9)	27.3 (6.84)	28.89 (8)
SBP median—baseline (IQR)	136 (26)	136 (26)	136 (25)
DBP median—baseline (IQR)	81 (14)	81 (14)	82 (14)
Smoking status
Not answered	2276	Not mapped	Not mapped
Never	317 891	317 891	1676
Previous	197 949	197 949	1323
Current	55 676	55 676	395
Comorbidities
T2DM	40 433 (8.04%)	40 476 (8.05%)	453 (14.67%)
HF	8068 (1.6%)	8053 (1.6%)	140 (4.53%)
AMI	10 593 (2.1%)	10 749 (2.13%)	110 (3.56%)
COPD	22 364 (4.45%)	22 367 (4.45%)	328 (10.62%)
HT	175 449 (34.91%)	175 539 (34.93%)	1571 (50.9%)

	Source UK Biobank data	OMOP-transformed UK Biobank data	Transformed UK Biobank COVID-19 positive sub population
Patients	502 505	502 504	3086
% Female	54.4	54.4	48.76
Median age (IQR)	58 (13)	58 (13)	58 (15)
Median Townsend deprivation index (IQR)	−2.135 (4.18)	−2.135 (4.18)	−1.111 (5.19)
BMI median—baseline (IQR)	26.652 (5.72)	26.65 (5.70)	27.7 (6.21)
BMI median—GP EMIS (IQR)	27.2 (6.9)	27.3 (6.84)	28.89 (8)
SBP median—baseline (IQR)	136 (26)	136 (26)	136 (25)
DBP median—baseline (IQR)	81 (14)	81 (14)	82 (14)
Smoking status
Not answered	2276	Not mapped	Not mapped
Never	317 891	317 891	1676
Previous	197 949	197 949	1323
Current	55 676	55 676	395
Comorbidities
T2DM	40 433 (8.04%)	40 476 (8.05%)	453 (14.67%)
HF	8068 (1.6%)	8053 (1.6%)	140 (4.53%)
AMI	10 593 (2.1%)	10 749 (2.13%)	110 (3.56%)
COPD	22 364 (4.45%)	22 367 (4.45%)	328 (10.62%)
HT	175 449 (34.91%)	175 539 (34.93%)	1571 (50.9%)

Note: Age, Townsend deprivation index, Body Mass Index (BMI), Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) values collected at first assessment center visit.

T2DM: type-II diabetes; HF: heart failure; AMI: acute myocardial infarction; COPD: chronic obstructive pulmonary disease; HT: hypertension.

Table 1.

Patient demographic and clinical characteristics presented for the source population, the OMOP CDM transformed population and the subset of the transformed population with COVID-19

	Source UK Biobank data	OMOP-transformed UK Biobank data	Transformed UK Biobank COVID-19 positive sub population
Patients	502 505	502 504	3086
% Female	54.4	54.4	48.76
Median age (IQR)	58 (13)	58 (13)	58 (15)
Median Townsend deprivation index (IQR)	−2.135 (4.18)	−2.135 (4.18)	−1.111 (5.19)
BMI median—baseline (IQR)	26.652 (5.72)	26.65 (5.70)	27.7 (6.21)
BMI median—GP EMIS (IQR)	27.2 (6.9)	27.3 (6.84)	28.89 (8)
SBP median—baseline (IQR)	136 (26)	136 (26)	136 (25)
DBP median—baseline (IQR)	81 (14)	81 (14)	82 (14)
Smoking status
Not answered	2276	Not mapped	Not mapped
Never	317 891	317 891	1676
Previous	197 949	197 949	1323
Current	55 676	55 676	395
Comorbidities
T2DM	40 433 (8.04%)	40 476 (8.05%)	453 (14.67%)
HF	8068 (1.6%)	8053 (1.6%)	140 (4.53%)
AMI	10 593 (2.1%)	10 749 (2.13%)	110 (3.56%)
COPD	22 364 (4.45%)	22 367 (4.45%)	328 (10.62%)
HT	175 449 (34.91%)	175 539 (34.93%)	1571 (50.9%)

	Source UK Biobank data	OMOP-transformed UK Biobank data	Transformed UK Biobank COVID-19 positive sub population
Patients	502 505	502 504	3086
% Female	54.4	54.4	48.76
Median age (IQR)	58 (13)	58 (13)	58 (15)
Median Townsend deprivation index (IQR)	−2.135 (4.18)	−2.135 (4.18)	−1.111 (5.19)
BMI median—baseline (IQR)	26.652 (5.72)	26.65 (5.70)	27.7 (6.21)
BMI median—GP EMIS (IQR)	27.2 (6.9)	27.3 (6.84)	28.89 (8)
SBP median—baseline (IQR)	136 (26)	136 (26)	136 (25)
DBP median—baseline (IQR)	81 (14)	81 (14)	82 (14)
Smoking status
Not answered	2276	Not mapped	Not mapped
Never	317 891	317 891	1676
Previous	197 949	197 949	1323
Current	55 676	55 676	395
Comorbidities
T2DM	40 433 (8.04%)	40 476 (8.05%)	453 (14.67%)
HF	8068 (1.6%)	8053 (1.6%)	140 (4.53%)
AMI	10 593 (2.1%)	10 749 (2.13%)	110 (3.56%)
COPD	22 364 (4.45%)	22 367 (4.45%)	328 (10.62%)
HT	175 449 (34.91%)	175 539 (34.93%)	1571 (50.9%)

Note: Age, Townsend deprivation index, Body Mass Index (BMI), Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) values collected at first assessment center visit.

T2DM: type-II diabetes; HF: heart failure; AMI: acute myocardial infarction; COPD: chronic obstructive pulmonary disease; HT: hypertension.

Baseline and EHR data mapping

In the baseline data, we processed events from 1 127 434 self-reported noncancer illnesses (field id 20002), 53 384 cancer illnesses (field id 20001), 1 381 148 medications (field id 20003) and, 994 355 procedures entries (field id 20004) and mapped 946 053 (83.91%), 37 802 (70.81%), 1 218 935 (88.25%) and 864 788 (86.96%) entries respectively (Table 2) in addition to 45 629 849 (74.65%) hematology entries.

Table 2.

Mapping coverage for terms in the baseline and EHR data relating to ethnic status, noncancer/cancer diseases, medication usage and surgical procedures in the UK Biobank and converted to the OMOP CDM standard vocabulary

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
Baseline ethnic status	22	10 (45.45%)	533 612	512 158 (95.97%)
Self-reported noncancer illness	446	351 (78.69%)	1 127 434	946 053 (83.91%)
Self-reported cancer	82	48 (58.53%)	53 384	37 802 (70.81%)
Self-reported medication	3737	1100 (29.43%)	1 381 148	1 218 935 (88.25%)
Self-reported procedures	254	128 (50.39%)	994 355	864 788 (86.96%)
Hematology samples	124	93 (75%)	61 119 731	45 629 849 (74.65%)
Hospital EHR admission source	86	44 (51.16%)	3 541 594	282 505 (7.97%)
Hospital EHR admission method	63	58 (92.06%)	3 541 610	3 540 046 (99.95%)
Hospital EHR discharge destination	91	56 (61.53%)	3 484 435	3 189 509 (91.53%)

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
Baseline ethnic status	22	10 (45.45%)	533 612	512 158 (95.97%)
Self-reported noncancer illness	446	351 (78.69%)	1 127 434	946 053 (83.91%)
Self-reported cancer	82	48 (58.53%)	53 384	37 802 (70.81%)
Self-reported medication	3737	1100 (29.43%)	1 381 148	1 218 935 (88.25%)
Self-reported procedures	254	128 (50.39%)	994 355	864 788 (86.96%)
Hematology samples	124	93 (75%)	61 119 731	45 629 849 (74.65%)
Hospital EHR admission source	86	44 (51.16%)	3 541 594	282 505 (7.97%)
Hospital EHR admission method	63	58 (92.06%)	3 541 610	3 540 046 (99.95%)
Hospital EHR discharge destination	91	56 (61.53%)	3 484 435	3 189 509 (91.53%)

Note: Coverage is given as both the number of unique terms mapped and as the number of events mapped.

EHR: electronic health records.

Table 2.

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
Baseline ethnic status	22	10 (45.45%)	533 612	512 158 (95.97%)
Self-reported noncancer illness	446	351 (78.69%)	1 127 434	946 053 (83.91%)
Self-reported cancer	82	48 (58.53%)	53 384	37 802 (70.81%)
Self-reported medication	3737	1100 (29.43%)	1 381 148	1 218 935 (88.25%)
Self-reported procedures	254	128 (50.39%)	994 355	864 788 (86.96%)
Hematology samples	124	93 (75%)	61 119 731	45 629 849 (74.65%)
Hospital EHR admission source	86	44 (51.16%)	3 541 594	282 505 (7.97%)
Hospital EHR admission method	63	58 (92.06%)	3 541 610	3 540 046 (99.95%)
Hospital EHR discharge destination	91	56 (61.53%)	3 484 435	3 189 509 (91.53%)

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
Baseline ethnic status	22	10 (45.45%)	533 612	512 158 (95.97%)
Self-reported noncancer illness	446	351 (78.69%)	1 127 434	946 053 (83.91%)
Self-reported cancer	82	48 (58.53%)	53 384	37 802 (70.81%)
Self-reported medication	3737	1100 (29.43%)	1 381 148	1 218 935 (88.25%)
Self-reported procedures	254	128 (50.39%)	994 355	864 788 (86.96%)
Hematology samples	124	93 (75%)	61 119 731	45 629 849 (74.65%)
Hospital EHR admission source	86	44 (51.16%)	3 541 594	282 505 (7.97%)
Hospital EHR admission method	63	58 (92.06%)	3 541 610	3 540 046 (99.95%)
Hospital EHR discharge destination	91	56 (61.53%)	3 484 435	3 189 509 (91.53%)

Note: Coverage is given as both the number of unique terms mapped and as the number of events mapped.

EHR: electronic health records.

In hospitalization EHR (Table 3), we processed 12 962 292 diagnoses using ICD-10 and mapped 12 961 962 (99.99%). Additionally, we processed 7 220 399 procedure events using the OPCS-4 classification and successfully mapped 6 449 843 (89.32%). A significantly smaller number of clinical events using deprecated terminologies (eg, ICD-9 and OPCS-3) were mapped to a high degree of accuracy (Table 3). Finally, 77 127 (99.95%) of all death events recorded in mortality registers across the 3 countries were successfully mapped (cause of death recorded using ICD-10).

Table 3.

Mapping and event coverage for UK Biobank vocabularies for diagnoses, procedures, and death electronic health records

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
ICD-10 diagnoses	12 094	12 088 (99.95%)	12 962 292	12 961 962 (99.99%)
ICD-9 diagnoses	3337	2847 (85.31%)	72 256	66 220 (91.64%)
OPCS-3 procedures	883	221 (25.02%)	20 077	15 556 (77.48%)
OPCS-4 procedures	8324	8276 (99.42%)	7 220 399	6 449 843 (89.32%)
ICD-10 Death Cause	1962	1961 (99.94%)	77 161	77 127 (99.95%)

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
ICD-10 diagnoses	12 094	12 088 (99.95%)	12 962 292	12 961 962 (99.99%)
ICD-9 diagnoses	3337	2847 (85.31%)	72 256	66 220 (91.64%)
OPCS-3 procedures	883	221 (25.02%)	20 077	15 556 (77.48%)
OPCS-4 procedures	8324	8276 (99.42%)	7 220 399	6 449 843 (89.32%)
ICD-10 Death Cause	1962	1961 (99.94%)	77 161	77 127 (99.95%)

Note: Coverage is given as both the number of unique terms mapped and as the number of events mapped.

ICD: International Classification of Diseases; OPCS: OPCS Classification of Interventions and Procedures.

Table 3.

Mapping and event coverage for UK Biobank vocabularies for diagnoses, procedures, and death electronic health records

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
ICD-10 diagnoses	12 094	12 088 (99.95%)	12 962 292	12 961 962 (99.99%)
ICD-9 diagnoses	3337	2847 (85.31%)	72 256	66 220 (91.64%)
OPCS-3 procedures	883	221 (25.02%)	20 077	15 556 (77.48%)
OPCS-4 procedures	8324	8276 (99.42%)	7 220 399	6 449 843 (89.32%)
ICD-10 Death Cause	1962	1961 (99.94%)	77 161	77 127 (99.95%)

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
ICD-10 diagnoses	12 094	12 088 (99.95%)	12 962 292	12 961 962 (99.99%)
ICD-9 diagnoses	3337	2847 (85.31%)	72 256	66 220 (91.64%)
OPCS-3 procedures	883	221 (25.02%)	20 077	15 556 (77.48%)
OPCS-4 procedures	8324	8276 (99.42%)	7 220 399	6 449 843 (89.32%)
ICD-10 Death Cause	1962	1961 (99.94%)	77 161	77 127 (99.95%)

Note: Coverage is given as both the number of unique terms mapped and as the number of events mapped.

ICD: International Classification of Diseases; OPCS: OPCS Classification of Interventions and Procedures.

In primary care EHR (Table 4), we processed 212 828 306 clinical events from EMIS and 133 092 016 clinical events from TPP. These were recorded using 51 160 SNOMED-CT and 82 669 Clinical Terms Version 3 (CTV3) terms respectively. In EMIS data, 49 968 (97.67%) of SNOMED-CT concepts were mapped successfully resulting in 207 756 102 (97.62%) of clinical events mapped. In TPP, 73 683 (89.13%) of CTV3 concepts were mapped but the proportion of successfully mapped clinical events remained equally high with 97.78% of events (n = 130 140 231) successfully mapped. Measurement units in EMIS for relevant clinical events (eg, mmHg for blood pressure) were recorded using 55 terms of which 44 were mapped resulting in 31.27% of events successfully transformed. We processed 141 752 534 medication prescription events which were recorded using dm+d. Overall, 30 859 (99.85%) were mapped and 139 966 587 (98.74%) were successfully transformed. Finally, we mapped 41 COVID-19-related unique proprietary codes used by primary care EHR software vendors.

Table 4.

Mapping and event coverage for UK Biobank primary care electronic health records

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
EMIS units	4544	44 (0.96%)	94 623 584	82 517 900 (87.2%)
SNOMED-CT (EMIS)	51 160	49 968 (97.67%)	212 828 306	207 756 102 (97.62%)
dm+d	30 903	30 859 (99.85%)	141 752 534	139 966 587 (98.74%)
CTV3 (TPP)	82 669	73 683 (89.13%)	133 092 016	130 140 231 (97.78%)
TPP and EMIS proprietary codes	20 990	41 (0.19%)	19 554 574	37 882 (0.19%)

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
EMIS units	4544	44 (0.96%)	94 623 584	82 517 900 (87.2%)
SNOMED-CT (EMIS)	51 160	49 968 (97.67%)	212 828 306	207 756 102 (97.62%)
dm+d	30 903	30 859 (99.85%)	141 752 534	139 966 587 (98.74%)
CTV3 (TPP)	82 669	73 683 (89.13%)	133 092 016	130 140 231 (97.78%)
TPP and EMIS proprietary codes	20 990	41 (0.19%)	19 554 574	37 882 (0.19%)

Table 4.

Mapping and event coverage for UK Biobank primary care electronic health records

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
EMIS units	4544	44 (0.96%)	94 623 584	82 517 900 (87.2%)
SNOMED-CT (EMIS)	51 160	49 968 (97.67%)	212 828 306	207 756 102 (97.62%)
dm+d	30 903	30 859 (99.85%)	141 752 534	139 966 587 (98.74%)
CTV3 (TPP)	82 669	73 683 (89.13%)	133 092 016	130 140 231 (97.78%)
TPP and EMIS proprietary codes	20 990	41 (0.19%)	19 554 574	37 882 (0.19%)

Source vocab	Used source terms #	Mapped used terms # (%)	Events #	Mapped event # (%)
EMIS units	4544	44 (0.96%)	94 623 584	82 517 900 (87.2%)
SNOMED-CT (EMIS)	51 160	49 968 (97.67%)	212 828 306	207 756 102 (97.62%)
dm+d	30 903	30 859 (99.85%)	141 752 534	139 966 587 (98.74%)
CTV3 (TPP)	82 669	73 683 (89.13%)	133 092 016	130 140 231 (97.78%)
TPP and EMIS proprietary codes	20 990	41 (0.19%)	19 554 574	37 882 (0.19%)

Lists of top 10 most frequently used mapped and unmapped terms can be found in Supplementary Table S4a–S4k.

Evaluation and validation

We identified 40 433 T2DM, 8068 HF, 10 593 AMI, 22 364 COPD and 175 449 HT patients in the source data and observed similar estimates in the converted data. A small number of patients (43 AMI, 15 HF, 157 AMI, 6 COPD and 94 HT) were identified only in the converted data and not in the source data.

DataQualityDashboard verified and validated plausibility, conformance, and completeness of the transformed dataset. On the final run, 3399 checks passed and 18 failed (Supplementary Figure S2). All remaining failed checks were investigated, and their failure was expected. Seven checks on completeness failed because the percentage of records with a value of 0 in the standard concept field exceeded a threshold (20%) due to missing mappings. Two plausibility checks failed due to an incompatible gender for a gender related clinical code, eg, 41 records with a concept 198197—male infertility are not associated with participants identified as males. This is given by the source data. Due to errors introduced during the manual mapping process (ie, incorrect mapping selections using USAGI), 9 conformance checks failed as a standard Concept ID value in a table did not conform with a corresponding domain (eg, 0.2% of unit Concept ID values in a Measurement table do not conform with a Unit domain).

DISCUSSION

We have extracted and transformed the UKB, a complex large-scale biobank cohort study of 502 504 middle-aged individuals from England, Scotland, and Wales. The study combined self-reported data from questionnaires which were collected during recruitment and longitudinal EHR from primary care consultations, hospital admissions, cancer registrations, and mortality using 8 different clinical terminologies. Overall, >1.3 billion rows of data were processed and transformed to the OMOP CDM. Transformation of OMOP has enabled UKB to take part in federated analyses of 17 health data sources on adverse events of special interest (AESIs) associated with COVID-19 vaccination and many other studies are ongoing.³⁷^,³⁹

Representing data collected through questionnaires in the CDM was a challenging task and required a significant amount of preprocessing and consolidation across multiple fields. Eight custom mapping tables together with vocabularies from the existing OMOP vocabularies were used to map data fields and data values to standard OMOP concepts. Each type of data required a different mapping approach. One challenge was that OMOP measurements do not have many attributes, eg, for the Hemoglobin concentration (field id 30020), the freeze-thaw cycles data field (field id 30021) and the device ID (field id 30023) had to be mapped as a separate observation and device record respectively.

In line with previous studies³⁸ that used similar controlled clinical terminologies for EHR, our approach achieved high mapping coverage (>97% coverage) across established systems, eg, SNOMED-CT, ICD-10. Similarly, 89% of surgical procedure events recorded in OPCS-4 were transformed. Older terminologies, eg, ICD-9, OPCS-3, used in historic data had slightly less good coverage: 91% and 77%, respectively. In contrast with previous research using prescription information in primary care EHR, the establishment of dm+d as the standard used has led to a significantly improved mapping accuracy of 98.7%. Using USAGI, we mapped a small subset of the proprietary TPP and EMIS codes related to COVID-19 (41, 0.19%). The mapping of these proprietary codes had a significant impact on COVID-19 case ascertainment as it captured ∼60% of unique identified cases in primary care data and 28% in all sources.

We observed good overall concordance when comparing key demographic, risk factor and clinical comorbidities source and converted data. Broadly, we observed 2 classes of problems. Firstly, not all patients identified by comorbidity in the source data were identified in the transformed data. One cause is semantically unmapped diagnosis codes used for a cohort identification and appearing in patients’ clinical records (eg, CTV3 code X73lE—Coronavirus, used for identification COVID-19 cases; n = 15). A second cause are restrictions imposed by the ETL (eg, diagnosis codes outside observation period window).

Secondly, a very small number of patients were only identified as cases in the transformed data (n = 3 in case of COVID-19 cases). This occurs when 2 or more distinct source codes are mapped onto the same target code. If the source comorbidity definition uses one code and not the other, it is not possible to separate these using the target code (Supplementary Figure S3). Mapping of 2 or more source codes onto the same target concept could be a result of: (1) an incorrectly specified mapping, (2) specific source codes are mapped onto a more general target code or (3) synonymous source codes In the latter case the source comorbidity definition should take both source codes into account.

Our study does have limitations. Not all available data could be mapped to the OMOP CDM and must be handled separately. For example, genomic data (eg, SNPs) cannot be integrated within the OMOP CDM as the data model has been developed for routinely collected healthcare and claims data. This provides an additional layer of complexity when creating studies that need to combine information across phenotypic and genomic sources. Information collected via questionnaires is also challenging to include as it differs from typical OMOP CDM data; it uses local coding systems, storing data in a wide format and of cross-sectional nature. In addition, questionnaire data often captures negation and data missingness explicitly (eg, patient did not answer or refused to answer), which by convention is not stored in the OMOP CDM. As with previous studies, the OMOP CDM definition of an observation period (the period for which the data capture of a person is considered complete) causes some discrepancies between analysis on the source and OMOP CDM as historical medical events are considered outside the observation period. It should be noted that this has been recently revised, and events outside observation period are allowed in the OMOP CDM for some use cases.

Finally, our study findings are potentially generalizable to other large datasets consisting of research-driven questionnaires and EHR linkage that require conversion to the OMOP CDM. The UK Biobank contains detailed phenotypic data that are sourced from different data modalities (eg, patient-reported questionnaires data, research data, claims data and EHR) combined with deep genotypic information. This resulted in a challenging technical implementation including the usage of a custom OMOP vocabulary. Other similar resources in terms of complexity, such as All of Us⁴⁰ and the MVP⁴¹ in the United States can potentially benefit from our findings when undergoing similar conversion to OMOP CDM for participation in OHDSI studies.

CONCLUSION

Our study demonstrated that the OMOP CDM can be successfully leveraged to harmonize complex large-scale biobanked studies. Our study did uncover several challenges when transforming data collected using bespoke questionnaires from patients to the OMOP CDM which require further research. The transformed UK Biobank resource is a valuable research tool that can enable large-scale research in COVID-19 and other diseases.

FUNDING

This study was supported by a European Health Data & Evidence Network (EHDEN) project grant; This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA. The grant was for the institute. SD, RJBD, and VP are funded by the UCLH NIHR Biomedical Research Centre (BRC). SD is supported by BHF Data Science Centre led by Health Data Research UK (BHF Grant no. SP/19/3/34678); the COVID-19 Longitudinal Health and Wellbeing National Core Study funded by the Medical Research Council (MC_PC_20030; MC_PC_20059) and Health Data Research UK, which receives its core funding from the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation (BHF) and the Wellcome Trust. RJBD is supported by the following: (1) NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, UK; (2) Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome Trust; (3) The BigData@Heart Consortium, funded by the Innovative Medicines Initiative-2 Joint Undertaking under grant agreement No. 116074. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA; it is chaired by DE Grobbee and SD Anker, partnering with 20 academic and industry partners and ESC; (4) the National Institute for Health Research University College London Hospitals Biomedical Research Centre; (5) the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London; (6) the UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare; (7) the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King’s College Hospital NHS Foundation Trust. FA is supported by UCL Hospitals NIHR Biomedical Research Centre.

AUTHOR CONTRIBUTIONS

SD conceived and designed the study. MM, EAV, SB, AVW, AP, SP implemented the ETL pipeline. SD, MM, VP reviewed and revised ETL mapping files. VP executed the ETL pipeline, extracted data and conducted the analyses. VP and SD analyzed and interpreted the results. SD, VP and MM wrote the report. All authors reviewed and interpreted the results, commented on the report, contributed to revisions, and read and approved the final version.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST STATEMENT

EAV is an employee of Janssen Research and Development LLC and a shareholder of Johnson & Johnson (J&J) stock. Prof. Prieto-Alhambra’s research group has received grant support from Amgen, Chesi-Taylor, Novartis, and UCB Biopharma. His department has received advisory or consultancy fees from Amgen, Astellas, AstraZeneca, Johnson, and Johnson, and UCB Biopharma and fees for speaker services from Amgen and UCB Biopharma. Janssen, on behalf of IMI-funded EHDEN and EMIF consortiums, and Synapse Management Partners have supported training programs organized by DP-A’s department and open for external participants organized by his department outside this work.

REFERENCES

WHO Coronavirus (COVID-19) Dashboard

. https://covid19.who.int/. Accessed June 25,

2021

Thygesen

Tomlinson

Hollings

, et al. COVID-19 trajectories among 57 million adults in England: a cohort study using electronic health records. Lancet Digit Health 2022; 4 (7):

e542

–

Raventós

Roel

, et al.

Association between covid-19 vaccination, SARS-CoV-2 infection, and risk of immune mediated neurological events: population based cohort and self-controlled case series analysis

BMJ

2022

;

376

e068373

Kostka

Duarte-Salles

Prats-Uribe

, et al.

Unraveling COVID-19: a large-scale characterization of 4.5 million COVID-19 cases using CHARYBDIS

Clin Epidemiol

2022

;

369

–

Bradwell

Wooldridge

Amor

, et al. ;

N3C Consortium

Harmonizing units and values of quantitative data elements in a very large nationally pooled electronic health record (EHR) dataset

J Am Med Inform Assoc

2022

;

(

1172

–

Ostropolets

Makadia

, et al.

Characterising the background incidence rates of adverse events of special interest for covid-19 vaccines in eight countries: multinational network cohort study

BMJ

2021

;

373

n1435

Burn

Kostka

, et al.

Background rates of five thrombosis with thrombocytopenia syndromes of special interest for COVID-19 vaccine safety surveillance: incidence between 2017 and 2019 and patient profiles from 38.6 million people in six European countries

Pharmacoepidemiol Drug Saf

2022

;

(

495

–

510

Williams

Markus

Yang

, et al.

Seek COVER: using a disease proxy to rapidly develop and validate a personalized risk calculator for COVID-19 outcomes in an international network

BMC Med Res Methodol

2022

;

(

Hripcsak

Duke

Shah

, et al.

Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers

Stud Health Technol Inform

2015

;

216

574

–

578

European Health Data Evidence Network (EHDEN)

. ehden.eu. April 27,

2022

. https://www.ehden.eu/. Accessed May 25, 2022.

Sudlow

Gallacher

Allen

, et al.

UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age

PLoS Med

2015

;

(

e1001779

UK Biobank Data-Field 20002

. https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=20002. Accessed July 12,

2022

SNOMED Home Page

. SNOMED [Internet]. http://snomed.org. Accessed May 23,

2022

]

Read Codes – NHS Digital

. https://digital.nhs.uk/services/terminology-and-classifications/read-codes. Accessed March 5,

2021

Spiers

Goulding

Arrowsmith

Clinical terminologies in the NHS: SNOMED CT and dm+ d

Br J Pharm

2017

;

(

–

ICD-10 Version: 2019. https://icd.who.int/browse10/2019/en#/. Accessed May 23,

2022

World Health Organization Staff, World Health Organization

Jack

Percy

Sobin

Whelan

International Classification of Diseases for Oncology: ICD-O

World Health Organization

;

2000

. https://play.google.com/store/books/details?id=2FVdGxRhsoIC. Accessed May 23, 2022.

. http://amisha.pragmaticdata.com/units/UCUM/UCUM.pdf

Morley

Wallace

Denaxas

, et al.

Defining disease phenotypes using national linked electronic health records: a case study of atrial fibrillation

PLoS One

2014

;

(

e110900

OMOP CDM Specification v5.3. https://ohdsi.github.io/CommonDataModel/cdm53.html. Accessed May 25,

2022

OHDSI WhiteRabbit Tool. Github. https://github.com/OHDSI/WhiteRabbit. Accessed May 23, 2022.

Schadow

McDonald

CJ.

The Unified Code for Units of Measure

Indianapolis, IN

Regenstrief Institute and UCUM Organization

;

2009

OHDSI Athena

. https://athena.ohdsi.org/search-terms/start. Accessed November 11,

2020

OHDSI USAGI Tool

. Github. https://github.com/OHDSI/Usagi. Accessed May 23, 2022.

NHS Digital TRUD

. https://isd.digital.nhs.uk/trud3/user/guest/group/0/home. Accessed March 5,

2021

Liu

Moore

Ganesan

Nelson

RxNorm: prescription for electronic drug information exchange

IT Prof

2005

;

(

–

Crossref

COVID-19 Data

. https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/covid-19-data. Accessed May 23,

2022

Wood

, ,

Denholm

, ,

Hollings

, et al.

Linked electronic health records for research on a nationwide cohort of more than 54 million people in England: data resource

. BMJ

2021

;

373

n826

. https://github.com/spiros/tofu.

Denaxas

Gonzalez-Izquierdo

Direk

, et al.

UK phenomics platform for developing and validating electronic health record phenotypes: CALIBER

J Am Med Inform Assoc

2019

;

(

1545

–

Denaxas

Tofu: Tofu Is a Python Tool for Generating Synthetic UK Biobank Data

Github

OHDSI Achilles Tool

. https://ohdsi.github.io/Achilles/. Accessed May 23,

2022

OHDSI DataQualityDashboard Tool

. Github. https://github.com/OHDSI/DataQualityDashboard. Accessed May 23, 2022.

OHDSI CdmInspection Tool

. Github. https://github.com/EHDEN/CdmInspection. Accessed May 23, 2022.

Kuan

, ,

Denaxas

, ,

Gonzalez-Izquierdo

, et al.

A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service

. Lancet Digit Health

2019

;

(

e63

–

OHDSI ATLAS Tool

. https://www.ohdsi.org/atlas-a-unified-interface-for-the-ohdsi-tools/. Accessed May 23,

2022

Mapping UK Biobank to the OMOP CDM: Challenges and Solutions Using The Delphyne ETL Framework. https://ohdsi.org/2021-global-symposium-showcase-3/. Accessed June 20, 2022.

OHDSI Athena—UK Biobank Vocabulary

. https://athena.ohdsi.org/search-terms/terms?vocabulary=UK+Biobank&page=1&pageSize=15&query=. Accessed May 23,

2022

Shoaibi

Rao

Voss

, et al.

Phenotype algorithms for the identification and characterization of vaccine-induced thrombotic thrombocytopenia in real world data: a multinational network cohort study

Drug Saf

2022

;

(

685

–

Papez

Moinat

Payralbe

, et al.

Transforming and evaluating electronic health record disease phenotyping algorithms using the OMOP common data model: a case study in heart failure

JAMIA Open

2021

;

(

ooab001

Voss

Shoaibi

Ostropolets

, et al.

[RESEARCH PROTOCOL] Adverse Events of Special Interest within COVID-19 Subjects

GitHub

. https://ohdsi-studies.github.io/Covid19SubjectsAesiIncidenceRate/Protocol.html. Accessed July 12,

2022

The All of Us research program investigators. The “All of Us” research program

N Engl J Med

2019

;

381

668

–

Crossref