Abstract
Background Clinically coded long COVID cases in electronic health records (EHRs) are incomplete, despite reports of rising cases of long COVID.
Aim To determine patient characteristics associated with clinically coded long COVID.
Design & setting With the approval of NHS England, we conducted a cohort study using EHRs within the OpenSAFELY-TPP platform in England, to study patient characteristics associated with clinically coded long COVID from 29 January 2020 to 31 March 2022.
Method We summarised the distribution of characteristics for people with clinically coded long COVID. We estimated age–sex adjusted hazard ratios (aHRs) and fully aHRs for coded long COVID. Patient characteristics included demographic factors, and health behavioural and clinical factors.
Results Among 17 986 419 adults, 36 886 (0.21%) were clinically coded with long COVID. Patient characteristics associated with coded long COVID included female sex, younger age (aged <60 years), obesity, living in less deprived areas, ever smoking, greater consultation frequency, and history of diagnosed asthma, mental health conditions, pre-pandemic post-viral fatigue, or psoriasis. These associations were attenuated following two doses of COVID-19 vaccines compared with before vaccination. Differences in the predictors of coded long COVID between the pre-vaccination and post-vaccination cohorts may reflect the different patient characteristics in these two cohorts rather than the vaccination status. Incidence of coded long COVID was higher in those with hospitalised COVID-19 than with those with non-hospitalised COVID-19.
Conclusion We identified variation in coded long COVID by patient characteristic. Results should be interpreted with caution as long COVID was likely under-recorded in EHRs.
How this fits in
Electronic health records (EHRs) for long COVID are incomplete. It is important to understand the characteristics of people who have had their long COVID coded in EHRs. This study identified a set of patient characteristics associated with clinically coded long COVID. This includes the frequency of prior GP–patient interaction, sociodemographical variables, history of diagnosed diseases, and SARS-CoV-2 severity.
Introduction
Long COVID,1 also known as post-acute sequelae of SARS-CoV-2 (PASC)2, or post-COVID-19 syndrome,3 is an overarching term for the persistent symptoms for weeks, months,4 or years, following the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. National Institute for Health and Care Excellence (NICE) guidance on supporting patients with long COVID includes assessing people with symptoms after acute SARS-CoV-2, investigations, and referrals.5
Understanding risk factors for long COVID is a public health priority. Counts and rates of people having a long COVID code in English primary care varied by demographic factors but also considerably by the practice clinical software system.6 UK longitudinal cohort studies reported that risk factors for having long COVID included increasing age, female sex, obesity, poor pre-pandemic general and mental health, and asthma.7,8
Previous electronic health records (EHRs) analyses were based on the study period from 1 February 2020 to 9 May 2021,7 during which 4189 long COVID cases were clinically coded. This represents considerable under-reporting, compared with the Office for National Statistics' (ONS') estimate of 1.0 million people with self-reported long COVID9 in the UK in May 2021. The usage of long COVID codes has improved with time.10 General practice services were encouraged to enhance their knowledge on assessing and referring patients with long COVID as set out in NHS actions on long COVID for 2021–2022.11
We conducted a cohort study within the OpenSAFELY-TPP database (https://www.opensafely.org/), which includes detailed linked data on around 24 million people registered with an English GP using TPP SystmOne EHR software (see ‘Data source’). We aimed to quantify associations of patient characteristics, including vaccination status, COVID-19 severity, and history of a range of disease diagnoses, with coded long COVID in English primary care.
Method
Data source
We used patient data from primary care records managed by the GP software provider, TPP SystmOne, covering around 40% of the population in England. These data include clinically coded long COVID, information on sociodemographics, pre-existing health conditions, and frequencies of GP–patient interactions, which may be consultations or any practice contacts. Data were linked to national SARS-CoV-2 testing records (Second Generation Surveillance System), vaccination data (National Immunisation Management Service), Index of Multiple Deprivation (IMD), and the ONS death registry. Admitted Patient Care Spells (APCS) is part of Hospital Episode Statistics (HES) and is provided to OpenSAFELY via NHS Digital’s Secondary Uses Service (SUS). OpenSAFELY includes pseudonymised data such as coded diagnoses, medications, and physiological parameters, but does not include free-text data.
Study population and cohort definitions
Our study population consisted of adults aged between 18 years and 105 years, with known sex and region, who were registered as active patients in a TPP GP on 29 January 2020 (the date when the first two SARS-CoV-2 cases were reported in the UK) and had at least 1 year of prior follow-up in a general practice, to ensure that baseline characteristics could be adequately captured.
We constructed four cohorts (Supplement, Figure S1, Table S1): (1) a primary general population cohort, with follow-up start date on 29 January 2020 and end date the earliest of first record of any long COVID code, death date, or 31 March 2022 (the day before free SARS-CoV-2 testing in England ended;12 (2) a post-COVID diagnosis cohort, defined regardless of vaccination status, with follow-up start date the first recorded COVID-19 diagnosis and end date the earliest of first record of any long COVID code, death date, or 31 March 2022; (3) a pre-vaccination cohort with follow-up start date on 29 January 2020 and end date the earliest of first record of any long COVID code, date of receipt of first COVID-19 vaccine dose, death date, or 31 March 2022; (4) a post-vaccination cohort, with follow-up start date 14 days after receipt of second COVID-19 vaccine dose and end date the earliest of first record of any long COVID code, death date, or 31 March 2022. In each cohort, people with a history of SARS-CoV-2 infection, and/or long COVID code before their follow-up start date, were excluded.
Outcomes
The outcome was clinically coded long COVID, constructed from the date of the first record of any of the 15 UK SNOMED-CT codes for long COVID6 in English primary care records, consisting of two diagnostic codes, three referral codes, and 10 assessment codes (Supplement, Table S2). Time to the outcome event was defined as days from participant specific follow-up start date (Supplement, Table S1).
COVID-19 diagnosis
Date of COVID-19 diagnosis was defined as the earliest of: record of a positive SARS-CoV-2 polymerase chain reaction or antigen test; confirmed COVID-19 diagnosis in primary care or secondary care hospital admission records; or death certificate with SARS-CoV-2 infection listed as primary or underlying cause.
Patient characteristics
Patient characteristics included demographic variables, and health behavioural and clinical factors that may be associated with coded long COVID,6,7 and the frequency of GP–patient interactions, which could be an indicator of patient access to care and ability to interact with general practice. There is only one entry for sex in the EHR for each patient. All other coded values were the latest record on or before the cohort and participant specific follow-up start date. A full description of patient characteristics is in the Supplement, Table S2.
Demographic variables included age, sex, obesity, ethnicity, region, and deprivation. Where categorised, age groups were: 18–39 years, 40–59 years, 60–79 years, 80–105 years. Obesity was grouped based on body mass index (BMI kg/m2) using categories derived from the World Health Organization (WHO):13 no evidence of obesity BMI<30 kg/m2; obese class I, BMI 30–34.9 kg/m2; obese class II, BMI 35–39.9 kg/m2; and obese class III, BMI≥40 kg/m2. Ethnic groups were White, Mixed, Asian or Asian British, Black or Black British, and Chinese or other ethnic groups. All nine regions in England were included (East, London, East Midlands, North East, North West, West Midlands, Yorkshire and the Humber, South East, and South West).14 IMD was determined based on residential area categorised into five quintiles based on relative disadvantage, with quintile 1 (Q1) being the most deprived, and quintile 5 (Q5) being the least deprived.
Health behavioural and clinical factors included smoking status, frequency of GP–patient interaction and history of disease diagnoses. Smoking status was grouped into current-, ever-, and never-smokers. Frequency of GP–patient interaction was defined during the 12 months before participants’ follow-up start date, and categorised as: without any interaction; 1–3; 4–8; 9–12, and ≥13 interactions. History of the disease diagnoses, chosen based on previous literature on risk factors for long COVID7 and defined on or before the cohort and participant specific follow-up start date, was coded as separate indicator variables: asthma, cancer, chronic cardiac disease, chronic kidney disease, chronic liver disease, chronic obstructive pulmonary disease (COPD), chronic respiratory disease, dementia, diabetes, dysplenia (dysfunctional spleen), haematological cancer, heart failure, hypertension, mental health condition, organ transplant, other immunosuppressive condition, other neurological condition, post-viral fatigue, psoriasis, rheumatoid arthritis, systemic lupus erythematosus (SLE), and stroke. History of diagnosed post-viral fatigue was defined before 29 January 2020 owing to the potential use of the corresponding codes as a proxy for long COVID before the introduction of long COVID clinical codes in December 2020.
Hospitalisation for COVID-19 was defined as a hospital admission record with confirmed COVID-19 diagnosis in primary position within 28 days of the first COVID-19 diagnosis and COVID-19 without hospitalisation as a COVID-19 diagnosis that was not followed by hospitalisation within 28 days.15
Statistical analyses
Rates of coded long COVID were quantified as the number of first long COVID events per 1000 person-years. The cumulative probability of coded long COVID was estimated, using the Kaplan–Meier approach, by age group and sex. In each cohort, hazard ratios with 95% confidence intervals (CIs) for each patient characteristic were estimated from age-and-sex adjusted Cox proportional hazards (PH) models, and then all patient characteristics were included in a multivariable Cox PH model. Age was modelled using a restricted cubic spline, and estimated log hazard ratios against continuous age were plotted. In the post-COVID diagnosis cohort, we included COVID-19 severity (hospitalised versus non-hospitalised COVID-19) as an additional factor. Hazard ratios by age group (40–59 years, 60–79 years, and 80–105 years compared with 18–39 years [reference]), were estimated from models including age as a categorical variable, instead of a cubic spline.
For computational efficiency, we used the full population with coded long COVID and a randomly sampled population without coded long COVID with a ratio of 1:20. We used inverse probability weighting and robust standard errors to account for the sampling approach. The discriminative ability of the fitted model was quantified using C-statistics.16
We included a missing category for ethnicity, smoking status, and IMD. All other covariates were defined using the presence versus absence of specific codes, and thus have no identifiable missing values.
Data management and analysis were conducted using Python (version 3.8) and R (version 4.2.1) according to a prespecified protocol. Our protocol, analysis code, and code lists are available.17
Results
Study population
In total, 17 986 419 adults were included in the primary and pre-vaccination cohorts, 13 401 208 in the post-vaccination cohort and 3 507 738 in the post-COVID diagnosis cohort (Table 1). In the primary cohort, there were missing data for ethnicity (4 809 699, 26.74%), smoking status (744 851, 4.14%), and IMD (298 586, 1.66%). There were 1 855 613 (10.32%) people with ethnicity recorded as from minority groups, including Asian or Asian British, Black or Black British, Chinese or other ethnic groups, or Mixed. People in the post-vaccination and post-COVID diagnosis cohorts were more likely to have had at least one GP interaction 12 months before follow-up than those in the primary cohort. In each cohort, the most prevalent previous diagnoses were of asthma, chronic cardiac disease, diabetes, hypertension, and mental health conditions. People in the post-vaccination cohort were older, less likely to be recorded as from a minority ethnic group, and more likely to have a history of prior disease diagnoses than those in the pre-vaccination cohort. People in the post-COVID diagnosis cohort were younger, more likely to be male, and more likely to be recorded as from a minority ethnic group than those in the primary cohort. This motivates future research on increasing vaccine uptake in minority ethnic groups.
The numbers of people with coded long COVID were 36 886 (0.2%), 7155 (0.04%), 17 376 (0.1%), and 29 268 (0.8%) in the primary, pre-vaccination, post-vaccination, and post-COVID diagnosis cohorts, respectively (Table 2). The corresponding incidence rates of coded long COVID were 1.0, 0.3, 1.6, and 12.8 per 1000 person-years, respectively. In the primary cohort, the rate was highest in people aged 40–59 years (1.4), females (1.2), and people with BMI >40 kg/m2 (1.8). In the post-COVID diagnosis cohort, the incidence rate was highest in people aged 40–59 years (17.0), females (14.8), and people with BMI >40 kg/m2 (20.2), of White ethnicity (14.0), and living in less deprived areas (IMD Q4: 14.7).
In the primary cohort, the overall cumulative probability of coded long COVID was less than 0.1% in people aged ≥80 years, rising to around 0.4% and 0.2%, respectively, in women and men aged 40–59 years (Supplement, Figure S2). In the post-COVID diagnosis cohort, the overall cumulative probability of coded long COVID was <0.5% in people aged ≥80 years, rising to around 1.3% and 0.9%, respectively, in women and men aged 40–59 years (Supplement, Figure S3). The low cumulative probability of coded long COVID for people aged ≥80 years may have been owing to higher risk of mortality, which can censor diagnosis of long COVID.
Demographic factors: primary and post-COVID diagnosis cohorts
Fully adjusted hazard ratios (aHRs) for sex, obesity, and ethnicity were generally attenuated towards 1, compared with age–sex aHR (Figure 1). The incidence of coded long COVID declined markedly with age in the primary cohort (aHRs 0.51 [95% confidence interval {CI} = 0.43 to 0.60]) and 0.19 (95% CI = 0.15 to 0.24) for age groups 60–79 and 80–105 years, respectively, compared with age group 18–39 years). This decline was less marked in the post-COVID diagnosis cohort. The aHRs comparing age groups were consistent with those when age was modelled by restricted cubic spline (supplement Figure S4). The incidence of coded long COVID was higher in females than males in (aHRs 1.33 [95% CI = 1.27 to 1.39]) and 1.20 [95% CI = 1.14 to 1.27] in the primary and post-COVID diagnosis cohorts, respectively). In the primary cohort, the incidence of coded long COVID was lower in people from Black or Black British ethnicity (aHR 0.84 [95% CI = 0.74 to 0.96]) and Chinese or other ethnic groups (aHR 0.66 [95% CI = 0.56 to 0.77]), compared with those of White ethnicity. These differences were attenuated towards 1 in the post-COVID diagnosis cohort. In each cohort, the incidence of coded long COVID was higher in North East, and increased with increasing obesity and decreasing deprivation.
Demographic factors: pre-vaccination and post-vaccination cohorts
Fully aHRs for sex and BMI were generally attenuated towards 1, compared with age–sex aHRs (Figure 2) in both pre-vaccination and post-vaccination cohorts. The incidence of coded long COVID declined in older adults in the post-vaccination cohort (aHRs 0.36 [95% CI = 0.30 to 0.44] and 0.12 [95% CI = 0.09 to 0.16] for age groups 60–79 years and 80–105 years, respectively, compared with younger adults aged 18–39 years). This decline was less marked in the pre-vaccination cohort. The incidence of coded long COVID was higher in females than males (aHRs 1.31 [95% CI = 1.22 to 1.41] and 1.23 [95% CI = 1.16 to 1.30] in the pre-vaccination and post-vaccination cohorts, respectively). In the pre-vaccination cohort, the incidence of coded long COVID was increased with increasing obesity. This pattern was less clear in the post-vaccination cohort. In both cohorts, the incidence of coded long COVID was lower in people of Chinese or other ethnic groups (aHRs 0.63 [95% CI = 0.50 to 0.81] and 0.72 [95% CI = 0.56 to 0.92] in the pre-vaccination and post-vaccination cohorts, respectively), compared with those of White ethnicity. The incidence of coded long COVID was lower in people of Black or Black British ethnicity compared with White ethnicity in the post-vaccination cohort (aHR 0.67 [95% CI = 0.56 to 0.81]), but not in the pre-vaccination cohort (aHR 1.10 [95% CI = 0.93 to 1.30]). In the pre-vaccination cohort, the incidence of coded long COVID was slightly higher in North East. In the post-vaccination cohort, it was slightly higher in North West. In each cohort, the incidence of long COVID increased with decreasing deprivation.
Health behavioural and clinical factors: primary and post-COVID diagnosis cohorts
In the primary cohort, the incidence of coded long COVID was lower in current smokers and people with a missing smoking status, compared with people who never smoked (Figure 3). These differences were attenuated towards 1 in the post-COVID diagnosis cohort. In each cohort, the incidence of coded long COVID increased with increasing frequency of GP–patient interactions, during 12 months before the follow-up start date. The aHRs for GP–patient interaction were generally attenuated, compared with age–sex aHRs.
In the primary cohort, the incidence of coded long COVID was higher in people with than without a history of diagnosed asthma, chronic cardiac disease, chronic respiratory disease, haematological cancer, mental health conditions, pre-pandemic post-viral fatigue, psoriasis, or rheumatoid arthritis. These differences were generally attenuated in the post-COVID diagnosis cohort. In both cohorts, aHRs for these diseases were attenuated towards 1, compared with age–sex aHRs. The largest aHRs were for pre-pandemic post-viral fatigue (pre-vaccination cohort: 2.01, 95% CI = 1.72 to 2.35; post-vaccination cohort: 1.96, 95% CI = 1.63 to 2.35). In the primary cohort, the incidence of coded long COVID was lower in people with than without a history of diagnosed cancer, COPD, diabetes, heart failure, hypertension, or other neurological disorders. In the post-COVID diagnosis cohort, incidence of coded long COVID was similar in people with and without a history of diagnosed hypertension (aHR 1.00 [95% CI = 0.97 to 1.04]). In the post-COVID diagnosis cohort, people with hospitalised COVID-19 had higher incidence of coded long COVID (aHR 1.37 [95% CI = 1.21 to 1.55]) than those with non-hospitalised COVID-19.
Health behavioural and clinical factors: pre-vaccination and post-vaccination cohorts
In the pre-vaccination cohort, the incidence of coded long COVID was lowest in current smokers and people with a missing smoking status, and highest in ever smokers, compared with people who never smoked (Figure 4). The aHRs for smoking status were attenuated towards 1 in the post-vaccination cohort, compared with the pre-vaccination cohort. The incidence of coded long COVID increased with increasing frequency of GP–patient interaction, although aHRs were attenuated towards 1 in the post-vaccination cohort, compared with the pre-vaccination cohort. The aHRs for GP–patient interaction were generally attenuated, compared with age–sex adjusted hazard ratios.
In the pre-vaccination cohort, the incidence of coded long COVID was higher in people with than without a history of diagnosed asthma, mental health conditions, pre-pandemic post-viral fatigue, and psoriasis. These differences were attenuated in the post-vaccination cohort, compared with the pre-vaccination cohort. The aHRs for these diseases were attenuated, compared with age–sex aHRs. In the post-vaccination cohort, but not the pre-vaccination cohort, the incidence of coded long COVID was higher in people with than without a history of organ transplant. The incidence of coded long COVID was higher in people with than without a history of diagnosed pre-pandemic post-viral fatigue, in both the pre-vaccination and post-vaccination cohorts.
Discussion
Summary
Despite an estimated 2.8% of the UK population having self-reported symptoms of long COVID18 as of 3 April 2022, only 36 886 (0.2%) of the eligible general adult population in this study of 17 986 419 adults had a diagnosis of long COVID recorded in their primary care record.
Patient characteristics associated with higher incidence of coded long COVID included female sex, younger age (aged <60 years), greater BMI, ever having smoked, and a history of diagnosed asthma, mental health conditions, and psoriasis. The incidence of coded long COVID was higher with increasing GP–patient interaction. Coded long COVID was more than twice as likely in people with than without a diagnosis of post-viral fatigue before the pandemic. The incidence of coded long COVID was higher after hospitalised than non-hospitalised COVID-19.
Differences between factors associated with coded long COVID in the four cohorts studied may reflect differences between risk factors for infection with SARS-CoV-2, developing severe COVID-19, and developing long COVID having been infected with SARS-CoV-2. They may also reflect the influence of vaccination on developing long COVID, and changes in primary care coding practice and healthcare-seeking behaviours during the pandemic. There were only minor differences between the cohorts in associations of demographic factors with coded long COVID (for example, lower incidence compared with White ethnicity for Chinese or other ethnic groups apart from the post-COVID diagnosis cohort, and for Asian or Asian British only in the post-vaccination cohort). Similarly, there were inverse associations with coded long COVID of current smoking compared with never smoking, and positive associations with number of previous GP–patient interactions, across the four cohorts, although the magnitude of this association was lower in the post-COVID diagnosis and post-vaccination cohorts than in the primary and pre-vaccination cohorts. Associations with previous disease diagnoses were also broadly consistent across the four cohorts. Further, COVID-19 vaccination did not substantially modify associations of factors with coded long COVID-19, although it is likely to have substantially attenuated the overall incidence of COVID-19.19
Strengths and limitations
A key strength of this study is its use of the data from the OpenSAFELY-TPP platform, which includes more than 40% of the English population.20 We analysed data from all eligible adults with follow-up of up to 26 months. The prevalence of coded long COVID was higher in people registered in an NHS primary care GP using EMIS EHR software than in practices using TPP software.6 The type of EHR software used is geographically clustered.21 However, we were not able to access data from practices using EMIS software.
The prevalence of coded long COVID in English primary care records was substantially lower than that found in population surveys. There is likely to be considerable under-ascertainment of long COVID in these records owing to difficulties in accessing care during the pandemic. Future research can investigate if access to free-text records might help decrease under-ascertainment.22 However, free text was not available for our analyses.
Fully aHRs quantify the contribution of each patient characteristic to predicting the outcome, having accounted for the value of each other patient characteristic. However, they do not have causal interpretations, because they do not distinguish between adjustment for confounders and mediators. Such misinterpretation of multiple adjusted effect estimates presented in a single table has been referred to as the 'table 2 fallacy'.23
As described in the 'COVID-19 diagnosis' section, we used data from multiple sources to capture COVID-19 diagnosis as accurately as possible. Patient characteristics were determined by using primary care records, and additional data from secondary care can improve the completeness and updated analysis can be conducted.
In the pre-vaccination cohort, follow-up was censored at the time of vaccination. Such censoring could lead to bias if it was informative, which would be the case if the incidence of coded long COVID differed systematically between people who were and were not vaccinated, having accounted for baseline covariates. We believe that our analyses adjusted for the major predictors of COVID-vaccination (for example, age, sex, ethnicity, IMD), which should have limited informative censoring, but cannot exclude the possibility that estimated associations were biased because of the censoring.
Comparison with existing literature
Similar to other studies,7,24,25 we found positive associations of coded long COVID with female sex, obesity, mental health conditions, and living in less deprived areas. The latter association contrasts with the increased risk of SARS-CoV-2 infection with increasing deprivation, and illustrates the distinction between long COVID and coded long COVID, which depends on the ability of people with long COVID to access health care for their condition at a time of extreme pressure on health services. A previous EHR analysis also found that people living in less deprived areas had higher incidence of coded long COVID. However, in the same study, the analysis of longitudinal cohort studies found no association between IMD and self-reported long COVID.7
Among the general population, the incidence for coded long COVID was lower in people of Black ethnicity, similar to a previous study.7 We found similar incidence of coded long COVID in Asian and Asian British people and people of White ethnicity. Additionally, the incidence of coded long COVID was lower in Chinese or other ethnic groups, compared with people of White ethnicity. In general, the incidence of coded long COVID was higher in ever smokers but lower in current smokers, compared with never smokers. A previous study7 included only two categories for smoking status, and found no difference in the incidence of coded long COVID between current smokers and non-smokers. Smoking status in EHR may not be up to date, especially for people who had less frequent interaction with their GP.
A study in Moscow identified that pre-existing hypertension was associated with higher risk of long COVID 12 months since discharge from hospitalisation.26 Our fully adjusted model in the primary cohort showed that the incidence of coded long COVID was lower for people with a history of diagnosed hypertension, although the incidence was higher when only adjusted for age and sex. In other three cohorts, no association with hypertension was observed from the fully adjusted models. In the Moscow study, long COVID was assessed by clinicians after hospitalised COVID, while our study relied on people getting access to their GP and the diagnosis then being recorded.
A previous report to the UK Government’s Scientific Advisory Group for Emergencies found that the risk of coded long COVID was higher in adults with hospitalised than non-hospitalised COVID-19.27 Our study was restricted to adults. Other studies report that hospitalised COVID was also associated with higher risk of long COVID in children.25,28,29 A systematic review of 20 studies, identified higher risk of long COVID with female sex, mental health conditions, fatigue, and acute disease severity with respiratory symptoms.30
Implications for practice
A potentially large proportion of people with long COVID did not have a long COVID code in their health records. The incidence of coded long COVID was influenced by the frequency of previous GP–patient interaction. Despite controlling for this factor, we identified a set of patient characteristics associated with coded long COVID, including sociodemographical variables, history of diagnosed diseases, and SARS-CoV-2 severity.
Information governance
NHS England is the data controller for OpenSAFELY-TPP; TPP is the data processor; all study authors using OpenSAFELY have the approval of NHS England.31 This implementation of OpenSAFELY is hosted within the TPP environment which is accredited to the ISO 27001 information security standard and is NHS IG Toolkit compliant.32
Patient data has been pseudonymised for analysis and linkage using industry standard cryptographic hashing techniques; all pseudonymised datasets transmitted for linkage onto OpenSAFELY are encrypted; access to the platform is via a virtual private network (VPN) connection, restricted to a small group of researchers; the researchers hold contracts with NHS England and only access the platform to initiate database queries and statistical models; all database activity is logged; only aggregate statistical outputs leave the platform environment following best practice for anonymisation of results such as statistical disclosure control for low cell counts.33
The service adheres to the obligations of the UK General Data Protection Regulation (UK GDPR) and the Data Protection Act 2018. The service previously operated under notices initially issued in February 2020 by the Secretary of State under Regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002 (COPI Regulations), which required organisations to process confidential patient information for COVID-19 purposes; this set aside the requirement for patient consent.34 As of 1 July 2023, the Secretary of State has requested that NHS England continue to operate the Service under the COVID-19 Directions 2020.35 In some cases of data sharing, the common law duty of confidence is met using, for example, patient consent or support from the Health Research Authority Confidentiality Advisory Group.36
Taken together, these provide the legal bases to link patient datasets on the OpenSAFELY platform. GP practices, from which the primary care data are obtained, are required to share relevant health information to support the public health response to the pandemic and have been informed of the OpenSAFELY analytics platform.
Notes
Funding
This research was funded by an UKRI MRC Fellowship awarded to YW (MR/W021358/1). YW received funding from UKRI EPSRC Impact Acceleration Account (EP/X525789/1) and Health Data Research UK. The Longitudinal Health and Wellbeing UK COVID-19 National Core Study was funded by the UKRI Medical Research Council (MC_PC_20059) and the NIHR CONVALESCENCE study (COV-LT-0009). The OpenSAFELY software platform was funded by Wellcome and by the Data and Connectivity COVID-19 National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (MC_PC_20058). TPP provided technical expertise and infrastructure within their data centre pro bono in the context of a national emergency. This research used data assets made available as part of the Data and Connectivity National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (grant ref MC_PC_20058). In addition, the OpenSAFELY Platform is supported by grants from the Wellcome Trust (222097/Z/20/Z); MRC (MR/V015757/1, MC_PC-20059, MR/W016729/1); NIHR (NIHR135559, COV-LT2-0073), and Health Data Research UK (HDRUK2021.000, 2021.0157). JACS, EH and RD are supported by the NIHR Bristol Biomedical Research Centre. JACS, RD and YW are supported by Health Data Research UK (HDRUK2023.0022). AMW is supported by the NIHR Cambridge Biomedical Research Centre and by Health Data Research UK. BG has also received funding from: the Bennett Foundation, the Wellcome Trust, NIHR Oxford Biomedical Research Centre, NIHR Applied Research Collaboration Oxford and Thames Valley, the Mohn-Westlake Foundation; all Bennett Institute staff are supported by BG’s grants on this work. JM is partly funded by the National Institute for Health and Care Research Applied Research Collaboration West (NIHR ARC West). VW also receives support from the MRC Integrative Epidemiology Unit at the University of Bristol (MC_UU_00011/4). SD is supported by a) the BHF Data Science Centre led by HDR UK (grant SP/19/3/34678), b) BigData@Heart Consortium, funded by the Innovative Medicines Initiative-2 Joint Undertaking under grant agreement 116074, c) the NIHR Biomedical Research Centre at University College London Hospital NHS Trust (UCLH BRC), d) a BHF Accelerator Award (AA/18/6/24223), e) the CVD-COVID-UK/COVID-IMPACT consortium and f) the Multimorbidity Mechanism and Therapeutic Research Collaborative (MMTRC, grant number MR/V033867/1). The views expressed are those of the authors and not necessarily those of the NIHR, NHS England, UK Health Security Agency (UKHSA) or the Department of Health and Social Care. Funders had no role in the study design, collection, analysis, and interpretation of data; in the writing of the report; and the decision to submit the article for publication.
Ethical approval
This study was approved by NHS London - Harrow Research Ethics Committee (IRAS reference: 310808, NHS REC reference: 22/LO/0105); and by the University of Plymouth Research Ethics and Integrity Panel (reference: 3193).
Provenance
Freely submitted; externally peer reviewed.
Acknowledgements
We are very grateful for all the support received from the TPP Technical Operations team throughout this work, and for generous assistance from the information governance and database teams at NHS England and the NHS England Transformation Directorate. We thank the CONVALESCENCE Study Long COVID PPIE group for their input and for sharing their experiences and expertise throughout the duration of the project.
Competing interests
Over the past five years BG has received research funding from the Laura and John Arnold Foundation, the NHS National Institute for Health Research (NIHR), the NIHR School of Primary Care Research, the NIHR Oxford Biomedical Research Centre, the Mohn-Westlake Foundation, NIHR Applied Research Collaboration Oxford and Thames Valley, the Wellcome Trust, the Good Thinking Foundation, Health Data Research UK (HDRUK), the Health Foundation, and the World Health Organization; he also receives personal income from speaking and writing for lay audiences on the misuse of science
- Received June 11, 2024.
- Revision received December 17, 2024.
- Accepted April 30, 2025.
- Copyright © 2025, The Authors
This article is Open Access: CC BY license (https://creativecommons.org/licenses/by/4.0/)











