Skip to main content

Main menu

  • HOME
  • LATEST ARTICLES
  • ALL ISSUES
  • AUTHORS & REVIEWERS
  • RESOURCES
    • About BJGP Open
    • BJGP Open Accessibility Statement
    • Editorial Board
    • Editorial Fellowships
    • Audio Abstracts
    • eLetters
    • Alerts
    • BJGP Life
    • Research into Publication Science
    • Advertising
    • Contact
  • SPECIAL ISSUES
    • Artificial Intelligence in Primary Care: call for articles
    • Social Care Integration with Primary Care: call for articles
    • Special issue: Telehealth
    • Special issue: Race and Racism in Primary Care
    • Special issue: COVID-19 and Primary Care
    • Past research calls
    • Top 10 Research Articles of the Year
  • BJGP CONFERENCE →
  • RCGP
    • British Journal of General Practice
    • BJGP for RCGP members
    • RCGP eLearning
    • InnovAIT Journal
    • Jobs and careers

User menu

  • Alerts

Search

  • Advanced search
Intended for Healthcare Professionals
BJGP Open
  • RCGP
    • British Journal of General Practice
    • BJGP for RCGP members
    • RCGP eLearning
    • InnovAIT Journal
    • Jobs and careers
  • Subscriptions
  • Alerts
  • Log in
  • Follow BJGP Open on Instagram
  • Visit bjgp open on Bluesky
  • Blog
Intended for Healthcare Professionals
BJGP Open

Advanced Search

  • HOME
  • LATEST ARTICLES
  • ALL ISSUES
  • AUTHORS & REVIEWERS
  • RESOURCES
    • About BJGP Open
    • BJGP Open Accessibility Statement
    • Editorial Board
    • Editorial Fellowships
    • Audio Abstracts
    • eLetters
    • Alerts
    • BJGP Life
    • Research into Publication Science
    • Advertising
    • Contact
  • SPECIAL ISSUES
    • Artificial Intelligence in Primary Care: call for articles
    • Social Care Integration with Primary Care: call for articles
    • Special issue: Telehealth
    • Special issue: Race and Racism in Primary Care
    • Special issue: COVID-19 and Primary Care
    • Past research calls
    • Top 10 Research Articles of the Year
  • BJGP CONFERENCE →
Research

UK research data resources based on primary care electronic health records: review and summary for potential users

Lara Edwards, James Pickett, Darren M Ashcroft, Hajira Dambha-Miller, Azeem Majeed, Christian Mallen, Irene Petersen, Nadeem Qureshi, Tjeerd van Staa, Gary Abel, Chris Carvalho, Rachel Denholm, Evangelos Kontopantelis, Ayoyemi Macaulay and John Macleod
BJGP Open 2023; 7 (3): BJGPO.2023.0057. DOI: https://doi.org/10.3399/BJGPO.2023.0057
Lara Edwards
1 Health Data Research UK (HDR UK), London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lara Edwards
James Pickett
1 Health Data Research UK (HDR UK), London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for James Pickett
Darren M Ashcroft
2 Centre for Pharmacoepidemiology and Drug Safety, NIHR Greater Manchester Patient Safety Translational Research Centre, School of Health Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Darren M Ashcroft
Hajira Dambha-Miller
3 Primary Care Research Centre, University of Southampton, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hajira Dambha-Miller
Azeem Majeed
4 Primary Care and Public Health, Imperial College London, London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Azeem Majeed
Christian Mallen
5 Institute for Global Health, Keele University, Keele, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christian Mallen
Irene Petersen
6 Department of Primary Care & Population Health, Institute of Epidemiology & Health, University College London, London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Irene Petersen
Nadeem Qureshi
7 Centre for Academic Primary Care, University of Nottingham, Nottingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nadeem Qureshi
Tjeerd van Staa
8 Health eResearch Centre, University of Manchester, Manchester, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tjeerd van Staa
Gary Abel
9 Department of Health and Community Sciences (Medical School), Faculty of Health and Life Sciences, University of Exeter, Exeter, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gary Abel
Chris Carvalho
10 Clinical Effectiveness Group, Queen Mary University of London, London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chris Carvalho
Rachel Denholm
11 Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
12 Centre for Academic Primary Care, University of Bristol, Bristol, UK
13 NIHR Bristol Biomedical Research Centre, Bristol, UK
14 Health Data Research UK South-West, Bristol, UK
15 NIHR Applied Research Collaboration (ARC) West, Bristol, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Rachel Denholm
Evangelos Kontopantelis
16 Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Evangelos Kontopantelis
Ayoyemi Macaulay
1 Health Data Research UK (HDR UK), London, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ayoyemi Macaulay
John Macleod
11 Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
15 NIHR Applied Research Collaboration (ARC) West, Bristol, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for John Macleod
  • For correspondence: John.Macleod{at}bristol.ac.uk
  • Article
  • Figures & Data
  • Info
  • eLetters
  • PDF
Loading

Abstract

Background The range and scope of electronic health record (EHR) data assets in the UK has recently increased, which has been mainly in response to the COVID-19 pandemic. Summarising and comparing the large primary care resources will help researchers to choose the data resources most suited to their needs.

Aim To describe the current landscape of UK EHR databases and considerations of access and use of these resources relevant to researchers.

Design & setting Narrative review of EHR databases in the UK.

Method Information was collected from the Health Data Research Innovation Gateway, publicly available websites and other published data, and from key informants. The eligibility criteria were population-based open-access databases sampling EHRs across the whole population of one or more countries in the UK. Published database characteristics were extracted and summarised, and these were corroborated with resource providers. Results were synthesised narratively.

Results Nine large national primary care EHR data resources were identified and summarised. These resources are enhanced by linkage to other administrative data to a varying extent. Resources are mainly intended to support observational research, although some can support experimental studies. There is considerable overlap of populations covered. While all resources are accessible to bona fide researchers, access mechanisms, costs, timescales, and other considerations vary across databases.

Conclusion Researchers are currently able to access primary care EHR data from several sources. Choice of data resource is likely to be driven by project needs and access considerations. The landscape of data resources based on primary care EHRs in the UK continues to evolve.

  • electronic health records
  • primary care databases
  • population level linked data
  • population
  • primary health care

How this fits in

This narrative review is intended to provide an update on the continually evolving UK landscape of primary care EHR-linked databases available for research purposes. Similar reviews have been conducted previously; however, with the emergence of newer linked data assets, this update provides a current view of these different data assets, providing detail on scale, scope, and data sources within each, as well as how researchers can access them, costing models across each, and the training and accreditation required.

Introduction

Information held in EHRs is a valuable research resource, particularly where the source data systems have near universal, longitudinal population coverage, as is the case with UK primary care EHRs. Given that the main purpose of EHRs is for clinical management, great care on interpretation is needed when data are used for research. Many issues of data completeness and quality, alongside the biases inherent in observational epidemiology, attach to analyses based on them; these are discussed below. This notwithstanding, EHRs have supported observational research for several decades.1,2

The range and scope of EHR-based data assets in the UK has recently increased, which has been primarily in response to the COVID-19 pandemic. Newer data assets may be less familiar to researchers, making their choice of the data resource most appropriate for their intended study difficult. This review aimed to summarise the current major sources of primary care EHRs data resources in the UK, alongside key characteristics of these relevant to potential users. It is hoped this information will help researchers choose the data resource most suited to their needs.

The review focused exclusively on UK EHR resources. Global resources, their development, and their uses are discussed elsewhere.3,4 Similarly, discussion of important issues, such as controversy around data sharing and patient perspectives, is beyond the scope of this article but these are discussed elsewhere.5

Historical context

In the UK, primary medical care moved progressively from paper-based to electronic records from the late 1980s. Record-keeping in UK primary care is now almost exclusively electronic.6 A variety of commercially supplied clinical software systems are used in primary care. Currently, the following three vendors dominate the UK market: EMIS Health; SystmOne (provided by The Phoenix Partnership; TPP) and Vision (Cegedim Healthcare Solutions). Partnerships between practices, system vendors, academics, and for-profit companies subsequently made subsets of electronic primary care health records available for research.

These partnerships led to the formation of the General Practice Research Database now known as the Clinical Practice Research Datalink (CPRD),7,8 QResearch,9 The Health Improvement Network (THIN)10 database, and Optimum Patient Care Research Database (OPCRD).11 The Royal College of General Practitioners (RCGP) has supported practice-based infectious disease surveillance since 1957.12 This system is now electronic and supports a broader Research and Surveillance Centre (RCGP RSC).13 More recently, other partnerships have arisen (see below).

The population coverage of each database reflects the popularity and geographical reach of the parent systems6 as well as the practices that opt into them. EMIS Health is the most common provider to practices across the UK, and EMIS Health and TPP cover more than 90% of practices in England.

Initially, the major focus of EHR research was pharmaco-epidemiology but their research use now encompasses most aspects of observational epidemiology, including risk prediction,14–17 health services research,18–20 and clinical trials.21 This expansion has been facilitated by enhancement of EHR resources through linkage to other administrative data and to data collected in research studies and clinical audit.

Whole population coverage

Statistical power in EHR research reflects sample size, whereas external validity is related to sample representativeness of the target population. Whole population coverage of an EHR database in a single nation has proved difficult to achieve for technical, socio-political, and legal reasons. Under the EU General Data Protection Regulation (GDPR) and the UK Data Protection Act (DPA) 2018, the legal data controller of primary care is the GP practice, which is responsible for the legal use of data and can decide whether data from practice patients may be processed for research purposes.22

Pre-COVID-19, Wales was the only UK nation to achieve near full-population coverage in a primary care EHR research database. The Welsh Longitudinal General Practice Dataset (WLGP),23 hosted by SAIL Databank,24,25 provides coverage of 83% of the population of Wales and 80% of Welsh GP practices. It is linked to other routine health and administrative datasets.26

COVID-19 pandemic response

The COVID-19 pandemic created a situation where observational research based on EHR data at scale became a public health and policy priority, to identify risk factors for and sequelae of infection, and to investigate the effects of treatment and prevention measures. To enable this, a Notice under Regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002 (COPI) was introduced, covering England and Wales, by the Secretary of State for Health in March 2020, which directed general practices to provide primary care information deemed essential for the COVID-19 response.27

New EHR-based UK data resources have been enabled by the pandemic response, including a minimised primary care data extract, GP Data for Pandemic Planning and Research (GDPPR). A partnership between Health Data Research UK (HDR UK), NHS Digital, and the British Heart Foundation (BHF) formed the BHF Data Science Centre-led CVD-COVID-UK/COVID-IMPACT Consortium.28 This project resulted in the NHS Digital Trusted Research Environment (TRE), now NHS England Secure Data Environment (SDE); and enabled research relevant to COVID-19 with linkage to other datasets held by NHS England. The Consortium also includes other national TREs; SAIL Databank and the Scottish National Data Safe Haven.29 OpenSAFELY is a new TRE project created in collaboration across the Bennett Institute at the University of Oxford, the EHR Research Group at the London School of Hygiene and Tropical Medicine (LSHTM), the EHR suppliers TPP SystmOne and EMIS Health, and NHS England. The open source OpenSAFELY software tools are implemented inside the data centres of TPP and EMIS to enable secure and federated analysis of all structured GP data without the need for raw data to be extracted and disseminated.30,31

Other UK nations established large EHR-based data resources to support COVID-19-related research, including the Early Pandemic Evaluation and Enhanced Surveillance of COVID-19 (EAVE II) database in Scotland.32

Given this evolving landscape, the review aims to provide a summary and comparison of the current UK-based large primary care EHR data resources, as a guide to researchers.

Method

The Health Data Research Innovation Gateway33 was searched with the term 'primary care'. This search was supplemented with information from key informants in the National Institute for Health and Care Research (NIHR) School for Primary Care Research34 and the wider primary care research community. Consideration was restricted to datasets openly accessible to external researchers.

Sources of primary care data for research purposes

National resources

Nine data resources were identified, which are described in Supplementary Tables S1 and S2. Each includes patients resident in one or more of the UK nations. The summary characteristics tabulated were obtained via publicly accessible websites and published data. Data providers were contacted to confirm accuracy and completeness of information.

Regional data sources

Some UK regions have developed local EHR databases, with linkage to primary care data, to support care delivery and planning; NHS business intelligence; and research. Some of these resources are accessible to researchers, although this has been generally restricted, to date, to local analysts. Because of this, these resources are not described in detail. Examples include a regional network of TREs in Scotland35 such as DataLoch;36 and others across England such as Combined Intelligence for Population Health (CIPHA),37 HDR UK hub Discover-NOW,38 the Bristol, North Somerset and South Gloucestershire systemwide dataset,39 and the Connected Bradford database.40 Regional EHR data will eventually become more accessible for research through the current NHS England Data for Research & Development (R&D) Programme to develop an interoperable network of NHS-owned subnational SDEs across England.41

Discussion

National primary care EHR data resources

Researcher-relevant characteristics of the nine data resources identified are described below.

1. Scope, scale, and data source

CPRD, QResearch, and THIN work with software suppliers to aggregate EHRs from practices that opt in. RCGP RSC and OPCRD hold agreements at the practice level to provide data and create resources that include records from different EHR vendors. Individuals can opt out of data sharing through contacting their practice.

OpenSAFELY provides secure access to full de-identified EHR records held by TPP and EMIS (>99% of patients in England, combined),31 and enables consistent, federated analysis across the two. A GDPPR extract is available from NHS England Data Access Request Service,42 in addition to access via the NHS England SDE. Use of these resources is currently enabled by COPI transitionary provision. General use beyond the pandemic is under negotiation.

These large data resources include records from between 3 and 70 million individuals with varying person follow-up time (see Supplementary Tables S1 and S2). Reported size of the data resource may include historic patients now deceased or embarked (that is, patients who have left the geographical catchment area of the resource) such that the number of live, registered patients may be lower than total numbers reported. For example, as of November 2022 CPRD reports 60 million patients, of which 18 million are currently registered active patients, with at least 20 years of follow-up for 25% of the patients.43 There is substantial overlap of patients represented between data resources.

All resources identified have been enhanced through linkage to other administrative data to a varying extent. Typically linkage is to secondary care records, death records, cancer registrations, and census-derived sociodemographic measures. More recently, linkage has been expanded to other datasets such as COVID-19 testing, immunisation, and intensive care.

Users typically must demonstrate a level of skills and experience appropriate to their intended research before gaining access, and may have to evidence completion of specific training, in addition to information governance and data security training.

In addition to supporting observational research, some resources offer extra research services; for example, to facilitate data-enabled trials.21

Refer to Supplementary Table S1: Scope, scale and restrictions on use of UK primary care data resources, for a detailed description of the scope, scale, and data sources of data assets.

Mechanisms of data access

Access models

Across these resources, data are accessed either through provision of a study-specific extract with assurances around security, appropriate handling, and data deletion or via a TRE or SDE. In both cases, the process typically involves several steps.

Steps and timescales

Typically, potential users are required to submit a proposal to an oversight committee. 'Access times' often describe time to this approval rather than time to data access, which can be misleading. Time to data access depends on multiple considerations that can incur considerable delays, these include the following:

  • Ethical and other approvals: access to some resources requires prior ethical and R&D approvals to be in place. Some data resources have pre-approval from research ethics committees for particular types of research. Complex linked data applications and non-observational studies are more likely to require prior ethical approvals.

  • Accreditation: this may be at the institutional or individual level. Some resources require organisations to have the NHS Data Security and Protection Toolkit44 in place, in line with GDPR. Individual users may be required to complete specific training such as Safe Researcher Training offered by the UK Data Service. Some resources provide training for main users of the data, with the expectation that knowledge is passed on within the user institute. Some resources do not specify particular training requirements but expect applicants to evidence specific competencies.

  • Application process: beyond completion of an application form, the application process may necessitate engagement with the data provider to discuss the proposed research; for example, to estimate feasibility and statistical power. The more elaborate this process, the greater time required.

  • Linked data: this typically requires additional permissions, causing delays particularly when the linkages sought are new rather than established. New linkages, where available, will generally incur greater costs and delays.

  • Data preparation and processing: depending on the data resource and project, preparation of a suitable extract or pre-processing of data made available through a TRE or SDE may incur further delays.

2. Funding models for access

Data are made accessible through the following three main funding models: (a) an annual licence (some negotiated at an organisation level); (b) per project, which may include a base cost with additional charges representing resources in preparing bespoke or complex data requests or linkages; (c) on an academic collaboration basis.

Refer to Supplementary Table S2: Access processes and requirements for primary care data resources, for detailed description of data access mechanisms and processes across data assets in scope.

3. Analysis of primary care EHRs

Once data have been accessed as above, several considerations apply to the analysis process.

Data wrangling and curation

Data wrangling and curation describe the processes of preparing the data before they can be analysed. The readiness of data for analysis varies depending on the data resource. Resources generally provide some form of data dictionary or data notes describing metadata and provenance of the data. Clinical and prescription data is commonly provided in a structured clinical vocabulary agnostic to the source system. Common formats include SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms), Read Codes, ICD-10 (International Classification of Diseases, Tenth Revision) codes, as well as local codes (which may be less interoperable). Sometimes a combination of these is used.

The extent of curation needed varies with study design, but may include manipulating tables, deriving variables, linking data sources, and identifying study cohorts. Where several studies require similar manipulation of data, reusing a common code is helpful. Some resources require users to share code, using repositories such as GitHub.45 OpenSAFELY requires all code to be posted on GitHub before execution and publishes links to all executed code automatically at jobs.opensafely.org; analysts use standardised OpenSAFELY dataset building tools, which are integrated with the codelist development and sharing tools at OpenCodeLists.org.46

The CVD-COVID-UK/COVID IMPACT Consortium publishes protocols, code, and phenotype code lists via the HDR UK Gateway and GitHub.

Another common step required of analysts is to create EHR phenotypes that describe clinical concepts. Phenotype libraries and other resources to support standardisation and reproducibility have also been developed.47–51 Publishers may expect authors to provide code lists, algorithms, and programme files as supplements in published articles.

Using a Trusted Research Environment (TRE) or Secure Data Environment (SDE)

Several data resources provide access via a TRE or SDE. Models vary in several ways, including the following:

  • the prepackaged tools and software available in the analytical environment;

  • the ability to import a user’s own code or software;

  • availability of code for common data management tasks;

  • the degree to which previous users’ data curation, variable derivation, and documentation is available to new users;

  • threshold of small number suppression to protect against risk of patient reidentification;

  • the level of user support available;

  • ease of use;

  • cost of use.

Some models allow curation, documentation, novel variable derivation, and associated documentation to be stored beyond the life of a single project or analysis and made available to future users, increasing the value of the resource. A UK Health Data Research Alliance White Paper52 has set out guidelines and principles for TRE and SDE good practice structured around the 'Five Safes' framework,53 and the Goldacre Review recommended use of TREs and SDE as the norm for analysis of health data.54

Methodological and other considerations for working with primary care EHR data resources

Clinical context

Primary care EHRs are created primarily to support continuity in clinical care, as a medico-legal document, and to support payment systems. Their use in research needs to take into consideration why and how the data were collected. Because of this, experience of creating EHRs can help in guiding and interpreting analysis of them. Data recording and coding is influenced by many considerations. Understanding these, how they influence the content of the record, and the potential for bias to be introduced is essential to making valid inferences.55

Analytic and epidemiological considerations

Working with these data requires considerable epidemiological and analytical experience, including knowledge of common analytical tools and experience in handling large data resources. Access may be contingent on evidencing these competencies.

Population-level data also have characteristics that can make them challenging to use.56 Missing data and misclassification are key issues. Data are unlikely to be missing completely at random. Multiple imputation can be used to address this; however, it may introduce additional bias if used inappropriately.57 Sometimes missingness can be addressed through linkage to other data, facilitating the assessment of the extent of potential bias.57 Research questions must be evaluated for feasibility against the quality of the available data. For example, recording and management of many chronic conditions, risk markers, and other aspects of care have been incentivised in UK primary care, potentially introducing variations in data quality between information whose recording is or is not incentivised.58

Other epidemiological considerations are those attached to the difficulty of making valid causal inference in observational data where exposure allocation is non-random. The main issue is confounding by indication, where risk of exposure is associated with risk of outcome through a pathway independent of exposure.59 Collider bias60 and immortal time bias61 are also frequently important. The nature of causes, causal inference, and addressing bias attached to this endeavour have been discussed elsewhere, both in general terms62 and in the context of EHRs.63

Future work and future developments

Models and mechanisms for accessing primary care EHRs, enhanced through linkage to other information, continue to evolve. This information is likely to include non-health administrative data, research data, patient-reported data, and data from patient-based and other sensors. Eventually this evolution may lead to near-whole population, real-time data from across the health and care system, linked to multimodal data from other sources being readily, securely, and acceptably available for analysis. Multiple biases will attach to these analyses and appreciation of their possible influence is important, particularly when analysis is genuinely intended to inform policy choices. Strategies to address these biases will also evolve. The broad term 'artificial intelligence' is currently applied to a variety of automated analytical approaches (including machine learning and deep learning) intended to make the extraction of useful inference from multimodal data more efficient and reliable.64,65 Linkage-enhanced data from health and care systems is likely to increasingly provide the substrate for such methods. Ultimately, this may lead to better understanding of the forces shaping human health and wellbeing, both in individuals and between social groups. This may support action to reduce inequities in these outcomes.

This article summarises major UK primary care data resources in terms of their strengths, weaknesses, and the opportunities they provide for researchers. Securing access to an appropriate dataset for research is often a complex transaction, for reasons described above. This article is intended to help researchers navigate that complexity. This is also a rapidly evolving landscape, shaped by multiple social, technical, and political considerations. In general, the trend is towards more streamlined, secure, and transparent access to better data, with the ambition that this will ultimately lead to health improvement for individuals and populations.

Notes

Funding

No specific funding was awarded to complete this work. However, LE and JP acknowledge funding from the Data and Connectivity National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (MC_PC_20058). DMA is funded by the National Institute for Health and Care Research (NIHR) through the Greater Manchester Patient Safety Translational Research Centre (NIHR Greater Manchester PSTRC, Grant number: PSTRC-2016-003). CM is funded by the NIHR School for Primary Care Research and NIHR Applied Research Collaboration (ARC) West Midlands. GA is supported by the NIHR ARC South West Peninsula. AM is supported by the NIHR ARC NW London. JM is supported by the NIHR Health Research ARC West and the NIHR Bristol Biomedical Research Centre. RD is supported by the NIHR Bristol Biomedical Research Centre and Health Data Research UK South West. The views expressed in this publication are those of the author(s) and not necessarily those of the National Institute for Health and Care Research or the Department of Health and Social Care, or UK Research and Innovation. All stated funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Ethical approval

Not applicable for this review.

Provenance

Freely submitted; externally peer reviewed.

Acknowledgements

The authors would like to thank the representatives of those data custodians that validated the summary information in Supplementary Tables S1 and S2; Chris Orton, Ashley Akbari (WLGP, SAIL Databank), Rouven Priedon and John Nolan (CVD-COVID-UK/COVID-IMPACT), Kimberley Watson (GDPPR, NHS England), Catheryn Evans (CPRD), Pete Stokes, Brian MacKenna, Ben Goldacre (OpenSAFELY), Julia Hippisley-Cox, Rebekah Burrow (QResearch), Simon de Lusignan (RCGP RSC), Samir Dhalla (THIN database).

Competing interests

Hajira Dambha-Miller is the Editor-in-Chief of BJGP Open, but had no involvement in the peer review process or decision on this manuscript.

  • Received April 2, 2023.
  • Revision received June 12, 2023.
  • Accepted July 7, 2023.
  • Copyright © 2023, The Authors

This article is Open Access: CC BY license (https://creativecommons.org/licenses/by/4.0/)

References

  1. 1.↵
    1. Chaudhry Z,
    2. Mannan F,
    3. Gibson-White A,
    4. et al.
    (2017) Outputs and growth of primary care databases in the United kingdom: Bibliometric analysis. J Innov Health Inform 24 (3):942, doi:10.14236/jhi.v24i3.942, pmid:29121851.
    OpenUrlCrossRefPubMed
  2. 2.↵
    1. McDonnell L,
    2. Delaney BC,
    3. Sullivan F
    (2018) Finding and using routine clinical datasets for observational research and quality improvement. Br J Gen Pract 68 (668):147–148, doi:10.3399/bjgp18X695237, pmid:29472226.
    OpenUrlFREE Full Text
  3. 3.↵
    1. Aminpour F,
    2. Sadoughi F,
    3. Ahamdi M
    (2014) Utilization of open source electronic health record around the world: a systematic review. J Res Med Sci 19 (1):57–64, pmid:24672566.
    OpenUrlPubMed
  4. 4.↵
    1. Celi LA,
    2. Majumder MS,
    3. Ordóñez P,
    4. Osorio JS,
    5. et al.
    (2020) Leveraging data science for global health (Springer Nature, Cham) In 1st edn, doi:10.1007/978-3-030-47994-7. Leveraging Data Science for Global Health. http://link.springer.com/10.1007/978-3-030-47994-7.
    OpenUrlCrossRef
  5. 5.↵
    1. Carter P,
    2. Laurie GT,
    3. Dixon-Woods M
    (2015) The social licence for research: why care.data ran into trouble. J Med Ethics 41 (5):404–409, doi:10.1136/medethics-2014-102374, pmid:25617016. https://jme.bmj.com/content/41/5/404.info.
    OpenUrlAbstract/FREE Full Text
  6. 6.↵
    1. Kontopantelis E,
    2. Stevens RJ,
    3. Helms PJ,
    4. et al.
    (2018) Spatial distribution of clinical computer systems in primary care in England in 2016 and implications for primary care electronic medical record databases: a cross-sectional population study. BMJ Open 8 (2), doi:10.1136/bmjopen-2017-020738, pmid:29490968. e020738.
    OpenUrlAbstract/FREE Full Text
  7. 7.↵
    1. Herrett E,
    2. Gallagher AM,
    3. Bhaskaran K,
    4. et al.
    (2015) Data resource profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol 44 (3):827–836, doi:10.1093/ije/dyv098, pmid:26050254.
    OpenUrlCrossRefPubMed
  8. 8.↵
    1. Clinical Practice Research Datalink (CPRD)
    Digital object identifiers (DOIs) for datasets. accessed. https://cprd.com/digital-object-identifiers-dois-datasets. 13 Jul 2023.
  9. 9.↵
    1. QResearch
    Generating new knowledge to improve patient care. accessed. https://www.QResearch.org/. 13 Jul 2023.
  10. 10.↵
    1. The Health Improvement Network (THIN)
    Data to build better population health outcomes and a foundation for research. accessed. https://www.the-health-improvement-network.com/. 13 Jul 2023.
  11. 11.↵
    1. NHS Health Research Authority
    Optimum Patient Care Research Database. accessed. https://www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/optimum-patient-care-research-database/. 1 Aug 2023.
  12. 12.↵
    1. Royal College of General Practitioners
    RCGP Research and Surveillance Centre (RSC). accessed. https://www.rcgp.org.uk/clinical-and-research/our-programmes/research-and-surveillance-centre. 13 Jul 2023.
  13. 13.↵
    1. Leston M,
    2. Elson WH,
    3. Watson C,
    4. et al.
    (2022) Representativeness, vaccination uptake, and COVID-19 clinical outcomes 2020–2021 in the UK Oxford-Royal College of General Practitioners Research and Surveillance Network: cohort profile summary. JMIR Public Health Surveill 8 (12), doi:10.2196/39141, pmid:36534462. e39141.
    OpenUrlCrossRefPubMed
  14. 14.↵
    1. Hippisley-Cox J,
    2. Coupland C,
    3. Vinogradova Y,
    4. et al.
    (2008) Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ 336 (7659):1475–1482, doi:10.1136/bmj.39609.449676.25, pmid:18573856.
    OpenUrlAbstract/FREE Full Text
  15. 15.
    1. Cornish R,
    2. Macleod J,
    3. Strang J,
    4. et al.
    (2010) Risk of death during and after opiate substitution treatment in primary care: prospective observational study in UK General Practice Research Database. BMJ 341 doi:10.1136/bmj.c5475, pmid:20978062. c5475.
    OpenUrlAbstract/FREE Full Text
  16. 16.
    1. Macleod J,
    2. Steer C,
    3. Tilling K,
    4. et al.
    (2019) Prescription of benzodiazepines, Z-drugs, and gabapentinoids and mortality risk in people receiving opioid agonist treatment: observational study based on the UK Clinical Practice Research Datalink and Office for National Statistics death records. PLoS Med 16 (11), doi:10.1371/journal.pmed.1002965, pmid:31770388. e1002965.
    OpenUrlCrossRefPubMed
  17. 17.↵
    1. Osborn DPJ,
    2. Hardoon S,
    3. Omar RZ,
    4. et al.
    (2015) Cardiovascular risk prediction models for people with severe mental illness: results from the prediction and management of cardiovascular risk in people with severe mental illnesses (PRIMROSE) research program. JAMA Psychiatry 72 (2):143–151, doi:10.1001/jamapsychiatry.2014.2133, pmid:25536289.
    OpenUrlCrossRefPubMed
  18. 18.↵
    1. Jick H,
    2. Jick SS,
    3. Myers MW,
    4. Vasilakis C
    (1997) Third-generation oral contraceptives and venous thrombosis. Lancet 349 (9053):731–732, doi:10.1016/S0140-6736(05)60173-0, pmid:9078226.
    OpenUrlCrossRefPubMed
  19. 19.
    1. Bhanu C,
    2. Jones ME,
    3. Walters K,
    4. et al.
    (2020) Physical health monitoring in dementia and associations with Ethnicity: a descriptive study using electronic health records. BJGP Open 4 (4), doi:10.3399/bjgpopen20X101080, pmid:32967843. bjgpopen20X101080.
    OpenUrlAbstract/FREE Full Text
  20. 20.↵
    1. Pham TM,
    2. Petersen I,
    3. Walters K,
    4. et al.
    (2018) Trends in dementia diagnosis rates in UK ethnic groups: analysis of UK primary care data. Clin Epidemiol 10 949–960, doi:10.2147/CLEP.S152647, pmid:30123007.
    OpenUrlCrossRefPubMed
  21. 21.↵
    1. van Staa T-P,
    2. Dyson L,
    3. McCann G,
    4. et al.
    (2014) The opportunities and challenges of pragmatic point-of-care randomised trials using routinely collected electronic records: evaluations of two exemplar trials. Health Technol Assess 18 (43):1–146, doi:10.3310/hta18430, pmid:25011568.
    OpenUrlCrossRefPubMed
  22. 22.↵
    1. British Medical Association
    (2018) GPs as data controllers under the General Data Protection Regulation. accessed. https://www.bma.org.uk/advice-and-support/ethics/confidentiality-and-health-records/gps-as-data-controllers-under-gdpr. 13 Jul 2023.
  23. 23.↵
    1. SAIL Databank
    Welsh Longitudinal General Practice Dataset — (WLGP). accessed. https://web.www.healthdatagateway.org/dataset/33fc3ffd-aa4c-4a16-a32f-0c900aaea3d2. 13 Jul 2023.
  24. 24.↵
    1. SAIL Databank
    SAIL Databank is a rich and trusted population databank. accessed. https://saildatabank.com/. 13 Jul 2023.
  25. 25.↵
    1. Ford DV,
    2. Jones KH,
    3. Verplancke J-P,
    4. et al.
    (2009) The SAIL Databank: building a national architecture for E-health research and evaluation. BMC Health Serv Res 9 doi:10.1186/1472-6963-9-157, pmid:19732426. 157.
    OpenUrlCrossRefPubMed
  26. 26.↵
    1. Akbari A,
    2. Lyons R,
    3. Bandyopadhyay A,
    4. et al.
    (2018) Analysis of factors associated with changing general practice in the first 14 years of life in Wales using linked cohort and primary care records: implications for using primary care Databanks for life course research. Int J Popul Data Sci 3 (4), doi:10.23889/ijpds.v3i4.818.
    OpenUrlCrossRef
  27. 27.↵
    1. NHS Digital
    Control of patient information (COPI) notice. [last updated 15 Feb 2022] accessed. https://digital.nhs.uk/coronavirus/coronavirus-covid-19-response-information-governance-hub/control-of-patient-information-copi-notice. 13 Jul 2023.
  28. 28.↵
    1. Wood A,
    2. Denholm R,
    3. Hollings S,
    4. et al.
    (2021) Linked electronic health records for research on a nationwide cohort of more than 54 million people in England: data resource. BMJ 373 doi:10.1136/bmj.n826, pmid:33827854. n826.
    OpenUrlAbstract/FREE Full Text
  29. 29.↵
    1. Public Health Scotland
    Electronic Data Research and Innovation Service (eDRIS). accessed. https://www.isdscotland.org/products-and-services/edris/. 13 Jul 2023.
  30. 30.↵
    1. Andrews C,
    2. Schultze A,
    3. Curtis H,
    4. et al.
    (2022) OpenSAFELY: Representativeness of electronic health record platform OpenSAFELY-TPP data compared to the population of England. Wellcome Open Res 7 doi:10.12688/wellcomeopenres.18010.1, pmid:35966958. 191.
    OpenUrlCrossRefPubMed
  31. 31.↵
    1. Walker AJ,
    2. MacKenna B,
    3. Inglesby P,
    4. et al.
    (2021) Clinical coding of long COVID in English primary care: a federated analysis of 58 million patient records in situ using OpenSAFELY. Br J Gen Pract 71 (712):e806–e814, doi:10.3399/BJGP.2021.0301, pmid:34340970.
    OpenUrlAbstract/FREE Full Text
  32. 32.↵
    1. Simpson CR,
    2. Robertson C,
    3. Vasileiou E,
    4. et al.
    (2020) Early pandemic evaluation and enhanced surveillance of COVID-19 (EAVE II): protocol for an observational study using linked Scottish national data. BMJ Open 10 (6), doi:10.1136/bmjopen-2020-039097, pmid:32565483. e039097.
    OpenUrlAbstract/FREE Full Text
  33. 33.↵
    1. Health Data Research UK
    Health Data Research Innovation Gateway. Gateway to health data and tools for research. accessed. https://www.healthdatagateway.org/. 13 Jul 2023.
  34. 34.↵
    1. National Institute for Health and Care Research
    NIHR School for Primary Care Research. accessed. https://www.spcr.nihr.ac.uk/. 13 Jul 2023.
  35. 35.↵
    1. Gao C,
    2. McGilchrist M,
    3. Mumtaz S,
    4. et al.
    (2022) A national network of safe havens: Scottish perspective. J Med Internet Res 24 (3), doi:10.2196/31684, pmid:35262495. e31684.
    OpenUrlCrossRefPubMed
  36. 36.↵
    1. University of Edinburgh
    DataLoch. accessed. https://dataloch.org/. 13 Jul 2023.
  37. 37.↵
    1. Combined Intelligence for Population Health
    CIPHA data platform. accessed. https://www.cipha.nhs.uk/. 13 Jul 2023.
  38. 38.↵
    1. Imperial College Healthcare Partners
    Discover NOW Health Data Research Hub. accessed. https://discover-now.co.uk/. 13 Jul 2023.
  39. 39.↵
    1. Healthier Together: Bristol, North Somerset and South Gloucestershire Integrated Care Board
    BNSSG system wide dataset. accessed. https://bnssghealthiertogether.org.uk/population-health-management/. 13 Jul 2023.
  40. 40.↵
    1. Sohal K,
    2. Mason D,
    3. Birkinshaw J,
    4. et al.
    (2022) Connected Bradford: a whole system data linkage accelerator. Wellcome Open Res 7 26, doi:10.12688/wellcomeopenres.17526.2, pmid:36466951.
    OpenUrlCrossRefPubMed
  41. 41.↵
    1. Bloomfield C
    (2022) Investing in the future of health research: secure, accessible and life-saving (NHS England). accessed. https://www.england.nhs.uk/blog/investing-in-the-future-of-health-research-secure-accessible-and-life-saving/. 13 Jul 2023.
  42. 42.↵
    1. NHS England
    Data Access Request Service (DARS). accessed. https://digital.nhs.uk/services/data-access-request-service-dars. 13 Jul 2023.
  43. 43.↵
    1. Medicines & Healthcare Products Regulatory Agency
    Clinical Practice Research Datalink. accessed. https://cprd.com/. 13 Jul 2023.
  44. 44.↵
    1. NHS England
    Data Security and Protection Toolkit. accessed. https://www.dsptoolkit.nhs.uk/. 13 Jul 2023.
  45. 45.↵
    1. GitHub
    GitHub website. accessed. https://github.com/. 13 Jul 2023.
  46. 46.↵
    1. OpenSAFELY
    OpenCodelists. accessed. https://www.opencodelists.org. 13 Jul 2023.
  47. 47.↵
    1. Chapman M,
    2. Mumtaz S,
    3. Rasmussen LV,
    4. et al.
    (2021) Desiderata for the development of next-generation electronic health record phenotype libraries. Gigascience 10 (9), doi:10.1093/gigascience/giab059, pmid:34508578. giab059.
    OpenUrlCrossRefPubMed
  48. 48.
    1. Springate DA,
    2. Kontopantelis E,
    3. Ashcroft DM,
    4. et al.
    (2014) ClinicalCodes: an online clinical codes repository to improve the validity and reproducibility of research using electronic medical records. PLoS One 9 (6), doi:10.1371/journal.pone.0099825, pmid:24941260. e99825.
    OpenUrlCrossRefPubMed
  49. 49.
    1. Health Data Research UK
    HDR Phenotype Library. accessed. https://phenotypes.healthdatagateway.org/. 13 Jul 2023.
  50. 50.
    1. SAIL DataBank
    Concept Library. accessed. https://conceptlibrary.saildatabank.com. 13 Jul 2023.
  51. 51.↵
    1. Sharma M,
    2. Petersen I,
    3. Nazareth I,
    4. Coton SJ
    (2016) An algorithm for identification and classification of individuals with type 1 and type 2 diabetes mellitus in a large primary care database. Clin Epidemiol 8 373–380, doi:10.2147/CLEP.S113415, pmid:27785102.
    OpenUrlCrossRefPubMed
  52. 52.↵
    1. UK Health Data Research Alliance,
    2. NHSX
    (2021) Building trusted research environments — principles and best practices; towards TRE ecosystems. accessed, 10.5281/zenodo.5767586. https://zenodo.org/record/5767586#.ZATOu3bP02w. 13 Jul 2023.
  53. 53.↵
    1. Desai T,
    2. Ritchie F,
    3. Welpton R
    (2016) Five safes: designing data access for research. accessed. https://www2.uwe.ac.uk/faculties/bbs/Documents/1601.pdf. 13 Jul 2023.
  54. 54.↵
    1. Goldacre B,
    2. Morley J
    (2022) Better, broader, safer: using health data for research and analysis. A review commissioned by the Secretary of State for Health and Social Care (Department of Health and Social Care). accessed. https://www.gov.uk/government/publications/better-broader-safer-using-health-data-for-research-and-analysis. 13 Jul 2023.
  55. 55.↵
    1. Petersen I,
    2. Welch CA,
    3. Nazareth I,
    4. et al.
    (2019) Health indicator recording in UK primary care electronic health records: key implications for handling missing data. Clin Epidemiol 11 157–167, doi:10.2147/CLEP.S191437, pmid:30809103.
    OpenUrlCrossRefPubMed
  56. 56.↵
    1. Christen P,
    2. Schnell R
    (2023) Thirty-three myths and misconceptions about population data: from data capture and processing to linkage. Int J Popul Data Sci 8 (1), doi:10.23889/ijpds.v8i1.2115.
    OpenUrlCrossRef
  57. 57.↵
    1. Cornish RP,
    2. Tilling K,
    3. Boyd A,
    4. et al.
    (2015) Using linked educational attainment data to reduce bias due to missing outcome data in estimates of the association between the duration of breastfeeding and IQ at 15 years. Int J Epidemiol 44 (3):937–945, doi:10.1093/ije/dyv035, pmid:25855709.
    OpenUrlCrossRefPubMed
  58. 58.↵
    1. Gulliford MC,
    2. Charlton J,
    3. Ashworth M,
    4. et al.
    (2009) Selection of medical diagnostic codes for analysis of electronic patient records. Application to stroke in a primary care database. PLoS One 4 (9), doi:10.1371/journal.pone.0007168, pmid:19777060. e7168.
    OpenUrlCrossRefPubMed
  59. 59.↵
    1. Freemantle N,
    2. Marston L,
    3. Walters K,
    4. et al.
    (2013) Making inferences on treatment effects from real world data: propensity scores, confounding by indication, and other perils for the unwary in observational research. BMJ 347 doi:10.1136/bmj.f6409, pmid:24217206. f6409.
    OpenUrlFREE Full Text
  60. 60.↵
    1. Cole SR,
    2. Platt RW,
    3. Schisterman EF,
    4. et al.
    (2010) Illustrating bias due to conditioning on a collider. Int J Epidemiol 39 (2):417–420, doi:10.1093/ije/dyp334, pmid:19926667.
    OpenUrlCrossRefPubMed
  61. 61.↵
    1. Mansournia MA,
    2. Nazemipour M,
    3. Etminan M
    (2021) Causal diagrams for immortal time bias. Int J Epidemiol 50 (5):1405–1409, doi:10.1093/ije/dyab157, pmid:34333642.
    OpenUrlCrossRefPubMed
  62. 62.↵
    1. Krieger N,
    2. Davey Smith G
    (2016) The tale wagged by the DAG: broadening the scope of causal inference and explanation for epidemiology. Int J Epidemiol 45 (6):1787–1808, doi:10.1093/ije/dyw114, pmid:27694566.
    OpenUrlCrossRefPubMed
  63. 63.↵
    1. Kotz D,
    2. O’Donnell A,
    3. McPherson S,
    4. Thomas KH
    (2022) Using primary care databases for addiction research: an introduction and overview of strengths and weaknesses. Addict Behav Rep 15 doi:10.1016/j.abrep.2022.100407, pmid:35111898. 100407.
    OpenUrlCrossRefPubMed
  64. 64.↵
    1. Bi Q,
    2. Goodman KE,
    3. Kaminsky J,
    4. Lessler J
    (2019) What is machine learning? A primer for the epidemiologist. Am J Epidemiol 188 (12):2222–2239, doi:10.1093/aje/kwz189, pmid:31509183.
    OpenUrlCrossRefPubMed
  65. 65.↵
    1. Jorm LR
    (2021) Commentary: towards machine learning-enabled epidemiology. Int J Epidemiol 49 (6):1770–1773, doi:10.1093/ije/dyaa242, pmid:33485274.
    OpenUrlCrossRefPubMed
Back to top
Previous ArticleNext Article

In this issue

BJGP Open
Vol. 7, Issue 3
September 2023
  • Table of Contents
  • Index by author
Download PDF
Email Article

Thank you for recommending BJGP Open.

NOTE: We only request your email address so that the person to whom you are recommending the page knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
UK research data resources based on primary care electronic health records: review and summary for potential users
(Your Name) has forwarded a page to you from BJGP Open
(Your Name) thought you would like to see this page from BJGP Open.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
UK research data resources based on primary care electronic health records: review and summary for potential users
Lara Edwards, James Pickett, Darren M Ashcroft, Hajira Dambha-Miller, Azeem Majeed, Christian Mallen, Irene Petersen, Nadeem Qureshi, Tjeerd van Staa, Gary Abel, Chris Carvalho, Rachel Denholm, Evangelos Kontopantelis, Ayoyemi Macaulay, John Macleod
BJGP Open 2023; 7 (3): BJGPO.2023.0057. DOI: 10.3399/BJGPO.2023.0057

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Share
UK research data resources based on primary care electronic health records: review and summary for potential users
Lara Edwards, James Pickett, Darren M Ashcroft, Hajira Dambha-Miller, Azeem Majeed, Christian Mallen, Irene Petersen, Nadeem Qureshi, Tjeerd van Staa, Gary Abel, Chris Carvalho, Rachel Denholm, Evangelos Kontopantelis, Ayoyemi Macaulay, John Macleod
BJGP Open 2023; 7 (3): BJGPO.2023.0057. DOI: 10.3399/BJGPO.2023.0057
del.icio.us logo Facebook logo Mendeley logo Bluesky logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One
  • Mendeley logo Mendeley

Jump to section

  • Top
  • Article
    • Abstract
    • How this fits in
    • Introduction
    • Method
    • Discussion
    • Notes
    • References
  • Figures & Data
  • Info
  • eLetters
  • PDF

Keywords

  • electronic health records
  • primary care databases
  • population level linked data
  • population
  • Primary Health Care

More in this TOC Section

  • General practitioners’ views about opioid management and tapering before hip or knee replacement surgery: a qualitative study
  • Rising scabies incidence and the growing burden on GPs: a retrospective longitudinal study
  • Patient characteristics associated with clinically coded long COVID: an OpenSAFELY study using electronic health records
Show more Research

Related Articles

Cited By...

Intended for Healthcare Professionals

 
 

British Journal of General Practice

NAVIGATE

  • Home
  • Latest articles
  • Authors & reviewers
  • Accessibility statement

RCGP

  • British Journal of General Practice
  • BJGP for RCGP members
  • RCGP eLearning
  • InnovAiT Journal
  • Jobs and careers

MY ACCOUNT

  • RCGP members' login
  • Terms and conditions

NEWS AND UPDATES

  • About BJGP Open
  • Alerts
  • RSS feeds
  • Facebook
  • Twitter

AUTHORS & REVIEWERS

  • Submit an article
  • Writing for BJGP Open: research
  • Writing for BJGP Open: practice & policy
  • BJGP Open editorial process & policies
  • BJGP Open ethical guidelines
  • Peer review for BJGP Open

CUSTOMER SERVICES

  • Advertising
  • Open access licence

CONTRIBUTE

  • BJGP Life
  • eLetters
  • Feedback

CONTACT US

BJGP Open Journal Office
RCGP
30 Euston Square
London NW1 2FB
Tel: +44 (0)20 3188 7400
Email: bjgpopen@rcgp.org.uk

BJGP Open is an editorially-independent publication of the Royal College of General Practitioners

© 2025 BJGP Open

Online ISSN: 2398-3795