When we use the NHS, data is collected about our health and the treatments we receive. It is widely acknowledged that this data, suitably anonymised, together with AI techniques have the potential to transform both the efficiency and the quality of health care. However, this vision remains far off: at present the data are both incomplete and very difficult to use. Working with practical examples, this project will provide a clearer understanding of the problems, some initial solutions and a road map for further work. 

Explaining the science

Data is stored in databases, commonly with a relational structure. In principle, the relational structure of the data should tell users about the meaning of the data. However, clinical data is entered using configurable form-based systems, with the clinical user able to select which fields to compete. This results in a very general relational structure with correspondingly less information about the meaning of the data. Moreover, the actual data entered differ between different clinical data system and from one user to another.

The primary solution attempting to address these issues is to use medical ontologies (Reed code, SNOMED-CT, ICD and more) to tag data fields. Despite being large and complex, these systems do not provide very much information about the relationships between data fields (such as the reason for a prescription). Importantly these systems do allow data from different systems to be linked but much more expressive power is needed, notably if other data sources (such as free text) are to be included. These other sources are vital for supervised learning: at present, data typically says little about the outcome of treatments.

The data potentially available to researchers is several steps removed from the actual data entered. This is necessary, for example to ensure privacy is respected, but the lack of description of all stages of the process (the data entry, its transformation and linking) creates ambiguity that could underline the usefulness or any analysis or AI. It also presents a practical barrier: researchers, who must operate at some distance from the raw data, cannot specify the data selection or models in detail without improved descriptions. 

Project aims

The increasing use and capability of Electronic Health Record (EHR) systems has made available large collections of data about patients’ use of different health services, treatments and prescriptions. This data has many potential uses (discovering causes, optimising health delivery, choosing treatment and more). However, there are challenges to overcome before these benefits can be achieved.

The overall objective of the project is to lay the foundations for a transformative approach to patient- linked health data, making it accessible for both medical and data science researchers to fully exploit. 

There are a number of related challenges:

  • Understanding the data and its potential use - The data arise from the operation of health services, so the understanding of the data contents is embedded in the health community and not accessible to the wider AI and machine learning communities
  • Knowledge elicitation and modelling - Achieving the full potential from the data requires knowledge of health care processes so this may need to be modelled for data analysis
  • Statistical modelling - Existing data analysis approaches often extract a ‘flat’ dataset from the underlying ‘relational’ structure of the date. New techniques avoiding this (e.g. statistical relational learning) might allow new types of queries but their practical applicability is unknown
  • Efficient and acceptable data handling - Existing studies using data from EHR systems, require extensive ‘data wrangling’ to extract and cleanse a usable dataset. This work is largely manual and very time consuming: can this be improved?



This project is primarily relevant to the health domain and involves working with two separate groups who have developed clinical datasets linking data from different services. One group focuses on optimising resources allocated to different health services and is currently looking at the impact of funding of mental health services on the use of other services. The other group focuses on population health issues, such as the implementation of guidelines for screening and prevention. Although the project will not work directly on either of these topics, they show the areas in which solutions to data handling issues will be applied.

Very similar issues with data apply in many industries managing and maintaining complex assets (such as railways). In the longer term, it is hoped that these industries could benefit too.


Contact info

William Marsh, [email protected]