Health data, from either electronic health records (EHRs) or other sources, such as smart wearables or the internet, are subject to complex data generation processes. The rise of data complexity presents both opportunities and challenges in clinical prediction such as 'missingness' (missing values in a dataset), informative presence (what the presence of a particular observation says about a person), selection bias, and data heterogeneity. There is enormous potential in advancing predictive models, particularly in combining conventional statistical analysis with the merits and strength of machine learning.

Explaining the science

Clinical prediction models (CPMs)

CPMs are tools to diagnose current outcomes or predict future outcomes in individuals based on what is known about that individual and their environment. CPMs are widely used in clinical research and practice to help understanding and improve the prognosis of a disease or health condition.

In public health, prediction models will help to target preventive interventions to subjects at relatively high risk of developing a disease. In clinical practice, prediction models may inform patients and their treating physicians on the probability of a diagnosis or a prognostic outcome. In research, prediction models may assist in the design and analysis of randomised trials.

Electronic health records (EHRs)

EHRs are a growing common source of data for clinical risk prediction. One of the inherent problems in EHRs is ‘informative observation’ which results from the way that records are created. Each observation in an EHR is a result of a patient engaging with health services, most likely due to their ill health. The data collected within EHRs is therefore systematically biased towards sicker patients as these patients are more likely to be in regular contact with healthcare services.

Informative observation presents various challenges in prediction modelling and is likely to result in biased predictions if not handled correctly. One promising way is to develop longitudinal models with a linked ‘observation’ process model, which allows measurement or treatment times to be both sparse and heterogeneous across patients. 


To enable 'what-if' queries, a potential outcomes (causal) framework can be used to incorporate counterfactuals, such as medication use, surgeries, and lifestyle changes, into a clinical prediction model. Initial work has explored the use of marginal structural models to infer potential outcomes for patients at risk of cardiovascular events – where the intervention is a statin prescription. More advanced models are needed to allow ‘what-if’ prediction modelling using messy observational data, and furthermore enable dynamic treatment allocation.

Project aims

This project will focus on three key issues.

Informative presence

Routine health data is subject to 'informative presence' whereby presence of a particular observation (e.g. a blood test) is informative about a person, independent from the actual result of the test. For example, it gives information about an individual’s tendency to engage with the healthcare system, and/or information about a clinician’s prior beliefs about a patient’s condition that drives them to run particular tests.

This is challenging to model because it essentially corresponds to a ‘missing not at random’ scenario. There are, however, opportunities to exploit the way in which patients interact with health services to improve predictive performance, by drawing information from the frequency and timing of the data collected within the EHRs. 

Incorporating counterfactuals to enable ‘what-if’ queries

There is often a division between the objectives of predictive modelling and causality. This project hypothesises that the interplay between these two objectives is essential to producing usable risk prediction models. A key limitation of existing risk prediction models is that they model prognostic risk only – and do not allow consideration of ‘what-if’ scenarios. A potential outcomes (causal) framework allows for these considerations: through principled modelling of the underlying causal structure it's possible to infer risk under different intervention scenarios - both at individual patient level and at population level. 

Dealing with systematic bias and heterogeneity

There is potential to improve the accuracy and generalisability of risk prediction models by accounting for heterogeneity (differences) from various sources, such as variable data quality between clinical sites, systematic differences in important risk factors not included in the models, and systematic differences in coding practices. The generalisability of risk prediction models to other settings and EHR software settings (that store data differently) is therefore unclear. The objective of this part of work will be to develop and implement the methods that adjust for this heterogeneity in risk prediction models.


Case study: Risk prediction models for the prevention and early intervention in Alzheimer’s disease (AD), other dementias and the identification of high risk groups

Alzheimer’s disease (AD), the most common form of dementia, may be prevented using interventions based on known risk factors such as diet, exercise and other lifestyle factors. Identifying individuals in the community at higher risk using risk prediction models is a very promising approach that has been successfully used in other chronic conditions. Developing and implementing such a model for AD, aimed at helping to educate individuals on appropriate and effective healthy ageing, therefore represents an important approach.

Furthermore there are great opportunities for research into the radiological assessment of early disease using the imaging data provided on the 100,000 UK Biobank participants together with the identification of genetic profiles of higher risk populations.

Recent updates

March 2019

  • Literature reviews on methods for handling informative presence and for counterfactual reasoning in risk prediction models in progress.


Researchers and collaborators

Contact info

[email protected]