The majority of disease risk prediction models using electronic health records (EHR) utilize a fraction of the available data, rely on a priori defined clinical features and are created using survival analyses which heavily rely on clinical input to model nonlinearity. In this project, we will create new methods for extracting features from longitudinal EHR and evaluate supervised learning methods for creating disease predictions models which are interpretable. Accurate and interpretable risk prediction models can be translated into clinical care and lead to improved health and healthcare.

Explaining the science

When patients interact with physicians, or get admitted into hospital, information is collected electronically on symptoms, diagnoses, laboratory test results, and prescriptions. This information is stored in Electronic Health Records (EHR) and is a valuable resource for researchers and clinicians for improving health and healthcare. Risk prediction models are tools that doctors use to predict a patients risk of having an adverse event, such as for example a heart attack. While there are hundreds of data points recorded (e.g. a patient can have hundreds of blood pressure measurements over the years), the majority of risk prediction tools only use a single measurement and thus can potentially be improved by using more. This project will try out different approaches for using large amounts of information, called machine learning, and identify which method is best for creating risk prediction tools that clinicians can use to predict the risk of bleeding or heart disease. We will particularly focus on creating tools that are interpretable so that they can be adopted quicker and patients can benefit quicker.

Project aims

The overarching aim of the proposed research is to undertake novel methodological research across the machine learning pipeline, from data normalisation to feature engineering and model training evaluation, and comprehensively evaluate supervised learning algorithms for risk prediction in EHR. Chronic and acute medical conditions present different methodological challenges and as such I will use two exemplar conditions for methods development: coronary artery disease (chronic) and bleeding (acute). The aims of the proposed work are: a) to evaluate clinical data modelling and harmonisation algorithms for normalising structured EHR data with a particular focus on integrating different data modalities i.e. genetic data, b) develop and evaluate feature extraction approaches and define a novel temporal abstraction framework for incorporating longitudinal EHR data into machine learning methods. A particular focus of this work will be around time-series data as input to recurrent neural network architectures and different latent representations of EHR data such as word embeddings., c) evaluate automated or semi-automated feature selection approaches for identifying and prioritising the most predictive features, and d) compare supervised learning methods (e.g. neural networks, xgboost and others) for creating risk predictions models from multi-modal EHR data. A particular focus of this work will be around the the interpretability of the developed models as this is one of the main barriers to clinical adoption.


Applying supervised learning algorithms to EHR data will result in the creation of robust risk stratification and classification models which accurately reflect the complex and longitudinal nature of disease (e.g. multiple morbidities) and provide insight into the underlying disease prognostic factors. These models can then be translated into clinical care, provided that they are sufficiently evaluated, validated and interpretable, and lead to more timely diagnoses and eventually improved human health and healthcare.