The Turing-RSS Health Data Lab: Project overviews

A jargon-free overview of projects within the Turing-RSS Health Data Lab


Our aim is to ensure our research is accessible to all audiences. These summaries give a plain English, jargon-free overview of our past and present research projects.

1. Biomedical acoustic markers 

The biomedical acoustic markers project aims to develop a process to identify features in audio signals (voice and speech sounds), which are caused by COVID-19. This has the potential to be a fast and easily used early test paving the way for mass testing. It also has the potential, in due course, to be used for the early detection of other diseases. 

Our work builds on early-stage research from Cambridge and other research groups, which reported how an algorithm (A sequence of rules that a computer uses to complete a task) could accurately identify COVID-19 positive patients that had no symptoms using audio recordings of coughs from a small test group. 

To evaluate the possibility that COVID-19 results in unique features in individuals’ speech and airway sounds, we have collected a world leading respiratory sounds COVID-19 dataset. It is superior to previous datasets thanks to the number of recordings collected, richness of the metadata (information about the dataset) and quality of the ground truth labels (data that demonstrates COVID-19 status is the correct diagnosis). On top of this we have carefully created two subsets of the dataset to evaluate the performance of the model. The first set, known as the training set, is what we let the model see and learn from.

The second set, known as the test set, is how we evaluate the performance of the model. We have carefully created these partitions to address the bias in the dataset. Most importantly the test set is curated to feature matched pairs of individuals. These paired individuals have all the same characteristics, e.g. symptoms except for their COVID-19 status. Therefore, if we classify a pair correctly we are more confident that this is due to true COVID-19 audio signals, rather than other symptoms. 

If our study proves positive, the use of this algorithm released as a smartphone app has the potential as a rapid and affordable screening tool for COVID-19 and possibly other diseases.


2. De-biasing

Global and national monitoring of COVID-19 is mostly based on targeted plans, which test individuals that show symptoms or work in high-risk environments such as hospitals and care homes. These targeted individuals are often unrepresentative of the wider population and have positivity rates that are higher compared with the true number of people infected with COVID-19 in the population. The data are commonly used to count the number of infections in populations and calculate the effective reproduction number (known as the R number), which measures how fast the virus is spreading in the population.

This information is used to make decisions about public health policy and it is therefore important to interpret these numbers in the most appropriate way. Our debiasing study has combined targeted testing counts with data from a randomised monitoring study (the UK REACT study) to estimate the true number of people with COVID-19 at different times of the pandemic and in different local authorities around the United Kingdom.

Our model takes into account the likelihood of infected people getting tested compared to non-infected people (A model is a mathematical representation of how variables in a dataset come about and how they are related to one another).

Our approach was tested by comparing the numbers that it creates to another set of randomly collected REACT data that was not originally used in producing our model.   

We found our local estimates of the disease reproduction number  can be used to predict one-week and two-week ahead changes in COVID-19 positive cases. We also saw increases in estimated local frequencies of COVID-19 and the disease reproduction number that matched the spread of the Alpha and Delta COVID-19 variants.

Our results demonstrate how randomised testing for disease can add to targeted testing schemes to improve accuracy in monitoring the spread of new and ongoing infectious diseases.


3. Health inequalities

This project aimed to investigate how ethnicity and socio-economic deprivation have affected chances of becoming infected with COVID-19 for people in England (socio-economic deprivation is the disadvantage an individual or group experiences in terms of access and control over money, material or social resources and opportunities).

Our study used two sources of data - weekly COVID-19 positive test rates and estimated debiased disease frequency (see our debiasing study summary for details of what this is), both at Lower Tier Local Authority level. For ethnicity we considered the percentage of BAME (Black, Asian and Minority Ethnic) population in each geographical area, while for socio-economic deprivation we used the Index of Multiple Deprivation (this measure includes income, employment, education, health, crime, barriers to housing and services, and living environment). We analysed the test rates and debiased disease frequency separately, but in both cases we considered other factors in our analysis such as the type of area (urban or rural), policy intervention (vaccination rollout) and age structure of the population.

Our analyses cover the period between 1 June 2020 to 19 September 2021 and we allowed the effect of deprivation and ethnicity to vary throughout the period. Our results show the mostly non-White and most deprived areas have increased weekly positivity rates compared to the least deprived and mostly White areas of the UK. Deprivation has a stronger effect until October 2020 and then ethnicity has a greater effect 
on chances of infection during the peak of the second wave and again in May-June 2021. 

We also considered ethnicity subgroups and found evidence that in the second wave, areas with large South Asian populations were most affected, whereas areas with large Black populations did not show increased infection for either set of data during the entire period under analysis.

In summary, we have found that area-level deprivation and proportion of BAME population are both linked to increased COVID-19, but this has varied over the course of the pandemic and for different ethnic subgroups. This evidence highlights the importance of continually monitoring how different communities are responding, in order to inform relevant policies aimed at eliminating social inequality in COVID-19 burden.


4. Interoperability

Interoperability is a guiding framework for statistical thinking to assist policy-makers asking multiple questions, using combinations of different datasets, when decisions need to be made fast in response to the current and future pandemics. The framework allows us to build statistical models for quick analysis of data and urgent action; such models are mathematical representations of data that enable conclusions to be drawn and decisions to be made in the real world.

Our interoperable approach provides an important set of principles for future pandemic preparedness. It does this through the design and deployment of multiple models at the same time (joint design) and the use of a modular system that means each model can be easily adapted for use in future disease monitoring in tandem with other models.


5. Wastewater

Infected people with COVID-19, with or without symptoms, shed the virus through their digestive systems or during daily activities, which ends up in wastewater. This process is called shedding. It is now known that as the number of COVID-19 patients in one area increases the amount of virus particles detected in wastewater (the viral load) also increases. The process of testing wastewater is done at wastewater plants as part of the regular testing of wastewater samples.

The Environmental Monitoring for Health Protection (EMHP) wastewater monitoring program led by the UK Health Security Agency, tests wastewater on a daily basis. This started in mid-2020 and carries on gathering data across 270 sites across England.

This project seeks to use these data to address research questions such as:

  • How determining the frequency of disease using wastewater data at specific points in time can be used with more commonly used health monitoring data?
  • Does wastewater data add value to monitoring diseases?
  • And how can we best design wastewater sampling schemes for real-time monitoring, either using only wastewater data, or combined with traditional monitoring data in a cost-effective manner?

During the first phase of the project, the team will focus on the first of these research questions above and also work to identify priorities for future research. The data will first be explored and then the team will go on to conduct analysis related to different time periods and different places in the United Kingdom.

We are illustrating our interoperability approach through several case studies aimed at finding out the number of people infected with coronavirus (SARS-CoV-2) in local authorities across England at different times and also to work out the rate of transmission of the disease in these local areas.


6. Transmission and mobility

This project aims to improve our understanding of how peoples movement affects the spread of COVID-19 virus. This work has the potential to provide insight for policy-makers on, for example, the likely impact of travelling outside of a person's local area on controlling the spread of the virus. We will create a high quality infectious disease transmission model (a model is a framework to show the relationship between variables in a dataset) that uses real time mobility data. 

This work builds on a space and time model (the Epimap model), previously developed by our team, which produces local estimates of transmission. The model includes consideration of other factors such as population density and data that captures social and economic deprivation, vaccination coverage and information on variants. 

The work from this project, such as the model, will be open-source (accessible to all) to generate discussion and increase the transparency of this work for greater future reusability by other researchers and for greater access for policy-makers.


7. Genomics+
Genomics (the study of an organism’s DNA and how that information is used) is a particular strength of the UK ​​with several laboratory and computational hubs capable of delivering fast and large scale investigations.

The academic and public health communities moved quickly during the first phase of the COVID-19 pandemic to provide key resources and infrastructure to allow research in this area. The UK COVID19 gene sequencing initiative is internationally recognised as one of the best schemes in the world and provided critical assessments of new variants of the disease. Sequencing DNA means determining the order of the four chemical building blocks - called "bases" - that make up a DNA molecule, and is most often applied to find out entire genomes, in this case the genome of the COVID-19 virus.

The datasets created by genome sequencing are vital for the accurate assessment of disease transmission and other important biological features related to the COVID-19 virus as well as allowing extensive monitoring of global disease transmission. This dataset can be reused in several ways: to look back on aspects of the spread of the disease and also for future health monitoring.

Bringing together the genomic datasets with other datasets from the UKHSA will further develop analysis concerning the transmission of diseases such as the examination of gaps in our knowledge of how the disease spread over space and time.

Our study, therefore, aims to inform public policy for responses to future health emergencies by providing more detailed analyses of how COVID-19 spread over the course of the pandemic


Go back to the Turing-RSS Health Data Lab page