Ground truth for mental health data science

Linking digital footprint data in the UK birth cohorts to guide the next generation of algorithms for mental health and wellbeing research

Project status



As we go about our daily lives, each of us lays down digital footprints, such as through our interactions on social media, our search behaviour, and the things we buy. This vast repository of data on real life behaviour has huge potential to inform research into the causes of poor health. However, we very rarely have reference or “ground truth” data available to tell us whether our interpretation is accurate. Fortunately, the UK is home to worldleading birth cohorts: samples of thousands of people recruited at birth, who have been donating their data to research their whole lives. Linking digital footprint data in these cohorts could allow researchers to validate their algorithms against existing gold standard measures, improving the accuracy and reliability of interpreting these digital data, and helping to realise its potential in improving our understanding of mental health and wellbeing.

Explaining the science

Although it is straightforward to collect social media data from consenting study participants via the application programming interfaces (APIs) that social media sites make available, there are challenges specific to linking these data in birth cohorts. For example, cohorts have a responsibility to protect the identity of their participants. To do this, they maintain data safe havens certified by the International Organization for Standardisation (ISO) and the International Electrotechnical Commission (IEC), and ensure that participants’ personal and identifiable data remain inside these havens. This means that instead of providing a centralised service to cohorts, it is easier for cohorts to run social media linking software themselves, within their own data safe havens. This introduces the challenge of developing software that is not only secure, but straightforward to run in heterogenous computing environments, and robust enough to run long-term without too much intervention.

For researchers to be able to validate algorithms for making inferences from digital footprint data, it is necessary to run these algorithms on the original social media data held by the cohorts and compare the results to more traditional measures. Because the original data cannot leave the data safe haven, this project is developing a framework for researchers to submit algorithms for the cohort to run themselves, returning anonymised summary results on the performance of the algorithm to the researchers. To supplement this approach of bringing the analysis to the data, the project has also been developing machine learning models to generate synthetic data sets that can be shared directly with digital footprint researchers, allowing them to develop and train better algorithms for later validation on the real cohort data.

Project aims

Building on previous work with cohort participants and leaders to develop the processes and software frameworks to allow studies to easily and securely collect social media data, this Turing project is making it possible for digital footprint researchers to submit mental health coding algorithms for validation against the gold standard ground truth data that already exists in the cohorts, while keeping personal data secure. This will allow researchers to use the cohorts as a platform for developing better algorithms for inferring mental health and wellbeing from digital footprints, helping the data to realise their potential for improving our understanding of mental health and wellbeing. The project is also considering the ethical and social implications of improving our ability to infer mental health and wellbeing from digital data, and incorporating this into the design of the framework.


Alongside these technical challenges there are important social challenges. Cohorts rely on long-term trusting relationships with their participants, so it is crucial that cohorts link and use digital footprint data in a way that is acceptable to them, such as by maintaining copies of the full social media records within the data safe havens while making anonymised information derived from these data available for health and social research. This project is working closely with cohort leaders and participants from the Avon Longitudinal Study of Parents and Children (ALSPAC) and others from the CLOSER group of 19 UK birth cohorts to co-produce these approaches, making sure that the outputs are both useful to the cohorts and acceptable to their contributors. With these collaborations, and the ongoing dedication of tens of thousands of study participants across the country, the project will provide both unique data on how mental health changes across time, and the opportunity to improve how the whole field of research interprets the digital footprints we leave behind.


Dr Valerio Maggio

Senior Research Associate in Data Science and Artificial Intelligence Bristol Medical School and the MRC Integrative Epidemiology Unit, University of Bristol

Nina Di Cara

MRC GW4 BioMed Data Science PhD student Bristol Medical School and the MRC Integrative Epidemiology Unit, University of Bristol