Even before the COVID-19 pandemic, the shift towards ‘big data’ research had presented health and social care researchers with a problem: what is the best way to conduct collaborative projects involving sensitive data? Similar questions face researchers in fields as diverse as finance and law. One answer has come in the form of trusted research environments (TREs) – highly secure platforms that allow authorised researchers to access sensitive data along with analysis tools, software and associated libraries and programming languages (an example is The Alan Turing Institute’s own data safe haven platform).
At the Turing, one of our missions is to document and promote the best tools, systems and practices for conducting data science and publishing research software. Resources such as The Turing Way aim to capture this knowledge and make it accessible to researchers worldwide, enabling them to collaborate effectively and ensure their research is reusable, transparent and reproducible. But one thing that has been overlooked until now is how these practices apply to research on sensitive data that are being accessed via TREs.
One healthcare research project that is utilising a TRE is the Wales Multimorbidity Machine Learning (WMML) project, in which researchers at Swansea University and the University of Manchester are working on machine learning (ML) methods for identifying common multimorbidities (i.e. the co-occurrence of two or more long-term health conditions in the same person) in an anonymised, individual-level, population-scale collection of healthcare data. These data are only accessible within the ‘SAIL Databank’ TRE at Swansea University. The researchers hope to uncover new links between conditions associated with the multimorbidities that cause the largest problems for the NHS and/or individuals.
In my recent collaboration with the WMML team, as a member of the Turing’s Research Engineering team, I noticed a gap in the literature on best practices and ways of working with TREs, specifically on the development of research code and the ability to share it as quickly but securely as possible with other researchers who do not have access to the TRE. Crucially, the WMML team wanted to be able to publish its code alongside its results to maximise the findability, cite-ability and reproducibility of its research methods. Research involving sensitive data can be crucial in determining real-world outcomes, as we saw with healthcare policies during the pandemic, and so being able to reproduce studies is of paramount importance.
The scenario faced by the WMML researchers provided an interesting opportunity for myself and the Turing to make a first step in the direction of documenting best practices for conducting and publishing research done with a TRE. The result is this report, which is aimed at all TRE users, whether in health data science (where TREs are becoming increasingly common) or another field.
The suggestions in the report include, but are not limited to:
- Developing research code as modular scripts and utilising version control software such as GitLab or GitHub.
- Considering the advantages and disadvantages associated with where research code destined for publication is developed, either inside or outside the TRE.
- Using unit testing, quality checks and continuous integration to ensure that research code works as expected.
- Using Jupyter or R Markdown notebooks for data analysis and, where appropriate, the drafting of research papers which include an online executable version.
- Exploring the use of synthetic data for scenarios where access to sensitive data is not possible.
- Making use of DOIs (digital object identifiers), citation files and software licences when publishing research software.
To find out more about these suggestions, read the full report or contact Ed Chalstrey (The Alan Turing Institute). To learn more about the WMML project, contact Niels Peek (University of Manchester) or Ronan Lyons (Swansea University).
Top image: Andrew Ruiz / Unsplash