Introduction
The data science research community frequently encounters a need for:
- Secure environments for analysis of sensitive datasets
- Providing high performance computing capability
- A productive environment for curiosity-driven research
These requirements have, to their knowledge, not yet been realised simultaneously.
Some solutions render research unproductive by making it hard to author new code while engaging with the data, or to experiment with the many software libraries realising new analytical techniques. Others lack access to significant GPU compute power.
This perversely reduces security: researchers route around carefully constructed secure environments to avoid perceived productivity loss, reverting to 'folk security practices', (such as over-reliance on imperfect anonymisation), increasing the risk of a breach.
An organisation then lacks a clear inventory of all the datasets it is handling, and the risk profile in terms of consequences and threat actors for each. Email- and document-based processes create a confused environment with costly reporting and audit.
Project aims
This project is building a system with:
- Clearly defined security classification tiers for the Institute's data, corresponding to the government system but extending to research activities
- A clean, easy to use, web-based system for management, tracking, review and classification of datasets, and the allocation of users and datasets to projects
- Automated creation of multiple, independent secure environments, each configured according to the appropriate classification tier for the data
A cloud-based approach to secure infrastructure means every aspect of the system is defined by fully scripted configuration manifests, and these can be interrogated by any data provider to audit the system. This also provides scalable high performance computing.
The web-based management solution will provide clear reporting of data inventory across all tiers.
Applications
The framework developed in this project combines prevailing data security threat and risk profiles into five sensitivity tiers, and, at each tier, specifies recommended policies for data reclassification, data ingress, software ingress, data egress, user access, user device control, and analysis environments. The framework presents design patterns for security choices for each tier, and uses software defined infrastructure so that a different, independent, secure analysis environment can be instantiated for each project appropriate to its classification. The aim is to maximise researcher productivity and minimise risk, allowing research organisations to operate with confidence.
As well as this design research, the project is developing, and will publish, a reference implementation of the design based on Microsoft Azure.
Recent updates
- One-page overview: Poster with overview of our data classification approach, security measures, data management and technical architecture. This is the best one-page high-level overview of our systems and process. Presented at the poster session at the second annual workshop of the Research Software London and South East community on 06 February 2020.
- Two-minute video overview: Lightning talk about the Data Safe Haven. Presented at AI UK 2022.
- In-depth overview: Slides and recording of an in-depth overview of the project (50 minutes). Presented at the Warwick Research Software Engineering in Data and AI Workshop on 15 February 2023.
- Demonstration: An overview presentation that also demonstrates our data classification web application and using the Data Safe Haven as a researcher. Presented at the Software Sustainability Institute's Collaborations Workshop on 01 April 2020.
- Design choices: Our preprint "Design choices for productive, secure, data-intensive research at scale in the cloud", outlining our policies, processes and design decisions for the Safe Haven. Pre-print initially submitted on arXiv on 23 August 2019, last revised on 15 September 2019.
- Open source TRE implementation: This is currently available under an open source BSD licence. If you would like to evaluate our implementation or collaborate in its development, take a look at the public GitHub repository containing the code and the documentation to deploy our reference implementation to Microsoft's Azure cloud (or contact James Robinson or Martin O'Reilly).
- Community engagement: Along with The University of Dundee, we’ve co-established the RSE TRE community - bringing together TRE teams & people around the UK to collaborate on building & discussing TRE infrastructure. Get involved by joining the mailing list.
- Current focus: We're currently working with the University of Dundee, UCL, Ulster University, Health Data Research UK and Research Data Scotland as part of the SATRE project to capture common principles and requirements for Trusted Research Environments (TREs) like our Data Safe Haven, as well as community accepted approaches for meeting these standards. As part of this work we will be validating our Data Safe Haven and two other open source TRE implementation against these standards, bringing each of them into closer alignment. This work is being funded by the DARE UK programme as one of its five Driver Projects. Check out our blog and keep and eye on our GitHub repository, where we will be publishing our outputs. If you'd like to get involved, please get in contact at [email protected]