The data science research community frequently encounters a need for:
- Secure environments for analysis of sensitive datasets
- Providing high performance computing capability
- A productive environment for curiosity-driven research
These requirements have, to their knowledge, not yet been realised simultaneously.
Some solutions render research unproductive by making it hard to author new code while engaging with the data, or to experiment with the many software libraries realising new analytical techniques. Others lack access to significant GPU compute power.
This perversely reduces security: researchers route around carefully constructed secure environments to avoid perceived productivity loss, reverting to 'folk security practices', (such as over-reliance on imperfect anonymisation), increasing the risk of a breach.
An organisation then lacks a clear inventory of all the datasets it is handling, and the risk profile in terms of consequences and threat actors for each. Email- and document-based processes create a confused environment with costly reporting and audit.
This project is building a system with:
- Clearly defined security classification tiers for the Institute's data, corresponding to the government system but extending to research activities
- A clean, easy to use, web-based system for management, tracking, review and classification of datasets, and the allocation of users and datasets to projects
- Automated creation of multiple, independent secure environments, each configured according to the appropriate classification tier for the data
A cloud-based approach to secure infrastructure means every aspect of the system is defined by fully scripted configuration manifests, and these can be interrogated by any data provider to audit the system. This also provides scalable high performance computing.
The web-based management solution will provide clear reporting of data inventory across all tiers.
The framework developed in this project combines prevailing data security threat and risk profiles into five sensitivity tiers, and, at each tier, specifies recommended policies for data reclassification, data ingress, software ingress, data egress, user access, user device control, and analysis environments. The framework presents design patterns for security choices for each tier, and uses software defined infrastructure so that a different, independent, secure analysis environment can be instantiated for each project appropriate to its classification. The aim is to maximise researcher productivity and minimise risk, allowing research organisations to operate with confidence.
As well as this design research, the project is developing, and will publish, a reference implementation of the design based on Microsoft Azure.
One-page overview: Poster with overview of our data classification approach, security measures, data management and technical architecture. This is the best one-page high-level overview of our systems and process. Presented at the poster session at the second annual workshop of the Research Software London and South East community on 06 February 2020.
Overview presentation: Slides from a presentation about the Safe Haven giving a more in-depth overview. Presented at the UKRI Cloud Workshop on 03 March 2020.
Overview video: An extended version of our overview presentation that also demonstrates our data classification web application and using the environment as a researcher. Presented at the Software Sustainability Institute's Collaborations Workshop on 01 April 2020.
Design choices: Our preprint "Design choices for productive, secure, data-intensive research at scale in the cloud", outlining our policies, processes and design decisions for the Safe Haven. Pre-print initially submitted on arXiv on 23 August 2019, last revised on 15 September 2019.
Reference implementation: This is currently available through a semi-open beta programme, under an open source MIT licence. If you would like to evaluate our implementation or collaborate in its development, please contact James Robinson or Martin O'Reilly for access to the GitHub repository containing the code and documentation to deploy our reference implementation to Microsoft's Azure cloud. We plan to make the GiHub repository for our implementation fully open to the public sometime later in 2020.