Sensitive datasets are often too inaccessible to make the most effective use of them (for example healthcare or census micro-data). Synthetic data – artificially generated data used to replicate the statistical components of real-world data but without any identifiable information – offers an alternative. However, synthetic data is poorly understood in terms of how well it preserves the privacy of individuals on which the synthesis is based, and also of its utility (i.e. how representative of the underlying population the data are).
This project is evaluating how well data synthesis methods preserve both individuals’ privacy and the utility of datasets for analysis. Where methods sufficiently preserve both privacy and utility, then synthetic datasets could be made available in more accessible environments.
Generating synthetic data introduces additional uncertainty compared to the original dataset. This project is quantifying this uncertainty and understanding how it propagates through space and time when synthetic data is used for modelling purposes. A range of data synthesis techniques are being evaluated in the contexts of both health and urban analytics, and an open-source pipeline is being developed for data owners to easily use to evaluate the potential effectiveness and impact of using synthetic data for their own applications.
Explaining the science
There are a variety of methods for generating synthetic microdata: this project's goal is to include as wide a range of these as possible in an evaluation platform. It will be be straightforward for any user to extend the platform by adding new methods into the benchmarking pipeline. For an overview of some key data synthesis methods see this review from the ONS and this review that deals with microsimulation methods.
Once data has been synthesised, the project aims to know:
- Is it useful for replacing real individual-level data in a given application?
- What is the privacy risk to the individuals in the original dataset?
The first of these questions, on the utility of the data, strongly depends on the nature of the application, and this means having a flexible platform for specifying and quantifying many such measures.
Regarding the second question, privacy preservation offers a wealth of theoretically-based and also more ad-hoc methods. These ultimately aim to measure the risk to an individual when certain information is released. This applies to synthetic records too, since the synthesis depends on some individuals' data, which might be sensitive. The underlying idea is often to perturb the underlying data, or parameters of a model, with some noise. The results of these methods can sometimes be related to one another, but often they are incompatible, and not all measures of privacy are applicable to each synthesis method. This makes it important to clearly present any privacy results, their implications, and link to their theoretical underpinning where possible.
The project is exploring existing and novel techniques to quantify the privacy of a synthesised data set. These include data-driven methods which are independent of the synthesis algorithm used, as well as methods that embed privacy in the synthesis process. A few examples include:
- Differential privacy
- Plausible deniability
- Adversarial methods
This project is evaluating a range of existing synthetic data generation techniques to understand how they ensure the privacy of individuals yet remain useful for answering the same research questions as the original sensitive dataset. This will involve evaluating these data synthesis techniques against a range of privacy and utility measures.
The project is also evaluating the uncertainty introduced by these data synthesis methods and how this uncertainty propagates when the synthetic data is combined with other data and is used in simulations that evolve in space and time.
In addition to making the analysis reproducible and openly available, the project will also ensure that the work is reusable by others by packaging evaluation code into robust, reliable, reusable benchmarking tools. These tools can be used by researchers, practitioners, data holders and other stakeholders to evaluate these data synthesis techniques for their own datasets and application contexts.
The project's open benchmarking tools will be developed such that it is easy for others to add additional data synthesis methods, evaluation datasets and measures for evaluating privacy, utility and uncertainty.
Sensitive data is critical for research in a range of fields, including healthcare, finance and government. The ability to reliably generate less sensitive synthetic datasets that are still useful in answering research questions of interest would enable this data to be made more widely available and enable a wider community of researchers to engage with these problems.
The project's researchers are working with colleagues developing a triage dashboard for hospital emergency departments to evaluate to what extent synthetic data can be used in place of sensitive patient data in the development of algorithms to support the patient triage process in hospital emergency departments.
They are also collaborating with researchers from University College London Critical Care Health Informatics Collaborative to evaluate whether data synthesis techniques can generate privacy preserving versions of their critical care patient data.
In many cases, researchers are already working with datasets that have been de-identified or otherwise altered to preserve the privacy of the individuals to which the data pertains, often resulting in a reduction in the resolution of the data with respect to various properties (e.g. location, age, timing). In these cases, data synthesis techniques are often used to recreate the detail that is lost in the de-identification process and understanding the uncertainty these techniques introduce is important when interpreting the data and simulations that rely on it.
The researchers are also working with colleagues on the Turing's SPENSER project, who are using synthetic data generation techniques to generate more fine-grained regional and local census datasets from national level microdata samples. They are helping them evaluate additional data synthesis techniques that support dynamic population models and to understand how the uncertainty introduced by these synthesis techniques propagates through their simulations as they evolve in space and time.