Synthetic data and privacy preservation

Exploring the use of synthetic data generators, which offer a naturally private way to generate high-quality data that preserves the statistical features of the original dataset

Project status

Ongoing

Introduction

Datasets are often stored in silos spread across organisations and are not easy to share with outside entities (e.g. academic community) or with different departments within organisations. The roadblocks for that sharing are privacy constraints and regulatory requirements.

Synthetic data generators (SDGs) enable users to share and link data, to work with data in safe environments, to fix structural deficiencies in data, to increase the size of the data, and to validate machine learning systems by generating adversarial scenarios.

This project aims to produce state of art data generators for both structured and unstructured datasets, as well as metrics for evaluation the utility and privacy of synthetic datasets across multiple use cases.

Explaining the science

This project will draw on recent methodological developments in network modelling and the application of the signature method for data description. Combining these developments with classical and deep data generation processes ensure that the data generated is not just accurate but also efficient and explainable. Many SDGs can also be made private using differential private mechanisms at the optimisation and model parameter level. Unlike de-identification methods like data masking, shuffling, and encryption, SDGs minimize the scope for adversaries to recover personal information. SDGs use algorithms that preserve the original data's statistical features while producing new data points, without the one-to-one matching that is seen with de-identification methods. 

Project aims

Synthetic data has many possible use cases such as increasing the size of the data, fixing structural deficiencies, or enabling researchers to test machine learning algorithms functionality without access to the full datasets.

Alongside this, synthetic data has the potential to enable easier access to synthetic versions of sensitive datasets, democratising research and allowing greater sharing of data between (and within) organisations.

Working with the Office for National Statistics (ONS), we seek to create state of the art synthetic data generators, alongside metrics for assessing the utility and privacy of synthetic data to bolster data sharing within the ONS, and across the research community engaging with them.

Applications

The potential applications of SDGs are numerous and range from simple synthetic datasets for software development, to allowing researchers access to synthesized versions of incredibly sensitive datasets, whilst providing data controllers with reassurance regarding privacy concerns.

This project is concerned with building up a useful framework for generating synthetic data, as well as assessing its privacy and utility. Doing this in collaboration with the ONS enables for the adoption of this framework in researcher safe havens across the ONS and other government departments.

Organisers

Researchers and collaborators

Contact info

Priscila Lopez-Beltran

[email protected]

Funders