Differential privacy is a statistical technique that aims to provide the means to maximise the accuracy of queries from statistical databases while minimising the privacy impact on individuals whose information is in the database. This project surveys and compares, both analytically and empirically, different methods for generating differentially private 'synthetic' data. It focuses on a series of case studies on real-world applications and datasets, and will provide both a report on the results as well as re-usable code (in the form of libraries) developed as part of the project.
Explaining the science
In this project, the focus is on machine learning algorithms known as generative neural networks. These model a data-generating distribution by training on the original data, making it possible to generate data that resemble the data they are trained on. Therefore, the intuition is for entities to train and publish the model, but not the original data, so that anybody can generate a synthetic dataset resembling the original data.
This project uses differential privacy to guarantee that the synthetic data generation does not reveal any sensitive information about the data used to train the generative model.
Differential privacy formalises the idea that a query should not reveal whether any one person is present in a dataset, much less what their data are. It addresses the paradox of learning nothing about an individual while learning useful information about a population. It aims to provide rigorous, statistical guarantees against what an adversary can infer from learning the result of some randomised algorithm.
It is the formal mathematical model that ensures privacy protection, and is primarily used to analyse and release sensitive data statistics.
In a number of realistic settings, organisations might need or want to provide access to their datasets, e.g. aiming to monetise them or allow third parties with the appropriate expertise to analyse them. A common approach is to share an 'anonymised' version of the dataset, i.e. replacing the original data in possibly privacy-sensitive data analytics tasks.
Unfortunately, traditional anonymisation models, such as k-anonymity and differential privacy, are not effective on high-dimensional data, providing either poor utility or insufficient privacy guarantees.
A more promising approach is offered by generative models based on deep neural networks: these model the data-generating distribution by training on the original data. Thus, entities can publish the model but not the original data, and anybody can generate a synthetic dataset resembling the original (training) data as much as possible. However, off-the-shelf generative models fail to provide such privacy guarantees and overfit on specific training samples by implicitly memorising them.
This project plans to investigate the feasibility of supporting the release of models as well as the generation of synthetic datasets while providing strong, differentially private guarantees. The main objectives include:
- Survey existing work on privacy-preserving synthetic data release and identify candidate techniques
- For each of them, add, when needed, differential privacy guarantees to the model and the data release
- Compare, both analytically and empirically, different methods for differentially private synthetic data generation
Public entities and private organisations alike may often be willing, or required, to provide access to their datasets, enabling analysis by third-parties with the appropriate expertise. Government bodies might need to share data with other organisations that can extract knowledge from it (e.g. for informing policy decisions, health and safety monitoring, security intelligence, etc), while companies might be interested in monetising rich datasets or simply 'donating' them to science.
On the one hand, the ability to share datasets enables a number of compelling applications and analytics. On the other hand, however, useful datasets are likely to be sensitive in nature, contain 'personally identifiable information' (PII), or anyway be vulnerable to inferences that might endanger privacy of users, thus, disclosing such information might yield violations of data protection statutes like GDPR. The work in this project will aid in counteracting these potential pitfalls.
- Evaluated the models on the Adult and German Credit datasets from UCI Machine Learning Datasets
October - November 2018
- Identified candidate models for evaluation
- Identified and obtained the datasets for evaluation