Gaining access to, and sharing data, is fundamental in all AI research. But doing so can prove extremely difficult due to concerns around privacy. Synthetically generated data offers a potential solution.
There are challenges to this though. Synthetic data (SD) is not inherently private; methods need to actively enforce privacy in order to achieve it. Enforcing privacy often comes with a trade-off of utility. These trade-offs can only be understood in context; what is acceptable in one field may be actively harmful in another. The interest group aims to provide a platform for exchanging knowledge across several active projects on synthetic data generation at the Turing, to enable discussions about what various fields with an interest in synthetic data find acceptable (and unacceptable).
Beyond privacy, SD has the potential to allow for corrections to bias in data, the ability to create larger datasets from smaller sample sizes, and to simulate plausible scenarios beyond historical data which are needed to validate and increase robustness of machine learning pipelines.
This group aims to bring together groups interested in the development and science of synthetic data, to develop a framework of understanding and to share learnings amongst practitioners.
Explaining the science
As a leading privacy technology, differential privacy is of particular interest to the group. This includes alternatives to it, and potential criticisms of it. Efforts have been made to incorporate differential privacy into synthetic data generators such as GANs (generative adversarial networks), VAEs (variational auto-encoders), and Bayesian networks. The interest group will explore the variety of methods out there.
Moreover, privacy has been shown to have some unexpected/undesirable side effects, such as a disproportionate impact on outliers and minority groups. The interest group will explore these types of drawbacks to existing privacy approaches.
The objectives of the interest group are as follows:
- To connect synthetic data researchers with commercial and public sector practitioners so that academics can best understand the needs of real-world practitioners and so that industry partners can understand the limits and 'impossibilities' of SD
- Encourage researchers to build a common framework of understanding
- To identify and propagate opportunities for collaboration between projects and help researchers to build upon each other’s learnings
- To design an open source codebase building on preliminary work done within REG/TPS project on ‘Quantifying utility and preserving privacy in synthetic data sets’ (QUIPP)
- Does synthetic data need to be created for specific purposes?
- Can you create synthetic data that is both useful and private?
- What are the benefits to researchers of synthetic data?
- How can synthetic data be used to correct bias in data?
- Can empirical evaluations of privacy be sufficient?
How to get involved