Synthetic data generators (SDG) use algorithms to generate data that preserves the original data’s statistical features while producing entirely new data points. SDGs offer a naturally private way to generate high-quality data. Among other benefits, they enable users to share data, to work with data in safe environments, to fix structural deficiencies in data, to increase the size data, and to validate machine learning system by generating adversarial scenarios.
Explaining the science
At the Turing we place special emphasis on combining conventional models with deep generative models. This enables us develop data generation processes that are not just accurate but also efficient and explainable.
Among other models, we develop generative adversarial networks, variational autoencoders, recurrent neural networks, and autoregressive models to different structured data formats like cross-sectional, time-series, and graph data.
Many SDGs can also be made private using differential private mechanisms at the optimisation and model parameter level. Unlike de-identification methods like data masking, shuffling, and encryption, SDG's do not leave a lot of scope for adversaries to recover personal information.
SDGs use algorithms that preserves the original data’s statistical features while producing new data points without the 1-to-1 matching that is seen with de-identification methods.
Regardless of function and purpose, a synthetic data generation system includes the pre-processing of data, the development of synthesisers, and a feedback mechanism in the form of utility, similarity, and privacy measures. It is our opinion that the pipeline should be considered in its entirety to identify the best models, model parameters, and evaluation metrics.
SDGs are an emerging concept and user acceptance is low even though it has shown numerous quantifiable benefits. This project aims to increase industry adoption using ground-breaking research coming from within the Turing network.
Anonymisation and data generation methods can be costly. This project seeks to lower the barrier of entry both in terms of cost and usability to further advance industry adoption.
Datasets are often stored in silos spread across organisations and are not easy to share with outside entities (e.g. academic community) or with different departments within organisations. The roadblocks for that sharing are privacy constraints and regulatory requirements. It is therefore critical to investigate methods for synthesising datasets that mirror the properties of the original data.
- Develop state of the art data generators for both structured and unstructured data sets.
- Develop metrics for evaluating the utility, similarity, and privacy of synthetic data sets across multiple use cases.
- Provide methodologies for assessing the utility and privacy trade-offs
- Enable data sharing between organisations and different departments within an organisation.
- Establish systems to train and validate machine learning models under adversarial scenarios.
- Communicate the benefits of data generators and reduce the overall barrier of entry.
The Turing Institute is in the process of partnering with a number of institutions to develop generative models. Collaborating with a wide range of partners would enable us to develop and validate SDGs on common industry problems, which will allow us to compare and promote the best solutions.
Among other benefits, synthetic data helps users to share data with others, use data in unsafe environments, validate models, increase data size, and fix structural deficiencies in data.
Synthetic data can be shared between companies, departments and research units for synergistic benefits.
By using synthetic data, organisations can store the relationships and statistical patterns of their data, without having to store individual level data.
The data can be used in unsafe environments to test systems, validate models, or develop applications and dashboards.
Samples can be generated for several extreme scenarios that could be rare to find in current datasets and could help to rebalance the data, leading to improved prediction models.
Instead of looking backwards at historical data, organisations can simulate customer level transactions using conditional sequential generative models which would ordinarily not be allowed without user consent due to GDPR.
Synthetic data can be used to train data-hungry algorithms in small-data environments, or for data sets with severe imbalances. It can also be used to train and validate machine learning models under adversarial scenarios.