Introduction
In this project, the team will build on long-standing expertise on synthetic data generation and develop novel metrics and methods for evaluating synthetic data. The metrics will quantitatively and robustly characterise the fidelity, diversity, and generalisation performance of any synthetic data model in a domain-agnostic fashion.
Explaining the science
Machine learning has the potential to catalyse complete transformations in many critical domains, such as healthcare. However, researchers are often hamstrung by a lack of access to high-quality data due to (perfectly valid) concerns regarding privacy. Synthetic data is an extremely promising but as-yet underexplored solution to this problem.
Current notions of ‘quality’ for synthetic data are poorly defined and further complicated when one considers the many use cases of such data which come with different performance metrics, quality, and privacy requirements.
Project aims
This project will:
- Develop new metrics for quantifying the performance and privacy of synthetic data
- Focus on metrics suitable for assessing synthetic time-series data
- Develop principled ways of deriving metrics for specific use cases
- Create novel privacy measures that provide probabilistic guarantees for data leakage
- Develop methods and pipelines for auditing synthetic data models to ensure that they fulfil desired requirements in terms of performance and privacy
Applications
- Model development with synthetic data addresses key privacy concerns(e.g. patient, or financial transaction records)
- Synthetic data could be used where limited data is available (e.g. smaller datasets with high utility could be expanded into much larger synthetic datasets)
- Simulation of forward-looking data (e.g. stock prices, sales, transactions) for novel entities that do not possess historical data
Contact info
Tony Zemaitis
[email protected]