There are many forms of anonymisation, but none can remove the threat of an attacker recreating personal information using external resources. Choosing which technique to use requires understanding of the attacker threat, what the shared data is to be used for, and the context in which it was both gathered and released. Anonymity is not solely a property of the data, but a function of the 'data environment' in which it is held. 'Provenance' is the record of creation and modification of data and processes. The goal is to use provenance information to identify and describe data environments so that the appropriate anonymization techniques can be applied.
Explaining the science
Recent work (Elliot et al 2018) has introduced the concept of 'functional anonymisation' which states that risk lies not in the properties of the data on their own, but in the relationship between data and their context, called the 'data environment', which can be characterised by four parameters: the agents with access to the data; the supplementary data which can be integrated with the data; the infrastructure in which the data is stored and processed; and the governance of the data. Anonymity is not therefore solely a property of the data, but a function of the data environment(s) in which it is held. Anonymisation can be reversed when someone with appropriate supplementary data can gain access and perform the necessary data integration to re-identify some or all people in the dataset.
'Provenance' is the record of creation and modification of data and processes. It has many uses, including: debugging, scientific reproducibility, and establishing trust in data. The Alan Turing Institute's 'Symposium on Reproducibility for Data Intensive Research' Final Report notes that Turing should investigate low-overhead provenance collection systems and utilise the W3C PROV (Groth & Moreau 2013). The W3C PROV is an interoperability standard for provenance that defines actors, entities, activities, and the relationships between them. Using provenance (as described by PROV), it is possible to trace where data came from, and how it was processed. Recently, the notion of 'prescriptive provenance' has gained traction; prescriptive provenance describes how the data should flow, based on instances of the current data flow.
Hence the intention of functional anonymisation is to configure anonymisation as a means of privacy protection through risk management, and the intention of provenance is to facilitate warranted trust in the processes of data creation and management. Putting the two together enables provenance to be used as a means of modelling the data environment(s), making it a key risk management tool for functional anonymisation.
The aim of this work is to provide an understanding of how data environments can be expressed and reasoned over using PROV interoperability standard. These findings will be applied to real use cases of data environments, data flows, and anonymisation requirements to test whether these techniques can be used to automatically identify appropriate anonymization techniques for a given situation. Moreover, there is an aim to identify how to use proscriptive provenance as a contractual agreement in how data will be utilised after sharing, and how it can be used to support/comply with GDPR and other data privacy legislation.
Expected outcomes include:
- Tutorial on using provenance to improve data protection and anonymisation decisions at ProvenanceWeek2020.
- Workshops with data providers and other stakeholders, in order to disseminate and critique early results in a critical-friendly environment, and to ensure the practicality and real-world focus of the work.
- Publication detailing mapping of data environments using W3C PROV in both provenance and privacy/data protection venues.
- Set of compiled and released use cases that highlight requirements for functional anonymisation with associated provenance.
- A refinement of ADF, which is currently being updated for GDPR. The work reported here will feed into many of the components of the ADF to add formalisms and expressivity. The result should be an extended ADF providing some expressive tools to accompany the process.
All corporate and governmental entities that wish to share data, but must protect the privacy of the data subjects should utilize the approaches developed in this work.
For example, a university hospital system which has patient information that must be protected. However, much of that information is available for research, with appropriate protections, such as anonymization. Unfortunately, anonymization is never perfect, and the choice of technique depends on how/where the data was collected and how/where the data is being sent. The techniques provided by this research will enable data administrators to manage the risks and best techniques to use, by facilitating more accurate model-building of both the threat and the risk.