A look at ‘privacy preserving data analysis’ from Turing Fellow Jon Crowcroft and Research Fellow Adria Gascon.
Decisions informed by the analysis of personal data are becoming an increasingly ubiquitous part of modern life; what advertisements we see online, which of our incoming emails are discarded, what conditions are attached to our insurance, what new drugs are developed. This data is being collected not just from social media and online services, but from sensors in smart homes, cars, cities, and health devices.
While many of these processes have reasonable commercial goals, the risks of leakage or misuse are growing, as the recent Cambridge Analytica scandal shows. As researchers at the Alan Turing Institute, we are at the forefront of developing and evaluating the tools and techniques that are attempting to limit these risks. This article takes a look at the current state of so-called ‘privacy-preserving data analysis’ and how it can potentially be implemented, and is taken from our forthcoming paper Analytics without tears, written for the Institute of Electrical and Electronics Engineers.
There are currently many efforts to regulate the protection of sensitive information, such as The EU’s General Data Protection Regulation, which aim to place obligations on data controllers and data processors, and specify user’s rights. Various Turing researchers are producing influential work into data governance and how such regulation should be implemented. However, speciﬁc algorithms are rarely mentioned in these regulations and we are far from eﬀective standardisation guidelines.
The issue is that privacy can be a slippery concept, that ideally requires a robust, mathematically rigorous approach to tackle it. In response to this need, privacy-preserving analytics has emerged as a very active research topic, which aims to ensure personal data can be used to its fullest potential without compromising our privacy. The research covers several ﬁelds such as machine learning, databases, cryptography, hardware systems, and statistics.
Research in these fields has already produced viable technical solutions for encryption – changing data from a readable form to a protected form – for data at rest on disk and data as it’s being transmitted. The key challenges currently lie in preserving privacy during the actual processing of data.
Let’s say you upload your data encrypted to the cloud, but still allow computations to be performed on it by service providers, such as for training machine learning models, or for selecting ads tailored to you. How do we ensure that these computations protect your private data from breaches? A number of techniques are emerging to address this, many of which may only be viable in combination with each other.
Secure enclaves – Trusted hardware that provides a secure container into which the cloud user can upload encrypted private data, securely decrypt it, and compute on it. Both the decryption and the computation are run in a processor which, in principle, can’t even be broken into by their owner. Whilst promising, there are some limitations in security guarantees and scalability.
Homomorphic encryption – Allows a cloud provider to compute on encrypted data – as if it was computing blindfolded – and return the encrypted results to the data owner who can decrypt them themselves. Not currently scalable to deal with massive input sizes but a very useful tool in privacy-preserving data analysis pipelines. Libraries are available, but standardisation and systematic evaluation are needed.
Multi-Party Computation (MPC) – Rather than providing cryptographic guarantees in the cloud, this provides them for computations over data held by several parties. MPC enables, for example, distributed computations where a set of hospitals compute models using their patients’ data without disclosing their respective datasets to each other or any other party.
» Related Turing publication about developing efficient dedicated protocols with better performance than generic MPC techniques
MPC is related to the challenge of edge computing, where instead of moving all the data to a single server (where it might be leaked), we can leave data in peoples’ devices (smart homes, smart TVs, cars, tablets, etc) and distribute the programmes that do the analytics in a privacy-preserving way. This then moves the results (e.g. market segment statistics) to businesses that wish to exploit them without ever moving the raw personal data anywhere at all. Techniques are quite eﬃcient with many available libraries and applications, and even commercial products.
The techniques above do not address the problem of quantifying how much information about a person is disclosed with the results of some computation. Simply removing primary keys (e.g. name, birthday, postcode, etc) from a database and replacing them with some pseudo-random numbers, so called ‘de-identiﬁcation’, doesn’t work in general – due to the many diverse holders of records it’s often possible to link data from diﬀerent sources and infer who a subject is.
For example, in 2007 Netflix released a large collection of its viewers’ film ratings as part of a competition to optimise its recommendations, removing people’s names and other identifying details and publishing only their Netflix ratings. However, researchers were able cross-reference the Netflix data with public review data on IMDB to match up similar patterns of recommendations between the sites and add names back into Netflix’s supposedly anonymous database.
The development of differential privacy (DP) aims to tackle this issue by mathematically quantifying how much a certain form of data analysis reveals about an individual. It can involve several approaches to ﬁltering the data collected, or involve adding random data that obscures the real, sensitive personal information.
There have been some applications of differential privacy by big data controllers such as Google and Apple, but a clear path to standardisation doesn’t exist yet, often due to issues with modelling and parameterisation choices. Turing Fellow Graham Cormode is currently conducting work into potential approaches for DP, with applications to telecommunications and social data.
» Related Turing publication on combining MPC and DP
The technologies mentioned above are not mature enough to be completely standardised, and related issues like personal data management and consent also need to be standardised. Even if we agree on which protocols to use, and how to implement them, there are always a set of services that are required to deploy such protocols in practice, and a reasonable incentive system for parties to provide such services must be in place.
Furthermore, what one means by ‘safe’ needs to be not only rigorously established, but also eﬀectively communicated by, for example, a ‘privacy level’ equivalent of the British Standards kitemarks.
It is also important to note that technical advances in general do not solve all ethical issues, and privacy is not an exception to that. Every data analysis has some ethical issues regarding privacy associated with it, which must be approached as such.
Several of the privacy-enhancing techniques discussed here have the potential to revolutionise the data landscape. These techniques have diﬀerent trade-oﬀs, maturity levels, and privacy guarantees, and in some cases solve slightly diﬀerent problems. A fully ﬂedged approach to privacy-preserving data analysis requires a signiﬁcant interdisciplinary eﬀort.
Fortunately, at the Turing we have researchers from a broad range of disciplines working across data privacy, encryption, and governance. Through multi-institution research projects, working with industry partners such as Intel, and sharing skills and knowledge in research interest groups, we will be at the forefront of developing a unified approach to secure, privacy-preserving data analysis as well as finding an effective, mathematically robust definition of privacy.
Together, we need to mitigate the growing risks of privacy failures, but also enable the exciting opportunities that computing on private data can yield.