Private information retrieval with SHEEP: A Homomorphic Encryption Evaluation Platform

Tuesday 11 Dec 2018

Research area

In this post we describe the SHEEP project and platform by means of an example. SHEEP is a recursive acronym for “SHEEP is a Homomorphic Encryption Evaluation Platform,” and is under development by researchers at the Turing Institute.

Consider the following problem: I want to host a database remotely, but the data within it are sensitive and must be encrypted. In addition, I want clients who access the database not to reveal which records were accessed. This last requirement especially sounds impossible, short of transmitting the entire database. This is in fact a classical problem in cryptography known as Private Information Retrieval (PIR) and there are several ways to solve it. It turns out that we can construct a protocol for private information retrieval based on a technology known as Homomorphic Encryption.

Homomorphic Encryption (HE) allows us to compute on encrypted data without decrypting it. Let's give an example. You can encrypt the number 23 and obtain an encryption ? ? ?, and encrypt 19 and obtain an encryption ? ? ?. Homomorphic encrytion schemes then provide a way to add the two encrypted values to obtain a new encryption, say ? ? ? + ? ? ? = ? ? ?, that when decrypted equals 23 + 19 = 42. Crucially, whoever operated on the encrypted values (the fruit) did not have to be able to decrypt them or otherwise make sense of them in order to perform the addition correctly.

This enables secure outsourced computations, where Alice can encrypt her data as ? ? ? ? ? ? ?, upload to a private computing platform (let's call it BlindFruit™), and obtain the result of a computation again encrypted. This differs from existing cloud computing frameworks in that BlindFruit™ never learnt anything about Alice's data despite computing on it. This is as awesome as it sounds. Of course, in practice it is also way more costly than computing in the clear. Helping to easily determine how much more costly is one of the goals of the SHEEP project.

In the latest HE schemes, ciphertexts are not strings of fruit but polynomials, but we don't have cute emojis for those...

We will use the property of Homomorphic Encryption to compute on private data to construct a Private Information Retrieval protocol. Schematically, our PIR protocol with Homomorphic Encryption will look something like this:

At the Alan Turing Institute we have been experimenting with Homomorphic Encryption for almost a year now and contributing to its more practical engineering aspects. The outcome is SHEEP: a platform for practitioners to evaluate the state-of-the-art of (fully) homomorphic encryption technology in the context of their own concrete application.

Encryption schemes with homomorphic properties have been around for a while. Some schemes are additive homomorphic (meaning we can operate on encrypted values to obtain the encryption of their sum, e.g. Paillier), and others are multiplicative homomorphic. Some encryption schemes are both additive and multiplicative, and we call those Fully Homomorphic Encryption (FHE) schemes. Before FHE schemes were known, several Somewhat Homomorphic Encryption schemes were described, which can support only a limited fixed number of multiplications (see this presentation and article). SHEEP supports a number of fully homomorphic encryption libraries (currently HElib, SEAL and TFHE) as well as libpallier.

Returning to the Private Information Retrieval problem, the version of PIR in SHEEP's native benchmarks works like this: a server holds a dataset $d$ of $n$ integers encrypted under the client's key. For the server to select an element obliviously, the client sends an encrypted selection binary vector $s$ of length $n$ containing $0$s everywhere except for a $1$ in the position $i$ of the element to be selected. The server then computes the dot product of $d$ and $s$ using the homomorphism, with the encrypted result $r = \sum_{j = 0}^n d_j\cdot s_j$ and sends it to the client, who can decrypt it.

Note that this protocol requires an amount of communication linear in the size of the database (since the selection vector must have as many elements as the database has records). To achieve sublinear communication at the cost of additional computation on the server one can use a nice trick, which we will describe a little later.

We can express this simple dot-product protocol for a database of a specific size (here, eight records) in an assembly-like language which can be run by the SHEEP platform:

INPUTS d_0 d_1 d_2 d_3 d_4 d_5 d_6 d_7 s_0 s_1 s_2 s_3 s_4 s_5 s_6 s_7
s_0 d_0 MULTIPLY c_0
s_1 d_1 MULTIPLY c_1
s_2 d_2 MULTIPLY c_2
s_3 d_3 MULTIPLY c_3
s_4 d_4 MULTIPLY c_4
s_5 d_5 MULTIPLY c_5
s_6 d_6 MULTIPLY c_6
s_7 d_7 MULTIPLY c_7
c_0 c_1 ADD p_1
p_1 c_2 ADD p_2
p_2 c_3 ADD p_3
p_3 c_4 ADD p_4
p_4 c_5 ADD p_5
p_5 c_6 ADD p_6
p_6 c_7 ADD r

This results in a single value (r) being returned that holds the selected element from the database.

The meaning of this program should be fairly self-explanatory: after a list of the inputs and outputs of the program are a list of statements of the form a1 a2 ... aN OP b, which defines b to be the result of evaluating the operation OP with a1 a2 ... aN as arguments. Variables are required to be defined before they are used and never redefined (requirements that mean it is in SSA form). One usually wants to generate such code programatically.

PIR with sublinear communication

First, the server organizes the database $d$ into an $\alpha$-dimensional $\lambda_1 \times \cdots \times \lambda_\alpha$ hyperrectangle $d'$. Think of $d'$ as the result of folding the database $d$ repeatedly $\alpha$ times, where the $j$th folding arranges the database into $\lambda_j$ parts of equal size. In this way, every entry in the database $d$ is addressed in $d'$ by an array $q$ of length $\alpha$, where $q_i\in\{0, \ldots, \lambda_i - 1\}$. Note that if $\lambda_1 = \cdots = \lambda_\alpha = 2$ then $q$ is a binary array of length $\log(|d|)$.

Hence, a query $q = (q_1, \ldots, q_\alpha)$ on the database $d$ can be recursively answered as a query $q_{rec} = (q_2, \ldots, q_\alpha)$ on a database $d_{rec}$ defined as follows:

$$d_{rec} = \left(\sum_{i=0}^{\lambda_1} d_{i\cdot c + j}\cdot s_i\right)_{j\in\{1, \ldots, c\}}$$

Here, the selection vector $s\in\{0,1\}^{\lambda_1}$ is the one-hot encoding of the number $q_1$, i.e. a binary array of length $\lambda_1$ with $0$s everywhere except for the $q_1$th position, and $c = \frac{|d|}{\lambda_0}$. In the base case, i.e. when $|q| = 0$, the answer to the query is simply the whole database.

The general idea behind the recursive approach to PIR is to have the client encode its query $q$ as selection vectors $s_1, \ldots, s_\alpha$: one-hot encodings of $q_1, \ldots, q_\alpha$. The client sends the $s_i$s encrypted using an encryption scheme $E$ for which only the client knows the private key. Then, the server runs the whole recursive evaluation described in the equation above using the encrypted $s_i$s, and returns the encrypted answer to the query to the client, who can decrypt it. This approach gives sublinear communication, and in fact logarithmic communication in the case where $\lambda_1 = \cdots = \lambda_\alpha = 2$.

Finally, note that this approach imposes some constraints on $E$. In particular, an encryption scheme supporting homomorphic addition and $\alpha$ homomorphic multiplications is required.

How to do it with SHEEP

As an example, we show the SHEEP code for the case $|d| = 8, \vec{\lambda} = (2, 2, 2)$:

INPUTS d_0 d_1 d_2 d_3 d_4 d_5 d_6 d_7 s_00 s_01 s_10 s_11 s_20 s_21
s_00 d_0 MULTIPLY c_00
s_00 d_1 MULTIPLY c_01
s_00 d_2 MULTIPLY c_02
s_00 d_3 MULTIPLY c_03
s_01 d_4 MULTIPLY c_04
s_01 d_5 MULTIPLY c_05
s_01 d_6 MULTIPLY c_06
s_01 d_7 MULTIPLY c_07
c_00 c_04 ADD e_00
c_01 c_05 ADD e_01
c_02 c_06 ADD e_02
c_03 c_07 ADD e_03
s_10 e_00 MULTIPLY c_10
s_10 e_01 MULTIPLY c_11
s_11 e_02 MULTIPLY c_12
s_11 e_03 MULTIPLY c_13
c_10 c_12 ADD e_10
c_11 c_13 ADD e_11
s_20 e_10 MULTIPLY c_20
s_21 e_11 MULTIPLY c_21
c_20 c_21 ADD r

where as before d is the database and s is the selection vector and the result is returned in r. Note that there are now only six selection elements (the ss) instead of eight in the simpler scheme, although there are more arithmetic operations.

In this article we have given some simple examples of using the SHEEP language to describe privacy-preserving computation. In the next article in this series, we will describe the various components of the SHEEP platform and how they fit together, and how they can be used to evaluate several built-in benchmarks.

This work was supported by the Defence and Security, and the Data Science at Scale research programmes.

Cover photo by Sam Carter on Unsplash.