The Alan Turing Institute

# Private information retrieval with SHEEP: A Homomorphic Encryption Evaluation Platform

Tuesday 11 Dec 2018

## Research area

In this post we describe the SHEEP project and platform by means of an example. SHEEP is a recursive acronym for “SHEEP is a Homomorphic Encryption Evaluation Platform,” and is under development by researchers at the Turing Institute.

Consider the following problem: I want to host a database remotely, but the data within it are sensitive and must be encrypted. In addition, I want clients who access the database not to reveal which records were accessed. This last requirement especially sounds impossible, short of transmitting the entire database. This is in fact a classical problem in cryptography known as Private Information Retrieval (PIR) and there are several ways to solve it. It turns out that we can construct a protocol for private information retrieval based on a technology known as Homomorphic Encryption.

Homomorphic Encryption (HE) allows us to compute on encrypted data without decrypting it. Let's give an example. You can encrypt the number 23 and obtain an encryption ? ? ?, and encrypt 19 and obtain an encryption ? ? ?. Homomorphic encrytion schemes then provide a way to add the two encrypted values to obtain a new encryption, say ? ? ? + ? ? ? = ? ? ?, that when decrypted equals 23 + 19 = 42. Crucially, whoever operated on the encrypted values (the fruit) did not have to be able to decrypt them or otherwise make sense of them in order to perform the addition correctly.

This enables secure outsourced computations, where Alice can encrypt her data as ? ? ? ? ? ? ?, upload to a private computing platform (let's call it BlindFruit™), and obtain the result of a computation again encrypted. This differs from existing cloud computing frameworks in that BlindFruit™ never learnt anything about Alice's data despite computing on it. This is as awesome as it sounds. Of course, in practice it is also way more costly than computing in the clear. Helping to easily determine how much more costly is one of the goals of the SHEEP project.

In the latest HE schemes, ciphertexts are not strings of fruit but polynomials, but we don't have cute emojis for those...

We will use the property of Homomorphic Encryption to compute on private data to construct a Private Information Retrieval protocol. Schematically, our PIR protocol with Homomorphic Encryption will look something like this:

At the Alan Turing Institute we have been experimenting with Homomorphic Encryption for almost a year now and contributing to its more practical engineering aspects. The outcome is SHEEP: a platform for practitioners to evaluate the state-of-the-art of (fully) homomorphic encryption technology in the context of their own concrete application.

Encryption schemes with homomorphic properties have been around for a while. Some schemes are additive homomorphic (meaning we can operate on encrypted values to obtain the encryption of their sum, e.g. Paillier), and others are multiplicative homomorphic. Some encryption schemes are both additive and multiplicative, and we call those Fully Homomorphic Encryption (FHE) schemes. Before FHE schemes were known, several Somewhat Homomorphic Encryption schemes were described, which can support only a limited fixed number of multiplications (see this presentation and article). SHEEP supports a number of fully homomorphic encryption libraries (currently HElib, SEAL and TFHE) as well as libpallier.

Returning to the Private Information Retrieval problem, the version of PIR in SHEEP's native benchmarks works like this: a server holds a dataset $d$ of $n$ integers encrypted under the client's key. For the server to select an element obliviously, the client sends an encrypted selection binary vector $s$ of length $n$ containing $0$s everywhere except for a $1$ in the position $i$ of the element to be selected. The server then computes the dot product of $d$ and $s$ using the homomorphism, with the encrypted result $r = \sum_{j = 0}^n d_j\cdot s_j$ and sends it to the client, who can decrypt it.

Note that this protocol requires an amount of communication linear in the size of the database (since the selection vector must have as many elements as the database has records). To achieve sublinear communication at the cost of additional computation on the server one can use a nice trick, which we will describe a little later.

We can express this simple dot-product protocol for a database of a specific size (here, eight records) in an assembly-like language which can be run by the SHEEP platform:

INPUTS d_0 d_1 d_2 d_3 d_4 d_5 d_6 d_7 s_0 s_1 s_2 s_3 s_4 s_5 s_6 s_7
OUTPUTS r
s_0 d_0 MULTIPLY c_0
s_1 d_1 MULTIPLY c_1
s_2 d_2 MULTIPLY c_2
s_3 d_3 MULTIPLY c_3
s_4 d_4 MULTIPLY c_4
s_5 d_5 MULTIPLY c_5
s_6 d_6 MULTIPLY c_6
s_7 d_7 MULTIPLY c_7

This results in a single value (r) being returned that holds the selected element from the database.

The meaning of this program should be fairly self-explanatory: after a list of the inputs and outputs of the program are a list of statements of the form a1 a2 ... aN OP b, which defines b to be the result of evaluating the operation OP with a1 a2 ... aN as arguments. Variables are required to be defined before they are used and never redefined (requirements that mean it is in SSA form). One usually wants to generate such code programatically.

## PIR with sublinear communication

First, the server organizes the database $d$ into an $\alpha$-dimensional $\lambda_1 \times \cdots \times \lambda_\alpha$ hyperrectangle $d'$. Think of $d'$ as the result of folding the database $d$ repeatedly $\alpha$ times, where the $j$th folding arranges the database into $\lambda_j$ parts of equal size. In this way, every entry in the database $d$ is addressed in $d'$ by an array $q$ of length $\alpha$, where $q_i\in\{0, \ldots, \lambda_i - 1\}$. Note that if $\lambda_1 = \cdots = \lambda_\alpha = 2$ then $q$ is a binary array of length $\log(|d|)$.

Hence, a query $q = (q_1, \ldots, q_\alpha)$ on the database $d$ can be recursively answered as a query $q_{rec} = (q_2, \ldots, q_\alpha)$ on a database $d_{rec}$ defined as follows:

$$d_{rec} = \left(\sum_{i=0}^{\lambda_1} d_{i\cdot c + j}\cdot s_i\right)_{j\in\{1, \ldots, c\}}$$

Here, the selection vector $s\in\{0,1\}^{\lambda_1}$ is the one-hot encoding of the number $q_1$, i.e. a binary array of length $\lambda_1$ with $0$s everywhere except for the $q_1$th position, and $c = \frac{|d|}{\lambda_0}$. In the base case, i.e. when $|q| = 0$, the answer to the query is simply the whole database.

The general idea behind the recursive approach to PIR is to have the client encode its query $q$ as selection vectors $s_1, \ldots, s_\alpha$: one-hot encodings of $q_1, \ldots, q_\alpha$. The client sends the $s_i$s encrypted using an encryption scheme $E$ for which only the client knows the private key. Then, the server runs the whole recursive evaluation described in the equation above using the encrypted $s_i$s, and returns the encrypted answer to the query to the client, who can decrypt it. This approach gives sublinear communication, and in fact logarithmic communication in the case where $\lambda_1 = \cdots = \lambda_\alpha = 2$.

Finally, note that this approach imposes some constraints on $E$. In particular, an encryption scheme supporting homomorphic addition and $\alpha$ homomorphic multiplications is required.

### How to do it with SHEEP

As an example, we show the SHEEP code for the case $|d| = 8, \vec{\lambda} = (2, 2, 2)$:

INPUTS d_0 d_1 d_2 d_3 d_4 d_5 d_6 d_7 s_00 s_01 s_10 s_11 s_20 s_21
OUTPUTS r
s_00 d_0 MULTIPLY c_00
s_00 d_1 MULTIPLY c_01
s_00 d_2 MULTIPLY c_02
s_00 d_3 MULTIPLY c_03
s_01 d_4 MULTIPLY c_04
s_01 d_5 MULTIPLY c_05
s_01 d_6 MULTIPLY c_06
s_01 d_7 MULTIPLY c_07
s_10 e_00 MULTIPLY c_10
s_10 e_01 MULTIPLY c_11
s_11 e_02 MULTIPLY c_12
s_11 e_03 MULTIPLY c_13