

# Hardware Challenges in Homomorphic Encryption

Sujoy Sinha Roy sujoy.sinharoy@iaik.tugraz.at



## **Privacy-Preserving Outsourcing of Computation**



#### Diabetic Retinopathy [Chao et al., 2019]

User wants to compute *foo(data*) in the cloud without loosing privacy.

# **Definition: Homomorphic Encryption Scheme**

An encryption scheme  $Enc(\cdot, \cdot)$  is homomorphic for an operation  $\Box$  on the message space iff

$$Enc(m_1 \square m_2, k_E) = Enc(m_1, k_E) \circ Enc(m_2, k_E)$$

with o operation on the ciphertext.

- If  $\Box$  = + then Enc( $\cdot$ ,  $\cdot$ ) is additively homomorphic.
- If  $\Box = \times$  then Enc( $\cdot$ ,  $\cdot$ ) is multiplicatively homomorphic.

#### **Example: Textbook RSA is multiplicatively homomorphic**

• You have encryption of two messages m<sub>1</sub> and m<sub>2</sub> where

 $c_1 = m_1^e \mod N$  $c_2 = m_2^e \mod N$ 

• By multiplying  $c_1$  and  $c_2$  you get  $c_3 = c_1 \cdot c_2 = (m_1 \cdot m_2)^e \mod N$ 

• Hence,  $c_3$  is encryption of  $m_1 \cdot m_2$ 

# **Fully Homomorphic Encryption (FHE)**

An encryption scheme  $Enc(\cdot, \cdot)$  is homomorphic for an operation  $\Box$  on the message space iff

$$Enc(m_1 \square m_2, k_E) = Enc(m_1, k_E) \circ Enc(m_2, k_E)$$

An encryption scheme is called Fully Homomorphic Encryption (FHE) when it supports both + and × on ciphertexts.

- If  $\Box = +$  then Enc( $\cdot$ ,  $\cdot$ ) is additively homomorphic.
- If  $\Box = \times$  then Enc( $\cdot$ ,  $\cdot$ ) is multiplicatively homomorphic.

### **Recap -- Ring LWE Public-Key Encryption (PKE)**

#### **Generation:**

**Output:** public key (pk), secret key (sk)



Arithmetic operations are performed in a polynomial ring R<sub>q</sub> **Public Key (pk):** (a,b) **Secret Key (sk):** (s)

V. Lyubashevsky, C. Peikert, and O. Regev. "On Ideal Lattices and Learning with Errors Over Rings". IACR ePrint 2012/230.

## **Recap -- Ring LWE Public-Key Encryption (PKE)**



#### **Recap -- Ring LWE Public-Key Encryption (PKE)**



Select most significant bit of each coefficient as the message bits

#### **Ring-LWE PKE – Written with different symbols**

Secret key: polynomial s Public-key: polynomials  $(p_0, p_1)$ Plaintext modulus: 2 Ciphertext modulus: q Scale factor:  $\Delta = q/2$ 

#### Encryption

Decryption

 $e_0, e_1, u \leftarrow error();$   $ct_0 = p_0 \cdot u + e_1 + \Delta \cdot m$  $ct_1 = p_1 \cdot u + e_2$ 



Polynomials are in blue Scalars are in red

#### **Ring-LWE PKE shows Homomorphism**



Now consider two ciphertexts  $Ct_A = \{ct_{A0}, ct_{A1}\}$  and  $Ct_B = \{ct_{B0}, ct_{B1}\}$ 

 $e_{A0}, e_{A1}, u_A \leftarrow error();$   $ct_{A0} = p_0 \cdot u_A + e_{A1} + \Delta \cdot m_A$  $ct_{A1} = p_1 \cdot u_A + e_{A2}$   $e_{B0}, e_{B1}, u_{B} \leftarrow error();$   $ct_{B0} = p_{0} \cdot u_{B} + e_{B1} + \Delta \cdot m_{B}$  $ct_{B1} = p_{1} \cdot u_{B} + e_{B2}$ 

#### **Ring-LWE PKE: Additive Homomorphism**





















# **The Biggest Problem in FHE**



#### **Parameters for PQC and FHE**



Like Public-key encryption, FHE does lots of polynomial arithmetic.

#### How to design a hardware accelerator for FHE?

## What makes implementation of FHE very challenging?

- Lots of polynomial arithmetic operations
  - Large degree polynomial arithmetic
  - Long integer arithmetic
- Big operands
  - Ciphertexts could be several MBs
- Memory management in HW accelerators
  - On-Chip memory is limited
  - Off-Chip data transfer is very slow

## What makes implementation of FHE very challenging?

- Lots of polynomial arithmetic operations
  - Large degree polynomial arithmetic
  - Long integer arithmetic
     This problem is solved using CRT
- Big operands
  - Ciphertexts could be several MBs
- Memory management in HW accelerators
  - On-Chip memory is limited
  - Off-Chip data transfer is very slow

## **Dealing with long-int coefficients using RNS**

We can take a modulus  $q = \prod q_i$  where  $q_i$  are coprime. Then we can work with Residue Number System (RNS). Chinese Arithmetic *mod*  $q_{0}$ Arithmetic *mod*  $q_1$ Remainder Arithmetic *mod q* Theorem ... Arithmetic *mod*  $q_1$ (CRT) **RNS** arithmetic Result mod q Small coefficients

• Parallel computation

## What makes implementation of FHE very challenging?

- Lots of polynomial arithmetic operations
  - Large degree polynomial arithmetic
  - Long integer arithmetic
- Big operands
  - Ciphertexts could be several MBs
- Memory management in HW accelerators
  - On-Chip memory is limited
  - Off-Chip data transfer is very slow

# How to multiply two very large polynomials?

- Schoolbook multiplication:  $O(n^2)$
- Karatsuba multiplication:  $O(n^{1.585})$
- Toom-Cook (generalization of Karatsuba)
- Fast Fourier Transform (FFT) multiplication: O(n log n)

#### Which one is the best choice?

# How to multiply two very large polynomials?

- Schoolbook multiplication:  $O(n^2)$
- Karatsuba multiplication:  $O(n^{1.585})$
- Toom-Cook (generalization of Karatsuba)
- Fast Fourier Transform (FFT) multiplication: O(n log n)

#### Which one is the best choice?

Asymptotic complexity plays its role.

## **NTT-based Polynomial Multiplication**



#### NTT or Number Theoretic Transform

Let's consider an application example.

Polynomial size  $n = 2^{15}$ Log(q) = 60

#### NTT and of a polynomial A[] Simplified NTT loops

```
A[n-1]
A[n-2]
 A[3]
 A[2]
 A[1]
 A[0]
```

for (m=2; m<=n; m=2m) {</pre> for (j=0; j<=m/2-1; j++) {</pre> for(k=0; j<n; k=k+m) {</pre> index = f(m, j, k); Butterfly(A[index],A[index+m/2]);

Simplified NTT loops



Simplified NTT loops



Simplified NTT loops



Simplified NTT loops



**Simplified NTT loops** 



# Can we speedup polynomial multiplication using several NTT cores in parallel?

Answer: Yes

Can we speedup polynomial multiplication using several NTT cores in parallel?

Answer: Yes

Is parallel NTT easy to implement?

Answer: Complexity of implementation increases with number of cores

#### Parallel NTT Challenge: Port limitation in BRAM or SRAM



Problem:

- One BRAM has only two ports.
- Each NTT core needs two ports

#### Parallel NTT Challenge: Port limitation in BRAM or SRAM



#### Problem:

- One BRAM has only two ports.
- Each NTT core needs two ports

To get parallel NTT, designers instantiate *parallel* BRAMs in parallel.

#### Memory access conflict

• Two or more cores try to read/write the same BRAM element. But BRAM has a limited number of ports to satisfy one core.



#### Memory access conflict

• Two or more cores try to read/write the same BRAM element. But BRAM has a limited number of ports to satisfy one core.

> BRAM ... BRAM Two cores are trying to access the same BRAM. BRAM

Solution: Cores generate addresses such that they are mutually exclusive.

#### Long data routing

- Core requires data from distant BRAM memory
  - Long routing of data wires  $\rightarrow$  slow clock frequency



Core is reading data from far memory.

#### Long data routing

- Core requires data from distant BRAM memory
  - Long routing of data wires  $\rightarrow$  slow clock frequency



Core is reading data from far memory.

This paper localizes the read operation.

BRAM is exclusively read by only one core.



#### **Next, FHE accelerator**



#### **High level computation flow**



Each thread perform arithmetic in residue polynomial ring  $R_{qi}$ 

#### **High level computation flow**



### **High level computation flow**



#### **High level accelerator architecture**



 $\frac{1}{RPAU_{0}()} \qquad RPAU_{1}() \qquad \dots L \text{ parallel modules} \qquad \frac{1}{RPAU_{L-1}()} \qquad \text{diagram}$ 

\*RPAU stands for Residue Polynomial Arithmetic Unit

# RPAU ()

module RPAU ( ) Each RPAU() module must support arithmetic modulo q\_i

- NTT
- INTT
- Modular reduction by q\_i
- Coefficient-wise modular addition
- Coefficient-wise modular multiplication

# RPAU ()



Example RPAU. It uses 16 NTT butterfly cores and 4 coefficient-wise (dyadic) arithmetic cores. Polynomials are stored in 'Memory' made of BRAMs.

# Instruction Parallelism in RPAU ()

 $\tilde{d}_{0,j} \leftarrow \tilde{c}_{0,j} \star \tilde{c}'_{0,j}$ HE. Mult  $ilde{d}_{1,j} \leftarrow ilde{c}_{0,j} \star ilde{c}_{1,j}' + ilde{c}_{1,j} \star ilde{c}_{0,j}'$  $\tilde{d}_{2,j} \leftarrow \tilde{c}_{1,j} \star \tilde{c}_{1,j}'$  $\{c_{0,i}'',c_{1,i}''\} \leftarrow 0$ HE. Relin  $d_{2,i} \leftarrow \text{INTT}(d_{2,i})$ for i = 0 to L - 1 do Obtain  $d_{2,i}$  from RPAU<sub>i</sub>  $r_{2,i} \leftarrow \texttt{Coeff.Reduce}(d_{2,i}, q_j)$  $\tilde{t} \leftarrow \operatorname{NTT}(r_{2,i})$  $c_{0,i}'' \gets c_{0,i}'' + \mathtt{KSK}_{0,i} \star \tilde{t}$  $c_{1,i}'' \leftarrow c_{1,i}'' + \texttt{KSK}_{1,i} \star \tilde{t}$ end for  $(d_{0,i}, d_{1,i}) \leftarrow |c'' \cdot p^{-1}|$ 

Homomorphic multiplication & key-switching. (The most expensive operation) Parallel execution of instructions



This reduces 40% cycle count

#### **Placement of RPAUs**

CRT requires combining the residues.

 $\rightarrow$  Therefore, RPAUs need to communicate with each other

How to interconnect the RPAUs in large 3D FPGAs?



#### Large SLR FPGA

Large FPGAs are multi-die

- > The FPGA is split into four SLRs.
- Connected by a limited number of wires.



#### Large SLR FPGA – top view



# There are a limited number of interconnects.

Large design cannot be spread arbitrarily across SLRs.

Xilinx Alveo U250 FPGA. This FPGA is 1000x larger than the FPGA used in this course.

- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Naïve solution: A "star-like" network





- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Naïve solution: A "star-like" network

#### Each RPAU has its own connections





- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Naïve solution: A "star-like" network

- Complicates the routing
- Large number of nets crossing the SLRs
- Reduces the clock frequency to around 50 MHz or less



- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Solution: A "ring" interconnection of RPAUs

- Only two neighbour RPAUs are connected.
- Data sent to an RPAU through a chain of RPAUs.
- No additional computation overhead



- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Placement of 10 RPAUs using "ring" interconnect





#### **Floorplan of the design**





#### **Full system overview**



Figure 8: CPU-FPGA interface and software stack

FPGA is used as an accelerator card of a server. HW/SW codesign is used to run applications.

#### **FPGA Acceleration results**



# **Our Group's research: Open Problems in FHE**

- 1. How to make hardware accelerators for larger parameter sets?
- 2. How to support different parameters?
- 3. How to support different FHE schemes?
- 4. How to implement FHE Bootstrapping?
- 5. From FPGA to ASIC accelerators
  - More parallel processing
  - Custom memory
  - Higher clock frequency and lower power consumption