

# Hardware Acceleration Opportunities in Homomorphic Encryption

Cryptography on Hardware Platform 2023 Sujoy Sinha Roy <u>sujoy.sinharoy@iaik.tugraz.at</u>

## **Privacy-Preserving Outsourcing of Computation**



### Diabetic Retinopathy [Chao et al., 2019]

User wants to compute *foo(data*) in the cloud without loosing privacy.

# **Fully Homomorphic Encryption (FHE)**

### FHE enables computation on encrypted data



Dec() gives foo(data)

Cloud *homomorphically* evaluates *foo*()

### **Tutorial outline**

- 1. FHE concepts
- 2. Parallel processing opportunities in FHE (from high-level)
- 3. Hardware architecture design challenges and methods
- 4. Results

# **Definition: Homomorphic Encryption**

An encryption scheme  $Enc(\cdot, \cdot)$  is homomorphic for an operation  $\Box$  on the message space iff

$$Enc(m_1 \square m_2, k_E) = Enc(m_1, k_E) \circ Enc(m_2, k_E)$$

with o operation on the ciphertext.

- If  $\Box$  = + then Enc( $\cdot$ ,  $\cdot$ ) is additively homomorphic.
- If  $\Box = \times$  then Enc( $\cdot$ ,  $\cdot$ ) is multiplicatively homomorphic.

### **Example: Textbook RSA is multiplicatively homomorphic**

• You have encryption of two messages m<sub>1</sub> and m<sub>2</sub> where

 $c_1 = m_1^e \mod N$  $c_2 = m_2^e \mod N$ 

• By multiplying  $c_1$  and  $c_2$  you get  $c_3 = c_1 \cdot c_2 = (m_1 \cdot m_2)^e \mod N$ 

• Hence,  $c_3$  is encryption of  $m_1 \cdot m_2$ 

Can we get 'Additive & Multiplicative' Homomorphic Encryption?

Popular constructions of FHE use augmented Ring-LWE public-key encryption

### **Recap -- Ring LWE Public-Key Encryption (PKE)**

**Encryption:** 

□ Input:  $pk = (p_0, p_1)$ , message m □ Output:  $ct = (ct_0, ct_1)$ 



### **Recap -- Ring LWE Public-Key Encryption (PKE)**



$$ct_0 + ct_1.s = m' = Enc(m) + (e.s' + e'' + e'.s)$$
$$= Enc(m) + e_{small}$$

### **Recap -- Ring LWE Public-Key Encryption (PKE)**



$$ct_0 + ct_1.s = m' = Enc(m) + (e.s' + e'' + e'.s)$$
$$= Enc(m) + e_{small}$$

Equivalently,

$$\frac{\operatorname{ct}_0 + \operatorname{ct}_1 \cdot \mathrm{s}}{\mathrm{q}/2} \mod 2 = \mathrm{m}$$

### **Ring-LWE PKE – Written with different symbols**

Let scale factor  $\Delta = q/t$  and t be plaintext modulus, e.g., t = 2. Scalars are in red. All polynomials are in blue.



### **Ring-LWE PKE shows Homomorphism**



Now consider two ciphertexts  $Ct_A = \{ct_{A0}, ct_{A1}\}$  and  $Ct_B = \{ct_{B0}, ct_{B1}\}$ 

 $e_{A0}, e_{A1}, u_A \leftarrow error();$   $ct_{A0} = p_0 \cdot u_A + e_{A0} + \Delta \cdot m_A$  $ct_{A1} = p_1 \cdot u_A + e_{A1}$   $e_{B0}, e_{B1}, u_{B} \leftarrow error();$   $ct_{B0} = p_{0} \cdot u_{B} + e_{B0} + \Delta \cdot m_{B}$  $ct_{B1} = p_{1} \cdot u_{B} + e_{B1}$ 

### **Ring-LWE PKE: Additive Homomorphism**





















## **The Biggest Problem in FHE**



## **Polynomial size**



#### FHE does lots of (large) polynomial arithmetic.

How to accelerate FHE?

### **Tutorial outline**

- 1. FHE concepts
- 2. Parallel processing opportunities in FHE (from high-level)
- 3. Hardware architecture design challenges and methods
- 4. Results

### What makes acceleration of FHE very challenging?

- Lots of polynomial arithmetic operations
  - Large degree polynomial arithmetic
  - Long integer arithmetic
- Memory management
  - Ciphertexts could be several MBs
  - On-Chip memory is limited
  - Off-Chip data transfer is very slow

### What makes acceleration of FHE very challenging?

- Lots of polynomial arithmetic operations
  - Large degree polynomial arithmetic
  - Long integer arithmetic This problem is solved using CRT
- Memory management
  - Ciphertexts could be several MBs
  - On-Chip memory is limited
  - Off-Chip data transfer is very slow

### **Dealing with long-int coefficients using RNS**

- 1. Take a modulus  $Q = \prod_{i=1}^{L-1} q_i$  where  $q_i$  are coprime.
- 2. Use Residue Number System (RNS).



### E.g., Parallel computation flow with CRT



### ... (1) Residue polynomial arithmetic layer



\*RPAU stands for 'Residue Polynomial Arithmetic Unit'

### ... (2) Residue polynomial layer $\iff$ CRT layer



Each thread in CRT layer combines *all threads* from previous layer.

### ... (3) Residue polynomial layer $\iff$ CRT layer

Therefore, threads or RPAUs need to exchange data with each other.



### **Tutorial outline**

- 1. FHE concepts
- 2. Parallel processing opportunities in FHE (from high-level)

### 3. Hardware architecture design challenges and methods

4. Results

## **System-level view**





# How to multiply two very large polynomials?

- Schoolbook multiplication:  $O(n^2)$
- Karatsuba multiplication:  $O(n^{1.585})$
- Toom-Cook (generalization of Karatsuba)
- Fast Fourier Transform (FFT) multiplication: O(n log n)

### FFT is the best choice

Asymptotic complexity plays its role.

### **NTT-based Polynomial Multiplication**



#### NTT or Number Theoretic Transform is special FFT with integers.

Let's consider an application example.

Polynomial size n =  $2^{15}$ And *log*( $q_i$ ) = 60

#### NTT and of a polynomial A[] Simplified NTT loops



### **NTT and Memory access**

**Simplified NTT loops** 


Simplified NTT loops



Simplified NTT loops



Simplified NTT loops



Simplified NTT loops



# Can we speedup polynomial multiplication using several NTT cores in parallel?

Answer: Yes

Can we speedup polynomial multiplication using several NTT cores in parallel?

Answer: Yes

Is parallel NTT easy to implement?

Answer: Complexity of implementation increases with number of cores

#### Parallel NTT Challenge 1: Port limitation in BRAM or SRAM



#### Problem:

- One BRAM has only two ports.
- Each NTT core needs two ports

#### Parallel NTT Challenge 1: Port limitation in BRAM or SRAM



#### Problem:

- One BRAM has only two ports.
- Each NTT core needs two ports

#### Solution: Use BRAMs in parallel.

New problem: How to distribute data?

## Parallel NTT Challenge 2: Memory access conflicts

Two or more cores try to read/write the same BRAM element. But BRAM has a limited number of ports to satisfy one core.



## Parallel NTT Challenge 2: Memory access conflicts

Two or more cores try to read/write the same BRAM element. But BRAM has a limited number of ports to satisfy one core.



## Parallel NTT Challenge 3: Data routing

Core requires data from distant BRAM memory

- Long routing of data wires  $\rightarrow$  slow clock frequency



## **Parallel NTT Challenge 3: Data routing**

Core requires data from distant BRAM memory

Long routing of data wires  $\rightarrow$  slow clock frequency



Solution: There is no solution to this problem. Localizing read or write (not both) the read operation.

Compute Core-C \* Bus Switching Matrix Data-write paths are . heavily pipelined. This paper localizes Compute Core-1 BRAM is exclusively read by only one Pipeline register core. Compute Coefficient of a polynomial \* Core-0 \* NTT Cores **BRAMs** 

Mert et al. "Medha: Microcoded Hardware Accelerator for computing on Encrypted Data". TCHES 2023

## **System-level view: Main challenges**



Next topic: Memory management

#### **Memory organization and management**



**Common techniques** 

• Lots of on-chip memory (BRAM/SRAM) for storing operands

#### **Memory organization and management**



**Common techniques** 

- Lots of on-chip memory (BRAM/SRAM) for storing operands
- Perform communication-computation parallelism using cache

### **Memory organization and management**



**Common techniques** 

- Lots of on-chip memory (BRAM/SRAM) for storing operands
- Perform communication-computation parallelism using cache
- High-bandwidth off-chip memory and with multiple channels

#### **Tutorial outline**

- 1. FHE concepts
- 2. Parallel processing opportunities in FHE (from high-level)
- 3. Hardware architecture design challenges and methods
  Implementation
- 4. Results

#### Implementations

There are two main tracks

- 1. True accelerator prototype in ASIC/FPGA
- 2. Simulation-based modelling of accelerator

Real HW prototypes: HEAWS<sup>[1]</sup>, HEAX<sup>[2]</sup>, CoFHEE<sup>[3]</sup>, Medha<sup>[4]</sup>

Simulation-based works: F1<sup>[5]</sup>, BTS<sup>[6]</sup>, CraterLake<sup>[7]</sup>, ...

[1] Furkan Turan et al. HEAWS: an accelerator for homomorphic encryption on the amazon AWS FPGA. IEEE ToC, 2020.
 [2] M. Sadegh Riazi et al. HEAX: an architecture for computing on encrypted data. ASPLOS 2020.
 [3] Mohammed Nabeel et al. CoFHEE: A Co-processor for Fully Homomorphic Encryption Execution. DATE 2023.
 [4] Mert et al. Medha: Microcoded Hardware Accelerator for computing on Encrypted Data. CHES 2023.
 [5] Axel Feldmann et al. F1: A fast and programmable accelerator for fully homomorphic encryption. MICRO 2021.
 [6] Sangpyo Kim et al. BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption. ISCA 2022.
 [7] Samardzic et al. CraterLake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data. ISCA 2022.

Briefly talk about

#### **Next, FHE accelerator**



### **High level computation flow**



#### ... Residue polynomial arithmetic layer



\*RPAU stands for Residue Polynomial Arithmetic Unit

#### ... Residue polynomial layer $\iff$ CRT layer



Each thread in CRT layer combines *all threads* from previous layer.

#### ... Residue polynomial layer 🔶 CRT layer

Therefore, RPAUs need to exchange data with each other.



## RPAU ()



Example RPAU. It uses 16 NTT butterfly cores and 4 coefficient-wise (dyadic) arithmetic cores. Polynomials are stored in 'Memory' made of BRAMs.

Mert et al. "Medha: Microcoded Hardware Accelerator for computing on Encrypted Data". TCHES 2023

## Instruction Parallelism in RPAU ()

 $\tilde{d}_{0,j} \leftarrow \tilde{c}_{0,j} \star \tilde{c}_{0,j}'$ HE. Mult  $ilde{d}_{1,j} \leftarrow ilde{c}_{0,j} \star ilde{c}_{1,j}' + ilde{c}_{1,j} \star ilde{c}_{0,j}'$  $\tilde{d}_{2,j} \leftarrow \tilde{c}_{1,j} \star \tilde{c}_{1,j}'$  $\{c_{0,i}'',c_{1,i}''\} \leftarrow 0$ HE. Relin  $d_{2,i} \leftarrow \text{INTT}(d_{2,i})$ for i = 0 to L - 1 do Obtain  $d_{2,i}$  from RPAU<sub>i</sub>  $r_{2,i} \leftarrow \texttt{Coeff.Reduce}(d_{2,i}, q_j)$  $\tilde{t} \leftarrow \operatorname{NTT}(r_{2,i})$  $c_{0,i}'' \gets c_{0,i}'' + \mathtt{KSK}_{0,i} \star \tilde{t}$  $c_{1,i}'' \leftarrow c_{1,i}'' + \texttt{KSK}_{1,i} \star \tilde{t}$ end for  $(d_{0,i}, d_{1,i}) \leftarrow |c'' \cdot p^{-1}|$ 

Homomorphic multiplication & key-switching. (The most expensive operation)

#### **Parallel execution of instructions**



This reduces 40% cycle count

#### **Placement of RPAUs**

CRT requires combining the residues.

 $\rightarrow$  Therefore, RPAUs need to communicate with each other

How to interconnect the RPAUs in large 3D FPGAs?



### Large SLR FPGA

Large FPGAs are multi-die

- > The FPGA is split into four SLRs.
- Connected by a limited number of wires.



#### Large SLR FPGA – top view



## There are a limited number of interconnects.

Large design cannot be spread arbitrarily across SLRs.

Xilinx Alveo U250 FPGA. This FPGA is 1000x larger than the FPGA used in this course.

- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Naïve solution: A "star-like" network





- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Naïve solution: A "star-like" network

#### Each RPAU has its own connections





- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Naïve solution: A "star-like" network

- Complicates the routing
- Large number of nets crossing the SLRs
- Reduces the clock frequency to around 50 MHz or less



- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Solution: A "ring" interconnection of RPAUs

- Only two neighbour RPAUs are connected.
- Data sent to an RPAU through a chain of RPAUs.
- No additional computation overhead



#### Mert et al. "Medha: Microcoded Hardware Accelerator for computing on Encrypted Data". TCHES 2023

- FPGA Constraints
  - > The FPGA is split into four SLRs.
  - Connected by a limited number of wires.
- Some operations require exchanging the residue polynomials between RPAUs
- Placement of 10 RPAUs using "ring" interconnect





#### **Floorplan of the design**




## **Full system overview**



Figure 8: CPU-FPGA interface and software stack

FPGA is used as an accelerator card of a server. HW/SW codesign is used to run applications.

Mert et al. "Medha: Microcoded Hardware Accelerator for computing on Encrypted Data". TCHES 2023

## **FPGA Acceleration results**



Mert et al. "Medha: Microcoded Hardware Accelerator for computing on Encrypted Data". TCHES 2023

## **Our Group's research: Open Problems in FHE**

- 1. How to make hardware accelerators for larger parameter sets?
- 2. How to support different parameters?
- 3. How to support different FHE schemes?
- 4. How to implement FHE Bootstrapping?
- 5. From FPGA to ASIC accelerators
  - More parallel processing
  - Custom memory
  - Higher clock frequency and lower power consumption