

# Side-Channel Security

Chapter 2: Caches & Cache Attacks

#### **Daniel Gruss**

March 14, 2024

Graz University of Technology



## TLB and Paging

- Paging: memory translated page-wise from virtual to physical
- TLB (translation lookaside buffer) caches virtual to physical mapping
- TLB has some latency

## **TLB** and Paging

- Paging: memory translated page-wise from virtual to physical
- TLB (translation lookaside buffer) caches virtual to physical mapping
- TLB has some latency
- Worst case for Cache: mapping not in TLB, need to load mapping from RAM
- Solution: Use virtual addresses instead of physical addresses

## **Cache indexing methods**

- VIVT: Virtually indexed, virtually tagged
- PIPT: Physically indexed, physically tagged
- PIVT: Physically indexed, virtually tagged
- VIPT: Virtually indexed, physically tagged

## **VIVT**



## **VIVT**



• Shared memory more than once in cache

## **PIPT**



## **PIPT**



## (PIVT)



## (PIVT)



## **VIPT**



#### **VIPT**



- Using more bits is unpractical (like VIVT)
- $\rightarrow$  Cache size < # ways  $\cdot$  page size

#### Remarks

- L1 caches: VIVT or VIPT
- L2/L3 caches: PIPT



 $\textbf{step 0} : \textbf{attacker maps shared library} \rightarrow \textbf{shared memory, shared in cache}$ 



 $\textbf{step 0} : \mathsf{attacker \ maps \ shared \ library} \to \mathsf{shared \ memory}, \ \mathsf{shared \ in \ cache}$ 



 $\textbf{step 0} : \textbf{attacker maps shared library} \rightarrow \textbf{shared memory, shared in cache}$ 

step 1: attacker flushes the shared line



 $\textbf{step 0} : \mathsf{attacker \ maps \ shared \ library} \to \mathsf{shared \ memory}, \ \mathsf{shared \ in \ cache}$ 

step 1: attacker flushes the shared line

step 2: victim loads data while performing encryption



 $\textbf{step 0} \colon \mathsf{attacker \ maps \ shared \ library} \to \mathsf{shared \ memory, \ shared \ in \ cache}$ 

step 1: attacker flushes the shared line

step 2: victim loads data while performing encryption

**step 3**: attacker reloads data  $\rightarrow$  fast access if the victim loaded the line

Pros: fine granularity (1 line)

Cons: restrictive

- 1. needs clflush instruction (not available e.g., in JS)
- 2. needs shared memory

## Variants of Flush+Reload

- Flush+Flush [2]
- Evict+Reload [3] on ARM [5]



 $\textbf{step 0} : \mathsf{attacker fills the cache (prime)}$ 



 $\textbf{step 0} : \mathsf{attacker fills the cache (prime)}$ 



 $\textbf{step 0} : \mathsf{attacker fills the cache (prime)}$ 



step 0: attacker fills the cache (prime)



step 0: attacker fills the cache (prime)



step 0: attacker fills the cache (prime)



step 0: attacker fills the cache (prime)



 $\textbf{step 0} : \mathsf{attacker fills the cache (prime)}$ 



step 0: attacker fills the cache (prime)

 $\textbf{step 1}: \ \mathsf{victim} \ \mathsf{evicts} \ \mathsf{cache} \ \mathsf{lines} \ \mathsf{while} \ \mathsf{performing} \ \mathsf{encryption}$ 

step 2: attacker probes data to determine if the set was accessed



 $\textbf{step 0} : \mathsf{attacker fills the cache (prime)}$ 

 $\textbf{step 1}: \ \mathsf{victim} \ \mathsf{evicts} \ \mathsf{cache} \ \mathsf{lines} \ \mathsf{while} \ \mathsf{performing} \ \mathsf{encryption}$ 

step 2: attacker probes data to determine if the set was accessed



 $\textbf{step 0} : \mathsf{attacker fills the cache (prime)}$ 

 $\textbf{step 1}: \ \mathsf{victim} \ \mathsf{evicts} \ \mathsf{cache} \ \mathsf{lines} \ \mathsf{while} \ \mathsf{performing} \ \mathsf{encryption}$ 

step 2: attacker probes data to determine if the set was accessed

Pros: less restrictive

- 1. no need for clflush instruction (not available e.g., in JS)
- 2. no need for shared memory

Cons: coarser granularity (1 set)

#### Issues with Prime+Probe

We need to evict caches lines without clflush or shared memory:

- 1. which addresses do we access to have congruent cache lines?
- 2. without any privilege?
- 3. and in which order do we access them?

#### #1.1: Which physical addresses to access?



#### "LRU eviction":

- assume that cache uses LRU replacement
- ullet accessing n addresses from the same cache set to evict an n-way set
- ullet eviction from last level o from whole hierarchy (it's inclusive!)

#### #1.2: Which addresses map to the same set?



- function H that maps slices is undocumented
- reverse-engineered by [4, 7, 9]

#### #1.2: Which addresses map to the same set?



- function H that maps slices is undocumented
- reverse-engineered by [4, 7, 9]
- hash function basically an XOR of address bits

#### #1.2: Which addresses map to the same set?

#### 3 functions, depending on the number of cores

|         |       |          | Address bit |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |   |          |          |          |
|---------|-------|----------|-------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|---|----------|----------|----------|
|         |       | 3        | 3           | 3        | 3        | 3        | 3        | 3        | 3        | 2        | 2        | 2        | 2        | 2        | 2        | 2        | 2        | 2        | 2        | 1        | 1        | 1        | 1        | 1        | 1        | 1        | 1        | 1        | 1        | 0 | 0        | 0        | 0        |
|         |       | 7        | 6           | 5        | 4        | 3        | 2        | 1        | 0        | 9        | 8        | 7        | 6        | 5        | 4        | 3        | 2        | 1        | 0        | 9        | 8        | 7        | 6        | 5        | 4        | 3        | 2        | 1        | 0        | 9 | 8        | 7        | 6        |
| 2 cores | 00    |          |             |          |          |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ |   |          |          | $\oplus$ |
| 4 cores | 00    |          |             |          |          | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ |   |          |          | $\oplus$ |
|         | $o_1$ |          |             |          | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ |          |   |          | $\oplus$ |          |
|         | 00    |          | $\oplus$    | $\oplus$ |          | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ |          | $\oplus$ |   |          |          | $\oplus$ |
| 8 cores | $o_1$ | $\oplus$ |             | $\oplus$ | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ |          | $\oplus$ |          | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ | $\oplus$ |          |   |          | $\oplus$ |          |
|         | 02    | $\oplus$ | $\oplus$    | $\oplus$ | $\oplus$ |          |          | $\oplus$ |          |          | $\oplus$ |          |          | $\oplus$ | $\oplus$ |          |          |   | $\oplus$ |          |          |
|         |       |          |             |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |          |   |          |          |          |

• last-level cache is physically indexed

- last-level cache is physically indexed
- root privileges needed for physical addresses

- last-level cache is physically indexed
- root privileges needed for physical addresses
- ullet use 2 MB pages o lowest 21 bits are the same as virtual address

- last-level cache is physically indexed
- root privileges needed for physical addresses
- ullet use 2 MB pages o lowest 21 bits are the same as virtual address
- $\rightarrow$  enough to compute the cache set



"LRU eviction" memory accesses

| cache set |  |  |  |  |
|-----------|--|--|--|--|
|           |  |  |  |  |

• LRU replacement policy: oldest entry first



- LRU replacement policy: oldest entry first
- timestamps for every cache line



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp



- LRU replacement policy: oldest entry first
- timestamps for every cache line
- access updates timestamp

"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses



"LRU eviction" memory accesses





- no LRU replacement
- only 75% success rate on Haswell



- no LRU replacement
- only 75% success rate on Haswell
- more accesses → higher success rate, but too slow

#### **#3.3:** Cache eviction strategy



**Figure 1:** Fast and effective on Haswell. Eviction rate >99.97%.

Cache covert channels

#### Side channels vs covert channels

- side channel: attacker spies a victim process
- covert channel: communication between two processes
  - that are not supposed to communicate
  - that are collaborating

#### 1-bit cache covert channels

ideas for 1-bit channels:

#### 1-bit cache covert channels

ideas for 1-bit channels:

- Prime+Probe: use one cache set to transmit
  - 0: sender does not access the set ightarrow low access time in receiver
  - 1: sender does access the set  $\rightarrow$  high access time in receiver

#### 1-bit cache covert channels

#### ideas for 1-bit channels:

- Prime+Probe: use one cache set to transmit
  - 0: sender does not access the set  $\rightarrow$  low access time in receiver
  - 1: sender does access the set  $\rightarrow$  high access time in receiver
- Flush+Reload/Flush+Flush/Evict+Reload: use one address to transmit
  - 0: sender does not access the address  $\rightarrow$  high access time in receiver
  - 1: sender does access the address  $\rightarrow$  low access time in receiver

#### 1-bit covert channels

• 1 bit data, 0 bit control?

#### 1-bit covert channels

- 1 bit data, 0 bit control?
- idea: divide time into slices (e.g.,  $50\mu s$  frames)
- synchronize sender and receiver with a shared clock

#### Problems of 1-bit covert channels

• errors?

#### Problems of 1-bit covert channels

- ullet errors? o error-correcting codes
- retransmission may be more efficient (less overhead)
- desynchronization
- optimal transmission duration may vary

#### Multi-bit covert channels

• combine multiple 1-bit channels

#### Multi-bit covert channels

- combine multiple 1-bit channels
- avoid interferences
- $\rightarrow \ \text{higher performance}$

#### Multi-bit covert channels

- combine multiple 1-bit channels
- avoid interferences
- $\rightarrow$  higher performance
  - use 1-bit for sending = true/false

## Packets / frames

Organize data in packets / frames:

- some data bits
- check sum
- sequence number
- → keep sender and receiver synchronous
- $\rightarrow$  check whether retransmission is necessary

How can the sender know when to retransmit?

• idea: acknowledge packets (requires a backward channel)

How can the sender know when to retransmit?

- idea: acknowledge packets (requires a backward channel)
- use some bits as a backward channel
- use the same bits as a backward channel (sender sending bit/receiver sending bit)

How can the sender know when to retransmit?

- idea: acknowledge packets (requires a backward channel)
- use some bits as a backward channel
- use the same bits as a backward channel (sender sending bit/receiver sending bit)
- why wait for retransmission?

How can the sender know when to retransmit?

- idea: acknowledge packets (requires a backward channel)
- use some bits as a backward channel
- use the same bits as a backward channel (sender sending bit/receiver sending bit)
- why wait for retransmission?
- → sender should retransmit until receiver acknowledged

## Raw capacity C

- number of bits per second
- measure over  $\geq 1$  minute

s bits transmitted in 1 minute:

$$C = \frac{s}{60}$$

### Bit error rate p

- ullet count bits that are wrong w
- count total bits sent  $b_s$
- ullet count total bits received  $b_r$

Error rate:

$$p = \frac{w + |b_r - b_s|}{\max(b_s, b_r)},$$

or if  $b_r = b_s$ :

$$p = \frac{w}{b_r},$$

# **Capacity**

True capacity T:

$$T = C \cdot (1 + ((1 - p) \cdot \log_2 (1 - p) + p \cdot \log_2 (p)))$$

C is the raw capacity and p is the bit error rate.

# Capacity



## State of the art

| method  | raw capacity | err. rate | true capacity | env.   |
|---------|--------------|-----------|---------------|--------|
| F+F [2] | 3968Kbps     | 0.840%    | 3690Kbps      | native |
| F+R [2] | 2384Kbps     | 0.005%    | 2382Kbps      | native |
| E+R [5] | 1141Kbps     | 1.100%    | 1041Kbps      | native |
| P+P [8] | 601Kbps      | 0.000%    | 601Kbps       | native |
| P+P [6] | 600Kbps      | 1.000%    | 552Kbps       | virt   |
| P+P [8] | 362Kbps      | 0.000%    | 362Kbps       | native |
|         |              |           |               |        |

- State of the art: cache attacks are powerful
- Problem: manual identification of attack targets

- State of the art: cache attacks are powerful
- Problem: manual identification of attack targets
- Solution: Cache Template Attacks
- Automatically find any secret-dependent cache access
- Can be used for attacks and to improve software

- State of the art: cache attacks are powerful
- Problem: manual identification of attack targets
- Solution: Cache Template Attacks
- Automatically find any secret-dependent cache access
- Can be used for attacks and to improve software
- Examples:
  - Cache-based keylogger
  - Automatic attacks on crypto algorithms

## Challenges

• How to locate key-dependent memory accesses?

## **Challenges**

- How to locate key-dependent memory accesses?
- It's complicated:
  - Large binaries and libraries (third-party code)
  - Many libraries (gedit: 60MB)
  - Closed-source / unknown binaries
  - Self-compiled binaries

## **Challenges**

- How to locate key-dependent memory accesses?
- It's complicated:
  - Large binaries and libraries (third-party code)
  - Many libraries (gedit: 60MB)
  - Closed-source / unknown binaries
  - Self-compiled binaries
- Difficult to find all exploitable addresses

- Preprocessing step to find exploitable addresses automatically
  - w.r.t. "events" (keystrokes, encryptions, ...)
  - called "Cache Template"

#### **Profiling Phase**

- Preprocessing step to find exploitable addresses automatically
  - w.r.t. "events" (keystrokes, encryptions, ...)
  - called "Cache Template"

#### **Exploitation Phase**

• Monitor exploitable addresses

| Attacker address space |       | Victim address space |
|------------------------|-------|----------------------|
|                        | Cache |                      |
|                        |       |                      |
| Shared 0x0             |       |                      |
|                        |       |                      |
|                        |       | Shared 0x0           |
|                        |       |                      |
|                        |       |                      |

Cache is empty



Attacker triggers an event



Attacker checks one address for cache hits ("Reload")



Update cache hit ratio (per event and address)



Attacker flushes shared memory



Repeat for higher accuracy



Repeat for all events

# **Profiling Phase**



Repeat for all events

# **Profiling Phase**



Continue with next address

# **Profiling Phase**



Continue with next address



# Profiling Phase: 1 Event, 1 Address



KEY

## Profiling Phase: 1 Event, 1 Address



Example: Cache Hit Ratio for (0x7c800, n): 200 / 200

## Profiling Phase: All Events, 1 Address



### Profiling Phase: All Events, 1 Address

```
SS ghijklmnopqrstuvwxyz
```

Example: Cache Hit Ratio for (0x7c800, u): 13 / 200

### **Profiling Phase: All Events, 1 Address**



Distinguish n from other keys by monitoring 0x7c800

### **Profiling Phase: All Events, All Addresses**



#### AES uses T-Tables (precomputed from S-Boxes)

4 T-Tables

•

$$T_0\left[k_{\{0,4,8,12\}} \oplus p_{\{0,4,8,12\}}\right]$$
  
 $T_1\left[k_{\{1,5,9,13\}} \oplus p_{\{1,5,9,13\}}\right]$ 

...

- If we know which entry of T is accessed, we know the result of  $k_i \oplus p_i$ .
- Known-plaintext attack  $(p_i \text{ is known}) \rightarrow k_i$  can be determined

- Most addresses in two groups:
  - Cache hit ratio 100% (always cache hits)
  - Cache hit ratio 0% (no cache hits)

- Most addresses in two groups:
  - Cache hit ratio 100% (always cache hits)
  - Cache hit ratio 0% (no cache hits)
- One 4096 byte memory block:
  - $\bullet$  Cache hit ratio of 92%
  - Cache hits depend on key value and plaintext value
  - The T-Tables

- Known-plaintext attack
- Events: encryption with only one fixed key byte

- Known-plaintext attack
- Events: encryption with only one fixed key byte
- Profile each event

- Known-plaintext attack
- Events: encryption with only one fixed key byte
- Profile each event
- Exploitation phase:
  - Eliminate key candidates

- Known-plaintext attack
- Events: encryption with only one fixed key byte
- Profile each event
- Exploitation phase:
  - Eliminate key candidates
  - Reduction of key space in first-round attack:
    - 64 bits after 16–160 encryptions

- Known-plaintext attack
- Events: encryption with only one fixed key byte
- Profile each event
- Exploitation phase:
  - Eliminate key candidates
  - Reduction of key space in first-round attack:
    - 64 bits after 16–160 encryptions
  - State of the art: full key recovery after 30000 encryptions

## Attack 4: AES T-Table Template



 $({\sf transposed})$ 

- Novel technique to find any cache side-channel leakage
  - Attacks
  - Detect vulnerabilities

- Novel technique to find any cache side-channel leakage
  - Attacks
  - Detect vulnerabilities
- Works on virtually all Intel CPUs
- Works even with unknown binaries

- Novel technique to find any cache side-channel leakage
  - Attacks
  - Detect vulnerabilities
- Works on virtually all Intel CPUs
- Works even with unknown binaries
- Marks a change of perspective:

- Novel technique to find any cache side-channel leakage
  - Attacks
  - Detect vulnerabilities
- Works on virtually all Intel CPUs
- Works even with unknown binaries
- Marks a change of perspective:
  - Large scale analysis of binaries
  - Large scale automated attacks

#### Evict+Reload

- Variant of Flush+Reload with cache eviction instead of clflush
- Works on ARMv7
- Applicable to millions of devices
- Cache Template Attacks using Evict+Reload
- ARMageddon: Last-Level Cache Attacks on Mobile Devices [5]

#### Flush+Flush: Motivation

- $\bullet \ \ \text{cache attacks} \to \text{many cache misses}$
- detect via performance counters

#### Flush+Flush: Motivation

- ullet cache attacks o many cache misses
- detect via performance counters
- $\rightarrow\,$  good idea, but is it good enough?

#### Flush+Flush: Motivation

- ullet cache attacks o many cache misses
- detect via performance counters
- $\rightarrow$  good idea, but is it good enough?
  - ullet causing a cache flush eq causing a cache miss

#### clflush execution time













#### Flush+Flush: Conclusion

- attacker causes no direct cache misses
  - ightarrow fast
  - $\rightarrow \ \text{stealthy}$

### Flush+Flush: Conclusion

- attacker causes no direct cache misses
  - $\rightarrow$  fast
  - $\rightarrow \ \text{stealthy}$
- same side channel targets as Flush+Reload

### Flush+Flush: Conclusion

- attacker causes no direct cache misses
  - $\rightarrow$  fast
  - ightarrow stealthy
- same side channel targets as Flush+Reload
- 496 KB/s covert channel

### Cache Attacks on mobile devices?

- powerful cache attacks on Intel x86 in the last 10 years
- nothing like Flush+Reload or Prime+Probe on mobile devices

### Cache Attacks on mobile devices?

- powerful cache attacks on Intel x86 in the last 10 years
- nothing like Flush+Reload or Prime+Probe on mobile devices

 $\rightarrow$  why?

# Prefetch Side-Channel Attacks

### **Overview**

- prefetch instructions don't check privileges
- prefetch instructions leak timing information

### **Overview**

- prefetch instructions don't check privileges
- prefetch instructions leak timing information

### exploit this to:

- locate a driver in kernel = defeat KASLR
- translate virtual to physical addresses

# NOTE

#### **NOTE**

Using the PREFETCH instruction is recommended only if data does not fit in cache.

#### NOTE

Using the PREFETCH instruction is recommended only if data does not fit in cache. Use of software prefetch should be limited to memory addresses that are managed or owned within the application context.

#### NOTE

Using the PREFETCH instruction is recommended only if data does not fit in cache. Use of software prefetch should be limited to memory addresses that are managed or owned within the application context. Prefetching to addresses that are not mapped to physical pages can experience non-deterministic performance penalty.

#### NOTE

Using the PREFETCH instruction is recommended only if data does not fit in cache. Use of software prefetch should be limited to memory addresses that are managed or owned within the application context. Prefetching to addresses that are not mapped to physical pages can experience non-deterministic performance penalty. For example specifying a NULL pointer (0L) as address for a prefetch can cause long delays.

### **CPU Caches**

Memory (DRAM) is slow compared to the CPU

- buffer frequently used memory
- every memory reference goes through the cache
- based on physical addresses

# Memory Access Latency



# Unprivileged cache maintainance

### Optimize cache usage:

- prefetch: suggest CPU to load data into cache
- clflush: throw out data from all caches

... based on virtual addresses

# **Software prefetching**

prefetch instructions are somewhat unusual

- hints can be ignored by the CPU
- do not check privileges or cause exceptions

but they do need to translate virtual to physical

# Kernel must be mapped in every address space

#### **Today's operating systems:**



### Address translation on x86-64



### **Address Translation Caches**



# Address Space Layout Randomization (ASLR)



# Address Space Layout Randomization (ASLR)



Same library – different offset!

# Kernel Address Space Layout Randomization (KASLR)



# Kernel Address Space Layout Randomization (KASLR)



Same driver - different offset!

# Kernel direct-physical map



# Kernel direct-physical map



OS X, Linux, BSD, Xen PVM (Amazon EC2)

### **Translation-Level Oracle**













# Timing the prefetch instruction

The CPU may reorder instructions

instruction 1

cpuid

instruction 2

cpuid

instruction 3

but not over cpuid!

# **Breaking KASLR with Prefetch**



# Last-level cache addressing



# Last-level cache addressing

- ullet last-level cache o as many slices as cores
- undocumented hash function that maps a physical address to a slice
- designed for performance



### Prime+Probe on recent procesors?

Complex addressing  $\rightarrow$  impossible to target a set



## Reverse engineering method

1. find some way to map one address to one slice

## Reverse engineering method

- 1. find some way to map one address to one slice
- 2. repeat for every address with a 64B stride

#### Reverse engineering method

- 1. find some way to map one address to one slice
- 2. repeat for every address with a 64B stride
- 3. infer a function out of it

• with performance counters

- with performance counters
- with a timing attack

- with performance counters
- with a timing attack
  - ullet using clflush

- with performance counters
- with a timing attack
  - using clflush
  - using memory access

• event UNC\_CBO\_CACHE\_LOOKUP counts accesses to a slice



• event UNC\_CBO\_CACHE\_LOOKUP counts accesses to a slice



UNC\_CBO\_CACHE\_LOOKUP

• event UNC\_CBO\_CACHE\_LOOKUP counts accesses to a slice



• event UNC\_CBO\_CACHE\_LOOKUP counts accesses to a slice



1. translate virtual to physical address with /proc/pid/pagemap

- 1. translate virtual to physical address with /proc/pid/pagemap
- 2. set up monitoring session

- 1. translate virtual to physical address with /proc/pid/pagemap
- 2. set up monitoring session
- 3. repeat access to a single address

- 1. translate virtual to physical address with /proc/pid/pagemap
- 2. set up monitoring session
- 3. repeat access to a single address
  - → hint: clflush is already counted as an access

- 1. translate virtual to physical address with /proc/pid/pagemap
- 2. set up monitoring session
- 3. repeat access to a single address
  - $\rightarrow$  hint: clflush is already counted as an access
- 4. read UNC\_CBO\_CACHE\_LOOKUP event for each CBo

- 1. translate virtual to physical address with /proc/pid/pagemap
- 2. set up monitoring session
- 3. repeat access to a single address
  - → hint: clflush is already counted as an access
- 4. read UNC\_CBO\_CACHE\_LOOKUP event for each CBo
- 5. slice is the one that has the maximum lookup events

#### Mapping physical addresses to slices



#### Method #2: Flush+Flush [2]

• side channel similar to Flush+Reload, using only clflush execution time differences

#### Method #2: Flush+Flush [2]

- side channel similar to Flush+Reload, using only clflush execution time differences
- time difference between cached and not cached

### Method #2: Flush+Flush [2]

- side channel similar to Flush+Reload, using only clflush execution time differences
- time difference between cached and not cached
- time difference between slices

#### Last-level cache and Ring interconnect

Intel Optimization Manual:

"The LLC hit latency [...] depends on the core location relative to the LLC block, and how far the request needs to travel on the ring."

#### clflush time difference between slices



 ${\tt clflush}$  histogram for an address in slice 1 on different cores

### Inferring the function

#### Two cases:

- 1. linear function:  $2^n$  number of cores
- 2. non-linear function: the rest

#### Methods

- brute-force
- smarter method
- let a solver do it! [1]

Perspective: the first two methods assume that we know what the functions look like (XORs of address bits)

#### **Brute force**

For each bit of output  $o_0, \ldots, o_{k-1}$ 

- 1. try one function
- $\rightarrow$  e.g.,  $b_6 \oplus b_7 \oplus \ldots \oplus b_{32}$
- 2. test if the function corresponds to the mapping you already have

#### **Brute force**

For each bit of output  $o_0, \ldots, o_{k-1}$ 

- 1. try one function
- $\rightarrow$  e.g.,  $b_6 \oplus b_7 \oplus \ldots \oplus b_{32}$
- 2. test if the function corresponds to the mapping you already have
- 3. if not  $\rightarrow$  try another function until it works!

#### Brute force

For each bit of output  $o_0, \ldots, o_{k-1}$ 

- 1. try one function
- $\rightarrow$  e.g.,  $b_6 \oplus b_7 \oplus \ldots \oplus b_{32}$
- 2. test if the function corresponds to the mapping you already have
- 3. if not  $\rightarrow$  try another function until it works!

Fail fast, but still tolerate some errors

# Finding linear function (1/2) [7]

 $\bullet$  XOR is commutative:  $A \oplus B \Leftrightarrow B \oplus A$ 

# Finding linear function (1/2) [7]

- XOR is commutative:  $A \oplus B \Leftrightarrow B \oplus A$
- XOR is associative:  $A \oplus (B \oplus C) \Leftrightarrow (A \oplus B) \oplus C$

# Finding linear function (1/2) [7]

- XOR is commutative:  $A \oplus B \Leftrightarrow B \oplus A$
- XOR is associative:  $A \oplus (B \oplus C) \Leftrightarrow (A \oplus B) \oplus C$
- you can treat all the input bits independently

## Finding linear function (2/2) [7]

For each bit of output  $o_0, \ldots, o_{k-1}$ 

- ullet find two addresses that differ by a single bit  $b_i$
- compare output
  - $\bullet$  if different:  $b_i$  is part of the function
  - if equal:  $b_i$  is not part of the function



# Side-Channel Security

Chapter 2: Caches & Cache Attacks

#### **Daniel Gruss**

March 14, 2024

Graz University of Technology

#### References

- [1] Gerlach, L., Schwarz, S., Faroß, N., and Schwarz, M. (2024). Efficient and Generic Microarchitectural Hash-Function Recovery. In *S&P*.
- [2] Gruss, D., Maurice, C., Wagner, K., and Mangard, S. (2016). Flush+Flush: A Fast and Stealthy Cache Attack. In *DIMVA*.
- [3] Gruss, D., Spreitzer, R., and Mangard, S. (2015). Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches. In *USENIX Security Symposium*.
- [4] Inci, M. S., Gulmezoglu, B., Irazoqui, G., Eisenbarth, T., and Sunar, B. (2015). Seriously, get off my cloud! cross-vm rsa key recovery in a public cloud. *Cryptology ePrint Archive, Report 2015/898*.

- [5] Lipp, M., Gruss, D., Spreitzer, R., Maurice, C., and Mangard, S. (2016). ARMageddon: Cache Attacks on Mobile Devices. In *USENIX Security Symposium*.
- [6] Liu, F., Yarom, Y., Ge, Q., Heiser, G., and Lee, R. B. (2015). Last-Level Cache Side-Channel Attacks are Practical. In *S&P*.
- [7] Maurice, C., Le Scouarnec, N., Neumann, C., Heen, O., and Francillon, A. (2015). Reverse Engineering Intel Complex Addressing Using Performance Counters. In *RAID*.
- [8] Maurice, C., Weber, M., Schwarz, M., Giner, L., Gruss, D., Alberto Boano, C., Mangard, S., and Römer, K. (2017). Hello from the Other Side: SSH over Robust Cache Covert Channels in the Cloud. In NDSS.
- [9] Yarom, Y., Ge, Q., Liu, F., Lee, R. B., and Heiser, G. (2015). Mapping the Intel Last-Level Cache. *Cryptology ePrint Archive, Report 2015/905*.