

# **Block Ciphers**

- A family of cryptographic functions that map an n-bit plaintext block into n-bit ciphertext block.
  - It is parameterized by its key bit length, K.



# **Block Ciphers**

- A family of cryptographic functions that map an n-bit plaintext block into n-bit ciphertext block.
  - It is parameterized by its key bit length, K.



Rijndael is selected as AES in 2000.



| Key Length ( <i>K</i> ) | Nr |
|-------------------------|----|
| 128                     | 10 |
| 192                     | 12 |
| 256                     | 14 |

#### **AES Overview**

- 128-bit (16 bytes) input is arranged into a 4x4 matrix in column-major order.
  - Each matrix entry is an element of  $GF(2^8)$  with  $x^8+x^4+x^3+x+1$ .



$$\begin{pmatrix} a_{0,0} & a_{0,1} & a_{0,2} & a_{0,3} \\ a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} \end{pmatrix}$$

#### **AES Overview**

- 128-bit (16 bytes) input is arranged into a 4x4 matrix in column-major order.
  - Each matrix entry is an element of  $GF(2^8)$  with  $x^8+x^4+x^3+x+1$ .





#### **AES Overview**

- Rijndael has four main operations:
  - AddRoundKey: XORing the block with the round key.
  - <u>SubBytes</u>: Substitute a byte with another byte.
  - ShiftRows: Each row of the block is rotated.
  - MixColumns: Each column of the block is multiplied with a polynomial.
- Rijndael also has a key scheduling mechanism to generate round keys.

# **AES Encryption**



# **AES Decryption**



# Arithmetic in GF(28)

- GF(2<sup>k</sup>) is a Galois field of 2<sup>k</sup> elements.
  - Also called binary fields.
- GF(2<sup>k</sup>) elements in polynomial basis
  - Every element can be represented as a polynomial of power k-1.

$$E = (E_{k-1}E_{k-2} \dots E_1E_0) = E_{k-1}X^{k-1} + E_{k-2}X^{k-2} + \dots + E_1X + E_0$$

$$E_i$$
: {0, 1}

# **Arithmetic in GF(28)**

- GF(2<sup>k</sup>) is a Galois field of 2<sup>k</sup> elements.
  - Also called binary fields.
- GF(2<sup>k</sup>) elements in polynomial basis
  - Every element can be represented as a polynomial of power k-1.

$$E = (E_{k-1}E_{k-2} \dots E_1E_0) = E_{k-1}x^{k-1} + E_{k-2}x^{k-2} + \dots + E_1x + E_0$$

E<sub>i</sub>: {0, 1}

AES is using GF(2<sup>8</sup>) with irreducible polynomial x<sup>8</sup> + x<sup>4</sup> + x<sup>3</sup> + x + 1.

# Arithmetic in GF(28): Addition

- Addition in GF(28) with irreducible polynomial  $x^8 + x^4 + x^3 + x + 1$ .
  - GF(2) addition of the individual bits.
  - GF(2) addition corresponds to the XOR operation in Boolean logic.
- A, B, C in GF(28):

$$C_i = A_i + B_i \pmod{2}$$
, for  $i = 0, ..., k-1$ 

Subtraction is the same as addition



# Arithmetic in GF(28): Multiplication

- Multiplication in GF(28) with irreducible polynomial  $x^8 + x^4 + x^3 + x + 1$ .
  - Polynomial multiplication (each coefficient is in GF(2)).
  - Reduction with irreducible polynomial.
- Example:

201. 2 = 
$$(11001001)_2$$
.  $(00000010)_2$   
=  $(x^7 + x^6 + x^3 + 1)$ .  $(x)$   
=  $x^8 + x^7 + x^4 + x \pmod{x^8 + x^4 + x^3 + x + 1}$   
=  $x^7 + x^4 + x - x^4 - x^3 - x - 1$   
=  $x^7 + x^3 + 1$   
=  $(10001001)_2 = 129$ 

# **AES Key Schedule**

- AES takes a single key and generates round keys with the input key and its key scheduling (expansion) algorithm.
  - RotWord: Cyclic left shift
  - SubWord: AES S-box for each byte
  - Rcon: Add with [rc<sub>i</sub> 00 00 00]
    - $rc_i = x^{i-1}$  is round constant (can be stored as a table)

| i   | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 |
|-----|----|----|----|----|----|----|----|----|----|----|
| rci | 01 | 02 | 04 | 08 | 10 | 20 | 40 | 80 | 1B | 36 |



# **AddRoundKey**

- Round Key Addition.
  - Addition of the current state with the round key in GF(28).
  - Simple bit-wise addition (XOR) of state bytes with round key bytes.

$$\begin{pmatrix} a_{0,0} & a_{0,1} & a_{0,2} & a_{0,3} \\ a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} \end{pmatrix} \oplus \begin{pmatrix} k_0^i & k_4^i & k_8^i & k_{12}^i \\ k_1^i & k_5^i & k_9^i & k_{13}^i \\ k_2^i & k_6^i & k_{10}^i & k_{14}^i \\ k_3^i & k_7^i & k_{11}^i & k_{15}^i \end{pmatrix}$$

# **SubBytes**

- Byte Substitution (Forward S-box).
  - First, GF(2<sup>8</sup>) multiplicative inverse of each byte in round state is computed. Then, an affine transformation is applied to each byte.



# **Inv SubBytes**

- Inverse Byte Substitution (Inverse S-box).
  - An inverse affine transform is followed by multiplicative inverse operation in GF(28) for each state byte.



# **SubBytes and Inv SubBytes**

• You can use a table (S-box) to combine affine transformation and  $GF(2^8)$  inverse.

#### Forward S-Box



#### **Inverse S-Box**



- Shift Row Layer
  - Four rows of the state matrix are shifted cyclically to the left by offsets of
    - (

- Shift Row Layer
  - Four rows of the state matrix are shifted cyclically to the left by offsets of
    - 0, 1

$$\begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_5 & b_9 & b_{13} & b_1 \\ b_{10} & b_{14} & b_2 & b_6 \\ b_{15} & b_3 & b_7 & b_{11} \end{pmatrix} \leftarrow \begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_1 & b_5 & b_9 & b_{13} \\ b_2 & b_6 & b_{10} & b_{14} \\ b_3 & b_7 & b_{11} & b_{15} \end{pmatrix}$$

- Shift Row Layer
  - Four rows of the state matrix are shifted cyclically to the left by offsets of
    - 0, 1, 2

$$\begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_5 & b_9 & b_{13} & b_1 \\ b_{10} & b_{14} & b_2 & b_6 \\ b_{15} & b_3 & b_7 & b_{11} \end{pmatrix} \leftarrow \begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_1 & b_5 & b_9 & b_{13} \\ b_2 & b_6 & b_{10} & b_{14} \\ b_3 & b_7 & b_{11} & b_{15} \end{pmatrix}$$

- Shift Row Layer
  - Four rows of the state matrix are shifted cyclically to the left by offsets of
    - 0, 1, 2, 3

$$\begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_5 & b_9 & b_{13} & b_1 \\ b_{10} & b_{14} & b_2 & b_6 \\ \hline b_{15} & b_3 & b_7 & b_{11} \end{pmatrix} \leftarrow \begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_1 & b_5 & b_9 & b_{13} \\ b_2 & b_6 & b_{10} & b_{14} \\ \hline b_3 & b_7 & b_{11} & b_{15} \end{pmatrix}$$

- Shift Row Layer
  - Four rows of the state matrix are shifted cyclically to the left by offsets of
    - 0, 1, 2, 3

$$\begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_5 & b_9 & b_{13} & b_1 \\ b_{10} & b_{14} & b_2 & b_6 \\ b_{15} & b_3 & b_7 & b_{11} \end{pmatrix} \leftarrow \begin{pmatrix} b_0 & b_4 & b_8 & b_{12} \\ b_1 & b_5 & b_9 & b_{13} \\ b_2 & b_6 & b_{10} & b_{14} \\ b_3 & b_7 & b_{11} & b_{15} \end{pmatrix}$$

Inv ShiftRows performs right circular shift.

#### **MixColumns**

- Mix Column Layer
  - Each column of the state (4-bytes) is considered as a degree-3 polynomial in  $GF(2^8)[x]/x^4+1$
  - Then, each polynomial is multiplied with a constant polynomial in the same ring
    - $03.x^3 + 01.x^2 + 01.x + 02$
    - This multiplication can be written as a matrix-vector multiplication

$$\begin{pmatrix} d_0 & d_4 & d_8 & d_{12} \\ d_1 & d_5 & d_9 & d_{13} \\ d_2 & d_6 & d_{10} & d_{14} \\ d_3 & d_7 & d_{11} & d_{15} \end{pmatrix} = \begin{pmatrix} 02 & 03 & 01 & 01 \\ 01 & 02 & 03 & 01 \\ 01 & 01 & 02 & 03 \\ 03 & 01 & 01 & 02 \end{pmatrix} \cdot \begin{pmatrix} c_0 & c_4 & c_8 & c_{12} \\ c_1 & c_5 & c_9 & c_{13} \\ c_2 & c_6 & c_{10} & c_{14} \\ c_3 & c_7 & c_{11} & c_{15} \end{pmatrix}$$

- Inverse Mix Column layer uses the inverse of  $03.x^3 + 01.x^2 + 01.x + 02$ 
  - $0B. x^3 + 0D. x^2 + 09. x + 0E$

#### **AES Round Overview**

AES Round.



<sup>\*</sup> Image source: https://tratliff.webspace.wheatoncollege.edu/2022\_Fall/math202/index.html

# **Block Cipher/AES Modes**

- In order to efficiently and securely use a block cipher, one must use the cipher in an appropriate mode of operation [H2020].
  - Electronic CodeBook Mode (ECB)
  - Cipher Block Chaining Mode (CBC)
  - Counter Mode (CTR)



# **Block Cipher/AES Modes**



<sup>\*</sup> Image source: https://medium.com/@TalBeerySec/zooming-on-zoom-5-encryption-cc7e9b710b9f

# **AES Implementations**

Parallelism dimensions [AGS2014]



# **AES Implementations**

Efficiency parameters:Latency

P<sub>i</sub>

AES Enc/Dec Time to

encrypt/decrypt

a single plaintext.

 $C_i$ 

Throughput

 $P_{i+1}$ 

 $P_i$ 

AES Enc/Dec Number of plaintext encrypted/decrypted in a unit of time.

 $C_{i}$ 

 $C_{i+}$ 

# **Block Cipher Implementations: Iterative Approach**

- Implement the combinational logic required for one round (supplemented with register and multiplexers). Then, use it repeatedly.
  - Only one block of data is encrypted at a time.
  - The number of clock cycles necessary to encrypt a single block of data is equal to the number of cipher rounds.



- Initialization
- Round (repeated Nr-1 times):
  - SubBytes
  - ShiftRows
  - MixColumns
  - AddRoundKey
- Final Round
  - SubBytes
  - ShiftRows
  - Add Round Key





- SubBytes is expensive and it is instantiated twice.
  - Can we do better?

- SubBytes is expensive and it is instantiated twice.
  - Can we do better?
- See the order for a toy example: Nr = 3

```
AddRoundKey with ARK[0]
SubBytes
ShiftRows
MixColumns
AddRoundKey with ARK[1]
SubBytes
ShiftRows
MixColumns
AddRoundKey with ARK[2]
SubBytes
ShiftRows
AddRoundKey with ARK[3]
```

- SubBytes is expensive and it is instantiated twice.
  - Can we do better?
- See the order for a toy example: Nr = 3

```
AddRoundKey with ARK[0]
SubBytes
ShiftRows
MixColumns
AddRoundKey with ARK[1]
SubBytes
ShiftRows
MixColumns
AddRoundKey with ARK[2]
SubBytes
ShiftRows
AddRoundKey with ARK[3]
```

- SubBytes and AddRoundKey are instantiated twice.
  - Can we do better?
- See the order for a toy example: Nr = 3

```
AddRoundKey with ARK[0]
SubBytes
ShiftRows
MixColumns
AddRoundKey with ARK[1]
SubBytes
ShiftRows
MixColumns
AddRoundKey with ARK[2]
SubBytes
ShiftRows
AddRoundKey with ARK[3]
```

Round (repeated Nr times): AddRoundKey RK[i] Add Round Key SubBytes i=0,...,Nr-1 **ShiftRows** SubBytes **MixColumns ShiftRows** or Nr AddRoundKey Rounds i==Nr-1 Add Round Key MixColumns RK[Nr] ·

#### **AES Implementations: Iterative Approach**

High-level diagram of the architecture



#### **AES Implementations: Iterative Approach**

- High-level diagram of the architecture
  - What happens if we divide a round into multiple stages?



What about decryption?



Can we make Enc. and Dec. look similar?



Swap InvShiftRows and InvSubBytes



Push InvShiftRows and InvSubBytes down



# **Block Cipher Implementations: Partial Loop Unrolling**

- K round out of Nr round functions are implemented in combinational part.
  - Partial loop unrolling.



# **Block Cipher Implementations: Partial Loop Unrolling**

- K round out of Nr round functions are implemented in combinational part.
  - Partial loop unrolling



# **Block Cipher Implementations: Loop Unrolling**

- All round functions are implemented in combinational part.
  - Full loop unrolling



# **Block Cipher Implementations: Loop Unrolling**

- All round functions are implemented in combinational part
  - Full loop unrolling



Without pipelining, unrolling offers no throughput improvement.

- A traditional methodology for design of high-performance implementations.
  - Partial or full outer-loop pipelining (i.e., K=2 with Nr=4 rounds)



- A traditional methodology for design of high-performance implementations.
  - Partial or full outer-loop pipelining (i.e., K=2 with Nr=4 rounds)



- A traditional methodology for design of high-performance implementations.
  - Partial or full outer-loop pipelining (i.e., K=2 with Nr=4 rounds)



- A traditional methodology for design of high-performance implementations.
  - Partial or full outer-loop pipelining (i.e., K=2 with Nr=4 rounds)



- A traditional methodology for design of high-performance implementations.
  - Partial or full outer-loop pipelining.
  - Inner-loop pipelining.
  - Partial or full outer-loop pipelining with inner loop pipelining.



## **Block Cipher Implementations: Summary**

- Summary of implementation methods
  - Iterative
  - Partial unroll
  - Fully unroll



## **Block Cipher Implementations: Summary**

- Summary of implementation methods
  - Iterative
  - Partial unroll
  - Fully unroll
  - Pipelining
    - Inner
    - Outer





#### **AES Implementations: I/O**

- Assume that the input data rate is 100 Mb/sec (1Mb = 1,000,000 bits), the input and output buffers can store 128-bits each.
  - What would be your design strategy?



- It takes one byte as input and produces one byte output. It has two components:
  - Multiplicative inverse in GF(2<sup>8</sup>)
    - Complex operation
  - Affine transformation
- Three different approaches for implementation:
  - Look-up table
  - Look-up table + logic
  - Logic-only

- Look-up table:
  - Pre-compute and store SubBytes results for all possible inputs (0 to 255).
  - Each round state has 16 bytes, so 16 256x8 bits (2 Kbits) table is required.
- For a merged enc/dec design, table size is doubled.
  - i.e., use most significant bit of table address to distinguish forward and inverse conversions.

- Look-up table and logic:
  - InvSubBytes and SubBytes operations can share the same table.



 Then, affine and inverse affine transformation operations can be implemented using XOR gates.

- Look-up table and logic:
  - InvSubBytes and SubBytes operations can share the same table.

$$q = \operatorname{aff\_trans}(a) \qquad (13) \qquad q = \operatorname{aff\_trans}^{-1}(a) \qquad (14)$$

$$\overline{a_A = a_0 \oplus a_1, a_B = a_2 \oplus a_3,} \qquad \overline{a_A = a_0 \oplus a_5, a_B = a_1 \oplus a_4,}$$

$$a_C = a_4 \oplus a_5, a_D = a_6 \oplus a_7 \qquad a_C = a_2 \oplus a_7, a_D = a_3 \oplus a_6$$

$$q_0 = \overline{a_0} \oplus a_C \oplus a_D \qquad q_0 = \overline{a_5} \oplus a_C$$

$$q_1 = \overline{a_5} \oplus a_A \oplus a_D \qquad q_1 = a_0 \oplus a_D$$

$$q_2 = a_2 \oplus a_A \oplus a_D \qquad q_2 = \overline{a_7} \oplus a_B$$

$$q_3 = a_7 \oplus a_A \oplus a_B \qquad q_3 = a_2 \oplus a_A$$

$$q_4 = a_4 \oplus a_A \oplus a_B \qquad q_4 = a_1 \oplus a_D$$

$$q_5 = \overline{a_1} \oplus a_B \oplus a_C \qquad q_5 = a_4 \oplus a_C$$

$$q_6 = \overline{a_6} \oplus a_B \oplus a_C \qquad q_6 = a_3 \oplus a_A$$

$$q_7 = a_3 \oplus a_C \oplus a_D. \qquad q_7 = a_6 \oplus a_B.$$

- Logic:
  - Table-based implementations can be costly for ASIC.
  - It also can limit maximum clock frequency in deeply-pipelined architectures.
- What are our options?
  - Construct truth-table and derive Boolean expression.
    - Very inefficient even with Boolean minimization techniques.

- Logic:
  - Table-based implementations can be costly for ASIC.
  - It also can limit maximum clock frequency in deeply-pipelined architectures.
- What are our options?
  - Construct truth-table and derive Boolean expression.
    - Very inefficient even with Boolean minimization techniques.
  - Implement multiplicative inverse operation for Rijndael's finite field.
    - Extended Euclidean Algorithm
    - Map operations to GF(2<sup>4</sup>)

Map operations to GF(2<sup>4</sup>). [WOL2002]

#### An ASIC Implementation of the AES SBoxes<sup>⋆</sup>

Johannes Wolkerstorfer<sup>1</sup>, Elisabeth Oswald<sup>1</sup>, and Mario Lamberger<sup>2</sup>

<sup>1</sup> Institute for Applied Information Processing and Communications, Graz University of Technology, Inffeldgasse 16a, A-8010 Graz, Austria Johannes.Wolkerstorfer@iaik.at, http://www.iaik.at

<sup>2</sup> Department of Mathematics Graz University of Technology, Steyrergasse 30, A-8010 Graz, Austria

Abstract. This article presents a hardware implementation of the S-Boxes from the Advanced Encryption Standard (AES). The SBoxes substitute an 8-bit input for an 8-bit output and are based on arithmetic operations in the finite field  $GF(2^8)$ . We show that a calculation of this function and its inverse can be done efficiently with combinational logic. This approach has advantages over a straight-forward implementation using read-only memories for table lookups. Most of the functionality is used for both encryption and decryption. The resulting circuit offers low transistor count, has low die-size, is convenient for pipelining, and can be realized easily within a semi-custom design methodology like a standard-cell design. Our standard cell implementation on a 0.6  $\mu$ m CMOS process requires an area of only 0.108 mm<sup>2</sup> and has delay below 15 ns which equals a maximum clock frequency of 70 MHz. These results were achieved without applying any speed optimization techniques like pipelining.

[WOL2002] J. Wolkerstorfer et al., An ASIC Implementation of AES SBoxes, CT-RSA, 2002.

- Map operations to GF(2<sup>4</sup>). [WOL2002]
  - We can represent an element in GF(2<sup>8</sup>) as a two-term polynomial with coefficients in GF(2<sup>4</sup>).

$$a = a_h \cdot x + al$$
,  $a \in GF(2^8)$  and  $a_h$ ,  $al \in GF(2^4)$ 

Example: Inversion in GF(2<sup>4</sup>)

$$q(x) = a(x)^{-1} \mod m_4(x), \quad q(x), a(x) \in GF(2^4)$$

$$a_A = a_1 \oplus a_2 \oplus a_3 \oplus a_1 a_2 a_3$$

$$q_0 = a_A \oplus a_0 \oplus a_0 a_2 \oplus a_1 a_2 \oplus a_0 a_1 a_2$$

$$q_1 = a_0 a_1 \oplus a_0 a_2 \oplus a_1 a_2 \oplus a_3 \oplus a_1 a_3 \oplus a_0 a_1 a_3$$

$$q_2 = a_0 a_1 \oplus a_2 \oplus a_0 a_2 \oplus a_3 \oplus a_0 a_3 \oplus a_0 a_2 a_3$$

$$q_3 = a_A \oplus a_0 a_3 \oplus a_1 a_3 \oplus a_2 a_3.$$

$$(9)$$

<sup>\*</sup> Equations are retrieved from [WOL2002]

 Inversion in GF(2<sup>8</sup>) can be re-written in terms of addition, multiplication, squaring and inversion in GF(2<sup>4</sup>):

$$(a_h x + al) \otimes (a'_h x + a'_l) = 0.x + 1, \quad a_h, a_l, a'_h, a'_l \in GF(2^4)$$

• Then, inversion can be derived as:

$$(a_h.x + al)^{-1} = (a_h'.x + a_l') = (a_h \otimes d).x + (a_h \oplus al) \otimes d$$
 where

$$d = \left( (a_h^2 \otimes E) \oplus (a_h \otimes a_l) \oplus a_l^2 \right)^{-1}$$

• All operations for calculating d is in GF(2<sup>4</sup>).

• We can further map operations in GF(2<sup>4</sup>) to GF(2). [C2005]

A Very Compact S-box for AES

D. Canright

Naval Postgraduate School, Monterey CA 93943, USA, dcanright@nps.edu

Abstract. A key step in the Advanced Encryption Standard (AES) algorithm is the "S-box." Many implementations of AES have been proposed, for various goals, that effect the S-box in various ways. In particular, the most compact implementations to date of Satoh et al.[1] and Mentens et al.[2] perform the 8-bit Galois field inversion of the S-box using subfields of 4 bits and of 2 bits. Our work refines this approach to achieve a more compact S-box. We examined many choices of basis for each subfield, not only polynomial bases as in previous work, but also normal bases, giving 432 cases. The isomorphism bit matrices are fully optimized, improving on the "greedy algorithm." Introducing some NOR gates gives further savings. The best case improves on [1] by 20%. This decreased size could help for area-limited hardware implementations, e.g., smart cards, and to allow more copies of the S-box for parallelism and/or pipelining of AES.

[C2005] D. Canright, A Very Compact S-Box for AES, CHES, 2005.

- The MixColumn Layer
  - It can be expressed as matrix multiplication
  - Each element of the matrix is a byte

| $02\ 03\ 01\ 01$ | $\lceil a_0 \rceil$ |
|------------------|---------------------|
| $01\ 02\ 03\ 01$ | $a_1$               |
| 01 01 02 03      | $a_2$               |
| $03\ 01\ 01\ 02$ | $a_3$               |

- Each polynomial coefficient will be multiplied with a matrix element. Then, the resulting four bytes will be added (XORed)
  - 2 layers of XOR gates: 3-input XOR gates + 4-input XOR gates
  - i.e.,  $b_0 = 2 \otimes a_0 \oplus 3 \otimes a_1 \oplus 1 \otimes a_2 \oplus 1 \otimes a_3$
- Each coefficient multiplication can also be implemented using look-up tables

- The Inverse MixColumn Layer
  - It can be expressed as matrix multiplication
  - Each element of the matrix is a byte

| 0E 0B 0D 09    |   | $\lceil a_0 \rceil$ |
|----------------|---|---------------------|
| 09  0E  0B  0D |   | $a_1$               |
| 0D 09 0E 0B    | • | $a_2$               |
| 0B 0D 09 0E    |   | $a_3$               |

- Inverse MixColumn has larger coefficients.
  - 2 layers of XOR gates: up to 6-input XOR gates + 4-input XOR gates
- Inverse MixColumn implementation will have larger area and longer critical path.

- Since the hardware implementing inverse MixColumn layer is always larger, there are works targeting resource sharing between MixColumn and inverse MixColumn for reducing hardware cost. [W2001]
- Inverse matrix can be expressed as: [GC2009]

$$\begin{bmatrix} 0\text{C} & 08 & 0\text{C} & 08 \\ 08 & 0\text{C} & 08 & 0\text{C} \\ 0\text{C} & 08 & 0\text{C} & 08 \\ 08 & 0\text{C} & 08 & 0\text{C} \end{bmatrix} + \begin{bmatrix} 02 & 03 & 01 & 01 \\ 01 & 02 & 03 & 01 \\ 01 & 01 & 02 & 03 \\ 03 & 01 & 01 & 02 \end{bmatrix}$$



- Since the hardware implementing inverse MixColumn layer is always larger, there are works targeting resource sharing between MixColumn and inverse MixColumn for reducing hardware cost. [W2001]
- Inverse matrix/polynomial d(x) can be expressed as  $c(x)^3$  where c(x) is forward matrix/polynomial:

$$d(x) = c(x) \cdot c(x)^2$$

•  $c(x)^2$ :

$$\begin{bmatrix} 05 & 00 & 04 & 00 \\ 00 & 05 & 00 & 04 \\ 04 & 00 & 05 & 00 \\ 00 & 04 & 00 & 05 \end{bmatrix}$$



<sup>\*</sup> Image resource: [GC2009]

#### References

[H2020] H. M. Heys, A Tutorial on the Implementation of Block Ciphers: Software and Hardware Applications, 2020, IACR ePrint 2020/1545.

[AGS2014] A. Aysu et al., SIMON Says, Break the Area Records for Symmetric Key Block Ciphers on FPGAs, ESL, 2014.

[WOL2002] J. Wolkerstorfer et al., An ASIC Implementation of AES SBoxes, CT-RSA, 2002.

[C2005] D. Canright, A Very Compact S-Box for AES, CHES, 2005.

[W2001] J. Wolkerstorfer. *An ASIC implementation of the AES MixColumn operation*. In Proc. Austrochip 2001

[GC2009] K. Gaj, FPGA and ASIC Implementations of AES, Cryptographic Engineering, 2009.