Hardware and Embedded Systems Security

Implementation of Symmetric Cryptography

Block Ciphers

Encrypts fixed block size (n in bits) plaintext (p) to fixed block size (n) ciphertext (c) using a key (k) of fixed size (m)
Examples:

Block Cipher	n	m
AES-128	128	128
DES	64	56
3DES (Encrypt-Decrypt-Encrypt)	64	112 or 168
PRESENT	64	80 or 128

Round Function

Block ciphers are usually iterated ciphers (a round function is applied many times to map to ciphertext)
Round function operates on state of cipher
- State usually initialised with plaintext
In a round (i), a round key (k_i) enters the function with the current state (s_i)
The final state is the ciphertext

Building Blocks

Diffusion: change in single single input bit affects all output bits with 50% probability
Confusion: relationship between input and output is "sufficiently complex"
Key addition: key is combined with the state through addition-like op, e.g. XOR
S-Box: substitution box providing a non-linear mapping over small no. of bits e.g. 8-to-8 or 6-to-4
- Provides confusion
- DES has many seemingly random S-boxes
- AES has one algebraic structure S-box
Permutation: permutation providing a linear mapping such that S-box output affects many S-box inputs in next round.
- Provides diffusion
- DES has bitwise permutations
- AES has bytewise/wordwise operations

PRESENT Pseudocode

n = 64, m = 80
addRoundKey: bitwise XOR of state s and round key k_i
sboxLayer: 4-to-4 PRESENT S-Box to state in groups of 4 bits
- 64 / 4 = 16 times S-Box applied
- Storing a-to-b table requires b * 2^a bits
pLayer: bitwise permutation
- bit at pos j is permuted to new position:
- bit 0 to pos 0, 1 to 16, 2 to 32, etc.

Bitwise Operations in C

getbit: return (word >> bit) & 0x1
setbit: *word |= (1 << bit)
clrbit: *word &= ~(1 << bit)

S-Box Layer Optimisations

Normally implemented as a lookup table, so usually fast
(Time) Can merge multiple S-Boxes to one lookup table e.g. 2x 4-to-4 -> 1x 8-to-8
(Time) Can load S-Box into RAM
(Memory) Can have smaller S-Box and lookup lower and upper nibble
- Likely will cause additional time overhead
Implement S-Box as combinatorial logic
- Each output bit is represented by a boolean expression on input bits

Permutation Layer Optimisations

Permutations on byte level usually faster in software than bitwise
- Architectures don't always have bitwise operations
Bitwise permutations may be faster on hardware-based implementations

Table-based Software Optimisations

Combine S-Box and permutation layer
- In PRESENT, instead of 4-to-4 bit, we use 4-to-64
- Table size expands linearly e.g. 64 * 2⁴ = (b * 2^a)
- Linear expansion raises fewer memory concerns

Example (saving 4 write ops to the state):

// Apply S−Box
for (uint8t i = 0 ; i < 4 , i ++) {
  s[i] = sbox[s[i]];
}

// Permute bytes 0, 1, 2, 3 −> 2, 3, 1, 0
const uint8 t tmp = s[0];
s[0] = s[2];
s[2] = s[1];
s[1] = s[3];
s[3] = tmp;

vs

// Permute bytes 0, 1, 2, 3 −> 2, 3, 1, 0 and apply S−Box
const uint8t tmp = s[0];
s[0] = sbox[s[2]];
s[2] = sbox[s[1]];
s[1] = sbox[s[3]];
s[3] = sbox[tmp];

Minor effect with byte-oriented permutations (ShiftRows in AES)
Cannot be applied directly to bitwise permutations since each S-Box output affects many bytes

Creating the Larger Table (PRESENT S-Box Example)

S-Box output bits are stored at their permuted positions with remaining bits set to zero

SP₀

S-Box	3 2 1 0	… 48 … 32 … 16 … 0
0xC	1 1 0 0	… ..1 … ..1 … ..0 … 0
0x5	0 1 0 1	… ..0 … ..1 … ..0 … 1
0x6	0 1 1 0	… ..0 … ..1 … ..1 … 0
…	…	…

SP₁

S-Box	3 2 1 0	… 49 … 33 … 17 … 1 0
0xC	1 1 0 0	… ..1 … ..1 … ..0 … 0 0
0x5	0 1 0 1	… ..0 … ..1 … ..0 … 1 0
0x6	0 1 1 0	… ..0 … ..1 … ..1 … 0 0
…	…	…

For the first S-Box, only bits 0, 16, 32 and 48 can be non-zero, for the second S-Box, only bits 1, 17, 33 and 49, and so on
Tables for each S-Box instance are re-combined into one state using bitwise XOR or OR
There will only be at most one 1 at each bit position

Efficiency and Cost

Speed-up by avoiding bit permutations
Combined table requires more memory
Program code memory saved from removing permutation implementation
Structure of PRESENT permutation means we can save memory by storing the combined table once, looking up a value and shifting it left by one for each S-Box
Could combine table approach with S-Box merging, approx halving S-Box lookup and permutation

Bitslicing

Smallest addressable unit on processor, normally a byte
Could store each bit in a byte but with significant wasted space
Bitslicing works on idea that we usually encrypt many blocks
Bitslicing allows to "reclaim" some bits that would be lost in approach above

Expanded Form

Where each bit is represented in a byte.

Key Addition

Both key and state are in "expanded" form, can XOR them byte-wise

Permutation Layer

Becomes byte-wise permutation
Example of byte-wise PRESENT permutation:

state_out[0] = state[0];
state_out[16] = state[1];
state_out[32] = state[2];
state_out[48] = state[3];
state_out[1] = state[4];
// …

S-Box Layer

Can no longer be easily implemented as a LUT
Instead, represent outputs as boolean functions in input bits
- We Mainly look at Algebraic Normal Form

Modification to "Reclaim" Bits

Store plaintext in bitsliced state s'[0...63]
- First bit of first block p₀ goes into bit 0 of s'[0]
- First bit of second block p₁ goes into bit 1 of s'[0]
- Second bit of p₁ goes into bit 1 of s'[1]
- etc.
Apply key addition, S-Box and permutation
Map s' back to normal representation

Complexity

Significant performance increase due to operations saved for permutations
Key addition layer keeps similar performance
S-Box more costly, depends on number of boolean operations (more = bad)
Memory overhead since multiple cipher states stored at once
Code size may increase

Algebraic Normal Form

Representation of boolean functions
Use Fast Fourier Transform (FTT)-like algorithm to compute functions
- Write the truth table for function inputs
- Apply "butterfly" n times, with spacing σ = 2⁰ = 1, 2¹, 2², ..., 2ⁿ⁻¹

Butterfly for computing ANF:

A full ANF butterfly table:

We take the rows that are set to one in the final column and use the inputs of these rows to form the boolean function
If 000 is set, the result is inverted (+1)
Example table above results in: x₀ + x₁ + x₀ * x₁... + 1

Implementation of Asymmetric (Public Key) Cryptography

Higher computational and storage requirements
Longer keys
Easy for PCs, harder for embedded systems

RSA

l-bit RSA
Pick secret primes p,q, each about ^l/₂ size
n = p * q
Pick public exponent e
Public key: (n, e)
Private Key: (p, q, d) where d = e^-1 mod 𝜙(n)
Encrypt x: y = x^e mod n
Decrypt y: x = y^d mod n

Long-Number Arithmetic

Required for computations with numbers > width w of a single CPU register

Number Representation

With a processor with w-bit registers, given an l-bit number u, we store the number in k = ceiling(^l/_w)
Example:
- Take 12345678 in binary this is a 24-bit number u = (101111000110000101001110)₂
- We have a CPU with 5-bit registers
- ceil(24/5) = 5
- u = (01011 11000 11000 01010 01110)_2⁵ (added a leading zero)
- u = (a₄, a₃, a₂, a₁, a₀)_2⁵

Addition

mod: take lower limb of result t
floor(t/b): take upper limb of result t
t must be a two limb number for algorithm to work
Some architectures have an add-with-carry (ADDC) instruction
Use workarounds to implement add-with-carry in higher-level languages
- e.g. if t >= b: c = 1 else c = 0
- In this example, execution time is dependent on inputs (less secure)

Subtraction

Done similar to addition, carry becomes "borrow" bit
Can work in two's complement representation
- Negative sign by negating each bit and adding 1

Complexity

k base-b additions-with-carry
2k if no ADDC instruction
Complexity O(k)

Multiplication

Product of k-limb and m-limb numbers is at most k + m length

Complexity

Requires k + m base-b multiplications and 2(b + m) base-b shifts
If k = m, complexity O(k²)

Karatsuba Multiplication

Split two k-digit numbers u, v into halves of equal size k = ceil(^k/₂)
Write values as:
- u = u_H * b^k + u_L
- v = v_H * b^k + v_L
u * v written as:

(u_H * b^k + u_L) * (v_H * b^k + v_L) = u_Hv_Hb^2k + (u_Hv_L +u_Lv_H)b^k + u_Lv_L
Middle part (u_Hv_L +u_Lv_H) can be expressed as:

(u_H + u_L) * (v_L + v_H) − u_Hv_H − u_Lv_L = u_Hv_L + u_Lv_H
u_Hv_H and u_Lv_L are already computed once, so we save one multiplication
Method expressed as:
- D₀ = u_L * v_L
- D₂ = u_H * v_H
- D₁ = (u_H + u_L) * (v_L + v_H)
- Hence:
- u * v = D₂b^2k + [D₁ - D₂ - D₀]b^k + D₀

Recursion

Algorithm should be implemented recursively to compute sub-products
In practice, algorithm applied until threshold, then schoolbook algorithm used
- At some point, additional additions/management become more expensive than the saved multiplication

Complexity

O((³/₄)ⁱ k²)
Refactor to O(k^1.585) for two numbers of length k = 2ⁱ

Modulo Arithmetic

Complicated algorithms
Dividing 2k-digit number by a k-digit number, requires k * (k - 2) multiplications and k divisions
Barrett reduction is an optimisation by performing an expensive pre-computation floor(b^2k / n)

Exponentiation Algorithms

Square-and-Multiply (SAM)

Interleaves modular reduction and multiplications
- Square first, then reducing would require huge storage/computation

t = exponent length
Scans binary exponent e from left to right
Squares for each bit and multiplies if exponent bit = 1

Complexity

On average, assume half of the bits are set
We need (t-1)/2 multiplications and t-1 squares

Elliptic Curve Cryptography

2048-bit keys recommended in RSA, for ECC 256-bit is sufficient
Faster long-number ops due to shorter key inputs
Single group operation requires multiple long-number operations
ECC better suited for constrained embedded devices

Implementation Attacks

With physical access, fault injection (FI) and passive side-channel analysis (SCA) shown to break analytically secure ciphers (DES, AES, RSA)
Exploit properties of implementation

Fault Injection

Devices require certain conditions to function normally:
- Stable supply voltage
- Stable clock signal
- Normal operating temperature
- No strong magnetic or electric fields in vicinity
- etc.
If conditions are not met, device may produce incorrect results
Example faults induced:
- Expose device to high/low temperatures
  - Could reduce entropy of random number generator
  - Could disable memory writes
- Over/Undervoltage device
  - Can affect single or few instructions
- Temp increase clock frequency
  - Can affect single or few instructions
- Expose the integrated circuit (IC) to UV-C light
  - Can clear flash memory
- Target parts of IC with laser
  - Can precisely modify single signal in device
- Tap microprobes on IC
  - Read keys from memory
  - Change internal signals

Generic FI Attacks

Some attacks apply to all ciphers
In the code below, a fault could be induced to skip the return -1 instruction, bypassing the PIN check

int c = memcmp(enteredpin, storedpin, pinlen);
if(c != 0) {
  return −1;
}
// Code continues ...

A fault can be induced on a single bit to always set it to 0 ("stuck-at" fault)
If adversary can fault each bit of the key, they can observe a change in ciphertext to establish if the key bit was a 1 or a 0
If ciphertext changed, the key bit was 1, otherwise key bit was 0
Algorithm shown below

Both generic attacks require high control over fault, timing and space

FI on CRT-RSA

RSA with CRT

Split one RSA computation mod k-digit number n into two operations mod p and q
Allows to reduce computational complexity by a factor of 4

Bellcore Attack

Assume adversary can inject any fault into s_p = x^dp_p
Given correct signature s and fault signature s^¬, mod n can be factored
q = gcd(s - s^¬, n), p = ⁿ/_q
Disadvantage: requires correct and fault signature on same plaintext
Example:
- Correct signature 141, faulty signature 115
- n = 143

Recover q:

  q = gcd(141 - 115, 143) = gcd(26, 143)
    = gcd(26, 143 - 5*26) = gcd(26, 13) = 13

Lenstra Attack

Advantage: "Emulate" correct signature value using signature verification
Since a fault was induced on s_p, result is still correct modulo q
Verification can be factored to: q = gcd(s^¬e - x, n), p = ⁿ/_q
Example:
- n = 143, e = 7, x = 15
- Faulty signature s^¬= 115

Compute 115⁷ mod 143 = 80 (using SAM)
Recover q:

  q = gcd(80 - 15, 143) = gcd(65, 143)
    = gcd(65, 13) = 13

Countermeasures

Two classes: detection-based and algorithmic
Detection-based often implemented on hardware or low software level
- Detect that FI has taken or is taking place
- Examples:
  - Monitor power supply/clock signals for glitches
  - Monitor environment (temperature)
  - Sensors for light/laser detection
  - Compute on two identical CPUs
  - Check for control flow manipulation
  - Checksums and parity bits on internal signals and data bus
Algorithmic countermeasures use properties of algorithm to detect faults
- Examples:
  - Compute algorithm inverse, e.g. verify or decrypt ciphertext before output
  - Insert dummy operations to make FI harder at a specific position
Logging of fault injection attempts must be done carefully
- Obvious attempt of simply incrementing a counter could be detected by an adversary, where they then remove power before the counter write and adjust their FI parameters
- Correct method is to increment a counter before operation, after success, decrement the counter
  - Increases execution time
  - Flash memory etc. have limited write cycles

Side-Channel Analysis

Passive measurement of physical properties of an implementation
Signal is usually called a trace
Examples:
- Execution time: leak information about branching etc.
- Power consumption: leak information about data processed
- Electro-magnetic emanation: similar to power consumption
- Photonic emissions: precise information on internal processes
- Sound: vibrating circuit components can leak information, especially for slow algorithms (RSA)

Measurement Setup

Left shows resistor in ground path of device
- Supply current I_CC flows through the resistor R
- Derive I_CC from voltage drop V_R over R
Right shows R in V_CC path
- Measure drop over R
- Signal is DC-shifted, must remove DC constant by measuring AC-coupled
V_R can be measured directly using a differential probe

Simple Power Analysis (SPA)

Attack by inspecting one or a few traces
In SAM, can distinguish squaring and multiply to reconstruct the secret exponent

Differential Power Analysis (DPA)

Can exploit small leakages
Detect small leakage in large amount of noisy traces
Idea is, power consumption is different for processing a 1 or 0
Measurement and evaluation phase

Measurement

n traces p_i (t) with T sample points each
Encrypting plaintext X_i using key k
When referring to a byte, X⁰_i means byte 0 of X_i

Evaluation

Assume key byte k₀ = 00
Compute LSB(S(x_i XOR k))
Put p₀ trace in bit 0 or bit 1 heap
Repeat for different inputs
Repeat with key candidates k₀ = 01, 02, … , FF
Look at difference of means of traces to find peaks indicating correct key

Why DPA Works

Averaging many signals reduces noise relative to signal

Correlation Power Analysis (CPA)

Reduces number of required traces
We can approximate how a value affects leakage signal

Image shows how a value with fewer ones may result in a lower amplitude
The leakage model is the "Hamming Weight" (HW) model
Hamming weight is equal to number of bits set to 1 in b
Hamming Distance (HD) model assumes leakage depends on the previous and new value of a register
HD is equal to number of bits changed between the old and new value
In CPA, assume a key for first S-Box for input byte x_i, then predict the intermediate result as (S(x_i XOR k^{^}))
Convert value into hypothetical power consumption using HW model
Determine the "match" between prediction and reality of traces

Rules of Thumb

CPA is normalised to range of -1 <= p <= 1, making correlation easier to define
{Some other stuff that seems complicated for an exam}

Countermeasures

Amplitude-based Countermeasures

Signal to Noise Ratio (SNR) determines success rate of side-channel attacks
- Lower the SNR, the more measurements needed

Balanced Logic Styles

Reduce power of leakage signal by balancing power consumption
E.g. Use a and a^¬ at the same time
- Adversary will be observing HW(a) + HW(a^¬)
Always do an operation, e.g. in SAM, always square and multiply but discard unwanted results

Noise Generation

Will increase number of traces required to remove noise in averaging
Used in conjunction with other countermeasures can make averaging out noise hard

Masking

Combine sensitive value x with a random mask value m e.g. using XOR
Device performs all computation on (x XOR m)
Device unmasks before outputting
Non-linear components such as S-Box have to be re-generated for a masked version
Higher-order SCA can leak value and mask

Timing-based Countermeasures

Spreading leakage over multiple clock cycles
Can make clock signal unstable on purpose
Can randomly modulate frequency so operations are at different times in different executions
Adversary may be able to re-align traces using peaks etc.
Dummy operations can be added to control flow to move real operations for different executions
Can randomly change order of operations e.g. S-Box instances in AES
Timing-based countermeasures generally not enough on their own
The actual algorithm should be in constant time

System-level Countermeasures

Diversify keys, every device gets a unique key
- If one key is extracted, only one device is affected
Backend system could check for and block inconsistencies/unexpected use
Rate limiting
System-level countermeasures depend on application, should be second-line defense

Files

notes.md

Latest commit

History

notes.md

File metadata and controls

Hardware and Embedded Systems Security

Implementation of Symmetric Cryptography

Block Ciphers

Round Function

Building Blocks

PRESENT Pseudocode

Bitwise Operations in C

S-Box Layer Optimisations

Permutation Layer Optimisations

Table-based Software Optimisations

Bitslicing

Expanded Form

Algebraic Normal Form

Implementation of Asymmetric (Public Key) Cryptography

RSA

Long-Number Arithmetic

Number Representation

Addition

Multiplication

Karatsuba Multiplication

Modulo Arithmetic

Exponentiation Algorithms

Elliptic Curve Cryptography

Implementation Attacks

Fault Injection

Generic FI Attacks

FI on CRT-RSA

Countermeasures

Side-Channel Analysis

Simple Power Analysis (SPA)

Differential Power Analysis (DPA)

Correlation Power Analysis (CPA)

Countermeasures

Amplitude-based Countermeasures

Timing-based Countermeasures

System-level Countermeasures