- Encrypts fixed block size (n in bits) plaintext (p) to fixed block size (n) ciphertext (c) using a key (k) of fixed size (m)
- Examples:
Block Cipher | n | m |
---|---|---|
AES-128 | 128 | 128 |
DES | 64 | 56 |
3DES (Encrypt-Decrypt-Encrypt) | 64 | 112 or 168 |
PRESENT | 64 | 80 or 128 |
- Block ciphers are usually iterated ciphers (a round function is applied many times to map to ciphertext)
- Round function operates on state of cipher
- State usually initialised with plaintext
- In a round (i), a round key (ki) enters the function with the current state (si)
- The final state is the ciphertext
-
Diffusion: change in single single input bit affects all output bits with 50% probability
-
Confusion: relationship between input and output is "sufficiently complex"
-
Key addition: key is combined with the state through addition-like op, e.g. XOR
-
S-Box: substitution box providing a non-linear mapping over small no. of bits e.g. 8-to-8 or 6-to-4
- Provides confusion
- DES has many seemingly random S-boxes
- AES has one algebraic structure S-box
-
Permutation: permutation providing a linear mapping such that S-box output affects many S-box inputs in next round.
- Provides diffusion
- DES has bitwise permutations
- AES has bytewise/wordwise operations
- n = 64, m = 80
- addRoundKey: bitwise XOR of state s and round key ki
- sboxLayer: 4-to-4 PRESENT S-Box to state in groups of 4 bits
- 64 / 4 = 16 times S-Box applied
- Storing a-to-b table requires b * 2a bits
- pLayer: bitwise permutation
- getbit: return (word >> bit) & 0x1
- setbit: *word |= (1 << bit)
- clrbit: *word &= ~(1 << bit)
- Normally implemented as a lookup table, so usually fast
- (Time) Can merge multiple S-Boxes to one lookup table e.g. 2x 4-to-4 -> 1x 8-to-8
- (Time) Can load S-Box into RAM
- (Memory) Can have smaller S-Box and lookup lower and upper nibble
- Likely will cause additional time overhead
- Implement S-Box as combinatorial logic
- Each output bit is represented by a boolean expression on input bits
- Permutations on byte level usually faster in software than bitwise
- Architectures don't always have bitwise operations
- Bitwise permutations may be faster on hardware-based implementations
- Combine S-Box and permutation layer
- In PRESENT, instead of 4-to-4 bit, we use 4-to-64
- Table size expands linearly e.g. 64 * 24 = (b * 2a)
- Linear expansion raises fewer memory concerns
Example (saving 4 write ops to the state):
// Apply S−Box
for (uint8t i = 0 ; i < 4 , i ++) {
s[i] = sbox[s[i]];
}
// Permute bytes 0, 1, 2, 3 −> 2, 3, 1, 0
const uint8 t tmp = s[0];
s[0] = s[2];
s[2] = s[1];
s[1] = s[3];
s[3] = tmp;
vs
// Permute bytes 0, 1, 2, 3 −> 2, 3, 1, 0 and apply S−Box
const uint8t tmp = s[0];
s[0] = sbox[s[2]];
s[2] = sbox[s[1]];
s[1] = sbox[s[3]];
s[3] = sbox[tmp];
- Minor effect with byte-oriented permutations (ShiftRows in AES)
- Cannot be applied directly to bitwise permutations since each S-Box output affects many bytes
Creating the Larger Table (PRESENT S-Box Example)
- S-Box output bits are stored at their permuted positions with remaining bits set to zero
SP0
S-Box | 3 2 1 0 | … 48 … 32 … 16 … 0 |
---|---|---|
0xC | 1 1 0 0 | … ..1 … ..1 … ..0 … 0 |
0x5 | 0 1 0 1 | … ..0 … ..1 … ..0 … 1 |
0x6 | 0 1 1 0 | … ..0 … ..1 … ..1 … 0 |
… | … | … |
SP1
S-Box | 3 2 1 0 | … 49 … 33 … 17 … 1 0 |
---|---|---|
0xC | 1 1 0 0 | … ..1 … ..1 … ..0 … 0 0 |
0x5 | 0 1 0 1 | … ..0 … ..1 … ..0 … 1 0 |
0x6 | 0 1 1 0 | … ..0 … ..1 … ..1 … 0 0 |
… | … | … |
- For the first S-Box, only bits 0, 16, 32 and 48 can be non-zero, for the second S-Box, only bits 1, 17, 33 and 49, and so on
- Tables for each S-Box instance are re-combined into one state using bitwise XOR or OR
- There will only be at most one 1 at each bit position
Efficiency and Cost
- Speed-up by avoiding bit permutations
- Combined table requires more memory
- Program code memory saved from removing permutation implementation
- Structure of PRESENT permutation means we can save memory by storing the combined table once, looking up a value and shifting it left by one for each S-Box
- Could combine table approach with S-Box merging, approx halving S-Box lookup and permutation
- Smallest addressable unit on processor, normally a byte
- Could store each bit in a byte but with significant wasted space
- Bitslicing works on idea that we usually encrypt many blocks
- Bitslicing allows to "reclaim" some bits that would be lost in approach above
Where each bit is represented in a byte.
Key Addition
- Both key and state are in "expanded" form, can XOR them byte-wise
Permutation Layer
- Becomes byte-wise permutation
- Example of byte-wise PRESENT permutation:
state_out[0] = state[0];
state_out[16] = state[1];
state_out[32] = state[2];
state_out[48] = state[3];
state_out[1] = state[4];
// …
S-Box Layer
- Can no longer be easily implemented as a LUT
- Instead, represent outputs as boolean functions in input bits
- We Mainly look at Algebraic Normal Form
Modification to "Reclaim" Bits
- Store plaintext in bitsliced state s'[0...63]
- First bit of first block p0 goes into bit 0 of s'[0]
- First bit of second block p1 goes into bit 1 of s'[0]
- Second bit of p1 goes into bit 1 of s'[1]
- etc.
- Apply key addition, S-Box and permutation
- Map s' back to normal representation
Complexity
- Significant performance increase due to operations saved for permutations
- Key addition layer keeps similar performance
- S-Box more costly, depends on number of boolean operations (more = bad)
- Memory overhead since multiple cipher states stored at once
- Code size may increase
- Representation of boolean functions
- Use Fast Fourier Transform (FTT)-like algorithm to compute functions
- Write the truth table for function inputs
- Apply "butterfly" n times, with spacing σ = 20 = 1, 21, 22, ..., 2n−1
Butterfly for computing ANF:
A full ANF butterfly table:
- We take the rows that are set to one in the final column and use the inputs of these rows to form the boolean function
- If 000 is set, the result is inverted (+1)
- Example table above results in: x0 + x1 + x0 * x1... + 1
- Higher computational and storage requirements
- Longer keys
- Easy for PCs, harder for embedded systems
- l-bit RSA
- Pick secret primes p,q, each about l/2 size
- n = p * q
- Pick public exponent e
- Public key: (n, e)
- Private Key: (p, q, d) where d = e-1 mod 𝜙(n)
- Encrypt x: y = xe mod n
- Decrypt y: x = yd mod n
- Required for computations with numbers > width w of a single CPU register
- With a processor with w-bit registers, given an l-bit number u, we store the number in k = ceiling(l/w)
- Example:
- Take 12345678 in binary this is a 24-bit number u = (101111000110000101001110)2
- We have a CPU with 5-bit registers
- ceil(24/5) = 5
- u = (01011 11000 11000 01010 01110)25 (added a leading zero)
- u = (a4, a3, a2, a1, a0)25
-
mod: take lower limb of result t
-
floor(t/b): take upper limb of result t
-
t must be a two limb number for algorithm to work
-
Some architectures have an add-with-carry (ADDC) instruction
-
Use workarounds to implement add-with-carry in higher-level languages
- e.g. if t >= b: c = 1 else c = 0
- In this example, execution time is dependent on inputs (less secure)
Subtraction
- Done similar to addition, carry becomes "borrow" bit
- Can work in two's complement representation
- Negative sign by negating each bit and adding 1
Complexity
- k base-b additions-with-carry
- 2k if no ADDC instruction
- Complexity O(k)
- Product of k-limb and m-limb numbers is at most k + m length
Complexity
- Requires k + m base-b multiplications and 2(b + m) base-b shifts
- If k = m, complexity O(k2)
-
Split two k-digit numbers u, v into halves of equal size k = ceil(k/2)
-
Write values as:
- u = uH * bk + uL
- v = vH * bk + vL
-
u * v written as:
(uH * bk + uL) * (vH * bk + vL) = uHvHb2k + (uHvL +uLvH)bk + uLvL
-
Middle part (uHvL +uLvH) can be expressed as:
(uH + uL) * (vL + vH) − uHvH − uLvL = uHvL + uLvH
-
uHvH and uLvL are already computed once, so we save one multiplication
-
Method expressed as:
- D0 = uL * vL
- D2 = uH * vH
- D1 = (uH + uL) * (vL + vH)
- Hence:
- u * v = D2b2k + [D1 - D2 - D0]bk + D0
Recursion
- Algorithm should be implemented recursively to compute sub-products
- In practice, algorithm applied until threshold, then schoolbook algorithm used
- At some point, additional additions/management become more expensive than the saved multiplication
Complexity
- O((3/4)i k2)
- Refactor to O(k1.585) for two numbers of length k = 2i
- Complicated algorithms
- Dividing 2k-digit number by a k-digit number, requires k * (k - 2) multiplications and k divisions
- Barrett reduction is an optimisation by performing an expensive pre-computation floor(b2k / n)
Square-and-Multiply (SAM)
- Interleaves modular reduction and multiplications
- Square first, then reducing would require huge storage/computation
- t = exponent length
- Scans binary exponent e from left to right
- Squares for each bit and multiplies if exponent bit = 1
Complexity
- On average, assume half of the bits are set
- We need (t-1)/2 multiplications and t-1 squares
- 2048-bit keys recommended in RSA, for ECC 256-bit is sufficient
- Faster long-number ops due to shorter key inputs
- Single group operation requires multiple long-number operations
- ECC better suited for constrained embedded devices
- With physical access, fault injection (FI) and passive side-channel analysis (SCA) shown to break analytically secure ciphers (DES, AES, RSA)
- Exploit properties of implementation
- Devices require certain conditions to function normally:
- Stable supply voltage
- Stable clock signal
- Normal operating temperature
- No strong magnetic or electric fields in vicinity
- etc.
- If conditions are not met, device may produce incorrect results
- Example faults induced:
- Expose device to high/low temperatures
- Could reduce entropy of random number generator
- Could disable memory writes
- Over/Undervoltage device
- Can affect single or few instructions
- Temp increase clock frequency
- Can affect single or few instructions
- Expose the integrated circuit (IC) to UV-C light
- Can clear flash memory
- Target parts of IC with laser
- Can precisely modify single signal in device
- Tap microprobes on IC
- Read keys from memory
- Change internal signals
- Expose device to high/low temperatures
- Some attacks apply to all ciphers
- In the code below, a fault could be induced to skip the
return -1
instruction, bypassing the PIN check
int c = memcmp(enteredpin, storedpin, pinlen);
if(c != 0) {
return −1;
}
// Code continues ...
- A fault can be induced on a single bit to always set it to 0 ("stuck-at" fault)
- If adversary can fault each bit of the key, they can observe a change in ciphertext to establish if the key bit was a 1 or a 0
- If ciphertext changed, the key bit was 1, otherwise key bit was 0
- Algorithm shown below
- Both generic attacks require high control over fault, timing and space
RSA with CRT
- Split one RSA computation mod k-digit number n into two operations mod p and q
- Allows to reduce computational complexity by a factor of 4
Bellcore Attack
- Assume adversary can inject any fault into sp = xdpp
- Given correct signature s and fault signature s¬, mod n can be factored
- q = gcd(s - s¬, n), p = n/q
- Disadvantage: requires correct and fault signature on same plaintext
- Example:
- Correct signature 141, faulty signature 115
- n = 143
Recover q:
q = gcd(141 - 115, 143) = gcd(26, 143)
= gcd(26, 143 - 5*26) = gcd(26, 13) = 13
Lenstra Attack
- Advantage: "Emulate" correct signature value using signature verification
- Since a fault was induced on sp, result is still correct modulo q
- Verification can be factored to: q = gcd(s¬e - x, n), p = n/q
- Example:
- n = 143, e = 7, x = 15
- Faulty signature s¬= 115
- Compute 1157 mod 143 = 80 (using SAM)
- Recover q:
q = gcd(80 - 15, 143) = gcd(65, 143)
= gcd(65, 13) = 13
- Two classes: detection-based and algorithmic
- Detection-based often implemented on hardware or low software level
- Detect that FI has taken or is taking place
- Examples:
- Monitor power supply/clock signals for glitches
- Monitor environment (temperature)
- Sensors for light/laser detection
- Compute on two identical CPUs
- Check for control flow manipulation
- Checksums and parity bits on internal signals and data bus
- Algorithmic countermeasures use properties of algorithm to detect faults
- Examples:
- Compute algorithm inverse, e.g. verify or decrypt ciphertext before output
- Insert dummy operations to make FI harder at a specific position
- Examples:
- Logging of fault injection attempts must be done carefully
- Obvious attempt of simply incrementing a counter could be detected by an adversary, where they then remove power before the counter write and adjust their FI parameters
- Correct method is to increment a counter before operation, after success,
decrement the counter
- Increases execution time
- Flash memory etc. have limited write cycles
- Passive measurement of physical properties of an implementation
- Signal is usually called a trace
- Examples:
- Execution time: leak information about branching etc.
- Power consumption: leak information about data processed
- Electro-magnetic emanation: similar to power consumption
- Photonic emissions: precise information on internal processes
- Sound: vibrating circuit components can leak information, especially for slow algorithms (RSA)
Measurement Setup
- Left shows resistor in ground path of device
- Supply current ICC flows through the resistor R
- Derive ICC from voltage drop VR over R
- Right shows R in VCC path
- Measure drop over R
- Signal is DC-shifted, must remove DC constant by measuring AC-coupled
- VR can be measured directly using a differential probe
- Attack by inspecting one or a few traces
- In SAM, can distinguish squaring and multiply to reconstruct the secret exponent
- Can exploit small leakages
- Detect small leakage in large amount of noisy traces
- Idea is, power consumption is different for processing a 1 or 0
- Measurement and evaluation phase
Measurement
- n traces pi (t) with T sample points each
- Encrypting plaintext Xi using key k
- When referring to a byte, X0i means byte 0 of Xi
Evaluation
- Assume key byte k0 = 00
- Compute LSB(S(xi XOR k))
- Put p0 trace in bit 0 or bit 1 heap
- Repeat for different inputs
- Repeat with key candidates k0 = 01, 02, … , FF
- Look at difference of means of traces to find peaks indicating correct key
Why DPA Works
- Averaging many signals reduces noise relative to signal
- Reduces number of required traces
- We can approximate how a value affects leakage signal
-
Image shows how a value with fewer ones may result in a lower amplitude
-
The leakage model is the "Hamming Weight" (HW) model
-
Hamming weight is equal to number of bits set to 1 in b
-
Hamming Distance (HD) model assumes leakage depends on the previous and new value of a register
-
HD is equal to number of bits changed between the old and new value
-
In CPA, assume a key for first S-Box for input byte xi, then predict the intermediate result as (S(xi XOR k^))
-
Convert value into hypothetical power consumption using HW model
-
Determine the "match" between prediction and reality of traces
Rules of Thumb
- CPA is normalised to range of -1 <= p <= 1, making correlation easier to define
- {Some other stuff that seems complicated for an exam}
- Signal to Noise Ratio (SNR) determines success rate of side-channel attacks
- Lower the SNR, the more measurements needed
Balanced Logic Styles
- Reduce power of leakage signal by balancing power consumption
- E.g. Use a and a¬ at the same time
- Adversary will be observing HW(a) + HW(a¬)
- Always do an operation, e.g. in SAM, always square and multiply but discard unwanted results
Noise Generation
- Will increase number of traces required to remove noise in averaging
- Used in conjunction with other countermeasures can make averaging out noise hard
Masking
- Combine sensitive value x with a random mask value m e.g. using XOR
- Device performs all computation on (x XOR m)
- Device unmasks before outputting
- Non-linear components such as S-Box have to be re-generated for a masked version
- Higher-order SCA can leak value and mask
- Spreading leakage over multiple clock cycles
- Can make clock signal unstable on purpose
- Can randomly modulate frequency so operations are at different times in different executions
- Adversary may be able to re-align traces using peaks etc.
- Dummy operations can be added to control flow to move real operations for different executions
- Can randomly change order of operations e.g. S-Box instances in AES
- Timing-based countermeasures generally not enough on their own
- The actual algorithm should be in constant time
- Diversify keys, every device gets a unique key
- If one key is extracted, only one device is affected
- Backend system could check for and block inconsistencies/unexpected use
- Rate limiting
- System-level countermeasures depend on application, should be second-line defense