AES support in Westmere
In one of the lectures on the crypto-class professor Dan Boneh mentioned that the Intel's Westmere architecture has native instructions aiding with the AES cipher.
Or formally speaking - newer CPU's support AES-NI instruction set.
I decided to play with this instructions for a while and implement a simple AES-128 CTR stream cipher.
I'm not experienced in writing assembly I mostly derived the code from
the
aesni-intel_asm.S
file from linux kernel.
Additionally, Intel's original AES-NI documentation is quite usefull.
I'll be using AT&T syntax and targetting x86-64.
Key expansion
First, AES takes a fixed length key, in our case it's 128 bits, and expands it. In our case we'll have eleven round keys.
Intel supplies a helper instruction for that: aeskeygenassist
.
C header for our key expansion routine:
struct key128_ctx { u8 key[11][16]; }; // Take 16-byte key and expand it to key128_ctx void expand_key128(struct key128_ctx *ctx, u8 *key);
And the assembly code:
ENTRY(expand_key128) # %rdi - ctx pointer # %rsi - key pointer movups (%rsi), %xmm0 # move key to xmm0 movups %xmm0, (%rdi) # save key as the first round key add $0x10, %rdi # move to next slot aeskeygenassist $0x1, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x2, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x4, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x8, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x10, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x20, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x40, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x80, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x1b, %xmm0, %xmm1 call _key_expansion_128 aeskeygenassist $0x36, %xmm0, %xmm1 call _key_expansion_128 ret .align 4,0x90 _key_expansion_128: # xmm0 - previous key, xmm1 - assist result # magical bitshifting pshufd $0b11111111, %xmm1, %xmm1 shufps $0b00010000, %xmm0, %xmm4 pxor %xmm4, %xmm0 shufps $0b10001100, %xmm0, %xmm4 pxor %xmm4, %xmm0 pxor %xmm1, %xmm0 movaps %xmm0, (%rdi) # save the key add $0x10, %rdi # move to next slot ret
CTR-mode encryption
/* * Fills `out` with the stream generated by CTR. * `len` must be a multiply of 16 bytes. * `iv` is 16 byte long. */ void ctr_stream(struct key128_ctx *ctx, u8 *out, u32 len, u8 *iv);
ENTRY(ctr_stream) # %rdi - ctx # %rsi - out # %rdx - len # %rcx - iv # Load round keys from CTX movaps (%rdi), %xmm5 movaps 0x10(%rdi), %xmm6 movaps 0x20(%rdi), %xmm7 movaps 0x30(%rdi), %xmm8 movaps 0x40(%rdi), %xmm9 movaps 0x50(%rdi), %xmm10 movaps 0x60(%rdi), %xmm11 movaps 0x70(%rdi), %xmm12 movaps 0x80(%rdi), %xmm13 movaps 0x90(%rdi), %xmm14 movaps 0xa0(%rdi), %xmm15 # Load IV to xmm0 movups (%rcx), %xmm0 .loop: # Exit loop if done cmp $16, %rdx jb .loopexit # Encrypt xmm0 put result to xmm1 movdqa %xmm0, %xmm1 pxor %xmm5, %xmm1 # Whitening step (Round 0) aesenc %xmm6, %xmm1 # Round 1 aesenc %xmm7, %xmm1 # Round 2 aesenc %xmm8, %xmm1 # Round 3 aesenc %xmm9, %xmm1 # Round 4 aesenc %xmm10, %xmm1 # Round 5 aesenc %xmm11, %xmm1 # Round 6 aesenc %xmm12, %xmm1 # Round 7 aesenc %xmm13, %xmm1 # Round 8 aesenc %xmm14, %xmm1 # Round 9 aesenclast %xmm15, %xmm1 # Round 10 movups %xmm1, (%rsi) # Save generated stream add $16, %rsi # Move pointer sub $16, %rdx # Reduce len call _inc_xmm0 # Increase IV jmp .loop .loopexit: ret # Increment IV in %xmm0 .align 4,0x90 _inc_xmm0: # Use 16 bytes from the red zone lea -0x10(%rsp), %r8 # Load the address to rbx movups %xmm0, (%r8) # Save xmm0 there mov 0x8(%r8), %rax # Load bottom 8 bytes to rax bswap %rax inc %rax bswap %rax mov %rax, 0x8(%r8) # Save it back movups (%r8), %xmm0 # Reload xmm0 ret
This is not an optimal code. The encryption xor-ing could be done much
faster on the vector registers. Some of the loads could be assumed to
be aligned, rather than unaligned. Additionally it looks like on AMD
Bulldozer (via
Agner instruction tables)
aesenc
could be executed on two pipelines in parallel.
Usage
And finally, an example usage of our ctr_stream
function:
int main() { u8 *key = hex("36f18357be4dbd77f050515c73fcf9f2", NULL); u8 *iv = hex("69dda8455c7dd4254bf353b773304eec", NULL); struct key128_ctx ctx; expand_key128(&ctx, key); u8 out[512]; ctr_stream(&ctx, out, sizeof(out), iv); int msg_len = 0; u8 *msg = hex("0ec7702330098ce7f7520d1cbbb20fc3", &msg_len); xor(out, msg, msg_len); out[msg_len] = '\0'; printf("out=%s\n", out); return 0; }
Full code, as usually is available on github.
And remember, I'm not an assembly guru, so please do carefully review
this code before reusing. Or even better - reuse routines from the
kernel:
aesni-intel_asm.S
.