Bitsliced SipHash
Few days ago I presented a Python and a C implementation of SipHash. This time for no reason whatsoever I implemented a bitsliced version of it.
Bitslicing a crypto algorithm is usually done to speed it up when doing massively parallel operations. For example when trying to find a collision with brute force. Bitsliced implementation is only useful if you have a large number of exactly the same mathematical operations to be computed in parallel.
Let me try to describe how bitslicing works.
Intro to bitslicing
In essence to bitslice an algorithm is to rewrite it in a lowest possible level, using simplest logical operations on every bit of the input. The idea is to perform those simple computations on large number of bits in parallel using long SIMD vectors.
For example, say we define a mathematical function that does a
bitwise or
on two 4 bit variables. Our non-bitsliced function will
look like:
u4 formula_raw(u4 a, u4 b) { return a | b; }
Now, if we have 256 different inputs we'd need to call this function 256 times to generate all the results.
An simplistic illustration of two calls to our formula:
On the other hand a bitsliced version will look like:
void formula_bs(vec dst[4], vec a_bits[4], vec b_bits[4]) { unsigned bit_no; for (bit_no = 0; bit_no < 4; bit_no ++) { dst[bit_no] = a_bits[bit_no] | b_bits[bit_no]; } }
Where a_bits[0]
is a vector (say an
avx
256-bit wide
register) that contains all zeroth bits of our 256 inputs a
, a_bits[1]
contains all first bits, and so on.
To count all 256 outputs we need to call the function only once as it operates on 256 bit vectors already.
Illustration of the bitsliced version of our formula:
As you can see, we need a single or
operation for every bit of
the input, which is constant.
To give you an approximation of the benefit - on my i7 CPU counting our function on 256 inputs took 506 cycles for the normal implementation and only 39 cycles for the bitsliced one (see the sources of the test).
The hidden cost
There is a major cost in practical usage of bitslicing - it is
neccesary to pass bit vectors as the input, not the a
and b
the
end user cares about. This is not hard - all is needed is a bit matrix
transposition. Unfortunately, although SSE instruction
PMOVMSKB
/_mm_movemask_epi8
helps a lot, it's computationally
expensive.
To illustrate the transposition for our formula: two 4-bit inputs are transposed into four 2-bit vectors, one for each bit:
On my i7 CPU it takes around 1040 CPU cycles to transpose 8x256 bit
matrix. We need three transpositions for our or
formula (two for
inputs, and one to decipher the output), that's 3120 cycles for
transpositions only. Bear in mind - I didn't optimise
the transposition code, just borrowed it from Mischasan.
Rewriting SipHash
The whole fun in playing with bitslicing is to rewrite a non-trivial algorithm in terms of bitwise operations.
Let's focus on the core of the SipHash - a single "round".
Here's a half-round of SipHash from my non-bitsliced code:
#define ROTATE(x,b) \ (u64)( ((x) << (b)) | ( (x) >> (64 - (b))) ) #define HALF_ROUND(a,b,c,d,s,t) \ a += b; \ b = ROTATE(b, s) ^ a; \ a = ROTATE(a, 32); \ c += d; \ d = ROTATE(d, t) ^ c;
Nothing fancy. Just some bit rotations, two additions and two xor
operations. It's worth noting that SipHash state is encapsulated in
four 64 bit variables a
- d
.
In the bitsliced implementation we'll be using
gcc
vector notation,
based on 256 bit vectors. Being an optimist I assume the compiler can
take advantage of avx
.
First, let's define the vector type and constants:
/* 32 bytes, that's 256 bits and 4*8 u64 values */ typedef uint64_t vec __attribute__ ((vector_size (32))); const vec zero = {0ULL, 0ULL, 0ULL, 0ULL}; const vec one = {0xffffffffffffffffULL, 0xffffffffffffffffULL, 0xffffffffffffffffULL, 0xffffffffffffffffULL};
Now, let's define an addition operation bitwise. That takes three
inputs - carry
, a
and b
and returns new carry
and the result.
Following the adder logic gate schema:
our function requires 5 operations:
/* Add with carry */ #define ADD(a, b, out, carry_in, carry_out) do { vec __a = a, __b = b; vec _axb = __a ^ __b; out = carry_in ^ _axb; carry_out = (_axb & carry_in) | (__a & __b); } while (0)
Having that we can implement the half-round of SipHash:
#define M(e) (e) & 0x3f #define HALF_ROUND(a,b,c,d, s,t, _a,_b,_c,_d) do { unsigned o; vec tmp, carry; carry = zero; for (o = 0; o < 64; o++) { ADD(a[o], b[o], tmp, carry, carry); _a[M(o-32)] = tmp; _b[o] = b[M(o-s)] ^ tmp; } carry = zero; for (o = 0; o < 64; o++) { ADD(c[o], d[o], tmp, carry, carry); _c[o] = tmp; _d[o] = d[M(o-t)] ^ tmp; } } while (0)
Where a
- d
vectors are inputs and _a
- _d
are used for
storing the results.
Notice we don't need rotate operations - instead we just address
different bit in a vector: _a[M(o-32)]
instead of _a[o]
.
Complexity
Normally a half-round of SipHash requires:
- 2 additions
- 2 xors
- 3 bit rotations
That's 1792 operations to count 256 inputs.
The bitsliced code on the other hand does:
- 64 * 2 additions, 5 operations each
- 64 * 2 xors
That's 768 operations to count 256 inputs.
In order to compute a full SipHash function one needs to run at least twelve half-rounds.
Results
Unfortunately in the end my bitsliced implementation is slower:
- The cost of transposing inputs for the bitsliced algorithm is significant.
- Bitsliced algorithm uses two sets of four vectors 256x64 bits. That's 16KiB of internal state! I'm afraid it may not fit into the CPU registers.
On my i7 CPU computing 256 SipHash hashes the normal implementation requires about:
computation: 9768 cycles
Compiled with clang
and avx
extensions the bitsliced code runs in:
computation: 11278 cycles transpose: 13900 cycles
In other words - even if we ignore the cost of transposition the code still runs a bit slower than the non-parallelised version. With hand crafted assembler we should be able to do better.
Most crucially: first I need to work on reducing the cost of the bit transposition!
The code used in this article: