Reinventing spinlocks
In the previous article I was playing with the implementation of the concurrent Queue in C.
During the experiments I tried to outsmart pthread
library and beat
the speed of their spinlock implementation.
CAS/CAS
My first attempt was to use compare-and-swap CAS
instruction for the
spinlock. Here's the code, basically we spin till the lock is acquired:
static inline void _lock(unsigned int *lock) { while (1) { int i; for (i=0; i < 10000; i++) { if (__sync_bool_compare_and_swap(lock, 0, 1)) { return; } } sched_yield(); } } static inline void _unlock(unsigned int *lock) { __sync_bool_compare_and_swap(lock, 1, 0); }
This approach works well, but we can still do better.
CAS/store
It's easy to notice that the unlock
doesn't actually need the heavy
CAS
write. We know that the value in the lock must be 1 and we can
just unconditionally write 0.
static inline void _unlock(unsigned int *lock) { __asm__ __volatile__ ("" ::: "memory"); *lock = 0; }
The asm line is required to prevent GCC from reordering the write. Additionally, in x86 the stores aren't reordered, but in other architectures a write barrier might be required above the assignment.
Benchmark
I used the benchmark from previous blog post to get the numbers. Beware, the locks are only a fraction of the measured complexity in these benchmarks.
The numbers for four threads spinning on four cores (ie: contended case):
And a single thread (uncontended case):
The CAS/store
method is as fast as pthread implementation of
spinlocks. That's nothing really surprising considering that pthread
also uses the same approach
(source).
So now you know how to implement reasonable spinlocks. This may come
handy for example when you're writing for a platform with glibc that
doesn't have pthread_spin_lock
, like Mac OS X. Here's the shim for
that case:
int pthread_spin_init(pthread_spinlock_t *lock, int pshared) { __asm__ __volatile__ ("" ::: "memory"); *lock = 0; return 0; } int pthread_spin_destroy(pthread_spinlock_t *lock) { return 0; } int pthread_spin_lock(pthread_spinlock_t *lock) { while (1) { int i; for (i=0; i < 10000; i++) { if (__sync_bool_compare_and_swap(lock, 0, 1)) { return 0; } } sched_yield(); } } int pthread_spin_trylock(pthread_spinlock_t *lock) { if (__sync_bool_compare_and_swap(lock, 0, 1)) { return 0; } return EBUSY; } int pthread_spin_unlock(pthread_spinlock_t *lock) { __asm__ __volatile__ ("" ::: "memory"); *lock = 0; return 0; }
Alternatively, ConcurrencyKit is a high-quality library that provides loads of concurrency primitives, including spinlocks: