Epoll is fundamentally broken 1/2
I/O multiplexing part #3
In previous articles we talked about:
This time we'll focus on Linux's select(2)
successor - the
epoll(2)
I/O
multiplexing syscall.
Epoll is relatively young. It was created by Davide Libenzi in 2002. For comparison: Windows did IOCP in 1994 and FreeBSD's kqueue was introduced in July 2000. Unfortunately, even though epoll is the youngest in the advanced IO multiplexing family, it's the worse in the bunch.
Comparison with /dev/poll
Bryan Cantrill of
Joyent is known for bashing epoll()
. Here's one of the more
entertaining interviews:
He mentions two defects.
First he describes "a fatal flaw, that is subtle" in the Solaris
/dev/poll
model. He starts by describing the "thundering herd"
problem (which
we discussed earlier). Then
he moves on to the real issue. In a multithreaded scenario, when the
/dev/poll
descriptor is shared, it is impossible to deliver events
on one file descriptor to precisely one worker thread. He
explains that band aids to level triggered /dev/poll
model and naive
edge-triggered won't work in multithreaded case1.
This argument is indeed subtle, but since epoll
has semantics close
to /dev/poll
, it's safe to say it wasn't designed to work in
multithreaded scenarios.
In the video Mr Cantrill raised a second argument against epoll
: the
events registered in epoll aren't associated with file descriptor, but
with the underlying kernel object referred to by the file descriptor
(let's call this the file description). He mentions the "stunning"
effect of forking and closing an fd. We will leave this problem for
now and describe it in another blog post.
Why the critique?
Most of the epoll
critique is based on two fundamental design
issues:
1) Sometimes it is desirable to scale application by using multi
threading. This was not supported by early implementations of epoll
and was fixed by EPOLLONESHOT
and EPOLLEXCLUSIVE
flags.
2) Epoll registers the file descripton, the kernel data structure, not file descriptor, the userspace handler pointing to it.
The debate is heated because it's technically possible to avoid both
pitfalls with careful defensive programming. If you can you
should avoid using epoll for load balancing across
threads. Avoid sharing epoll file descriptor across
threads. Avoid sharing epoll-registered file
descriptors. Avoid forking, and if you must: close all
epoll-registered file descriptors before calling execve
. Explicitly
deregister affected file descriptors from epoll set before calling
dup
/dup2
/dup3
or close
.
If you have simple code and follow the advice above you might be fine. The problem starts when your epoll program gets complex.
Let's dig deeper. In this blog post I'll focus on the load balancing argument.
Load balancing
There are two distinct load balancing scenarios:
- scaling out
accept()
calls for a single bound TCP socket - scaling usual
read()
calls for large number of connected sockets
Scaling out accept()
Sometimes it's necessary to serve lots of very short
TCP connections. A high throughput HTTP 1.0 server is one such
example. Since the rate of inbound connections is high, you want to
distribute the work of accept()
ing connections across multiple
CPU's.
This is a real problem happening in large deployments. Tom Herbert reported an application handling 40k connections per second. With such a volume it does makes sense to spread the work across cores.
But it's not that simple. Up until kernel 4.5 it wasn't possible to use epoll to scale out accepts.
Level triggered - unnecessary wake-up
A naive solution is to have a single epoll file descriptor shared across worker threads. This won't work well, neither will sharing bound socket file descriptor and registering it in each thread to unique epoll instance.
This is because "level triggered" (aka: normal) epoll inherits the
"thundering herd" semantics from select()
. Without special flags, in
level-triggered mode, all the workers will be woken up on each and
every new connection. Here's an example:
- Kernel: Receives a new connection.
- Kernel: Notifies two waiting threads A and B. Due to "thundering herd" behavior with level-triggered notifications kernel must wake up both.
- Thread A: Finishes
epoll_wait()
. - Thread B: Finishes
epoll_wait()
. - Thread A: Performs
accept()
, this succeeds. - Thread B: Performs
accept()
, this fails with EAGAIN.
Waking up "Thread B" was completely unnecessary and wastes precious resources. Epoll in level-triggered mode scales out poorly.
Edge triggered - unnecessary wake-up and starvation
Okay, since we ruled out naive level-triggered setup, maybe "edge triggered" could do better?
Not really. Here is a possible pessimistic run:
- Kernel: Receives first connection. Two threads, A and B, are waiting. Due to "edge triggered" behavior only one is notified - let's say thread A.
- Thread A: Finishes
epoll_wait()
. - Thread A: Performs
accept()
, this succeeds. - Kernel: The accept queue is empty, the event-triggered socket moved from "readable" to "non readable", so the kernel must re-arm it.
- Kernel: Receives a second connection.
- Kernel: Only one thread is currently waiting on the
epoll_wait()
. Kernel wakes up Thread B. - Thread A: Must perform
accept()
since it does not know if kernel received one or more connections originally. It hopes to get EAGAIN, but gets another socket. - Thread B: Performs
accept()
, receives EAGAIN. This thread is confused. - Thread A: Must perform
accept()
again, gets EAGAIN.
The wake-up of Thread B was completely unnecessary and is confusing. Additionally, in edge triggered mode it's hard to avoid starvation:
- Kernel: Receives two connections. Two threads, A and B, are waiting. Due to "edge triggered" behavior only one is notified - let's say thread A.
- Thread A: finished
epoll_wait()
. - Thread A: performs
accept()
, this succeeds - Kernel: Receives third connection. The socket was "readable", still is "readable". Since we are in "edge triggered" mode, no event is emitted.
- Thread A: Must perform
accept()
, hopes to get EGAIN, but gets another socket. - Kernel: Receives fourth connection.
- Thread A: Must perform
accept()
, hopes to get EGAIN, but gets another socket.
In this case the socket moved only once from "non-readable" to "readable" state. Since the socket is in edge-triggered mode, the kernel will wake up epoll exactly once. In this case all the connections will be received by Thread A and load balancing won't be achieved.
Correct solution
There are two workarounds.
The best and the only scalable approach is to use recent Kernel 4.5+ and
use level-triggered events with EPOLLEXCLUSIVE
flag. This will
ensure only one thread is woken for an event, avoid "thundering herd"
issue and scale properly across multiple CPU's
Without EPOLLEXCLUSIVE
, similar behavior it can be emulated with
edge-triggered and EPOLLONESHOT
, at a cost of one extra epoll_ctl()
syscall after each event. This will distribute load across multiple
CPU's properly, but at most one worker will call accept()
at a time, limiting throughput.
- Kernel: Receives two connections. Two threads, A and B, are waiting. Due to "edge triggered" behavior only one is notified - let's say thread A.
- Thread A: Finishes epoll_wait().
- Thread A: Performs
accept()
, this succeeds. - Thread A: Performs
epoll_ctl(EPOLL_CTL_MOD)
, this will reset theEPOLLONESHOT
and re-arm the socket.
It's worth noting there are other ways to scale accept()
without
relying on epoll. One option is to use
SO_REUSEPORT
and create multiple
listen sockets sharing the same port number. This approach has
problems though - when one of the file descriptors is closed, the
sockets already waiting in the accept queue will be dropped. Read more
in this
Yelp blog post
and this LWN comment.
Kernel 4.4
introduced SO_INCOMING_CPU
to further improve locality of SO_REUSEPORT
sockets. I wasn't
able to find a good documentation of this very new feature.
Even better, kernel 4.5
introduced SO_ATTACH_REUSEPORT_CBPF
and SO_ATTACH_REUSEPORT_EBPF
socket options. When
used properly, with a bit of magic, it should be possible to substitute
SO_INCOMING_CPU
and overcome the usual SO_REUSEPORT
dropped
connections on rebalancing problem.
Scaling out read()
Apart from scaling accept()
there is a second use case for scaling
epoll
across many cores. Imagine a situation when you have a large
number of HTTP client connections and you want to serve them as
quickly as the data arrives. Each connection may require some
unpredictable processing, so sharding them into equal buckets across worker threads
will worsen mean latency. It's better to use "the combined queue" queuing
model - have one epoll set and use multiple threads to pull active
sockets and perform the work.
Here's The Engineer Guy explaining the combined queue model:
In our case the shared queue is an epoll descriptor, the tills are worker threads and the jobs are readable sockets.
Epoll level triggered
We don't want to use the level triggered model due to the "thundering
herd" behavior. Additionally the EPOLLEXCLUSIVE
won't help since
there is a race condition possible. Here's how it may
materialize:
- Kernel: receives 2047 bytes of data
- Kernel: two threads are waiting on epoll, kernel wakes up due to
EPOLLEXCLUSIVE
behavior. Let's say kernel woke up Thread A. - Thread A: finishes
epoll_wait()
- Kernel: receives 2 bytes of data
- Kernel: one thread is waiting on epoll, kernel wakes up Thread B.
- Thread A: performs
read(2048)
and reads full buffer of 2048 bytes. - Thread B: performs
read(2048)
and reads remaining 1 byte of data
In this situation the data is split across two threads and without using mutexes the data may be reordered.
Epoll edge triggered
Okay, so maybe edge triggered model will do better? Not really. The same race condition occurs:
- Kernel: receives 2048 bytes of data
- Kernel: two threads are waiting for the data: A and B. Due to the "edge triggered" behavior only one is notified.
- Thread A: finishes
epoll_wait()
- Thread A: performs a
read(2048)
and reads full buffer of 2048 bytes - Kernel: the buffer is empty so the kernel arms the file descriptor again
- Kernel: receives 1 byte of data
- Kernel: one thread is currently waiting in epoll_wait, wakes up Thread B
- Thread B: finished
epoll_wait()
- Thread B: performs
read(2048)
and gets 1 byte of data - Thread A: retries
read(2048)
, which returns nothing, gets EAGAIN
Correct solution
The correct solution is to use EPOLLONESHOT
and re-arm the file
descriptor manually. This is the only way to guarantee that the data
will be delivered to only one thread and avoid race conditions.
Conclusion
Using epoll()
correctly is hard. Understanding extra flags
EPOLLONESHOT
and EPOLLEXCLUSIVE
is necessary to achieve load
balancing free of race conditions.
Considering that EPOLLEXCLUSIVE
is a very new epoll flag, we may
conclude that epoll
was not originally designed for balancing load
across multiple threads.
In the next blog post in this series we will describe the epoll
"file descriptor vs file description" problem which occurs when used
with close()
and fork
calls.
-
The only solution is to remove the active file descriptor from the event set and have user re-arm it manually. ↩
or leave a comment here.