Logging best practices

31 December 2013

Every serious program has some kind of logging infrastructure. Sometimes it's trivial (stderr); sometimes it's sophisticated and highly configurable.

Unfortunately, many programs use logging inconsistently. The logging infrastructure is usually grown organically and not thought through. Many programs repeat the same mistakes with regard to logging.

Let's take some time to talk about logging in larger systems.

Before we start it's necessary to recognise that logging serves many different purposes. Generally speaking, I think log messages fit into distinct categories, each useful on a different occasion:

Critical errors describing why the system wasn't able to start. They are useful for the system administrator trying to get the program to run.

Debug messages used mostly post mortem to discover why something went wrong. They are used by developers to understand what happened to a system on a higher level. Users are often asked to share these logs but rarely have enough knowledge to understand what they say.

Access logs related to end-user actions; for example HTTP requests in an HTTP server. Useful for statistics and to debug end-to-end system behaviour.

1) Critical failures are special

Errors of the first type are critical errors describing why the system wasn't able to start. In my opinion these messages are a special case. They are often produced before a proper logging system started; for example when configuration can't be parsed. I think these messages require separate treatment.

The Linux kernel is a good example. It can print all logs on the console before the system is properly configured. If the kernel fails to start a system administrator is able to quickly debug it.

2) Don't produce too many logs

It's often useful to have a verbosity setting for debug messages - though I think it shouldn't be overcomplicated. Three verbosity settings are in my opinon enough:

"Quiet" mode, useful when the server is running in batch mode.
"Default", that prints logs most useful for normal users.
"Debug" mode, with verbose logs for debugging.

In my opinion even the "debug" should be readable. These logs are going to be read by a human and it's counterproductive to produce megabytes of verbose logs that nobody can digest. Furthermore, useless debug statements spoil the code without creating any value.

Let's look at the Linux kernel again. Kernel programmers avoid logging too much - dmesg usually contains only the most important information. To log more than that is treated as a potential denial of service attack vector.

3) Don't mix debug and access logs

I believe that access logs are fundamentally different from debug messages and should be dealt with separately. Access logs, as opposed to debug messages, are mostly used for statistics and represent a well-defined and valid system behaviour.

In practice the distinction is not that simple. Ideally debug logs should be only about a meaningful state changes in the full system. Things like start, stop, configuration reload, credential changes or disk errors, completely unrelated to end-user activity. Web servers are a good example. They treat access logs separately from whole-program debug messages. A configuration file allows the user to specify access log granularity per vhost.

This time, the Linux kernel is a bad example. By design it doesn't have an "access log" equivalent. Unfortunately, it's possible to bend the rules and get it log end-user activity in dmesg. For example one can get iptables to do logging by using -j LOG like this:

iptables -A INPUT -p tcp --dport 22 --syn \
         -j LOG --log-prefix "iptables: "

As a result the "access" logs produced by iptables are mixed with potentially crucial debug messages in dmesg. Without a special rule in syslogd.conf, the /var/log/kern.log file becomes a mess.

4) Rate-limit debug logs

Related to the previous point: I think it's best to avoid producing debug messages that contain user requests or are derived from them.

As mentioned above, the Linux kernel by default avoids logging end-user actions and keeps dmesg clean. But it's not always easy to distinguish what is an "end-user" action and what is not. For example, receiving a network packet clearly shouldn't be logged in dmesg, but how about the action of mounting a CD? For this reason Linux kernel has a special printk_ratelimited function. It's used to log a thing, but if it happens too often the kernel is free to skip it and avoid overwhelming the syslog.

5) Heighten your hygiene

When software matures and the code stabilises, logging statements can age. I've seen this many times - recent parts of the code produce reasonable debugging logs while old code has some really weird verbose stuff that nobody understands and everybody is afraid to remove. My advice: be bold, remove stupid logs and make a constant effort to keep all logging levels consistent and sensible.

Let me use Linux as an example again. Its logging is generally in good shape, but in my opinion it could be simplified. I really don't understand the difference between "EMERGENCY", "ALERT" and "CRITICAL"; or "DEBUG" and "INFO" error levels. But all of these seem to be used:

linux$ for type in _DEFA _DEBUG _INFO _WARN _NOTI _ERR _CRIT _ALER _EMER;
do echo -en "$type\t"; find . -name \*c -a -type f| xargs cat|egrep "\sprintk"|grep KERN$type|wc -l; done
_EMERG   266
_ALERT   287
_CRIT    397
_ERR     10697
_WARNING 4811
_NOTICE  822
_INFO    6706
_DEBUG   4478

Summary

The main message is simple: don't be a slave of your logging system. Tweak it, refactor it, maybe even rewrite it. Most importantly: think it through. In the end it's a part of the code you ship.