From years of bitter experience staring at log files trying to work out what turned the servers into a pile of molten rubble, I’ve built up a list of what I really like to see when a process logs its activity. Prompted by some discussions at work, I’d like to share it — in the hope of raising the quality of logging in our software and saving someone a lot of stress at 3am when their project absolutely refuses to work and they can’t figure out why.
This is explicitly not about logging frameworks. I don’t particularly care how logging is implemented in code, since that will necessarily differ by language and application architecture. I just care about the outcome.
Each process logs to one log file
I don’t want to jump back and forth between log files, interleaving lines, trying to reconstruct what happened when. It’s hard enough to work out what’s going on at the best of times. Let’s not make it harder.
Each log file has one process, one thread writing to it
When two or more processes or threads write to the same file, it’s difficult to isolate what the process you care about actually did. You can work around this by adding a token to each log line — a process or thread ID, for instance — but there are deeper complications.
Say thread A logs something at exactly the same time as thread B. Halfway through thread A writing its log entry, thread B becomes active and starts logging. Thread B eventually yields, and thread A resumes. What a mess. Thread A’s log entry now has thread B’s log entry spliced right through the middle. Who said what? Nightmare.
When you can’t avoid having multiple threads or processes writing to a log file, they should talk to a logging arbitrator service that manages the file and ensures entries are written atomically.
Logging is not buffered
When it comes to logging, I prefer completeness over speed. If the process dies or is killed, I don’t want the last few log entries sitting in a memory buffer somewhere — I want them on disk where I can read them. If my process does something, I should be able to see it immediately, not after waiting for a buffer to fill or a flush interval to expire.
Each log line has a timestamp
This seems obvious, but I’m amazed by how often it doesn’t happen. A log file without timestamps is useless unless you happen to be watching it when something goes wrong.
Each timestamp is at sub-second resolution
Logging with timestamps accurate only to one second is maddening when you’re dealing with thousands of entries per second. Which of those caused the issue? Good luck.
Given the choice, I’d prefer to let me configure the logger to print to STDOUT so I can use svlogd to add tai64n timestamps (and do much more besides). But just do something reasonably sane and I’ll be happy.
Each log file has a guaranteed maximum size
Ever seen what happens when a process tries to start and the disk is full of old logs? Generally, it doesn’t work. That’s annoying.
Older logs should be archived elsewhere. Only recent logs should live on local disk for easy debugging. Your definition of “older” and “recent” will vary, but knowing you’re keeping, say, 1GB of logs on disk means you can ensure there’s always enough space.
Ad hoc or interval-based log rotation doesn’t help here, because by the time the log is rotated the disk is already full. Processes like svlogd and some syslog implementations handle this properly.
There’s some way of processing a log file once it reaches a known size
At some point I’ll want to analyse and archive log files. It’s annoying to miss a rotation trigger and discover I’ve lost several hours of entries. Please don’t make me track file rotation myself. I’ll do it badly, and that will make me sad.
Most of these preferences are legacies of working with (and being spoiled by) DaemonTools and Runit, coming from a Rails-centric, Mongrel-running world where there’s typically one process logging to one file. If I’ve missed something, please let me know — email address is below.