An otter, please (or, a better notification system)
2013-06-18
2 minutes read

Recently, there’s been discussions on IRC and the debian-devel mailing list about how to notify users, typically from a cron script or a system daemon needing to tell the user their hard drive is about to expire. The current way is generally “send email to root” and for some bits “pop up a notification bubble, hoping the user will see it”. Emailing me means I get far too many notifications. They’re often not actionable (apt-get update failed two days ago) and they’re not aggregated.

I think we need a system that at its core has level and edge triggers and some way of doing flap detection. Level interrupts means “tell me if a disk is full right now”. Edge means “tell me if the checksums have changed, even if they now look ok”. Flap detection means “tell me if the nightly apt-get update fails more often than once a week”. It would be useful if it could extrapolate some notifications too, so it could tell me “your disk is going to be full in $period unless you add more space”.

The system needs to be able to take in input in a variety of formats: syslog, unstructured output from cron scripts (including their exit codes), snmp, nagios notifications, sockets and fifos and so on. Based on those inputs and any correlations it can pull out of it, it should try to reason about what’s happening on the system. If the conclusion there is “something is broken”, it should see if it’s something that it can reasonably fix by itself. If so, fix it and record it (so it can be used for notification if appropriate: I want to be told if you restart apache every two minutes). If it can’t fix it, notify the admin.

It should also group similar messages so a single important message doesn’t drown in a million unimportant ones. Ideally, this should be cross-host aggregation. The notifications should be possible to escalate if they’re not handled within some time period.

I’m not aware of such a tool. Maybe one could be rigged together by careful application of logstash, nagios, munin/ganglia/something and sentry. If anybody knows of such a tool, let me know, or if you’re working on one, also please let me know.

Back to posts