Decoupling Nagios Host and Service check events for fun and profit
Nagios does a pretty good job of watching over my services and hosts, but I want to do a little more with the events it creates – when it checks a service and something is wrong, or when something recovers. In particular I want to give my clients the ability to select at an incredibly high resolution what sort of notifications they get, for what services, how often, and at what level of technical detail. Coupled with this I want to up-sell the services that Xeriom offers – if the disk is getting full or the transfer quota is being consumed so fast that it wont last until the end of the month I want to make it easy to upgrade plans. I’d also like to be able to try out some fun things – iPhone push notifications, SMS gateways, audible alarms, whatever – without worrying that I might destroy Nagios and bring my monitoring setup to its knees.
Message queues are a great way of decoupling systems, moving risk and complexity elsewhere. Nagios shouldn’t have to worry about all of the stuff I want to build around the monitoring system, it should focus just on the core features that I like it for: monitoring my hosts and services.
Luckily, I already have ActiveMQ running for other tasks, writing a STOMP client using SMQueue is pretty trivial, and Nagios has several ways to execute external commands when events happen including the global host and service event handlers. All I need is a command to have Nagios run that’ll accept a bunch of information from Nagios and stick it on the message queue.
Here’s what I came up with:
require 'rubygems'
require 'smqueue'
require 'json'
message = {
:hostname => ARGV[2],
:service => ARGV[3],
:state => ARGV[4],
:state_type => ARGV[5],
:state_time => ARGV[6].to_i,
:attempt => ARGV[7].to_i,
:max_attempts => ARGV[8].to_i,
:time_t => Time.now.to_i
}
configuration = {
:host => ARGV[0],
:name => ARGV[1],
:adapter => :StompAdapter
}
broadcast = SMQueue(configuration)
broadcast.put message.to_json, "content-type" => "application/json"
You’ll need Ruby and RubyGems installed. Once you have those, install the script like this:
sudo su -
gem sources -a http://gems.github.com/
gem install seanohalpin-smqueue json --no-ri --no-rdoc
cd /usr/bin
wget http://gist.github.com/raw/306765/2a3e9cbade88b4c6dd430e108bc8a28f95047462/notify-service-by-stomp.rb
chmod +x notify-service-by-stomp.rb
Once it's installed tell Nagios to use it by adding this to your Nagios configuration:
define command {
command_name notify-service-by-stomp
command_line /usr/bin/notify-service-by-stomp.rb mq.example.com /topic/foo.bar.baz.quux $HOSTADDRESS$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEDURATIONSEC$ $SERVICEATTEMPT$ $MAXSERVICEATTEMPTS$
}
global_service_event_handler=notify-service-by-stomp
Change mq.example.com to be the hostname of your message broker, and /topic/foo.bar.baz.quux to be the topic or queue that you’d like notifications to be sent to. Restart Nagios and you should start receiving notifications on that queue or topic.
If you don’t receive notifications form Nagios very often then a simple way to test that this is working is to attach stompcat – a cat type tool that uses STOMP as a source – to the topic or queue, then send a few test notifications to the same queue by manually running the same command that Nagios would.
Here’s a simple stompcat tool in case you don’t have one handy:
#! /usr/bin/env ruby
# Run me like this:
#
# ./stompcat.rb mq.example.com /topic/foo.bar.baz.quux
#
require 'rubygems'
require 'smqueue'
configuration = {
:host => ARGV[0],
:name => ARGV[1],
:adapter => :StompAdapter
}
source = SMQueue(configuration)
source.get do |m|
payload = m.body
puts ">>> #{payload}"
end
Here’s how to send notifications to the queue or topic:
/usr/bin/notify-service-by-stomp.rb mq.example.com \
/topic/foo.bar.baz.quux service-host.example.com "SERVICE NAME" \
WARNING HARD 86492 6 6
If it’s working you should get an entry like this showing up where you’re running the stompcat:
{
"time_t":1266427384,
"state":"WARNING",
"state_type":"HARD",
"state_time":86492,
"attempt":6,
"hostname":"service-host.example.com",
"max_attempts":6,
"service":"SERVICE NAME"
}
You should be able to change the stompcat example to perform more complex and interesting actions – looking up clients in a database, sending text messages if an account has enough credit, whatever you fancy. If you come up with something fun, please let me know!
The correct OID for system uptime
I use SNMP to track system uptime so I know when my hosts have recently rebooted, but I keep making a mistake when picking which OID to monitor and using sysUpTime.0 which is wrong I should be using hrSystem.hrSystemUptime.0.
sysUpTime.0- Timeticks (0.01s) since the snmpd started.
hrSystem.hrSystemUptime.0- Timeticks since the hardware started.
Running Starling under DaemonTools
I've been playing with Starling quite a bit recently. Like most of my deployed tools, I like to be sure that it's running. Here's a run script for Starling under DeamonTools:
#!/bin/sh
# This is /home/starling/service/run
exec 2>&1
echo "Starting..."
PORT=22122
IP=0.0.0.0
USER=starling
HOME=/home/starling
exec setuidgid $USER \
starling -v -v -v -h $IP -p $PORT -P $HOME/starling.pid -q $HOME/queue 2>&1
You'll want to keep the logs too. Here's the log/run script:
#!/bin/sh
# This is /home/starling/service/log/run
exec multilog t s1000000 n10 ./main
Note that you'll have to create the starling user to use these scripts (or just change the scripts).
Posting to IRC using ActiveMQ
Previously I wrote about querying your app using IRC and IRCCat. Well, that's not all IRCCat can be used for. It can also allow your application or platform to talk to you and let you know what's up. A source code commit, a user logging in or a server dying can all be pretty interesting information, and it's really easy to hook these into IRC using IRCCat.
The IRCCat examples of sending notifications to IRC all seem to work around using netcat to pipe data across the network to the IRCCat process. My preference is to use a small Ruby script and a message bus to deliver messages to the process. Of course, I already have ActiveMQ setup so I don't have much extra overhead doing things this way.
#! /usr/bin/env ruby
STDOUT.sync = true
require 'rubygems'
require 'smqueue'
require 'yaml'
require 'socket'
puts "Starting..."
messages = SMQueue(:name => "/queue/irc.outgoing", :host => "mq.domain.com", :reliable => true, :adapter => "StompAdapter")
messages.get do |job|
message = YAML.parse(job.body).transform
puts "Posting #{message['text']} in #{message.headers['message-id']}."
irc = TCPSocket.open('localhost', '12345')
irc.send("#{message['text']}\r\n", 0)
irc.close
puts "Posted #{message.headers['message-id']}."
end
With that running on the same box as IRCCat your other processes can now just put messages onto the IRC queue and they'll be posted to IRC. If IRCCat isn't running, they'll be stored in the queue until it is and then get posted.
I like this approach because the various other processes don't need to know which server / port IRCCat is running on, they just talk to the message queue - which is made easy by SMQueue.
Query your applications using IRC
IRC. Most of you know what it is. For those that don't, it stands for Internet Relay Chat - think of it as a geeky group-chat and you won't be too far off the mark.
There's a long tradition of using bots - automated processes - to provide various services in IRC channels. There are, for example, bots that help people access Paste Bin services in IRC channels so that they don't have to paste hundreds of lines of code into the channel, or there are bots which take messages for users that are currently offline then replay those messages when they come online. These bots are really useful because they simply and easily enhance a medium which is naturally used for communication without requiring any additional software to use.
Last.fm use IRC as an internal communication mechanism. They've written (and released under the GPL - thanks!) IRCCat which allows us to write simple bots to answer queries or perform commands given in IRC channels.
I've setup IRCCat and written a few simple scripts for it. It's pretty easy to get started. First you'll need to get Java and Ant setup. I'm on a Mac on OS X 10.4 so I already have Java, and MacPorts provides an Ant port which is easy to install.
After Java and Ant you'll need to clone the IRCCat source from GitHub.
git clone git://github.com/RJ/irccat.git
Now you can compile and package the bot by running ant dist in the directory created by the git clone.
Once it's packaged create a directory called config/ and copy the example configuration from examples/irccat.xml there. We'll use this to setup the bot the way we want it to work.
The configuration file is reasonably well commented so open it up and run through each of the sections filling in the appropriate details.
- Provide the details of the IRC server you'd like to connect to. I use an internal server but if you don't have one then there are plenty of IRC networks out there - a quick Google should get the details.
- Tell the bot which username it should use.
- Change the external scripts hook to
scripts/runand up the max response lines to 30. - Choose which channels you'd like the bot to join. If they don't exist them they'll be created when the bot joins them (depending on IRC network policy).
With the configuration out of the way you can now launch your bot.
ant -Dconfgfile=./config/irccat.xml
If you're in one of the channels that you've asked the bot to connect to then you should see it join. If you're not in the channel, now would be a great time to join it. You can check the bot is working properly by asking it which channels it's running in. Type !channels into any of the channels that it's joined. It should response with a list of channels that it's currently active in.
CraigW: !channels
bot: I am in 2 channels: #foo #bar
There are a few built-in commands such as !channels. Built-in commands always start with an exclamation mark.
| Command | Description |
|---|---|
| !join #channel password | Make the bot join another channel. Password is optional. |
| !part #channel | Make the bot leave a channel |
| !channels | List all channels the bot is in |
| !spam message | Send message to all channels the bot is in |
| !exit | Make the bot quit IRC and shutdown |
The really interesting commands are the externals. Externals are called by starting a command with a question mark. You get to write external commands yourself and they can do anything you want.
Remember the configuration file had a cmdhandler value that I set to script/run? That's the first port of call for externals. I use this script to launch a router which loads and executes other commands stored in the scripts/ directory.
If you'd like to do the same thing as me, my script/run script looks like this:
#!/bin/bash
# This script handles ?commands to irccat
exec ruby ./scripts/router "$@" 2>&1
That's executable (chmod +x script/run). The code in script/router looks like the following and does the routing to and invocation of the correct command.
#! /usr/bin/env ruby
COMMANDS = File.expand_path(File.dirname(__FILE__))
name, channel, username, command, arguments = *ARGV[0].split(/ /, 5)
command_script = File.join(COMMANDS, File.basename(command))
if File.exists?(command_script) && !%W(run router).include?(command)
load command_script
puts Command.execute(name, channel, username, arguments).strip
else
desired_command = "#{command} #{arguments}".strip
puts "Sorry #{name}, I don't understand `#{desired_command}`."
end
To write externals we now just need to write a short script that implements a Command class.
The name of the file which the command should be implemented in is based on what you'd like to type into the IRC channel. If you'd like to query SNMP on a certain host under a certain OID you might like to write something like ?snmp xeriom-vm-host-06 .1.3.6.1.2.1.1.1. The script name in this case would be script/snmp. Here's a very simple implementation of that commands which just executes snmpwalk and returns the results.
class Command
class << self
def execute(name, channel, username, arguments)
hostname, oid, remainder = arguments.split(/ /, 3)
`snmpwalk -c public -v 1 #{hostname}.core.xeriom.net #{oid}`
end
end
end
Now entering ?snmp xeriom-vm-host-06 .1.3.6.1.2.1.1.1 in IRC will run this script and print the results straight back into the IRC channel the command was typed in. Note that there's no need to restart the bot for changes to take effect.
CraigW: ?snmp xeriom-vm-host-06 .1.3.6.1.2.1.1.1
bot: SNMPv2-MIB::sysDescr.0 = STRING: Linux xeriom-vm-host-06.core.xeriom.net 2.6.24-17-xen #1 SMP Thu May 1 15:55:31 UTC 2008 x86_64
There's great potential in having such powerful scripts available directly in the same communication channel that you'd use to discuss development or customer relations - a simple command can quickly pull up customer records or statistics without the need to drop to the terminal and fire up the application console or load a web page.
Ruby vs Java
I've just found out that there's a Ruby port of IRCCat. I'll be swapping to use that since I find it easier to maintain and fork a Ruby project than a Java project. Your mileage may vary.
Running Daemontools under Ubuntu 8.10
Daemontools is a collection of tools that help keep manage processes. It's great for keeping daemons running - if they ever die then Daemontools just restarts them. Unfortunately the package for Ubuntu is a little broken because it relies on /etc/inittab and Ubuntu hasn't used this file for a long time. Here's how install Daemontools and fix the problem.
Installing Daemontools
This bit is easy.
sudo apt-get install daemontools
Daemontools is installed. Easy, huh? Unfortunately it won't start after a reboot. That's bad. The daemontools-run package was meant to make Daemontools start at system bootup but unfortunately it relies on a system that uses init... and Ubuntu doesn't. It uses upstart instead.
Make Daemontools run at system startup
Create the file /etc/event.d/svscanboot with the following content.
start on runlevel 2
start on runlevel 3
start on runlevel 4
start on runlevel 5
stop on runlevel 0
stop on runlevel 1
stop on runlevel 6
respawn
exec /usr/bin/svscanboot
You'll also need to mkdir /etc/service since this is where the Ubuntu-installed Daemontools looks for service definitions.
Now tell the system to start the process.
sudo initctl start svscanboot
Other distributions
Plenty of other distributions use upstart instead of init. Getting DaemonTools running on them is pretty similar.
