Technical recipes for frequently and infrequently recurring problems
Production services need to be running all the time, and if they go down they should re-start automatically without the need for human input. Here’s how to use our nagios infrastructure to do that. This guide assumes you’re using a Nagios system (in our case, nagios.curationexperts.com) to monitor and restart a service running on a different system, via nrpe.
This guide borrows heavily from this blog post: https://kakoma.ug/code/2014/12/nagios-event-handlers-nrpe
On nagios.curationexperts.com, edit /usr/local/nagios/etc/objects/remotehosts.cfg
and add an event_handler
to the nagios service in question. In the example
below, nrpe_event_handler!restart_sidekiq
will be called whenever the check_sidekiq
command will be called whenever nagios detects that sidekiq is down.
define service{
use dce-service
hostgroup_name sidekiq
service_description sidekiq
check_command check_nrpe!check_sidekiq
event_handler nrpe_event_handler!restart_sidekiq
}
Add something like the following to /usr/local/nagios/etc/objects/commands.cfg
:
################################################################################
# Restart sidekiq
define command{
command_name nrpe_event_handler
command_line /usr/local/nagios/libexec/event_handlers/nrpe_event_handler -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $SERVICEATTEMPT$ -H $HOSTADDRESS$ -c $ARG1$
}
On nagios.curationexperts.com, there is already a general-pupose nrpe event handler
(copied from Kakoma’s blog post, above). Make sure the path is right, and that the
file is executable. Note that the restart_sidekiq
parameter gets passed to nrpe_event_handler
as ARG1
.
sudo systemctl restart nagios
Make sure it came back up correctly. Nagios won’t restart if there are syntax errors
in the config files. Check /usr/local/nagios/var/nagios.log
On the remote host (where nrpe is running), edit /usr/local/nagios/etc/nrpe.cfg
and add a line like this:
# Restart sidekiq if it isn't running
command[restart_sidekiq]=/usr/local/nagios/libexec/event_handlers/restart_sidekiq
This is the file that will be executed by the nagios user if nagios detects the
system is down. Make sure it’s executable and that the nagios user has the
permissions it needs to be able to run it. Create a file called /usr/local/nagios/libexec/event_handlers/restart_sidekiq
(or whatever filename
you used in the nrpe config above). It should look something like this:
#!/bin/bash
#Uncomment the next two lines for debugging. Check logs in /tmp to see how execution's being done
exec 2> /tmp/nagioslog."$$"
set -x
sudo /bin/systemctl restart sidekiq
sudo systemctl restart xinetd
You should now be able to stop the monitored service. Run tail -f /usr/local/nagios/var/log/nagios.log
on the nagios server to watch it run. You
should see nagios detect the down service, try again twice more (depending on how
many tries you have configured – 3 is the default). On each try it will call
the nrpe_event_handler but only on the third try (when the service is in a HARD down state) will it actually trigger a restart.