Archive for the ‘Sensu’ Category

Why Sensu is a monitoring router – some cool handlers

I have just finished a couple of handlers that really fits well into to the Sensu routing model.

The Execution handler

Will automatically execute things triggered by alerts, for example restart a service or enable debug mode. By using the great tool mcollective the handler can execute tasks on other servers. One example I show here is to restart an apache service if the web application doesn’t respond, there are a lot of nice things that could be done to automate the handling of unexpected events in the system. Our conclusion is that there are many events that could be handled in an automatic or semi-automatic way and by doing the handling via the Sensu router you will be alerted when something happens, you have the history of the event and actions can be triggered on other instances than the instance that the alert was triggered on in the first place.

This is an example of a check that check if the foo web site is responding. If the site doesn’t work and an alert is triggered the execute handler will run, and will run the task(s) defined in execute. In this case all servers with the role Foowebserver will restart the apache2 service.

"foo_website": {
      "handlers": [
        "rfdefault",
        "execute"
      ],
      "command": "/opt/sensu/plugins/http/check-http.rb -h www.foo.com -p / -P 80",
      "execute": [
        {
          "scope": "CLASS",
          "class": "role.Foowebserver",
          "application": "service",
          "execute_cmd": "apache2 restart"
        }
}

This will actually run this mcollective command on the sensu-server and restart the apache2 servers

mco service -C role.Foowebserver apache2 restart

The execute handler and mcollective could be use for a lot of great stuff like restarting services, shut-down or start servers, enable or disable debug mode for applications or gather more data from the servers. The only thing needed is to extend mcollective with more agents. One of the great advantages with mcollective is that it is agent based so it’s feel more safe than just execute remote ssh commands.

The graphite notify handler

The graphite notify handler is quite simple and just send a 1 when an event occur and 0 when it’s resolved, that means that it’s easy to get statistics how often an error occur and on which machines.

http://graphite.recfut.com/render/?width=1548&height=786&_salt=1353966035.28&from=17%3A00_20121126&until=18%3A00_20121126&target=keepLastValue(r13b.sensu.events.adm3_recfut_net.chef_client_fatal_error)&hideLegend=true&hideGrid=false&lineWidth=4&width=400&height=280

In this case we had one event that last for 20 minutes at 17:10.

The only thin needed to be done is to add some graphite config and then add the graphite_notify handler

  "graphite_notify":{
    "host":"graphite.foo.com",
    "prefix":"sensu.events"
  }

Graphite and Sensu –

I will have a short ignite talk on the Devops conference in Rome about how we implemented Sensu at Recorded Future. This is a more detailed description of how we use Sensu and Graphite.

At Recorded Future we have created a monitoring pipeline with some very useful components. I will describe one of the most useful pipelines that we have implemented. I think one of the great ideas with Sensu is to use it as a router and not include a lot of functionality into the tool, instead Sensu is easy to integrate with other tools like Graphite. Graphite by itself is a really good tool and database for handling and storing metrics, one of the main advantages is much better performance for handling metrics compared with traditional databases.

Metric collection – Graphite – Sensu pipeline

We have built a pipeline where we sent metrics to Graphite we send those metrics from different sources like Sensu, applications and specific tools or scripts. In the end of the pipeline we have Sensu scripts that pull the data out from Graphite (in many case by using built-in Graphite functions to cure the data) and based on that data the Sensu client send an OK or an alert to the Sensu server. This means that we can monitor trends and we can monitor on averages, max, mean values for time series without the need to save the data in our monitoring system. In Graphite we can also view the graphs of the data and that means that we alert on the same data as we graph and we need to collect the data only once. It’s easy for everyone to understand what we are monitoring and discuss/find triggers level by easy graphing the data. When we find some problems and realize that we need to monitor that, we very often already have the data in Graphite and then it’s easy to create alert on that data as well.

The benefits of this pipeline

  • It has been much easier to get the developers to add the code the send data to graphite as it has a simle to use api .They can also use the metrics to see graphs by themselves.
  • The history is stored and you can use that to evaluate triggers etc and see the previous behavior of your system
  • You can easily graph the metrics
  • You can do a lot of calculations on the data, one of my favorites is the derivate of a metric to see if it a metric is increasing or declining for a period of time and then for example alert if it isn’t growing as the path it should.
  • You don’t store any data in your monitoring system and get much less load on the monitoring system and databases
  • You can use your data for graphs and dashboards
  • If you need to monitoring a metric, you very often find that you already have the data in Graphite

Collecting data

How is this pipeline set-up? The first thing is to collect the data and sent it to graphite we are doing this in 3 different ways

From Sensu scripts

Create a Sensu script of type Graphite and use the function output for your Graphite metric. The script is run from the local Sensu client that takes the output and send it via RabbitMQ to the Sensu server that send it via RabbitMQ to Graphite. (You need to config the Graphite server to accept metrics via RabbitMQ)

class RFAPIMetrics < Sensu::Plugin::Metric::CLI::Graphite
option :host,
 :short => "-h HOST",
 :long => "--host HOST",
 :description => "The API host to do the check against, including port",
 :required => true
def run
 api_server = RFAPI.new
 api_server.base_uri=config[:host]
 api_stat = api_server.api_stat
 api_stat.each do |line|
 stat_name, stat_value = line.split("=")
   output "api.#{stat_name}", stat_value
 end
 ok
end

From applications

We are sending a lot of metrics directly from the applications and processes. That could be usage, number of documents handled, executions times and many other things. We are using libs like metrics in Java http://metrics.codahale.com/ to send metrics data to Graphite without any obscure path via some monitoring application, that has a clumsy API that the developers hates. Instead it makes it very easy for developers to add metrics, they can then be used both during development, bug fixing and in production. It’ s a very simple and easy to understand the format that Graphite use so it’s easy to use. By letting the application send the data directly there is no need for any extra data gathering processes and the data could be calculated and formatted directly in the application code to fulfil the needs.

From other tools/scripts

We are using some tools that gather data and send it directly to Graphite and we have also some scripts that does scheduled runs and collect data from logs, processes etc and then send the data directly to Graphite.

Using the Graphite data in Sensu

We use graphite data in different ways in Sensu, the Sensu client script access the data from Graphite and then act on the data and respond with an OK or create an alert.

Last value

Just grab the last value from Graphite for the metric and compare with the threshold and trigger an alert if outside the bound. This is an small example.

params = {
 :target => target,
 :from => "-#{@period.to_s}",
 :format => 'json'
 }
resp = Net::HTTP.post_form(graphite_url, params)
data = JSON.parse(resp.body)
if data.size > 0
  if data.first['datapoints']  > TRIGGER_VALUE
	warning “Metric #{target} has value #{data.first['datapoints']} that is larger than #{TRIGGER_VALUE}”
    end
else
  critical “No data found for graphite metric #{target}”
end
ok

Time series

We can use graphite to store our time series and calculate averages, max values, aggregating values from many monitoring instances etc. By using that we can then alert on those values, for example when the total cpu usage for some kind of process running on many machines is too high.

Trends, statistical methods

We grab a time series from Graphite and then we use it to alert on trends, we use different statistical methods including calculation in R to be able to alert on anomalous values.

Sensu script to check Graphite data

You can find Sensu script to check handling metrics based on data in Graphite here https://github.com/portertech/sensu-community-plugins/blob/master/plugins/graphite/check-data.rb