Graphite and Sensu –

I will have a short ignite talk on the Devops conference in Rome about how we implemented Sensu at Recorded Future. This is a more detailed description of how we use Sensu and Graphite.

At Recorded Future we have created a monitoring pipeline with some very useful components. I will describe one of the most useful pipelines that we have implemented. I think one of the great ideas with Sensu is to use it as a router and not include a lot of functionality into the tool, instead Sensu is easy to integrate with other tools like Graphite. Graphite by itself is a really good tool and database for handling and storing metrics, one of the main advantages is much better performance for handling metrics compared with traditional databases.

Metric collection – Graphite – Sensu pipeline

We have built a pipeline where we sent metrics to Graphite we send those metrics from different sources like Sensu, applications and specific tools or scripts. In the end of the pipeline we have Sensu scripts that pull the data out from Graphite (in many case by using built-in Graphite functions to cure the data) and based on that data the Sensu client send an OK or an alert to the Sensu server. This means that we can monitor trends and we can monitor on averages, max, mean values for time series without the need to save the data in our monitoring system. In Graphite we can also view the graphs of the data and that means that we alert on the same data as we graph and we need to collect the data only once. It’s easy for everyone to understand what we are monitoring and discuss/find triggers level by easy graphing the data. When we find some problems and realize that we need to monitor that, we very often already have the data in Graphite and then it’s easy to create alert on that data as well.

The benefits of this pipeline

  • It has been much easier to get the developers to add the code the send data to graphite as it has a simle to use api .They can also use the metrics to see graphs by themselves.
  • The history is stored and you can use that to evaluate triggers etc and see the previous behavior of your system
  • You can easily graph the metrics
  • You can do a lot of calculations on the data, one of my favorites is the derivate of a metric to see if it a metric is increasing or declining for a period of time and then for example alert if it isn’t growing as the path it should.
  • You don’t store any data in your monitoring system and get much less load on the monitoring system and databases
  • You can use your data for graphs and dashboards
  • If you need to monitoring a metric, you very often find that you already have the data in Graphite

Collecting data

How is this pipeline set-up? The first thing is to collect the data and sent it to graphite we are doing this in 3 different ways

From Sensu scripts

Create a Sensu script of type Graphite and use the function output for your Graphite metric. The script is run from the local Sensu client that takes the output and send it via RabbitMQ to the Sensu server that send it via RabbitMQ to Graphite. (You need to config the Graphite server to accept metrics via RabbitMQ)

class RFAPIMetrics < Sensu::Plugin::Metric::CLI::Graphite
option :host,
 :short => "-h HOST",
 :long => "--host HOST",
 :description => "The API host to do the check against, including port",
 :required => true
def run
 api_server = RFAPI.new
 api_server.base_uri=config[:host]
 api_stat = api_server.api_stat
 api_stat.each do |line|
 stat_name, stat_value = line.split("=")
   output "api.#{stat_name}", stat_value
 end
 ok
end

From applications

We are sending a lot of metrics directly from the applications and processes. That could be usage, number of documents handled, executions times and many other things. We are using libs like metrics in Java http://metrics.codahale.com/ to send metrics data to Graphite without any obscure path via some monitoring application, that has a clumsy API that the developers hates. Instead it makes it very easy for developers to add metrics, they can then be used both during development, bug fixing and in production. It’ s a very simple and easy to understand the format that Graphite use so it’s easy to use. By letting the application send the data directly there is no need for any extra data gathering processes and the data could be calculated and formatted directly in the application code to fulfil the needs.

From other tools/scripts

We are using some tools that gather data and send it directly to Graphite and we have also some scripts that does scheduled runs and collect data from logs, processes etc and then send the data directly to Graphite.

Using the Graphite data in Sensu

We use graphite data in different ways in Sensu, the Sensu client script access the data from Graphite and then act on the data and respond with an OK or create an alert.

Last value

Just grab the last value from Graphite for the metric and compare with the threshold and trigger an alert if outside the bound. This is an small example.

params = {
 :target => target,
 :from => "-#{@period.to_s}",
 :format => 'json'
 }
resp = Net::HTTP.post_form(graphite_url, params)
data = JSON.parse(resp.body)
if data.size > 0
  if data.first['datapoints']  > TRIGGER_VALUE
	warning “Metric #{target} has value #{data.first['datapoints']} that is larger than #{TRIGGER_VALUE}”
    end
else
  critical “No data found for graphite metric #{target}”
end
ok

Time series

We can use graphite to store our time series and calculate averages, max values, aggregating values from many monitoring instances etc. By using that we can then alert on those values, for example when the total cpu usage for some kind of process running on many machines is too high.

Trends, statistical methods

We grab a time series from Graphite and then we use it to alert on trends, we use different statistical methods including calculation in R to be able to alert on anomalous values.

Sensu script to check Graphite data

You can find Sensu script to check handling metrics based on data in Graphite here https://github.com/portertech/sensu-community-plugins/blob/master/plugins/graphite/check-data.rb

Advertisements

8 comments so far

  1. Ryan on

    very nice architecture I was planning for that exact setup and was looking for relevant case study.

    did you take a look at rearview from livingsocial? it’s actually a very impressive tool that does all the alerting based on graphite’s data. The part where is shines is that it has a nice web UI for you to manage all your monitoring alerts. That is the part where I think it is better than sensu for this purpose.

    • Michael Owings on

      I had a look at rearview, but it really doesn’t seem very far along. I think it has promise, but It didn’t look very usefule

    • Jaime Gago (@JaimeGagoTech) on

      Rearview does look interesting but IMHO the beauty of this pipeline is its simplicity personally I’d rather stick to Sensu since it does have a decent dashboard. Ultimately less moving parts means less room for Murphy to show his damned face.

  2. Jaime Gago (@JaimeGagoTech) on

    Thanks for the sharing, this pipeline is simply genius (currently in implementation phase), so glad I found your post when looking at potential architectures leveraging Sensu and Graphite. I only wish you more exposure, and I’ll share as I much as I can once it’s live, you should totally submit this to Monitorama!

    • Jaime Gago (@JaimeGagoTech) on

      What do you think of using a TCP socket to pass data from Sensu to Graphite (i.e. using a Sensu TCP handler)? It seems like a better way than using RabbitMQ, according to this post http://joemiller.me/2013/12/07/sensu-and-graphite-part-2/

      • ulfmansson on

        Yes, the approach Joe describes is better.

        Actually we send most of our graphite data directly from the applications or from logstass, in most of the cases we send the data to statsd.

        Then we use a Sensu checks to alert on the data in Graphite.

  3. Jaime Gago (@JaimeGagoTech) on

    Have you thought about integrating flapjack in this pipeline? After reading this http://holmwood.id.au/~lindsay/2014/01/03/the-how-and-why-of-flapjack/ I think we’re going to need it.

  4. […] currently implementing a monitoring pipeline based on Sensu for the monitoring data routing and Graphite for the actual stora…calculations, it wouldn't be complete without a notification systems to it so I thought I would […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: