Why Sensu is a monitoring router – some cool handlers

I have just finished a couple of handlers that really fits well into to the Sensu routing model.

The Execution handler

Will automatically execute things triggered by alerts, for example restart a service or enable debug mode. By using the great tool mcollective the handler can execute tasks on other servers. One example I show here is to restart an apache service if the web application doesn’t respond, there are a lot of nice things that could be done to automate the handling of unexpected events in the system. Our conclusion is that there are many events that could be handled in an automatic or semi-automatic way and by doing the handling via the Sensu router you will be alerted when something happens, you have the history of the event and actions can be triggered on other instances than the instance that the alert was triggered on in the first place.

This is an example of a check that check if the foo web site is responding. If the site doesn’t work and an alert is triggered the execute handler will run, and will run the task(s) defined in execute. In this case all servers with the role Foowebserver will restart the apache2 service.

"foo_website": {
      "handlers": [
        "rfdefault",
        "execute"
      ],
      "command": "/opt/sensu/plugins/http/check-http.rb -h www.foo.com -p / -P 80",
      "execute": [
        {
          "scope": "CLASS",
          "class": "role.Foowebserver",
          "application": "service",
          "execute_cmd": "apache2 restart"
        }
}

This will actually run this mcollective command on the sensu-server and restart the apache2 servers

mco service -C role.Foowebserver apache2 restart

The execute handler and mcollective could be use for a lot of great stuff like restarting services, shut-down or start servers, enable or disable debug mode for applications or gather more data from the servers. The only thing needed is to extend mcollective with more agents. One of the great advantages with mcollective is that it is agent based so it’s feel more safe than just execute remote ssh commands.

The graphite notify handler

The graphite notify handler is quite simple and just send a 1 when an event occur and 0 when it’s resolved, that means that it’s easy to get statistics how often an error occur and on which machines.

http://graphite.recfut.com/render/?width=1548&height=786&_salt=1353966035.28&from=17%3A00_20121126&until=18%3A00_20121126&target=keepLastValue(r13b.sensu.events.adm3_recfut_net.chef_client_fatal_error)&hideLegend=true&hideGrid=false&lineWidth=4&width=400&height=280

In this case we had one event that last for 20 minutes at 17:10.

The only thin needed to be done is to add some graphite config and then add the graphite_notify handler

  "graphite_notify":{
    "host":"graphite.foo.com",
    "prefix":"sensu.events"
  }

#monitoringlove – a true story

This is a write up of the ignite talk I gave at devopsdays 2012 in Rome. This story is not only about monitoring, it’s a about #dadops as well😉 , my 2 daughters Olivia and Agnes helped me out with the creative drawings for the presentation.

Monitoring doesn’t need to suck these days, a lot has happen the last year that has changed the game, so let year #2013 be the year of #monitoringlove

I will tell the story how we did at Recorded Future to go from #monitoringsucks to #monitoringlove and also how we got from monitoring as pure ops thing to a business thing as well.

The monitoring dragon – throwing fire everywhere

All of us were tired of our old monitoring solution based on Hyperic that we choose a couple of year ago as state of the art monitoring at that time. We have been discussing for a long time to replace Hyperic with another solution but things like Zabios, Zenoss and Nagios are as @obfuscurity told us “The fucking terrible” as well so we have stayed calm.

One day in May, the 4 of us in the operation team took the train down to Varberg for a 3 day hackathon to get rid of our monitoring solution and install a new one.

The train to Varberg with Erik, Martin, Anders and me

We worked hard for 2 days and late of the evening the 2nd day we could shut-down our old monitoring solution and launch or new solution and we celebrated with a bottle of Champagne.

What we installed as the monitoring solution for Recorded Future was Sensu with friends. The first day of this exercise we actually spent to try to understand how Sensu worked, how plugins and handlers worked. We did a installation of Sensu and the infrastructure via Chef and tested different plugins. Everyone in the team created their first Sensu plugin as well. Already after the first day we were convinced that this was they way to go. Sensu was easy to understand, it was easy to create plug-ins and the server code was simple and easy to understand. The third day we used to clean up things and add more monitoring, all in all it took 4 of us 3 days to be able to completely replace the old monitoring system. We have now about 60 different plugins(Sensu check scripts), 5 different handlers, 200 different checks and 2 500 000 metrics,

An image of Sensu according to my daughter

Sensu is just one part of a monitoring solution, Sensu needs friends to be really great. Of course there are the obivous like RabbitMQ and Redis as Sensu will not work without them, but then there are the ones that make Sensu into a complete monitoring solution like Graphite, Pagerduty, Gdash etc (you can add your own favorite here)

Sensu with RabbitMQ the friend, and Pagerduty

Graphite – the very best friend of Sensu

Graphite is a really great tool and together with Sensu it’s awesome. Graphite is splendid in storing high volume of metrics and show graphs but it also very useful to alert from metrics data in Graphite, read more in a previous blog post here.

I would say it’s about using the right tool for the right thing. Most of the monitoring solution in the past has been about having a big monolith that should solve all your monitoring problems but those solutions doesn’t solve all your monitoring problems, instead they create you a monitoring headache and you start really too feel that #monitoringsuck.

The really great thing with Sensu is that it doesn’t try to solve everything instead it’s decomposed into different components and easy to extend and integrate with other tools. So instead of using one swiss-army-knife tool to fix my house, I prefer to use proper tools made for the type of task that I need to do.

Why we are more happy at Recorded Future

The new monitoring approach we use has lowered barriers between different teams and also make it easier to communicate within the company, no matter if it’s developer communicating with ops or ops communicating with business.

The developer

By lowering the barrier for the developers to interact with the monitoring system they have been much more interested into data and monitoring. It has been much easier to add metric data, and the main reason is the integration with Graphite, the developers just send data to Graphite and there are many libs supporting this. Also the data is useful for the developer as well as they can use the data and graphs in Graphite both during development, test and to hunt down bugs. Today when we need to add some more monitoring data we discuss with the developer and normally it’s just a couple of lines that they add to the code to get the metric, this was much harder to do before and therefore it wasn’t done.

It’s also possible to trigger an alert directly from an application to Sensu via  port 3030

The business user

Business users like it as well

The data in Graphite is used all over the business, sales persons can follow the usage of the site, financial can see the Amazon costs, analytics are monitoring the flow and the data volumes stored in the system. Everyone is looking at the performance. So the graphs make it much easier to work together and get the same common perspective. It also is of great benefit that it’s easy to add new monitoring and metrics, today when we talk with business people and they would like to add monitoring or metrics we can easy fulfil there needs which make them more happy. It’s also much easier to ask them what they think is important to monitor and use the graphs as a base for that discussion. – At which level should we trigger an alert? – What is the right volumes of data? etc This dialogue is much better today than before.

Graphs makes people happier

Just to be able to show nice looking graphs makes people happier  but also mangle the metrics in many way to create smart alerts is an important functionality. Many times when we create a new alert we look at the graphs and test different functions like timeshifts, moving averages, filtering etc to get the metric that we would like to alert on. For example we could by using the timeshift function in graphite compare the values with the same time the week before and alert if they differ to much. Another example is to use the derivative function to alert if the growth of a specific metric is outside our desired values. I haven’t seen any other monitoring solution with as powerful calculations as you can do in Graphite

What are we measuring and monitoring?

We are measuring a lot, we have about 2 500 000 metrics and we try to store a lot of metric information, a lot of those metrics are slow moving metrics that we only update every 5 minute or so. The idea is to store as much metrics as possible to be able to help finding causes when we have problem, be able to see bottlenecks, capacity planning or see user behaviour. By already having a lot of metrics make it easier to add new things to monitor as we often have the needed metric in our system.

Monitor usage not number of web server requests

There are few of the metrics that we are actually monitoring and we also try to monitor on a higher level compared to before. We are running in the cloud so network, routers etc has never been to any interest of us, but we are even monitoring less on the OS level like load, disk performance etc because that is not much of interest. Of interest for us is the performance for the end user, how many document we can analyse per second, usage etc and therefore are we try to monitor on higher levels. Instead of monitoring the load on database machine we are monitoring how long the operation take to insert a record in our database. It doesn’t matter if the load is > 10 or disk I/O is high, what matter is the performance of the database for our application. Then if we get alerted that the database starts to slow down then it could be good to look at the graphs of the load or the IO metrics or maybe look at how many instances we have in the database etc. So instead of alerting on the behaviour of the OS  we look at higher levels, this approach saves us time and give us much better quality. Sensu make this approach much easier with the easy integration with other tools and the good support to configure checks and handlers via configuration management tools like Chef

Monitor performance – not load on a server

The architecture – so elegant

Simple architecture and you can extend it as quick as the flash

The nice thing with Sensu is the architecture with separated services and messaging based on RabbitMQ and easy to integrate with other services and components. There is a little step to get a understanding of how it works but when you have gripped Sensu it’s easy to work with and easy to extend with new checks, handlers and integrate with other tools.

Drawbacks with Sensu

Sensu is not perfect and will probably not be, but it so much better than our previous solution. What is missing in Sensu is documentation, it has improved with the wiki. Also the UI will need development and more advanced functionality but as the community is growing very quick I am sure those problems will overcome, there are actually already 3 user interfaces developed.

I am happy

A portrait of me – Ulf

All of us in the operation team are more happy with our new monitoring solution and specially I am much more satisfied with this monitoring solution. We were discussing all of the spring to implement Sensu but we just postponed it, I am glad we took the decision to go away for a couple of days to do the monitoring make over and also in advance telling everyone that when we come home there will be a new monitoring solution (pressure is good some times). This was also a very good team exercise.

Let 2013 be the year of #monitoringlove

So stop complain, the future is here, have a look and see if it fits to your needs.

I would like to thank Erik, Martin and Anders in the operation team that make this change happen.
The slides are available here

Graphite and Sensu –

I will have a short ignite talk on the Devops conference in Rome about how we implemented Sensu at Recorded Future. This is a more detailed description of how we use Sensu and Graphite.

At Recorded Future we have created a monitoring pipeline with some very useful components. I will describe one of the most useful pipelines that we have implemented. I think one of the great ideas with Sensu is to use it as a router and not include a lot of functionality into the tool, instead Sensu is easy to integrate with other tools like Graphite. Graphite by itself is a really good tool and database for handling and storing metrics, one of the main advantages is much better performance for handling metrics compared with traditional databases.

Metric collection – Graphite – Sensu pipeline

We have built a pipeline where we sent metrics to Graphite we send those metrics from different sources like Sensu, applications and specific tools or scripts. In the end of the pipeline we have Sensu scripts that pull the data out from Graphite (in many case by using built-in Graphite functions to cure the data) and based on that data the Sensu client send an OK or an alert to the Sensu server. This means that we can monitor trends and we can monitor on averages, max, mean values for time series without the need to save the data in our monitoring system. In Graphite we can also view the graphs of the data and that means that we alert on the same data as we graph and we need to collect the data only once. It’s easy for everyone to understand what we are monitoring and discuss/find triggers level by easy graphing the data. When we find some problems and realize that we need to monitor that, we very often already have the data in Graphite and then it’s easy to create alert on that data as well.

The benefits of this pipeline

  • It has been much easier to get the developers to add the code the send data to graphite as it has a simle to use api .They can also use the metrics to see graphs by themselves.
  • The history is stored and you can use that to evaluate triggers etc and see the previous behavior of your system
  • You can easily graph the metrics
  • You can do a lot of calculations on the data, one of my favorites is the derivate of a metric to see if it a metric is increasing or declining for a period of time and then for example alert if it isn’t growing as the path it should.
  • You don’t store any data in your monitoring system and get much less load on the monitoring system and databases
  • You can use your data for graphs and dashboards
  • If you need to monitoring a metric, you very often find that you already have the data in Graphite

Collecting data

How is this pipeline set-up? The first thing is to collect the data and sent it to graphite we are doing this in 3 different ways

From Sensu scripts

Create a Sensu script of type Graphite and use the function output for your Graphite metric. The script is run from the local Sensu client that takes the output and send it via RabbitMQ to the Sensu server that send it via RabbitMQ to Graphite. (You need to config the Graphite server to accept metrics via RabbitMQ)

class RFAPIMetrics < Sensu::Plugin::Metric::CLI::Graphite
option :host,
 :short => "-h HOST",
 :long => "--host HOST",
 :description => "The API host to do the check against, including port",
 :required => true
def run
 api_server = RFAPI.new
 api_server.base_uri=config[:host]
 api_stat = api_server.api_stat
 api_stat.each do |line|
 stat_name, stat_value = line.split("=")
   output "api.#{stat_name}", stat_value
 end
 ok
end

From applications

We are sending a lot of metrics directly from the applications and processes. That could be usage, number of documents handled, executions times and many other things. We are using libs like metrics in Java http://metrics.codahale.com/ to send metrics data to Graphite without any obscure path via some monitoring application, that has a clumsy API that the developers hates. Instead it makes it very easy for developers to add metrics, they can then be used both during development, bug fixing and in production. It’ s a very simple and easy to understand the format that Graphite use so it’s easy to use. By letting the application send the data directly there is no need for any extra data gathering processes and the data could be calculated and formatted directly in the application code to fulfil the needs.

From other tools/scripts

We are using some tools that gather data and send it directly to Graphite and we have also some scripts that does scheduled runs and collect data from logs, processes etc and then send the data directly to Graphite.

Using the Graphite data in Sensu

We use graphite data in different ways in Sensu, the Sensu client script access the data from Graphite and then act on the data and respond with an OK or create an alert.

Last value

Just grab the last value from Graphite for the metric and compare with the threshold and trigger an alert if outside the bound. This is an small example.

params = {
 :target => target,
 :from => "-#{@period.to_s}",
 :format => 'json'
 }
resp = Net::HTTP.post_form(graphite_url, params)
data = JSON.parse(resp.body)
if data.size > 0
  if data.first['datapoints']  > TRIGGER_VALUE
	warning “Metric #{target} has value #{data.first['datapoints']} that is larger than #{TRIGGER_VALUE}”
    end
else
  critical “No data found for graphite metric #{target}”
end
ok

Time series

We can use graphite to store our time series and calculate averages, max values, aggregating values from many monitoring instances etc. By using that we can then alert on those values, for example when the total cpu usage for some kind of process running on many machines is too high.

Trends, statistical methods

We grab a time series from Graphite and then we use it to alert on trends, we use different statistical methods including calculation in R to be able to alert on anomalous values.

Sensu script to check Graphite data

You can find Sensu script to check handling metrics based on data in Graphite here https://github.com/portertech/sensu-community-plugins/blob/master/plugins/graphite/check-data.rb

Everyone running a service is doing operation

NoOps and DevOps has been a hot topic on twitter in the last couple of months and a couple of days ago @adrianco has published a blog entry saying that Netflix are doing NoOps for the part of the service they run on AWS. I think this discussion is important how will the operation evolve today and in the future and what will the role be and what skill will be needed. There will always be operation, I can access the site at www.netflix.com so the site seams to be operated😉 The question is which approach to take and what is the need for operation and development in the future.

Clearly as @lusis point out in this reply, http://blog.lusis.org/blog/2012/03/20/it-sucks-to-be-right Netflix does operation or the site as well as the business will be dead.

One of the issue I think @adrianco got wrong is the way you do operation today. Operation in businesses utilizing virtual environments and working with continues delivery is different compared to what you do in operation running on your own physical hardware and with scheduled releases. Of course you still do operation but you do it on another level and partly you need other skills.

(Maybe there are some improvements that could be done on the operation of netflix.com, earlier today I tried to access their site and got this:

This part of the site could of course be the stuff running on the physical hardware and operated by another team, but anyway Netflix is one company with a common goal to deliver services to their customers.)

Operation – the move from hardware to programming

Operations is getting more and more of a development approach and not just developing scripts, instead we see the movement that operations become developing, maintaining and operate🙂 a system of itself utilizing most of the fancy things used in developing of applications like databases, messaging, indexing technologies and GUI development. Also operation needs to take a greater part in handling other the infrastructure of  development like build systems, version management, test tools etc as those tools are becoming more and more important in an agile world.

This is driven by the movements toward virtualization (public or private clouds), automation and the agile movement in development. This is a driver for a major change for the role for operations and also the skills needed for the operation girl or guy to do a great job.

How to handle the challenges

The new way of working put a lot of challenges on the organization and there are several ways you can handle this. Either as, I think, how Netflix handled it by separating the operation duty into two different teams one team with the “old” operation staff handling everything running on hardware owned by Netflix and then a new team of developers that operate everything that runs on hardware owned by Amazon. I think this is what happen if you have a company culture of us and them with no common goals. The development team get frustrated by the old way of running operation on physical hardware and they take the company credit card and sign up for a service, deploy their applications and take the responsibility for operation of the system. I don’t think Netflix is the only organization where you see this happen and when we see more of virtualization and automation we will see this happen in many organizations that can’t get operation and development together.

Instead successful organizations will handle this in another way and that is by deeply integrating the operation teams with the development teams and create a culture that makes the teams to work smoothly together or even create mixed teams. This creates new way of working, one good example is the cultural change at Nokia Entertainment UK, presented at the DevOps conference in Göteborg, by inclusion going from 6 releases/year and 50 persons working with releases to 246 releases/year with only 4 persons, see http://www.slideshare.net/pswartout/devopsorg-how-we-are-including-almost-everyone. That story was impressive.

Why are there so many operations persons involved in the DevOps movement, I think this is natural, the tools used by operations are the one that encounter the most dramatic changes and the evolution needs to come from this change of the reality. What is really the driver is the opportunity for the business to reduce time to market and lower costs. By more and more development organizations working with agile development this put pressure on the operation organization to become more agile and work in a different way. This different way involves a lot of development and therefore I think we already see a movement that we have more of persons with developer background working with operation. At Recorded Future we are hiring developers for operations stuff as 70 % of our work today is to do programming of a fairly complex systems.

So one of the major challenges is to transfer the operation team into teams doing development of the operation system. Those working with operation and understand this change and make the operation team more of a development team will be those that will success. There is a need of both developing operations people into more of programmers and when hiring new people focus on the programming skills and skills to interact with other persons and teams.

DevOps is about culture

DevOps is about emphasize a culture of common goals and get things done together. Even if you have a great development team and a skilled and great operation team you will not have success until those works close together and to have this to happen you must work with the organization, the values and the culture. As many has written before you need to break the silos, not building higher silos.

There is a movement started by introduction of new technologies and driven by business requirements to be faster on the market and to reduce the cost. We see a lot of services delivered on different levels of XAAS but still we will have to cope with how we operate our service in the best way and this will be even if we don’t handle the hardware or even the operating system anymore. We have to improve our way to meet the business need and to be competitive and for me one of the most important tools is DevOps.

(And if you have read so far you deserve not only a cake but a beer😉 )

How we use environments in Chef at Recorded Future

A couple of months ago Opscode introduced the environment concept in Chef, a long awaited feature. Still not perfect but a very useful feature. At Recorded Future are we using subversion as our versions management system and trying to keep the work in the trunk but from time to time we alsohave one or two feature branches, and until we are in the trunk only heaven (read more about why avoid feature branches in this blogpost by Jez Humble) we have to cope with managing branches for different Chef environments.

Today we have different environments for different trunk and branches but also for different setup for our system, we have our main production system but we are also managing our system for specific customers at separate machines with separate configurations. We also use green-blue deployment that means for some time we live with two environments on released and one to release.

An example how we a setup of environments could look like this:

Chef environment management

A cookbook in Chef could have a version number that has three parts like 1.0.1 major, minor and patch version. We are using are using different major number for the trunk and the different branches. So cookbooks in trunk starts with 12.x.x and in a branch they could start with 13.x.x, and after we have merged our branch back into trunk then the cookbooks in trunk will start with 13.x.x and the next branch will have 14.x.x as major version number. Then we use the second number to determine which environment within the trunk/or branch that the cookbook is valid for

An example of our environments and version used in the cookbook for the different environment

Env Used for From Ver
r12a Released main production system Chef recipes from trunk a specific revision 12.0.x
r12b Main production system to be released Chef recipes from trunk – head – latest version in trunk 12.1.x
cust_a A system for a customer Chef recipes from trunk a specific revision)/ 12.2.x
r12test A test environment for manual tests 12.98.x
r12build A test environment for unit, integration and systems test used in our build process Chef recipes from trunk – head 12.99.x
r13a A branch to be released Chef recipes from a branch – a specific version 13.0.x
r13build A test environment for unit, integration and systems test used in our build process Chef recipes from a branch – head 13.99.x

Then in Chef we set the version of the cookbook in the different environments with the version constraint ~> for example rfcommon ~> 12.0.0 for r12a, we do this with a script that generate an environment specification file for all cookbooks that we upload to the Chef server for each environment.

But now we have the problem how to maintain the version number for the cookbook in the metadata.rb file. We have earlier created a script (no source published yet) for uploading cookbooks to save some typing and that do some check if the cookbook is checked-in in subversion etc and then the script does a knife cookbook upload command. We started with the script before version 0.10 of knife when Opscode introduced the plugin concept, so maybe we will later create a knife plugin instead.

Now we also have extended our script to handle the version management for the cookbooks. We use the script with an environment parameter eq r12a then the script temporarily change in the metadata.rb file for the cookbook to be uploaded. The script set the correct version number according for the environment parameter by replacing major and minor version number with the version number for the environment. Eq 12.0 for environment r12 but the script doesn’t touch the last number the revision number. After the cookbook is uploaded the script changes back to the original metadata.rb file, this to avoid a lot of mismatches during merge of branches etc.

The script will also validate that the path given to the script match the branch/trunk used for the environment to avoid uploading code from wrong branch/trunk to an environment

Follow

Get every new post delivered to your Inbox.