#monitoringlove – a true story

This is a write up of the ignite talk I gave at devopsdays 2012 in Rome. This story is not only about monitoring, it’s a about #dadops as well ;) , my 2 daughters Olivia and Agnes helped me out with the creative drawings for the presentation.

Monitoring doesn’t need to suck these days, a lot has happen the last year that has changed the game, so let year #2013 be the year of #monitoringlove

I will tell the story how we did at Recorded Future to go from #monitoringsucks to #monitoringlove and also how we got from monitoring as pure ops thing to a business thing as well.

The monitoring dragon – throwing fire everywhere

All of us were tired of our old monitoring solution based on Hyperic that we choose a couple of year ago as state of the art monitoring at that time. We have been discussing for a long time to replace Hyperic with another solution but things like Zabios, Zenoss and Nagios are as @obfuscurity told us “The fucking terrible” as well so we have stayed calm.

One day in May, the 4 of us in the operation team took the train down to Varberg for a 3 day hackathon to get rid of our monitoring solution and install a new one.

The train to Varberg with Erik, Martin, Anders and me

We worked hard for 2 days and late of the evening the 2nd day we could shut-down our old monitoring solution and launch or new solution and we celebrated with a bottle of Champagne.

What we installed as the monitoring solution for Recorded Future was Sensu with friends. The first day of this exercise we actually spent to try to understand how Sensu worked, how plugins and handlers worked. We did a installation of Sensu and the infrastructure via Chef and tested different plugins. Everyone in the team created their first Sensu plugin as well. Already after the first day we were convinced that this was they way to go. Sensu was easy to understand, it was easy to create plug-ins and the server code was simple and easy to understand. The third day we used to clean up things and add more monitoring, all in all it took 4 of us 3 days to be able to completely replace the old monitoring system. We have now about 60 different plugins(Sensu check scripts), 5 different handlers, 200 different checks and 2 500 000 metrics,

An image of Sensu according to my daughter

Sensu is just one part of a monitoring solution, Sensu needs friends to be really great. Of course there are the obivous like RabbitMQ and Redis as Sensu will not work without them, but then there are the ones that make Sensu into a complete monitoring solution like Graphite, Pagerduty, Gdash etc (you can add your own favorite here)

Sensu with RabbitMQ the friend, and Pagerduty

Graphite – the very best friend of Sensu

Graphite is a really great tool and together with Sensu it’s awesome. Graphite is splendid in storing high volume of metrics and show graphs but it also very useful to alert from metrics data in Graphite, read more in a previous blog post here.

I would say it’s about using the right tool for the right thing. Most of the monitoring solution in the past has been about having a big monolith that should solve all your monitoring problems but those solutions doesn’t solve all your monitoring problems, instead they create you a monitoring headache and you start really too feel that #monitoringsuck.

The really great thing with Sensu is that it doesn’t try to solve everything instead it’s decomposed into different components and easy to extend and integrate with other tools. So instead of using one swiss-army-knife tool to fix my house, I prefer to use proper tools made for the type of task that I need to do.

Why we are more happy at Recorded Future

The new monitoring approach we use has lowered barriers between different teams and also make it easier to communicate within the company, no matter if it’s developer communicating with ops or ops communicating with business.

The developer

By lowering the barrier for the developers to interact with the monitoring system they have been much more interested into data and monitoring. It has been much easier to add metric data, and the main reason is the integration with Graphite, the developers just send data to Graphite and there are many libs supporting this. Also the data is useful for the developer as well as they can use the data and graphs in Graphite both during development, test and to hunt down bugs. Today when we need to add some more monitoring data we discuss with the developer and normally it’s just a couple of lines that they add to the code to get the metric, this was much harder to do before and therefore it wasn’t done.

It’s also possible to trigger an alert directly from an application to Sensu via  port 3030

The business user

Business users like it as well

The data in Graphite is used all over the business, sales persons can follow the usage of the site, financial can see the Amazon costs, analytics are monitoring the flow and the data volumes stored in the system. Everyone is looking at the performance. So the graphs make it much easier to work together and get the same common perspective. It also is of great benefit that it’s easy to add new monitoring and metrics, today when we talk with business people and they would like to add monitoring or metrics we can easy fulfil there needs which make them more happy. It’s also much easier to ask them what they think is important to monitor and use the graphs as a base for that discussion. – At which level should we trigger an alert? – What is the right volumes of data? etc This dialogue is much better today than before.

Graphs makes people happier

Just to be able to show nice looking graphs makes people happier  but also mangle the metrics in many way to create smart alerts is an important functionality. Many times when we create a new alert we look at the graphs and test different functions like timeshifts, moving averages, filtering etc to get the metric that we would like to alert on. For example we could by using the timeshift function in graphite compare the values with the same time the week before and alert if they differ to much. Another example is to use the derivative function to alert if the growth of a specific metric is outside our desired values. I haven’t seen any other monitoring solution with as powerful calculations as you can do in Graphite

What are we measuring and monitoring?

We are measuring a lot, we have about 2 500 000 metrics and we try to store a lot of metric information, a lot of those metrics are slow moving metrics that we only update every 5 minute or so. The idea is to store as much metrics as possible to be able to help finding causes when we have problem, be able to see bottlenecks, capacity planning or see user behaviour. By already having a lot of metrics make it easier to add new things to monitor as we often have the needed metric in our system.

Monitor usage not number of web server requests

There are few of the metrics that we are actually monitoring and we also try to monitor on a higher level compared to before. We are running in the cloud so network, routers etc has never been to any interest of us, but we are even monitoring less on the OS level like load, disk performance etc because that is not much of interest. Of interest for us is the performance for the end user, how many document we can analyse per second, usage etc and therefore are we try to monitor on higher levels. Instead of monitoring the load on database machine we are monitoring how long the operation take to insert a record in our database. It doesn’t matter if the load is > 10 or disk I/O is high, what matter is the performance of the database for our application. Then if we get alerted that the database starts to slow down then it could be good to look at the graphs of the load or the IO metrics or maybe look at how many instances we have in the database etc. So instead of alerting on the behaviour of the OS  we look at higher levels, this approach saves us time and give us much better quality. Sensu make this approach much easier with the easy integration with other tools and the good support to configure checks and handlers via configuration management tools like Chef

Monitor performance – not load on a server

The architecture – so elegant

Simple architecture and you can extend it as quick as the flash

The nice thing with Sensu is the architecture with separated services and messaging based on RabbitMQ and easy to integrate with other services and components. There is a little step to get a understanding of how it works but when you have gripped Sensu it’s easy to work with and easy to extend with new checks, handlers and integrate with other tools.

Drawbacks with Sensu

Sensu is not perfect and will probably not be, but it so much better than our previous solution. What is missing in Sensu is documentation, it has improved with the wiki. Also the UI will need development and more advanced functionality but as the community is growing very quick I am sure those problems will overcome, there are actually already 3 user interfaces developed.

I am happy

A portrait of me – Ulf

All of us in the operation team are more happy with our new monitoring solution and specially I am much more satisfied with this monitoring solution. We were discussing all of the spring to implement Sensu but we just postponed it, I am glad we took the decision to go away for a couple of days to do the monitoring make over and also in advance telling everyone that when we come home there will be a new monitoring solution (pressure is good some times). This was also a very good team exercise.

Let 2013 be the year of #monitoringlove

So stop complain, the future is here, have a look and see if it fits to your needs.

I would like to thank Erik, Martin and Anders in the operation team that make this change happen.
The slides are available here

3 comments so far

  1. Pascal (@jpascalw) on

    Love the drawing of Sensu with RabbitMQ the friend, and Pagerduty :-)

    You mention capacity planning. What things do you care about specifically with capacity planning given that your entire(?) infrastructure is AWS?

    • ulfmansson on

      Thank you, I like the rabbit too.

      Even if we are on Amazon we need to do capacity planning.
      1. We try to optimize the running infrastructure, we check the load, io usage etc to see if we can move processes and reduce number of instances.
      2. The volume of data in our system are growing and we use the metrics and graphs to analyze the future need of instances for planning our budget but also to see if there is a part of the architecture/infrastructure that start to add too much costs

  2. Why Monitoring Sucks on

    [...] as a counter point, read this post entitled #monitoringlove – a true story by Ulf [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: