I’d recently ordered a new round of servers and was positively dreading having to setup Nagios & Munin on them. This is where the fact that I’m a “born & raised” developer really shines through. The configuration of Nagios is simply beyond me. No matter how much documentation I read, I just can’t get all the pieces moving right. Try to bolt Munin on top of this and I simply walk away in frustration. There had to be a better way…
The Competition
I had actually outsourced this initial setup, so of course I toyed around with the idea of having the same guys do it again on the new servers. But then I thought to myself, why should I have to do this at all? Aren’t there “all-in-one” products that can do this for me? I started with Centreon and then Zenoss. But I hadn’t really understood SNMP well enough to have one monitor cover my disparate network of servers.
Arrgggh – weren’t there companies out there that did this sort of thing? Why do I have to reinvent the wheel here?
The Tweet Heard ‘Round the World
Image may be NSFW.
Clik here to view.
LogicMonitor to the Rescue
Over the weekend, I dug around a little on the LogicMonitor site and figured I’d give them a shot. Monday I was contacted by their sales and we scheduled a desktop session for that evening. I was asked to install a lightweight java agent on my test server and configure SNMP properly. They guided me through that process and within a few minutes, all the basic datapoints were being gathered and graphed. All well in fine, but was it worth the cost?
Later, I started to get interesting alert emails – warnings about excess TcpRetrans and query cache prunes:
www2 has 21.42 query cache prunes per second due to low memory. The Query cache has a hit ratio of greater than 50% - so it is likely to benefit more from increasing the cache size to alleviate this memory pressure.
This state started at 2010-07-23 20:36:35 CEST and has been going on for 0h 12m.
Wow – not just a plain alert because a given threshhold was tripped, but a bit of background information and even a suggestion on how to fix it! Suddenly, the light went on LogicMonitor! I now had a virtual system administrator watching my servers and pointing me in the right direction when it noticed something strange. Just yesterday, I created my first custom datasources with graphs to keep track of my website response time for different pages.
Image may be NSFW.
Clik here to view.
It took about 10 minutes to setup 2 of these graphs and it’s the first step to being able to easily gather and plot relevant business data. An example next step would be how many people have logged in within the past 5 minutes? How many people have posted articles or questions? And the “holy-grail”, how does our response times (and load) compare when plotted against these other metrics?
Naturally, I’ve had second thoughts about pushing such a core business value like server metrics out to a SaaS provider. But are the metrics themselves a core value? Or is it the visualization and business decisions that are made after comprehending these metrics that are the real value?
I’m going with the latter and am extremely happy to not have to worry about how to setup my own metrics system.
Understand that LogicMonitor is based in California and this caused some minor headaches during the sales & support process as I’m located here in Munich, Germany. The 9hrs time difference basically meant my problems/questions were answered next business day. Nevertheless, within a week I had full monitoring setup and configured on my test servers. The week after that I moved it into the production systems and haven’t looked back since.