The Order of the Blue Polo - Member 000010

From OpenNMS
Jump to: navigation, search

The Order of the Blue Polo

The first time I heard of OpenNMS, I was a consultant looking for a network management solution in a pinch. I was stuck with a 300 node black-network; that is to say: no records of what should and shouldn't be present, and how the current network and nodes were performing and why it kept dropping (the whole thing would drop around 10:15am, and 3:00pm). I had come from the world of Nagios and needed something much less hands-on to configure. SNMP was the only service I had to work with.

The client still runs OpenNMS to this day. It helped us fix that first problem by narrowing our search to a single core switch. It's been just as valuable many more times since then. The client swears by it - but you'll never get a testimonial out of them, so you get the consultants account of what happened instead.

Today: I work as a senior member of the operations team at Halogen Software. We currently monitor about 20 switching devices, a dozen layer-3 to 7 network appliances, and about a hundred servers + the SAN that powers it all. We’re collecting hundreds of metrics per server per polling interval (5 minutes).

We have worked with many different free and non-free management tools over the years, but I keep coming back to OpenNMS for the critical projects; and there are several reasons for this - but the most important factors guiding that decision are: customizability/suitability to task and the ability to integrate with third-party products.

I use OpenNMS for performance availability monitoring. It's ability to auto-discover and support all the different classes of device & host in my environment is impressive. Out of the box; over 90% of our hardware is supported without additional configurations.

It uses the agent-less paradigm: In my experience, agent-based systems just don't scale in a large environment - they become too hard to maintain and upgrade and licensing costs can get ridiculous. In my experience, few agent-based platforms have agents for my switches, firewalls, SAN, load balancers, etc...

OpenNMS has many different ways of collecting data and that makes it extremely powerful.

Of particular note: it has a native JMX client for data collection: We manage and monitor over 300 Hotspot JVM's. While there are many products designed to manage and monitor a JVM, none, in my experience, give us the ability to customize the collections to suit our environment better. We only monitor the parts of the JVM that we need to and leave the rest. With JVM monitoring-type products it's pretty much all or nothing and 'all' can mean a lot of different things including the all-too-common phrase: "we don’t support that service on a non-standard port". Phooey -- OpenNMS does -- just tell it what port to look for the service on.

We also need to monitor 1000+ websites - and not just simple websites, but Web2.0 applications. The monitoring needs to include logins to verify the functionality of the application beyond the login screen. OpenNMS provides a simple mechanism for this type of monitoring, and having all my monitoring within a single system that has an SLA manager gives me a much more holistic overview of our operational environment.

Alerting Escelation paths - it's easy to define the logic of an escalation of an outage or alert and to match the escalation paths with an on-call rotation schedule.

Reporting - Being able to report on the metrics and availability data collected is critical to the over-all effectiveness of a monitoring solution. Too many tools lack this feature, but it is critical for things like capacity planning, Quality Control, troubleshooting and performance envelop testing and mapping.

Customizable reports: customizable reporting is invaluable as it give one the ability to present metrics, visually, and in compelling ways. For example:

In a single server report, I can show, CPU metrics, RAM Metrics. JVM metrics, Network Metrics etc... and get a complete view of this server's activity profile.

In a cluster performance report, I can show CPU, JVM and network metrics for each cluster member side-by-side, allowing for a very fast way to assess load distribution, performance trends and trouble-spots.

Extensible:

In every environment I've ever worked in, there are custom metrics that do not feed into a commercial products monitoring schemes. This data is often not visible or presentable to said products.

Well, years ago I got tired of this dilemma. As most of my clients used Linux, which uses net-snmp almost exclusively as the SNMP agent, I wrote a python script that can present any metric through the local snmp agent using custom OID tree (see man 5 snmpd.conf: pass_persist). Then I configured OpenNMS to collect these custom OIDs (no need to write an ugly MIB defintion - yuck) and voila - they're graphed out in my weekly reports along with the standard metrics that came pre-configured to be collected.

I've never used any commercial product that was able to accommodate so many different customizations and custom data collections policies nearly so well. I have replaced several commercial products with OpenNMS installations 7 separate times in my career.

In summary: OpenNMS is, hands-down, one of the best network monitoring packages I've ever used. I swear by it, I depend on it for my sanity. You should too!

- Steve Hillier, Halogen Software, Canada