User talk:Jonathan

From OpenNMS
Jump to: navigation, search

CPU Statistics using net-snmp

8th March 2006

Well, I've about completed messing around with graphing the systemStats table on net-snmp 5.3. You can see the results in net-snmp 5.3 CPU collections.

Where did February go?

6th March 2006

I can't believe that it's been a couple of months since my last post here. February has been incredibly busy. Apart from my day job, I've been spending a lot of time trying to make our old cricket and orca installations obsolete. Obviously, I'm replacing them with OpenNMS. This has meant a lot of time wrangling collectd-configuration.xml, datacollection-config.xml and snmp-graph.properties. It's also resulted in couple of interesting (at least for me) threads on the "discuss" list and on #opennms. I must thank Robert and JS for listening to my musings on the subject.

Whilst I've been busy trying to consolidate our data collection, I've been dealing with an increasingly overloaded OpenNMS server. There's been a lot of discussion regarding scaling OpenNMS, and the (dis)advantages of different RAID configurations. Hopefully I'll have some hard data on this in a week or so as we install new disk array for our primary OpenNMS machine. Watch this space for the before and after performance stats.....

Using iplike in SQL

17th January 2006

OpenNMS uses a custom postgres function "iplike" for IP address comparisons. Its something of a bete noir amongst the community as it ties OpenNMS to Postgres as a back end database. It does have it's uses however....

My employers make extensive use of asset categories, in particular pollercategory and notifycategory to customise poller packages and notification paths for various devices. The downside to this is that each node needs to be categorised via the webUI and OpenNMS restarted before any new nodes will get polled. We got around this by running a small chunk of sql daily to identify "uncategorised nodes" thus:

SELECT t1.nodeid as "nodeid", t0.nodelabel AS "node", t1.displaycategory, 
       t1.pollercategory, t1.notifycategory,t1.thresholdcategory
FROM node t0 LEFT OUTER JOIN assets t1
ON (t0.nodeid = t1.nodeid)
WHERE (t1.pollercategory IS NULL
       OR t1.notifycategory IS NULL
       OR t1.displaycategory IS NULL)
ORDER BY t1.nodeid ASC;

You'll notice that this does not check the threholdcategory. Adding this would be simple though.

This whole scheme started to become a real pain for some nodes. In particular, VMware. virtual machines. We have an entire network of nodes that are virtual machines. These are often transient, running for a few days or weeks only. Continually having to categorise these nodes via the webUI became a pain. To avoid this we added some display categories, notices and poller packages based on the node's IP address, instead of asset categories. This all worked fine apart from the fact that these transient nodes were appearing on our "uncategorised nodes" report. The report was only intended to indicate nodes that required asset data updates before they would be polled, and therefore should not contain these nodes. This is where iplike came in handy. We modified our sql thus:

SELECT t1.nodeid as "nodeid", t0.nodelabel AS "node", t1.displaycategory,
t1.pollercategory, t1.notifycategory, t1.thresholdcategory
FROM node t0 LEFT OUTER JOIN assets t1
 ON (t0.nodeid = t1.nodeid)
 WHERE (t1.pollercategory IS NULL
       OR t1.notifycategory IS NULL
       OR t1.displaycategory IS NULL)
       AND t1.nodeid NOT IN (SELECT nodeid from ipinterface WHERE iplike(ipinterface.ipaddr,'192.168.1.*'))
ORDER BY t1.nodeid ASC;

You can see the use of iplike to exclude any node with an IP address in the 192.186.1.* range, which happens to contain all the VMware virtual machines. This is probably not the most efficient piece of SQL ever (I'm no DBA), but does the trick. Now our report only shows nodes that will not be displayed or polled or not cause notices to be generated.

Monitoring tomcat using JMX

14th December 2005

My employers are big tomcat users. OpenNMS 1.3 is a big deal for us as it adds the ability to monitor key system attributes from a running Java 5 VM via JMX. It's taken me a long time to get around to implementing it though. This week finally saw some results for my efforts.

It's gonna be a short blog this week. Instead, I'd like to point you to my adventures with JMX in the Tomcat 5.5 JMX How-To. Big thanks to Mike Jamison, who wrote the code that does the data collection, for checking this document. Hopefully it will give people enough information to get started with this useful and unique OpenNMS feature.

Jabber Festival

21st October 2005

After my blog on Jabber, and a request on the discuss list for details on using the festival speech synthesis system with OpenNMS, it occurred to me that it may be possible to combine the two.

A quick google for "festival" and "XMPP" lead me to festival-gaim, a plugin for the gaim IM Client. It was five minute's work to have festival-gaim's plugin installed. Now all my XMPP notifications from OpenNMS are uttered by a deadpan American lady through my workstation speakers.

Enjoy ....

Jabber Jabber Jabber

14th October 2005

Many moons ago, I built a jabber XMPP server for our organisation. We needed a communications channel to remote workers, and didn't want them to use a public instant messaging system (like MSN messenger or AIM) for company communications. We settled on jabberd2 for a number of reasons. It is nicely documented, can run with a postgres backend database, and could integrate with Active Directory via LDAP for authentication and access control.

I put together some notes on using jabberd2 with Psi (a very nice client that runs on all the platforms we use), informed the IT department, then promptly forgot about the whole thing. From time to time, I'd look at the people logged in to the Jabber server, it seemed particularly popular with the development teams.

At my employer, we have made a big effort to use OpenNMS to collect operational information from not just network devices and server platforms, but from applications too. Once you do this, OpenNMS can move out of the Operations realm and be a useful tool for the entire IT department. Given the development teams enthusiasm for instant messaging I thought that the jabber server might be a useful tool to distribute OpenNMS notices.

Chris Abernethy had written a notification strategy to allow sending XMPP messages to individual users. I thought that maybe I could use this to send notices to a group. The trouble with this was that I didn't want to bypass our Operations team. I didn't want to have to make all the developers OpenNMS users and have them acknowledge notices. I just wanted to copy them on all notices, so that they would be aware that their application may be in trouble. After a bit of head scratching I added a new XMPP group notification strategy to go alongside the one that Chris wrote. Now we have an OpenNMS group chat room. By prepending delivery to that channel to all my destination paths, every notice now appears there. Anyone who has joined that room can see all the notices generated by OpenNMS.

Of course, one bright idea leads to another. Now what I really need is an xmpp client for all the alpha messaging LED signs lying unused in our office....

Lessons (Re)learned

5th September 2005

We've recently taken delivery of a new dual Xeon em64t machine for OpenNMS. The upgrade taught me a useful lesson about life at the leading edge that I thought I'd share.

I've always been a build-from-source kind of guy, and I must have built OpenNMS hundreds of times now. I assumed it would just be the usual:

  1. Install Postgres (from source)
  2. Install rrdtool (from source)
  3. Install tomcat.
  4. Edit build.properties to point OpenNMS at the right libraries.
  5. Install OpenNMS.
  6. Import the database from backup.
  7. Start OpenNMS and tomcat.

Which looked to be true, but just like some others who subscribe to opennms-discuss, I ran into problems immediately. Discovery refused to start.

The result was a quick trip to 32/64 bit hell....

It didn’t take me long to see the symptoms of my problem:

2005-09-05 13:49:26,545 ERROR main Discovery: Failed to create ping manager java.lang.NullPointerException

       at org.opennms.netmgt.discovery.Discovery.start(Discovery.java:434)
       at org.opennms.netmgt.discovery.jmx.Discovery.start(Discovery.java:43)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:324)
       at mx4j.MBeanIntrospector.invoke(MBeanIntrospector.java:214)
       at mx4j.MBeanServerImpl.invoke(MBeanServerImpl.java:158)
       at org.opennms.netmgt.vmmgr.Manager.start(Manager.java:208)
       at org.opennms.netmgt.vmmgr.Manager.main(Manager.java:436)

I think I've seen this on the mailing list. Hang on a minute, this is a ))x86_64(( machine. I'm running a sun 1.4.2 JVM (which is a 32 bit executable). Lets have a quick look at the libjicmp JNI library.

root@opennms lib# file libjicmp.so libjicmp.so: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped

There's my problem. I'm trying to load 64bit code from a 32bit executeable. Well, the forward step would be to get a 64 bit JVM. Unfortunately Sun do not have a 64 bit 1.4 JVM for linux x86-64. I could try a 1.5 VM, but I'd be moving into uncharted waters. This is intended to be backup for the production machine (which is a 32 bit machine), so lets try to keep it like the live one.

Now I could just dive in and change the compiler and linker options for libjicmp, but is that the only C code that gets compiled? Having a look around OpenNMS's libraries reveals libjicmp.so (which I already know about), libjrrd.so, and iplike.so. Running "file" on them shows that they've all been built as 64 bit shared objects. Now, iplike.so is used by Postgres, not Java. My postgres installation was compiled on this platform and is 64 bit too, so I need to leave that alone. "All" I have to fix is libjicmp.so and libjrrd.so.

This is beginning to look a bit more complex than I thought. I've got two tasks on my hands, rebuild libjicmp and rebuild libjrrd.

Rebuilding libjicmp for 32 bit

This should just be a matter of getting gcc to use -m32 option during the compile and link steps for libjicmp. I can modify the jicmp.compile target in build.xml like this:


<compiler name="gcc">
  <compilerarg value="-D${build.platform.define}"/>
   ~~#FF0000:<compilerarg value="-m32"/>~~
</compiler>
~~#FF0000:<linker name="gcc">
  <linkerarg value="-m32"/>
</linker>~~


I do so and rerun build.sh.

root@opennms opennms# file ./work/jicmp/libjicmp.so ./work/jicmp/libjicmp.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped

So far, so good.

Rebuilding libjrrd for 32 bit

Now for libjrrd. I can edit build.xml in the same way as for libicmp, modifying the jrrd.compile target like this:


     <compiler name="gcc">
       <compilerarg value="-D${build.platform.define}"/>
       ~~#FF0000:<compilerarg value="-m32"/>~~
     </compiler>
     <linker name="gcc">
       ~~#FF0000:<linkerarg value="-m32"/>~~
       <linkerarg value="${build.jrrd.linker.arg}"
                  if="build.jrrd.linker.arg"/>
     </linker>

Time for another build, this time I get the following:


jrrd.compile:

      cc 1 total files to be compiled.


      cc Starting link
      cc /usr/bin/ld: skipping incompatible /usr/local/rrdtool/lib/librrd.a when searching for -lrrd
      cc /usr/bin/ld: cannot find -lrrd
      cc collect2: ld returned 1 exit status

BUILD ))FAILED(( /home/src/linux/OPENNMS_1_2_4_RELEASE/opennms/build.xml:1177: gcc failed with return code 1

Total time: 3 seconds


Ugh. It looks like I need a 32 bit rrdtool library as well. As I said, I'm a build-from-source kind of person, so time to go back to the rrdtool source.

Build 32 bit rrdtool

Running “configure –help” in the rrdtool source directory tells me that rrdtool will honour any compiler flags set using the ))CFLAGS(( environment variable. I need to set -m32 for gcc to compile for 32 bit. So I go ahead and do this:

root@opennms rrdtool-1.0.50# export CFLAGS=-m32 root@opennms rrdtool-1.0.50# sh ./configure --prefix=/usr/local/rrdtool-1.0.50-32 --enable-shared

Hmm, no complaints so far....

root@opennms rrdtool-1.0.50# make

This doesn't work too well. Make complains when trying to build the shared perl modules. Well, unsurprisingly my Perl is 64 bit too. I could spend a long time working on this, but actually, I don't particularly care about Perl support right now. So lets see if I've got an rrdtool library out of this; a quick find shows that I got a librrd.0 and a librrd.a, so it looks okay. A make install puts rrdtool into the directory specified by the --prefix option in the configure step. Back to libjrrd.

Back to libjrrd again

I need to point the OpenNMS compilation/link step to my new rrdtool installation. I also need OpenNMS to find the rrdtool binary, so that it can draw those graphs. Instead of hacking build.xml, I can just add a few lines to build.properties:

build.rrdtool.bin = /usr/local/rrdtool-1.0.50-32/bin/rrdtool build.rrdtool.include.dir = /usr/local/rrdtool-1.0.50-32/include build.rrdtool.lib.dir = /usr/local/rrdtool-1.0.50-32/lib

(I need to remember that build.rrdtool.bin will modifies configuration files in $OPENNMS_HOME/etc, so if I want to reuse any existing configurations, I’ll need to edit them by hand).

Now build again:

jrrd.compile:

      cc 1 total files to be compiled.


      cc Starting link


    copy Copying 1 file to home/src/linux/OPENNMS_1_2_4_RELEASE/opennms/work/jrrd

Much better. Now I'm not sure what state I've got into with the number of builds I've been through, so I'd better reinstall from scratch:

  1. back up my $OPENNMS_HOME/etc and $OPENNMS_HOME/share
  2. rm -rf $OPENNMS_HOME
  3. build.sh clean
  4. build.sh install
  5. restore my $OPENNMS_HOME/etc and $OPENNMS_HOME/share.
  6. adjust the following config files to point to the 32 bit rrdtool:
    1. response-graph.properties
    2. response-adhoc-graph.properties
    3. snmp-graph.properties
    4. snmp-adhoc-graph.properties
  7. rerun the installer

Finally, I've got a working OpenNMS installation.

Lessons (re)learned

This whole experience taught me a couple of a valuable lessons and reminded me of something that I'd forgotten in my enthusiasm:

  1. Java good, JNI bad, especially when you decide to change architectures.
  2. Building from source is all well and good, but it can eat time. The nice OpenNMS people have created packages for us, so we should use them.
  3. Yet again, the leading edge is not a place where I want to spend too much time.

KSC reports to the rescue

30th August 2005

Like most of the community, OpenNMS was not the first systems management tool we deployed, nor is it the only one that we use now. I wrote a piece for Planet about the whole experience a few weeks ago. At the end of the piece, I noted that I was in the process of moving out existing SNMP performance monitoring from those legacy tools into OpenNMS. The truth of the matter is that I hadn't progressed far.

One of the biggest obstacles we perceived to removing some of the other tools has been the wide variety of different devices we have, and the pain we went through configuring [1] and others to collect from them.

We just took delivery of some spiffing new devices for web-based application acceleration and load balancing. In the interests of practicing what I preach, I decided to use OpenNMS alone to collect stats from them. The devices present multiple virtual ip addresses and load balance across multiple targets web servers (much like Local DIrector, Big IP, F5 et al.). A quick browse of the supplied MIB files indicated that all the information we needed was in there. This included throughput and sessions on the VIP, and crucially, throughput and sessions on the target webservers across which the VIP load is balanced.

The usual procedure after this would have been a tedious mix of walking the MIB, typing in OIDs and cursing the vendor until I'd got something resembling a decent set of configuration files put together. Luckily OpenNMS has some tools available to help me. A quick search of the wiki and I turned up Tarus's ((SnmpInformantHowto)). This doc is particularly useful if you're trying to configure SNMP data collection across tables indexed by something other than interface.

I'm not going to outline all the steps here, as Tarus's FAQ entry covers it much better than I could hope to. My strategy was:

Collect the data

This was pretty simple, and involved a couple of iterations of the following.....

  1. Decide which OIDs I wanted to collect from which tables.
  2. Put together a datacollection-config.xml to get one instance of it.
  3. Run xmllint to make sure that I hadn't broken the config file.
  4. Restart OpenNMS, then see that my data is being collected, RRDs generated and that they contain at least some data.
  5. Add the other instances that I wanted to datacollection-config.xml
  6. Repeat from step 3 until I had all the RRDs that I wanted.

Graph the data

There's no way around it, snmp-graph.properties is one of the harder config files to wrangle. Check blog for musings about this file. In actual fact, the reality of editing it was not as bad as I'd feared. All it required was care, careful placement of backslashes and I was soon in possesion of some comprehensive graphs for my devices. Problem was, I now had huge SNMP performance pages for these devices (I'd chosen to collect __everything__ that looked of potential interest). I needed to see a few stats only, at least for the time being, but I still wanted to collect a more comprehensive set. Luckily KSC reports could help me here.

Configure the KSC reports

After looking a SNMP OIDs for a morning, this was a doddle. I pointed and clicked. Soon I had a view of crucial stats for all the devices in the clusters, showing both VIP, and load balanced target IO stats. This was my "eureka" moment. I suddenly "got" the real point of the KSC reports. They enabled me to impose a different view on the heirarchical data that I was collecting and to present it in a functionally oriented, rather than a device oriented way. Where the SNMP performance pages showed me all data for a single node, the KSC reports could show me key perfomance data across mutiple nodes. I like this, the Operations Team like this, and more worryingly for me, the Management Team like this too.

Now, of course, I have a real problem. I simply have no excuse not to start migrating the rest of our SNMP performance monitoring into OpenNMS.....

Another day, another device

19th August 2005

What, an End-User blog? Well, I thought that it would be a good idea to share some of my day-to-day experiences of running OpenNMS. I'll freely admit that it's not quite as glamorous as winning awards at Linux Expo, but maybe my random thoughts may prove of interest ....

It seems like every week brings a new device past my desk with a request to monitor its heath/function before it gets deployed. Some devices play nicely and are only a moderate pain in the neck to instrument, some don't. When they don't play nicely, you can loose weeks trying to drag something useful out of them.

My top list of SNMP implementation hates are:

  1. Enormous, machine generated mib files that don't even pass smilint.
  2. Walking a useful looking mib to find out none of the objects actually contain any data.
  3. Getting wildly different data for the same metric from a device's SNMP agent its web UI or console.
  4. Comments in mib files that bear no relation whatsoever to the objects that they purport to describe.
  5. A single generic SNMP trap that covers every event that might occur (from the tivial, to the fatal) with no unambiguous identification the event's severity.

I could go on, but it's Friday, and, allegedly, I have a life.

It seems that vendors are still shipping devices with buggy SNMP implementations. If any hardware vendors (and you know who you are) are reading this, please, please, please; SNMP support is not the last thing you add after you've finished writing web GUI. Considering this as an important deliverable would considerably enhance my quality of life, not to mention all the other poor souls that get to deploy your products.

Thankyou.