User talk:Mhuot
Subscribe

From OpenNMS

Jump to: navigation, search

Contents

My OpenNMS installation

I work for a hospital in the Midwest of the U.S. We have been using OpenNMS and paying for support for OpenNMS since the fall of 2001. We have about 4000 users with about 400 servers. The servers are a mix of Windows and Unix. Our network consists of about 7000 separate SNMP interfaces on about 200 Ethernet switches. We have three primary locations with about 15 smaller remote sites. We install SNMP agents on all of our servers and enable it on all network devices. We have over 30 gigabytes of jRobin/RRDTool data stored on the system. This represents 366 days of data for all of our devices.

We use OpenNMS to SNMP data collect on each of these interfaces a minimum of 6 stats: ifInOctets, ifOutOctets, ifInErrors, ifOutErrors, ifInDiscards, and ifOutDiscards. There are a total of 128,000 different stats collected every 5 minutes. This number includes the minimum 6 stats, plus multiple statistics collected at a host/node level. We have almost 1000 hosts defined in OpenNMS. These hosts have multiple SNMP stats, for example, load average, CPU utilization, memory utilization, number of users connected, and hard drive usage. All of the data collected is stored in a jRobin/RRDTool file. There are a set of standard "reports" to view this data via the web interface. Additional reports can be created using a text file. Our experience has shown that data collection is the most resource intensive piece of OpenNMS, it has a high amount of I/O. Lately the OpenNMS project has addressed this issue, but we have not yet implemented it.

In addition to data collection, OpenNMS also does polling on over 30 services, be default. Additional TCP services can be easily configured via the Web Interface. These service polls vary in complexity depending on the service. Some are simple checks to see if the port is responding. We use this feature for a very limited set of services, ICMP, HTTP, FTP, DNS, NTP, and RADIUS. OpenNMS discovers these services and begins polling them automatically. On occasion we have added services, such as Oracle, for polling. We have not found much need in our environment to expand this. The poller also tracks response times and the information collected is viewable via jRobin/RRDtool graphs. Like the SNMP data collection standard reports are available to view this data.

Using the polling response time data and the SNMP data we can setup threshold alarms. These can be either high or low thresholds, that have a separate "rearm" value as well as a requirement to see the violation one or multiple times prior to alarming. Currently we use this for temperature alarms and free disk space. Plans are to include this to check the Ethernet interface errors to better isolate user problems. This may require a lot of research after implementation. I believe there are a large number of misconfigured workstations having these errors. Eventually we would like to also use the response time data for thresholds. The OpenNMS project has talked about using some kind of statistical analysis approach to this, until that is available, I think this would be difficult to maintain in our environment.

One of the other large pieces of OpenNMS that we use is the notification/events system. All of our SNMP traps are sent to OpenNMS for processing. We have created notifications based on the traps most important to us. The notifications have some information on how to troubleshoot the problem associated with the trap. This is all customizable through XML files and limitedly via the web interface. Our installation has processed over 16 million events. The number of events has escalated greatly in the last 12-18 months and OpenNMS has shown no signs of being strained. Recently a syslogd has been added to OpenNMS. It is still in the early stages of development but it looks very promising. At some point I will add this to our installation.

The events system feeds the important events into a notification process. The notification process can send a page, e-mail, jabber message and other methods of notification. The notification's use a concept of a notification path. A path consists of initial notification specifying who and how to contact a single person or group. Escalations can be added in time intervals. We generally specify the operations group on the first round, 15 minutes later we contact the on-call person in addition to operations, after 30 minutes we contact the manager for operations and the others. This continues every 15 minutes for an hour and a half. I mentioned above the ability to contact the on-call person. OpenNMS has a concept of a role. A role can have a single user scheduled to be contacted on a scheduled basis. This can be set for different groups and has a default to contact the manager for a group if there is no one scheduled. The on call scheduling was a key feature to my company. It allowed us to move to OpenNMS as the primary source for most notifications.

Recently the addition of data collection via JMX has been added. This is a very powerful feature that I have not implemented. My understanding is that java applications can expose information via JMX. OpenNMS can then retrieve this information much like it does for SNMP and store it into an RRD file. Our HIS, Cerner, is beginning to expose more information about the application using JMX. I see this as critical to us in the near future.

OpenNMS has also added the ability to poll NSClients and NRPE. This will be extended to include data collection as well. This could be very useful to users of Nagios. I am not sure if this is needed by us. It would seem that we get enough data from our SNMP agents. The NSClient would possibly allow access to WMI on Windows boxes.

OpenNMS has been very very reliable for us. Most of the outages for it have been user error or hardware failure. OpenNMS is one of the critical applications that we use. It has been able to alert us for almost all network problems we have had in the last 5 years. The scalability has been able to cover our needs, and based on discussions with others, it scales well beyond any needs we may ever have.


Interesting Graphs

Thought I might start by just showing some interesting looking graphs from my OpenNMS installation.

CPU statistics of a linux box
CPU statistics of a linux box(jRobin)
1 year of one gigabit Ethernet link
The other end of the same link
SMTP response times
Main page of production system

Helpful SQL

Find all current outages -

select * from outages where ifregainedservice is null;

Find all current outages with the nodelabel and service name -

select outage.outageid, node.nodelabel, outages.ipaddr, service.servicename, outages.iflostservice, outages.svclosteventid from outages, node, service where node.nodeid=outages.nodeid and outages.serviceid=service.serviceid and ifregainedservice is null;

Manually acknowledge a notification where the notification ID is 85879 -

update notifications set answeredby='admin',respondtime =now() where notifyid=85879;

Find notifications from the last 24 hours -

select * from notifications where pagetime > now() - interval '24 hours';

Working with jrobin files

echo -e dump multiicmp.jrb\\n . | java -jar /home/mhuot/JRobinLite-1.5.2/lib/jrobin-1.5.2.jar 
  • Run jrobin inspector, this is a gui interface that lets you look at data stored in jrobin files. You may need to export your display.
java - jar inspector-1.5.4.jar
  • Open a jrobin file
Select file
  • The data source(DS) structure should now be in the "RRD File Inspector" window
View Structure - 1
View Structure - 2
  • You can then select the individual RRA's and the "Archive Data" tab will have the actual data stored in the file now.
View Data

Thoughts on daemons in OpenNMS

OpenNMS daemons have two primary types of behaviors, active and passive. Most daemons are ruled by one of the other behavior. Some daemons like importer can be configured to act in either a passive or active manner. Understanding the behaviors helps us to understand how to better refactor the daemons to reduce duplication.

An example of active bhavior is pollerd. It works on a schedule events may alter the schedule, but primarily it works based on an interval specified in its configuration. These kinds of daemons create events that may be acted upon.

Passive behavior is the primary behavior of notifd. Passive daemons wait for an event to arrive and then take action based on that event. They may take that event and transform it to another format like an e-mail or pager message. They can also create other events, like event translator.

Restful thoughts

In rest all URLs represent a resource. A resource is -

  • Actual object, file, html static page
  • List or details for a node, outage, service, IP interface
  • Results of an algorithm

Use the four operations of http to manipulate and view resources -

  1. get, show resource as XML, HTML
  2. post, create a new entity from XML, HTML form
  3. put, modify an entity from XML, HTML form
  4. delete, delete an entity from XML, HTML form

Start from the top

Graphing, I have a node, I want to show its graphs, I need to know what the URL for the node Then I have to know what graphs to show, what is useful? We don't persist attributes for a node, we don't persist the graph definitions well. If we ignore the persistance level, start from the UI on how we want it to work from the users perspective. Not just in the display, but also how easy it is to access. Imagine being able to post graphs to some summary page.

jax rs just use annotations on an object to create the rest URLs, declare parameters.