Difference between revisions of "Thresholding"

From OpenNMS
Jump to: navigation, search
(Added information on supported threshold filter data types)
Line 177: Line 177:
  
 
=== Notes on using <resource-filter> in thresholds ===
 
=== Notes on using <resource-filter> in thresholds ===
 +
 +
==== Supported resource filter field values ====
 +
 +
Only single metrics are supported as the filter field data object, not arithmetic expressions.
 +
 +
==== Supported resource filter field data types ====
 +
 +
For generic resources (see below for interface resources), the resource-filter field is any mibObj defined in datacollection-config.xml.
 +
 +
Before OpenNMS 1.10.2, this value could only be a string. As of OpenNMS 1.10.2, the field can be any metric defined in datacollection-config.xml including numeric values or strings.
 +
 +
==== Filtering using interface data sources ====
 +
 
For interface resources (ds-type="if"), resource-filters are based to snmpinterface table columns.  If you want to do something like:
 
For interface resources (ds-type="if"), resource-filters are based to snmpinterface table columns.  If you want to do something like:
 
   <resource-filter field="ifDescr">Gigabit.+</resource-filter>
 
   <resource-filter field="ifDescr">Gigabit.+</resource-filter>
Line 182: Line 195:
 
It needs to be done like this instead:
 
It needs to be done like this instead:
 
   <resource-filter field="snmpifdescr">Gigabit.+</resource-filter>
 
   <resource-filter field="snmpifdescr">Gigabit.+</resource-filter>
 
 
For generic resources(anything not interface related), the resource-filter field is related to any mibObj defined as "string" in datacollection-config.xml.
 
  
 
== The "range" parameter ==
 
== The "range" parameter ==

Revision as of 16:32, 10 May 2012

OpenNMS has a thresholding daemon called "threshd" that allows events to be created when a performance metric exceeds a certain value. This can be used to create events (and notifications) when, for example, the response time for a monitored service is too high, when bandwidth utilization exceeds a certain amount, or when available disk space falls under a threshold or grows or shrinks too quickly. Thresholds can be checked on most types of numerical data that OpenNMS collects and stores in RRD files (with the goal being to eventually support all collected data). In newer versions of OpenNMS (1.3.3 and later), an additional threshold type can also be used, like relative change (e.g.: did the available disk space change more than 5% from the last poll?), and it is relatively easy to write your own custom threshold evaluator.

Why use thresholding?

Thresholding is an important part of automating network management to alert operators (and planners) to warning and critical issues.

In the case of a critical issue, an online services company might want to ensure that their customers are getting fast response from their service, so they want to ensure that an individual HTTP request takes no longer than two seconds to complete. This could be done in the poller by setting the service timeout to two seconds, however when the timeout is reached, the poller will simply give up. Administrators won't know if the service is down or if it is just taking a long time to complete, so further manual investigation would be needed, and the results of this investigation will most likely not be recorded in a standard way for later retrieval. A "high" threshold could be used along with a longer poller timeout to notify administrators when the service is responsive but is taking too long to complete the request.

Similarly, thresholds can be used as a warning mechanism to operators and planners that a service is performing within SLA parameters, but might be getting uncomfortably close to the SLA. Continuing with the above example, a threshold could be used to send an email to the administrators if the service poll takes 1.5 seconds to complete, giving administrators time to react to the issue before it becomes a problem. If the service normally completes in well under a second, having another threshold at 1.0 seconds that causes a notification to administrators and/or planners will allow them to identify any persistent performance problems and plan upgrades to the service to improve performance before users ever notice an issue.

These examples used response time data, but thresholding can also be done with any numerical data collected by collectd , including SNMP-collected node-level and interface-level data like CPU statistics, memory statistics, bandwidth utilization, etc., as well as JMX-collected data.

How does it work?

  • First, you have to collect data for which you want to monitor thresholds
  • Next you have to define the variables/values for which you want thresholds to be monitored (in file thresholds.xml). This is described in this document
  • Then you have to define the devices / ip-ranges / interfaces for which you want the thresholds to be applied (in file threshd-configuration.xml).
  • now you can create alarms or notifications based on threshold events

See Creating_Threshold_Alarms below Docu-overview/Thresholds for a complete example of this.

What data can be thresholded?

Anything numeric, collected by any of the OpenNMS data collectors.

What kind of thresholds can be checked?

High threshold

A "high" threshold triggers when the value of the data source exceeds the threshold value ("value"), and is re-armed when it drops below the re-arm value ("rearm").

Low threshold

A "low" threshold triggers when the value of the data source drops below the threshold value ("value"), and is re-armed when it comes back up above the re-arm value ("rearm").

Relative change threshold

Note: this feature is available from OpenNMS 1.3.3.

This is used for detecting a change between two samples when the size of the change relative to one of the samples is important, as opposed to the absolute size of the change. For example, a 5% change in outbound traffic can be represented with a single threshold that will work on both a 1200bps dial-up link and a 10Gbps backbone link.

This only uses one configuration option, the 'value', which is set to the relative triggering upper bound (if less than 1.0) or relative triggering lower bound (if greater than 1.0) when to trigger the threshold. For example, a value of "0.95" will trigger if the sample goes down at least 5% from the previous sample (so it will trigger at -5%, -10%, etc., but not -4% or +5%). A value of "1.05" will trigger if the sample goes up at least 5% from the previous sample (it will trigger at +5%, +10%, etc., but not +4% or -5%). Every time the relative threshold is exceeded, a threshold event is triggered. Note that there is currently no support for triggering only after a certain number of individual exceeded thresholds or disabling further events from being sent until the threshold "rearms."

Finally, if you want to support a positive and negative threshold on a data source, for example greater than +/-5%, two thresholds will need to be configured, one for "0.95" to catch <= -5%, and "1.05" to catch >= +5%.

Configuration example:

<!-- Note: the "rearm" and "trigger" values are not currently used. -->
<threshold type="relativeChange" ds-name="ifInOctets"  ds-type="if" value="1.5" rearm="1.0" trigger="1"/>

The above configuration example will trigger if ifInOctets increases by at least 50% from one sample to the next. Note that the 'rearm' and 'trigger' values are currently unused.

Negative values

Note that negative numbers are handled by taking the absolute value of the incoming value (as of 1.6, when released). Behaviour when the value crosses from positive to negative (or vice versa) is defined, but may not do what you expect. If you have such a scenario, please contact the developers with details and we'll look at handling it better.

Absolute change threshold

Note: this feature will be available in OpenNMS 1.6, or shortly after.

This is used for detecting a change between two samples when the absolute change between the samples is important. For example, on a fiber-optic link, a change in loss of anything greater than (say) 3 dB is a problem, no matter what the value was to start with, nor what the final value is. This is likely to be more useful on Gauge type data items than counters, but YMMV.

This only uses one configuration option, the 'value', which is set to the size of the change that will trigger the threshold. For example, a value of "3" will trigger if the sample goes up by at least 3 compared to the previous sample. A value of "-3" will trigger if the sample goes down by at least 3 compared to the previous sample. Every time the absolute threshold is exceeded, a threshold event is triggered. Note that there is currently no support for triggering only after a certain number of individual exceeded thresholds or disabling further events from being sent until the threshold "rearms." If you have such a scenario, please contact the developers with details of your situation.

Finally, if you want to support a positive and negative threshold on a data source, for example greater than +/-5, two thresholds will need to be configured.

Configuration example:

<!-- Note: the "rearm" and "trigger" values are not currently used. -->
<threshold type="absoluteChange" ds-name="loss"  ds-type="node" value="3" rearm="1.0" trigger="1"/>

The above configuration example will trigger if loss increases by at least 3 from one sample to the next. Note that the 'rearm' and 'trigger' values are currently unused.

Expression-based threshold

Note: this feature is available from OpenNMS 1.3.3.

This is very similar to a regular threshold, except it allows you to use a mathematical expression and multiple data sources. See below for a configuration example. Expressions use the JEP expression library, and are in traditional math format.

Note: Since 1.9.0 the JEP regexp library was replaced (it's no longer open source) and Apache's JEXL is used (see Bug 3413) with a slightly different syntax, especially in if() - clauses, see http://commons.apache.org/jexl/ for details.

How is this configured?

Attributes of a threshold

  • type: A "high" threshold triggers when the value of the data source exceeds the "value", and is re-armed when it drops below the "re-arm" value. Conversely, a "low" threshold triggers when the value of the data source drops below the "value", and is re-armed when it exceeds the "re-arm" value. "relativeChange" is for thresholds that trigger when the change in data source value from one collection to the next is greater than "value" percent.
  • expression: A mathematical expression involving datasource names which will be evaluated and compared to the threshold values. This is used in "expression" thresholding (supported from 1.3.3).
  • ds-name: The name of the variable to be monitored.
  • ds-type: Data source type. Node for "node-level" data items, and "if" for interface-level items.
  • ds-label: Data source label. The name of the collected "string" type data item to use as a label when reporting this threshold. Note: this is a data item whose value is used as the label, not the label itself.
  • value: The value that must be exceeded (either above or below, depending on whether this is a high or low threshold) in order to trigger. In the case of relativeChange thresholds, this is the percent that things need to change in order to trigger (e.g. 'value="1.5"' means a 50% increase).
  • rearm: The value at which the threshold will reset itself. Not used for relativeChange thresholds.
  • trigger: The number of times the threshold must be "exceeded" in a row before the threshold will be triggered. Not used for relativeChange thresholds.
  • triggeredUEI: A custom UEI to send into the events system when this threshold is triggered. If left blank, it defaults to the standard thresholds UEIs.
  • rearmedUEI: A custom UEI to send into the events system when this threshold is re-armed. If left blank, it defaults to the standard thresholds UEIs.
  • filters: Filters are written as regular expression matches, without the surrounding /. The Java regex library supports the GNU syntax, including \w and \d.
  • filterOperator: Defines the operator that concatenates the filters. Allowed value are "and" and "or". Default is "or".

Custom UEIs in thresholds

When setting up your thresholding configuration, it is much easier to create the notifications if each threshold has a custom UEI. If you do not use a custom UEI, events with the generic "high/low/relativeThresholdExceeded" UEI are created, and you have to do other filtering on the notifications in order to select a particular threshold event (e.g. using varbinds). As an alternative, for any given threshold you can create a custom UEI for both the triggering and rearming events. A notification can then be created for just that UEI. A typical UEI is of the format "uei.opennms.org/<category>/<name>". It is recommended that when creating custom UEIs for thresholds, you use a one-word version of your company name as the category to avoid name conflicts. The "name" portion is up to you.

Examples of fully fleshed-out thresholds

Standard threshold

<threshold ds-type="node" type="high" ds-name="ns-disk-1-pct" ds-label="ns-disk-1-name" 
           value="90" rearm="80" trigger="1"
           triggeredUEI="uei.opennms.org/company/highdisk-trigger"
           rearmedUEI="uei.opennms.org/company/highdisk-rearm" />

Latency thresholds

Note: latencies are in microseconds (for ICMP) or milliseconds (for other services)! If you configure thinking in seconds you will be flooded with events...

ICMP latency thresholds

This threshold will be exceeded when ICMP latency exceeds 200 milliseconds for four consecutive polls and rearmed when it falls back to 80 milliseconds.

   <group name="icmp-latency" rrdRepository="/opt/opennms/share/rrd/response/">
       <threshold type="high" ds-type="if" value="200000.0"
           rearm="80000.0" trigger="4" ds-name="icmp"/>
   </group>
http latency thresholds

This threshold will be exceeded when HTTP latency exceeds one second, and rearmed when it falls back below 0.5 seconds:

   <group name="http-latency" rrdRepository="/opt/opennms/share/rrd/response/">
       <threshold type="high" ds-type="if" value="1000.0"
           rearm="500.0" trigger="1" ds-name="http"/>
   </group>

Expression-based threshold

Note: Applicable to 1.3.3+, 1.5.9x

Monitor data coming from the hrStorageIndex SNMP collection (mib2-host-resources-storage):

       <expression type="high" ds-type="hrStorageIndex" value="91.0"
           rearm="89.0" trigger="1" ds-label="hrStorageDescr"
           triggeredUEI="uei.opennms.org/company/disk-high"
           rearmedUEI="uei.opennms.org/company/disk-rearm" expression="(hrStorageUsed / hrStorageSize) * 100">
           <resource-filter field="hrStorageDescr">^/$</resource-filter>
           <resource-filter field="hrStorageDescr">^\w\:</resource-filter>
       </expression>


The resource-filter lines define regular expression matches that are used by OpenNMS to determine whether the expression should be tested against the current data input. Because Windows machines will return items like Virtual Memory as hrStorageDescrs, the ^\w\: filter is needed to pick up only entries that are drive-letter references like C:\ Label:.... The ^/$ filter is used to match only the file system on / for a *nix machine (modern Linux boxes will spit out entries for /sys and Real Memory. Obviously, this filter set can be modified to suit your setup, and it should be possible to write more complex entries that would match everything beginning with a / but not containing /sys.

Another example, on 1.8.11, with a regexp negation. The servers I monitor have varied storage configurations, so in my snmpd.conf I use the directive includeAllDisks. This allows all disks to be polled via SNMP, but it also means I collect data and have alerts on /proc, /sys and so forth. So I have the following in thresholds.xml, which matches all disks except those containing the strings in the resource-filter:

       <threshold type="high" ds-type="dskIndex" value="90.0"
           rearm="75.0" trigger="2" ds-label="ns-dskPath"
           filterOperator="or" ds-name="ns-dskPercent">
           <resource-filter field="ns-dskPath">^(?!.*(proc|sys|pts|binfmt|rpc_pipefs)).*$</resource-filter>
       </threshold>

Another Expression-based threshold

Note: Applicable to 1.3.3+, 1.5.9x

trigger a custom UEI when a linux host uses more than 1.5GB swap space

   <expression type="high" ds-type="node" value="1500000" rearm="500000"
       trigger="1" ds-label="memUsedSwap" expression="memTotalSwap - memAvailSwap"
       triggeredUEI="uei.opennms.org/custom/highSwapThresholdExceeded"
       rearmedUEI="uei.opennms.org/custom/highSwapThresholdRearmed">
   </expression>

Changing thresholds

Once you have configured threshold groups in threshd-configuration.xml and thresholds.xml you can use the GUI (starting with Version 1.6.x) and go to "Admin / Manage thresholds" to modify thresholds or add new thresholds to existing groups

Configuring notifications for thresholds

After setting up a threshold, you'll likely want to configure event notifications. There are four configured in the default notifications.xml that you can turn on. When customizing the messages associated with threshold notifications, you can use these extra variables:

  • %parm[ds]%: The snmp variable name specified in "ds-name".
  • %parm[label]%:' The current string value of the snmp variable name specified in "ds-label".
  • %parm[value]%: The current value of the datasource.
  • %parm[threshold]%: The "value" set in the threshold itself.
  • %parm[rearm]%: The "rearm" set in the threshold itself.

Notes on using <resource-filter> in thresholds

Supported resource filter field values

Only single metrics are supported as the filter field data object, not arithmetic expressions.

Supported resource filter field data types

For generic resources (see below for interface resources), the resource-filter field is any mibObj defined in datacollection-config.xml.

Before OpenNMS 1.10.2, this value could only be a string. As of OpenNMS 1.10.2, the field can be any metric defined in datacollection-config.xml including numeric values or strings.

Filtering using interface data sources

For interface resources (ds-type="if"), resource-filters are based to snmpinterface table columns. If you want to do something like:

 <resource-filter field="ifDescr">Gigabit.+</resource-filter>

It needs to be done like this instead:

 <resource-filter field="snmpifdescr">Gigabit.+</resource-filter>

The "range" parameter

In OpenNMS 1.2 and 1.3 (up to 1.3.9, but not from 1.3.10 onwards) collection and thresholding are separate and are not synchronised. At some point the collector will write an RRD and at some point threshd may read that RRD and compare it to a threshold value. The collector and the threshold package generally operate on the same interval (by default), but there's no guarantee that thresholding will take place immediately after collection, (they're _asynchronous_).

By default threshd will compare the threshold trigger value to last possible PDP (primary data point) retrieved from the RRD referenced in the threshold. That's not the value from the last PDP that was updated by collectd, but the value of the last possible PDP before now. There is a potential problem with this. If the RRD has been updated long enough before threshd runs then the last possible PDP may not actually have been updated (remember RRDs will _interpolate_ update values in order to fit them into PDPs with fixed time intervals). If this happens then the last value will be NaN (not a number) and thresholding will spit something like this into threshd.log:

2006-02-17 16:23:09,914 DEBUG [ThreshdScheduler-5 Pool-fiber1]JRobinRrdStrategy: rrd last updateime: 1140193180 collect time 1140193200
2006-02-17 16:23:09,915 DEBUG [ThreshdScheduler-5 Pool-fiber1] SnmpThresholder: checkNodeDir: got dsValue of: NaN

If this happens, then your thresholds will not trigger.

The cure to this is the range parameter. This is a little hack that works for 1.2.8 and above in the 1.2 line and 1.3.2 and above in the 1.3 line. The range parameter is applied to each service within a package as shown in the snipped from threshd-configuration.xml shown below.

   <package name="net-snmp-disk">
       <filter>IPADDR IPLIKE *.*.*.*</filter>
       <include-url>file:/opt/OpenNMS/etc/net-snmp-disk</include-url>
       <service name="SNMP" interval="300000" user-defined="false" status="on">
          <parameter key="thresholding-group" value="net-snmp-disk"/>
           <parameter key="range" value="300000"/>
       </service>
   </package>

Adding the range parameter tells threshd that, should it get a NaN value when asking for the last PDP from an RRD, it should walk back through the RRD until it finds a value that's not NaN. The value of the range parameter determines the number of milliseconds back in time threshd will go to try and get a non NaN value. This is period between the PDP time and the current time and represents the maximum age of a PDP before it is considered "stale" and unfit for thresholding. In this case it's 300000 mS (or five minutes). In general with the default collection and thresholding interval, a range value equal to the interval value should be adequate, but in some cases, a value 2x the collection interval (600000 in the above example) seems to be necessary.

This behaviour was identified (and corrected by introducing the range parameter) in bug 1432. The range parameter should definitely be in all of your thresholds.xml package definitions if you expect thresholding to work reliably.

This has been rectified in a more permanent way in the current version in trunk (what will be 1.3.10 when it's released), where thresholding is moved into collectd, and is done immediately as data is collected.

Merge into collectd

As of revision 8097, and as part of the OpenNMS 1.3.10 release, thresholding has been merged into collectd. This means that thresholding occurs on data as it is collected, not read from disk at some point later. This seems so far to have a reasonably significant positive effect on load (reducing it), and also eliminates the need for the whole "range" parameter (see previous section).

Notice: This applies only to data collected via collectd. Other data not collected via collectd, for example ICMP response times, are still written directly to the RRD files and thresholding on them has to be done by reading the RRD files as before.

Thresholding merge with collectd for versions between 1.3.10 and 1.5.90

If you are on opennms between 1.3.10 and 1.5.90 and want to use this new code, note that

  1. You should disable the SnmpThresholder configuration in threshd-configuration.xml for the SNMP service (remove or comment out the line starting with <thresholder service="SNMP" ...). This way you can still use the the Threshd service for response time data via the LatencyThresholder class.
  2. You need to add a thresholding-group parameter (not an attribute!) to each <service> in collectd-configuration.xml on which you want to do thresholding. This corresponds directly with the old thresholding-group parameter in threshd-configuration.xml:
<service name="SNMP" interval="30000" user-defined="false" status="on">
  <parameter key="collection" value="default"/>
  <parameter key="thresholding-group" value="default-snmp"/>
</service>

If you do not add the thresholding-group parameter, the new code will not be used at all (it'll never find a set of thresholds to use for a given collection). You could, in theory, leave threshd turned on, but that would be pointless unless you are still doing latency thresholding


Thresholding merge with collectd for versions 1.5.91 and later

As of revision 8557, and as part of the OpenNMS 1.5.91 release, a new feature was added to this implementation.

Now, you can use filters defined on threshd-configuration.xml without enabling Threshd. The first implementation enable a thresholding-group for all SNMP service defined on collectd package. With this version you must create different collectd packages to apply different thresholding groups.

With the new version you can use the same collectd packages definitions and use thresholding-group and filters defined on threshd-configuration.xml

The difference between this and the old version is the way you enable thresholds on collectd. By default this is disabled. You can enable it adding enable parameter like this in your collectd-configuration.xml:

<service name="SNMP" interval="300000" user-defined="false" status="on">
   <parameter key="collection" value="default"/>
   <parameter key="thresholding-enabled" value="true"/>
</service>

Note that thresholding-group is not used here. The filters and thresholding-group definitions are still to be configured in threshd-configuration.xml while the thresholds have to be configured in thresholds.xml.

You should disable the SnmpThresholder configuration in threshd-configuration.xml for the SNMP service - remove or comment out the line

 <thresholder service="SNMP" class-name="org.opennms.netmgt.threshd.SnmpThresholder"/>

This way you can still use the the Threshd service for response time data via the LatencyThresholder class.

Added to pollerd

As of revision 14075, and as part of the OpenNMS 1.7.6, thresholding of latency data has been added into pollerd. This means that latency thresholding for polled services like ICMP occurs on data as it is polled, not read from disk at some point later. This seems so far to have a reasonably significant positive effect on load (reducing it), and also eliminates the need for the whole "range" parameter (see previous section). This is the preferred method of thresholding for most non-SNMP services.

If you are on opennms 1.7.6 or newer you should use the method described further down on this page.

If you want to use this new code, note that:

  • You should disable the LatencyThresholder configuration in threshd-configuration.xml for the service you are interest on. For example, if you are interested on ICMP service, remove or comment out the line:
 <thresholder service="ICMP" class-name="org.opennms.netmgt.threshd.LatencyThresholder"/>

This way you can still use the the Threshd service for collected snmp data via the SnmpThresholder class, or other latency services via the LatencyThresholder class.

  • You need to add a thresholding-enabled parameter (not an attribute!) to the service in poller-configuration.xml on which you want to do thresholding. For example, if you are interested in thresholding on the ICMP service:
<service name="ICMP" interval="600000" user-defined="false" status="on">
   ...
   <parameter key="thresholding-enabled" value="true"/>
</service>

The filters and thresholding-group definitions are still to be configured in threshd-configuration.xml while the thresholds have to be configured in thresholds.xml.


Version History/Availability