Creating Threshold Alarms
Subscribe

From OpenNMS

Jump to: navigation, search

Contents

Overview

Thresholds are a useful part of the collectd functionality in OpenNMS. They allow you to define triggers against any data retrieved by the SNMP collector, and from those triggers, generate Events, Notifications and Alarms. A key fact to bear in mind is that in 1.5.9x of OpenNMS, thresholds only fire when the trigger or re-arm boundaries are crossed. They don't fire every time that the SNMP data is found to match the trigger or re-arm conditions.

Concept

The general concept of threshold alarms is to monitor Performance and Response Time data, generate an Event when a trigger or re-arm value is met, and subsequently generate a Notification and an Alarm from that Event. To do this, several pieces of the puzzle are needed.

  • A threshold, either Basic or Expression-based.
  • A modified Event based on the Threshold trigger and re-arm UEI
  • A Destination Path for the Notification
  • A Notification attached to the Event

An optional part of this process is to include an Automation that reads the Alarms table for unacknowledged alarms matching the Event, and then re-raises the Event, generating a new Notification and increments the Alarm counter.

Implementation

This sample implementation assumes that you are using the MIB2 Host Storage collection, not the NetSNMP Storage collection.

Collection Daemon

File: $OPENNMS_HOME/etc/collectd-configuration.xml

This method uses the 'new' integrated-in-collectd method. Ensure your collection package has <parameter key="thresholding-enabled" value="true"/> included in the configuration.

<collectd-configuration threads="100">
       <package name="example1">
               <filter>IPADDR != '0.0.0.0'</filter>
               <include-range begin="1.1.1.1" end="254.254.254.254"/>
               <service name="SNMP" interval="300000" user-defined="false" status="on">
                       <parameter key="collection" value="default"/>
                       <parameter key="thresholding-enabled" value="true"/>
                       <parameter key="retry" value="2"/>
                       <parameter key="timeout" value="1000"/>
               </service>
       </package>
       <collector service="SNMP" class-name="org.opennms.netmgt.collectd.SnmpCollector"/>
</collectd-configuration>

Threshold

File: $OPENNMS_HOME/etc/thresholds.xml

Next, a threshold is needed. In this case, the threshold is set against the disk monitoring ability using hrStorageIndex.

<expression type="high" ds-type="hrStorageIndex" value="91.0"
    rearm="89.0" trigger="1" ds-label="hrStorageDescr"
    triggeredUEI="uei.opennms.org/Company/disk-high"
    rearmedUEI="uei.opennms.org/Company/disk-rearm" expression="(hrStorageUsed / hrStorageSize) * 100">
    <resource-filter field="hrStorageDescr">^/$</resource-filter>
    <resource-filter field="hrStorageDescr">^\w\:</resource-filter>
</expression>

This threshold,

  • calculates the amount of used disk as a percentage value,
  • checks the hrStorageIndex collection,
  • only requires one occurrence of the data collection to be above the trigger limit,
  • creates two custom UEIs (rather than using the defaults),
  • and has two filter regexes applies to stop it matching descriptions like 'Real Memory'.

Where possible, create this threshold using the web interface, as this propagates the threshold change into the collector daemon immediately, and creates the related threshold events. The filters should be adjusted to suit your environment, as this sample only matches entries like D:\ for Windows systems and / for *nix-like systems (the bounding with ^ and $ on the first filter mean that entries like /proc get ignored).

An alternative, untested-in-OpenNMS regular expression for the resource-filter is

^[/|\w\:][^proc|dev|boot|sys|var/lib/nfs/rpc_pipefs](.*)|/$

Tested against

/proc
/dev
/boot
/sys_fs
/var/lib/nfs/rpc_pipefs
Real memory
/usr/local/tomcat/share
/
d:\ Label fred

only the last 3 lines matched (in Visual Regexp).

As of 1.7.7-1 (may apply to earlier versions), it's confirmed that full regular expressions (including back-references) work as intended. For instance,

       <group name="NetAppVols" rrdRepository = "/opt/opennms/share/rrd/snmp/">
               <threshold type="high" ds-type="naDfIndex" ds-label="naDfFileSys" value="95.0" rearm="90.0" trigger="1" ds-name="naDfPctKB" triggeredUEI="uei.opennms.org/MyCompany/NetAppVolSpaceWarning" rearmedUEI="uei.opennms.org/MyCompany/NetAppVolSpaceWarning-rearmed">
               <resource-filter field="naDfFileSys">(.(?!(aggr0)))*</resource-filter>
               </threshold>
       </group>

will filter out the root volumes on NetApps.

Event

File: $OPENNMS_HOME/etc/events/programmatic.events.xml, $OPENNMS_HOME/etc/eventconf.xml

In support of the new threshold event, a file should be created in $OPENNMS_HOME/etc/events, called programmatic.events.xml. The default will look something like

<?xml version="1.0" encoding="UTF-8"?>
<events xmlns="http://xmlns.opennms.org/xsd/eventconf">
   <event>
       <uei xmlns="">uei.opennms.org/Company/disk-high</uei>
       <event-label xmlns="">User-defined threshold event uei.opennms.org/Company/disk-high</event-label>
       <descr xmlns="">Threshold exceeded for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]</descr>
       <logmsg dest="logndisplay">Threshold exceeded for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]%</logmsg>
       <severity xmlns="">Warning</severity>
   </event>
   <event>
       <uei xmlns="">uei.opennms.org/Company/disk-rearm</uei>
       <event-label xmlns="">User-defined thresh♦old event uei.opennms.org/Company/disk-rearm</event-label>
       <descr xmlns="">Threshold rearmed for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]</descr>
       <logmsg dest="logndisplay">Threshold rearmed for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]%</logmsg>
       <severity xmlns="">Warning</severity>
   </event>
</events>

To support an Alarm, new parameters have to be added

<?xml version="1.0" encoding="UTF-8"?>
<events xmlns="http://xmlns.opennms.org/xsd/eventconf">
   <event>
       <uei xmlns="">uei.opennms.org/Company/disk-high</uei>
       <event-label xmlns="">User-defined threshold event uei.opennms.org/Company/disk-high</event-label>
       <descr xmlns="">Threshold exceeded for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]</descr>
       <logmsg dest="logndisplay">Threshold exceeded for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]%</logmsg>
       <severity xmlns="">Minor</severity>
        <alarm-data 
                reduction-key="%uei%!%nodeid%!%parm[label]%" alarm-type="1" auto-clean="false" />
   </event>
   <event>
       <uei xmlns="">uei.opennms.org/Company/disk-rearm</uei>
       <event-label xmlns="">User-defined threshold event uei.opennms.org/Company/disk-rearm</event-label>
       <descr xmlns="">Threshold rearmed for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]</descr>
       <logmsg dest="logndisplay">Threshold rearmed for %service% datasource %parm[ds]% on interface %interface%, parms: %parm[all]%</logmsg>
       <severity xmlns="">Normal</severity>
        <alarm-data
               clear-key="uei.opennms.org/Company/disk-high!%nodeid%!%parm[label]%"
               reduction-key="%uei%:%nodeid%:%parm[label]%" alarm-type="2" auto-clean="false" />
   </event>
</events>

The key here is the <alarm-data> annotation. The first entry creates a reduction key that says 'All events with this UEI, NodeID and Label should be collated', that the Alarm type is a problem (1) and states that old events for this alarm will not be deleted (auto-cleaned) when a new event arrives with the same reduction key as an existing alarm. The second entry reduces resolution (2) alarm events separately from problem events, and specifies that events of this type should set to "Cleared" the severity of any existing alarm matching the clear-key. See Configuring alarms for more details on reduction keys and alarm types. The severities have also been changed, as a 91% disk used threshold breach is a minor problem (you might think it's major or critical, but see this discussion of severities.

Note the use of ! as the reduction key separator - this allows the PostgreSQL query used in the Automation task to easily parse the reduction key field. If the de-facto standard of : is used, the SQL used in the Automation trigger will be unable to parse the label field properly. The reduction-key for the disk-high event generates values like

org.opennms.uei:88:D:\ Label Fred

Since the trigger uses split_part(reductionkey,'<seperator>',3) to extract the label from the key (it cannot be extracted from the logmsg due to truncation issues), using a colon means that the extraction will get D, not the required D:\ Label Fred. This incorrect extraction would, in turn, create a new Event with a different label, and then a new Alarm, with the final result being two Notifications for every Threshold breach when the Automation runs.

$OPENNMS_HOME/etc/eventconf.xml should also have a line towards the bottom that looks like

    <event-file xmlns="">events/programmatic.events.xml</event-file>

NB all the handling of programmatic-events.xml happens automatically when the web UI is used as recommended.

Destination Path

File: $OPENNMS_HOME/etc/destinationPaths.xml

For the CompanyEmail destination, the target is based on a User in the OpenNMS system.

   <path name="CompanyEmail" initial-delay="0s">
       <target interval="0s">
           <name xmlns="">username</name>
           <autoNotify xmlns="">auto</autoNotify>
           <command xmlns="">email</command>
       </target>
   </path>

Notification

File: $OPENNMS_HOME/etc/notifications.xml

For this example, the notification will be an e-mail notification.

   <notification name="Threshold: Storage 91%" status="on" writeable="yes">
       <uei xmlns="">uei.opennms.org/Company/disk-high</uei>
       <description xmlns="">Storage calculation says 91% is exceeded</description>
       <rule xmlns="">(IPADDR IPLIKE *.*.*.*)</rule>
       <destinationPath xmlns="">CompanyEmail</destinationPath>
       <text-message xmlns="">Disk: %parm[label]% &#xd;
Usage: %parm[value]%&#xd;
&#xd;
http://opennms:8980/opennms/graph/chooseresource.htm?parentResourceType=node&amp;parentResource=%nodeid%&amp;reports=all</text-message>
       <subject xmlns="">Threshold Breach: Storage Capacity</subject>
   </notification>

The message sent via e-mail links the user straight to the resource graphs for the device. Use the web interface, where possible, to create the Notification.

Automation of Repeat Notifications (optional)

File: $OPENNMS_HOME/etc/vacuumd-configuration.xml

The Automation of repeat Notifications is performed in the vacuumd-configuration.xml file. Props to http://bugzilla.opennms.org/show_bug.cgi?id=2642 for providing clues. All of the blocks below occur in vacuumd-configuration, but are split up for ease of explanation.

<automations>
 ...
   <automation name="DiskHighThresholdAlarmReoccuring" interval="86400000" active="true"
               trigger-name="triggerDiskHighThresholdReoccuring"
               action-name="actionDiskHighThresholdReoccuring"
               action-event="eventDiskHighThresholdAlarmReoccuring" />
 ...
</automations>

This is the master definition of the automation. The interval is in milliseconds, and in this example is 24 hours (in other words, the automation runs every 24 hours, which is how often repeated notifications are desired for this particular issue).

<triggers>
 ...
    <trigger name="triggerDiskHighThresholdReoccuring" operator="&gt;=" row-count="1" >
      <statement>
    SELECT
        e.nodeid AS _nodeid,
        e.eventuei AS _eventuei,
        e.ipaddr AS _ipaddr,
        s.servicename AS _servicename,
        regexp_replace(e.ds, '[[:space:]][[:alnum:]_]+$', '') AS _ds,
        regexp_replace(e.n_value, '[[:space:]][[:alnum:]_]+$', '') AS _value,
        regexp_replace(e.n_threshold, '[[:space:]][[:alnum:]_]+$', '') AS _thresh,
        regexp_replace(e.n_trig, '[[:space:]][[:alnum:]_]+$', '') AS _trig,
        regexp_replace(e.n_rearm, '[[:space:]][[:alnum:]_]+$', '') AS _rearm,
        e.n_label AS _label
    FROM (
        SELECT nodeid, eventuei, ipaddr,
            split_part(reductionkey,'!',3) as n_label,
            replace(split_part(logmsg,'=',2), '"', '') as ds,
            replace(split_part(logmsg,'=',3), '"', '') as n_value,
            replace(split_part(logmsg,'=',4), '"', '') as n_threshold,
            replace(split_part(logmsg,'=',5), '"', '') as n_trig,
            replace(split_part(logmsg,'=',6), '"', '') as n_rearm
        FROM alarms
        WHERE
            eventuei='uei.opennms.org/Company/disk-high'
            AND alarmacktime is null
        ) as e
    LEFT OUTER JOIN service AS s ON (s.serviceid = e.serviceid)
 
     </statement>
   </trigger>
 ...
</triggers>

This query is written against PostgreSQL 8.3, so check the functions and syntax for compatibility. It extracts data from the original reduction key and log message to generate parameters for the action-event that raises a new Event, and thus Notification and Alarm (which gets reduced back on top of the original due to the Alarm configuration in the Event). The sub-select syntax is not required, though it does make it a bit easier to read due to all of the function use. The check for alarmacktime means that acknowledging the alarm will stop further notifications.

<actions>
 ...
   <action name="actionDiskHighThresholdReoccuring" >
     <statement>
     </statement>
   </action>
 ...
</actions>

Take no action.

<action-events>
 ...
  <action-event name="eventDiskHighThresholdAlarmReoccuring" for-each-result="true" >
               <assignment type="field" name="uei" value="${_eventuei}" />
               <assignment type="field" name="nodeid" value="${_nodeid}" />
               <assignment type="field" name="service" value="${_servicename}" />
               <assignment type="field" name="interface" value="${_ipaddr}" />
               <assignment type="parameter" name="ds" value="${_ds}" />
               <assignment type="parameter" name="value" value="${_value}" />
               <assignment type="parameter" name="threshold" value="${_thresh}" />
               <assignment type="parameter" name="trigger" value="${_trig}" />
               <assignment type="parameter" name="rearm" value="${_rearm}" />
               <assignment type="parameter" name="label" value="${_label}" />
  </action-event>
 ...
</action-events>

This action-event assigns parameters from the SQL query in the trigger, and generates a new Event. Note that the parameter values must be in the same sequence as the original from the log message, otherwise the SQL in the trigger will extract the wrong values on subsequent runs (since the Alarm logmsg field gets updated by the new Event). The log message being used looks like

  • Threshold exceeded for SNMP datasource (hrStorageUsed / hrStorageSize) * 100 on interface 127.0.0.1, parms: ds="(hrStorageUsed / hrStorageSize) * 100" value="99.00568295556656" threshold="91.0" trigger="1" rearm="89.0" label="D:\ Label:New Volume S...

An alternative approach for the Automation is to use an SQL statement in the <actions> area to fire the re-arm event. This would cause the Threshold to be breached on a subsequent collection. (This is a theory, and not tested.)

Display

After restarting OpenNMS, when a threshold is breached, you should be able to find an Event, a Notification and an Alarm in their respective locations. If you're using the default Dashboard view, then there should be both Alarms (with counts of 1) and Notifications displayed in red.

Caveats

  • OpenNMS needs to be informed every time you change the vacuumd configuration file. This can be accomplished either by restarting OpenNMS or by sending a special event using a tool such as OPENNMS_HOME/bin/send-event.pl to send an event with UEI uei.opennms.org/internal/reloadVacuumdConfig.
  • On OpenNMS restart, if an Alarm is raised by an Event, the Automation will detect it immediately and generate another Event. This leads to two notifications. The trigger SQL could be amended to read the lasteventtime column in the alarms table, and only fire if the lasteventtime is more than n minutes old.
  • The Threshold calculation will be performed to 5+ significant digits, leading to rather precise percentage calculations. The Automation to generate additional Events can be modified with round((...)::numeric,2) on the _value extraction to round to 1 significant digit.
  • This method does not detect changes over the set threshold once a breach has occurred. This means that your Notifications will always show the same breach value, whether it's climbed or not. For example, if a disk breaches a 91% full Threshold with 92%, a Notification gets sent, plus the Alarm. On the next run of the Automation, even if the polled value is 94%, the Notification will indicate 92%.

Troubleshooting

Finding threshold events in the logs:

(5:14:05 PM) alonrb: can you run me through which logs should have events and in which order?
(5:14:13 PM) bajan: Whee :)
(5:14:15 PM) bajan: Lemme think
(5:14:22 PM) bajan: we rotate through our logs in about a day or so
(5:15:04 PM) bajan: If you set DEBUG in log4j.properties for the collectd.log file
(5:15:17 PM) bajan: and wait a few minutes (or restart - waiting works for me, not for Abracadabra)
(5:15:32 PM) bajan: Then a tail -f collectd.log | grep ThresholdingVisitor
(5:15:41 PM) bajan: should show the entries where teh thresholding is being run
(5:15:52 PM) alonrb: thanks
(5:15:58 PM) bajan: You -may- need to set debug on the thresholding.log entry, even though it won't write to that file

The log file says

  • 2008-09-02 12:10:57,935 WARN [CollectdScheduler-50 Pool-fiber0] ThresholdingVisitor: createThresholdingVisitor: Thresholds processing is not enabled. Check thresholding-enabled param on collectd package

You forgot to do the collectd-configuration.xml stage at the beginning of this document.