Auto-acknowledge and match event parameters
Subscribe

From OpenNMS

Jump to: navigation, search

Contents

Problem description

Sometimes, vendors use SNMP traps in their very own way. Common practice seems to be to send exaclty one trap oid for any type of error or clear event. Whether this trap indicates an error or the clearance of an error and what type of error this trap refers to is only visible when looking at the event parameters.

Unfortunately, as of this writing (version 1.8.3), OpenNMS is not able to analyse event parameters when it comes to auto-acknowledging.

In my case, there were 150 different items the trap could send an error or a clear message for. Translating the MIB only lead to one event definition instead of the 300 different meanings the trap could have. So the first step was to find out which parameter indicates whether the trap was an error or a clear event and create event definitions for the error and the clear event. In my case it was parameter number 4.

But also with this configuration, auto-acknowledgment leads to various problems when simply configuring notifd to auto-acknowledge the error event with the clear event. Like if you had multiple errors at the same time, a clear event would always clear the latest error event, although this clear event might belong to a completely different error event. And since the guys receiving the notification only wanted to know about error events that lasted more than 5 minutes, things became a bit interesting.

So the goal of this document is to show how to filter auto-acknowledgement on event parameters in order to acknowledge an error only through its actual clear event.

Simple, but annoying approach

  • Configure 150 event definitions with different ueis for each error event
  • Configure 150 event definitions with different ueis for each clear event
  • Configure 150 auto-acknowledge definitions in notifd for each error- and its corresponding clear-event
  • Configure 150 notifications

That would probably do it.

An OpenNMS way to do this

In order for the example to make more sense, before I start digging into the configuration I need to mention how I could distinguish my traps.

First of all, I need to look at parameter number 4. If there is a 5 in there, this indicates a clear event. Anything but 5 is an error.

The second parameter I need to look at is number 3. This holds a device-internal unique identifier for the error.

So if there is a trap with parameter 3 being "a" and parameter 4 being "1", this is an error for "a". This would be acknowledged by a trap with parm3 = "a" and parm4 = "5".

Abstract

Abstract steps to achieve the goal

  • Create a notification for the error event
    • Put parm3 into the numericmsg of the notification
    • Use a destination path with an initial delay of 5 minutes
  • Translate the clear event using the event translator
    • If the clear event was received within 5 minutes after the error event, acknowledge the notification and then translate the event into clear event 1
    • If the clear event was not received within 5 minutes, acknowledge the notification and translate the event into clear event 2
  • Create a notification for clear event 2 so that the notification recipient receives a "resolved" message

Actual configuration

events
   <event>
       <mask>
           <maskelement>
               <mename>id</mename>
               <mevalue>.1.3.6.1.4.1.1.1.1</mevalue>
           </maskelement>
           <maskelement>
               <mename>generic</mename>
               <mevalue>6</mevalue>
           </maskelement>
           <maskelement>
               <mename>specific</mename>
               <mevalue>202</mevalue>
           </maskelement>
	    <varbind>
		<vbnumber>4</vbnumber>
		<vbvalue>~[01234]</vbvalue>
	    </varbind>
       </mask>
       <uei>uei.opennms.org/example/errorevent</uei>
       <event-label>errorevent</event-label>
       <logmsg dest="logndisplay">errorevent %parm[#3]%</logmsg>
       <severity>Major</severity>
   </event>
   <event>
       <mask>
           <maskelement>
               <mename>id</mename>
               <mevalue>.1.3.6.1.4.1.1.1.1</mevalue>
           </maskelement>
           <maskelement>
               <mename>generic</mename>
               <mevalue>6</mevalue>
           </maskelement>
           <maskelement>
               <mename>specific</mename>
               <mevalue>202</mevalue>
           </maskelement>
	    <varbind>
		<vbnumber>4</vbnumber>
		<vbvalue>5</vbvalue>
	    </varbind>
       </mask>
       <uei>uei.opennms.org/example/clearevent</uei>
       <event-label>clearevent</event-label>
       <logmsg dest="logonly">clearevent %parm[#3]%</logmsg>
       <severity>Cleared</severity>
   </event>
notifications
   <notification name="errorevent" status="on" writeable="yes">
       <uei xmlns="">uei.opennms.org/example/errorevent</uei>
       <rule xmlns="">(IPADDR IPLIKE *.*.*.*)</rule>
       <destinationPath xmlns="">Email-Admin</destinationPath>
       <text-message xmlns="">Notice #%noticeid% errorevent %parm[#3]%</text-message>
       <subject xmlns="">Notice #%noticeid% errorevent %parm[#3]%</subject>
       <numeric-message xmlns="">%parm[#3]%</numeric-message>
   </notification>
   <notification name="clearevent2" status="on" writeable="yes">
       <uei xmlns="">uei.opennms.org/translator/clearevent2</uei>
       <rule xmlns="">(IPADDR IPLIKE *.*.*.*)</rule>
       <destinationPath xmlns="">Email-Admin</destinationPath>
       <text-message xmlns="">RESOLVED Notice #%noticeid% clearevent2 %parm[#3]%</text-message>
       <subject xmlns="">RESOLVED Notice #%noticeid% cleareevent2 %parm[#3]%</subject>
   </notification>
postgres plpgsql procedures
create or replace function ackunackderrorevents() returns void as $body$
declare
 i record;
begin 
 for i in select numericmsg from notifications where answeredby is null and respondtime is null and eventuei like '%errorevent' order by numericmsg LOOP
   execute $$update notifications set answeredby='admin', respondtime='now()' where respondtime is null and answeredby is null and numericmsg=$$ || quote_literal(i.numericmsg) || $$ and exists ( select eventid from events where eventparms ~ '^.*$$ || i.numericmsg || $$[(].*' and eventuei = 'uei.opennms.org/example/clearevent' )$$;
 END LOOP;
END
$body$  LANGUAGE 'plpgsql';

In the following translator config, .1\.3\.6\.1\.4\.1\.231\.7\.1\.3\.1\.1\.1\.3\.0 corresponds to parameter 3, the unique error identifier from the device.

translator configuration (see in-line comments)
    <event-translation-spec uei="uei.opennms.org/example/clearevent">
      <mappings>
        <mapping>
          <assignment name="uei" type="field" >
            <value type="constant" result="uei.opennms.org/translator/clearevent2" />
          </assignment>
 	  <assignment name="sleep" type="parameter">
 		<value type="sql" result="select pg_sleep(3)" />
 	  </assignment>
          <!-- this is where the error notification will get acknowledged -->
 	  <assignment name="doesntmatter" type="parameter">
 		<value type="sql" result="select ackunackderrorevents()" />
 	  </assignment>
          <assignment name="duration" type="parameter">
          <!-- this will return a result if there was an error notification with this clear event's parm3 more than 5 minutes ago -->
            <value type="sql" result="select duration from (select respondtime-pagetime duration from notifications where notifyid=(select notifyid from notifications where eventid=(SELECT eventid FROM events WHERE eventparms ~ ? and eventuei ='uei.opennms.org/example/errorevent'))) duration where duration > '5 minutes'" >
              <value type="parameter" name="~^\.1\.3\.6\.1\.4\.1\.1\.1\.1\.1\.1\.1\.1\.1\.0$" matches=".*" result="${0}" />
            </value>
 	  </assignment>
        </mapping>
        <mapping>
          <assignment name="uei" type="field" >
            <value type="constant" result="uei.opennms.org/translator/clearevent" />
          </assignment>
 	  <assignment name="sleep" type="parameter">
 		<value type="sql" result="select pg_sleep(3)" />
 	  </assignment>
 	  <assignment name="doesntmatter" type="parameter">
 		<value type="sql" result="select ackunackderrorevents()" />
 	  </assignment>
          <assignment name="duration" type="parameter">
             <!-- this will return a result if there was an error notification with this clear event's parm3 less than 5 minutes ago. so we translate into an event that we will *not* notify about -->
             <value type="sql" result="select duration from (select respondtime-pagetime duration from notifications where notifyid=(select notifyid from notifications where eventid=(SELECT eventid FROM events WHERE eventparms ~ ? and eventuei ='uei.opennms.org/example/errorevent'))) duration where duration < '5 minutes'" >
               <value type="parameter" name="~^\.1\.3\.6\.1\.4\.1\.1\.1\.1\.1\.1\.1\.1\.1\.0$" matches=".*" result="${0}" />
            </value>
 	  </assignment>
        </mapping>
      </mappings>
    </event-translation-spec>