From OpenNMS
Some notes on possible implementation
T: = Tarus
Add a default pathTestIp to a configuration file
Which config file?
T: We could put it in opennms-server.xml. Perhaps the default gateway?
Done. Added default critical path IP, service, timeout and retries. (ICMP is the only service supported now)
T: I would suggest a new table, something like PathOutage, made up of nodeid, service, and IP. For Phase 1 service would be ICMP.
Done.
Add a new column to the node table for pathTestIp
Instead of pathTestIp, perhaps the asset table?
T: See above. We shouldn't need to touch the node or asset table.
Set/clear pathTestIp from web interface
Using the asset table would make this easy.
T: Should be able to model this on that code, but I'd keep it separate.
Done
Test the pathTestIp for a node when polling failures indicate a node down condition
If no pathTestIp is set, test the default pathTestIp instead
T: Cool.
Done.
Take appropriate action if pathTestIp fails to respond.
This could be implemented at the polling level, the outage level or the notification level. Doing it at the notification level might be the easiest and would impose the least overhead. Doing it at the outage level would allow for the possibility of marking the state of the node as unknown instead of down. Need to investigate the possibilities and amount of work involved here.
T: This is where it gets tricky. If we discover a node is down due to a path outage, what do we do? I'm thinking that for Phase 1 we change the nodeDown event to something like sympatheticNodeDown or NodeImpacted. Thus suppressing the nodeDown message from sending notifications. Everything else, for Phase 1 will remain the same.
Changing the nodeDown event to something else caused some other things to break. So instead I added parms to the node down event indicating it was caused by a path outage. Notifd checks the parms on the nodeDown event to see if it should suppress nodeDown notification or not. I also created a pathoutage event so you could take separate action on that if you wanted to. (like email instead of page)
T: But what about outages? Well, if the OpenNMS server cannot reach a service, that is an outage. In other words, it doesn't matter if I can't get to example.com's website because Apache is down or the router is down. In future phases we can refine this.
There are enough parms on the nodeDown and pathOutage events to calculate things like percent availability either including or excluding path outages. In phase I, outages won't change. (As you suggest above, this could be refined later).
T: So, if an interface on a router goes down, the nodes behind that router will generate NodeImpacted events. The router will generate an interfaceDown event, which should send a page.
T: As for the default IP, I would suggest the gateway since on most systems I can still ping the server interface even if the cable is pulled.
The administrator will have to put their own specific gateway address in opennms-server.xml
What Next?
T: I would love to do a lot more with this, but for Phase 1 this would be enough. Pehaps first thing in Phase 2 we could generate a node impacted table for display on the webUI.
Got that in phase I. I'm not happy with this yet, but have an improvement in mind. Also need to fix a minor bug I found last night.
OK, I fixed the bug and made some improvements to appearance and functionality of the display pages. It would also be nice to have a page allowing the admin to set the path outage config for a group of nodes based on a rule. For example nodes matching an 'iplike' rule or 'nodelabel like' rule.
Update 4/27/06
Tarus asked for an on/off switch. That has been added as an option in poller-configuration,xml as pathOutageEnabled="true" or "false". Default is "false". Also I've added a rule-based configuration capability in Home > Admin > Configure Notifications > Configure Path Outages.
After wrestling with the best way to display the status of nodes dependent on a specific critical path I finally had a light-bulb moment tonight, and I think its laid to rest for phase I anyway.






