30 second outage

From OpenNMS

Contents

What is it?

A poll consists of a connection to a particular port on a remote interface, and then a test to see if the service on that port returns an expected response. If the response is not received within the timeout and there are retries configured, it will be tried again. If the number of retries is exceeded, the service is considered down. In some networks, however, short, intermittent failures are common. With the default downtime model, a failed service will be polled again in 30 seconds. This will result in what is known as a "30 second outage".

Note that this is a real problem: a user attempting to access that resource would also have experienced a timeout. But in some networks these 30 second outages can be annoying yet hard to correct.

Possible causes

XXX discuss how packet loss affects TCP connections, and in particular, how duplex mismatches can significantly increase the chances of packet loss in TCP connections.

Packet loss due to Ethernet duplex mismatch 
Packet loss due to collisions 
Packet loss due to other factors (wireless interference, congestion, intermittent link failure, etc.) 
DNS name resolution 
high CPU load on the device to be tested 
service is recycling 
routing problems 
SSH key generation 
(XXX is this really a possibility?)
SMTP Greylisting or other funky feature

Workarounds

Fixing the root of the problem

We're serious. You should seriously look into fixing the root of the problem instead of implementing a workaround. See the list of possible causes above.

Increase timeout and retries

If you couldn't fix the problem you might increase the timeout and / or the number of retries befor the service is considered to be down. See Polling Configuration How-To for details.

Initial delay in notifications

If 30 second outages persist in spite of your efforts to track down the root cause, you may set an initial time delay for notifications. This is done in the configuration file destinationPaths.xml. For example:

   <path name="Email-Admin" initial-delay="1m">
       <target>
               <name xmlns="">Admin</name>
               <command xmlns="">javaEmail</command>
       </target>
   </path>

This will delay the notification for 1 minute. If service is restored in less than 1 minute the notification will be cancelled.

Setting serviceUnresponsiveEnabled in poller-configuration.xml

The option was added to denote a failure as when the port connection fails and not the response. In this case, an unresponsive service does not generate an outage, but only a "service unresponsive" event. To enable this behavior, set this value to "true". See the polling configuration how-to for details.

Personal tools
DevJam 2008 Sponsors
DevJam 2008 Sponsor: Google
DevJam 2008 Sponsor: Netregistry
DevJam 2008 Sponsor: Papa John's
NewEdge Networks
OpenNMS takes home the gold award!
Join the Free Software Foundation
Support This Project Commercial OpenNMS Support OpenNMS Italia Get OpenNMS at SourceForge.net. Fast, secure and Free Open Source software downloads Our Network Simulator Our Java Profiler