From OpenNMS
Contents |
What is it?
A poll consists of a connection to a particular port on a remote interface, and then a test to see if the service on that port returns an expected response. If the response is not received within the timeout and there are retries configured, it will be tried again. If the number of retries is exceeded, the service is considered down. In some networks, however, short, intermittent failures are common. With the default downtime model, a failed service will be polled again in 30 seconds. This will result in what is known as a "30 second outage".
Note that this is a real problem: a user attempting to access that resource would also have experienced a timeout. But in some networks these 30 second outages can be annoying yet hard to correct.
Possible causes
XXX discuss how packet loss affects TCP connections, and in particular, how duplex mismatches can significantly increase the chances of packet loss in TCP connections.
- Packet loss due to Ethernet duplex mismatch
- Packet loss due to collisions
- Packet loss due to other factors (wireless interference, congestion, intermittent link failure, etc.)
- DNS name resolution
- high CPU load on the device to be tested
- service is recycling
- routing problems
- SSH key generation
- (XXX is this really a possibility?)
- SMTP Greylisting or other funky feature
Workarounds
Fixing the root of the problem
We're serious. You should seriously look into fixing the root of the problem instead of implementing a workaround. See the list of possible causes above.
Increase timeout and retries
If you couldn't fix the problem you might increase the timeout and / or the number of retries befor the service is considered to be down. See Polling Configuration How-To for details.
Initial delay in notifications
If 30 second outages persist in spite of your efforts to track down the root cause, you may set an initial time delay for notifications. This is done in the configuration file destinationPaths.xml. For example:
<path name="Email-Admin" initial-delay="1m">
<target>
<name xmlns="">Admin</name>
<command xmlns="">javaEmail</command>
</target>
</path>
This will delay the notification for 1 minute. If service is restored in less than 1 minute the notification will be cancelled.
Setting serviceUnresponsiveEnabled in poller-configuration.xml
The option was added to denote a failure as when the port connection fails and not the response. In this case, an unresponsive service does not generate an outage, but only a "service unresponsive" event. To enable this behavior, set this value to "true". See the polling configuration how-to for details.






