FAQ-Troubleshooting
Subscribe

From OpenNMS

Jump to: navigation, search

Contents

Q: Why do I see a frightening number of Java processes/memory allocated to Java with ps or top?

A: From Pete Siemsen and Ben Reed's emails from 3/12/2001:

Pete:

> 3. With the software up, a "ps" command is a little 
> frightening, as it looks like there are 400+ processes
> on your machine. Don't worry, it's an artifact of
> Java's thread support, and your system will run fine.

Ben: Yup. The more technical explanation (as I understand it) is that there is only one type of process internally inside the linux kernel, a kernel process. While regular user processes map 1-to-1 to kernel processes, each thread in a threaded application gets it's own kernel process, even though, from a programming perspective, they are still running in the same userland process. That's why applications like the Java JVM or Mozilla appear to have many more processes than you would think they should. PS is also very bad at enumerating memory usage for the same reason. Threads for the most part share memory, but they each get listed so as to appear to have allocated that much memory themselves.

mike from opennms pointed out that if you want a quick summary of the number of threads in each VM you can use the command:

ps axww | awk '/java/ { print $NF }' | sort | uniq -c | sort -n

Or you can use (on Linux):

pstree root | grep 'java'

As well, but this command won't tell you which VM has what threads, just the number of threads in each.

Remember the thread counts won't match the configuration exactly. Something to do with the fact that Java has a ThreadManager? thread, etc.

With NTPLS, you will see better thread handling in Linux.

Q: Why are jar_cacheXXXX.tmp files filling up my /tmp?

A: If you have been doing a kill from the command line to stop Tomcat, you may see lots of jar_cacheXXXX.tmp files in your /tmp directory (or whatever your $TMPDIR is set to for Tomcat). When Tomcat is not running, you can safely delete these files.

Tomcat will open 10 to 100 of these files when running, but clean them all up when shut down properly. If Tomcat is stopped prematurely (by say a kill -9), then the files aren't deleted, and can fill up your /tmp partition.

This problem affects Tomcat version 4.0.1 and later. It's actually originating from the JDK's JarURLConnection class, which Tomcat uses in its internal classloaders.

Q: Why do I get "can't parse argument 'RRA:AVERAGE:0.5:1:8928'"?

A: Currently, OpenNMS does not support localization, thus we recommend setting the locale to "en_US".

However, RRDTool, which we use, does. Thus, if your locale is not "en_US" you need to make a small change to the datacollection-config.xml file in /opt/OpenNMS/etc:

Change every line like

RRA:AVERAGE:0.5:12:8928

where there is a 0.5 (zero - dot - five) to:

RRA:AVERAGE:0,5:12:8928 or 0,5

(zero - comma - five).

Q: I installed OpenNMS, and admin/admin Does Not Log Me On, Why?

A: The OpenNMS install process has several steps. First, the product dependencies are installed and configured, such as Postgres, RRDTool and Tomcat. Then the OpenNMS packages are installed (the core, webapps and docs). As these OpenNMS packages are being installed, modifications are made to both Postgres and Tomcat. When everything is complete, you should be able to start up OpenNMS and go.

There are several ways to install OpenNMS, and some people have found that they get the login prompt when going to http://localhost:8080/opennms/ after install, but username="admin", password="admin" doesn't seem to work.

This is due to something going wrong with the install. It has nothing to do with the username and password being wrong (in case you are wondering, they are stored in /opt/OpenNMS/etc/users.xml).

Look in your install.log (it should either be in root's home directory or where you ran the install). Read it. Note any errors.

The most common error has to deal with Postgresql. First, make sure it is running. "ps -ef | grep postmaster" should return a running postmaster process. If not, start postgres (on Red Hat Linux you can use /sbin/service postgresql start).

If it is running, attempt to access it with "psql -U opennms opennms". If you can log in, make sure all the tables are there:

 opennms=# \d
            List of relations
       Name       |   Type   |  Owner
 -----------------+----------+----------
 assets          | table    | postgres
 distpoller      | table    | postgres
 events          | table    | postgres
 eventsnxtid     | sequence | postgres
 ifservices      | table    | postgres
 ipinterface     | table    | postgres
 node            | table    | postgres
 nodenxtid       | sequence | postgres
 notifications   | table    | postgres
 notifynxtid     | sequence | postgres
 outagenxtid     | sequence | postgres
 outages         | table    | postgres
 service         | table    | postgres
 servicenxtid    | sequence | postgres
 snmpinterface   | table    | postgres
 usersnotified   | table    | postgres
 vulnerabilities | table    | postgres
 vulnnxtid       | sequence | postgres
 vulnplugins     | table    | postgres
 (19 rows)

If not, you probably forgot to change the security settings running Postgres 7.2 or higher.

If everything looks good, make sure the install.pl script modified the Tomcat server.xml file (the date and time will be different that the rest of the files in that directory).

I found that the failure of the DBI and DBD::Pg modules caused the admin/admin combo not to work (I was installing using RPM's). Since I hadn't discovered the FAQ-o-matic yet, I just installed these modules the Perl way (i.e., perl -MCPAN -e 'install DBI' and perl -MCPAN -e 'install DBD::Pg' ). The DBD::Pg install did a lot of complaining about the POSTGRES_INCLUDE and POSTGRES_LIB environment variables, but once I installed cleanly, and then removed and reinstalled the opennms rpm's, everything seemed to work fine (at least, I am able to get into the opennms web page with admin/admin).

Also, you can check:

grep -i opennms /var/tomcat4/conf/server.xml

The install.pl script is supposed to modify this file to allow Tomcat to know how to authenticate requests to the OpenNMS webapp. If this file doesn't include an opennms entry, try rerunning install.pl again:

$OPENNMS_HOME/bin/install.pl -q $OPENNMS_HOME/etc/create.sql

Given that the install fails to setup postgres, is there a script to just setup the database? Yes, run "/opt/OpenNMS/bin/install.pl etc/create.sql".

The reason for the logon failure is probably that postgresql wasn't accepting local request to access databaes when you installed opennms.

(/var/pgsql/data/...) Adapt postgresql.conf so that TCP request are allowed. Adapt the hb_pga.conf file to allow local and network accesses. and then:

  • either re-install opennms
  • create the opennms databases yourself

I had the same problem.

Just a note on the two above install scripts.

$OPENNMS_HOME/bin/install.pl -q $OPENNMS_HOME/etc/create.sql is the script that works
/opt/OpenNMS/bin/install.pl etc/create.sql Doesn't

(Ed. note: install.pl is deprecated in 1.1.4)

If you build from source, you will need to put some links in the $TOMCAT4_HOME/server/lib directory:

lrwxrwxrwx castor-0.9.3.9.jar -> /opt/OpenNMS/lib/castor-0.9.3.9.jar
lrwxrwxrwx castor-0.9.3.9-xml.jar -> /opt/OpenNMS/lib/castor-0.9.3.9-xml.jar
lrwxrwxrwx log4j.jar -> /opt/OpenNMS/lib/log4j.jar
lrwxrwxrwx opennms_core.jar -> /opt/OpenNMS/lib/opennms_core.jar
lrwxrwxrwx opennms_services.jar -> /opt/OpenNMS/lib/opennms_services.jar
lrwxrwxrwx opennms_web.jar -> /opt/OpenNMS/lib/opennms_web.jar


Updated: --Edavison 11:46, 15 June 2006 (CDT)

Another possible problem here is that if you run tomcat as any user other than root, which is a good security practice, then the tomcat user may not have access to read/write the config files on /opt/OpenNMS/etc. If that is the case, you cannot login as it cannot read the usernames and passwords from users.xml.

So one solution is to:

a) add the tomcat user to the opennms group

b) add read/write permissions to the /opt/OpenNMS/etc/ and /opt/OpenNMS/log directories.

Q: Tomcat won't start, complains about JAVA_HOME, why?

A: Sun had released a new version of the 1.4 JDK since the 1.0.0 RPMs were built. If you install that JDK, Tomcat will probably complain about JAVA_HOME. Edit /etc/tomcat4/conf/tomcat4.conf to correctly show the path to your JAVA_HOME.

Q: Why do I get '"FATAL 1: IDENT authentication failed for user "postgres"'?

A: An IDENT error means that a user other than postgres is not allowed to connect to the database as 'ident'.

The quick fix is to add the line:

host all 127.0.0.0 255.255.255.255 trust

to your "/var/lib/pgsql/data/pg_hba.conf" file. This means that anyone is able to connect as another username (ie, 'ben' the unix user can connect as 'postgres' the postgresql user, as long as he has the right password.)

Make sure your pg_hba.conf looks like this. There should not be two "local" entries. Once OpenNMS is installed, you can change the pg_hba.conf to your liking assuming you understand the consequences.

 # TYPE DATABASE IP_ADDRESS MASK AUTH_TYPE AUTH_ARGUMENT

 local all trust
 host all 127.0.0.1 255.255.255.255 trust

 # Using sockets credentials for improved security. Not available everywhere,
 # but works on Linux, *BSD (and probably some others)

 #local all ident sameuser

Or, if you like a more secure setup try the following:

host all 127.0.0.1 255.0.0.0 password

This will require a password for all connections over TCP/IP from localhost (such as the connection that opennms uses).

The following has worked with SLES 10 -

# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
 
# "local" is for Unix domain socket connections only
local   all         all                               trust
# IPv4 local connections:
host    all         all         127.0.0.1/32          trust
# IPv6 local connections:
host    all         all         ::1/128               trust

Q: Why does apt complain about zebra and gated?

A: If you get the following error:

 Sorry, but the following packages have unmet dependencies:
 gated: Obsoletes: zebra but 0.91a-6 is to be installed
 zebra: Obsoletes: gated but 3.6-12 is to be installed 
 E: Unmet dependencies.

Try 'apt-get -f install' with no packages (or specify a solution).

Uninstall either gated or zebra. Some distros (i.e. Redhat Linux) will allow you to install these conflicting packages, which will confuse apt.

Q: Why does OpenNMS Says My DNS Server is Down, When It Is Up?

A: Problem: After installing OpenNMS, it discovers DNS servers, but then says they are down.

By default, OpenNMS does a lookup on "localhost". While this returns an error from most DNS servers, the receipt of the error proves that the DNS server is running.

Note: the default configuration of OpenNMS 1.9.x makes response codes 3 (NXDOMAIN) and 5 (REFUSED) fatal to the DNS poller, but one of these is likely to be the desired response to a query containing "localhost." Edit the "fatal-response-codes" parameter in poller-configuration.xml to correct this behavior.

However, Microsoft DNS servers behave differently. In order to get OpenNMS to work with Microsoft DNS servers, edit the poller-configuration.xml file, and change the value of the DNS poller "lookup" parameter to something other than "localhost", such as "opennms.org".

Q: Why are some of my XML files all one line?

A: Why are some of the files in the /opt/OpenNMS/etc directory all one line, instead of being indented? I swear they were indented at one time.

OpenNMS uses castor to parse certain XML files. Any file that gets changed via the GUI, such as the poller and the notifications configuration files, will be written back as a single line.

It is possible to get castor to indent the lines, but it then adds whitespace that causes OpenNMS to fail, such as adding a carriage return after and before .

The task of fixing that is currently available to whoever wants it (grin)

In the meantime, use /opt/OpenNMS/bin/xml.reader.pl to fix your files. The syntax would be something like:

/opt/OpenNMS/bin/xml.reader.pl -w /opt/OpenNMS/etc/poller-configuration.xml

The "-w" will overwrite the file, without it xml.reader.pl will send the output to stdout.

Q: Why Don't My Linux Servers with the Net-SNMP Agent Show Up in Performance Reports?

A: The SNMP Data Collection How-To has a lot of information about how data collection works in OpenNMS.

A common question is that it appears the Net-SNMP (formerly UCD-SNMP) agent is installed correctly, and snmpwalk system or some such query works fine, but data is not collected.

OpenNMS requires that it can relate an IP address (from the ipAddrTable) to an interface index (from the ifTable) before data collection can occur.

By default, the Net-SNMP agent allows only the system tree to be visible. To change this, modify /etc/snmp/snmpd.conf file. A line like:

view all included .1 80

will pretty much open up all SNMP queries to everyone, but you can read the comments in that file and tailor it to your needs.

If that doesn't fix the issue, try the tips in the "How-To".

Q: opennms.sh status returns nothing, what's happening?

A: Make sure that you have curl installed on your system.

Q: Why does an RPM install hang on RedHat 8.0?

A: The stock RPM 4.1 package relased with RH 8.0 has growing pains.

RPM activities, especially upgrades, may hang indefinitely.

See: http://www.rpm.org/ "Status and Versions"

for the latest information and if a newer/bugfix version than 4.1 has been released.

If not, this section will refer you to this bugfix document:

http://www.rpm.org/hintskinks/repairdb/

Q: Why does the webUI give me an "Unable to compile class for JSP" exception?

A: If you get an exception like the following:

 exception org.apache.jasper.JasperException: Unable to compile class for JSP

An error occurred at line: 53 in the jsp file: /includes/header.jsp

 Generated servlet error:
 javac Compiling 1 source file
 Javac /usr/share/tomcat/work/Standalone/localhost/opennms/includes/header_jsp.java: In class `org.apache.jsp.header_jsp':
 javac /usr/share/tomcat/work/Standalone/localhost/opennms/includes/header_jsp.java: In method `org.apache.jsp.header_jsp._jspService (javax.servlet.http.HttpServletRequest,javax.servlet.http.HttpServletResponse)':
 javac /usr/share/tomcat/work/Standalone/localhost/opennms/includes/header_jsp.java:97: internal compiler error: in emit_store, at java/jcf-write.c:981

and at the bottom something like:

 at org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.processConnection(org.apache.tomcat.util.net.TcpConnection, java.lang.Object[]) (/usr/lib/lib-org-apache-coyote-http11-4.1.27.so)
  at org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[]) (/usr/lib/lib-org-apache-tomcat-util-4.1.27.so)
 at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() (/usr/lib/lib-org-apache-tomcat-util-4.1.27.so)
 at java.lang.Thread.run() (/usr/lib/libgcj.so.5.0.0)
 at _Jv_ThreadRun(java.lang.Thread) (/usr/lib/libgcj.so.5.0.0)
 at GC_start_routine (/usr/lib/libgcj.so.5.0.0)
 at __clone (/lib/tls/libc-2.3.3.so)

Then you have Tomcat configured to use gcj instead of Sun's java. Check your tomcat4.conf file and make sure JAVA_HOME is set properly.

Q: Why do I see JDBC related Exceptions in the log files?

A: There have been a couple of problems reported regarding exceptions using Postgres. These are caused by problems in the JDBC Driver for Postgres.

The first exception occurs in 1.1.3 end looks something like the following:

 java.lang.StringIndexOutOfBoundsException: String index out of range: 23
 at java.lang.String.charAt(String.java:444)
 at org.postgresql.jdbc2.ResultSet.toTimestamp(Unknown Source)
 at org.postgresql.jdbc2.ResultSet.getTimestamp(Unknown Source)

The other occurs when using OpenNMS with Postgres 7.4 and looks like the following:

java.sql.SQLException: ERROR: SET AUTOCOMMIT TO OFF is no longer supported

In both cases you need to upgrade the JDBC driver for postgres. go to [1].

If you are using postgres 7.2 or 7.3 download the 7.3.x JDBC2 driver.

If you are using PostgreSQL 7.4 use its JDC2 or JDBC3 driver.

You need to name them postgresql.jar and put them in the following dirs.

$OPENNMS_HOME/lib
$TOMCAT/webapps/opennms/WEB-INF/lib

On Debian you also need to replace postgresql.jar in /usr/share/java as well. (Thanks John.)

Them restart opennms and them tomcat

Q: Why do I get node level SNMP information, but no interface level information?

A: I was noticing this on a Fedora Core 1 machine with multiple NICs. The node level information would be produced just fine, such as "Number of Users" and "CPU Utilization", but there was no interface information, such as traffic.

I performed some snmpwalk commands:

 snmpwalk -v 1 -c public localhost iftable
 IF-MIB::ifIndex.1 = INTEGER: 1
 IF-MIB::ifIndex.2 = INTEGER: 2
 IF-MIB::ifIndex.3 = INTEGER: 3
 IF-MIB::ifIndex.4 = INTEGER: 4
 IF-MIB::ifDescr.1 = STRING: lo
 IF-MIB::ifDescr.2 = STRING: eth0
 IF-MIB::ifDescr.3 = STRING: eth1
 IF-MIB::ifDescr.4 = STRING: eth2

So far so good, but looking at the ipAddrTable:

 snmpwalk -v 1 -c public localhost ipaddrtable
 IP-MIB::ipAdEntAddr.10.1.4.12 = IpAddress: 10.1.4.12
 IP-MIB::ipAdEntAddr.127.0.0.1 = IpAddress: 127.0.0.1
 IP-MIB::ipAdEntIfIndex.10.1.4.12 = INTEGER: 6
 IP-MIB::ipAdEntIfIndex.127.0.0.1 = INTEGER: 1

You'll the the ifIndex of 6, which doesn't exist anywhere.

It turns out that this is a bug in net-snmp 5.1-2.1 which is the latest net-snmp for Fedora Core 1. I downloaded the source RPMs for 5.2.1, but was unable to get all of the dependencies worked out, so I found some rpms that did the trick. Now the snmpwalk commands return useful information and I get interface statistics.

Q: OpenNMS stops working after about 1 hour or intermittent servlet crashes are seen in the Web GUI.

A: This issue should be fixed in 1.1.4, but since it had a great impact on OpenNMS the original FAQ entry is left here as a reference.

OpenNMS will stop working after a period of time - this varies with the number of threads configured for the various daemons.

Intermittent servlet errors are encountered in the Web GUI - this seems to vary with the amount of concurrent usage of the GUI.

This has so far been reported with:

  • Debian Woody/Sid, RedHat? 8.0/(7.x?), SuSe? 8.1
  • OpenNMS 1.0.1
  • Sun JDK 1.4.0, 1.4.1, 1.4.1_01
  • Tomcat 4.0.3-0

Check for file like hs_err_pid9499.log in the directory that OpenNMS or Tomcat was launched.

It contains:

 Unexpected Signal : 11 occurred at PC=0x404D4324
 Function=size_given_klass__7oopDescP5Klass+0x44
 Library=/usr/java/j2sdk1.4.1_01/jre/lib/i386/server/libjvm.so

Looking at the bottom of the dump file shows this:

 # 
 # HotSpot? Virtual Machine Error : 11
 # Error ID : 4F530E43505002E6
 # Please report this error at
 # http://java.sun.com/cgi-bin/bugreport.cgi
 #
 # Java VM: Java HotSpot(TM) Server VM (1.4.1_01-b01 mixed mode)

Doing a Google search on the error ID shows that it appears with other apps as well.

Here is an excerpt from one article:

I have seen this problem come up with a variety of applications, most notably JBoss  3.X. The way I got around it was specify the -Xrs and -Xint options to the VM before running any application.

 Doing a "man java" gives:
       -Xint               Operates in interpreted-only mode.  Compilation  to
                           native code is disabled, and all bytecodes are exe-
                           cuted by the interpreter.  The performance benefits
                           offered  by the Java HotSpot VMs adaptive compiler
                           will not be present in this mode.


       -Xrs                Reduce usage of operating-system  signals  by  Java
                           virtual machine (JVM).

                           Sun's  JVM  catches  signals  to implement shutdown
                           hooks for abnormal JVM termination.  The  JVM  uses
                           SIGHUP, SIGINT, and SIGTERM to initiate the running
                           of shutdown hooks. The JVM uses SIGQUIT to  perform
                           thread dumps.

                           Applications  that embed the JVM frequently need to
                           trap signals like SIGINT or SIGTERM,  and  in  such
                           cases  there  is  the  possibility  of interference
                           between the applications' signal handlers  and  the
                           JVM shutdown-hooks facility.

                           To  avoid such interference, the -Xrs option can be
                           used to turn off the  JVM  shutdown-hooks  feature.
                           When  -Xrs  is  used,  the signal masks for SIGINT,
                           SIGTERM, SIGHUP, and SIGQUIT are not changed by the
                           JVM,  and signal handlers for these signals are not
                           installed.

 Note that -X options are non-standard and may change in the future.

Running with "-Xint -Xrs" results in stable operation.

In /opt/OpenNMS/bin/opennms.sh, find "HOTSPOT" and add the "X" flags like this:

 if [ -n "$HOTSPOT" -a "$HOTSPOT" = true ] ; then
        JAVA_CMD="$JAVA_CMD -server -Xint -Xrs"
 fi

Also in /etc/tomcat4/conf/tomcat4.conf change the CATALINA_OPTS line like:

 export CATALINA_OPTS="-Xint -Xrs -DTOMCATLAUNCH=true...

thats it...

Here's an update from DJ Gregor:

I just received an email from Sun saying that the crashing problem with Java 1.4.1 (and versions, too) has been assigned a bug number and is being worked on.

See the attached message for details, and here are the useful excerpts (including a workaround):

 > This bug is being tracked under the following Bug-ID: 4724356
 > ...
 > Feel free to check the status of Java(TM) bugs via the JDC at:
 > http://developer.java.sun.com/developer/bugParade/index.html
 >
 > The work around is to increase the amount of memory available
 > in the permanent generation (used to store class objects and
 > related metadata). Add this specification to the command line
 > that is used to launch the JVM:
 > -XX:MaxPermSize=128m
 > Use larger sizes if necessary.
 >
 > For more information, refer to this document:
 > http://wireless.java.sun.com/midp/articles/garbagecollection2

Over the last few months I have seen the occassional post on OpenNMS dying with a Java Hotspot error in output.log complaining of an "Unexpected Signal 11".

I was recently on a machine that was producing these errors frequently, and the problem had to deal with memory.

By default, OpenNMS allocates 256MB to the Java Heap Size for OpenNMS. This, combined with the default 64MB for Tomcat4 can exceed the available memory on some systems (systems with 256MB or 512MB and other processes).

To correct this, edit $OPENNMS_HOME/bin/opennms.sh, search for "HEAP" and lower it. This fixed the problem for me. I was not out of RAM, but was getting a lot of "Too many open files" messages in my logs, as well as the above crashes.

As well as the above two fixes, I changed in /opt/OpenNMS/bin/opennms.sh the line:

ulimit -s 2048

to:

ulimit -s 8192 ulimit -n 10240

which combined with the above two fixes cleared everything up.

Q: How Can I Best Test My XML Files?

A: From Eric Burghard on the OpenNMS Discuss mailing list.

I played with xml and xsd files this weekend (taken from OpenNMS' CVS tree). My main goal was to be able to validate each of my .xml files I had to read some specs about XML and XSDs because:

  • the XSD contains several little mispelling and syntax errors
  • the XML files were not associated with their schemas
  • I'm an XML newbie (and still am)

For those who are interested here is the procedure:

First, you need an XML validator of some kind. I know of two of them:

Second, you need all the XSD files. Get the source package or checkout the CVS tree. I put all the .xsd files for convenience in the same place as my .xml: /opt/OpenNMS/etc

schema validation: To be sure that your xsd file is valid, you need to... validate it. Load the xsd with Oxygen and push the validate button. Here are the errors I encoutered from various .xsd files:

unqualified references

error message: error: E src-resolve.4: Components from namespace 'http://www.w3.org/2001/XMLSchema' are not referenceable from schema document 'file:/opt/OpenNMS/etc/event.xsd'.

description: Local types are not referenceable because of the lack of a namespace definition.

workaround: Add attributes elementFormDefault="qualified" and xmlns: evt="http://xmlns.opennms.org/xsd/event" to the schema root tag (change the name for reflecting the target's namespace ) Change all references by prefixing them with the new namespace's alias ref="evt:sometype"

unnecessary type specification

error message: E src-attribute.4: Attribute 'type' have both a type attribute and a annoymous simpleType child..

description: <attribute ... type="string"> <simpleType> ..<restriction base="string"> </simpletype> </attribute>

workaround: remove the type="string" attribute

misplaced include statement

error message: E sch-props-correct.1: schema components of type 'include' cannot occur after declarations or are not permitted as children of a element.

description: self-explained

workaround: guess what? Put include tag at the begining.

minOccurs mispelled

error message: E s4s-att-not-allowed: Attribute 'minOcccurs' cannot appear in element 'element'.

description: nothing

workaround: replace by minOccurs

pattern syntax error

error message: error: E cvc-pattern-valid: Value '12:30:00' is not facet-valid with respect to pattern '(?:(?:^[0-9]{1,2}-A-Za-z]{3}-[1-2][0-9]{3}[ ][ ]*(?:[0-9]{1,2}:) {2}[0-9]{2})|(?:^(?:[0-9]{1,2}:){2}[0-9]{2}))'.

description: Some regexp (value attribute of pattern tag) contain ^ and $ (which means start and end of line). They are not supported in specs and anyway not necessary.

workaround: Just remove all heading ^ and trailing $.

value and default confusion

errors: E s4s-att-not-allowed: Attribute 'value' cannot appear in element 'attribute'.

description: I think this is because value attribute is used instead of default.

workaround: replace value by default if use="optional" delete value attribute if use="required"

XML validation:

Now its time to validate your .xml file with your valid and well formed .xsd file. You had to specify the schema file that will be used during validation inside the .xml. Add theses attributes to the root tag (change the namespace for reflecting the one defined in the .xsd)

xmlns="http://xmlns.opennms.org/xsd/config/poller" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.opennms.org/xsd/config/poller file:/opt/OpenNMS/etc/poller-configuration.xsd"

Push the validate button and say goodbye to parse errors and logs scanning. libxml (1 or 2) has xmllint which will perform validation.

Q: Why Do I Get an Invalid ifIndex Error?

A: In order for OpenNMS to collect SNMP information, the IP address must map to a valid ifIndex. Sometimes this fails:

2002-12-27 20:13:01, 310 WARN main Collectd: Schedule Interface: Unable to schedule 192.168.2.20 for service SNMP, reason: Unable to retrieve ifIndex for interface 192.168.2.20

In case of a failure, one of two things is happening:

  • The entry in the ipAddrTable does not map to a valid ifIndex
  • More probable: certain SNMP agents fail when an snmpgetnext command is issued.

To test the first scenario, simply do an snmpwalk on the ipAddrTable and insure for the interface you are trying to collect on (ipAdEntAddr) there is an ifIndex (ipAdEntIfIndx) that matches an ifIndex in the ifTable.

If one exists, you may have the second issue. Here is a test you can try.

 $ snmpgetnext ip.ipAddrTable.ipAddrEntry.ipAdEntAddr
 ip.ipAddrTable.ipAddrEntry.ipAdEntNetMask
 ip.ipAddrTable.ipAddrEntry.ipAdEntBcastAddr

If the SNMP agent is handling these requests correctly you should see something like:

 ip.ipAddrTable.ipAddrEntry.ipAdEntAddr.127.0.0.1 = 127.0.0.1
 ip.ipAddrTable.ipAddrEntry.ipAdEntNetMask.127.0.0.1 = 255.0.0.0
 ip.ipAddrTable.ipAddrEntry.ipAdEntBcastAddr.127.0.0.1 = 1

If the SNMP agent is NOT handling these requests correctly you will see something like:

 ip.ipAddrTable.ipAddrEntry.ipAdEntAddr.127.0.0.1=127.0.0.1
 ip.ipAddrTable.ipAddrEntry.ipAdEntNetMask.127.0.0.1=127.0.0.1
 ip.ipAddrTable.ipAddrEntry.ipAdEntBcastAddr.127.0.0.1=127.0.0.1

If this is the case, then at the moment there is nothing you can do. The way OpenNMS works is that all the necessary information we need for data collection is contained in one "get" request. The request is sent and the thread closes. When the reply comes back, a new thread is started and the information is added to the database.

Note that this method works fine if the vendor supports SNMP correctly. If you have support with them, please open a ticket and see if they will correct the problem. In order to make these requests individually would require quite a bit of new code to be written, and since it is rare (it occurs mainly on older HP printers) we haven't been able to spare the time.

Q: How are node labels determined?

A: Node labels in OpenNMS are determined in the following order:

  • User Defined
  • DNS lookup
  • SMB (NetBIOS)
  • SNMP
  • IP Address

All node labels can be set by the user on the node's page in OpenNMS, and a user defined label supersedes all other methods.

For devices with more than one interface, the lowest numbered interface is used.

If a node supports SNMP, the Primary SNMP interface will be used as the IP address to lookup to determine the node label. Also, in version 1.1 and beyond, the lowest non-127.*.*.* software loopback address will be set as the Primary SNMP interface.

If a node label changes, check out the provisiond.log. You should see a database dump listing what was known about the node before the node scan and what was determined after it.

Note: A node can be written to the database before the SNMP service is discovered. New nodes might see their labels change with the first rescan.

Q: How Do I Log Out of the webUI?

A: Through OpenNMS 1.2, BASIC authentication is used. In order to logout/re-login, you have to close all instances of the browser and start a new session. This issue is also explained in further detail in the archives.

Starting in OpenNMS 1.3.2, OpenNMS uses form-based authentication, and once logged in there is a link in the upper right-hand corner of the screen to log out next to your username.

Q: I upgraded to 1.1.1. Why does "Manage/Unmanage" not work?

A: If you upgrade to OpenNMS 1.1.1, you may get this error when trying to access the Admin page to manage and unmanage interfaces and services:

javax.servlet.ServletException: ERROR: Function 'inet(varchar)' does not exist

This is because new code was added to sort the interfaces by IP address using the inet function that was introduced in Postgres 7.2.

While included with Redhat Linux 7.3 and beyond, 7.2 was not listed as a requirement for OpenNMS in the package file, so it will allow you to run on the old Postgres 7.1.

To fix this, you need to upgrade to 7.2 - but it is not just a case of installing a new RPM. You have to export your database, upgrade, then import it back.

 su - postgres
 pg_dumpall > /tmp/old_data
 exit 
 upgrade postgresql to 7.2 or beyond
 su - postgres
 psql -U postgres -f /tmp/old_data template1

Q: Why doesn't the dhcpd process ever start?

A: OpenNMS is best run on a server with a static IP address. There are a number of reasons for this (setting the trap destination, for example) but also since the OpenNMS dhcpd process acts like a user, it has to bind to the same port that would be used to set a DHCP address on the server itself.

If you must run using DHCP, edit service-configuration.xml in $OPENNMS_HOME/etc and comment out the section:

<service>
        <name>OpenNMS:Name=Dhcpd</name>
        <class-name>org.opennms.netmgt.dhcpd.jmx.Dhcpd</class-name>
        <invoke pass="1" method="start"/>
        <invoke at="status" pass="0" method="status"/>
        <invoke at="stop" pass="0" method="stop"/>
</service>

Then restart OpenNMS. Note that without dhcpd you will not be able to monitor DHCP servers.

Q: I can snmpwalk a device, but OpenNMS won't collect data on it, why?

A1: While walking a device's MIB using the snmpwalk utility is a good initial test for whether the device's SNMP agent is configured appropriately, it's important to note that the snmpwalk utility included in most modern UNIX-like operating system distributions uses the Net-SNMP libraries, which are much more forgiving of all kinds of SNMP protocol violations than is the SNMP4J library that OpenNMS uses by default. Therefore a successful snmpwalk result is not 100% indicative that an SNMP agent is behaving correctly.

A2: OpenNMS was originally designed to monitor IP services, and as such tends to be IP centric. SNMP data, however, can be pretty free-form, so there had to be a way to associate SNMP data with a particular IP interface.

The way this was done was to use the ipAddrTable and map the ifIndex given there to the ifTable.

In addition, since there is only one SNMP agent per device (usually), rather than poll for SNMP data through each available interface, the concept of a "primary" SNMP interface was introduced. This interface would be used for all SNMP requests to the device.

In order to be a primary SNMP interface, several things must occur.

  • the IP address for the interface must exist in a collectd package.
  • the IP address must map to a valid ifIndex (originally in order to map the data to a particular IP address).
  • if more than one interface qualifies to be a primary interface, the lowest numbered interface is marked as "primary" and the others as "secondary", unless ...

a loopback address exists with a non 127.*.*.* IP address and meets the qualifications above - then it is chosen.

However, several people have reported the need to monitor SNMP on a device that either does not have a valid primary interface candidate or they wish to use another address altogether. At the moment a solution does not exist, although we hope to have something in place soon.

Possible workarounds include directly modifying the database and substituting in an ifIndex. This will work for awhile, but may be overwritten during the next Provisiond node scan.

Q: Why Does My Windows DHCP Server Show as Down?

A: To monitor Windows DHCP servers, you need to edit the dhcpd-configuration.xml file and put in the MAC address of the OpenNMS server in the macAddress field.

On *nix machines, /sbin/ifconfig -a will usually show you the MAC address.

Q: Why do I get opennms startup failed?

A: More recent versions of OpenNMS have a "START_TIMEOUT" value set. This can either be found in $OPENNMS_HOME/bin/opennms.sh or $OPENNMS_HOME/etc/opennms.conf, not sure which way it ends up so both are included. If you see opennms startup failed check its status using "opennms.sh -v status" , if you see start_pending it is likely you will need to increase the START_TIMEOUT value, 60-75 seconds should work on slower machines.

Q: Looking in output.log I see lots of references to 'java.lang.Exception' that appears to be 'Caused by: org.jrobin.core.RrdException: Bad sample timestamp ..... Last update time was ....., at least one second step is required'

A: Taken from a message from Tarus Balog at: http://sourceforge.net/mailarchive/message.php?msg_id=11898415

OpenNMS stores RRD data in the following manner:

Node level data is stored in

$OPENNMS_HOME/share/rrd/snmp/[nodeid]

and interface level data is stored in

$OPENNMS_HOME/share/rrd/snmp/[nodeid]/[ifdescr+MAC]

If you have two interfaces with the same ifDescr and same MAC  
address, OpenNMS will collect data on both of them, but then try to  
write it to the same file, say ifInOctets.rrd.

You can usually safely ignore this error.

Q: I switched to JRobin and now no graphs show up. Trying to view the graphs directly gives me an exception. I already switched on java.awt.headless; what gives?

A: Taken from IRC troubleshooting with DJ Gregor and Mike Huot

Search for the string 'x11' in this page (DJ's suggestion that put me on the right track)

If you do not have a full installation of the X Window System on your OpenNMS server, the Java runtime cannot access some graphics routines that it needs even in "headless" mode. Check whether your JAVA_HOME/jre/lib/PLATFORM/libawt.so can find all the X libraries it needs. On my Linux Ubuntu 5.10 (Breezy Badger) system, server install type, I had most of the X libraries but was missing libXp.so:

$ ldd /opt/j2sdk1.4.2_11/jre/lib/i386/libawt.so
        linux-gate.so.1 =>  (0xffffe000)
        libmlib_image.so => not found
        libjvm.so => not found
  ====> libXp.so.6 => not found                             <====
        libXt.so.6 => /usr/lib/libXt.so.6 (0xb7c84000)
        libXext.so.6 => /usr/lib/libXext.so.6 (0xb7c77000)
        libXtst.so.6 => /usr/lib/libXtst.so.6 (0xb7c72000)
        libX11.so.6 => /usr/lib/libX11.so.6 (0xb7bb2000)
        libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7b8f000)
        libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7b8b000)
        libjava.so => not found
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7a5c000)
        libSM.so.6 => /usr/lib/libSM.so.6 (0xb7a55000)
        libICE.so.6 => /usr/lib/libICE.so.6 (0xb7a3c000)
        libXau.so.6 => /usr/lib/libXau.so.6 (0xb7a39000)
        libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0xb7a34000)
        /lib/ld-linux.so.2 (0x80000000)

The missing libmlib_image.so and libjvm.so seem benign. Running apt-get install libxp6 resolved the libXp.so.6 link failure. After restarting both OpenNMS and Tomcat, JRobin graphs work beautifully. I would expect to see this problem also on Debian GNU/Linux server-profile systems and heavily minimalized Solaris ones.

Another symptom that may indicate you are having this issue is that the outage graphs in the front page in OpenNMS 1.3.0 and later are absent and replaced with text labels.

Q: SNMP datacollection fails when i try to read from port 260/udp in order to collect checkpoint data

A: It's likely that the SNMP daemon listening on port 260/udp is providing only the Check Point private enterprise MIB (rooted at 1.3.6.1.4.1.2620) and absolutely nothing else. That means no MIB-2 system table, no ifTable, no ipAddrTable, nothing but Check Point information. The critical item that is missing (from the system table) is the sysObjectID object, which would tell OpenNMS what kind of device it's dealing with. Without this information, there is no way for OpenNMS to determine what data it should collect from the agent.

You can work around this problem by manually hacking the database. DO NOT ATTEMPT UNLESS YOU KNOW EXACTLY WHAT YOU ARE DOING. Update the node table in the OpenNMS database, setting the nodesysoid column for your Check Point hosts to e.g. .1.3.6.1.4.1.2620.1.1. Do not ask me to give you the exact SQL statement to do this -- if you can't figure it out, you need to have a better understanding before you try something like this. If you have a datacollection package properly configured that matches on this system OID, it should start working after a rescan.

Note that the Check Point SNMP agent (cpsnmpd) is meant to be used only by Check Point SmartView Status, which is why it is so lacking in information that would be useful to OpenNMS or a like product. Also note that the cpsnmpd distributed with Firewall-1 versions prior to R55 (NG with AI) is very fragile and should not be used for anything at all -- the agent is likely to fall over when walked. On FW-1/VPN-1 releases R55 or later running on SecurePlatform (but not Red Hat / RHEL or Crossbeam XSLinux) and possibly Nokia, you may find that there is a master agent on udp/161 that can "pull in" the Check Point MIBs by running cpsnmpd as an AgentX subagent. You can try enabling this functionality in cpconfig. Crossbeam X-Series devices have a separate issue that causes all the VAP interfaces to get reparented to the CPM; you can avoid it by never letting OpenNMS discover the CPM or (advanced topic) by restricting the view exposed by the SNMP daemon on the CPM.

Q: I get lots of 30 second outages

A: See 30 second outage.

Q: OpenNMS 1.2.x won't start on a Linux system, I get "Could not initialize IcmpSocket" errors

A: If you get messages like these in output.log:

Could not initialize IcmpSocket: null
java.lang.NoSuchFieldError
<<No stacktrace available>>

The likely problem is that OpenNMS is attempting to run under GIJ, the GNU interpreter for Java. GIJ is not suitable for running OpenNMS; you can verify that this is the problem by running the following command:

`head -1 /opt/OpenNMS/etc/java.conf` -version

If you see gij (GNU libgcj) in the output, then OpenNMS is using GIJ. If you do not have a Sun JDK installed on your system, you will need to get one from Sun (just the latest SE version is fine, you don't need EE or NetBeans). Then re-run /opt/OpenNMS/bin/runjava -s or /opt/OpenNMS/bin/runjava -S to set the Java interpreter to be used for OpenNMS. The startup should now work.

Q: Tomcat fails to start properly on 1.3.2 and I see an "out of PermGen space" error in catalina.out (or similar)

A: You probably need to tweak the heap size on tomcat startup. See the Java 5 Heap Notes.

Q: I upgraded to 1.3.2 and now my resource graphs aren't showing up

A: If you were using rrdtool via JNI (as opposed to JRobin), it's likely that the change in the default RRD strategy has bitten you. There are two solutions for this problem.The first is to change the RRD strategy for your upgraded installation back to rrdtool / JNI. Edit OPENNMS_HOME/etc/rrd-configuration.properties and uncomment the line that reads:

# To switch to the JNI implementation uncomment the following line:
#org.opennms.rrd.strategyClass=org.opennms.netmgt.rrd.rrdtool.JniRrdStrategy

The second solution is to use the JRobinConverter to convert your RRD files, as JRobin is unable to read ones created by the rrdtool / JNI strategy.


Q: My thresholds never trigger, even when the value clearly exceeds the threshold

A: If you are using 1.2.8 and above in the 1.2 line or 1.3.2 in the 1.3 line you may need to use the range parameter in your thresholding pacakage definition.

Q: I see a ERROR (pollerBackend failing to init) in manager.log and OpenNMS won't start.

A: If you see the following:

ERROR [Main] Invoker: An error occurred invoking operation init on MBean OpenNMS:Name=PollerBackEnd: javax.management.RuntimeMBeanException: RuntimeException thrown in operation init
javax.management.RuntimeMBeanException: RuntimeException thrown in operation init

And further down in the dump:

Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'pollerBackEnd-rmi' defined in class path resource [META-INF/opennms/applicationContext-exportedPollerBackEnd.xml]: Invocation of init method fail
ed; nested exception is java.rmi.server.ExportException: internal error: ObjID already in use

Check your host name resolution(/etc/hosts, etc.). Your hostname must resolve to your IP address and vice versa. You can use "getent hosts <host name | IP address>" on most modern UNIXes to look this up.

Q: I try to start OpenNMS but I get an error "Caused by: java.lang.OutOfMemoryError: unable to create new native thread"

A: Try commenting-out these lines in $OPENNMS_HOME/bin/opennms.sh:

ulimit -s 8192 > /dev/null 2>&1
ulimit -n 10240 > /dev/null 2>&1

Q: Why is the time zone wrong in OpenNMS?

A: See this article. Additional information can be found at: this article and this article.

Q: OpenNMS detects snmp on some of my systems, but says "not monitored". What's wrong?

A: Nothing is wrong--this is normal and is due to the default configuration. OpenNMS will detect SNMP on nodes, but by default the poller isn't configured to monitor the SNMP service. In general, you don't care if the SNMP service fails on a node, other than the fact that you can't collect data over SNMP. It's a bit redundant, anyway, since the data collector ("collectd") sends events that can generate notifications if it fails to collect data from a node.

Note that polling is separate from data collection. The poller monitors a service for simple up or down status, creates an outage when a service goes down, and the outage can trigger notifications to be sent (e.g.: the "service HTTP is down on node BigImportantWebServer" notification that you want to get on your pager). Data collection is separate and collects performance data from SNMP (and other protocols, too, but SNMP is by far the most common).

It is possible to not monitor a service with the poller, like SNMP, and still collect data via that service, which is indeed the default configuration for SNMP.

Q: OpenNMS doesn't startup with the error with the "pollerBackEnd": "Port already in use: 1099"

A: A legacy version of the remote poller back-end that runs in the OpenNMS daemon is trying to use port 1099 and you already have some process listening on port 1099 (likely another Java process). If you aren't doing Remote Monitoring, disable the poller back end and you'll be fine. Edit service-configuration.xml and comment-out the entire <service> section for "OpenNMS:Name=PollerBackEnd". It should look something like this:

<!--
         <service>
                 <name>OpenNMS:Name=PollerBackEnd</name>
                 <class- 
name>org.opennms.netmgt.poller.jmx.RemotePollerBackEnd</class-name>
                 <invoke at="start" pass="0" method="init"/>
                 <invoke at="start" pass="1" method="start"/>
                 <invoke at="status" pass="0" method="status"/>
                 <invoke at="stop" pass="0" method="stop"/>
         </service>
-->

The legacy version of the remote poller interface inside OpenNMS uses Java RMI in the poller back-end to listen for and respond to requests from remote pollers. The trick comes in where other Java programs may also be using RMI on the same host, and they are likely using the same port, 1099. These ports can be reconfigured inside opennms.properties with the following system properties:

# Specify the RMI ports when using the legacy RMI IPC interface
#opennms.poller.server.serverPort=1199
#opennms.poller.server.registryPort=1099

Q: OpenNMS keeps on running out of memory with errors like "java.lang.OutOfMemoryError: Java heap space" in output.log

A: It can be one of a few things, so the possibilities are listed below in order of likelihood:

  1. Your disk subsystem can't keep up with RRD file writes and the in-memory RRD update queue in the OpenNMS Java daemon grows until the Java heap is full.
  2. Your CPU or disk subsystem can't keep up with event processing or event/alarm persistence/de-duplication and the in-memory event queue in the OpenNMS Java daemon grows until the Java heap is full.
  3. Your OpenNMS system is handling a large number of nodes and interfaces and needs a larger heap to handle everything that's cached in memory within the OpenNMS Java daemon.

For the RRD problem, you can verify if this is the case (and it almost always is the problem) by enabling DEBUG logging for queued.log and looking for lines containing "QS" in that log file. Look for totalOperationsPending and if that's large (in the tens of thousands to hundreds of thousands), then queued RRD updates are likely your problem. A few possible fixes:

  1. Persist less to RRD files. E.g.: collect on fewer nodes, fewer interfaces, collect fewer data points.
  2. Persist less often.
  3. Speed up your disk subsystem. On Linux, try to stay way from LVM, put RRD files on their own filesystem with RAID 0+1 (NOT RAID-5!!!). Battery-backed write cache may help if the cache is large enough to cache writes to the same file across multiple update periods (5 minutes by default).
  4. Use storeByGroup.
  5. Increase the size of the heap and the multiple-updates-per-write feature of queued might allow the disks to keep up. Note: this generally gives a 1.5x-2.0x increase, but not much more. Also, you will need to have enough physical RAM and the heap must be large enough to cache samples for multiple sampling periods.
  6. Delete .jrb (or .rrd) files that are not being used; It is unknown why this decreases system load, but has been demonstrated to decrease load from 20 to 3 on one customer's installation. (find /opt/opennms/share/rrd/snmp/ -name "*.jrb" -mtime +30 -exec rm -rf {} \;)

Q: Is there any way to prevent false outages?

A: See False Outages

Q: Why do OpenNMS SNMP graphs show higher traffic levels than Netflow-based graphs?

A: (Note: The below answer is not specific to OpenNMS, but applies to any SNMP-based data collection system.)

Short answer: When counting bytes, SNMP interface counters look at layer 2, while Netflow looks at layer 3.

Long answer:

Consider a 40 byte IP packet arriving at a VLAN-tagged Ethernet interface on a router. The IP packet is encapsulated inside an Ethernet frame.

An SNMP interface counter (e.g. ifInHCOctets) counts:

18 bytes of Ethernet header (src mac [6], dst mac [6], ethtype [2], VLAN [4])
40 byte IP payload
4 byte checksum
= 62 bytes

Netflow counts:

40 byte IP payload
= 40 bytes

So if your traffic consists entirely of 40 byte IP packets, OpenNMS (or any SNMP-based data collection) will report roughly 50% more bits per second than Netflow.

(Note: A good way to determine whether your Netflow and SNMP graphs are lining up is to compare packet counts - they should be roughly the same.)

Q: How do I force an SNMP rescan of a node?

A: go to the node's page. There should be an "Update SNMP" link on top of the node's page. If it's not there, there was no snmp discovered on your node. In this case look for troubleshooting SNMP.

If the link is there just click on the link. This will mark the node and during the next rescan OpenNMS will try to reread (poll) snmp informations for this node.

You can trigger this to run immediately by selecting "Rescan" from the node's page.