Performance tuning

From OpenNMS

Contents

Hardware considerations

Probably the biggest performance improvement on systems that are collecting a lot of RRD data is to move PostgreSQL and Tomcat to a separate system from OpenNMS daemons! Huge difference.

On a server with hardware RAID, consider investing in a battery-backed write cache. On a HP DL380 G4, the I/O wait of the server dropped from an average of 15% to almost nil with the addition of a 128 MB BBWC.

For a small collection of monitored nodes, moving the RRD data area into a tmpfs / RAM drive may also alleviate the I/O wait caused by all of the writing required by the RRD data. The trade-off is that a server crash or power-down will cause the RRD files to be lost, unless you implement a sync tool to sync the RAM drive to a disk backup.

Operating system

  1. Don't run in a VM.
  2. Don't put DB or RRD data on file systems managed by LVM.
  3. Don't put DB or RRD data on file systems on RAID-5.
  4. Do put OpenNMS logs and RRDs and PostgreSQL data on separate spindles or separate RAID sets. Read details for postgres and RRD below.
  5. Do run on a modern kernel. Linux 2.6 and later as well as Solaris 10 or newer are good. Stay away from Linux 2.4, in particular.
  6. Set noatime mount flag on file systems hosting data for #4 above.

Java Virtual Machine (JVM)

PostgreSQL

The default shared_buffers parameter in postgresql.conf is extremely conservative, and in most cases with modern servers, this can be significantly tweaked for a big performance boost, and drop in I/O wait time. This change will need to be in-line with kernel parameter changes to shmmax. See this PostgreSQL performance page for recommendations on this and other postgresql settings.

If you want to put PostgreSQL on a different box then you want to change the SQL host look in opennms-datasources.xml. The PostgreSQL server will also need iplike installed and configured.

To clean up extra events out of the datbase try this Event_Configuration_How-To#The_Database

PostgreSQL 8.1

These changes to postgresql.conf will probably improve your DB performance if you have a enough RAM (about 2GB installed RAM for a dedicated server) to support the changes. (YMMV) You'll probably need to make adjustments to the shmmax kernel attribute on your system.

shared_buffers = 20000
work_mem = 16348
maintenance_work_mem = 65536
vacuum_cost_delay = 50
checkpoint_segments = 20
checkpoint_timeout = 900
wal_buffers = 64
stats_start_collector = on
stats_row_level = on
autovacuum = on

I've also set these higher values on *bigger* systems:

wal_buffers = 256
work_mem = 32768
maintenance_work_mem = 524288
Systems with lots of RAM and PostgreSQL 8.2

Recently, we've found that changing the max_fsm_pages and max_fsm_releations 10 fold on systems with plenty of memory (4G+), improves performance dramatically.

#max_fsm_pages = 204800		# min max_fsm_relations*16, 6 bytes each
max_fsm_pages = 2048000
#max_fsm_relations = 1000		# min 100, ~70 bytes each
max_fsm_relations = 10000

As well as really bumping these:

work_mem = 100MB
maintenance_work_mem = 128MB

Note: To make adjustments to shmmax, do the following:

Start postgresql from the command line:

sudo -u postgres pg_ctl -D /var/lib/pgsql/data start

(adjusting paths as necessary) and look at the error message:

# FATAL:  could not create shared memory segment: Invalid argument
DETAIL:  Failed system call was shmget(key=5432001, size=170639360, 03600).
HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter.  
You can either reduce the request size or reconfigure the kernel with larger SHMMAX. 
 To reduce the request size (currently 170639360 bytes), reduce PostgreSQL's shared_buffers parameter (currently 20000) 
and/or its max_connections parameter (currently 100).

Notice the value of "size".

Then up the value of shmmax:

sysctl -w kernel.shmmax=170639360

And restart postgresql (using the normal method such as "service postgresql start")

Finally, edit /etc/sysctl and add the line

kernel.shmmax=170639360

so it will survive a reboot.

PostgreSQL *any* Version

One additional configuration that seems to make a tremendous amount of peformance improvement is having the write-head logs on a separate spindle (even better a separate disk controller/channel). The way to do this is:

  1. shutdown opennms / tomcat
  2. shutdown postgresql
  3. cd to $PG_DATA
  4. mv pg_xlog <file system on different spindle>
  5. ln -s <file system on different spindle>/pg_xlog pg_xlog
  6. restart postgresql

Make sure postgres data and write-ahead logs do not live on a RAID-5 disk subsystem.

iplike stored procedure

See the documentation in iplike to be sure you have the best version of iplike running

RRDTool/JRobin

See RRD performance fundamentals.

Disk Tuning

Because OpenNMS is well-equipped for gathering and recording details regarding network and systems performance and behavior, it tends to be a write-heavy application. If your environment offers a very large number of data points to be managed, it would serve you well to ensure that a large degree of spindle separation exists. In particular and where possible, ensure that:

  • OpenNMS SNMP Collection
  • OpenNMS Response Time Collection
  • OpenNMS (and system) logging
  • PostgreSQL Database
  • PostgreSQL Writeahead logging


..occur on separate spindles, and in some cases separate drives or separate devices. Further, in a *Nix environment, it may behoove you to ensure that the RRD's end up on different mounts, so one has the option of mounting with the noatime and nodiratime directives without compromising other aspects of the system configuration.

The defaults for the opennms directories mentioned above are

 /opt/opennms/share/rrd/snmp
 /opt/opennms/share/rrd/response
 /opt/opennms/logs or /var/log/opennms

but watch out for symbolic links!

Tomcat (if not using built-in Jetty server)

Note that there's no need to use Tomcat since OpenNMS version 1.3.7 unless you have a specific requirement that the built-in Jetty server in OpenNMS cannot meet.

If not already done at installation time; To allow Tomcat to access more memory than the default. The easiest way to do this is via the CATALINA_OPTS environment variable. If the Tomcat software being used has a configuration file as above, it can be added to that file. Otherwise it is best just to add it to catalina.sh. CATALINA_OPTS="-Xmx1024m"


The -Xmx option allows Tomcat to access up to 1GB of memory. Of course, the assumes that there is 1GB of available memory on the system. It will need to be tuned to the particular server in use.

OpenNMS daemon

OpenNMS webapp

Logging

By default the daemons and webapp log at DEBUG level. This causes a lot of extra disk I/O. You can reduce the logging substantially by setting the level to WARN in /opt/opennms/etc/log4j.properties and /opt/opennms/webapps/opennms/WEB-INF/log4j.properties. Just add this line:

   log4j.threshold=WARN

There is also /opt/opennms/jetty-webapps/opennms/WEB-INF/log4j.properties, but even though this file is read on startup, it seems not to matter; I didn't need to modify it.

After restarting, you should no longer see messages labelled DEBUG or INFO in /opt/opennms/logs/daemon/* and /opt/opennms/logs/webapp/*, except for the startup log (/opt/opennms/logs/daemon/output.log).

Poller threads

If you have good hardware and find your pollers are not completing in time, you can turn up the maximum number of poller threads at the top of poller-configuration.xml.

To find out how many threads are actually being used, make sure DEBUG level logging is enabled for daemon/poller.log, then run:

   $ tail -f poller.log | egrep 'PollerScheduler.*adjust:'
   ...
   2007-09-05 10:30:32,755 DEBUG [PollerScheduler-45 Pool] RunnableConsumerThreadPool$SizingFifoQueue:
       adjust: started fiber PollerScheduler-45 Pool-fiber2 ratio = 1.0227273, alive = 44
   
   ...
   
   2007-09-05 10:30:12,783 DEBUG [PollerScheduler-45 Pool-fiber29] RunnableConsumerThreadPool$SizingFifoQueue:
       adjust: calling stop on fiber PollerScheduler-45 Pool-fiber3

Watch the output for a while after startup. The "alive" count shows the number of active poller threads (minus one -- the new thread isn't counted). If the number of threads is continually pegged at the maximum (default 30), you might want to add more threads.

Event Archiving

In the OpenNMS "contrib" directory, we have a small script for helping performance by archiving events into a historical event table and updating the references to the archived event to an event place holder.

You can download the latest version of the script here.

It is recommended that you run this script by passing in a timestamp argument such that you archive one day's worth of events beginning with the oldest day up to the point you want to keep live events (default is 9 weeks). Then run this script without a timestamp parameter, from cron as often as you like from there out.

./maint_events.sh "2008/01/01"

Personal tools
DevJam 2008 Sponsors
DevJam 2008 Sponsor: Google
DevJam 2008 Sponsor: Netregistry
DevJam 2008 Sponsor: Papa John's
NewEdge Networks
OpenNMS takes home the gold award!
Join the Free Software Foundation
Support This Project Commercial OpenNMS Support OpenNMS Italia SourceForge.net Logo Our Network Simulator Our Java Profiler The best Java IDE