Net-snmp 5.3 CPU collections
Subscribe

From OpenNMS

Jump to: navigation, search

Contents

Introduction

There's been quite a lot of discussion regarding collecting CPU performance data from linux and solaris hosts. The example collections and graphs in OpenNMS do a pretty good job of attempting to collect at least some data from these. Digging around a little further though, reveals that the task is not as simple as it seems. It's all made a little more tricky by having Unix Sysadmins peering over your shoulder and pointing out where your graphs differ from top, vmstat, NMON, topaz etc. I thought that it would be worth a couple of days getting to the bottom of the problem. Here's what I found.

Operating systems, kernels and agents

We've got a variety of different Operating Systems, kernel versions and SNMP agents installed. To have an effective monitoring approach, I needed to be able to deal with:

  • RedHat AS 3 Update 1 with the redhat supplied snmpd
  • RedHat AS 3 Update 1 with net-snmp 5.3
  • RedHat AS 4 (em64T) with net-snmp 5.3
  • Solaris 2.6 and Solaris 8 with net-snmp 5.3

I'm interested in the ssCpuRaw counters from the UCD mib systemStats table. First, I need to see what is available from each operating system/agent combination:

Stock Redhat AS 3 update 1

UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 444784210
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 11880
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 610533695
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 567577498

RedHat AS 3 Update 1 with net-snmp 5.3

UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 345719282
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 11702
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 104620072
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 3838639234
UCD-SNMP-MIB::ssCpuRawWait.0 = Counter32: 653078644
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 96654888
UCD-SNMP-MIB::ssCpuRawInterrupt.0 = Counter32: 457959
UCD-SNMP-MIB::ssCpuRawSoftIRQ.0 = Counter32: 7507225

RedHat AS 4 (em64T) with net-snmp 5.3

UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 142774
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 122
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 138703
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 455120583
UCD-SNMP-MIB::ssCpuRawWait.0 = Counter32: 52299
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 78633
UCD-SNMP-MIB::ssCpuRawInterrupt.0 = Counter32: 49665
UCD-SNMP-MIB::ssCpuRawSoftIRQ.0 = Counter32: 10405

Solaris 2.6 and Solaris 8 with net-snmp 5.3

UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 538275716
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 68950117
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 2615758019
UCD-SNMP-MIB::ssCpuRawWait.0 = Counter32: 3681861
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 65268256

So I've got a variety of different counters available. The stock redhat AS 3 snmpd is the most terse, and the net-snmp 5.3 on redhat (both versions) most verbose.

What do those counters mean anyway?

The ssCpuRaw counters are actually the number of time ticks (usually 1mS) that the CPU spent performing different kinds of work (from the linux 2.6 kernel docs):

  • user: normal processes executing in user mode
  • nice: niced processes executing in user mode
  • system: processes executing in kernel mode
  • idle: twiddling thumbs
  • iowait: waiting for I/O to complete
  • irq: servicing interrupts
  • softirq: servicing softirqs

A single CPU, single core linux box will have 100 timeticks available per second. A two CPU, dual core CPU linux box will have 400 timeticks available per second.

Reading the docs, earlier kernel versions will report the stats in different ways, for example, the 2.4 kernel will not differentiate between iowait and idle time. This confused me for a while as the redhat AS 3 kernel reports itself as 2.4, yet separates iowait and idle. I can only assume that the redhat engineers backported some 2.5/2.6 kernel code into their 2.4 version.

Does one size fit all?

One of the beauties of OpenNMS is that out of the box, it will just start collecting data. All you need to do is get the snmp community right in snmp-config.xml and OpenNMS will collect what it can. By default, OpenNMS will try to collect user, nice, system, idle and kernel time. and it will try to plot user, nice, system and idle (not a bad set for 2.4 linux kernels). I was curious as to whether only graphing a subset of the total available would be viable.

I created some collections and some graphs from those collections. My target machine was two way dual core xeon box with redhat AS 3 and net-snmp 5.3 compiled from source (i.e. one of the most complete set of counters available). This box is our backup OpenNMS server.

raw data using a minimal systemStats collection

The graph below shows the raw data from our machine (i.e. the total timeticks in each state), when only graphing a small subset of the total number of counters.

Redhat net snmp unadjusted.png

This is about what I would expect. The maximum on the graph is 400 (remember the machine is two way, dual core). Over time, you can see the the height of the stacked graph doesn't reach 400. There is some unaccounted for time going on here, which is hardly surprising given the fact we're only graphing 4 out of 8 total counters.

raw data using a larger systemStats collection

I know that iowait time may represent a substantial proprotion of my CPU time. Maybe increasing my set of counters to include iowait would help. This time I'm graphing a subset of systemStats counters that include all the counters available from Solaris boxes running net-snmp 5.3 (hence the graph title).

Solaris net snmp unadjusted.png

Wow, so that's where my missing timeticks went, they're all in io wait (the red stacked data). I also notice that some counters may have an overlap (I've got more than 400 ticks in total), but this could also be a result of jrobin's interpolation of my data points, as it's quite small, I'm ignoring it for now.

raw data using a full systemStats collection

Finally, lets graph all of the systemStats counters that are available from this net-snmp/OS/kernel combination.

Linux net snmp unadjusted.png

Again, there seems to be an overlap in counters going on, but again it's sufficiently small to be ignored. I suspect Kernel and System counters may have some overlap, but without looking at the kernel code, I cant be sure.

But shouldn't these graphs be a percent figure?

Our Sysadmins like to see percentages, not timeticks though, and this is where things really get out of hand....

I created some more graph definitions to turn each counter into a percentage of the total ticks on each graph.

CPU percent using a minimal systemStats collection

Redhat net snmp adjusted.png

CPU percent using a larger systemStats collection

Solaris net snmp adjusted.png

raw data using a full systemStats collection

Linux net snmp adjusted.png

Now lets look at one of these figures in detail. Remember these graphs are all from the same machine. Lets take the maximum value for user time:

The minimal graph has it at 51%, the slightly more complete graph at 15% and the full graph 15%. Thats a difference of 36% between the one-size-fits-all graph and the customised one! This difference is the result of the machine spending a lot of its time in a state that's not accounted for by the counters that we're collecting. It's strikingly obvious as the machine is spending a lot of time waiting on IO (OpenNMS on one spindle is a bad idea). You can extend this to other "unaccounted for" time. For example, if the machine spent a high proportion of its time in soft IRQ, then only the full graph would accurately represent the percent time in each state.

So one size does not fit all then?

No, it doesn't, as we can clearly see from the graphs. You can about get away with it if you don't convert timeticks to a percentage figure for graphing purposes, but you need to be aware of what your're looking at, what the graphs contain, and crucially what the graphs do not contain.

So what's the solution?

The only real way around this is multiple collections for the systemStat counters, tailored to the counters available from each OS/Kernel/agent combination.

Here's one for net-snmp 5.3

Now bear in mind that you will now need to ensure that you don't have specific ranges like this in any other net-snmp cpu packages unless you're happy with double collections and double graphs.

collectd-configuration.xml

    <package name="enhanced-linux-net-snmp">
        <filter>IPADDR IPLIKE *.*.*.*</filter>
        <specific>your ip address here</specific>
        <service name="SNMP" interval="300000" user-defined="false" status="on">
            <parameter key="collection" value="enhanced-linux-net-snmp"/>
            <parameter key="port" value="161"/>
            <parameter key="retry" value="3"/>
            <parameter key="timeout" value="3000"/>
        </service>
        <outage-calendar>zzz from poll-outages.xml zzz</outage-calendar>
    </package>

datacollection-configuration.xml

        <snmp-collection name="enhanced-linux-net-snmp"
                maxVarsPerPdu = "10"
                snmpStorageFlag = "select">
                <rrd step = "300">
                        <rra>RRA:AVERAGE:0.5:1:8928</rra>
                        <rra>RRA:AVERAGE:0.5:12:8784</rra>
                        <rra>RRA:MIN:0.5:12:8784</rra>
                        <rra>RRA:MAX:0.5:12:8784</rra>
                </rrd>
                <groups>
                        <group name="enhanced-linux-ucd-sysstat-cpu" ifType="ignore">
                                <mibObj oid=".1.3.6.1.4.1.2021.11.50" instance="0" alias="eCpuRawUser" type="counter" />
                                <mibObj oid=".1.3.6.1.4.1.2021.11.51" instance="0" alias="eCpuRawNice" type="counter" />
                                <mibObj oid=".1.3.6.1.4.1.2021.11.52" instance="0" alias="eCpuRawSystem" type="counter" />
                                <mibObj oid=".1.3.6.1.4.1.2021.11.53" instance="0" alias="eCpuRawIdle" type="counter" />
                                <mibObj oid=".1.3.6.1.4.1.2021.11.54" instance="0" alias="eCpuRawWait" type="counter" />
                                <mibObj oid=".1.3.6.1.4.1.2021.11.55" instance="0" alias="eCpuRawKernel" type="counter" />
                                <mibObj oid=".1.3.6.1.4.1.2021.11.56" instance="0" alias="eCpuRawInterrupt" type="counter" />
                        </group>
                </groups>
                <systems>
                        <systemDef name = "Net-SNMP">
                                <sysoidMask>.1.3.6.1.4.1.8072.3.</sysoidMask>
                                <collect>
                                        <includeGroup>enhanced-linux-ucd-sysstat-cpu</includeGroup>
                                </collect>
                        </systemDef>
                </systems>
        </snmp-collection>

snmp-graph.properties

In the reports section, include:

netsnmp.enhanced.cpuUsage,

Then define the report itself....


report.netsnmp.enhanced.cpuUsage.name=enhanced net-snmp CPU Usage
report.netsnmp.enhanced.cpuUsage.columns=eCpuRawUser,eCpuRawNice,eCpuRawSystem,eCpuRawKernel,eCpuRawWait,eCpuRawIdle
report.netsnmp.enhanced.cpuUsage.type=nodeSnmp
report.netsnmp.enhanced.cpuUsage.command=--title="linux net-snmp CPU Usage" \
 --upper-limit 100 --lower-limit 0 --units-exponent 0 --vertical-label percent usage \
 DEF:ssCpuRawUser={rrd1}:eCpuRawUser:AVERAGE \
 DEF:ssCpuRawNice={rrd2}:eCpuRawNice:AVERAGE \
 DEF:ssCpuRawSystem={rrd3}:eCpuRawSystem:AVERAGE \
 DEF:ssCpuRawKernel={rrd4}:eCpuRawKernel:AVERAGE \
 DEF:ssCpuRawWait={rrd5}:eCpuRawWait:AVERAGE \
 DEF:ssCpuRawIdle={rrd6}:eCpuRawIdle:AVERAGE \
 CDEF:numCpu=ssCpuRawUser,ssCpuRawNice,+,ssCpuRawSystem,+,ssCpuRawKernel,+,ssCpuRawWait,+,ssCpuRawIdle,+,100,/ \
 CDEF:ssCpuUser=ssCpuRawUser,numCpu,/ \
 CDEF:ssCpuNice=ssCpuRawNice,numCpu,/ \
 CDEF:ssCpuSystem=ssCpuRawSystem,numCpu,/ \
 CDEF:ssCpuKernel=ssCpuRawKernel,numCpu,/ \
 CDEF:ssCpuWait=ssCpuRawWait,numCpu,/ \
 CDEF:ssCpuIdle=ssCpuRawIdle,numCpu,/ \
 AREA:ssCpuUser#00ff00:"User  " \
 GPRINT:ssCpuUser:AVERAGE:"Avg \\: %10.2lf %s" \
 GPRINT:ssCpuUser:MIN:"Min \\: %10.2lf %s" \
 GPRINT:ssCpuUser:MAX:"Max \\: %10.2lf %s\\n" \
 STACK:ssCpuNice#ffff00:"Nice  " \
 GPRINT:ssCpuNice:AVERAGE:"Avg \\: %10.2lf %s" \
 GPRINT:ssCpuNice:MIN:"Min \\: %10.2lf %s" \
 GPRINT:ssCpuNice:MAX:"Max \\: %10.2lf %s\\n" \
 STACK:ssCpuSystem#FF7F50:"System" \
 GPRINT:ssCpuSystem:AVERAGE:"Avg \\: %10.2lf %s" \
 GPRINT:ssCpuSystem:MIN:"Min \\: %10.2lf %s" \
 GPRINT:ssCpuSystem:MAX:"Max \\: %10.2lf %s\\n" \
 STACK:ssCpuKernel#F5F5DC:"Kernel" \
 GPRINT:ssCpuKernel:AVERAGE:"Avg \\: %10.2lf %s" \
 GPRINT:ssCpuKernel:MIN:"Min \\: %10.2lf %s" \
 GPRINT:ssCpuKernel:MAX:"Max \\: %10.2lf %s\\n" \
 STACK:ssCpuWait#ff0000:"Wait  " \
 GPRINT:ssCpuWait:AVERAGE:"Avg \\: %10.2lf %s" \
 GPRINT:ssCpuWait:MIN:"Min \\: %10.2lf %s" \
 GPRINT:ssCpuWait:MAX:"Max \\: %10.2lf %s\\n" \
 STACK:ssCpuIdle#800080:"Idle  " \
 GPRINT:ssCpuIdle:AVERAGE:"Avg \\: %10.2lf %s" \
 GPRINT:ssCpuIdle:MIN:"Min \\: %10.2lf %s" \
 GPRINT:ssCpuIdle:MAX:"Max \\: %10.2lf %s\\n"