Do Sysadmins Dream Of Electric Sheep?

A collection of random mutterings and mumblings about Windows and other technologies…

Diagnosing NetApp CPU Issues – Kahuna Bottlenecks

Posted by Joe Thompson on January 24, 2013

I’m working on a client site at the moment where one of their filers frequently exhibits slow console response. As yet (fortunately), no end users have reported performance issues. The client has asked me to investigate further to find out what’s going on before someone does start to complain.

Filer: FAS3240
OnTAP Version: 8.1.0 7-mode
Clustered: Active/Active

Notes: The cluster partner is not displaying the same symptoms. OpsManager is showing a “CPU Too Busy” warning.

1) Run sysstat -m

Running sysstat -x or -u shows a stat called “CPU”. On this system, it’s almost solidly fixed at 99%. Not good, but the system’s still operating. If the CPU’s running so hot, how can this be?

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache
        in out read write read write age
99% 3484 373 0 25505 19574 102993 57939 0 0 0s
99% 3304 463 0 23901 14424 104082 48239 0 0 9s
99% 2675 511 0 35202 9846 101772 47316 0 0 0s
99% 1615 243 0 18633 6818 75278 18690 0 0 0s
99% 1950 570 0 17120 6956 61602 53518 0 0 11s
99% 2177 357 0 30155 8587 96486 58358 0 0 7s
99% 2690 388 0 23931 5428 77650 36123 0 0 0s
99% 2781 280 0 33375 10736 97023 24737 0 0 0s
99% 3055 230 0 18215 14255 73371 77096 0 0 7s
99% 2431 357 0 10294 11330 68075 47320 0 0 0s
99% 2363 258 0 11092 19261 62122 63 0 0 0s

The FAS3240 is a multiprocessor beastie running 4 concurrent CPUs. The CPU stat you’re seeing is not actually showing you the combined or averaged CPU time across all four CPUs, but is actually showing you the total time that any of the four processors is in use.

This can be a little misleading at first, as this value would show 100% if only one of the CPUs was running at 100% and the other three were idle, or if each of them were running at 25% but were running (by some freak occurrence) in sequence. Likewise, if all four CPUs were running at 100%, you would get the same value.

Running sysstat -m gives us a little more detail on which of these might be the case:

ANY AVG CPU0 CPU1 CPU2 CPU3
100% 74% 67% 72% 73% 82%
100% 70% 62% 64% 69% 85%
100% 65% 57% 62% 63% 75%
100% 63% 55% 62% 59% 77%
100% 64% 60% 62% 61% 74%
100% 65% 60% 64% 63% 74%
100% 68% 58% 65% 66% 85%
100% 56% 41% 48% 48% 86%
100% 48% 30% 39% 37% 86%
100% 49% 31% 37% 39% 88%
100% 47% 28% 33% 37% 88%
100% 49% 33% 36% 42% 83%
100% 45% 30% 34% 36% 78%
100% 51% 35% 41% 43% 83%
100% 45% 30% 34% 35% 81%

A few moments later, the ANY stat (which correlates to the CPU stat in a sysstat -x or -u) has jumped to 100% from 99%, but we can clearly see why the rest of the filer is operating normally; the AVG column shows the average CPU usage which is generally sat between 50-70% – high, but not critically so – while a scan of the individual CPU columns shows a much more balanced picture. Not one of these is maxxed out, so business is continuing as usual.

All of which is lovely, but then why is our console running slowly?

2) Run sysstat -M

To drill down a bit further, we need to use a privileged command. sysstat -M (note the upper case M) is only available in diag mode, and there is virtually no documentation I can find about it on the NetApp support site.

Running this command gives an extended CPU output which is too long for this display, so I’m going to cut it into two parts:

ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3
100% 61% 43% 25% 58% 41% 47% 51% 93%
100% 62% 44% 27% 59% 43% 48% 52% 93%
100% 64% 45% 26% 60% 47% 49% 54% 88%
100% 75% 55% 36% 67% 53% 57% 64% 94%
100% 67% 49% 33% 63% 48% 52% 58% 93%
100% 67% 46% 30% 62% 47% 52% 56% 91%

The left hand side of the display shows an extended version of the same CPU info as sysstat -m gave us, with additional columns for the amount of time any one or more, two or more, three or more, or four or more CPUs are in use. (We only have four CPUs, so more is pretty much impossible!)

Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex (Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP
24% 0% 0% 10% 19% 0% 64% 77% (36%) 4% 0% 0% 28% 5% 1% 3454 85%
27% 0% 0% 10% 19% 0% 64% 77% (36%) 4% 0% 0% 29% 5% 1% 4524 87%
31% 0% 0% 10% 18% 0% 57% 82% (42%) 4% 0% 1% 29% 6% 1% 4409 61%
35% 0% 0% 13% 25% 0% 57% 88% (42%) 7% 0% 1% 34% 6% 1% 5057 99%
26% 0% 0% 11% 23% 0% 56% 84% (43%) 6% 0% 1% 38% 5% 1% 3743 99%
33% 0% 0% 10% 18% 2% 55% 83% (44%) 4% 0% 1% 31% 7% 1% 4788 68%

The right hand side of the display shows what NetApp calls “domains” – groups of related processes which can help clearly show which part of the system may be causing problems.

More detailed information on how NetApp filers handle processes across multiple CPUs can be found here – https://kb.netapp.com/support/index?page=content&id=3010150 – but the key information to know is that although different domains can run simultaneously on different processors, each individual domain can only exist on a single CPU at any one time. This is useful, because it means that any domain showing 100% usage indicates a CPU bottleneck for that bundle of related processes. Happily, there’s none of that here.

(Obviously there are exceptions to any rule, and so the above does not apply to any domains with Ex or Exempt in the name.)

To make matters a little more complicated, Kahuna processes and WAFL_Ex(Kahu) processes (the bit in brackets) cannot run simultaneously, so a potential Kahuna bottleneck occurs when the Kahuna value and the (Kahu) value add up to 100%.

Looking again at the above sysstat output, it’s pretty clear that this is what’s happening…

Kahuna WAFL_Ex (Kahu) Kahuna + (Kahu)
64% 77% (36%) 100%
64% 77% (36%) 100%
57% 82% (42%) 99%
57% 88% (42%) 99%
56% 84% (43%) 99%
55% 83% (44%) 99%

So the Kahuna domain has hit a CPU bottleneck. What does this mean?

Well – once again, information on the NetApp support site about which processes sit within the Kahuna domain is scant, but we can reasonably infer from what’s left out when you take away the other domains (Network, Protocol, Cluster, Storage, RAID, WAFL, SnapMirror) that what we’re really talking about are the top-level system management processes: console management, performance logging, de-dupe (as it’s visible nowhere else) etc

So – what could be causing this bottleneck.

Fortunately, a quick scan of the NetApp support site pulls up a bug report that seems to fit: http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=210480

Bug Detail

Bug ID

210480
Title Block reclamation scanners cause kahuna bottleneck.
Duplicate of  
Bug Severity 2 – System barely usable
Bug Status Fixed
Product Unknown
Bug Type Unknown
Description
Formatted
Block reclamation scanner returns unused blocks held by a container back to the aggregate. The amount of work the scanner does depends on the size of the flexvol it is running on and the number of blocks that are allocated in the flexvol but not in the aggregate.Some operations (e.g., snapshot deletion) can free a large number of blocks and cause the scanner to run for a long period of time. This in turn can cause high CPU utilization and a significant increase in response times.

This problem is most likely to affect low-end platforms such as FAS270.

This problem is present in 7.0.4 and later releases.

 

Workaround
Formatted
None. 
Notes
Formatted
 
Fixed-In Version Data ONTAP 7.0.5P8 (First Fixed) – Fixed
Data ONTAP 7.0.7 (GD) – Fixed
Data ONTAP 7.1.3 (GD) – Fixed
Data ONTAP 7.2.7 (GD) – Fixed
Data ONTAP 7.3.3 (GD) – Fixed
Data ONTAP 7.3.7 (GA) – Fixed
Data ONTAP 7.3.7P1 (GA) – Fixed
Data ONTAP 8.0.4 (GA) – Fixed
Data ONTAP 8.1.2 (GA) – FixedA complete list of releases where this bug is fixed is available here.
Related Bugs 168255, 212265, 185422, 299001

This bug matches our findings and our software level, and even neatly fits in with a planned upgrade to 8.1.2 we have scheduled for next weekend.

I’ll report back after and let y’all know if the fix worked…

UPDATE: The upgrade to 8.1.2 seems to have resolve the CPU issues we were having, CPU usage in sysstat is down to around 50% and the combined Kahuna values are peaking at around 60-70%.

It’s worth noting that Mike (in the comments section) quite rightly points out that the bug I’ve highlighted above is listed by NetApp as being fixed for the version of OnTAP we were running prior to the upgrade (8.1.0 – see here for the full list) however as the symptoms fit the description pretty accurately and the “fix” worked, I’ve decided to leave this as is for now in case it helps anyone else out.

About these ads

6 Responses to “Diagnosing NetApp CPU Issues – Kahuna Bottlenecks”

  1. Ian said

    Hello Joe.

    How was your holiday? I had to do a double take when I read this post. The symptoms are so similar to a problem we have here it was almost like you were sat next to me again.

    How did the OnTap upgrade go? We are due to do ours in a few weeks.

    Cheers
    Ian

    • Hi Ian – the trip was great thanks, we had a fantastic time. Coming back is hard though – if we do it again I’m gonna try and get back in spring time – the British weather in Dec / Jan is bit of a comedown after the Brazilian beach life :-)

      I had a funny feeling I wouldn’t be the only one seeing this particular fault, although I’m more than a little amused you and I are still putting out the same fires… Just goes to show – little changes from one end of the country to the other!

      Our OnTAP upgrade is scheduled in for this weekend (3rd Feb) so I’ll let you know how we get on.

      Say hi to the guys for me!
      Joe

  2. Mike said

    Hello, you need to view the *full* list of versions that bug is fixed in. By default, NetApp only displays the most recent version for that series. That bug has been fixed for a long time. However, there are numerous other bugs that exist in 8.1 and 8.1.1 that cause similar symptoms:

    For example, http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=568758 and http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=510586

    • Hi Mike,

      Thanks for the tip – I had a look at the full list and it does appear as though that bug should be fixed for the version of OnTAP we were running prior to the weekend, it’s just that the symptoms for this particular bug seemed to fit more closely than any of the other bug reports I’d seen (in particular the Kahuna processes locking at 100%).

      It’s interesting to note that since we upgraded from 8.1.0 to 8.1.2 over the weekend the symptoms appear to have subsided considerably. Obviously we’ll continue to monitor closely over the next few days and weeks as this could either be a coincidence or a false positive, but I’m reasonably happy that whatever the true cause of the over-utilisation of the CPU was, it appears to have been resolved.

      Thanks
      Joe

  3. […] http://dosysadminsdream.wordpress.com/2013/01/24/diagnosing-netapp-cpu-issues-kahuna-bottlenecks/ […]

  4. confuSAN said

    great. thanks. i was also hoping for some reference to the 3 cpu measurements from netapp cpu_busy/avg_processory_busy/total_processory_busy. cpu_busy is the System CPU resource utilization, avg_processor_busy is Average processor utilization across all procesors in the system and total_processor_busy is the sum of processor utilization of all processors in the system. I don’t understand the significance of total_processor_busy though. not sure how it is useful if at all

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: