Luke Gorrie's blog

21st Century Network Hacking

Intel's Performance Counter Monitor

Wanna find the limiting factor in I/O performance on an Intel Nehalem server app? Here’s one idea.

The Nehalem architecture looks like this:
Nehalem Architecture

So the main potential bottlenecks are:

  1. RAM<->CPU @ approx 256Gbps to each CPU’s RAM bank. (Too many memory copies?)
  2. CPU<->CPU @ approx 200Gbps for sharing RAM over QPI. (NUMA all mucked up?)
  3. PCIe @ approx 72Gbps. (Out of ports? Crappy motherboard?)
  4. Storage disk/SSD. (Waiting all the time?)

Intel’s Performance Counter Monitor can tell you the utilization of these links. It’s like top for memory bandwidth, inter-processor data shuffling, PCIe load, etc. I used this tool for verifying that PCs can push 40Gbps of full-duplex ethernet traffic without any hassle. Intel make some really fine tech.

Give it a try! It’s really easy to install.

Here’s how it looks:

 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses 
 L2MISS: L2 cache misses (including other core's L2 cache *hits*) 
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
 L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE 

   0    0     0.59   0.60   0.98    1.00    1743 K   3978 K    0.56    0.62    0.15    0.07     N/A     N/A
   1    0     0.61   0.62   0.97    1.00    2595 K   3356 K    0.23    0.64    0.23    0.02     N/A     N/A
   2    0     0.49   0.59   0.83    1.00    2205 K   3198 K    0.31    0.60    0.22    0.03     N/A     N/A
   3    0     0.06   0.32   0.18    1.00     715 K    921 K    0.22    0.35    0.34    0.02     N/A     N/A
------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.43   0.59   0.74    1.00    7259 K     11 M    0.37    0.61    0.21    0.04    6.51    2.85

 Instructions retired: 3707 M ; Active cycles: 6317 M ; Time (TSC): 2134 Mticks ; C0 (active,non-halted) core residency: 73.99 %

 PHYSICAL CORE IPC                 : 0.59 => corresponds to 14.67 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.43 => corresponds to 10.86 % core utilization over time interval
----------------------------------------------------------------------------------------------