Instructions Per Cycle (IPC) Flamegraphs

In first part of this post, I describe an Instruction-Per-Cycle (IPC) Flamegraph I ended up creating as part of my PCat perf optimization project. And I also describe Flamegraph's differential mode and the computation of its metrics. If you're more interested in Differential Mode, you might want to skip to the second part, flamegraph differential mode and what the metrics mean.

If you've read through Brandon Gregg's great resources you may have run across his CPI Flamegraphs (CPI being the inverse of IPC: Cycles Per Instruction). Gregg's flamegraph and my flamegraph are named after the same concept, but really are different beasts.

Were you to generate a "vanilla" flamegraph, you'd end up with something like the PCat-DNest flamegraph above; no color coding (okay yes, it's orange, but the orange colors are meaningless; are picked at random). Seems like a waste, so Gregg gave Flamegraph the ability to color based on the difference in two datasets. In the colored, "Differential Flamegraph" below, you can see not only that SloanModel::pixelLogLikelihood represents 32% of all cycles, but from the red color you can tell it has an IPC that is <1. Good idea!

Gregg also proposes a specific measurement, the "CPI Flamegraph" where low instruction throughput parts of the code are discernible by color. If we dig into Gregg's graph, he compares 2 measurements: RESOURCE_STALLS.ANY and CPU_CLK_UNHALTED.THREAD_P. In the comments Gregg suggests an alternate would be to look at CPU_CLK_UNHALTED.CORE. These are interesting metrics to look at, but I don't see how this is an IPC/CPI Flamegraph. It's a resource_stalls graph! It seems to me a set of FP-divides that aren't memory constrained would would score well on Gregg's metric, when in reality they retire instructions slowly and are actually low IPC.

Instead I suggest measuring with:

perf record -e cpu-cycles,instructions -c 3333333 -g bin_under_test

That is, measure cpu-cycles and instructions_retired directly.

To actually generate the Flamegraphs:

# Make all relevant PMC and debugging info available.
sudo sysctl kernel.kptr_restrict=0
sudo sysctl kernel.perf_event_paranoid=0

perf record -e cpu-cycles,instructions -c 3333333 -g bin_under_test
perf script > perf_script.data
stackcollapse-perf.pl --event-filter=instructions perf_script.data > folded_insts
stackcollapse-perf.pl --event-filter=cpu-cycles perf_script.data > folded_cycles
difffolded.pl folded_insts folded_cycles \
| flamegraph.pl --countname cycles --title "Cycles w/ Instruct-Cycle-Delta%%" \
                --subtitle "red: IPC &lt;1" > ipc_flamegraph.svg

Since apparently the PMC can come up with counts that are slightly off when cores Turbo Boost, for all measurements I disable Turbo Boost.

sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'

For calculation heavy codebases (like scientific computing) I believe this approach gives a better estimate of IPC.

So why would Gregg look at resource_stalls.any and cpu_clk_unhalted.thread_p? I assume he was concerned about frequency-scaling and scheduling interfering with the visibility of IO blocking. When you trigger based on cycles, you 1) trigger only as rapidly as the CPU core frequency and 2) trigger only when your thread is executing. When waiting on memory or IO, the core can change frequency, and your thread might be unscheduled until data is available. I assume this makes wait periods appear less visible, skews the data. This is an issue with measuring IPC the way I have done.

I assume Gregg believed memory / IO blocking was the main cause of low IPC for his application, so he measures it directly. Makes sense, if you know the patient's problem is low blood sugar, measure that directly instead of asking general questions (which can be inaccurate). For calculation heavy software, I believe my metrics give a more accurate estimate of IPC. So if your patient is PCat-DNest use my flamegraph.

For reasons you'll learn in a second, the color gradient in IPC Flamegraphs is not very informative. You can really only tell one thing: does the function have an IPC < 1 (red) or not (blue). The color gradient is relative, so may mean totally different things from one graph to the next, and the percentages don't help indicate the IPC either. I will explain.

What the Metrics in Differential Flame Graphs Really Mean

I've optimized pixelLogLikelihood and now it takes 36.7% fewer cycles than before. Great! But when I look at my Differential Flamegraph it reports -11.59%, dramatically different. The calculation isn't documented so what the #*%@ is going on?

So, Differential Flamegraphs show 2 percentages on hover (click any Flamegraph in this post and start hovering) 1. the percent of resource used by the function and 2. before-after percent global difference. That second one is not documented anywhere but the formula is:

(after_fn_usage - before_fn_usage) / (after_total_usage * scale_factor)

I've never seen scale_factor anything other than 1, so you can safely assume it is 1.

Differential Flamegraph wants to tell you the difference before and after, but wants to give you a "big picture" view, so it outputs difference metrics that are scaled against total resources used. So what if pixelLogLikelihood is 36.7% faster, if that only represents 11.6% of all resources used. Go find a different function that can cut global resources by more!

Once Differential Flamegraph has difference values, it generates a color gradient for the cells. 0 (no difference) is pegged white. max(diff) is red and min(diff) is blue.

Some implications:

The raw difference in number of cycles is nowhere on the graph.
The raw difference in number of cycles is hard to calculate (must multiply by number of cycles, can't do in head).
In common usage, percent difference is usually divide by before_total not after_total. This means for a binary that went from 200 to 100 cycles -- which most would call a decrease of 50% -- Flamegraph will call it -100%. You would probably want to speak of this value as "100% faster."
The color gradient does not reference absolute values and cannot be compared directly across Flamegraphs.

Differential Flamegraph can also normalize the two distributions with the -n option (before distribution is scaled linearly to match after). This is handy, and intended for in-situ measurements where load may change between measurements. Some gotchas though:

Constant work can get skewed. Say before is scaled larger because the after measurement had higher load. Now it appears the constant work decreased in runtime when really, that constant work was, well, constant.
You have a binary, it takes 100 cycles, you modify it, all functions take 2x cycles as before. Normalized, the two appear identical.

Ultimately, normalization is complicated and linear scaling is going to have some issues.

So go forth, use Differential Flamegraphs. Avoid the normalize option unless you are measuring server performance that changes in load. Understand that the percentages are really a globally-scaled diff. Don't look for raw differences (not available). Don't compare colors across graphs.