Linux perf howto.

Installation

sudo apt install linux-tools-common

Setup

To be able to grab performance and kernel events. Lower sampling rate to 10000, if it takes too much time. If it happens kernel will automatically lower perf_event_max_sample_rate. Examine dmesg output for perf: interrupt took too long, lowering kernel.perf_event_max_sample_rate to value.

sudo dmesg -wH | grep perf:

sudo sysctl -w kernel.perf_event_paranoid=-1 # beware!
sudo sysctl -w kernel.kptr_restrict=0
sudo sysctl -w kernel.perf_event_max_sample_rate=16250

To persist changes to your system edit /etc/sysctl.d/local.conf

kernel.perf_event_paranoid=-1
kernel.kptr_restrict=0
kernel.perf_event_max_sample_rate=16250

Reduce variance of the system https://gist.github.com/pankkor/b0970eb28547f5afa6776f8a8a143dfa

perf stats

Quick stats, similar to GNU time -v

perf stats -ddd -- ./app -with -args

-ddd - very detailed statistics (can be -d, -dd or -ddd)

NOTE: CPUs can only servuce a limited amount of counters (Skylake has 4 with HyperThreading, or 8 without HyperThreading). If you program doesn't run for long you won't see some statistics gathered. So picking few counters and having less stats is actually preferable. Choose wisely.

perf stat -e dTLB-store-misses,iTLB-store-misses ./app

-e - specify which counters to track. This example choose 2 counters, data and instruction TLB misses.

NOTE: you can explicitly specify to track user space (:u) and kernel (:k) software counters with :u and :k suffixes

perf stat -e page-faults:u ./app

Precise number of page faults

If you microbenchmark page fauls, you'll notice that perf stat reports few extra PF here and there between the runs. You might need to disable ASLR to make page faults counter consistent across the runs.

Disables ASLR globally (BEWARE!)

sysctl -w kernel.randomize_va_space=0

Run <my_binary> with ASLR disabled (-R option)

setarch "$(uname -m)" -R <my_binary>

perf list

To list all available Performance Monitor Unit (PMU) events

perf list

perf record (one process)

Records the session

perf record -Fmax -g --call-graph=lbr ./binary -arg1 -arg2

--call-graph=fp - (default: frame pointer). Requires all the code build with -fno-omit-frame-pointer. For a quick hack I built only the agent with this flag, libs from conan were not rebuilt and confused perf. --call-graph=dwarf - Uses debug data to determine callstack. At least x10 more data than lbr Use with -F99 to reduce sampling rate. Takes about 4 minutes to load 10MB report from 5 sec run. Takes >40min to load 30mb report. Basically unusable with high enough sampling rate (>1khz) --call-graph=lbr - >Haswell architectures we have LBR (Last Branch Records entries, which uses model-specific registers to record branch jumps that behave like a ring buffer). Works well on amd64. unfortunetely, amrv7 doesn't have LBR. Pretty fast: 144mb loads in 10sec.

perf report

generates report from the record

perf report -g -M intel -i <perf.data>

-g callgraph -M intel - intel assembly -i - custom file Other options -g 'graph,0.5,caller' --no-children graph - % are absolute (fractal - relative to the caller) 0.5 - filter value caller - show caller on top (callee - invert call graph) --no-children - report self time perf report for stripped binaries

To generate report from Release stripped binaries. Put non-stripped binaries at the same directory where stipped binaries were recorded that structure of /.

Alternatively you can use symfs as a root and place binaries there

debug_symbols_fs/
    opt/
        my_prog/
            bin/
                binary

Then run perf with

perf report --symfs debug_symbols_fs

Perf archive

Perf record on one machine and investigate on another machine https://gist.github.com/pankkor/6354d3c155fa82f93bca30ca200b6864

pankkor/perf.md

Select an option

No results found