sudo apt install linux-tools-common
To be able to grab performance and kernel events. Lower sampling rate to 10000, if it takes too much time. If it happens kernel will automatically lower perf_event_max_sample_rate. Examine dmesg output for perf: interrupt took too long, lowering kernel.perf_event_max_sample_rate to value.
sudo dmesg -wH | grep perf:
sudo sysctl -w kernel.perf_event_paranoid=-1 # beware!
sudo sysctl -w kernel.kptr_restrict=0
sudo sysctl -w kernel.perf_event_max_sample_rate=16250
To persist changes to your system edit /etc/sysctl.d/local.conf
kernel.perf_event_paranoid=-1
kernel.kptr_restrict=0
kernel.perf_event_max_sample_rate=16250
Reduce variance of the system https://gist.github.com/pankkor/b0970eb28547f5afa6776f8a8a143dfa
Quick stats, similar to GNU time -v
perf stats -ddd -- ./app -with -args
-ddd - very detailed statistics (can be -d, -dd or -ddd)
NOTE: CPUs can only servuce a limited amount of counters (Skylake has 4 with HyperThreading, or 8 without HyperThreading). If you program doesn't run for long you won't see some statistics gathered. So picking few counters and having less stats is actually preferable. Choose wisely.
perf stat -e dTLB-store-misses,iTLB-store-misses ./app
-e - specify which counters to track. This example choose 2 counters, data and instruction TLB misses.
NOTE: you can explicitly specify to track user space (:u) and kernel (:k) software counters with :u and :k suffixes
perf stat -e page-faults:u ./app
Precise number of page faults
If you microbenchmark page fauls, you'll notice that perf stat reports few extra PF here and there between the runs. You might need to disable ASLR to make page faults counter consistent across the runs.
Disables ASLR globally (BEWARE!)
sysctl -w kernel.randomize_va_space=0
Run <my_binary> with ASLR disabled (-R option)
setarch "$(uname -m)" -R <my_binary>
To list all available Performance Monitor Unit (PMU) events
perf list
Records the session
perf record -Fmax -g --call-graph=lbr ./binary -arg1 -arg2
--call-graph=fp - (default: frame pointer). Requires all the code build with -fno-omit-frame-pointer. For a quick hack I built only the agent with this flag, libs from conan were not rebuilt and confused perf. --call-graph=dwarf - Uses debug data to determine callstack. At least x10 more data than lbr Use with -F99 to reduce sampling rate. Takes about 4 minutes to load 10MB report from 5 sec run. Takes >40min to load 30mb report. Basically unusable with high enough sampling rate (>1khz) --call-graph=lbr - >Haswell architectures we have LBR (Last Branch Records entries, which uses model-specific registers to record branch jumps that behave like a ring buffer). Works well on amd64. unfortunetely, amrv7 doesn't have LBR. Pretty fast: 144mb loads in 10sec.
generates report from the record
perf report -g -M intel -i <perf.data>
-g callgraph -M intel - intel assembly -i - custom file Other options -g 'graph,0.5,caller' --no-children graph - % are absolute (fractal - relative to the caller) 0.5 - filter value caller - show caller on top (callee - invert call graph) --no-children - report self time perf report for stripped binaries
To generate report from Release stripped binaries. Put non-stripped binaries at the same directory where stipped binaries were recorded that structure of /.
Alternatively you can use symfs as a root and place binaries there
debug_symbols_fs/
opt/
my_prog/
bin/
binary
Then run perf with
perf report --symfs debug_symbols_fs
Perf record on one machine and investigate on another machine https://gist.github.com/pankkor/6354d3c155fa82f93bca30ca200b6864