uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
perf
Number of processes wanting to run. Includes processes blocked in uninterruptible IO.
Look for errors that can cause performance problems. "Out of memory", "TCP: ... dropping request", etc.
dmesg | grep oom-killer
- r: Number of processes running or waiting. Doesn't include IO.
IF r > num_cpu THEN saturation. - b: Number of processes blocked by IO.
- free: Free memory in kb
- buffers: buffer cache, used for block device I/O.
- cached: page cache, used by file systems.
- si, so: Swap-ins and swap-outs.
IF si,so != 0 THEN out_of_memory. - us, sy, id, wa, st: CPU time average
- us: user time.
- sy: system time.
IF sy > 20% THEN kernel processing IO inefficeitnly. - id: idle.
- wa: IO wait.
- st: stolten time, time spent by hypervisor for other VMs.
Look out for signle hot CPU
Like top but prints a rolling summary instead of clearing the screen.
- r/s: delivered reads per second
- rkB/s: kB read per second
- w/s: delivered writes per second
- wkB/s: kB write per second
- await: the average time for the IO in milliseconds. Includes both time queued and time being serviced.
IF high THEN device_saturation | device_problems - avgqu-sz: the average number of requests issued to the device.
IF avgqu-sz > 1 THEN could_be saturation. Still Multilple back-end disk devices can operate on requests in parallel. - %util: device utilization (busy %).
IF %util > 60% THEN poor_performance(double check with await).IF %util ~= 100% THEN saturation
- buffers: buffer cache, used for block device I/O.
- cached: page cache, used by file systems.
- buff/cache: sum of buffers and cached.
- available: used for caches but could be quickly made available for the application.
IF buffers or cached ~= 0 THEN higher disk IO
- rxpck/s: number of packets received per second.
- txpck/s: number of packets transmitted per second.
- rxkB/s: number of kilobytes received per second.
- txkB/s: number of kilobytes transmitted per second.
- %ifutil: utilization percentage of the network interface. Could be unreliable.
- active/s: number of locally-initiated TCP connections per second (e.g., via connect()). (~ outbound)
- passive/s: number of remotely-initiated TCP connections per second (e.g., via accept()). (~ inbound)
- retrans/s: number of TCP retransmits per second. Sign of network or server issue.
You'll need a call graph:
--call-graph lbr- aka Last Branch Record utilizes special hardware registers to store some limited call graph of last branching instruction (you can expect aroudn ~32 entries). Very fast, but requires modern hardware >Haswell >ARMv9.2-A.--call-graph fp- use frame pointer to determine call graph, use if your binary is built with frame pointer (-fno-omit-frame-pointer)--call-graph dwarf- saves 8k of call stack to be analyzed later together with debug info. Produces largeperf.datarecords, which are extremely slow toperf report. Practically unuseful with high sampling rate, therefore limit sampling rate to 99 Hz with-F99.
Example of comamnds:
Attach to running process to sample it for 10 seconds with 1000 Hz sample rate and LBR call-graph. Creates perf.data record.
perf record -p <pid> --call-graph lbr -F1000 -- sleep 10
Sample all the system for 10 secods with dward debug info, limiting samling rate to 99 Hz
perf record -a --call-graph dwarf -F99 -- sleep 10
If run on remote system, pack all necessary information to be analyzed later on a host system. This will create a .tar archive. Copy it together with perf.data to the host system.
perf archive
On the host system unpack the .tar archive. This will extract .tar archive to ~/.debug
perf archive --unpack
You can later generate a report with perf.data from remote system
perf report
perf-archive is missing from all the Ubuntu perf packages. Get one from Linux source:
mkdir /usr/libexec/perf-core/
wget -O /usr/libexec/perf-core/perf-archive https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/perf/perf-archive.sh
chmod +x /usr/libexec/perf-core/perf-archive