For my work on proxy-exec, I’ve built up a collection of tests that I use to try to strain things and uncover issues. I’m sure there are lots of better ways, however I am but a simple caveman. Folks have asked about my process so I figured I'd try to document it. I’m likely forgetting things, but I’ll try to update as I think of them.
(I'll usually keep a custom test defconfig in my kerenl trees that have the options I see as useful/helpful enabled so others can re-create the same config easily):
CONFIG_PROVE_LOCKING
CONFIG_DEBUG_RT_MUTEXES
CONFIG_DEBUG_SPINLOCK
CONFIG_DEBUG_MUTEXES
CONFIG_DEBUG_WW_MUTEX_SLOWPATH
CONFIG_DEBUG_RWSEMS
CONFIG_DEBUG_LOCK_ALLOC
CONFIG_LOCKUP_DETECTOR
CONFIG_SOFTLOCKUP_DETECTOR
CONFIG_HARDLOCKUP_DETECTOR
CONFIG_LOCK_TORTUREI then test using the following boot parameter for the configuration string:"torture.random_shuffle=1 locktorture.writer_fifo=1 locktorture.torture_type=mutex_lock locktorture.nested_locks=8 locktorture.rt_boost=1 locktorture.rt_boost_factor=50 locktorture.stutter=0 "CONFIG_WW_MUTEX_SELFTESTto exercise the ww-mutex die/wound logic. With my extension patches (hopefully to land upstream soon), I can trigger them to run repeatedly in a loop:
# while true; do echo 1 > /sys/kernel/test_ww_mutex/run_tests ; sleep 5; done
- My currently out-of-tree (ksched_football test)[https://github.com/johnstultz-work/linux-dev/commit/b28fa89f27b3d8466fe3f8374aa3ed76c79dde75] (
CONFIG_SCHED_RT_INVARIANT_TEST). This can often starve the system and gives the dl_server a workout. Re-run repeatedly in a loop:
# while true; do echo 10 > /sys/kernel/ksched_football/start_game; sleep 120; done
- rt-tests: Collection of tests for testing the RT class. Usually I’ll run cyclictest to add some frequent RT preempts via:
# ./cyclictest -t -p99
- Priority-inversion-demo: A userspace test that demonstrates cgroup caused priority inversions and allows you to create and compare histograms. Often I will run this in a loop indefinitely
# while true; do ./run.sh ; sleep 1; done
- Kselftest cpu-hotplug test: Found in kernel source under
tools/testing/selftests/cpu-hotplug/. I’ll usually run it in a loop like:
# while true; do ./cpu-on-off-test.sh -a; sleep 120; done
- stress-ng: An intense system stressor. May effectively DOS your system, so it's not always great for distinguishing between system overload and a bug. I’ll add it in when other stress testing hasn’t found anything. Run in a loop via:
# while true; do stress-ng -r `nproc` --timeout 300; sleep 90; done
I don’t run all of the above together all the time. I tend to pick a collection of 4 or so of the above to run in parallel that don't completely overwhelm the system.
I sort of treat my trees by grades of stability: 10mins, 1+hrs, 6+hrs, 12+hrs, 48+hrs. Where when I'm actively hacking on things, I usually only run for ~10mins, then off-for-lunch is an hour, running for a chunk the work day, running overnight, and running over the weekend. I try to take every opportuntity to leave tests running when I can't be actively working on things. In a few situations where I had really tricky issues to debug, I'd have to leave it running for 70+ hours to trip the problem. So this stress testing isn't always the fastest way to find issues.
- I tend to do most of my testing in a x86 QEMU environment. I’m lucky to be able to run my QEMU VM with 64 cores, so that creates a lot of parallelism and makes it easier to trip races.
- I will sometimes drop the
“-enable-kvm”flag to QEMU. This really slows down the test environment (taking >20 minutes to boot with many of the boot time in-kernel stress tests enabled). However, the combination of high CPU counts, combined with very slow execution, seems to open a number of races up and this has been helpful in finding problems. However, I really have to find a lot of patience that I don’t usually have to test this way.
- I usually run qmeu with the argument to ensure I always have the serial console logged to a file:
“-chardev stdio,id=char0,mux=on,logfile=serial.log,signal=off -serial chardev:char0 -mon chardev=char0 “ - Always run qemu with the
“-gdb tcp::1234”option. Also pass“nokaslr ”as a boot option to the kernel. Then if a hang or other problem arises you can easily debug the kernel by running gdb on the host machine:
$ gdb vmlinux -ex "target remote localhost:1234"
“printk.synchronous=1 “as a boot argument has also been helpful when trying to chase down rare issue where printk loses linestrace_printk()is your friend. Make sure you also have“ftrace_dump_on_oops ”as a kernel boot argument- When I suspect I’m hitting a race, or worried that there may be one present, I’ll add
udelay(500);(sometimes going as high as 2000) or sometimes something likeudelay(100*raw_smp_processor_id());after a lock is released to try to open the windows where races might occur.