- Load new kernel + initrd from files:
- Call
rebootwithLINUX_REBOOT_CMD_KEXEC:
(2) means that kexec is only available if reboot() is available as well.
There are two kexec_load-related syscalls:
kexec_loadwhich takes arbitrary memory (enabled viaCONFIG_KEXEC)kexec_file_loadwhich takes file descriptors and might do signature validation (enabled viaCONFIG_KEXEC_FILE)
KSPP talks only about CONFIG_KEXEC, not about CONFIG_KEXEC_FILE. At the same time it recommends sysctl to disable kexec_load which disables both flavors.
kexec source code.
kexec_file_load and reboot require CAP_SYS_BOOT capability.
reboot() inside user namespace doesn't reboot the system, it reboots the namespace (killing it) proof. If LINUX_REBOOT_CMD_KEXEC is used, it results in EINVAL. Which in turn means that any container can't actually use kexec, unless it breaks out of user namespace (if it does, security is compromised anyways).
We can further limit kexec by dropping CAP_SYS_BOOT capability for any process forked from machined (init). Path towards that is not yet totally clear for me, but some pointers:
- Go issue about runtime, threads and global stuff like settings capabilities
- runc setting capabilities
- os/exec can set Ambient capabilities
- PR_SET_NO_NEW_PRIVS
- article on capabilities
Creating user namespace re-enables all the capabilities back but capabilities inside the user namespace are limited to the resources scoped under the user namespace (more info).
In other words, on protecting kexec from being used by processes other than machined:
- For processes directly forked from
machined(which includeudevd,containerd, etc.): we can try to drop capabilities as we fork into those processes. - For containers created by
containerd(both system and k8s),kexecshouldn't be available as they reside in user namespace.