https://github.com/cncf/k8s-ai-conformance
The following blog post from the middle of 2025 and related talk gives a great explaination of how AI/ML workloads differ from non AI/ML workloads on Kubernetes. AI and ML workloads run on specialized hardware like GPUs and other accelerators and has Kubernetes has implemented changes to accomodate new devices.
Navigating Failures in Pods With Devices By Sergey Kanzhelev (Google) Mrunal Patel (RedHat) | Thursday, July 03, 2025
Based on the Kubecon NA SLC 2024 talk given by Sergey Kanzhelev (Google) Mrunal Patel (RedHat)
https://www.youtube.com/watch?v=-YCnOYTtVO8&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=151
And for completness here are the slides from the talk
The blog post lays out a roadmap here I have attempted (using agentic AI) to provide updates on the issues mentioned in that blog post. The top two issues are closed and I provide links to the merge commits that were associated with their closure.
The quoted issues have been worked on since this blog post was written
| Issue | Status | Target Release | SIG | Labels | Summary |
|---|---|---|---|---|---|
| integrate kubelet with the systemd watchdog · Issue #127460 | Closed | v1.32 | SIG Node | sig/node, kind/feature, priority/important-soon | Closed by PR #127566, merged as commit 7fff5b6. Adds systemd watchdog integration to the kubelet so systemd can restart it if it becomes unresponsive, improving node self-healing and failure detection. |
| DRA: detect stale DRA plugin sockets · Issue #128696 | Closed | v1.34 | SIG Node | sig/node, sig/scheduling, kind/bug, priority/important-soon | Closed by PR #133152, merged as commit 837b739. The kubelet now detects and cleans up stale Dynamic Resource Allocation (DRA) plugin sockets, preventing infinite retries against dead gRPC endpoints. |
| Support takeover for devicemanager/device-plugin · Issue #127803 | Open | TBD | SIG Node | sig/node, kind/feature, priority/important-longterm | Design discussion ongoing. Proposes allowing a new device plugin instance to take over device ownership from a previous instance without forcing deregistration, enabling safe plugin restarts and upgrades without pod disruption. No merged implementation yet. |
| Kubelet plugin registration reliability · Issue #127457 | Open | TBD | SIG Node | sig/node, kind/bug, priority/important-longterm | Tracks reliability gaps in kubelet’s plugin registration flow, including missing retries on GetInfo and GetDevicePluginOptions and slow detection of plugin restarts. Accepted as long-term reliability work; no PR merged yet. |
| Recreate the Device Manager gRPC server if failed · Issue #128167 | Open | TBD | SIG Node | sig/node, kind/feature, help-wanted | Proposes restarting the device manager’s gRPC server if it crashes instead of leaving kubelet in a degraded state. Also suggests tying failure into kubelet health reporting (e.g. systemd watchdog) so nodes are marked unhealthy when device management is broken. |
| Retry pod admission on device plugin grpc failures · Issue #128043 | Open | TBD | SIG Node | sig/node, kind/bug, priority/important-longterm | Pods that require device plugins may fail permanently if kubelet starts before the plugin. The issue proposes retrying pod admission or deferring failure until device plugin gRPC endpoints are available. Still under discussion. |