AI and ML on Kubernetes

Conformance Repo

https://github.com/cncf/k8s-ai-conformance

How Kubernetes has Evolved to handle AI/ML workloads

Navigating Failures in Pods with Devices

The following blog post from the middle of 2025 and related talk gives a great explaination of how AI/ML workloads differ from non AI/ML workloads on Kubernetes. AI and ML workloads run on specialized hardware like GPUs and other accelerators and has Kubernetes has implemented changes to accomodate new devices.

Blog Post

Navigating Failures in Pods With Devices By Sergey Kanzhelev (Google) Mrunal Patel (RedHat) | Thursday, July 03, 2025

Kubecon - Cloud Native Con Talk on YT

Based on the Kubecon NA SLC 2024 talk given by Sergey Kanzhelev (Google) Mrunal Patel (RedHat)

https://www.youtube.com/watch?v=-YCnOYTtVO8&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=151

Slides

And for completness here are the slides from the talk

https://static.sched.com/hosted_files/kccncna2024/b9/KubeCon%20NA%202024_%20Navigating%20Failures%20in%20Pods%20With%20Devices_%20Challenges%20and%20Solutions.pptx.pdf

The blog post lays out a roadmap here I have attempted (using agentic AI) to provide updates on the issues mentioned in that blog post. The top two issues are closed and I provide links to the merge commits that were associated with their closure.

The quoted issues have been worked on since this blog post was written

Updated Road Map from the blog post

Issue	Status	Target Release	SIG	Labels	Summary
integrate kubelet with the systemd watchdog · Issue #127460	Closed	v1.32	SIG Node	sig/node, kind/feature, priority/important-soon	Closed by PR #127566, merged as commit `7fff5b6`. Adds systemd watchdog integration to the kubelet so systemd can restart it if it becomes unresponsive, improving node self-healing and failure detection.
DRA: detect stale DRA plugin sockets · Issue #128696	Closed	v1.34	SIG Node	sig/node, sig/scheduling, kind/bug, priority/important-soon	Closed by PR #133152, merged as commit `837b739`. The kubelet now detects and cleans up stale Dynamic Resource Allocation (DRA) plugin sockets, preventing infinite retries against dead gRPC endpoints.
Support takeover for devicemanager/device-plugin · Issue #127803	Open	TBD	SIG Node	sig/node, kind/feature, priority/important-longterm	Design discussion ongoing. Proposes allowing a new device plugin instance to take over device ownership from a previous instance without forcing deregistration, enabling safe plugin restarts and upgrades without pod disruption. No merged implementation yet.
Kubelet plugin registration reliability · Issue #127457	Open	TBD	SIG Node	sig/node, kind/bug, priority/important-longterm	Tracks reliability gaps in kubelet’s plugin registration flow, including missing retries on `GetInfo` and `GetDevicePluginOptions` and slow detection of plugin restarts. Accepted as long-term reliability work; no PR merged yet.
Recreate the Device Manager gRPC server if failed · Issue #128167	Open	TBD	SIG Node	sig/node, kind/feature, help-wanted	Proposes restarting the device manager’s gRPC server if it crashes instead of leaving kubelet in a degraded state. Also suggests tying failure into kubelet health reporting (e.g. systemd watchdog) so nodes are marked unhealthy when device management is broken.
Retry pod admission on device plugin grpc failures · Issue #128043	Open	TBD	SIG Node	sig/node, kind/bug, priority/important-longterm	Pods that require device plugins may fail permanently if kubelet starts before the plugin. The issue proposes retrying pod admission or deferring failure until device plugin gRPC endpoints are available. Still under discussion.

RobertKielty/CDS-London-Talk-Notes.MD

Select an option

No results found