Skip to content

Instantly share code, notes, and snippets.

@RobertKielty
Last active January 26, 2026 12:54
Show Gist options
  • Select an option

  • Save RobertKielty/3fc53147971222e6a72c060f1111d94b to your computer and use it in GitHub Desktop.

Select an option

Save RobertKielty/3fc53147971222e6a72c060f1111d94b to your computer and use it in GitHub Desktop.
AI/ML Kubernetes References

AI and ML on Kubernetes

Conformance Repo

https://github.com/cncf/k8s-ai-conformance

How Kubernetes has Evolved to handle AI/ML workloads

Navigating Failures in Pods with Devices

The following blog post from the middle of 2025 and related talk gives a great explaination of how AI/ML workloads differ from non AI/ML workloads on Kubernetes. AI and ML workloads run on specialized hardware like GPUs and other accelerators and has Kubernetes has implemented changes to accomodate new devices.

Blog Post

Navigating Failures in Pods With Devices By Sergey Kanzhelev (Google) Mrunal Patel (RedHat) | Thursday, July 03, 2025

Kubecon - Cloud Native Con Talk on YT

Based on the Kubecon NA SLC 2024 talk given by Sergey Kanzhelev (Google) Mrunal Patel (RedHat)

https://www.youtube.com/watch?v=-YCnOYTtVO8&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=151

Slides

And for completness here are the slides from the talk

https://static.sched.com/hosted_files/kccncna2024/b9/KubeCon%20NA%202024_%20Navigating%20Failures%20in%20Pods%20With%20Devices_%20Challenges%20and%20Solutions.pptx.pdf

The blog post lays out a roadmap here I have attempted (using agentic AI) to provide updates on the issues mentioned in that blog post. The top two issues are closed and I provide links to the merge commits that were associated with their closure.

The quoted issues have been worked on since this blog post was written

Updated Road Map from the blog post
Issue Status Target Release SIG Labels Summary
integrate kubelet with the systemd watchdog · Issue #127460 Closed v1.32 SIG Node sig/node, kind/feature, priority/important-soon Closed by PR #127566, merged as commit 7fff5b6. Adds systemd watchdog integration to the kubelet so systemd can restart it if it becomes unresponsive, improving node self-healing and failure detection.
DRA: detect stale DRA plugin sockets · Issue #128696 Closed v1.34 SIG Node sig/node, sig/scheduling, kind/bug, priority/important-soon Closed by PR #133152, merged as commit 837b739. The kubelet now detects and cleans up stale Dynamic Resource Allocation (DRA) plugin sockets, preventing infinite retries against dead gRPC endpoints.
Support takeover for devicemanager/device-plugin · Issue #127803 Open TBD SIG Node sig/node, kind/feature, priority/important-longterm Design discussion ongoing. Proposes allowing a new device plugin instance to take over device ownership from a previous instance without forcing deregistration, enabling safe plugin restarts and upgrades without pod disruption. No merged implementation yet.
Kubelet plugin registration reliability · Issue #127457 Open TBD SIG Node sig/node, kind/bug, priority/important-longterm Tracks reliability gaps in kubelet’s plugin registration flow, including missing retries on GetInfo and GetDevicePluginOptions and slow detection of plugin restarts. Accepted as long-term reliability work; no PR merged yet.
Recreate the Device Manager gRPC server if failed · Issue #128167 Open TBD SIG Node sig/node, kind/feature, help-wanted Proposes restarting the device manager’s gRPC server if it crashes instead of leaving kubelet in a degraded state. Also suggests tying failure into kubelet health reporting (e.g. systemd watchdog) so nodes are marked unhealthy when device management is broken.
Retry pod admission on device plugin grpc failures · Issue #128043 Open TBD SIG Node sig/node, kind/bug, priority/important-longterm Pods that require device plugins may fail permanently if kubelet starts before the plugin. The issue proposes retrying pod admission or deferring failure until device plugin gRPC endpoints are available. Still under discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment