Skip to content

Instantly share code, notes, and snippets.

@fizz
Last active March 1, 2026 04:41
Show Gist options
  • Select an option

  • Save fizz/c5c55082d1ebdd33eb4005a5b9c57da3 to your computer and use it in GitHub Desktop.

Select an option

Save fizz/c5c55082d1ebdd33eb4005a5b9c57da3 to your computer and use it in GitHub Desktop.
Kubeflow prod incident update: reconciliation + RBAC recovery (2026-02-27)

Kubeflow prod incident update: reconciliation + RBAC recovery (2026-02-27)

Date: February 27, 2026 Clusters: mlinfra-prod, mlinfra-29

Three things broke, all fixed now. Both clusters are stable.

  1. KFP frontend images kept reverting after manual edits.
  2. workflow-controller and kserve-controller-manager were in CrashLoopBackOff on prod.
  3. Prod and dev had drifted apart on controller RBAC and KFP config.

What happened and what I fixed

KFP frontend reconciliation

Manual edits to ml-pipeline-ui and ml-pipeline-ui-artifact kept reverting. Metacontroller/profile-controller was reconciling them back from the parent Namespace resource.

Fix: patched the profile-controller env ConfigMap to set FRONTEND_IMAGE=ghcr.io/kubeflow/kfp-frontend and FRONTEND_TAG=2.5.0, then restarted profile-controller and triggered a namespace reconcile. Now the controller itself writes the correct image — no more manual edits to revert.

CrashLoopBackOff controllers

workflow-controller and kserve-controller-manager were crashing on prod with forbidden list/watch errors. RBAC bindings had drifted.

Fixes:

  • Restored cluster-scope RBAC for workflow controller (kubeflow/argo SA).
  • Fixed KServe manager binding subjects to include expected service accounts.
  • Restarted both deployments, both rolled out healthy.

Argo CD cleanup

Argo CD was not running (no namespace, no pods, no apps) but had leftover CRDs: applications.argoproj.io, applicationsets.argoproj.io, appprojects.argoproj.io. Removed those. Kept the Argo Workflows CRDs — Kubeflow Pipelines needs those.

For anyone confused by this in the future: KFP depends on Argo Workflows (execution engine), not Argo CD (GitOps controller). Different projects, similar names.

Follow-up changes (post-stabilization)

Timebomb removed: pipeline-install-config.appVersion

Was 2.0.5 in both clusters. Now 2.5.0 in both.

Profile-controller pins in both clusters

Set explicit image pins so reconciliation writes the right thing going forward:

  • FRONTEND_IMAGE=ghcr.io/kubeflow/kfp-frontend
  • FRONTEND_TAG=2.5.0
  • VISUALIZATION_SERVER_IMAGE=gcr.io/ml-pipeline/visualization-server
  • VISUALIZATION_SERVER_TAG=2.0.5 (pinned during the appVersion bump — will update separately)

Dev/prod parity restored

  • admin/ml-pipeline-ui-artifactghcr.io/kubeflow/kfp-frontend:2.5.0 in both clusters.
  • kserve-manager-rolebinding includes both subjects in both clusters:
    • ServiceAccount:kserve:kserve-controller-manager
    • ServiceAccount:kubeflow:kserve-controller-manager

New operational scripts

One-command dump of Kubeflow/KFP state for a given context. Reports component images, pipeline-install-config.appVersion, profile-controller frontend overrides, and infers the Kubeflow release line.

./scripts/kubeflow-version-snapshot.sh mlinfra-prod kubeflow
./scripts/kubeflow-version-snapshot.sh mlinfra-29 kubeflow

Runs kubectl auth can-i checks against the service accounts that broke during this incident: workflow-controller (kubeflow/argo), KServe controller (both SA locations), and scheduledworkflow.

./scripts/kubeflow-rbac-smoke.sh mlinfra-prod
./scripts/kubeflow-rbac-smoke.sh mlinfra-29

Both clusters pass all checks as of this writing.

Current versions

  • Kubernetes: v1.32.11-eks-* (both clusters)
  • Kubeflow control plane: v1.8.x (core components on v1.8.0 tags)
  • KFP frontend: 2.5.0 (both clusters)

Upgrade path (EKS + Kubeflow)

How to do the next upgrade without repeating this:

  1. Freeze the baseline. Export critical Kubeflow manifests and the RBAC deltas from this incident into source control before touching anything.

  2. Pick target versions up front. Choose the target EKS minor and compatible Kubeflow/KServe/KFP versions as a set. Don't mix and match.

  3. Do dev first, fully. Control plane → add-ons → nodegroups → Kubeflow validation. The whole sequence in dev before touching prod.

  4. Make the hotfixes declarative. The profile-controller env pins and RBAC fixes from this incident need to live in manifests, not be things I patched by hand.

  5. Gate each phase. Don't move to the next step until these are healthy:

    • workflow-controller
    • kserve-controller-manager
    • ml-pipeline-ui
    • ml-pipeline-ui-artifact
    • A sample pipeline run completes
  6. Collapse the duplicate KServe controller. Right now it exists in both kserve and kubeflow namespaces. Pick one, remove the other.

  7. Run the smoke scripts after each phase. Version snapshot + RBAC smoke. That's what they're for.

  8. Don't do a one-shot full-stack upgrade on prod. Same staged sequence as dev, with rollback points between phases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment