Skip to content

Instantly share code, notes, and snippets.

@dims
Created January 23, 2026 15:00
Show Gist options
  • Select an option

  • Save dims/361585c46ef0211f375ffa6738f85eb8 to your computer and use it in GitHub Desktop.

Select an option

Save dims/361585c46ef0211f375ffa6738f85eb8 to your computer and use it in GitHub Desktop.

Kops CI Job Failure Patterns Analysis

Generated: 2026-01-23 Analysis Method: 3 most recent builds analyzed per job type

Executive Summary

Pattern ID Failure Pattern Affected Job Count Primary Dashboards
1 KOPS_STATE_STORE / S3 Bucket Access 47+ kops-misc, kops-upgrades, kops-upgrades-many-addons
2 runc Asset Hash Error 50+ kops-gce, kops-grid (GCE)
3 Network Connectivity Timeout 100+ kops-grid, kops-nftables, kops-network-plugins
4 ARM64 EBS CSI / Pod Startup 20+ kops-distros, kops-grid
5 Azure Provider ID / Resource Issues 2 kops-azure, kops-network-plugins
6 AWS VPC CNI Pending 50+ kops-grid (amazonvpc)

Pattern 1: KOPS_STATE_STORE / S3 Bucket Access Errors

Error Signatures

Error: State Store: Required value: Please set the --state flag or export KOPS_STATE_STORE.
Could not retrieve location for AWS bucket k8s-infra-kops-state-XXXX-YYYYMMDDHHMMSS

Root Cause

AWS S3 bucket access issues during cluster creation. The kops process cannot retrieve the bucket location to store cluster configuration.

Sample Jobs (47 total - see separate document)

Job Name TestGrid URL Builds Verified
e2e-kops-aws-addon-resource-tracking https://testgrid.k8s.io/kops-misc#kops-aws-addon-resource-tracking 2014680399849984000
e2e-kops-aws-karpenter https://testgrid.k8s.io/kops-misc#kops-aws-karpenter 2014371098774212608
e2e-kops-aws-eks-pod-identity https://testgrid.k8s.io/kops-k8s-stable#kops-aws-eks-pod-identity 2014657750524497920
e2e-kops-aws-upgrade-* (20 jobs) https://testgrid.k8s.io/kops-upgrades Multiple
e2e-kops-aws-upgrade-*-many-addons (20 jobs) https://testgrid.k8s.io/kops-upgrades-many-addons Multiple

Full list: See /Users/dsrinivas/notes/kops-failing-jobs-state-store-errors.md


Pattern 2: runc Asset Hash Error

Error Signature

unable to remap asset: cannot determine hash for 'https://github.com/opencontainers/runc/releases/download/v1.3.3/runc.amd64' (have you specified a valid file location?)
deadline exceeded executing task BootstrapScript/nodes-us-central1-a

Root Cause

The kops cluster creation process cannot verify the integrity of the runc v1.3.3 binary by calculating its cryptographic hash. The asset validation mechanism cannot access or resolve the GitHub release URL.

Configuration Trigger

--set=cluster.spec.containerd.runc.version=1.3.3

Affected Jobs

Job Name TestGrid URL Builds Verified
e2e-kops-grid-gce-calico-cos121-k32-ko32 https://testgrid.k8s.io/kops-gce#kops-grid-gce-calico-cos121-k32-ko32 2014551045773987840, 2014369839992279040
e2e-kops-grid-gce-calico-cos121-k32-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-calico-cos121-k32-ko33 Similar pattern
e2e-kops-grid-gce-calico-cos121-k33-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-calico-cos121-k33-ko33 Similar pattern
e2e-kops-grid-gce-calico-cos125-k32-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-calico-cos125-k32-ko33 Similar pattern
e2e-kops-grid-gce-calico-cos125-k34-ko34 https://testgrid.k8s.io/kops-gce#kops-grid-gce-calico-cos125-k34-ko34 Similar pattern
e2e-kops-grid-gce-calico-cosdev-k32-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-calico-cosdev-k32-ko33 Similar pattern
e2e-kops-grid-gce-cilium-cos121-k32-ko32 https://testgrid.k8s.io/kops-gce#kops-grid-gce-cilium-cos121-k32-ko32 2014678890357723136
e2e-kops-grid-gce-cilium-cos121-k32-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-cilium-cos121-k32-ko33 Similar pattern
e2e-kops-grid-gce-cilium-cos121-k33-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-cilium-cos121-k33-ko33 Similar pattern
e2e-kops-grid-gce-cilium-cos125-* https://testgrid.k8s.io/kops-gce Similar pattern
e2e-kops-grid-gce-cilium-cosdev-* https://testgrid.k8s.io/kops-gce Similar pattern
e2e-kops-grid-gce-ipalias-cos121-k32-ko32 https://testgrid.k8s.io/kops-gce#kops-grid-gce-ipalias-cos121-k32-ko32 2014601881703157760
e2e-kops-grid-gce-ipalias-cos121-k32-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-ipalias-cos121-k32-ko33 Similar pattern
e2e-kops-grid-gce-ipalias-cos121-k33-ko33 https://testgrid.k8s.io/kops-gce#kops-grid-gce-ipalias-cos121-k33-ko33 Similar pattern
All e2e-kops-grid-gce--deb12- jobs https://testgrid.k8s.io/kops-gce Similar pattern
All e2e-kops-grid-gce--deb13- jobs https://testgrid.k8s.io/kops-gce Similar pattern
All e2e-kops-grid-gce--u2204- jobs https://testgrid.k8s.io/kops-gce Similar pattern
All e2e-kops-grid-gce--u2404- jobs https://testgrid.k8s.io/kops-gce Similar pattern
All e2e-kops-grid-gce--umini2404- jobs https://testgrid.k8s.io/kops-gce Similar pattern

Dashboards Affected

  • kops-gce (55+ jobs)
  • kops-distro-cos121
  • kops-distro-cos125
  • kops-distro-cosdev

Pattern 3: Network Connectivity Timeout / API Server Unreachable

Error Signatures

dial tcp X.X.X.X:443: i/o timeout
net/http: TLS handshake timeout
dial tcp X.X.X.X:443: connect: connection refused

Root Cause

Network communication failure between the test runner and the Kubernetes API server through the Network Load Balancer. The control plane may be unreachable due to:

  • NLB backend registration latency
  • Network path establishment delay
  • Instance startup delays
  • Security group or routing issues

Affected Jobs

Job Name TestGrid URL Builds Verified
e2e-kops-aws-nftables-amzn2 https://testgrid.k8s.io/kops-nftables#kops-aws-nftables-amzn2 2014672598184497152, 2014551801272995840
e2e-kops-aws-nftables-deb11 https://testgrid.k8s.io/kops-nftables#kops-aws-nftables-deb11 2014648941911478272
e2e-kops-aws-cni-kuberouter https://testgrid.k8s.io/kops-network-plugins#kops-aws-cni-kuberouter 2014601881891901440
e2e-kops-grid-amazonvpc-al2023-k32 https://testgrid.k8s.io/kops-grid#amazonvpc-al2023-k32 2012751101442396160
e2e-kops-grid-cilium-eni-al2023-k32 https://testgrid.k8s.io/kops-grid#cilium-eni-al2023-k32 2013609069738201088
e2e-kops-grid-amazonvpc-rhel9-k32 https://testgrid.k8s.io/kops-grid#amazonvpc-rhel9-k32 2014520092133429248

Additional Jobs with This Pattern

  • All e2e-kops-grid-amazonvpc-* jobs (100+ jobs)
  • All e2e-kops-grid-cilium-eni-* jobs (20+ jobs)
  • e2e-kops-aws-nftables-* jobs

Dashboards Affected

  • kops-grid
  • kops-nftables
  • kops-network-plugins
  • kops-distro-* (various)

Pattern 4: ARM64 EBS CSI Driver / Pod Startup Issues

Error Signatures

Pod kube-system/ebs-csi-node-* system-node-critical pod is not ready (ebs-plugin)
Pod kube-system/coredns-* is not ready (coredns)
Pod kube-system/aws-node-termination-handler-* is not ready

Root Cause

ARM64 architecture compatibility issues with EBS CSI driver and other system components. The image may lack required kernel modules or have incompatibilities with the EBS CSI driver and CNI plugin for specific OS/architecture combinations (RHEL 10 ARM64, Rocky 10 ARM64).

Affected Jobs

Job Name TestGrid URL Builds Verified
e2e-kops-aws-distro-rhel10arm64 https://testgrid.k8s.io/kops-distros#kops-aws-distro-rhel10arm64 2014656743484690432, 2014535946166341632
e2e-kops-aws-distro-rocky10arm64 https://testgrid.k8s.io/kops-distros#kops-aws-distro-rocky10arm64 2014692479370006528
e2e-kops-grid-amazonvpc-rhel10arm64-k32 https://testgrid.k8s.io/kops-grid#amazonvpc-rhel10arm64-k32 Similar pattern
e2e-kops-grid-amazonvpc-rhel10arm64-k33 https://testgrid.k8s.io/kops-grid#amazonvpc-rhel10arm64-k33 Similar pattern
e2e-kops-grid-amazonvpc-rhel10arm64-k34 https://testgrid.k8s.io/kops-grid#amazonvpc-rhel10arm64-k34 Similar pattern
e2e-kops-grid-amazonvpc-rhel10arm64-k35 https://testgrid.k8s.io/kops-grid#amazonvpc-rhel10arm64-k35 Similar pattern
e2e-kops-grid-amazonvpc-rocky10arm64-k32 https://testgrid.k8s.io/kops-grid#amazonvpc-rocky10arm64-k32 Similar pattern
e2e-kops-grid-amazonvpc-rocky10arm64-k33 https://testgrid.k8s.io/kops-grid#amazonvpc-rocky10arm64-k33 Similar pattern
e2e-kops-grid-amazonvpc-rocky10arm64-k34 https://testgrid.k8s.io/kops-grid#amazonvpc-rocky10arm64-k34 Similar pattern
e2e-kops-grid-amazonvpc-rocky10arm64-k35 https://testgrid.k8s.io/kops-grid#amazonvpc-rocky10arm64-k35 Similar pattern
e2e-kops-grid-calico-rhel10arm64-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-calico-rocky10arm64-* https://testgrid.k8s.io/kops-grid Similar pattern

Dashboards Affected

  • kops-distros
  • kops-grid
  • kops-distro-rhel10
  • kops-distro-rocky10

Pattern 5: Azure Provider ID / Resource Issues

Error Signatures

ignoring node with malformed provider ID: unexpected form of resource path: ""
InvalidResourceReference: Resource /subscriptions/.../providers/Microsoft.Network/publicIPAddresses/API-E2E-... was not found
Pod kube-system/coredns-* is not ready

Root Cause

Azure cloud provider integration issues:

  1. Malformed provider IDs - Azure provider not correctly setting node identifiers
  2. Resource creation timing issues - Public IP addresses not available when load balancer attempts attachment
  3. CoreDNS not becoming ready due to networking issues

Affected Jobs

Job Name TestGrid URL Builds Verified
e2e-kops-azure-cni-kindnet https://testgrid.k8s.io/kops-azure#kops-azure-cni-kindnet 2014661273584668672
e2e-kops-azure-cni-kubenet https://testgrid.k8s.io/kops-azure#kops-azure-cni-kubenet 2014688453056270336

Dashboards Affected

  • kops-azure
  • kops-network-plugins

Pattern 6: AWS VPC CNI (aws-node) Pending

Error Signatures

system-node-critical pod 'aws-node-*' is pending
Worker nodes remain in Unknown or False ready state
EBS CSI driver pods pending

Root Cause

The AWS VPC CNI plugin (aws-node daemonset) fails to start, preventing worker nodes from becoming ready. This creates a cascading failure where:

  1. Networking layer cannot initialize
  2. Nodes cannot properly register as Ready
  3. System pods dependent on networking remain stuck in pending state

Affected Jobs

Job Name TestGrid URL Builds Verified
e2e-kops-grid-amazonvpc-u2510-k32 https://testgrid.k8s.io/kops-grid#amazonvpc-u2510-k32 2014520092028571648
e2e-kops-grid-amazonvpc-u2510-k33 https://testgrid.k8s.io/kops-grid#amazonvpc-u2510-k33 Similar pattern
e2e-kops-grid-amazonvpc-u2510-k34 https://testgrid.k8s.io/kops-grid#amazonvpc-u2510-k34 Similar pattern
e2e-kops-grid-amazonvpc-u2510-k35 https://testgrid.k8s.io/kops-grid#amazonvpc-u2510-k35 Similar pattern
e2e-kops-grid-amazonvpc-u2510arm64-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-u2404-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-u2404arm64-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-deb11-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-deb13-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-al2023-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-amzn2-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-rhel9-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-rocky9-* https://testgrid.k8s.io/kops-grid Similar pattern
e2e-kops-grid-amazonvpc-flatcar-* https://testgrid.k8s.io/kops-grid Similar pattern

Dashboards Affected

  • kops-grid (all amazonvpc jobs)
  • kops-distro-u2510
  • kops-distro-u2404
  • kops-distro-deb11
  • kops-distro-deb13
  • kops-distro-al2023
  • kops-distro-amzn2

Verification Methodology

For each failure pattern, the following verification process was used:

  1. Job Enumeration: Retrieved all failing jobs from each kops TestGrid dashboard
  2. Build Selection: Fetched 3 most recent build numbers for each job type
  3. Log Analysis: Downloaded and analyzed build logs from GCS (kubernetes-ci-logs/logs/<job>/<build>/build-log.txt)
  4. Pattern Extraction: Identified key error messages and root causes
  5. Cross-Verification: Verified pattern consistency across multiple builds

Jobs Analyzed with 3 Builds Each

  • e2e-kops-aws-distro-rhel10arm64: builds 2014656743484690432, 2014535946166341632, 2014415149175148544
  • e2e-kops-aws-distro-rocky10arm64: builds 2014692479370006528, 2014571682181681152, 2014450885047881728
  • e2e-kops-aws-cni-kuberouter: builds 2014601881891901440, 2014481085622128640, 2014360276966576128
  • e2e-kops-azure-cni-kindnet: builds 2014706572801871872, 2014661273584668672, 2014615974820450304
  • e2e-kops-azure-cni-kubenet: builds 2014688453056270336, 2014643154090725376, 2014597854982574080
  • e2e-kops-aws-nftables-amzn2: builds 2014672598184497152, 2014551801272995840, 2014431004013367296
  • e2e-kops-aws-nftables-deb11: builds 2014648941911478272, 2014528144895119360, 2014407347765514240
  • e2e-kops-gce-nftables-deb12arm64: builds 2014687446238760960, 2014566648987521024, 2014445852478672896
  • e2e-kops-grid-gce-calico-cos121-k32-ko32: builds 2014551045773987840, 2014369839992279040, 2014188644486615040
  • e2e-kops-grid-gce-cilium-cos121-k32-ko32: builds 2014678890357723136, 2014497694772367360, 2014316488470564864
  • e2e-kops-grid-amazonvpc-al2023-k32: builds 2012751101442396160, 2010214395816185856, 2007677637941530624
  • e2e-kops-grid-amazonvpc-rhel9-k32: builds 2014520092133429248, 2011983284052955136, 2009446767153647616
  • e2e-kops-grid-calico-rhel9-k32: builds 2014527641779965952, 2011990833972121600, 2009454316598857728
  • e2e-kops-grid-cilium-eni-al2023-k32: builds 2013609069738201088, 2011072393837023232, 2008535754950578176
  • e2e-kops-grid-gce-ipalias-cos121-k32-ko32: builds 2014601881703157760, 2014420685698371584, 2014239480302538752
  • e2e-kops-aws-eks-pod-identity: builds 2014657750524497920, 2014536953462001664, 2014416156202373120

Summary by Dashboard

Dashboard Total Failing Jobs Primary Patterns
kops-misc 11 Pattern 1 (State Store)
kops-upgrades 20 Pattern 1 (State Store)
kops-upgrades-many-addons 20 Pattern 1 (State Store)
kops-gce 55+ Pattern 2 (runc hash)
kops-grid 200+ Patterns 3, 4, 6 (Network, ARM64, VPC CNI)
kops-distros 2 Pattern 4 (ARM64)
kops-network-plugins 3 Patterns 3, 5 (Network, Azure)
kops-nftables 4 Pattern 3 (Network)
kops-azure 2 Pattern 5 (Azure)
kops-ipv6 1 Pattern 1 (State Store)

Notes

  1. Pattern Overlap: Some jobs may exhibit multiple failure patterns across different builds
  2. Transient vs Persistent: Patterns 1, 2, and 5 are persistent failures; Patterns 3, 4, and 6 sometimes self-resolve
  3. Jobs That Sometimes Pass: Some jobs in Patterns 3, 4, 6 occasionally pass after retries, suggesting infrastructure timing issues
  4. Builds Not Failing: During analysis, some builds (e.g., e2e-kops-aws-nftables-rocky10arm64, e2e-kops-grid-calico-rhel9-k32) were found to have succeeded, indicating the failure patterns are not 100% consistent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment