Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 11, 2026 16:37
Show Gist options
  • Select an option

  • Save jmealo/e02a54a1bb21daf4450e7283351e7a38 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/e02a54a1bb21daf4450e7283351e7a38 to your computer and use it in GitHub Desktop.
RCA: search-retry-feeder CrashLoopBackOff (Production, 2026-03-11)

RCA: search-retry-feeder CrashLoopBackOff (Production)

Date: 2026-03-11 Reported by: Diego via Slack/Datadog Investigated by: Jeff Mealo (with Claude Code) Severity: Low (self-recovering, no data loss) Status: Root cause identified, fix committed and pending deploy

Summary

Datadog alerted on search-retry-feeder-intel-requests-shard-32 entering CrashLoopBackOff in production. Root cause: a known Stakater Reloader bug (#299, #810, #1089) where reloadOnCreate=true and syncAfterRestart=true cause mass rolling restarts of all watched deployments on every Reloader pod restart or leader election — even when ConfigMap/Secret content is unchanged.

This has been affecting redis-mirror-envoy (and all downstream Redis consumers) since May 20, 2025 (~10 months).

Total Impact (~10 months)

  • ~85 cluster-wide Redis disruption events (Reloader leader elections)
  • ~169 spurious envoy rollouts (2 per event: ConfigMap + Secret)
  • ~30,000+ estimated search-retry-feeder restarts
  • ~66,000+ estimated Redis connection errors
  • All from content that never actually changed — hashes are identical across all 169 rollouts

How we calculated this

Metric Source
Envoy deployment revision 179 (from deployment.kubernetes.io/revision)
Legitimate chart deploys (git commits) 10 (since 2025-05-20)
Spurious Reloader rollouts ~169 (179 - 10)
Reloader leader elections ~85 (169 ÷ 2, since each event produces 2 rollouts)
Frequency ~2 per week

Measured: 7-day window (Mar 4–11)

Date Pod Restarts Redis Errors
Mar 4 32 380
Mar 5 10 115
Mar 6 294 241
Mar 7 100 45
Mar 8 18 303
Mar 9 42 20
Mar 10 93 440
Mar 11 (partial) 150
7-day total 707 restarts 1,544 errors

Extrapolated over ~43 weeks: ~30,000 restarts, ~66,000 Redis errors.

No data loss — the task runner republishes stored messages before shutdown.

Architecture: redis-mirror-envoy

redis-mirror-envoy is an Envoy Redis proxy in the infra namespace (6 replicas) that splits Redis traffic by key prefix:

  • lock:* keysredis_secondary (gisual-production-secondary.redis.cache.windows.net) with dual-write mirror to redis_primary
  • All other keysredis_primary (gisual-production.redis.cache.windows.net)

This is a standalone deployment that all backend services connect to at redis-mirror-envoy.infra.svc.cluster.local:6379. It is not a sidecar.

Root Cause: Stakater Reloader Bug

The bug

Reloader stores change state only in memory. On pod restart or leader election, the Kubernetes informer cache delivers all existing resources as "Add" events. With reloadOnCreate=true or syncAfterRestart=true, these are treated as changes, triggering rolling restarts for every watched deployment — even though content is identical.

The Reloader already writes a SHA1 hash to the deployment annotation (reloader.stakater.com/last-reloaded-from), but does not compare it on startup. This is the unfixed design flaw.

Proof: Content hashes are identical across all rollouts

Every ReplicaSet created by Reloader has the exact same hashes — confirming no actual content change:

Resource Hash (SHA1) Same across all rollouts?
ConfigMap redis-lock-secondary-envoy-config 354c95fd2dd2ca9de2a8801ef7b07b1fe3444826 Yes
Secret redis-mirror-envoy-secret 74a81cde887a3d3f14af3cffd2c2929eb240ecde Yes

Running config (all 4 clusters)

All 4 clusters (demo, staging, production, infra) are running with both bad flags:

--reload-on-create=true      ← triggers spurious restarts on startup
--sync-after-restart=true    ← triggers spurious restarts on leader election
--enable-ha=true             ← safe, but amplifies the above bugs via leader elections

Trigger pattern

Each Reloader restart/leader election produces two overlapping envoy rollouts ~52s apart (one for ConfigMap, one for Secret):

Date ReplicaSet Created Trigger
Mar 6 01:38 1 RS Reloader leader election
Mar 7 03:52 + 03:53 Pair (52s gap) Reloader leader election
Mar 9 17:23 + 17:24 Pair (53s gap) Reloader leader election
Mar 9 19:37 + 19:38 Pair (51s gap) Reloader leader election
Mar 11 08:22 + 08:23 Pair (54s gap) Reloader leader election
Mar 11 14:35 + 14:36 Pair (52s gap) Reloader leader election

Rolling update strategy (maxSurge: 0, maxUnavailable: 1) cycles all 6 envoy pods sequentially, resetting active downstream Redis connections.

Reloader log evidence (Mar 11)

14:34:52 level=info msg="Successfully acquired lease"
14:34:52 level=info msg="became leader, starting controllers"
14:35:19 level=info msg="Changes detected in 'redis-lock-secondary-envoy-config' of type 'CONFIGMAP' in namespace 'infra'; updated 'redis-mirror-envoy'"
14:36:11 level=info msg="Changes detected in 'redis-mirror-envoy-secret' of type 'SECRET' in namespace 'infra'; updated 'redis-mirror-envoy'"

Azure Redis is healthy

All envoy upstream error metrics are at zero for both clusters over the last 7 days:

Metric Value
envoy_cluster_upstream_cx_connect_fail 0
envoy_cluster_upstream_cx_connect_timeout 0
envoy_cluster_health_check_failure 0
envoy_cluster_upstream_cx_protocol_error 0

Verified: Shard-32 Crash Timeline (Mar 11)

Correlated Reloader logs with pod-specific logs for search-retry-feeder-intel-requests-shard-32-c5ccb79d-s9zpm:

Time (UTC) Event
14:34:52 Reloader pod acquires leader lease
14:35:19 Reloader: "Changes detected" in ConfigMap → envoy rollout #1
14:36:11 Reloader: "Changes detected" in Secret → envoy rollout #2
14:37:50 Shard-32 crashes at iteration 107: redis.exceptions.ConnectionError: Error while reading from redis-mirror-envoy.infra.svc.cluster.local:6379 : (104, 'Connection reset by peer')
14:38:24 Shard-32 crashes again at iteration 7 (barely restarted, envoy still mid-rollout)
14:38:43 Datadog CrashLoopBackOff alert fires
14:45:48 Pod receives SIGTERM at iteration 391 — new deployment replaces pod, graceful shutdown

Full Timeline

Date Event
2025-05-20 Reloader deployed with reloadOnCreate=true, syncAfterRestart=true, enableHA=true (commit 69f146b1)
2025-05-20 → 2026-03-11 ~85 Reloader leader elections cause ~169 spurious envoy rollouts, each resetting Redis connections cluster-wide (~10 months, ~2/week)
2025-12-17 Separate incident: Envoy OOMKills from 1GB per-connection buffers (unrelated to Reloader)
2026-03-05 20:26 Reloader redeployed (chart version update), creating fresh pods — still with both bad flags
2026-03-06 syncAfterRestart fixed to false in git (commit 0cd46f44a) — not deployed
2026-03-06 → 03-11 8 more spurious envoy rollouts from Reloader leader elections
2026-03-11 14:35 Shard-32 CrashLoopBackOff alert — triggers this investigation
2026-03-11 reloadOnCreate fixed to false in git (commit 335044c69) — pending deploy to all 4 clusters

Fix

Both flags fixed in the base values.yaml (applies to all 4 clusters — per-env overrides are empty):

reloader:
  reloadOnCreate: false   # was true since 2025-05-20 (fixed: 335044c69)
  syncAfterRestart: false # was true since 2025-05-20 (fixed: 0cd46f44a)
  enableHA: true          # safe to keep

Trade-off: If a ConfigMap/Secret changes during the brief (~15s) leader election window, that change won't trigger a restart. Acceptable risk given the short window and the alternative (mass restarts on every leader election).

Upstream issues (all OPEN, unfixed as of Reloader v1.4.14):

  • #299reloadOnCreate causes mass restarts on pod restart (filed 2022)
  • #810 — HA leader election triggers unnecessary syncs (filed 2024)
  • #1089 — Mass pod restarts when controller restarts or leader changes (filed 2026)

Why Task Runners Crash on Redis Errors

This is intentional design, not a bug. From gisual_task_runner/app.py:245:

async def run(self) -> None:
    await self._runtime.__aenter__()
    try:
        await self.start()
        await self.execute()
    except Exception as error:
        self.logger.exception(str(error))
        await self.on_exception()
        raise  # Intentional: crash for clean reconnection
    finally:
        if self.amqp is not None:
            await self.amqp.maybe_republish_stored_messages()
        await self.stop()

Messages are republished before shutdown. No data loss.

Investigation Notes

Searching logs for sharded deployments

service_name:~"search-retry-feeder.*" | stats count() by (service_name)

Historical context

Dec 17, 2025: Envoy OOMKill incident caused by 1GB per-connection buffers. Remediation: reduced buffer to 100MB, enabled flush controls, increased memory to 1Gi/2Gi. See ENVOY_OOMKILL_INVESTIGATION.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment