RCA: search-retry-feeder CrashLoopBackOff (Production)

Date: 2026-03-11 Reported by: Diego via Slack/Datadog Investigated by: Jeff Mealo (with Claude Code) Severity: Low (self-recovering, no data loss) Status: Root cause identified, fix committed and pending deploy

Summary

Datadog alerted on search-retry-feeder-intel-requests-shard-32 entering CrashLoopBackOff in production. Root cause: a known Stakater Reloader bug (#299, #810, #1089) where reloadOnCreate=true and syncAfterRestart=true cause mass rolling restarts of all watched deployments on every Reloader pod restart or leader election — even when ConfigMap/Secret content is unchanged.

This has been affecting redis-mirror-envoy (and all downstream Redis consumers) since May 20, 2025 (~10 months).

Total Impact (~10 months)

~85 cluster-wide Redis disruption events (Reloader leader elections)
~169 spurious envoy rollouts (2 per event: ConfigMap + Secret)
~30,000+ estimated search-retry-feeder restarts
~66,000+ estimated Redis connection errors
All from content that never actually changed — hashes are identical across all 169 rollouts

How we calculated this

Metric	Source
Envoy deployment revision	179 (from `deployment.kubernetes.io/revision`)
Legitimate chart deploys (git commits)	10 (since 2025-05-20)
Spurious Reloader rollouts	~169 (179 - 10)
Reloader leader elections	~85 (169 ÷ 2, since each event produces 2 rollouts)
Frequency	~2 per week

Measured: 7-day window (Mar 4–11)

Date	Pod Restarts	Redis Errors
Mar 4	32	380
Mar 5	10	115
Mar 6	294	241
Mar 7	100	45
Mar 8	18	303
Mar 9	42	20
Mar 10	93	440
Mar 11 (partial)	150	—
7-day total	707 restarts	1,544 errors

Extrapolated over ~43 weeks: ~30,000 restarts, ~66,000 Redis errors.

No data loss — the task runner republishes stored messages before shutdown.

Architecture: redis-mirror-envoy

redis-mirror-envoy is an Envoy Redis proxy in the infra namespace (6 replicas) that splits Redis traffic by key prefix:

lock:* keys → redis_secondary (gisual-production-secondary.redis.cache.windows.net) with dual-write mirror to redis_primary
All other keys → redis_primary (gisual-production.redis.cache.windows.net)

This is a standalone deployment that all backend services connect to at redis-mirror-envoy.infra.svc.cluster.local:6379. It is not a sidecar.

Root Cause: Stakater Reloader Bug

The bug

Reloader stores change state only in memory. On pod restart or leader election, the Kubernetes informer cache delivers all existing resources as "Add" events. With reloadOnCreate=true or syncAfterRestart=true, these are treated as changes, triggering rolling restarts for every watched deployment — even though content is identical.

The Reloader already writes a SHA1 hash to the deployment annotation (reloader.stakater.com/last-reloaded-from), but does not compare it on startup. This is the unfixed design flaw.

Proof: Content hashes are identical across all rollouts

Every ReplicaSet created by Reloader has the exact same hashes — confirming no actual content change:

Resource	Hash (SHA1)	Same across all rollouts?
ConfigMap `redis-lock-secondary-envoy-config`	`354c95fd2dd2ca9de2a8801ef7b07b1fe3444826`	Yes
Secret `redis-mirror-envoy-secret`	`74a81cde887a3d3f14af3cffd2c2929eb240ecde`	Yes

Running config (all 4 clusters)

All 4 clusters (demo, staging, production, infra) are running with both bad flags:

--reload-on-create=true      ← triggers spurious restarts on startup
--sync-after-restart=true    ← triggers spurious restarts on leader election
--enable-ha=true             ← safe, but amplifies the above bugs via leader elections

Trigger pattern

Each Reloader restart/leader election produces two overlapping envoy rollouts ~52s apart (one for ConfigMap, one for Secret):

Date	ReplicaSet Created	Trigger
Mar 6 01:38	1 RS	Reloader leader election
Mar 7 03:52 + 03:53	Pair (52s gap)	Reloader leader election
Mar 9 17:23 + 17:24	Pair (53s gap)	Reloader leader election
Mar 9 19:37 + 19:38	Pair (51s gap)	Reloader leader election
Mar 11 08:22 + 08:23	Pair (54s gap)	Reloader leader election
Mar 11 14:35 + 14:36	Pair (52s gap)	Reloader leader election

Rolling update strategy (maxSurge: 0, maxUnavailable: 1) cycles all 6 envoy pods sequentially, resetting active downstream Redis connections.

Reloader log evidence (Mar 11)

14:34:52 level=info msg="Successfully acquired lease"
14:34:52 level=info msg="became leader, starting controllers"
14:35:19 level=info msg="Changes detected in 'redis-lock-secondary-envoy-config' of type 'CONFIGMAP' in namespace 'infra'; updated 'redis-mirror-envoy'"
14:36:11 level=info msg="Changes detected in 'redis-mirror-envoy-secret' of type 'SECRET' in namespace 'infra'; updated 'redis-mirror-envoy'"

Azure Redis is healthy

All envoy upstream error metrics are at zero for both clusters over the last 7 days:

Metric	Value
`envoy_cluster_upstream_cx_connect_fail`	0
`envoy_cluster_upstream_cx_connect_timeout`	0
`envoy_cluster_health_check_failure`	0
`envoy_cluster_upstream_cx_protocol_error`	0

Verified: Shard-32 Crash Timeline (Mar 11)

Correlated Reloader logs with pod-specific logs for search-retry-feeder-intel-requests-shard-32-c5ccb79d-s9zpm:

Time (UTC)	Event
14:34:52	Reloader pod acquires leader lease
14:35:19	Reloader: "Changes detected" in ConfigMap → envoy rollout #1
14:36:11	Reloader: "Changes detected" in Secret → envoy rollout #2
14:37:50	Shard-32 crashes at iteration 107: `redis.exceptions.ConnectionError: Error while reading from redis-mirror-envoy.infra.svc.cluster.local:6379 : (104, 'Connection reset by peer')`
14:38:24	Shard-32 crashes again at iteration 7 (barely restarted, envoy still mid-rollout)
14:38:43	Datadog CrashLoopBackOff alert fires
14:45:48	Pod receives SIGTERM at iteration 391 — new deployment replaces pod, graceful shutdown

Full Timeline

Date	Event
2025-05-20	Reloader deployed with `reloadOnCreate=true`, `syncAfterRestart=true`, `enableHA=true` (commit `69f146b1`)
2025-05-20 → 2026-03-11	~85 Reloader leader elections cause ~169 spurious envoy rollouts, each resetting Redis connections cluster-wide (~10 months, ~2/week)
2025-12-17	Separate incident: Envoy OOMKills from 1GB per-connection buffers (unrelated to Reloader)
2026-03-05 20:26	Reloader redeployed (chart version update), creating fresh pods — still with both bad flags
2026-03-06	`syncAfterRestart` fixed to `false` in git (commit `0cd46f44a`) — not deployed
2026-03-06 → 03-11	8 more spurious envoy rollouts from Reloader leader elections
2026-03-11 14:35	Shard-32 CrashLoopBackOff alert — triggers this investigation
2026-03-11	`reloadOnCreate` fixed to `false` in git (commit `335044c69`) — pending deploy to all 4 clusters

Fix

Both flags fixed in the base values.yaml (applies to all 4 clusters — per-env overrides are empty):

reloader:
  reloadOnCreate: false   # was true since 2025-05-20 (fixed: 335044c69)
  syncAfterRestart: false # was true since 2025-05-20 (fixed: 0cd46f44a)
  enableHA: true          # safe to keep

Trade-off: If a ConfigMap/Secret changes during the brief (~15s) leader election window, that change won't trigger a restart. Acceptable risk given the short window and the alternative (mass restarts on every leader election).

Upstream issues (all OPEN, unfixed as of Reloader v1.4.14):

#299 — reloadOnCreate causes mass restarts on pod restart (filed 2022)
#810 — HA leader election triggers unnecessary syncs (filed 2024)
#1089 — Mass pod restarts when controller restarts or leader changes (filed 2026)

Why Task Runners Crash on Redis Errors

This is intentional design, not a bug. From gisual_task_runner/app.py:245:

async def run(self) -> None:
    await self._runtime.__aenter__()
    try:
        await self.start()
        await self.execute()
    except Exception as error:
        self.logger.exception(str(error))
        await self.on_exception()
        raise  # Intentional: crash for clean reconnection
    finally:
        if self.amqp is not None:
            await self.amqp.maybe_republish_stored_messages()
        await self.stop()

Messages are republished before shutdown. No data loss.

Investigation Notes

Searching logs for sharded deployments

service_name:~"search-retry-feeder.*" | stats count() by (service_name)

Historical context

Dec 17, 2025: Envoy OOMKill incident caused by 1GB per-connection buffers. Remediation: reduced buffer to 100MB, enabled flush controls, increased memory to 1Gi/2Gi. See ENVOY_OOMKILL_INVESTIGATION.md.

jmealo/search-retry-feeder-crashloop-rca.md

Select an option

No results found