Date: 2026-03-11 Reported by: Diego via Slack/Datadog Investigated by: Jeff Mealo (with Claude Code) Severity: Low (self-recovering, no data loss) Status: Root cause identified, fix committed and pending deploy
Datadog alerted on search-retry-feeder-intel-requests-shard-32 entering CrashLoopBackOff in production. Root cause: a known Stakater Reloader bug (#299, #810, #1089) where reloadOnCreate=true and syncAfterRestart=true cause mass rolling restarts of all watched deployments on every Reloader pod restart or leader election — even when ConfigMap/Secret content is unchanged.
This has been affecting redis-mirror-envoy (and all downstream Redis consumers) since May 20, 2025 (~10 months).
- ~85 cluster-wide Redis disruption events (Reloader leader elections)
- ~169 spurious envoy rollouts (2 per event: ConfigMap + Secret)
- ~30,000+ estimated search-retry-feeder restarts
- ~66,000+ estimated Redis connection errors
- All from content that never actually changed — hashes are identical across all 169 rollouts
| Metric | Source |
|---|---|
| Envoy deployment revision | 179 (from deployment.kubernetes.io/revision) |
| Legitimate chart deploys (git commits) | 10 (since 2025-05-20) |
| Spurious Reloader rollouts | ~169 (179 - 10) |
| Reloader leader elections | ~85 (169 ÷ 2, since each event produces 2 rollouts) |
| Frequency | ~2 per week |
| Date | Pod Restarts | Redis Errors |
|---|---|---|
| Mar 4 | 32 | 380 |
| Mar 5 | 10 | 115 |
| Mar 6 | 294 | 241 |
| Mar 7 | 100 | 45 |
| Mar 8 | 18 | 303 |
| Mar 9 | 42 | 20 |
| Mar 10 | 93 | 440 |
| Mar 11 (partial) | 150 | — |
| 7-day total | 707 restarts | 1,544 errors |
Extrapolated over ~43 weeks: ~30,000 restarts, ~66,000 Redis errors.
No data loss — the task runner republishes stored messages before shutdown.
redis-mirror-envoy is an Envoy Redis proxy in the infra namespace (6 replicas) that splits Redis traffic by key prefix:
lock:*keys →redis_secondary(gisual-production-secondary.redis.cache.windows.net) with dual-write mirror toredis_primary- All other keys →
redis_primary(gisual-production.redis.cache.windows.net)
This is a standalone deployment that all backend services connect to at redis-mirror-envoy.infra.svc.cluster.local:6379. It is not a sidecar.
Reloader stores change state only in memory. On pod restart or leader election, the Kubernetes informer cache delivers all existing resources as "Add" events. With reloadOnCreate=true or syncAfterRestart=true, these are treated as changes, triggering rolling restarts for every watched deployment — even though content is identical.
The Reloader already writes a SHA1 hash to the deployment annotation (reloader.stakater.com/last-reloaded-from), but does not compare it on startup. This is the unfixed design flaw.
Every ReplicaSet created by Reloader has the exact same hashes — confirming no actual content change:
| Resource | Hash (SHA1) | Same across all rollouts? |
|---|---|---|
ConfigMap redis-lock-secondary-envoy-config |
354c95fd2dd2ca9de2a8801ef7b07b1fe3444826 |
Yes |
Secret redis-mirror-envoy-secret |
74a81cde887a3d3f14af3cffd2c2929eb240ecde |
Yes |
All 4 clusters (demo, staging, production, infra) are running with both bad flags:
--reload-on-create=true ← triggers spurious restarts on startup
--sync-after-restart=true ← triggers spurious restarts on leader election
--enable-ha=true ← safe, but amplifies the above bugs via leader elections
Each Reloader restart/leader election produces two overlapping envoy rollouts ~52s apart (one for ConfigMap, one for Secret):
| Date | ReplicaSet Created | Trigger |
|---|---|---|
| Mar 6 01:38 | 1 RS | Reloader leader election |
| Mar 7 03:52 + 03:53 | Pair (52s gap) | Reloader leader election |
| Mar 9 17:23 + 17:24 | Pair (53s gap) | Reloader leader election |
| Mar 9 19:37 + 19:38 | Pair (51s gap) | Reloader leader election |
| Mar 11 08:22 + 08:23 | Pair (54s gap) | Reloader leader election |
| Mar 11 14:35 + 14:36 | Pair (52s gap) | Reloader leader election |
Rolling update strategy (maxSurge: 0, maxUnavailable: 1) cycles all 6 envoy pods sequentially, resetting active downstream Redis connections.
14:34:52 level=info msg="Successfully acquired lease"
14:34:52 level=info msg="became leader, starting controllers"
14:35:19 level=info msg="Changes detected in 'redis-lock-secondary-envoy-config' of type 'CONFIGMAP' in namespace 'infra'; updated 'redis-mirror-envoy'"
14:36:11 level=info msg="Changes detected in 'redis-mirror-envoy-secret' of type 'SECRET' in namespace 'infra'; updated 'redis-mirror-envoy'"
All envoy upstream error metrics are at zero for both clusters over the last 7 days:
| Metric | Value |
|---|---|
envoy_cluster_upstream_cx_connect_fail |
0 |
envoy_cluster_upstream_cx_connect_timeout |
0 |
envoy_cluster_health_check_failure |
0 |
envoy_cluster_upstream_cx_protocol_error |
0 |
Correlated Reloader logs with pod-specific logs for search-retry-feeder-intel-requests-shard-32-c5ccb79d-s9zpm:
| Time (UTC) | Event |
|---|---|
| 14:34:52 | Reloader pod acquires leader lease |
| 14:35:19 | Reloader: "Changes detected" in ConfigMap → envoy rollout #1 |
| 14:36:11 | Reloader: "Changes detected" in Secret → envoy rollout #2 |
| 14:37:50 | Shard-32 crashes at iteration 107: redis.exceptions.ConnectionError: Error while reading from redis-mirror-envoy.infra.svc.cluster.local:6379 : (104, 'Connection reset by peer') |
| 14:38:24 | Shard-32 crashes again at iteration 7 (barely restarted, envoy still mid-rollout) |
| 14:38:43 | Datadog CrashLoopBackOff alert fires |
| 14:45:48 | Pod receives SIGTERM at iteration 391 — new deployment replaces pod, graceful shutdown |
| Date | Event |
|---|---|
| 2025-05-20 | Reloader deployed with reloadOnCreate=true, syncAfterRestart=true, enableHA=true (commit 69f146b1) |
| 2025-05-20 → 2026-03-11 | ~85 Reloader leader elections cause ~169 spurious envoy rollouts, each resetting Redis connections cluster-wide (~10 months, ~2/week) |
| 2025-12-17 | Separate incident: Envoy OOMKills from 1GB per-connection buffers (unrelated to Reloader) |
| 2026-03-05 20:26 | Reloader redeployed (chart version update), creating fresh pods — still with both bad flags |
| 2026-03-06 | syncAfterRestart fixed to false in git (commit 0cd46f44a) — not deployed |
| 2026-03-06 → 03-11 | 8 more spurious envoy rollouts from Reloader leader elections |
| 2026-03-11 14:35 | Shard-32 CrashLoopBackOff alert — triggers this investigation |
| 2026-03-11 | reloadOnCreate fixed to false in git (commit 335044c69) — pending deploy to all 4 clusters |
Both flags fixed in the base values.yaml (applies to all 4 clusters — per-env overrides are empty):
reloader:
reloadOnCreate: false # was true since 2025-05-20 (fixed: 335044c69)
syncAfterRestart: false # was true since 2025-05-20 (fixed: 0cd46f44a)
enableHA: true # safe to keepTrade-off: If a ConfigMap/Secret changes during the brief (~15s) leader election window, that change won't trigger a restart. Acceptable risk given the short window and the alternative (mass restarts on every leader election).
Upstream issues (all OPEN, unfixed as of Reloader v1.4.14):
- #299 —
reloadOnCreatecauses mass restarts on pod restart (filed 2022) - #810 — HA leader election triggers unnecessary syncs (filed 2024)
- #1089 — Mass pod restarts when controller restarts or leader changes (filed 2026)
This is intentional design, not a bug. From gisual_task_runner/app.py:245:
async def run(self) -> None:
await self._runtime.__aenter__()
try:
await self.start()
await self.execute()
except Exception as error:
self.logger.exception(str(error))
await self.on_exception()
raise # Intentional: crash for clean reconnection
finally:
if self.amqp is not None:
await self.amqp.maybe_republish_stored_messages()
await self.stop()Messages are republished before shutdown. No data loss.
service_name:~"search-retry-feeder.*" | stats count() by (service_name)
Dec 17, 2025: Envoy OOMKill incident caused by 1GB per-connection buffers. Remediation: reduced buffer to 100MB, enabled flush controls, increased memory to 1Gi/2Gi. See ENVOY_OOMKILL_INVESTIGATION.md.