Date: 2026-03-11 Reported by: Diego via Slack/Datadog Investigated by: Jeff Mealo (with Claude Code) Severity: Low (self-recovering, no data loss) Status: Root cause identified, fix committed and pending deploy
Two incidents, one root cause chain, 42 documented false signals across 2 days of investigation.
Actual root cause: AAA API at 8 replicas (some crash-looping) couldn't serve permissions queries from intel-requests-api fast enough. This caused cascading request queuing through intel-requests-api and incidents-api, starving notification-sender of API capacity. SQL queries were sub-millisecond throughout.
Source: Production logs, 2026-03-06 08:44 UTC
Charter org_id: 9de5d801-235a-4451-8c89-d2c3974c71e8
All queries use real IDs extracted from prod notification-sender logs.
WARNING: Run these inside a
BEGIN; ... ROLLBACK;transaction on a read replica if possible. EXPLAIN ANALYZE actually executes the query.
Context: Production SLA breach. notification-sender queue peaked at 7,617 msgs. Per-message processing: median 7.8s, max 14s. Charter webhook delivery is <10ms — the bottleneck is entirely internal API calls.
| [package] | |
| name = "wasm-h3-map" | |
| version = "0.1.0" | |
| edition = "2021" | |
| [lib] | |
| crate-type = ["cdylib"] | |
| [dependencies] | |
| wasm-bindgen = "0.2" |
| FROM python:3.9-slim | |
| # Set environment variables | |
| ENV PYTHONDONTWRITEBYTECODE=1 \ | |
| PYTHONUNBUFFERED=1 \ | |
| PIP_NO_CACHE_DIR=1 | |
| # Install the Kubernetes Python client | |
| RUN pip install --no-cache-dir kubernetes==32.0.1 |
| { | |
| "__inputs": [ | |
| { | |
| "name": "DS_PROMETHEUS", | |
| "label": "Prometheus", | |
| "description": "", | |
| "type": "datasource", | |
| "pluginId": "prometheus", | |
| "pluginName": "Prometheus" | |
| }, |
| brew install sleepwatcher | |
| # write restart_apps.sh to ~/.wakeup | |
| chmod +x ~/.wakeup | |
| brew services start sleepwatcher | |
| # your mileage may vary getting sleepwatcher to run the script, but, the script should work if executed manually |