Date: March 6, 2026 (~04:00–08:30 UTC) Environment: Production Service: notification-sender (v0.5.45) Severity: SLA breach — notifications delayed beyond 15-minute target
The notification-sender queue backed up to 7,617 messages at peak (07:30 UTC), with per-message processing times of 7-13 seconds (median 7.8s). This is a recurrence of the March 5 queue backup incident, driven by the same root cause: internal API latency under Charter-scale load.
All 75 consumers remained online with 0 restarts. The queue is self-recovering (currently ~4,819 and declining), but messages at the back of the queue during peak were likely delivered 15–25+ minutes late, breaching the 15-minute SLA.
| Metric | Value | Assessment |
|---|---|---|
| Pods | 75 (RS 7589654495) |
Normal |
| Restarts | 0 | Healthy |
| Memory | ~1% across all pods | Fine |
| CPU | 2.72 cores total | Light |
| Queue depth (peak) | 7,617 msgs @ 07:30 UTC | SLA breach |
| Queue depth (current) | ~4,819 msgs | Draining |
| Process time (median) | 7.77s | Degraded |
| Process time (P95) | 12.21s | Degraded |
| Process time (max) | 13.98s | Degraded |
| Error-level logs (4h) | 33 | Low |
| Warning-level logs (4h) | 50+ | Moderate |
04:00 10
04:15 1
04:30 5
04:45 3,631 ████████████████████
05:00 1,927 ██████████▋
05:15 5
05:30 87
05:45 74
06:00 568 ███
06:15 1,134 ██████▎
06:30 2,482 █████████████▊
06:45 5,409 ██████████████████████████████
07:00 6,273 ██████████████████████████████████▊
07:15 6,518 ████████████████████████████████████▏
07:30 7,617 ██████████████████████████████████████████▎ <- PEAK
07:45 6,275 ██████████████████████████████████▊
08:00 4,866 ███████████████████████████
Processing time distribution (50 samples from last 4 hours):
| Percentile | Duration |
|---|---|
| Min | 1.39s |
| Median | 7.77s |
| P90 | 11.44s |
| P95 | 12.21s |
| Max | 13.98s |
Effective throughput: 75 consumers / 6.69s avg = ~11.2 msgs/sec = ~672 msgs/min
Estimated end-to-end latency at peak (7,617 msgs):
| Scenario | Queue Wait | + Processing | Total Latency |
|---|---|---|---|
| Optimistic (avg processing) | ~11.3 min | + 6.7s | ~12 min |
| Realistic (median processing) | ~11.3 min | + 7.8s | ~12 min |
| Worst case (P95 processing) | ~11.3 min | + 12.2s | ~13.5 min |
| Degraded APIs (like March 5) | ~25 min | + 15s | ~25+ min |
Messages at the back of the queue during the 06:30–08:00 UTC window almost certainly exceeded the 15-minute SLA.
| Count | Pattern | Root Cause |
|---|---|---|
| 11 | ServiceNow 500 — gocotest.service-now.com Script Evaluation Exception |
External: ServiceNow bug |
| 11 | HTTPStatusError for same ServiceNow endpoints | External: ServiceNow bug |
| 9 | Server error 500 from ServiceNow callback URLs | External: ServiceNow bug |
| 2 | Drop message: Incident does not exist |
Stale references |
All errors are externally caused. ServiceNow's GOCO scripted REST API is returning 500s with Script Evaluation Exception: Cannot conv.... These auto-retry via the rpc-error exchange.
| Count | Pattern | Assessment |
|---|---|---|
| 30 | Repeated Payload: http.webhook payload is repeated | Dedup working correctly |
| 10 | Lock race condition detected | Expected with 75 competing consumers |
| ~7 | Timing: process took 10-15s | Slow internal API calls |
| 1 | Email template / outage status mismatch | One-off |
Error-level log distribution (15-min buckets):
04:00-04:30 0 errors
04:45 6 errors
05:00 22 errors <- ServiceNow spike
05:15-05:30 0
05:45 1
06:00-06:30 0
06:45 1
07:00-07:30 0
07:45 3
| Operation | Duration | % of Total | Source |
|---|---|---|---|
get_intel_request (recursive deps) |
5–7s | 50–60% | intel_requests_api |
get_intel_requests (bulk) |
0.2–5.7s | 15–30% | intel_requests_api |
generate_context_data (total) |
1.7–2.0s | 15–20% | Multiple APIs |
is_throttled |
0.35s | ~3% | Redis |
idempotent_session_lock |
0.1s | ~1% | Redis |
get_outbound_token |
0.09s | ~1% | AAA API |
get_organization |
0.05s | <1% | AAA API |
generate (Jinja template) |
0.01s | <1% | Local |
send (HTTP POST to Charter) |
<0.01s | <1% | External |
The external webhook POST to Charter is essentially instant (<10ms). The bottleneck is entirely internal API calls.
process() [app.py:223]
|
+-- generate_context_data() [intel.py:627]
| |
| +-- get_intel_request() [intel.py:749, 902-918]
| | GET intel_requests_api/intel-requests/{id}
| | ?get_recursive_dependencies=True
| | &recursive_dependency_type=dependents
| | &recursive_record_types=intel_requests
| | >>> 5-7s <<< #1 BOTTLENECK
| |
| +-- get_intel_requests() bulk [intel.py:764, 877-900]
| | POST intel_requests_api/intel-requests/bulk
| | (fetches dependent intel request IDs from #1)
| | >>> 0.2-5.7s <<< #2 BOTTLENECK
| |
| +-- get_assets() bulk [intel.py:798, 859-875]
| | POST assets_api/assets/bulk
| | ~0.5s
| |
| +-- redis set_state [intel.py:835]
| ~0.05s
|
+-- get_organization() [app.py:375]
| GET aaa_api/organizations/{id}
| ~0.05s
|
+-- For each channel (email + webhook):
+-- is_throttled() ~0.35s (Redis)
+-- is_payload_repeated() ~0.05s (Redis)
+-- idempotent_session_lock() ~0.1s (Redis)
+-- get_outbound_token() [app.py:1109] ~0.09s
+-- generate() (Jinja) ~0.01s
+-- send() (HTTP POST) <0.01s <- Charter webhook is FAST
generate_context_data() with incident_id [intel.py:639-696]
|
+-- get_incident() [intel.py:641-658]
| GET incidents_api/incidents/{id}
| ?include_deleted_dependencies=True
| &get_recursive_dependencies=True
| &recursive_dependency_type=dependents
| &recursive_record_types=intel_requests
| >>> variable, was 25-54s during March 5 <<<
|
+-- reverse_geocode() [intel.py:668] (if no admin area)
+-- get_intel_requests() bulk [intel.py:687]
+-- get_assets() bulk [intel.py:798]
| File | Key Lines | What |
|---|---|---|
notification_sender/app.py |
L223-419 | process() — main entry point, orchestrates generation + delivery |
notification_sender/app.py |
L846+ | send_payload() — idempotent delivery with Redis lock |
notification_sender/app.py |
L1109 | get_outbound_token() |
notification_sender/generators/intel.py |
L627-857 | generate_context_data() — THE HOT PATH, all API calls |
notification_sender/generators/intel.py |
L902-918 | get_intel_request() — #1 bottleneck, recursive deps |
notification_sender/generators/intel.py |
L877-900 | get_intel_requests() bulk — #2 bottleneck |
notification_sender/generators/intel.py |
L859-875 | get_assets() bulk |
notification_sender/generators/intel.py |
L639-696 | generate_context_data() incident path with get_incident |
| API | Repo Path | Called By | Latency |
|---|---|---|---|
| intel-requests-api | backend/apis/intel-requests-api/ |
get_intel_request(), get_intel_requests() |
5–7s (recursive deps) |
| incidents-api | backend/apis/incidents-api/ |
get_incident() (incident path) |
Variable, 25–54s during March 5 |
| assets-api | backend/apis/assets-api/ |
get_assets() |
~0.5s |
| aaa-api | backend/apis/aaa-api/ |
get_organization() |
~0.05s |
| Target | Details | Expected Impact |
|---|---|---|
| Optimize recursive dependency CTE in intel-requests-api | get_intel_request() calls GET /intel-requests/{id}?get_recursive_dependencies=True which triggers a recursive CTE that balloons under Charter-scale load. This single call is 5–7s = 50–60% of total processing time. |
5–7s → <1s per message |
Enable pg_stat_statements on app database |
Cannot identify the exact slow query without this. March 5 P0 action item — still open. | Enables root cause analysis |
| Target | Details | Expected Impact |
|---|---|---|
| Add depth limit to recursive dependency query | March 5 post-mortem P1 — prevents unbounded recursion in the CTE | Caps worst-case query time |
| Cache intel_request data across messages in same batch | 75 pods independently fetch the same data for the same org. Robin's cache MR (#44) addresses this. | Reduces redundant API calls |
| Add queue depth / consumer throughput alert | March 5 P0 action item — still open. The 7,617 peak was not alerted on. | Earlier detection |
| Parallelize independent calls | get_organization() (app.py:375) runs after generate_context_data() — these are independent and could run with asyncio.gather() |
Save ~0.05s (minor) |
| Target | Details | Expected Impact |
|---|---|---|
Reduce is_throttled Redis latency |
Currently 0.35s × 2 channels = 0.7s per message | Save ~0.5s |
| Batch Redis operations | is_throttled, is_payload_repeated, idempotent_session_lock could potentially be pipelined |
Save ~0.3s |
This is the same failure mode documented in the March 5 post-mortem:
| Aspect | March 5 | Today (March 6) |
|---|---|---|
| Queue peak | 20,241 | 7,617 |
| Processing time | 25–54s per API call | 5–13s per message |
| Duration | ~5 hours | ~4 hours (ongoing) |
| Consumers online | 75 | 75 |
| Restarts | 0 | 0 |
| Root cause | Internal API latency under Charter load | Same |
| Concurrent infra events | Dragonfly eviction, Reloader cascade, Vector drops | None |
| Self-recovered | Yes (spontaneous) | Yes (in progress) |
Key P0 action items from March 5 that remain open:
- Install
pg_stat_statementsonappdatabase - Add alert rules for
pg_activity_max_tx_durationonappDB - Add notification-sender queue depth / consumer throughput alert
- Investigate root cause of API latency under Charter-scale load
Both goco.service-now.com and gocotest.service-now.com are returning HTTP 500 with:
Script Evaluation Exception: Cannot conv...
This is a ServiceNow-side bug in their scripted REST API (/api/go2/gisual_callback). Messages auto-retry via the rpc-error exchange. Consider notifying Charter's ServiceNow team.