Skip to content

Instantly share code, notes, and snippets.

@jmealo
Created March 6, 2026 08:36
Show Gist options
  • Select an option

  • Save jmealo/58735eab2e67f5933a94110db39781a7 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/58735eab2e67f5933a94110db39781a7 to your computer and use it in GitHub Desktop.
notification-sender prod triage 2026-03-06 — SLA breach analysis and optimization targets

Notification-Sender Production Triage — 2026-03-06

Date: March 6, 2026 (~04:00–08:30 UTC) Environment: Production Service: notification-sender (v0.5.45) Severity: SLA breach — notifications delayed beyond 15-minute target


Executive Summary

The notification-sender queue backed up to 7,617 messages at peak (07:30 UTC), with per-message processing times of 7-13 seconds (median 7.8s). This is a recurrence of the March 5 queue backup incident, driven by the same root cause: internal API latency under Charter-scale load.

All 75 consumers remained online with 0 restarts. The queue is self-recovering (currently ~4,819 and declining), but messages at the back of the queue during peak were likely delivered 15–25+ minutes late, breaching the 15-minute SLA.


Key Metrics

Metric Value Assessment
Pods 75 (RS 7589654495) Normal
Restarts 0 Healthy
Memory ~1% across all pods Fine
CPU 2.72 cores total Light
Queue depth (peak) 7,617 msgs @ 07:30 UTC SLA breach
Queue depth (current) ~4,819 msgs Draining
Process time (median) 7.77s Degraded
Process time (P95) 12.21s Degraded
Process time (max) 13.98s Degraded
Error-level logs (4h) 33 Low
Warning-level logs (4h) 50+ Moderate

Queue Trend (15-min buckets, UTC)

04:00     10
04:15      1
04:30      5
04:45  3,631  ████████████████████
05:00  1,927  ██████████▋
05:15      5
05:30     87
05:45     74
06:00    568  ███
06:15  1,134  ██████▎
06:30  2,482  █████████████▊
06:45  5,409  ██████████████████████████████
07:00  6,273  ██████████████████████████████████▊
07:15  6,518  ████████████████████████████████████▏
07:30  7,617  ██████████████████████████████████████████▎  <- PEAK
07:45  6,275  ██████████████████████████████████▊
08:00  4,866  ███████████████████████████

SLA Breach Analysis

Processing time distribution (50 samples from last 4 hours):

Percentile Duration
Min 1.39s
Median 7.77s
P90 11.44s
P95 12.21s
Max 13.98s

Effective throughput: 75 consumers / 6.69s avg = ~11.2 msgs/sec = ~672 msgs/min

Estimated end-to-end latency at peak (7,617 msgs):

Scenario Queue Wait + Processing Total Latency
Optimistic (avg processing) ~11.3 min + 6.7s ~12 min
Realistic (median processing) ~11.3 min + 7.8s ~12 min
Worst case (P95 processing) ~11.3 min + 12.2s ~13.5 min
Degraded APIs (like March 5) ~25 min + 15s ~25+ min

Messages at the back of the queue during the 06:30–08:00 UTC window almost certainly exceeded the 15-minute SLA.


Errors

Error-level (33 entries in 4 hours)

Count Pattern Root Cause
11 ServiceNow 500 — gocotest.service-now.com Script Evaluation Exception External: ServiceNow bug
11 HTTPStatusError for same ServiceNow endpoints External: ServiceNow bug
9 Server error 500 from ServiceNow callback URLs External: ServiceNow bug
2 Drop message: Incident does not exist Stale references

All errors are externally caused. ServiceNow's GOCO scripted REST API is returning 500s with Script Evaluation Exception: Cannot conv.... These auto-retry via the rpc-error exchange.

Warning-level (50+ entries in 4 hours)

Count Pattern Assessment
30 Repeated Payload: http.webhook payload is repeated Dedup working correctly
10 Lock race condition detected Expected with 75 competing consumers
~7 Timing: process took 10-15s Slow internal API calls
1 Email template / outage status mismatch One-off

Error-level log distribution (15-min buckets):

04:00-04:30   0 errors
04:45         6 errors
05:00        22 errors  <- ServiceNow spike
05:15-05:30   0
05:45         1
06:00-06:30   0
06:45         1
07:00-07:30   0
07:45         3

Hot Path Analysis: Charter Webhook Generation

Per-Operation Timing Breakdown

Operation Duration % of Total Source
get_intel_request (recursive deps) 5–7s 50–60% intel_requests_api
get_intel_requests (bulk) 0.2–5.7s 15–30% intel_requests_api
generate_context_data (total) 1.7–2.0s 15–20% Multiple APIs
is_throttled 0.35s ~3% Redis
idempotent_session_lock 0.1s ~1% Redis
get_outbound_token 0.09s ~1% AAA API
get_organization 0.05s <1% AAA API
generate (Jinja template) 0.01s <1% Local
send (HTTP POST to Charter) <0.01s <1% External

The external webhook POST to Charter is essentially instant (<10ms). The bottleneck is entirely internal API calls.

Sequential Call Chain

process()                                    [app.py:223]
  |
  +-- generate_context_data()                [intel.py:627]
  |     |
  |     +-- get_intel_request()              [intel.py:749, 902-918]
  |     |     GET intel_requests_api/intel-requests/{id}
  |     |       ?get_recursive_dependencies=True
  |     |       &recursive_dependency_type=dependents
  |     |       &recursive_record_types=intel_requests
  |     |     >>> 5-7s  <<<  #1 BOTTLENECK
  |     |
  |     +-- get_intel_requests() bulk        [intel.py:764, 877-900]
  |     |     POST intel_requests_api/intel-requests/bulk
  |     |       (fetches dependent intel request IDs from #1)
  |     |     >>> 0.2-5.7s <<<  #2 BOTTLENECK
  |     |
  |     +-- get_assets() bulk                [intel.py:798, 859-875]
  |     |     POST assets_api/assets/bulk
  |     |     ~0.5s
  |     |
  |     +-- redis set_state                  [intel.py:835]
  |           ~0.05s
  |
  +-- get_organization()                     [app.py:375]
  |     GET aaa_api/organizations/{id}
  |     ~0.05s
  |
  +-- For each channel (email + webhook):
        +-- is_throttled()                   ~0.35s (Redis)
        +-- is_payload_repeated()            ~0.05s (Redis)
        +-- idempotent_session_lock()        ~0.1s  (Redis)
        +-- get_outbound_token()             [app.py:1109] ~0.09s
        +-- generate() (Jinja)               ~0.01s
        +-- send() (HTTP POST)               <0.01s  <- Charter webhook is FAST

For incident-based notifications, there's an additional upstream call:

generate_context_data() with incident_id     [intel.py:639-696]
  |
  +-- get_incident()                         [intel.py:641-658]
  |     GET incidents_api/incidents/{id}
  |       ?include_deleted_dependencies=True
  |       &get_recursive_dependencies=True
  |       &recursive_dependency_type=dependents
  |       &recursive_record_types=intel_requests
  |     >>> variable, was 25-54s during March 5 <<<
  |
  +-- reverse_geocode()                      [intel.py:668] (if no admin area)
  +-- get_intel_requests() bulk              [intel.py:687]
  +-- get_assets() bulk                      [intel.py:798]

Code References

notification-sender (consumer)

File Key Lines What
notification_sender/app.py L223-419 process() — main entry point, orchestrates generation + delivery
notification_sender/app.py L846+ send_payload() — idempotent delivery with Redis lock
notification_sender/app.py L1109 get_outbound_token()
notification_sender/generators/intel.py L627-857 generate_context_data() — THE HOT PATH, all API calls
notification_sender/generators/intel.py L902-918 get_intel_request() — #1 bottleneck, recursive deps
notification_sender/generators/intel.py L877-900 get_intel_requests() bulk — #2 bottleneck
notification_sender/generators/intel.py L859-875 get_assets() bulk
notification_sender/generators/intel.py L639-696 generate_context_data() incident path with get_incident

Backend APIs (where the latency actually lives)

API Repo Path Called By Latency
intel-requests-api backend/apis/intel-requests-api/ get_intel_request(), get_intel_requests() 5–7s (recursive deps)
incidents-api backend/apis/incidents-api/ get_incident() (incident path) Variable, 25–54s during March 5
assets-api backend/apis/assets-api/ get_assets() ~0.5s
aaa-api backend/apis/aaa-api/ get_organization() ~0.05s

Optimization Targets (Prioritized)

P0 — Single biggest win

Target Details Expected Impact
Optimize recursive dependency CTE in intel-requests-api get_intel_request() calls GET /intel-requests/{id}?get_recursive_dependencies=True which triggers a recursive CTE that balloons under Charter-scale load. This single call is 5–7s = 50–60% of total processing time. 5–7s → <1s per message
Enable pg_stat_statements on app database Cannot identify the exact slow query without this. March 5 P0 action item — still open. Enables root cause analysis

P1 — Important

Target Details Expected Impact
Add depth limit to recursive dependency query March 5 post-mortem P1 — prevents unbounded recursion in the CTE Caps worst-case query time
Cache intel_request data across messages in same batch 75 pods independently fetch the same data for the same org. Robin's cache MR (#44) addresses this. Reduces redundant API calls
Add queue depth / consumer throughput alert March 5 P0 action item — still open. The 7,617 peak was not alerted on. Earlier detection
Parallelize independent calls get_organization() (app.py:375) runs after generate_context_data() — these are independent and could run with asyncio.gather() Save ~0.05s (minor)

P2 — Nice to have

Target Details Expected Impact
Reduce is_throttled Redis latency Currently 0.35s × 2 channels = 0.7s per message Save ~0.5s
Batch Redis operations is_throttled, is_payload_repeated, idempotent_session_lock could potentially be pipelined Save ~0.3s

Relationship to March 5 Incident

This is the same failure mode documented in the March 5 post-mortem:

Aspect March 5 Today (March 6)
Queue peak 20,241 7,617
Processing time 25–54s per API call 5–13s per message
Duration ~5 hours ~4 hours (ongoing)
Consumers online 75 75
Restarts 0 0
Root cause Internal API latency under Charter load Same
Concurrent infra events Dragonfly eviction, Reloader cascade, Vector drops None
Self-recovered Yes (spontaneous) Yes (in progress)

Key P0 action items from March 5 that remain open:

  1. Install pg_stat_statements on app database
  2. Add alert rules for pg_activity_max_tx_duration on app DB
  3. Add notification-sender queue depth / consumer throughput alert
  4. Investigate root cause of API latency under Charter-scale load

Additional Issue: ServiceNow 500 Errors

Both goco.service-now.com and gocotest.service-now.com are returning HTTP 500 with:

Script Evaluation Exception: Cannot conv...

This is a ServiceNow-side bug in their scripted REST API (/api/go2/gisual_callback). Messages auto-retry via the rpc-error exchange. Consider notifying Charter's ServiceNow team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment