Notification-Sender Production Triage — 2026-03-06

Date: March 6, 2026 (~04:00–08:30 UTC) Environment: Production Service: notification-sender (v0.5.45) Severity: SLA breach — notifications delayed beyond 15-minute target

Executive Summary

The notification-sender queue backed up to 7,617 messages at peak (07:30 UTC), with per-message processing times of 7-13 seconds (median 7.8s). This is a recurrence of the March 5 queue backup incident, driven by the same root cause: internal API latency under Charter-scale load.

All 75 consumers remained online with 0 restarts. The queue is self-recovering (currently ~4,819 and declining), but messages at the back of the queue during peak were likely delivered 15–25+ minutes late, breaching the 15-minute SLA.

Key Metrics

Metric	Value	Assessment
Pods	75 (RS `7589654495`)	Normal
Restarts	0	Healthy
Memory	~1% across all pods	Fine
CPU	2.72 cores total	Light
Queue depth (peak)	7,617 msgs @ 07:30 UTC	SLA breach
Queue depth (current)	~4,819 msgs	Draining
Process time (median)	7.77s	Degraded
Process time (P95)	12.21s	Degraded
Process time (max)	13.98s	Degraded
Error-level logs (4h)	33	Low
Warning-level logs (4h)	50+	Moderate

Queue Trend (15-min buckets, UTC)

04:00     10
04:15      1
04:30      5
04:45  3,631  ████████████████████
05:00  1,927  ██████████▋
05:15      5
05:30     87
05:45     74
06:00    568  ███
06:15  1,134  ██████▎
06:30  2,482  █████████████▊
06:45  5,409  ██████████████████████████████
07:00  6,273  ██████████████████████████████████▊
07:15  6,518  ████████████████████████████████████▏
07:30  7,617  ██████████████████████████████████████████▎  <- PEAK
07:45  6,275  ██████████████████████████████████▊
08:00  4,866  ███████████████████████████

SLA Breach Analysis

Processing time distribution (50 samples from last 4 hours):

Percentile	Duration
Min	1.39s
Median	7.77s
P90	11.44s
P95	12.21s
Max	13.98s

Effective throughput: 75 consumers / 6.69s avg = ~11.2 msgs/sec = ~672 msgs/min

Estimated end-to-end latency at peak (7,617 msgs):

Scenario	Queue Wait	+ Processing	Total Latency
Optimistic (avg processing)	~11.3 min	+ 6.7s	~12 min
Realistic (median processing)	~11.3 min	+ 7.8s	~12 min
Worst case (P95 processing)	~11.3 min	+ 12.2s	~13.5 min
Degraded APIs (like March 5)	~25 min	+ 15s	~25+ min

Messages at the back of the queue during the 06:30–08:00 UTC window almost certainly exceeded the 15-minute SLA.

Errors

Error-level (33 entries in 4 hours)

Count	Pattern	Root Cause
11	ServiceNow 500 — `gocotest.service-now.com` Script Evaluation Exception	External: ServiceNow bug
11	HTTPStatusError for same ServiceNow endpoints	External: ServiceNow bug
9	Server error 500 from ServiceNow callback URLs	External: ServiceNow bug
2	`Drop message: Incident does not exist`	Stale references

All errors are externally caused. ServiceNow's GOCO scripted REST API is returning 500s with Script Evaluation Exception: Cannot conv.... These auto-retry via the rpc-error exchange.

Warning-level (50+ entries in 4 hours)

Count	Pattern	Assessment
30	Repeated Payload: http.webhook payload is repeated	Dedup working correctly
10	Lock race condition detected	Expected with 75 competing consumers
~7	Timing: process took 10-15s	Slow internal API calls
1	Email template / outage status mismatch	One-off

Error-level log distribution (15-min buckets):

04:00-04:30   0 errors
04:45         6 errors
05:00        22 errors  <- ServiceNow spike
05:15-05:30   0
05:45         1
06:00-06:30   0
06:45         1
07:00-07:30   0
07:45         3

Hot Path Analysis: Charter Webhook Generation

Per-Operation Timing Breakdown

Operation	Duration	% of Total	Source
`get_intel_request` (recursive deps)	5–7s	50–60%	intel_requests_api
`get_intel_requests` (bulk)	0.2–5.7s	15–30%	intel_requests_api
`generate_context_data` (total)	1.7–2.0s	15–20%	Multiple APIs
`is_throttled`	0.35s	~3%	Redis
`idempotent_session_lock`	0.1s	~1%	Redis
`get_outbound_token`	0.09s	~1%	AAA API
`get_organization`	0.05s	<1%	AAA API
`generate` (Jinja template)	0.01s	<1%	Local
`send` (HTTP POST to Charter)	<0.01s	<1%	External

The external webhook POST to Charter is essentially instant (<10ms). The bottleneck is entirely internal API calls.

Sequential Call Chain

process()                                    [app.py:223]
  |
  +-- generate_context_data()                [intel.py:627]
  |     |
  |     +-- get_intel_request()              [intel.py:749, 902-918]
  |     |     GET intel_requests_api/intel-requests/{id}
  |     |       ?get_recursive_dependencies=True
  |     |       &recursive_dependency_type=dependents
  |     |       &recursive_record_types=intel_requests
  |     |     >>> 5-7s  <<<  #1 BOTTLENECK
  |     |
  |     +-- get_intel_requests() bulk        [intel.py:764, 877-900]
  |     |     POST intel_requests_api/intel-requests/bulk
  |     |       (fetches dependent intel request IDs from #1)
  |     |     >>> 0.2-5.7s <<<  #2 BOTTLENECK
  |     |
  |     +-- get_assets() bulk                [intel.py:798, 859-875]
  |     |     POST assets_api/assets/bulk
  |     |     ~0.5s
  |     |
  |     +-- redis set_state                  [intel.py:835]
  |           ~0.05s
  |
  +-- get_organization()                     [app.py:375]
  |     GET aaa_api/organizations/{id}
  |     ~0.05s
  |
  +-- For each channel (email + webhook):
        +-- is_throttled()                   ~0.35s (Redis)
        +-- is_payload_repeated()            ~0.05s (Redis)
        +-- idempotent_session_lock()        ~0.1s  (Redis)
        +-- get_outbound_token()             [app.py:1109] ~0.09s
        +-- generate() (Jinja)               ~0.01s
        +-- send() (HTTP POST)               <0.01s  <- Charter webhook is FAST

For incident-based notifications, there's an additional upstream call:

generate_context_data() with incident_id     [intel.py:639-696]
  |
  +-- get_incident()                         [intel.py:641-658]
  |     GET incidents_api/incidents/{id}
  |       ?include_deleted_dependencies=True
  |       &get_recursive_dependencies=True
  |       &recursive_dependency_type=dependents
  |       &recursive_record_types=intel_requests
  |     >>> variable, was 25-54s during March 5 <<<
  |
  +-- reverse_geocode()                      [intel.py:668] (if no admin area)
  +-- get_intel_requests() bulk              [intel.py:687]
  +-- get_assets() bulk                      [intel.py:798]

Code References

notification-sender (consumer)

File	Key Lines	What
`notification_sender/app.py`	L223-419	`process()` — main entry point, orchestrates generation + delivery
`notification_sender/app.py`	L846+	`send_payload()` — idempotent delivery with Redis lock
`notification_sender/app.py`	L1109	`get_outbound_token()`
`notification_sender/generators/intel.py`	L627-857	`generate_context_data()` — THE HOT PATH, all API calls
`notification_sender/generators/intel.py`	L902-918	`get_intel_request()` — #1 bottleneck, recursive deps
`notification_sender/generators/intel.py`	L877-900	`get_intel_requests()` bulk — #2 bottleneck
`notification_sender/generators/intel.py`	L859-875	`get_assets()` bulk
`notification_sender/generators/intel.py`	L639-696	`generate_context_data()` incident path with `get_incident`

Backend APIs (where the latency actually lives)

API	Repo Path	Called By	Latency
intel-requests-api	`backend/apis/intel-requests-api/`	`get_intel_request()`, `get_intel_requests()`	5–7s (recursive deps)
incidents-api	`backend/apis/incidents-api/`	`get_incident()` (incident path)	Variable, 25–54s during March 5
assets-api	`backend/apis/assets-api/`	`get_assets()`	~0.5s
aaa-api	`backend/apis/aaa-api/`	`get_organization()`	~0.05s

Optimization Targets (Prioritized)

P0 — Single biggest win

Target	Details	Expected Impact
Optimize recursive dependency CTE in intel-requests-api	`get_intel_request()` calls `GET /intel-requests/{id}?get_recursive_dependencies=True` which triggers a recursive CTE that balloons under Charter-scale load. This single call is 5–7s = 50–60% of total processing time.	5–7s → <1s per message
Enable `pg_stat_statements` on `app` database	Cannot identify the exact slow query without this. March 5 P0 action item — still open.	Enables root cause analysis

P1 — Important

Target	Details	Expected Impact
Add depth limit to recursive dependency query	March 5 post-mortem P1 — prevents unbounded recursion in the CTE	Caps worst-case query time
Cache intel_request data across messages in same batch	75 pods independently fetch the same data for the same org. Robin's cache MR (#44) addresses this.	Reduces redundant API calls
Add queue depth / consumer throughput alert	March 5 P0 action item — still open. The 7,617 peak was not alerted on.	Earlier detection
Parallelize independent calls	`get_organization()` (app.py:375) runs after `generate_context_data()` — these are independent and could run with `asyncio.gather()`	Save ~0.05s (minor)

P2 — Nice to have

Target	Details	Expected Impact
Reduce `is_throttled` Redis latency	Currently 0.35s × 2 channels = 0.7s per message	Save ~0.5s
Batch Redis operations	`is_throttled`, `is_payload_repeated`, `idempotent_session_lock` could potentially be pipelined	Save ~0.3s

Relationship to March 5 Incident

This is the same failure mode documented in the March 5 post-mortem:

Aspect	March 5	Today (March 6)
Queue peak	20,241	7,617
Processing time	25–54s per API call	5–13s per message
Duration	~5 hours	~4 hours (ongoing)
Consumers online	75	75
Restarts	0	0
Root cause	Internal API latency under Charter load	Same
Concurrent infra events	Dragonfly eviction, Reloader cascade, Vector drops	None
Self-recovered	Yes (spontaneous)	Yes (in progress)

Key P0 action items from March 5 that remain open:

Install pg_stat_statements on app database
Add alert rules for pg_activity_max_tx_duration on app DB
Add notification-sender queue depth / consumer throughput alert
Investigate root cause of API latency under Charter-scale load

Additional Issue: ServiceNow 500 Errors

Both goco.service-now.com and gocotest.service-now.com are returning HTTP 500 with:

Script Evaluation Exception: Cannot conv...

This is a ServiceNow-side bug in their scripted REST API (/api/go2/gisual_callback). Messages auto-retry via the rpc-error exchange. Consider notifying Charter's ServiceNow team.

jmealo/notification-sender-triage-2026-03-06.md

Select an option

No results found