Notification-Sender Bottleneck Deep Dive — 2026-03-06

Context: Production SLA breach. notification-sender queue peaked at 7,617 msgs. Per-message processing: median 7.8s, max 14s. Charter webhook delivery is <10ms — the bottleneck is entirely internal API calls.

Hot Path Summary
Bottleneck #1: Recursive Dependencies CTE (shared library)
Bottleneck #2: intel-requests-api
Bottleneck #3: incidents-api
assets-api (not a bottleneck)
Notification-Sender Optimization Opportunities
Current Incidents Feature — Potential Fast Path
Prioritized Recommendations
File Reference Index

1. Hot Path Summary

Per-operation timing from production logs (08:26 UTC, March 6):

Operation	Duration	% of Total	API Called
`get_intel_request` (recursive deps)	5-7s	50-60%	intel-requests-api
`get_intel_requests` (bulk)	0.2-5.7s	15-30%	intel-requests-api
`generate_context_data` (total)	1.7-2.0s	15-20%	Multiple APIs
`is_throttled`	0.35s	~3%	Redis (lock contention)
`idempotent_session_lock`	0.1s	~1%	Redis
`get_outbound_token`	0.09s	~1%	AAA API
`get_organization`	0.05s	<1%	AAA API
`generate` (Jinja template)	0.01s	<1%	Local
`send` (HTTP POST to Charter)	<0.01s	<1%	External webhook

The webhook POST to Charter is essentially instant. 100% of the latency is internal.

2. Bottleneck #1: Recursive Dependencies CTE

This is the single biggest finding. Both intel-requests-api and incidents-api use the same shared library for recursive dependency resolution. The slow query lives in:

The Recursive CTE Query

File: backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py (lines 219-272)

WITH RECURSIVE dependents (
    dependent_type, dependent_id, blocker_type, blocker_id,
    created, deleted, dependency_distance, dependency_path
) AS (
    -- Base case: direct dependencies
    SELECT d.dependent_type, d.dependent_id, d.blocker_type, d.blocker_id,
           d.created, d.deleted,
           1 AS dependency_distance,
           ARRAY[ROW(d.dependent_type, d.dependent_id,
                     d.blocker_type, d.blocker_id)] AS dependency_path
      FROM dependencies.dependencies AS d
     WHERE {upstream_id_field} = %(root_dependency_id)s
       AND {upstream_type_field} = %(root_dependency_type)s
       AND d.deleted IS NULL

    UNION ALL

    -- Recursive case: traverse the graph
    SELECT d.dependent_type, d.dependent_id, d.blocker_type, d.blocker_id,
           d.created, d.deleted,
           rd.dependency_distance + 1 AS dependency_distance,
           rd.dependency_path || ARRAY[ROW(...)] AS dependency_path
      FROM dependencies.dependencies AS d
      JOIN dependents AS rd
           ON (d.{upstream_id} = rd.{downstream_id}
               AND d.{upstream_type} = rd.{downstream_type})
     WHERE NOT ROW(d.dependent_type, d.dependent_id,
                   d.blocker_type, d.blocker_id) = ANY(rd.dependency_path)
       AND d.deleted IS NULL
)
SELECT * FROM dependents

Why It's Slow

Unbounded recursion: max_distance parameter exists (line 214) but is never passed by any caller. Defaults to None = unlimited depth.
Array-based cycle detection: dependency_path array grows with each level. The NOT ROW(...) = ANY(path) check is O(n) per row per level.
Per-record execution: daos.py:147-157 runs this CTE once per record in a Python loop — N intel requests = N separate recursive queries.

Database Schema & Indexes

Table: dependencies.dependencies (app database, dependencies schema) DDL: ddl/app/schemata/dependencies/tables/00_dependencies.sql

Current indexes:

-- Single-column indexes
CREATE INDEX dependencies_dependent_type_idx ON dependencies.dependencies USING btree(dependent_type);
CREATE INDEX dependencies_blocker_type_idx ON dependencies.dependencies USING btree(blocker_type);
CREATE INDEX dependencies_dependent_id_idx ON dependencies.dependencies USING btree(dependent_id);
CREATE INDEX dependencies_blocker_id_idx ON dependencies.dependencies USING btree(blocker_id);

-- Partial composite indexes (good, but see issues below)
CREATE INDEX dep_blocker_live_idx ON dependencies.dependencies(blocker_type, blocker_id) WHERE deleted IS NULL;
CREATE INDEX dep_dependent_live_idx ON dependencies.dependencies(dependent_type, dependent_id) WHERE deleted IS NULL;

Index issues:

The direct dependencies query uses an OR condition across both blocker and dependent sides — the planner may struggle to use indexes efficiently for this pattern.
Missing 4-column composite index for the full (blocker_type, blocker_id, dependent_type, dependent_id) lookup pattern.

The Fix Is Already Partially Built

The max_distance parameter is implemented in models.py:214-217 and generates a WHERE dependency_distance <= %(max_distance)s clause. It's just never called:

DAO call chain (daos.py:147-157):

if get_recursive_dependencies:
    for record in records:
        record[recursive_dependency_type] = await self.recursive_get_dependencies(
            record_type, record[id_field],
            recursive_dependency_type,
            recursive_record_types=recursive_record_types,
            max_distance=recursive_max_distance,  # ← Always None
        )

Quick win: Pass max_distance=5 (or even 3) to cap recursion depth.

3. Bottleneck #2: intel-requests-api

GET /intel-requests/{id}

File: backend/apis/intel-requests-api/intel_requests_api/intel_requests.py
Handler: RecordHandler.get() (line 2650)
Key call: self.incorporate_dependency_data('intel_requests', intel_request) (line 2672)

When notification-sender calls with ?get_recursive_dependencies=True&recursive_record_types=intel_requests:

Fetches the intel_request record (fast, indexed PK lookup)
Fetches direct dependencies via models.Dependency.get_dependencies() (fast, uses partial indexes)
Runs the recursive CTE for each record (SLOW — 5-7s, unbounded depth)

POST /intel-requests/bulk

File: Same file
Handler: BulkHandler.post() (line 2133)
Key call: self.incorporate_dependency_data('intel_requests', filtered) (line 2157)

Same dependency incorporation logic. For N intel request IDs → N separate recursive CTE queries.

4. Bottleneck #3: incidents-api

GET /incidents/{id}

File: backend/apis/incidents-api/incidents_api/incidents.py
Handler: RecordHandler.get() (line ~1868)
Key call: incorporate_incident_dependency_data() (line ~595)

Uses the same shared gisual-data-mixins library:

RecordHandler.get()
  → incorporate_incident_dependency_data() [incidents.py:~595]
  → incorporate_dependency_data() [daos.py:41]
  → recursive_get_dependencies() [daos.py:182]
  → Dependency.get_recursive_dependencies() [models.py:174]  ← SAME recursive CTE

When notification-sender calls with ?get_recursive_dependencies=True&recursive_dependency_type=dependents&recursive_record_types=intel_requests:

Fetches incident with location (fast)
Runs recursive CTE to find all dependent intel_requests (SLOW — was 25-54s on March 5)

5. assets-api (Not a Bottleneck)

POST /assets/bulk

File: backend/apis/assets-api/assets_api/assets.py
Handler: BulkHandler.post() (line 504)
Query: Single JOIN query, no recursion

SELECT a.asset_id, a.asset_name, ..., t.asset_type_name,
       public.ST_AsGeoJSON(l.location) AS geojson, l.thoroughfare, ...
  FROM assets.assets AS a
  JOIN assets.asset_types AS t ON t.asset_type_id = a.asset_type_id AND t.deleted IS NULL
  JOIN locations.locations AS l ON l.location_id = a.location_id AND l.organization_id = a.organization_id AND l.deleted IS NULL
 WHERE a.asset_id = ANY(%(asset_id)s)
   AND a.deleted IS NULL

Assessment: Well-optimized. Single query, proper indexes (PK on asset_id, btree on organization_id, location_id). ~0.5s is acceptable. No action needed.

6. Notification-Sender Optimization Opportunities

6a. Parallelization (consumer-side)

Current: All API calls are sequential. Several are independent and could run concurrently.

Opportunity	Files	Current Flow	Savings
`get_organization()` + `generate_context_data()`	`app.py:337,375`	Sequential	~50ms
`reverse_geocoding` + `get_intel_requests()`	`intel.py:666,687`	Sequential	~300ms
`get_intel_requests()` + `get_assets()`	`intel.py:687,798`	Sequential	~200-300ms

Total parallelization savings: ~0.5-0.7s per message (modest but free).

6b. 75-Pod Redundant Work

When 75 consumers process the same Charter batch:

All 75 independently call get_incident(), get_intel_requests(), get_assets() for the same data
Only 1 successfully sends each notification (idempotent lock)
74 pods do redundant API work = 222+ wasted API calls per work unit

Cache mechanism (app.py:302): Per-work_unit_id Redis cache exists with fingerprint-based freshness check. But:

Cache is set AFTER generation (line 417-435), so the first pod pays full cost
Other pods racing on the same work_unit don't benefit until the first completes
No organization-level or incident-level cache to share across work units in the same batch

Robin's cache MR (#44) addresses this but is still open.

6c. `is_throttled()` — 0.35s Latency

Root cause: Distributed lock acquisition with 75 pods competing.

The GCRA rate limiter acquires a Redis lock per call. With 75 consumers hitting the same throttle key:

Lock contention causes retry loops
Each is_throttled() call averages 0.35s
Called twice per message (email + webhook channels) = 0.7s/msg

Fix: Replace lock-based GCRA with atomic Lua script — eliminates lock contention entirely.

7. Current Incidents Feature — Potential Fast Path

What It Is

The incidents-api has a pre-computed cache of current incidents per organization (GET /incidents/current). This cache includes:

Incident data with location
Outages (with locations)
Assets
Intel requests (direct dependents)
Parent intel requests (2nd-level dependents)
Internet outages

This is almost exactly the data that generate_context_data() spends 5-13s assembling from sequential API calls.

Current State

Endpoint: GET /incidents/current?organization_id=X (incidents.py:1054, CurrentHandler)
Cache storage: Redis hash, msgpack-encoded (incidents.py:1388-1396)
Cache population: On every incident update via upsert_to_current_incidents_cache() (incidents.py:433)
Per-org toggle: organization.metadata.use_current_incidents_cache (incidents.py:402)
SQL feature flag: current-sql enables the optimized LATERAL query path (incidents.py:1383)

`feat/optimize-query` Branch

The feat/optimize-query branch (commit ccc1c51) rewrites the get_current_incidents() SQL query using LATERAL subqueries:

Before (flat LEFT JOINs):

Cross-products across outages, assets, dependencies, internet outages
Charter (org 9de5d801, ~3K incidents): 123K intermediate rows, 148MB disk sort, 566K buffer hits
4,240ms

After (LATERAL subqueries):

Each relationship aggregated independently, 1 row per incident
121K buffer hits, no temp I/O
655ms (6.5x faster)

How This Could Help notification-sender

Option A — Short-circuit via cache lookup: If the current incidents cache is enabled for Charter, notification-sender could look up the incident directly from Redis instead of calling GET /incidents/{id}?get_recursive_dependencies=True. The cache already contains dependencies, assets, and outage data.

Option B — New bulk lookup endpoint: Add a method to fetch a single incident from the current incidents cache by ID. notification-sender calls this instead of the expensive recursive API.

Option C — Enable feature + merge optimize-query: Enable use_current_incidents_cache for Charter + merge feat/optimize-query. The cache stays warm (populated on every incident update). Other consumers of the cache (e.g., the web app) benefit from the LATERAL optimization. notification-sender would need modification to use it.

8. Prioritized Recommendations

Immediate (hours, not days)

#	Action	Impact	Effort
1	Pass `max_distance=5` to recursive CTE in `daos.py:155`	Caps unbounded recursion. Won't fix root cause but prevents worst-case blowup.	1-line change in `gisual-data-mixins`
2	Merge `feat/optimize-query` into incidents-api main	6.5x faster current incidents query for Charter	Branch ready, needs review
3	Enable `use_current_incidents_cache` for Charter	Pre-computes incident+dependency data on every update	Org metadata toggle

Short-term (days)

#	Action	Impact	Effort
4	Modify notification-sender to use current incidents cache	Eliminates `get_incident()` + recursive deps call entirely (~5-7s)	Medium — need to add cache lookup path in `generators/intel.py:627`
5	Batch recursive CTE — single query for multiple root IDs instead of per-record Python loop	Eliminates N sequential DB round-trips in `daos.py:147-157`	Medium — SQL refactor in `models.py`
6	Parallelize independent API calls in `intel.py` with `asyncio.gather()`	Save ~0.5-0.7s per message	Small — rearrange existing async calls
7	Replace is_throttled() lock with Lua script	Eliminate 0.35s lock contention (x2 channels = 0.7s)	Small — Redis Lua script

Medium-term (weeks)

#	Action	Impact	Effort
8	Enable `pg_stat_statements` on `app` database	Finally identify exact slow queries	DBA coordination
9	Robin's cache MR (#44) — org-level context caching	Eliminates 74/75 redundant API calls per work unit	Review + merge
10	Add 4-column composite index on dependencies table	Faster OR-condition lookups in direct deps query	DDL migration
11	Add queue depth / throughput alerts	Detect this before SLA breach	Alert rule config

9. File Reference Index

Notification-Sender (Consumer)

File	Lines	What
`backend/consumers/notification-sender/notification_sender/app.py`	223-419	`process()` — main entry, orchestrates generation + delivery
`backend/consumers/notification-sender/notification_sender/app.py`	302-305	`get_cached_payloads_for_work_unit()` — Redis cache check
`backend/consumers/notification-sender/notification_sender/app.py`	337-346	`generate_context_data()` call
`backend/consumers/notification-sender/notification_sender/app.py`	375	`get_organization()` — could parallelize with above
`backend/consumers/notification-sender/notification_sender/app.py`	846+	`send_payload()` — idempotent delivery
`backend/consumers/notification-sender/notification_sender/generators/intel.py`	627-857	`generate_context_data()` — ALL internal API calls
`backend/consumers/notification-sender/notification_sender/generators/intel.py`	640-663	`get_incident()` with recursive deps
`backend/consumers/notification-sender/notification_sender/generators/intel.py`	749-770	`get_intel_request()` + bulk dependent fetch
`backend/consumers/notification-sender/notification_sender/generators/intel.py`	798-808	`get_assets()` bulk
`backend/consumers/notification-sender/notification_sender/generators/intel.py`	877-900	`get_intel_requests()` bulk method
`backend/consumers/notification-sender/notification_sender/generators/intel.py`	902-918	`get_intel_request()` single method

Shared Library (gisual-data-mixins)

File	Lines	What
`backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py`	41-223	`incorporate_dependency_data()` — dependency resolution orchestrator
`backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py`	147-157	Per-record recursive loop (N+1 pattern)
`backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py`	182+	`recursive_get_dependencies()`
`backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py`	157-171	Direct dependencies SQL query
`backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py`	174-278	`get_recursive_dependencies()` — THE recursive CTE
`backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py`	214-217	`max_distance` parameter (exists but never used)

intel-requests-api

File	Lines	What
`backend/apis/intel-requests-api/intel_requests_api/intel_requests.py`	2644-2685	`RecordHandler.get()` — GET /intel-requests/{id}
`backend/apis/intel-requests-api/intel_requests_api/intel_requests.py`	2672	`incorporate_dependency_data()` call
`backend/apis/intel-requests-api/intel_requests_api/intel_requests.py`	2129-2159	`BulkHandler.post()` — POST /intel-requests/bulk

incidents-api

File	Lines	What
`backend/apis/incidents-api/incidents_api/incidents.py`	~1868	`RecordHandler.get()` — GET /incidents/{id}
`backend/apis/incidents-api/incidents_api/incidents.py`	~595	`incorporate_incident_dependency_data()`
`backend/apis/incidents-api/incidents_api/incidents.py`	1054-1396	`CurrentHandler` — current incidents feature
`backend/apis/incidents-api/incidents_api/incidents.py`	1061-1370	`get_current_incidents()` — the LATERAL-optimized query
`backend/apis/incidents-api/incidents_api/incidents.py`	1383	`current-sql` feature flag check
`backend/apis/incidents-api/incidents_api/incidents.py`	390-402	`uses_current_incidents_cache()` per-org toggle

assets-api

File	Lines	What
`backend/apis/assets-api/assets_api/assets.py`	504-524	`BulkHandler.post()` — single JOIN, no issues

Database DDL

File	What
`ddl/app/schemata/dependencies/tables/00_dependencies.sql`	Dependencies table + indexes
`ddl/app/schemata/intel_requests/tables/00_intel_requests.sql`	Intel requests table
`ddl/app/schemata/assets/tables/01_assets.sql`	Assets table + indexes

Endpoint	request_time
GET /intel-requests/{id}?recursive_deps=true	9.3-20.1s
POST /intel-requests/bulk	0.85-1.75s

Priority	Issue	Status
P0	AAA needs HPA or higher minimum replicas	Open
P0	AAA permissions should be cached in intel-requests-api (Redis, short TTL)	Open
P1	MR #46 — retry caching for 25% of messages	Ready for review
P1	Context-level read-through cache for first-delivery thundering herd (75% case)	Not started
P1	incidents-api `feat/optimize-query` — LATERAL join optimization	Ready to merge
P2	ServiceNow 500s on goco/gocotest callback API	Notify Charter
P2	March 5 P0s still open (pg_stat_statements, queue depth alerts)	Open

jmealo/notification-sender-bottleneck-analysis-2026-03-06.md