Context: Production SLA breach. notification-sender queue peaked at 7,617 msgs. Per-message processing: median 7.8s, max 14s. Charter webhook delivery is <10ms — the bottleneck is entirely internal API calls.
- Hot Path Summary
- Bottleneck #1: Recursive Dependencies CTE (shared library)
- Bottleneck #2: intel-requests-api
- Bottleneck #3: incidents-api
- assets-api (not a bottleneck)
- Notification-Sender Optimization Opportunities
- Current Incidents Feature — Potential Fast Path
- Prioritized Recommendations
- File Reference Index
Per-operation timing from production logs (08:26 UTC, March 6):
| Operation | Duration | % of Total | API Called |
|---|---|---|---|
get_intel_request (recursive deps) |
5-7s | 50-60% | intel-requests-api |
get_intel_requests (bulk) |
0.2-5.7s | 15-30% | intel-requests-api |
generate_context_data (total) |
1.7-2.0s | 15-20% | Multiple APIs |
is_throttled |
0.35s | ~3% | Redis (lock contention) |
idempotent_session_lock |
0.1s | ~1% | Redis |
get_outbound_token |
0.09s | ~1% | AAA API |
get_organization |
0.05s | <1% | AAA API |
generate (Jinja template) |
0.01s | <1% | Local |
send (HTTP POST to Charter) |
<0.01s | <1% | External webhook |
The webhook POST to Charter is essentially instant. 100% of the latency is internal.
This is the single biggest finding. Both intel-requests-api and incidents-api use the same shared library for recursive dependency resolution. The slow query lives in:
File: backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py (lines 219-272)
WITH RECURSIVE dependents (
dependent_type, dependent_id, blocker_type, blocker_id,
created, deleted, dependency_distance, dependency_path
) AS (
-- Base case: direct dependencies
SELECT d.dependent_type, d.dependent_id, d.blocker_type, d.blocker_id,
d.created, d.deleted,
1 AS dependency_distance,
ARRAY[ROW(d.dependent_type, d.dependent_id,
d.blocker_type, d.blocker_id)] AS dependency_path
FROM dependencies.dependencies AS d
WHERE {upstream_id_field} = %(root_dependency_id)s
AND {upstream_type_field} = %(root_dependency_type)s
AND d.deleted IS NULL
UNION ALL
-- Recursive case: traverse the graph
SELECT d.dependent_type, d.dependent_id, d.blocker_type, d.blocker_id,
d.created, d.deleted,
rd.dependency_distance + 1 AS dependency_distance,
rd.dependency_path || ARRAY[ROW(...)] AS dependency_path
FROM dependencies.dependencies AS d
JOIN dependents AS rd
ON (d.{upstream_id} = rd.{downstream_id}
AND d.{upstream_type} = rd.{downstream_type})
WHERE NOT ROW(d.dependent_type, d.dependent_id,
d.blocker_type, d.blocker_id) = ANY(rd.dependency_path)
AND d.deleted IS NULL
)
SELECT * FROM dependents- Unbounded recursion:
max_distanceparameter exists (line 214) but is never passed by any caller. Defaults toNone= unlimited depth. - Array-based cycle detection:
dependency_patharray grows with each level. TheNOT ROW(...) = ANY(path)check is O(n) per row per level. - Per-record execution:
daos.py:147-157runs this CTE once per record in a Python loop — N intel requests = N separate recursive queries.
Table: dependencies.dependencies (app database, dependencies schema)
DDL: ddl/app/schemata/dependencies/tables/00_dependencies.sql
Current indexes:
-- Single-column indexes
CREATE INDEX dependencies_dependent_type_idx ON dependencies.dependencies USING btree(dependent_type);
CREATE INDEX dependencies_blocker_type_idx ON dependencies.dependencies USING btree(blocker_type);
CREATE INDEX dependencies_dependent_id_idx ON dependencies.dependencies USING btree(dependent_id);
CREATE INDEX dependencies_blocker_id_idx ON dependencies.dependencies USING btree(blocker_id);
-- Partial composite indexes (good, but see issues below)
CREATE INDEX dep_blocker_live_idx ON dependencies.dependencies(blocker_type, blocker_id) WHERE deleted IS NULL;
CREATE INDEX dep_dependent_live_idx ON dependencies.dependencies(dependent_type, dependent_id) WHERE deleted IS NULL;Index issues:
- The direct dependencies query uses an
ORcondition across both blocker and dependent sides — the planner may struggle to use indexes efficiently for this pattern. - Missing 4-column composite index for the full
(blocker_type, blocker_id, dependent_type, dependent_id)lookup pattern.
The max_distance parameter is implemented in models.py:214-217 and generates a WHERE dependency_distance <= %(max_distance)s clause. It's just never called:
DAO call chain (daos.py:147-157):
if get_recursive_dependencies:
for record in records:
record[recursive_dependency_type] = await self.recursive_get_dependencies(
record_type, record[id_field],
recursive_dependency_type,
recursive_record_types=recursive_record_types,
max_distance=recursive_max_distance, # ← Always None
)Quick win: Pass max_distance=5 (or even 3) to cap recursion depth.
- File:
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py - Handler:
RecordHandler.get()(line 2650) - Key call:
self.incorporate_dependency_data('intel_requests', intel_request)(line 2672)
When notification-sender calls with ?get_recursive_dependencies=True&recursive_record_types=intel_requests:
- Fetches the intel_request record (fast, indexed PK lookup)
- Fetches direct dependencies via
models.Dependency.get_dependencies()(fast, uses partial indexes) - Runs the recursive CTE for each record (SLOW — 5-7s, unbounded depth)
- File: Same file
- Handler:
BulkHandler.post()(line 2133) - Key call:
self.incorporate_dependency_data('intel_requests', filtered)(line 2157)
Same dependency incorporation logic. For N intel request IDs → N separate recursive CTE queries.
- File:
backend/apis/incidents-api/incidents_api/incidents.py - Handler:
RecordHandler.get()(line ~1868) - Key call:
incorporate_incident_dependency_data()(line ~595)
Uses the same shared gisual-data-mixins library:
RecordHandler.get()
→ incorporate_incident_dependency_data() [incidents.py:~595]
→ incorporate_dependency_data() [daos.py:41]
→ recursive_get_dependencies() [daos.py:182]
→ Dependency.get_recursive_dependencies() [models.py:174] ← SAME recursive CTE
When notification-sender calls with ?get_recursive_dependencies=True&recursive_dependency_type=dependents&recursive_record_types=intel_requests:
- Fetches incident with location (fast)
- Runs recursive CTE to find all dependent intel_requests (SLOW — was 25-54s on March 5)
- File:
backend/apis/assets-api/assets_api/assets.py - Handler:
BulkHandler.post()(line 504) - Query: Single JOIN query, no recursion
SELECT a.asset_id, a.asset_name, ..., t.asset_type_name,
public.ST_AsGeoJSON(l.location) AS geojson, l.thoroughfare, ...
FROM assets.assets AS a
JOIN assets.asset_types AS t ON t.asset_type_id = a.asset_type_id AND t.deleted IS NULL
JOIN locations.locations AS l ON l.location_id = a.location_id AND l.organization_id = a.organization_id AND l.deleted IS NULL
WHERE a.asset_id = ANY(%(asset_id)s)
AND a.deleted IS NULLAssessment: Well-optimized. Single query, proper indexes (PK on asset_id, btree on organization_id, location_id). ~0.5s is acceptable. No action needed.
Current: All API calls are sequential. Several are independent and could run concurrently.
| Opportunity | Files | Current Flow | Savings |
|---|---|---|---|
get_organization() + generate_context_data() |
app.py:337,375 |
Sequential | ~50ms |
reverse_geocoding + get_intel_requests() |
intel.py:666,687 |
Sequential | ~300ms |
get_intel_requests() + get_assets() |
intel.py:687,798 |
Sequential | ~200-300ms |
Total parallelization savings: ~0.5-0.7s per message (modest but free).
When 75 consumers process the same Charter batch:
- All 75 independently call
get_incident(),get_intel_requests(),get_assets()for the same data - Only 1 successfully sends each notification (idempotent lock)
- 74 pods do redundant API work = 222+ wasted API calls per work unit
Cache mechanism (app.py:302): Per-work_unit_id Redis cache exists with fingerprint-based freshness check. But:
- Cache is set AFTER generation (line 417-435), so the first pod pays full cost
- Other pods racing on the same work_unit don't benefit until the first completes
- No organization-level or incident-level cache to share across work units in the same batch
Robin's cache MR (#44) addresses this but is still open.
Root cause: Distributed lock acquisition with 75 pods competing.
The GCRA rate limiter acquires a Redis lock per call. With 75 consumers hitting the same throttle key:
- Lock contention causes retry loops
- Each
is_throttled()call averages 0.35s - Called twice per message (email + webhook channels) = 0.7s/msg
Fix: Replace lock-based GCRA with atomic Lua script — eliminates lock contention entirely.
The incidents-api has a pre-computed cache of current incidents per organization (GET /incidents/current). This cache includes:
- Incident data with location
- Outages (with locations)
- Assets
- Intel requests (direct dependents)
- Parent intel requests (2nd-level dependents)
- Internet outages
This is almost exactly the data that generate_context_data() spends 5-13s assembling from sequential API calls.
- Endpoint:
GET /incidents/current?organization_id=X(incidents.py:1054,CurrentHandler) - Cache storage: Redis hash, msgpack-encoded (
incidents.py:1388-1396) - Cache population: On every incident update via
upsert_to_current_incidents_cache()(incidents.py:433) - Per-org toggle:
organization.metadata.use_current_incidents_cache(incidents.py:402) - SQL feature flag:
current-sqlenables the optimized LATERAL query path (incidents.py:1383)
The feat/optimize-query branch (commit ccc1c51) rewrites the get_current_incidents() SQL query using LATERAL subqueries:
Before (flat LEFT JOINs):
- Cross-products across outages, assets, dependencies, internet outages
- Charter (org
9de5d801, ~3K incidents): 123K intermediate rows, 148MB disk sort, 566K buffer hits - 4,240ms
After (LATERAL subqueries):
- Each relationship aggregated independently, 1 row per incident
- 121K buffer hits, no temp I/O
- 655ms (6.5x faster)
Option A — Short-circuit via cache lookup: If the current incidents cache is enabled for Charter, notification-sender could look up the incident directly from Redis instead of calling GET /incidents/{id}?get_recursive_dependencies=True. The cache already contains dependencies, assets, and outage data.
Option B — New bulk lookup endpoint: Add a method to fetch a single incident from the current incidents cache by ID. notification-sender calls this instead of the expensive recursive API.
Option C — Enable feature + merge optimize-query: Enable use_current_incidents_cache for Charter + merge feat/optimize-query. The cache stays warm (populated on every incident update). Other consumers of the cache (e.g., the web app) benefit from the LATERAL optimization. notification-sender would need modification to use it.
| # | Action | Impact | Effort |
|---|---|---|---|
| 1 | Pass max_distance=5 to recursive CTE in daos.py:155 |
Caps unbounded recursion. Won't fix root cause but prevents worst-case blowup. | 1-line change in gisual-data-mixins |
| 2 | Merge feat/optimize-query into incidents-api main |
6.5x faster current incidents query for Charter | Branch ready, needs review |
| 3 | Enable use_current_incidents_cache for Charter |
Pre-computes incident+dependency data on every update | Org metadata toggle |
| # | Action | Impact | Effort |
|---|---|---|---|
| 4 | Modify notification-sender to use current incidents cache | Eliminates get_incident() + recursive deps call entirely (~5-7s) |
Medium — need to add cache lookup path in generators/intel.py:627 |
| 5 | Batch recursive CTE — single query for multiple root IDs instead of per-record Python loop | Eliminates N sequential DB round-trips in daos.py:147-157 |
Medium — SQL refactor in models.py |
| 6 | Parallelize independent API calls in intel.py with asyncio.gather() |
Save ~0.5-0.7s per message | Small — rearrange existing async calls |
| 7 | Replace is_throttled() lock with Lua script | Eliminate 0.35s lock contention (x2 channels = 0.7s) | Small — Redis Lua script |
| # | Action | Impact | Effort |
|---|---|---|---|
| 8 | Enable pg_stat_statements on app database |
Finally identify exact slow queries | DBA coordination |
| 9 | Robin's cache MR (#44) — org-level context caching | Eliminates 74/75 redundant API calls per work unit | Review + merge |
| 10 | Add 4-column composite index on dependencies table | Faster OR-condition lookups in direct deps query | DDL migration |
| 11 | Add queue depth / throughput alerts | Detect this before SLA breach | Alert rule config |
| File | Lines | What |
|---|---|---|
backend/consumers/notification-sender/notification_sender/app.py |
223-419 | process() — main entry, orchestrates generation + delivery |
backend/consumers/notification-sender/notification_sender/app.py |
302-305 | get_cached_payloads_for_work_unit() — Redis cache check |
backend/consumers/notification-sender/notification_sender/app.py |
337-346 | generate_context_data() call |
backend/consumers/notification-sender/notification_sender/app.py |
375 | get_organization() — could parallelize with above |
backend/consumers/notification-sender/notification_sender/app.py |
846+ | send_payload() — idempotent delivery |
backend/consumers/notification-sender/notification_sender/generators/intel.py |
627-857 | generate_context_data() — ALL internal API calls |
backend/consumers/notification-sender/notification_sender/generators/intel.py |
640-663 | get_incident() with recursive deps |
backend/consumers/notification-sender/notification_sender/generators/intel.py |
749-770 | get_intel_request() + bulk dependent fetch |
backend/consumers/notification-sender/notification_sender/generators/intel.py |
798-808 | get_assets() bulk |
backend/consumers/notification-sender/notification_sender/generators/intel.py |
877-900 | get_intel_requests() bulk method |
backend/consumers/notification-sender/notification_sender/generators/intel.py |
902-918 | get_intel_request() single method |
| File | Lines | What |
|---|---|---|
backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py |
41-223 | incorporate_dependency_data() — dependency resolution orchestrator |
backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py |
147-157 | Per-record recursive loop (N+1 pattern) |
backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py |
182+ | recursive_get_dependencies() |
backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py |
157-171 | Direct dependencies SQL query |
backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py |
174-278 | get_recursive_dependencies() — THE recursive CTE |
backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py |
214-217 | max_distance parameter (exists but never used) |
| File | Lines | What |
|---|---|---|
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py |
2644-2685 | RecordHandler.get() — GET /intel-requests/{id} |
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py |
2672 | incorporate_dependency_data() call |
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py |
2129-2159 | BulkHandler.post() — POST /intel-requests/bulk |
| File | Lines | What |
|---|---|---|
backend/apis/incidents-api/incidents_api/incidents.py |
~1868 | RecordHandler.get() — GET /incidents/{id} |
backend/apis/incidents-api/incidents_api/incidents.py |
~595 | incorporate_incident_dependency_data() |
backend/apis/incidents-api/incidents_api/incidents.py |
1054-1396 | CurrentHandler — current incidents feature |
backend/apis/incidents-api/incidents_api/incidents.py |
1061-1370 | get_current_incidents() — the LATERAL-optimized query |
backend/apis/incidents-api/incidents_api/incidents.py |
1383 | current-sql feature flag check |
backend/apis/incidents-api/incidents_api/incidents.py |
390-402 | uses_current_incidents_cache() per-org toggle |
| File | Lines | What |
|---|---|---|
backend/apis/assets-api/assets_api/assets.py |
504-524 | BulkHandler.post() — single JOIN, no issues |
| File | What |
|---|---|
ddl/app/schemata/dependencies/tables/00_dependencies.sql |
Dependencies table + indexes |
ddl/app/schemata/intel_requests/tables/00_intel_requests.sql |
Intel requests table |
ddl/app/schemata/assets/tables/01_assets.sql |
Assets table + indexes |