Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jmealo/69220ee3ec92e3716dc01c0f1324b91b to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/69220ee3ec92e3716dc01c0f1324b91b to your computer and use it in GitHub Desktop.
notification-sender bottleneck deep dive — recursive CTE, API call chain, optimization targets (2026-03-06)

Notification-Sender Bottleneck Deep Dive — 2026-03-06

Context: Production SLA breach. notification-sender queue peaked at 7,617 msgs. Per-message processing: median 7.8s, max 14s. Charter webhook delivery is <10ms — the bottleneck is entirely internal API calls.


Table of Contents

  1. Hot Path Summary
  2. Bottleneck #1: Recursive Dependencies CTE (shared library)
  3. Bottleneck #2: intel-requests-api
  4. Bottleneck #3: incidents-api
  5. assets-api (not a bottleneck)
  6. Notification-Sender Optimization Opportunities
  7. Current Incidents Feature — Potential Fast Path
  8. Prioritized Recommendations
  9. File Reference Index

1. Hot Path Summary

Per-operation timing from production logs (08:26 UTC, March 6):

Operation Duration % of Total API Called
get_intel_request (recursive deps) 5-7s 50-60% intel-requests-api
get_intel_requests (bulk) 0.2-5.7s 15-30% intel-requests-api
generate_context_data (total) 1.7-2.0s 15-20% Multiple APIs
is_throttled 0.35s ~3% Redis (lock contention)
idempotent_session_lock 0.1s ~1% Redis
get_outbound_token 0.09s ~1% AAA API
get_organization 0.05s <1% AAA API
generate (Jinja template) 0.01s <1% Local
send (HTTP POST to Charter) <0.01s <1% External webhook

The webhook POST to Charter is essentially instant. 100% of the latency is internal.


2. Bottleneck #1: Recursive Dependencies CTE

This is the single biggest finding. Both intel-requests-api and incidents-api use the same shared library for recursive dependency resolution. The slow query lives in:

The Recursive CTE Query

File: backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py (lines 219-272)

WITH RECURSIVE dependents (
    dependent_type, dependent_id, blocker_type, blocker_id,
    created, deleted, dependency_distance, dependency_path
) AS (
    -- Base case: direct dependencies
    SELECT d.dependent_type, d.dependent_id, d.blocker_type, d.blocker_id,
           d.created, d.deleted,
           1 AS dependency_distance,
           ARRAY[ROW(d.dependent_type, d.dependent_id,
                     d.blocker_type, d.blocker_id)] AS dependency_path
      FROM dependencies.dependencies AS d
     WHERE {upstream_id_field} = %(root_dependency_id)s
       AND {upstream_type_field} = %(root_dependency_type)s
       AND d.deleted IS NULL

    UNION ALL

    -- Recursive case: traverse the graph
    SELECT d.dependent_type, d.dependent_id, d.blocker_type, d.blocker_id,
           d.created, d.deleted,
           rd.dependency_distance + 1 AS dependency_distance,
           rd.dependency_path || ARRAY[ROW(...)] AS dependency_path
      FROM dependencies.dependencies AS d
      JOIN dependents AS rd
           ON (d.{upstream_id} = rd.{downstream_id}
               AND d.{upstream_type} = rd.{downstream_type})
     WHERE NOT ROW(d.dependent_type, d.dependent_id,
                   d.blocker_type, d.blocker_id) = ANY(rd.dependency_path)
       AND d.deleted IS NULL
)
SELECT * FROM dependents

Why It's Slow

  1. Unbounded recursion: max_distance parameter exists (line 214) but is never passed by any caller. Defaults to None = unlimited depth.
  2. Array-based cycle detection: dependency_path array grows with each level. The NOT ROW(...) = ANY(path) check is O(n) per row per level.
  3. Per-record execution: daos.py:147-157 runs this CTE once per record in a Python loop — N intel requests = N separate recursive queries.

Database Schema & Indexes

Table: dependencies.dependencies (app database, dependencies schema) DDL: ddl/app/schemata/dependencies/tables/00_dependencies.sql

Current indexes:

-- Single-column indexes
CREATE INDEX dependencies_dependent_type_idx ON dependencies.dependencies USING btree(dependent_type);
CREATE INDEX dependencies_blocker_type_idx ON dependencies.dependencies USING btree(blocker_type);
CREATE INDEX dependencies_dependent_id_idx ON dependencies.dependencies USING btree(dependent_id);
CREATE INDEX dependencies_blocker_id_idx ON dependencies.dependencies USING btree(blocker_id);

-- Partial composite indexes (good, but see issues below)
CREATE INDEX dep_blocker_live_idx ON dependencies.dependencies(blocker_type, blocker_id) WHERE deleted IS NULL;
CREATE INDEX dep_dependent_live_idx ON dependencies.dependencies(dependent_type, dependent_id) WHERE deleted IS NULL;

Index issues:

  • The direct dependencies query uses an OR condition across both blocker and dependent sides — the planner may struggle to use indexes efficiently for this pattern.
  • Missing 4-column composite index for the full (blocker_type, blocker_id, dependent_type, dependent_id) lookup pattern.

The Fix Is Already Partially Built

The max_distance parameter is implemented in models.py:214-217 and generates a WHERE dependency_distance <= %(max_distance)s clause. It's just never called:

DAO call chain (daos.py:147-157):

if get_recursive_dependencies:
    for record in records:
        record[recursive_dependency_type] = await self.recursive_get_dependencies(
            record_type, record[id_field],
            recursive_dependency_type,
            recursive_record_types=recursive_record_types,
            max_distance=recursive_max_distance,  # ← Always None
        )

Quick win: Pass max_distance=5 (or even 3) to cap recursion depth.


3. Bottleneck #2: intel-requests-api

GET /intel-requests/{id}

  • File: backend/apis/intel-requests-api/intel_requests_api/intel_requests.py
  • Handler: RecordHandler.get() (line 2650)
  • Key call: self.incorporate_dependency_data('intel_requests', intel_request) (line 2672)

When notification-sender calls with ?get_recursive_dependencies=True&recursive_record_types=intel_requests:

  1. Fetches the intel_request record (fast, indexed PK lookup)
  2. Fetches direct dependencies via models.Dependency.get_dependencies() (fast, uses partial indexes)
  3. Runs the recursive CTE for each record (SLOW — 5-7s, unbounded depth)

POST /intel-requests/bulk

  • File: Same file
  • Handler: BulkHandler.post() (line 2133)
  • Key call: self.incorporate_dependency_data('intel_requests', filtered) (line 2157)

Same dependency incorporation logic. For N intel request IDs → N separate recursive CTE queries.


4. Bottleneck #3: incidents-api

GET /incidents/{id}

  • File: backend/apis/incidents-api/incidents_api/incidents.py
  • Handler: RecordHandler.get() (line ~1868)
  • Key call: incorporate_incident_dependency_data() (line ~595)

Uses the same shared gisual-data-mixins library:

RecordHandler.get()
  → incorporate_incident_dependency_data() [incidents.py:~595]
  → incorporate_dependency_data() [daos.py:41]
  → recursive_get_dependencies() [daos.py:182]
  → Dependency.get_recursive_dependencies() [models.py:174]  ← SAME recursive CTE

When notification-sender calls with ?get_recursive_dependencies=True&recursive_dependency_type=dependents&recursive_record_types=intel_requests:

  • Fetches incident with location (fast)
  • Runs recursive CTE to find all dependent intel_requests (SLOW — was 25-54s on March 5)

5. assets-api (Not a Bottleneck)

POST /assets/bulk

  • File: backend/apis/assets-api/assets_api/assets.py
  • Handler: BulkHandler.post() (line 504)
  • Query: Single JOIN query, no recursion
SELECT a.asset_id, a.asset_name, ..., t.asset_type_name,
       public.ST_AsGeoJSON(l.location) AS geojson, l.thoroughfare, ...
  FROM assets.assets AS a
  JOIN assets.asset_types AS t ON t.asset_type_id = a.asset_type_id AND t.deleted IS NULL
  JOIN locations.locations AS l ON l.location_id = a.location_id AND l.organization_id = a.organization_id AND l.deleted IS NULL
 WHERE a.asset_id = ANY(%(asset_id)s)
   AND a.deleted IS NULL

Assessment: Well-optimized. Single query, proper indexes (PK on asset_id, btree on organization_id, location_id). ~0.5s is acceptable. No action needed.


6. Notification-Sender Optimization Opportunities

6a. Parallelization (consumer-side)

Current: All API calls are sequential. Several are independent and could run concurrently.

Opportunity Files Current Flow Savings
get_organization() + generate_context_data() app.py:337,375 Sequential ~50ms
reverse_geocoding + get_intel_requests() intel.py:666,687 Sequential ~300ms
get_intel_requests() + get_assets() intel.py:687,798 Sequential ~200-300ms

Total parallelization savings: ~0.5-0.7s per message (modest but free).

6b. 75-Pod Redundant Work

When 75 consumers process the same Charter batch:

  • All 75 independently call get_incident(), get_intel_requests(), get_assets() for the same data
  • Only 1 successfully sends each notification (idempotent lock)
  • 74 pods do redundant API work = 222+ wasted API calls per work unit

Cache mechanism (app.py:302): Per-work_unit_id Redis cache exists with fingerprint-based freshness check. But:

  • Cache is set AFTER generation (line 417-435), so the first pod pays full cost
  • Other pods racing on the same work_unit don't benefit until the first completes
  • No organization-level or incident-level cache to share across work units in the same batch

Robin's cache MR (#44) addresses this but is still open.

6c. is_throttled() — 0.35s Latency

Root cause: Distributed lock acquisition with 75 pods competing.

The GCRA rate limiter acquires a Redis lock per call. With 75 consumers hitting the same throttle key:

  • Lock contention causes retry loops
  • Each is_throttled() call averages 0.35s
  • Called twice per message (email + webhook channels) = 0.7s/msg

Fix: Replace lock-based GCRA with atomic Lua script — eliminates lock contention entirely.


7. Current Incidents Feature — Potential Fast Path

What It Is

The incidents-api has a pre-computed cache of current incidents per organization (GET /incidents/current). This cache includes:

  • Incident data with location
  • Outages (with locations)
  • Assets
  • Intel requests (direct dependents)
  • Parent intel requests (2nd-level dependents)
  • Internet outages

This is almost exactly the data that generate_context_data() spends 5-13s assembling from sequential API calls.

Current State

  • Endpoint: GET /incidents/current?organization_id=X (incidents.py:1054, CurrentHandler)
  • Cache storage: Redis hash, msgpack-encoded (incidents.py:1388-1396)
  • Cache population: On every incident update via upsert_to_current_incidents_cache() (incidents.py:433)
  • Per-org toggle: organization.metadata.use_current_incidents_cache (incidents.py:402)
  • SQL feature flag: current-sql enables the optimized LATERAL query path (incidents.py:1383)

feat/optimize-query Branch

The feat/optimize-query branch (commit ccc1c51) rewrites the get_current_incidents() SQL query using LATERAL subqueries:

Before (flat LEFT JOINs):

  • Cross-products across outages, assets, dependencies, internet outages
  • Charter (org 9de5d801, ~3K incidents): 123K intermediate rows, 148MB disk sort, 566K buffer hits
  • 4,240ms

After (LATERAL subqueries):

  • Each relationship aggregated independently, 1 row per incident
  • 121K buffer hits, no temp I/O
  • 655ms (6.5x faster)

How This Could Help notification-sender

Option A — Short-circuit via cache lookup: If the current incidents cache is enabled for Charter, notification-sender could look up the incident directly from Redis instead of calling GET /incidents/{id}?get_recursive_dependencies=True. The cache already contains dependencies, assets, and outage data.

Option B — New bulk lookup endpoint: Add a method to fetch a single incident from the current incidents cache by ID. notification-sender calls this instead of the expensive recursive API.

Option C — Enable feature + merge optimize-query: Enable use_current_incidents_cache for Charter + merge feat/optimize-query. The cache stays warm (populated on every incident update). Other consumers of the cache (e.g., the web app) benefit from the LATERAL optimization. notification-sender would need modification to use it.


8. Prioritized Recommendations

Immediate (hours, not days)

# Action Impact Effort
1 Pass max_distance=5 to recursive CTE in daos.py:155 Caps unbounded recursion. Won't fix root cause but prevents worst-case blowup. 1-line change in gisual-data-mixins
2 Merge feat/optimize-query into incidents-api main 6.5x faster current incidents query for Charter Branch ready, needs review
3 Enable use_current_incidents_cache for Charter Pre-computes incident+dependency data on every update Org metadata toggle

Short-term (days)

# Action Impact Effort
4 Modify notification-sender to use current incidents cache Eliminates get_incident() + recursive deps call entirely (~5-7s) Medium — need to add cache lookup path in generators/intel.py:627
5 Batch recursive CTE — single query for multiple root IDs instead of per-record Python loop Eliminates N sequential DB round-trips in daos.py:147-157 Medium — SQL refactor in models.py
6 Parallelize independent API calls in intel.py with asyncio.gather() Save ~0.5-0.7s per message Small — rearrange existing async calls
7 Replace is_throttled() lock with Lua script Eliminate 0.35s lock contention (x2 channels = 0.7s) Small — Redis Lua script

Medium-term (weeks)

# Action Impact Effort
8 Enable pg_stat_statements on app database Finally identify exact slow queries DBA coordination
9 Robin's cache MR (#44) — org-level context caching Eliminates 74/75 redundant API calls per work unit Review + merge
10 Add 4-column composite index on dependencies table Faster OR-condition lookups in direct deps query DDL migration
11 Add queue depth / throughput alerts Detect this before SLA breach Alert rule config

9. File Reference Index

Notification-Sender (Consumer)

File Lines What
backend/consumers/notification-sender/notification_sender/app.py 223-419 process() — main entry, orchestrates generation + delivery
backend/consumers/notification-sender/notification_sender/app.py 302-305 get_cached_payloads_for_work_unit() — Redis cache check
backend/consumers/notification-sender/notification_sender/app.py 337-346 generate_context_data() call
backend/consumers/notification-sender/notification_sender/app.py 375 get_organization() — could parallelize with above
backend/consumers/notification-sender/notification_sender/app.py 846+ send_payload() — idempotent delivery
backend/consumers/notification-sender/notification_sender/generators/intel.py 627-857 generate_context_data() — ALL internal API calls
backend/consumers/notification-sender/notification_sender/generators/intel.py 640-663 get_incident() with recursive deps
backend/consumers/notification-sender/notification_sender/generators/intel.py 749-770 get_intel_request() + bulk dependent fetch
backend/consumers/notification-sender/notification_sender/generators/intel.py 798-808 get_assets() bulk
backend/consumers/notification-sender/notification_sender/generators/intel.py 877-900 get_intel_requests() bulk method
backend/consumers/notification-sender/notification_sender/generators/intel.py 902-918 get_intel_request() single method

Shared Library (gisual-data-mixins)

File Lines What
backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py 41-223 incorporate_dependency_data() — dependency resolution orchestrator
backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py 147-157 Per-record recursive loop (N+1 pattern)
backend/libraries/gisual-data-mixins/gisual_data_mixins/daos.py 182+ recursive_get_dependencies()
backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py 157-171 Direct dependencies SQL query
backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py 174-278 get_recursive_dependencies()THE recursive CTE
backend/libraries/gisual-data-mixins/gisual_data_mixins/models.py 214-217 max_distance parameter (exists but never used)

intel-requests-api

File Lines What
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py 2644-2685 RecordHandler.get() — GET /intel-requests/{id}
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py 2672 incorporate_dependency_data() call
backend/apis/intel-requests-api/intel_requests_api/intel_requests.py 2129-2159 BulkHandler.post() — POST /intel-requests/bulk

incidents-api

File Lines What
backend/apis/incidents-api/incidents_api/incidents.py ~1868 RecordHandler.get() — GET /incidents/{id}
backend/apis/incidents-api/incidents_api/incidents.py ~595 incorporate_incident_dependency_data()
backend/apis/incidents-api/incidents_api/incidents.py 1054-1396 CurrentHandler — current incidents feature
backend/apis/incidents-api/incidents_api/incidents.py 1061-1370 get_current_incidents() — the LATERAL-optimized query
backend/apis/incidents-api/incidents_api/incidents.py 1383 current-sql feature flag check
backend/apis/incidents-api/incidents_api/incidents.py 390-402 uses_current_incidents_cache() per-org toggle

assets-api

File Lines What
backend/apis/assets-api/assets_api/assets.py 504-524 BulkHandler.post() — single JOIN, no issues

Database DDL

File What
ddl/app/schemata/dependencies/tables/00_dependencies.sql Dependencies table + indexes
ddl/app/schemata/intel_requests/tables/00_intel_requests.sql Intel requests table
ddl/app/schemata/assets/tables/01_assets.sql Assets table + indexes

Root Cause Update: AAA API Capacity (2026-03-06)

Appended to: Bottleneck Deep Dive


Root Cause Found: AAA API at 8 Replicas

The SQL queries were a red herring. Every query in the notification-sender hot path executes in <1ms (verified via EXPLAIN ANALYZE on prod). The 9-20s API response times were caused by a cascading bottleneck in the AAA API.

The Cascade

notification-sender (75 pods)
  -> intel-requests-api (60 pods) — request_time: 9-20s
    -> aaa-api (8 pods, some crash-looping) — request_time: 7-19s
      -> Azure Flex aaa DB — 2 simple SQL queries (sub-ms)

Why AAA Was Slow

  1. Only 8 replicas serving the entire platform
  2. Some pods crash-looping due to missing APP_DB_URL in Kubernetes secret
  3. Intel-searcher's user-token requests trigger UserPermissionsHandler on AAA for every request to intel-requests-api
  4. Same permissions checked repeatedly — user 944c9c57 with org 3f8bea44 queried 4 times in the same second across 4 AAA pods, each taking 7-19s
  5. No permissions caching — identical user+org+resource lookups hit AAA's DB every time

Why This Affected Notification-Sender

Notification-sender uses service tokens, which bypass the AAA permissions check (allow_service_access_tokens at aaa_lib/tokens.py:1562). But notification-sender's requests to intel-requests-api queue behind intel-searcher's requests on the same Tornado event loop. Intel-searcher's requests are blocked waiting for AAA, holding event loop slots and DB connections.

Evidence

Intel-requests-api request_time (from notification-sender):

Endpoint request_time
GET /intel-requests/{id}?recursive_deps=true 9.3-20.1s
POST /intel-requests/bulk 0.85-1.75s

AAA API request_time (from intel-requests-api):

Endpoint request_time
GET /internal/permissions/intel_requests/users/{id} 7.1-19.3s

AAA UserPermissionsHandler code (aaa-api/aaa_api/permissions.py:792):

  • Just 2 SQL queries: User.get(user_id) + Permission.get_user_permission_assignments()
  • No HTTP calls to other services
  • The 7-19s is entirely DB connection pool saturation

Resolution

  1. Fixed AAA crash-loop (deleted/recreated secret with APP_DB_URL)
  2. Scaled AAA from 8 to 20 replicas
  3. Queue drained from ~7,885 to 2 messages within ~2 hours

Queue Trend (5-min resolution)

07:30  3,631   Queue building
08:30  5
09:00  568
10:00  2,482
11:00  6,273
11:30  8,569   PEAK
12:00  6,518
13:00  7,885   Second peak
14:00  3,881   AAA scaled to 20 — inflection point
14:30  1,965
15:00  39      DRAINED
15:10  2       CLEAR

Remaining Issues

Priority Issue Status
P0 AAA needs HPA or higher minimum replicas Open
P0 AAA permissions should be cached in intel-requests-api (Redis, short TTL) Open
P1 MR #46 — retry caching for 25% of messages Ready for review
P1 Context-level read-through cache for first-delivery thundering herd (75% case) Not started
P1 incidents-api feat/optimize-query — LATERAL join optimization Ready to merge
P2 ServiceNow 500s on goco/gocotest callback API Notify Charter
P2 March 5 P0s still open (pg_stat_statements, queue depth alerts) Open
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment