Sunday morning. Merchants can't load review widgets. Dashboard inaccessible. Churn reported. The legacy .NET monolith (app-backend-api) running on a self-managed SQL Server (r6a.16xlarge — 64 vCPUs, 512 GB RAM) was completely unresponsive.
What followed was a 12-hour debugging marathon that uncovered five cascading failures, a year-old regression, and a SQL Server death spiral that required a force-kill to resolve.
Symptom: stamped.io/api/widget returning 504 after exactly 15 seconds.
$ curl -s -w "%{http_code} %{time_total}s" "https://stamped.io/api/widget?productId=...&sId=6134&take=5"
upstream request timeout
504 15.585350sRoot cause: The widget proxy (svc-backend-widget-proxy-new) runs behind ECS Service Connect, which injects an Envoy sidecar. Envoy's default route_timeout is 15 seconds. The legacy backend needed 30-60s for uncached widget queries. Envoy was killing every request before it could complete.
The kicker: Because requests never completed, the Redis cache never got populated. Every request was a cache miss → hit the backend → timeout → cache stays empty → next request does the same thing. An infinite loop of failure.
Fix: Increased Envoy timeout to 120s in Terraform:
timeout = {
per_request_timeout_seconds = 120
idle_timeout_seconds = 120
}Symptom: Even with the timeout fixed, the legacy backend was returning 500 after 59 seconds. We found active lock chains in SQL Server:
SELECT r.session_id, r.blocking_session_id, r.wait_type, r.command
FROM sys.dm_exec_requests r WHERE r.blocking_session_id > 0
-- session 512 blocked by 182 (MERGE)
-- session 182 blocked by 3453 (UPDATE)
-- session 3453 blocked by 2035 (INSERT)
-- session 2035 blocked by 512 (SELECT)
-- DEADLOCK CYCLEWidget SELECTs were taking shared locks, blocking on concurrent UPDATEs from the review censor processor. The UPDATE [dbo].[Review] SET [Body] from censoring was holding IX locks that blocked the widget's S locks.
The smoking gun: git blame revealed that ReadUncommitted isolation was commented out in October 2023:
027775ca8 (anyuzer 2023-10-28) // using (var tx = new TransactionScope(ReadUncommitted))
Commit message: "change health check, change hardware profile, remove tx, remove async"
Every other widget endpoint (badges, questions) still had ReadUncommitted. This was the only one missing it. For over a year.
Fix: Uncommented the transaction scope. One line change.
Symptom: The widget proxy had a Redis lock to deduplicate concurrent requests (so only one hits the backend for the same widget). But the lock TTL was 30 seconds, and the backend took 60s+. The lock expired mid-flight.
Request A: acquires lock, calls backend...
[30s passes, lock expires]
Request B: no lock found, acquires new lock, calls backend...
[30s passes, lock expires]
Request C: acquires lock, calls backend...
Every 30 seconds, a new request stampeded the already-struggling backend. With 12 ECS tasks, this meant dozens of concurrent requests pounding the same slow query.
Fix: Increased lock TTL from 30s to 90s.
Symptom:
System.InvalidOperationException: Timeout expired. The timeout period elapsed
prior to obtaining a connection from the pool. This may have occurred because
all pooled connections were in use and max pool size was reached.
The numbers:
- 45 ECS tasks × 100 connections per pool = 4,500 max connections
- SQL Server showing 5,049 active connections from 131 hosts
- Every host maxed at exactly 100 connections
With queries blocked on locks for 59 seconds each, connections were held open and never returned to the pool. New requests couldn't get a connection.
What we tried:
- Increased
Max Pool Sizeto 200 → made it worse (more connections competing for locked rows) - Increased to 150 → still bad
- Reverted to 100
- Reduced ECS task count from 45 to 20
This is where things got truly ugly.
Symptom: SQL Server at 100% CPU, 100% memory, completely unresponsive. Even sqlcmd -S localhost from the RDP session was hanging.
SELECT TOP 5 wait_type, COUNT(*) as cnt
FROM sys.dm_exec_requests WHERE status = 'suspended'
GROUP BY wait_type ORDER BY cnt DESC
-- RESOURCE_SEMAPHORE_QUERY_COMPILE 2574
-- QDS_STMT 3342,574 sessions stuck waiting to compile queries. SQL Server has a compile gate that limits concurrent compilations. Entity Framework generates unique query shapes — different parameter counts, different TOP N, different WHERE clause structures. Each unique shape needs a fresh compilation.
We measured: 2,200–4,600 compilations/sec — that's 1 in 3 queries needing compilation instead of reusing a cached plan.
The amplifier: Query Store (QDS) was enabled. It intercepts every compilation to record the plan and stats. With thousands of compilations per second, QDS writes became the bottleneck on top of the compile gate.
Worker thread exhaustion: Each session waiting to compile holds a worker thread. SQL Server auto-configured 2,400 max workers for 64 CPUs. With 2,574 sessions in the compile queue alone:
SELECT SUM(work_queue_count) as queued, SUM(active_workers_count) as active_workers
FROM sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE'
-- queued: 1817, active_workers: 25101,817 tasks queued with no workers available. SQL Server couldn't even process new login handshakes — which is why external connections were timing out.
More connections → more unique queries → more compilations needed
→ compile gate full → sessions wait holding workers
→ no workers available → can't process logins → new connections retry
→ connection pool pressure → more connections
→ REPEAT
-- Increased max worker threads from 2400 to 3000
EXEC sp_configure 'max worker threads', 3000; RECONFIGURE;
-- Result: workers filled up to 3,109 immediately
-- Evicted 2000 single-use cached plans
DECLARE c CURSOR FOR SELECT plan_handle FROM sys.dm_exec_cached_plans
WHERE objtype = 'Prepared' AND usecounts = 1;
-- Result: no improvement, compile waits stayed at ~1,600
-- Tried to clear all cached plans
DBCC FREEPROCCACHE
-- Result: HUNG. No workers available to execute it.
-- Tried to disable Query Store
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE = OFF
-- Result: HUNG. Needs exclusive DB access, can't get it.
-- Tried read-only mode
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE (DESIRED_STATE = READ_ONLY)
-- Result: HUNG.Every remediation command requires a worker thread to execute. All workers are consumed. The only way out is to restart SQL Server.
Restart-Service MSSQLSERVER -ForceThis also hung — SQL Server couldn't gracefully shut down with 5,000+ stuck sessions. Had to force-kill:
Stop-Process -Name sqlservr -Force
Start-Service MSSQLSERVERThe database went through recovery (shopry_shopifyproductaddons showed RECOVERING state). Meanwhile, all 40 ECS tasks immediately tried to reconnect — 7,662 connections hit the server within seconds of it coming online, threatening to spiral again.
We had a narrow window after restart to:
- Scale down ECS to 10 tasks — reduce connection flood
- Scale down k8s consumers — censor from 8→4, webhooks from 5→2
- Disable Query Store while connections were low:
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE = OFF
- Gradually scale back up — 10 → 20 tasks, watching compile waits
Result after stabilization:
SELECT TOP 5 wait_type, COUNT(*) as cnt
FROM sys.dm_exec_requests WHERE status = 'suspended'
GROUP BY wait_type ORDER BY cnt DESC
-- ASYNC_NETWORK_IO 13 (normal)
-- PAGEIOLATCH_SH 7 (normal disk reads)
-- CXCONSUMER 5 (normal parallelism)
-- LCK_M_IX 4 (minor, acceptable)Zero compile waits. Server healthy at 100% CPU but doing actual work, not stuck in a spiral.
1. ReadUncommitted removed (Oct 2023)
→ Widget SELECTs take shared locks
2. Censor UPDATEs block widget SELECTs
→ Queries wait 59 seconds holding connections
3. Connection pools exhaust
→ "Max pool size reached" errors
4. Envoy 15s timeout kills requests before backend responds
→ Cache never populates
5. Every request is a cache miss → hammers backend
→ More connections, more lock contention
6. Entity Framework generates unique query shapes
→ Thousands of compilations/sec
7. Query Store adds write overhead to each compilation
→ Compile gate saturates
8. Compile queue consumes all worker threads
→ Server can't process logins
→ Can't run diagnostic queries
→ Can't run remediation commands
→ GAME OVER
-- Check what sessions are waiting on
SELECT TOP 10 wait_type, COUNT(*) as cnt
FROM sys.dm_exec_requests WHERE status = 'suspended'
GROUP BY wait_type ORDER BY cnt DESC
-- Check worker thread exhaustion (THIS IS THE KEY ONE)
SELECT SUM(work_queue_count) as queued,
SUM(active_workers_count) as active_workers
FROM sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE'
-- Check connection count
SELECT COUNT(*) as connections FROM sys.dm_exec_connections
-- Check blocking chains
SELECT r.session_id, r.blocking_session_id, r.wait_type,
r.wait_time/1000 as wait_secs,
SUBSTRING(t.text,1,100) as query_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.blocking_session_id > 0 ORDER BY r.wait_time DESC
-- Check plan cache bloat
SELECT objtype, COUNT(*) as cnt,
SUM(CAST(size_in_bytes AS BIGINT))/1024/1024 as size_mb
FROM sys.dm_exec_cached_plans GROUP BY objtype ORDER BY size_mb DESC
-- Check compilation rate (run twice, 10s apart, compare)
SELECT * FROM sys.dm_exec_query_optimizer_info WHERE counter = 'optimizations'# Check connection count from OS level
(netstat -an | findstr :1433 | findstr ESTABLISHED | measure).Count
# Check pending connections
(netstat -an | findstr :1433 | findstr SYN_RECEIVED | measure).Count
# Restart SQL Server (graceful)
Restart-Service MSSQLSERVER -Force
# If graceful restart hangs (check status)
Get-Service MSSQLSERVER | Select Status
# If StopPending for >5 minutes:
Stop-Process -Name sqlservr -Force
Start-Service MSSQLSERVER
# Check error log after restart
Get-Content "C:\Program Files\Microsoft SQL Server\MSSQL16.MSSQLSERVER\MSSQL\Log\ERRORLOG" -Tail 30# Scale down ECS tasks
aws ecs update-service --cluster shared --service app-backend-api-prod-new --desired-count 10
# Scale down k8s consumers
kubectl scale deployment app-backend-cron-review-censor -n prod --replicas=0
kubectl scale deployment app-backend-cron-shopify-webhooks -n prod --replicas=0
# Check ECS task count
uv run find_ecs_service.py app-backend-api-prod-- Disable Query Store while connections are low
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE = OFF
-- Verify clean state
SELECT TOP 5 wait_type, COUNT(*) as cnt
FROM sys.dm_exec_requests WHERE status = 'suspended'
GROUP BY wait_type ORDER BY cnt DESC
-- Should show ASYNC_NETWORK_IO, PAGEIOLATCH_SH — NOT RESOURCE_SEMAPHORE_QUERY_COMPILE| Change | Type | Status |
|---|---|---|
| Restore ReadUncommitted for CreateWidget | Code (PR #2156) | Deployed |
| Increase Envoy timeout 15s → 120s | Terraform | Applied |
| Increase Redis lock TTL 30s → 90s | Code (PR #256) | In PR |
| max worker threads 2400 → 3000 | SQL Server config | Applied |
| Query Store disabled | SQL Server config | Applied |
| Cache TTL 8h → 12h | Code | Deployed |
| Widget proxy added to k8s | k8s-mvp | Deployed |
| OTEL collector scaled to 5 replicas, 2Gi | k8s-mvp | Deployed |
| Consolidated 13 CI workflows into 1 | GitHub Actions | Deployed |
- ReadUncommitted matters for read-heavy endpoints — one commented-out line caused a year of intermittent lock contention that only became critical under high load
- Query Store on high-compilation workloads is dangerous — it amplifies the compile gate bottleneck
- The compile death spiral is self-reinforcing — once worker threads are exhausted, you can't even run the commands to fix it
- Connection pool × task count = total connections — autoscaling ECS to 45 tasks with 100-connection pools means 4,500 connections competing for SQL Server resources
- Envoy Service Connect has a 15s default timeout — if your backend is slower than that, nothing gets cached and every request is a miss
- Always have RDP access to self-managed databases — when SQL Server is unresponsive, kubectl/remote queries won't work