Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save diegoeche/035bae38e3fe1f7bd91b84d39b98e851 to your computer and use it in GitHub Desktop.

Select an option

Save diegoeche/035bae38e3fe1f7bd91b84d39b98e851 to your computer and use it in GitHub Desktop.

Battle Report: SQL Server Meltdown & Recovery — March 9, 2026

The Scene

Sunday morning. Merchants can't load review widgets. Dashboard inaccessible. Churn reported. The legacy .NET monolith (app-backend-api) running on a self-managed SQL Server (r6a.16xlarge — 64 vCPUs, 512 GB RAM) was completely unresponsive.

What followed was a 12-hour debugging marathon that uncovered five cascading failures, a year-old regression, and a SQL Server death spiral that required a force-kill to resolve.


Chapter 1: The 15-Second Wall

Symptom: stamped.io/api/widget returning 504 after exactly 15 seconds.

$ curl -s -w "%{http_code} %{time_total}s" "https://stamped.io/api/widget?productId=...&sId=6134&take=5"
upstream request timeout
504 15.585350s

Root cause: The widget proxy (svc-backend-widget-proxy-new) runs behind ECS Service Connect, which injects an Envoy sidecar. Envoy's default route_timeout is 15 seconds. The legacy backend needed 30-60s for uncached widget queries. Envoy was killing every request before it could complete.

The kicker: Because requests never completed, the Redis cache never got populated. Every request was a cache miss → hit the backend → timeout → cache stays empty → next request does the same thing. An infinite loop of failure.

Fix: Increased Envoy timeout to 120s in Terraform:

timeout = {
  per_request_timeout_seconds = 120
  idle_timeout_seconds        = 120
}

Chapter 2: The Missing ReadUncommitted

Symptom: Even with the timeout fixed, the legacy backend was returning 500 after 59 seconds. We found active lock chains in SQL Server:

SELECT r.session_id, r.blocking_session_id, r.wait_type, r.command
FROM sys.dm_exec_requests r WHERE r.blocking_session_id > 0

-- session 512 blocked by 182 (MERGE)
-- session 182 blocked by 3453 (UPDATE)
-- session 3453 blocked by 2035 (INSERT)
-- session 2035 blocked by 512 (SELECT)
-- DEADLOCK CYCLE

Widget SELECTs were taking shared locks, blocking on concurrent UPDATEs from the review censor processor. The UPDATE [dbo].[Review] SET [Body] from censoring was holding IX locks that blocked the widget's S locks.

The smoking gun: git blame revealed that ReadUncommitted isolation was commented out in October 2023:

027775ca8 (anyuzer 2023-10-28) // using (var tx = new TransactionScope(ReadUncommitted))

Commit message: "change health check, change hardware profile, remove tx, remove async"

Every other widget endpoint (badges, questions) still had ReadUncommitted. This was the only one missing it. For over a year.

Fix: Uncommented the transaction scope. One line change.


Chapter 3: The Stampede

Symptom: The widget proxy had a Redis lock to deduplicate concurrent requests (so only one hits the backend for the same widget). But the lock TTL was 30 seconds, and the backend took 60s+. The lock expired mid-flight.

Request A: acquires lock, calls backend...
  [30s passes, lock expires]
Request B: no lock found, acquires new lock, calls backend...
  [30s passes, lock expires]  
Request C: acquires lock, calls backend...

Every 30 seconds, a new request stampeded the already-struggling backend. With 12 ECS tasks, this meant dozens of concurrent requests pounding the same slow query.

Fix: Increased lock TTL from 30s to 90s.


Chapter 4: Connection Pool Exhaustion

Symptom:

System.InvalidOperationException: Timeout expired. The timeout period elapsed 
prior to obtaining a connection from the pool. This may have occurred because 
all pooled connections were in use and max pool size was reached.

The numbers:

  • 45 ECS tasks × 100 connections per pool = 4,500 max connections
  • SQL Server showing 5,049 active connections from 131 hosts
  • Every host maxed at exactly 100 connections

With queries blocked on locks for 59 seconds each, connections were held open and never returned to the pool. New requests couldn't get a connection.

What we tried:

  • Increased Max Pool Size to 200 → made it worse (more connections competing for locked rows)
  • Increased to 150 → still bad
  • Reverted to 100
  • Reduced ECS task count from 45 to 20

Chapter 5: The Compile Death Spiral

This is where things got truly ugly.

Symptom: SQL Server at 100% CPU, 100% memory, completely unresponsive. Even sqlcmd -S localhost from the RDP session was hanging.

SELECT TOP 5 wait_type, COUNT(*) as cnt 
FROM sys.dm_exec_requests WHERE status = 'suspended' 
GROUP BY wait_type ORDER BY cnt DESC

-- RESOURCE_SEMAPHORE_QUERY_COMPILE    2574
-- QDS_STMT                             334

2,574 sessions stuck waiting to compile queries. SQL Server has a compile gate that limits concurrent compilations. Entity Framework generates unique query shapes — different parameter counts, different TOP N, different WHERE clause structures. Each unique shape needs a fresh compilation.

We measured: 2,200–4,600 compilations/sec — that's 1 in 3 queries needing compilation instead of reusing a cached plan.

The amplifier: Query Store (QDS) was enabled. It intercepts every compilation to record the plan and stats. With thousands of compilations per second, QDS writes became the bottleneck on top of the compile gate.

Worker thread exhaustion: Each session waiting to compile holds a worker thread. SQL Server auto-configured 2,400 max workers for 64 CPUs. With 2,574 sessions in the compile queue alone:

SELECT SUM(work_queue_count) as queued, SUM(active_workers_count) as active_workers 
FROM sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE'

-- queued: 1817, active_workers: 2510

1,817 tasks queued with no workers available. SQL Server couldn't even process new login handshakes — which is why external connections were timing out.

The Death Spiral

More connections → more unique queries → more compilations needed
→ compile gate full → sessions wait holding workers
→ no workers available → can't process logins → new connections retry
→ connection pool pressure → more connections
→ REPEAT

What We Tried (Nothing Worked)

-- Increased max worker threads from 2400 to 3000
EXEC sp_configure 'max worker threads', 3000; RECONFIGURE;
-- Result: workers filled up to 3,109 immediately

-- Evicted 2000 single-use cached plans
DECLARE c CURSOR FOR SELECT plan_handle FROM sys.dm_exec_cached_plans 
WHERE objtype = 'Prepared' AND usecounts = 1;
-- Result: no improvement, compile waits stayed at ~1,600

-- Tried to clear all cached plans
DBCC FREEPROCCACHE
-- Result: HUNG. No workers available to execute it.

-- Tried to disable Query Store
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE = OFF
-- Result: HUNG. Needs exclusive DB access, can't get it.

-- Tried read-only mode
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE (DESIRED_STATE = READ_ONLY)
-- Result: HUNG.

Every remediation command requires a worker thread to execute. All workers are consumed. The only way out is to restart SQL Server.

The Restart

Restart-Service MSSQLSERVER -Force

This also hung — SQL Server couldn't gracefully shut down with 5,000+ stuck sessions. Had to force-kill:

Stop-Process -Name sqlservr -Force
Start-Service MSSQLSERVER

The database went through recovery (shopry_shopifyproductaddons showed RECOVERING state). Meanwhile, all 40 ECS tasks immediately tried to reconnect — 7,662 connections hit the server within seconds of it coming online, threatening to spiral again.

Breaking the Cycle

We had a narrow window after restart to:

  1. Scale down ECS to 10 tasks — reduce connection flood
  2. Scale down k8s consumers — censor from 8→4, webhooks from 5→2
  3. Disable Query Store while connections were low:
    ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE = OFF
  4. Gradually scale back up — 10 → 20 tasks, watching compile waits

Result after stabilization:

SELECT TOP 5 wait_type, COUNT(*) as cnt 
FROM sys.dm_exec_requests WHERE status = 'suspended' 
GROUP BY wait_type ORDER BY cnt DESC

-- ASYNC_NETWORK_IO    13   (normal)
-- PAGEIOLATCH_SH       7   (normal disk reads)
-- CXCONSUMER            5   (normal parallelism)
-- LCK_M_IX              4   (minor, acceptable)

Zero compile waits. Server healthy at 100% CPU but doing actual work, not stuck in a spiral.


The Full Kill Chain

1. ReadUncommitted removed (Oct 2023)
   → Widget SELECTs take shared locks
   
2. Censor UPDATEs block widget SELECTs  
   → Queries wait 59 seconds holding connections
   
3. Connection pools exhaust
   → "Max pool size reached" errors
   
4. Envoy 15s timeout kills requests before backend responds
   → Cache never populates
   
5. Every request is a cache miss → hammers backend
   → More connections, more lock contention
   
6. Entity Framework generates unique query shapes
   → Thousands of compilations/sec
   
7. Query Store adds write overhead to each compilation
   → Compile gate saturates
   
8. Compile queue consumes all worker threads
   → Server can't process logins
   → Can't run diagnostic queries
   → Can't run remediation commands
   → GAME OVER

Diagnostic Commands Cheat Sheet

When SQL Server is Responsive

-- Check what sessions are waiting on
SELECT TOP 10 wait_type, COUNT(*) as cnt 
FROM sys.dm_exec_requests WHERE status = 'suspended' 
GROUP BY wait_type ORDER BY cnt DESC

-- Check worker thread exhaustion (THIS IS THE KEY ONE)
SELECT SUM(work_queue_count) as queued, 
       SUM(active_workers_count) as active_workers 
FROM sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE'

-- Check connection count
SELECT COUNT(*) as connections FROM sys.dm_exec_connections

-- Check blocking chains
SELECT r.session_id, r.blocking_session_id, r.wait_type, 
       r.wait_time/1000 as wait_secs,
       SUBSTRING(t.text,1,100) as query_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.blocking_session_id > 0 ORDER BY r.wait_time DESC

-- Check plan cache bloat
SELECT objtype, COUNT(*) as cnt, 
       SUM(CAST(size_in_bytes AS BIGINT))/1024/1024 as size_mb 
FROM sys.dm_exec_cached_plans GROUP BY objtype ORDER BY size_mb DESC

-- Check compilation rate (run twice, 10s apart, compare)
SELECT * FROM sys.dm_exec_query_optimizer_info WHERE counter = 'optimizations'

When SQL Server is Unresponsive (RDP into the box)

# Check connection count from OS level
(netstat -an | findstr :1433 | findstr ESTABLISHED | measure).Count

# Check pending connections
(netstat -an | findstr :1433 | findstr SYN_RECEIVED | measure).Count

# Restart SQL Server (graceful)
Restart-Service MSSQLSERVER -Force

# If graceful restart hangs (check status)
Get-Service MSSQLSERVER | Select Status
# If StopPending for >5 minutes:
Stop-Process -Name sqlservr -Force
Start-Service MSSQLSERVER

# Check error log after restart
Get-Content "C:\Program Files\Microsoft SQL Server\MSSQL16.MSSQLSERVER\MSSQL\Log\ERRORLOG" -Tail 30

Reducing Pressure

# Scale down ECS tasks
aws ecs update-service --cluster shared --service app-backend-api-prod-new --desired-count 10

# Scale down k8s consumers
kubectl scale deployment app-backend-cron-review-censor -n prod --replicas=0
kubectl scale deployment app-backend-cron-shopify-webhooks -n prod --replicas=0

# Check ECS task count
uv run find_ecs_service.py app-backend-api-prod

After Recovery (Do These Immediately)

-- Disable Query Store while connections are low
ALTER DATABASE shopry_shopifyproductaddons SET QUERY_STORE = OFF

-- Verify clean state
SELECT TOP 5 wait_type, COUNT(*) as cnt 
FROM sys.dm_exec_requests WHERE status = 'suspended' 
GROUP BY wait_type ORDER BY cnt DESC
-- Should show ASYNC_NETWORK_IO, PAGEIOLATCH_SH — NOT RESOURCE_SEMAPHORE_QUERY_COMPILE

Changes Made

Change Type Status
Restore ReadUncommitted for CreateWidget Code (PR #2156) Deployed
Increase Envoy timeout 15s → 120s Terraform Applied
Increase Redis lock TTL 30s → 90s Code (PR #256) In PR
max worker threads 2400 → 3000 SQL Server config Applied
Query Store disabled SQL Server config Applied
Cache TTL 8h → 12h Code Deployed
Widget proxy added to k8s k8s-mvp Deployed
OTEL collector scaled to 5 replicas, 2Gi k8s-mvp Deployed
Consolidated 13 CI workflows into 1 GitHub Actions Deployed

Lessons Learned

  1. ReadUncommitted matters for read-heavy endpoints — one commented-out line caused a year of intermittent lock contention that only became critical under high load
  2. Query Store on high-compilation workloads is dangerous — it amplifies the compile gate bottleneck
  3. The compile death spiral is self-reinforcing — once worker threads are exhausted, you can't even run the commands to fix it
  4. Connection pool × task count = total connections — autoscaling ECS to 45 tasks with 100-connection pools means 4,500 connections competing for SQL Server resources
  5. Envoy Service Connect has a 15s default timeout — if your backend is slower than that, nothing gets cached and every request is a miss
  6. Always have RDP access to self-managed databases — when SQL Server is unresponsive, kubectl/remote queries won't work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment