Skip to content

Instantly share code, notes, and snippets.

@micahstubbs
Last active January 24, 2026 13:28
Show Gist options
  • Select an option

  • Save micahstubbs/2c885a9eb7596aaa051d809cbd1fcc21 to your computer and use it in GitHub Desktop.

Select an option

Save micahstubbs/2c885a9eb7596aaa051d809cbd1fcc21 to your computer and use it in GitHub Desktop.
OOM Investigation Notes - 2026-01-24 (redacted)

OOM Investigation - 2026-01-24

Summary

An OOM (Out of Memory) event occurred at 2026-01-24 00:56:33 that killed 48 processes and triggered a logout/login cycle (not a reboot - uptime is 13+ days).

Evidence Gathered

OOM Event Timeline

  • 00:56:13: GNOME session started shutting down services
  • 00:56:18: Zoom segfault (likely triggered by memory pressure)
  • 00:56:33: OOM killer activated, 48 processes SIGKILL'd
  • 00:56:33: [email protected] terminated
  • 00:56:33: New GNOME session started (PID 3806336)

Processes Killed (by type)

Process Type Count
bd (beads) 14
zsh 11
http-server 5
python 4
claude 2
sh 2
MainThread 2
zoom 1
Others 7

Key Observations

  1. 14 orphaned bd processes - Beads daemon processes accumulated and weren't cleaned up
  2. 5 http-server processes - Multiple http-server instances running
  3. 2 Claude sessions killed (PIDs 948978, 1257542)
  4. Large Claude transcript files:
    • 491MB: worktree-1/bf35e43e-*.jsonl (Jan 22)
    • 348MB: worktree-1/45c23588-*.jsonl (Jan 24 - most recent)
    • 64 subagent files in worktree-2's 74MB subagent dir
  5. Memory state at investigation time: 42GB/62GB used with 80GB swap available

Three Hypotheses

Hypothesis 1: Beads (bd) Process Accumulation (HIGHEST CONFIDENCE)

Evidence:

  • 14 orphaned bd processes were killed during OOM
  • These processes should terminate after completion but accumulated over 13+ days of uptime
  • Each bd process holds memory for issue parsing, graph computation, and IPC

Root Cause Theory: The bd (beads) command-line tool spawns processes for triage, listing, and other operations. When invoked via Claude Code subagents or shell commands, these processes may not properly terminate if:

  1. Parent processes exit before children
  2. Signal handlers don't propagate to forked bd instances
  3. The TUI component has zombie process handling issues

Confidence: HIGH (14 processes is strong signal)

Test:

# Monitor bd processes over time
watch -n 60 'ps aux | grep -E "^\S+\s+[0-9]+.*bd" | grep -v grep | wc -l'

# Check if bd processes are orphaned (PPID 1)
ps -eo pid,ppid,cmd | grep "bd" | awk '$2 == 1 {print}'

Hypothesis 2: Claude Code Session Context Accumulation

Evidence:

  • Two massive transcript files: 491MB and 348MB
  • 64 subagent files in worktree-2 alone (74MB subagent directory)
  • Session 45c23588 ran for extended period (last modified 00:01, still active at 00:56)
  • Claude processes were among those killed

Root Cause Theory: Long-running Claude Code sessions with:

  1. Large context windows loaded in memory
  2. Multiple subagents running concurrently
  3. Transcript files being written/read continuously
  4. No memory limits on Claude Code Node.js processes

When combined with the 13+ days of uptime, memory fragmentation and leaks in the Node.js process accumulate.

Confidence: MEDIUM-HIGH (large files are evidence, but unclear if loaded in memory)

Test:

# Check Claude process memory before and after session restart
ps aux --sort=-rss | grep -E "claude|node.*claude"

# Monitor during active session
watch -n 10 'ps aux --sort=-rss | head -5'

Hypothesis 3: http-server Process Leak from Claude Code Automation

Evidence:

  • 5 http-server processes killed during OOM
  • http-server is commonly spawned by Claude Code for previewing HTML/web content
  • These processes run in background and may not be cleaned up

Root Cause Theory: Claude Code workflows frequently spawn http-server for serving local files. When:

  1. Sessions are interrupted or crash
  2. Terminal contexts are lost
  3. Background processes are forked without cleanup tracking

These http-server instances persist indefinitely, each consuming memory.

Confidence: MEDIUM (5 processes is notable but not the primary cause)

Test:

# Check for orphaned http-server processes
ps aux | grep http-server

# Find http-server parent relationships
pstree -p | grep http-server

Recommended Mitigations

Immediate Actions

  1. Clean up orphaned processes regularly:

    # Add to cron (weekly)
    pkill -f "^bd" && pkill -f "http-server"

    Refine this if you have long-running http-server instance that you actually do want to keep alive.

  2. Monitor process accumulation:

    # Add to ~/.bashrc or cron
    if [ $(ps aux | grep -E "`<user>`.*bd" | grep -v grep | wc -l) -gt 5 ]; then
      notify-send "Warning: $(ps aux | grep bd | wc -l) bd processes running"
    fi

Long-term Fixes

  1. Investigate beads process lifecycle - Why are bd processes not terminating?
  2. Set memory limits on Claude Code - Use systemd slice or cgroups
  3. Implement session cleanup hooks - Kill orphaned processes on session end
  4. Archive old Claude transcripts - Move 300MB+ files to cold storage

Memory Snapshot at Investigation Time

total: 62Gi | used: 42Gi | available: 20Gi
swap:  80Gi | used: 2.2Gi
uptime: 13 days, 21:47

Current top memory consumers:

  1. <vpn-client>: 798MB
  2. <voice-daemon>: 728MB
  3. claude: 595MB
  4. gnome-shell: 515MB

Related Historical OOM Events

From boot 0 (current, since Jan 10):

  • Jan 10 03:38: gnome-shell, google-chrome, <voice-daemon> killed by OOM
  • Jan 24 00:56: Current event (48 processes killed)

Pattern: Extended uptime (10+ days) correlates with OOM events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment