Skip to content

Instantly share code, notes, and snippets.

@cgcardona
Last active February 27, 2026 22:03
Show Gist options
  • Select an option

  • Save cgcardona/9ca57bf05797115c204c9abf846c8e8d to your computer and use it in GitHub Desktop.

Select an option

Save cgcardona/9ca57bf05797115c204c9abf846c8e8d to your computer and use it in GitHub Desktop.
A comprehensive rulebook for unattended autonomous engineering agents

🧭 Agent Failure Taxonomy & Solutions v1

A comprehensive rulebook for unattended autonomous engineering agents

This document captures real-world failure modes observed in long‑running AI agent workflows, especially multi‑agent, CLI‑driven, containerized environments like Stori/Maestro pipelines.

Each section includes: - ❌ Problem Pattern - πŸ”Ž Signals - βœ… Recommended Solution Strategy


1. Cognitive / Reasoning Failures

🧠 Recursive Thinking Loop (Inception Mode)

Problem: Agent enters endless planning without executing. Signals: No commands, no diffs, repeated internal reasoning. Solutions: - Enforce planning time budget. - Require execution after N planning cycles. - Add "action watchdog" that resets agent if no commands executed.


🧠 Context Drift After Summarization

Problem: Agent loses execution state after compressing chat. Signals: Duplicate work, frozen state, lost cursor. Solutions: - Maintain machine‑readable execution state (step index, checkpoints). - Store resumable YAML state separate from summaries.


Quality–Time Tradeoff Drift (Silent Pragmatism)

Problem: Agent internally debates "best" vs "simple" and silently chooses a pragmatic shortcut, degrading architecture over time. Signals: Caller-side casts, partial fixes, tech debt accretion, repeated "quick fixes" across modules. Solutions:

  • Require an explicit execution budget for every task (pragmatic|balanced|optimal|research).
  • Enforce: agent MUST NOT silently escalate/shift budget; must request upgrade.
  • Require decision logging: for any non-trivial tradeoff, emit "budget decision" event with rationale and alternatives.

Type-System Evasion (Mypy Box-Checking)

Problem: Agent "fixes" mypy by using Any, caller-side casts, or type: ignore rather than architecting correct entities and callee signatures. Signals: cast(...) at call sites, proliferation of Any, new # type: ignore comments in core logic. Solutions:

  • Callee-first rule: if caller needs a cast, fix the callee return type and propagate typed entities.
  • Ban "naked innies": no dict[str, Any] or list[dict] across internal layers; wrap into typed dataclasses/Pydantic models.
  • If Any is unavoidable: wrap as OpaquePayload(raw: Any) at a boundary protocol and quarantine the Any.
  • type: ignore allowed ONLY at explicit boundary adapters (3rd-party libs, serialization, protocol bridges) and must include justification.

🧠 Reference Pattern Neglect

Problem: Agent fixes failing tests repeatedly instead of copying patterns from passing tests. Solutions: - Require repo search for similar passing tests before patching. - Add "golden pattern scan" rule before retry loops.


🧠 Retry Without Strategy Mutation

Problem: Agent retries same commands with small parameter tweaks. Solutions: - After 2 identical failures β†’ force algorithmic strategy change. - Maintain retry fingerprint hash.


2. Orchestration & Multi-Agent Failures

πŸ‘₯ Orphaned Subagent

Problem: Child agent spins forever; parent unaware. Solutions: - Heartbeat monitoring. - Mandatory timeout budgets. - Parent watchdog kills stale subagents.


Watchdog Chain Paradox (Who Watches the Watchdogs?)

Problem: Parent watchdog exists, but parent itself can stall; nested swarms can deadlock if supervisor doesn't have its own liveness guarantees. Signals: Entire pipeline stalls with no progressing heartbeats; children still "running" but no coordinator actions. Solutions:

  • Add a top-level Supervisor watchdog (outside the agent graph) that monitors all agent heartbeats.
  • Add explicit agent lifecycle states: SPAWNED/RUNNING/STALLED/RECOVERING/FAILED/COMPLETED.
  • Require per-layer timeouts + escalation path (L3β†’L2β†’L1β†’Supervisor).

Semantic Telephone Between Agent Layers

Problem: Plan/task intent mutates as it is handed off (Manager β†’ Agent β†’ Subagent). Hallucination creeps in between layers. Signals: Child executes a different interpretation than the parent spec; mismatched constraints; "I thought we meant..." Solutions:

  • Require explicit, versioned "contracts" between layers (schema + invariants).
  • Require spec hash verification at every hop; reject mismatched hash.
  • Make advisory vs structural fields explicit; structural must be enforced by code, not prompts.

πŸ‘₯ Spec Drift Between Agents

Problem: Parent updates plan; subagent continues old instructions. Solutions: - Versioned spec injection. - Require spec hash verification before execution.


πŸ‘₯ Cross-Agent Race Conditions

Problem: Agents modify same files simultaneously. Solutions: - File locking. - Branch-per-agent model. - Merge gates.


3. Build & Project Graph Failures

🧱 Target Membership Blindness

Problem: Agent creates Swift files but does not add to project build phases. Signals: Types not found, build loops. Solutions: - Mandatory project file patch after file creation. - Verify PBXBuildFile entries exist.


Build Output Filtering (False Failure / False Success)

Problem: Agent pipes build/test output through grep/head/tail and loses the true exit status signal, causing rebuild loops or false "done". Signals: Rebuild repeats despite successful compilation; agent "doesn't see" success; truncated logs. Solutions:

  • Ban piping build/test output through filters.
  • Always treat process exit code as authoritative; capture full logs to file if needed.

🧱 Workspace Shadowing

Problem: Agent edits code in one path while build runs elsewhere. Solutions: - Enforce canonical repo root. - Validate file hash inside container before build.


🧱 Dependency Graph Blind Spots

Problem: Agent edits local file but bug exists in another module. Solutions: - Require cross-module symbol search before edits.


4. Environment & Container Failures

🐳 Container State Desync

Problem: Agent forgets to rebuild containers. Solutions: - Rebuild on dependency file change. - Add container health checks.


Ephemeral /tmp Artifact Loss (Rebuild Nukes Work)

Problem: Agent saves artifacts/checkpoints/results under /tmp inside container; rebuild wipes /tmp; agent assumes state exists and resumes incorrectly. Signals: Missing checkpoint after rebuild; progress resets to 0; "file not found" after container rebuild. Solutions:

  • Enforce a canonical durable workspace root (e.g. /workspace) and require all artifacts live under it.
  • Add path guard: fail hard if writing outside /workspace for persistent outputs.
  • Mount /workspace to a named volume or host dir; include preflight durability test (write/read sentinel).

Host/Container Execution Split-Brain

Problem: Agent runs Python / pytest / pydantic on host instead of inside container, producing inconsistent deps and phantom failures. Signals: "works in container, fails on host" (or vice versa); missing deps; pydantic version mismatch. Solutions:

  • Hard rule: all Python execution must run via docker compose exec ...
  • Provide canonical commands and a wrapper script (make targets) so agent can't choose host accidentally.
  • Add guard that detects host execution and aborts with instructions.

Secrets / .env Permission Deadlock

Problem: Agent needs to edit .env or create dirs, but freezes waiting for permission (or edits without permission). Signals: Agent loops "can't proceed" or silently modifies secrets. Solutions:

  • Capability policy: allow_env_edit, allow_mkdir, allow_web_search, allow_python_exec, etc.
  • Enforce explicit "REQUESTED vs EXECUTED" state; never retry permission-blocked actions without new user input.

🐳 Resource Exhaustion (Exit Code 137)

Problem: Python batch jobs OOM when looping large MIDI datasets. Solutions: - Switch to streaming pipeline. - Process single file batches. - Persist incremental JSON results.


🐳 Permission Mirage

Problem: Command requested but awaiting human approval. Solutions: - Distinguish REQUESTED vs EXECUTED state. - Pause workflow until confirmation received.


5. Long‑Running Job Failures

πŸ“Š Frozen Progress Blindness

Problem: Logs remain alive but progress count stops increasing. Solutions: - Track delta(progress). - Abort if unchanged after N polls.


πŸ“Š Checkpoint Blindness

Problem: Agent restarts job from index 0 instead of saved checkpoint. Solutions: - Inspect output directory before restart. - Resume from last processed index.


Checkpoint Regression (Restart Loses Resume Point)

Problem: Agent restarts a long job and accidentally resets progress to 0 instead of resuming from checkpoint. Signals: processed_count returns to 0; output overwritten; repeated first N files. Solutions:

  • Require run_id + manifest.json; refuse to start if manifest exists and resume flag not set.
  • Write checkpoints atomically (tmp file + rename).
  • Require "resume proof" before restart: show last checkpoint + last processed id.

Progress Counter Stasis (Logs Alive, Work Frozen)

Problem: Agent tails logs and assumes success because logs exist; counter doesn't increase (deadlock, stuck IO). Signals: last_processed_count unchanged across polls; timestamps stale; no heartbeat events. Solutions:

  • Require delta(progress) monitoring with time window thresholds.
  • Add heartbeat: emit processed_count + last_increment_ts every N items.
  • Watchdog abort if no delta for N minutes; capture stack dump / thread dump.

πŸ“Š Output Buffering Illusion

Problem: Logs buffered β†’ agent thinks job stalled. Solutions: - Use unbuffered Python (-u). - Parse timestamps instead of output volume.


πŸ“Š Script Sleep Loop Desync

Problem: Agent sleeps while script already failed. Solutions: - Track PID + exit codes. - Verify process alive before sleeping again.


6. Debugging & Runtime Failures

πŸ”¬ No Backtrace Before Patch

Problem: Agent edits code after crash without LLDB. Solutions: - Require backtrace collection before editing logic.


πŸ”¬ Sanitizer Avoidance

Problem: Agent never runs TSan/ASan. Solutions: - Mandatory sanitizer builds after runtime failures.


Missing ToolError / Log Triage (False "All Good")

Problem: Agent checks logs but misses explicit toolError / error lines and reports success. Signals: toolError events in SSE/logs; HTTP stream errors; circuit breaker open; Orpheus unavailable. Solutions:

  • Define a hard "red-flag" pattern set (toolError, ERROR -, Traceback, circuit_breaker_open, Orpheus unavailable).
  • Require log scan that fails the run if any red-flag appears after start_ts.
  • Require explicit "health gates": Orpheus health + LLM stream health before claiming success.

πŸ”¬ Concurrency Illusions

Problem: Agent assumes async tasks finished prematurely. Solutions: - Require completion signals or explicit awaits.


7. Resource & Performance Failures

πŸ“‰ Metric Plateau Blindness

Problem: Agent thinks metrics improving despite flat results. Solutions: - Require delta thresholds.


πŸ“‰ Resource Throttling Misdiagnosis

Problem: Agent misinterprets CPU or IO throttling as logic bug. Solutions: - Monitor system metrics before patching code.


8. Control Plane / UX Failures

β›” Interactive Permission Boundaries

Problem: Agent stalls waiting for user approval dialogs. Solutions: - Explicit approval state detection. - Prompt user clearly.


Web Search Permission Stall

Problem: Agent freezes waiting for permission to browse/search; workflow halts without escalation. Signals: agent does nothing; repeated "need to search web" statements. Solutions:

  • Capability policy includes allow_web_search.
  • If denied: require fallback plan that does not require web OR explicit escalation request.

🧾 False Completion

Problem: Agent reports "Done" after last command succeeds. Solutions: - Require artifact validation. - Require canonical build + tests pass.


9. Data & File Management Failures

🧹 Cleanup Blindness

Problem: Temp files cause incorrect skipping or duplication. Solutions: - Enforce temp directory policies.


Revert Nukes Work (Over-Broad Undo)

Problem: Agent reverts an entire file to undo a small change, deleting unrelated critical work. Signals: large diff reversals; file timestamp rollback; "reverted file" without surgical patch. Solutions:

  • Require surgical revert: git checkout -p / patch-level undo.
  • Before revert: show diff summary and confirm scope.
  • Prefer small incremental commits to make rollback safe.

Cascading Test Failure Blindness (Shared Root Cause)

Problem: Agent fixes one failing test caused by a shared change (magic number/API change), but doesn't anticipate other tests failing for the same reason. Signals: sequential failures with similar stack traces; repeated "fix test" loop. Solutions:

  • After any fix, search for the failing constant/symbol across tests and update all impacted tests as a batch.
  • Require pattern scan: grep for similar assertions or fixtures.

🧾 Workspace Drift After Sleep/Wake

Problem: Agent resumes with stale filesystem assumptions. Solutions: - Re-scan repo state on wake.


10. Structural Agent Design Principles (Summary)

To achieve unattended operation:

  • Use resumable checkpoints for all long jobs.
  • Enforce progress‑delta monitoring.
  • Separate environment debugging from code debugging.
  • Require debugging tools (LLDB, TSan, ASan).
  • Implement subagent watchdogs.
  • Maintain execution state separate from summaries.
  • Validate project graph after file creation.
  • Force strategy change after repeated failures.

πŸ” Git Workflow State Machine (Deterministic Phase Control)

Agents must not perform side-effects (branch creation, PR creation, merge) without consulting the canonical workflow state.

All GitHub operations must be idempotent.

WorkflowPhase

from enum import Enum

class WorkflowPhase(str, Enum):
    ISSUE_CREATED = "issue_created"
    BRANCH_CREATED = "branch_created"
    WORK_IN_PROGRESS = "work_in_progress"
    PR_CREATED = "pr_created"
    CI_RUNNING = "ci_running"
    READY_TO_MERGE = "ready_to_merge"
    MERGED = "merged"
    CLOSED = "closed"

Transitions must only move forward. Backward transitions require explicit human override.


πŸ—‚ Canonical Agent Task Registry

Each agent must operate against a single authoritative task record.

{
  "agent_id": "agent-001",
  "issue_number": 412,
  "branch": "agent/412-contract-lineage",
  "pull_request_number": 527,
  "phase": "pr_created",
  "ci_status": "pending",
  "attempt_counters": {
    "pr_creation_attempts": 1
  },
  "updated_at": "2026-02-27T18:10:00Z"
}

This record must be checked before:

  • Creating branches
  • Creating pull requests
  • Rebasing
  • Merging
  • Closing issues

πŸ”’ Idempotent GitHub Wrapper Requirement

Agents must never call raw GitHub side-effects directly.

Instead, use wrapper functions that enforce idempotency:

def ensure_branch(branch: str, base: str) -> None:
    if branch_exists(branch):
        return
    create_branch(branch, base)

def ensure_pull_request(branch: str, issue_number: int) -> int:
    existing = find_pr_by_branch(branch)
    if existing:
        return existing.number
    return create_pr(branch, issue_number)

Duplicate PR creation must be impossible by design.


🚨 Anti-Loop Safeguards

Agents must track attempt counters for side-effects:

  • PR creation attempts
  • Merge attempts
  • Rebase attempts

If any attempt counter exceeds 1 without phase advancement:

β†’ Abort execution
β†’ Emit "LOOP_GUARD_TRIGGERED"
β†’ Escalate to orchestrator

Infinite retry loops are forbidden.


🧠 Pre-Action Verification Protocol

Before executing any side-effect:

  1. Read canonical registry
  2. Query GitHub for real state
  3. Confirm registry matches external state
  4. Confirm transition is valid

If registry and GitHub disagree:

β†’ Enter RECONCILIATION mode
β†’ Do not proceed blindly


🐳 Worktree Isolation Contract

Each agent must operate inside:

  • Dedicated git worktree
  • Dedicated branch
  • Dedicated container

Agents must not:

  • Mount root repository in multiple containers
  • Modify branches not assigned to them
  • Create arbitrary branch names

Branch naming format:

agent/<issue-number>-<slug>

🧬 Context Isolation Policy

Agents do NOT share token context.

They share:

  • Structured working memory
  • Contract hashes
  • Event logs

They do NOT share:

  • Internal reasoning traces
  • Chain-of-thought
  • Raw prompt transcripts

Shared state must be structured and typed.


πŸ§ͺ Strong Typing Enforcement (Python / mypy)

Agents must not:

  • Use Any unless wrapped in explicit boundary object
  • Cast at caller site to hide typing errors
  • Use # type: ignore without protocol-boundary justification

If a type mismatch occurs:

β†’ Fix the callee
β†’ Propagate correct type upward
β†’ Eliminate covariance

Follow strict module boundaries:

One module β†’ one responsibility β†’ clean API β†’ no covariance.


πŸ’Ύ Memory Durability Requirement

Agents must persist:

  • Full execution context to durable storage
  • Compressed execution summary to working memory

If memory window pressure occurs:

β†’ Use compressed context
β†’ Never discard canonical state

Full context must remain retrievable from storage.


πŸ” Contract Hash Enforcement (Execution Layer)

All execution layers must verify:

assert section_contract.parent_contract_hash == instrument_contract.contract_hash

If mismatch:

β†’ Raise ProtocolViolation
β†’ Halt execution

Structural fields must override LLM-proposed values.

Advisory fields must never influence structural placement.


🧱 Determinism Mandate

Agents must operate as deterministic state machines.

Side-effects must be:

  • Idempotent
  • Phase-gated
  • Hash-verified

Agents must not "decide what to do next."

They must:

  1. Query state
  2. Validate transition
  3. Execute allowed action
  4. Update state
  5. Halt

This section upgrades the agent taxonomy from heuristic execution to deterministic protocol-governed orchestration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment