🧭 Agent Failure Taxonomy & Solutions v1

A comprehensive rulebook for unattended autonomous engineering agents

This document captures real-world failure modes observed in long‑running AI agent workflows, especially multi‑agent, CLI‑driven, containerized environments like Stori/Maestro pipelines.

Each section includes: - ❌ Problem Pattern - 🔎 Signals - ✅ Recommended Solution Strategy

1. Cognitive / Reasoning Failures

🧠 Recursive Thinking Loop (Inception Mode)

Problem: Agent enters endless planning without executing. Signals: No commands, no diffs, repeated internal reasoning. Solutions: - Enforce planning time budget. - Require execution after N planning cycles. - Add "action watchdog" that resets agent if no commands executed.

🧠 Context Drift After Summarization

Problem: Agent loses execution state after compressing chat. Signals: Duplicate work, frozen state, lost cursor. Solutions: - Maintain machine‑readable execution state (step index, checkpoints). - Store resumable YAML state separate from summaries.

Quality–Time Tradeoff Drift (Silent Pragmatism)

Problem: Agent internally debates "best" vs "simple" and silently chooses a pragmatic shortcut, degrading architecture over time. Signals: Caller-side casts, partial fixes, tech debt accretion, repeated "quick fixes" across modules. Solutions:

Require an explicit execution budget for every task (pragmatic|balanced|optimal|research).
Enforce: agent MUST NOT silently escalate/shift budget; must request upgrade.
Require decision logging: for any non-trivial tradeoff, emit "budget decision" event with rationale and alternatives.

Type-System Evasion (Mypy Box-Checking)

Problem: Agent "fixes" mypy by using Any, caller-side casts, or type: ignore rather than architecting correct entities and callee signatures. Signals: cast(...) at call sites, proliferation of Any, new # type: ignore comments in core logic. Solutions:

Callee-first rule: if caller needs a cast, fix the callee return type and propagate typed entities.
Ban "naked innies": no dict[str, Any] or list[dict] across internal layers; wrap into typed dataclasses/Pydantic models.
If Any is unavoidable: wrap as OpaquePayload(raw: Any) at a boundary protocol and quarantine the Any.
type: ignore allowed ONLY at explicit boundary adapters (3rd-party libs, serialization, protocol bridges) and must include justification.

🧠 Reference Pattern Neglect

Problem: Agent fixes failing tests repeatedly instead of copying patterns from passing tests. Solutions: - Require repo search for similar passing tests before patching. - Add "golden pattern scan" rule before retry loops.

🧠 Retry Without Strategy Mutation

Problem: Agent retries same commands with small parameter tweaks. Solutions: - After 2 identical failures → force algorithmic strategy change. - Maintain retry fingerprint hash.

2. Orchestration & Multi-Agent Failures

👥 Orphaned Subagent

Problem: Child agent spins forever; parent unaware. Solutions: - Heartbeat monitoring. - Mandatory timeout budgets. - Parent watchdog kills stale subagents.

Watchdog Chain Paradox (Who Watches the Watchdogs?)

Problem: Parent watchdog exists, but parent itself can stall; nested swarms can deadlock if supervisor doesn't have its own liveness guarantees. Signals: Entire pipeline stalls with no progressing heartbeats; children still "running" but no coordinator actions. Solutions:

Add a top-level Supervisor watchdog (outside the agent graph) that monitors all agent heartbeats.
Add explicit agent lifecycle states: SPAWNED/RUNNING/STALLED/RECOVERING/FAILED/COMPLETED.
Require per-layer timeouts + escalation path (L3→L2→L1→Supervisor).

Semantic Telephone Between Agent Layers

Problem: Plan/task intent mutates as it is handed off (Manager → Agent → Subagent). Hallucination creeps in between layers. Signals: Child executes a different interpretation than the parent spec; mismatched constraints; "I thought we meant..." Solutions:

Require explicit, versioned "contracts" between layers (schema + invariants).
Require spec hash verification at every hop; reject mismatched hash.
Make advisory vs structural fields explicit; structural must be enforced by code, not prompts.

👥 Spec Drift Between Agents

Problem: Parent updates plan; subagent continues old instructions. Solutions: - Versioned spec injection. - Require spec hash verification before execution.

👥 Cross-Agent Race Conditions

Problem: Agents modify same files simultaneously. Solutions: - File locking. - Branch-per-agent model. - Merge gates.

3. Build & Project Graph Failures

🧱 Target Membership Blindness

Problem: Agent creates Swift files but does not add to project build phases. Signals: Types not found, build loops. Solutions: - Mandatory project file patch after file creation. - Verify PBXBuildFile entries exist.

Build Output Filtering (False Failure / False Success)

Problem: Agent pipes build/test output through grep/head/tail and loses the true exit status signal, causing rebuild loops or false "done". Signals: Rebuild repeats despite successful compilation; agent "doesn't see" success; truncated logs. Solutions:

Ban piping build/test output through filters.
Always treat process exit code as authoritative; capture full logs to file if needed.

🧱 Workspace Shadowing

Problem: Agent edits code in one path while build runs elsewhere. Solutions: - Enforce canonical repo root. - Validate file hash inside container before build.

🧱 Dependency Graph Blind Spots

Problem: Agent edits local file but bug exists in another module. Solutions: - Require cross-module symbol search before edits.

4. Environment & Container Failures

🐳 Container State Desync

Problem: Agent forgets to rebuild containers. Solutions: - Rebuild on dependency file change. - Add container health checks.

Ephemeral /tmp Artifact Loss (Rebuild Nukes Work)

Problem: Agent saves artifacts/checkpoints/results under /tmp inside container; rebuild wipes /tmp; agent assumes state exists and resumes incorrectly. Signals: Missing checkpoint after rebuild; progress resets to 0; "file not found" after container rebuild. Solutions:

Enforce a canonical durable workspace root (e.g. /workspace) and require all artifacts live under it.
Add path guard: fail hard if writing outside /workspace for persistent outputs.
Mount /workspace to a named volume or host dir; include preflight durability test (write/read sentinel).

Host/Container Execution Split-Brain

Problem: Agent runs Python / pytest / pydantic on host instead of inside container, producing inconsistent deps and phantom failures. Signals: "works in container, fails on host" (or vice versa); missing deps; pydantic version mismatch. Solutions:

Hard rule: all Python execution must run via docker compose exec ...
Provide canonical commands and a wrapper script (make targets) so agent can't choose host accidentally.
Add guard that detects host execution and aborts with instructions.

Secrets / .env Permission Deadlock

Problem: Agent needs to edit .env or create dirs, but freezes waiting for permission (or edits without permission). Signals: Agent loops "can't proceed" or silently modifies secrets. Solutions:

Capability policy: allow_env_edit, allow_mkdir, allow_web_search, allow_python_exec, etc.
Enforce explicit "REQUESTED vs EXECUTED" state; never retry permission-blocked actions without new user input.

🐳 Resource Exhaustion (Exit Code 137)

Problem: Python batch jobs OOM when looping large MIDI datasets. Solutions: - Switch to streaming pipeline. - Process single file batches. - Persist incremental JSON results.

🐳 Permission Mirage

Problem: Command requested but awaiting human approval. Solutions: - Distinguish REQUESTED vs EXECUTED state. - Pause workflow until confirmation received.

5. Long‑Running Job Failures

📊 Frozen Progress Blindness

Problem: Logs remain alive but progress count stops increasing. Solutions: - Track delta(progress). - Abort if unchanged after N polls.

📊 Checkpoint Blindness

Problem: Agent restarts job from index 0 instead of saved checkpoint. Solutions: - Inspect output directory before restart. - Resume from last processed index.

Checkpoint Regression (Restart Loses Resume Point)

Problem: Agent restarts a long job and accidentally resets progress to 0 instead of resuming from checkpoint. Signals: processed_count returns to 0; output overwritten; repeated first N files. Solutions:

Require run_id + manifest.json; refuse to start if manifest exists and resume flag not set.
Write checkpoints atomically (tmp file + rename).
Require "resume proof" before restart: show last checkpoint + last processed id.

Progress Counter Stasis (Logs Alive, Work Frozen)

Problem: Agent tails logs and assumes success because logs exist; counter doesn't increase (deadlock, stuck IO). Signals: last_processed_count unchanged across polls; timestamps stale; no heartbeat events. Solutions:

Require delta(progress) monitoring with time window thresholds.
Add heartbeat: emit processed_count + last_increment_ts every N items.
Watchdog abort if no delta for N minutes; capture stack dump / thread dump.

📊 Output Buffering Illusion

Problem: Logs buffered → agent thinks job stalled. Solutions: - Use unbuffered Python (-u). - Parse timestamps instead of output volume.

📊 Script Sleep Loop Desync

Problem: Agent sleeps while script already failed. Solutions: - Track PID + exit codes. - Verify process alive before sleeping again.

6. Debugging & Runtime Failures

🔬 No Backtrace Before Patch

Problem: Agent edits code after crash without LLDB. Solutions: - Require backtrace collection before editing logic.

🔬 Sanitizer Avoidance

Problem: Agent never runs TSan/ASan. Solutions: - Mandatory sanitizer builds after runtime failures.

Missing ToolError / Log Triage (False "All Good")

Problem: Agent checks logs but misses explicit toolError / error lines and reports success. Signals: toolError events in SSE/logs; HTTP stream errors; circuit breaker open; Orpheus unavailable. Solutions:

Define a hard "red-flag" pattern set (toolError, ERROR -, Traceback, circuit_breaker_open, Orpheus unavailable).
Require log scan that fails the run if any red-flag appears after start_ts.
Require explicit "health gates": Orpheus health + LLM stream health before claiming success.

🔬 Concurrency Illusions

Problem: Agent assumes async tasks finished prematurely. Solutions: - Require completion signals or explicit awaits.

7. Resource & Performance Failures

📉 Metric Plateau Blindness

Problem: Agent thinks metrics improving despite flat results. Solutions: - Require delta thresholds.

📉 Resource Throttling Misdiagnosis

Problem: Agent misinterprets CPU or IO throttling as logic bug. Solutions: - Monitor system metrics before patching code.

8. Control Plane / UX Failures

⛔ Interactive Permission Boundaries

Problem: Agent stalls waiting for user approval dialogs. Solutions: - Explicit approval state detection. - Prompt user clearly.

Web Search Permission Stall

Problem: Agent freezes waiting for permission to browse/search; workflow halts without escalation. Signals: agent does nothing; repeated "need to search web" statements. Solutions:

Capability policy includes allow_web_search.
If denied: require fallback plan that does not require web OR explicit escalation request.

🧾 False Completion

Problem: Agent reports "Done" after last command succeeds. Solutions: - Require artifact validation. - Require canonical build + tests pass.

9. Data & File Management Failures

🧹 Cleanup Blindness

Problem: Temp files cause incorrect skipping or duplication. Solutions: - Enforce temp directory policies.

Revert Nukes Work (Over-Broad Undo)

Problem: Agent reverts an entire file to undo a small change, deleting unrelated critical work. Signals: large diff reversals; file timestamp rollback; "reverted file" without surgical patch. Solutions:

Require surgical revert: git checkout -p / patch-level undo.
Before revert: show diff summary and confirm scope.
Prefer small incremental commits to make rollback safe.

Cascading Test Failure Blindness (Shared Root Cause)

Problem: Agent fixes one failing test caused by a shared change (magic number/API change), but doesn't anticipate other tests failing for the same reason. Signals: sequential failures with similar stack traces; repeated "fix test" loop. Solutions:

After any fix, search for the failing constant/symbol across tests and update all impacted tests as a batch.
Require pattern scan: grep for similar assertions or fixtures.

🧾 Workspace Drift After Sleep/Wake

Problem: Agent resumes with stale filesystem assumptions. Solutions: - Re-scan repo state on wake.

10. Structural Agent Design Principles (Summary)

To achieve unattended operation:

Use resumable checkpoints for all long jobs.
Enforce progress‑delta monitoring.
Separate environment debugging from code debugging.
Require debugging tools (LLDB, TSan, ASan).
Implement subagent watchdogs.
Maintain execution state separate from summaries.
Validate project graph after file creation.
Force strategy change after repeated failures.

🔁 Git Workflow State Machine (Deterministic Phase Control)

Agents must not perform side-effects (branch creation, PR creation, merge) without consulting the canonical workflow state.

All GitHub operations must be idempotent.

WorkflowPhase

from enum import Enum

class WorkflowPhase(str, Enum):
    ISSUE_CREATED = "issue_created"
    BRANCH_CREATED = "branch_created"
    WORK_IN_PROGRESS = "work_in_progress"
    PR_CREATED = "pr_created"
    CI_RUNNING = "ci_running"
    READY_TO_MERGE = "ready_to_merge"
    MERGED = "merged"
    CLOSED = "closed"

Transitions must only move forward. Backward transitions require explicit human override.

🗂 Canonical Agent Task Registry

Each agent must operate against a single authoritative task record.

{
  "agent_id": "agent-001",
  "issue_number": 412,
  "branch": "agent/412-contract-lineage",
  "pull_request_number": 527,
  "phase": "pr_created",
  "ci_status": "pending",
  "attempt_counters": {
    "pr_creation_attempts": 1
  },
  "updated_at": "2026-02-27T18:10:00Z"
}

This record must be checked before:

Creating branches
Creating pull requests
Rebasing
Merging
Closing issues

🔒 Idempotent GitHub Wrapper Requirement

Agents must never call raw GitHub side-effects directly.

Instead, use wrapper functions that enforce idempotency:

def ensure_branch(branch: str, base: str) -> None:
    if branch_exists(branch):
        return
    create_branch(branch, base)

def ensure_pull_request(branch: str, issue_number: int) -> int:
    existing = find_pr_by_branch(branch)
    if existing:
        return existing.number
    return create_pr(branch, issue_number)

Duplicate PR creation must be impossible by design.

🚨 Anti-Loop Safeguards

Agents must track attempt counters for side-effects:

PR creation attempts
Merge attempts
Rebase attempts

If any attempt counter exceeds 1 without phase advancement:

→ Abort execution
→ Emit "LOOP_GUARD_TRIGGERED"
→ Escalate to orchestrator

Infinite retry loops are forbidden.

🧠 Pre-Action Verification Protocol

Before executing any side-effect:

Read canonical registry
Query GitHub for real state
Confirm registry matches external state
Confirm transition is valid

If registry and GitHub disagree:

→ Enter RECONCILIATION mode
→ Do not proceed blindly

🐳 Worktree Isolation Contract

Each agent must operate inside:

Dedicated git worktree
Dedicated branch
Dedicated container

Agents must not:

Mount root repository in multiple containers
Modify branches not assigned to them
Create arbitrary branch names

Branch naming format:

agent/<issue-number>-<slug>

🧬 Context Isolation Policy

Agents do NOT share token context.

They share:

Structured working memory
Contract hashes
Event logs

They do NOT share:

Internal reasoning traces
Chain-of-thought
Raw prompt transcripts

Shared state must be structured and typed.

🧪 Strong Typing Enforcement (Python / mypy)

Agents must not:

Use Any unless wrapped in explicit boundary object
Cast at caller site to hide typing errors
Use # type: ignore without protocol-boundary justification

If a type mismatch occurs:

→ Fix the callee
→ Propagate correct type upward
→ Eliminate covariance

Follow strict module boundaries:

One module → one responsibility → clean API → no covariance.

💾 Memory Durability Requirement

Agents must persist:

Full execution context to durable storage
Compressed execution summary to working memory

If memory window pressure occurs:

→ Use compressed context
→ Never discard canonical state

Full context must remain retrievable from storage.

🔐 Contract Hash Enforcement (Execution Layer)

All execution layers must verify:

assert section_contract.parent_contract_hash == instrument_contract.contract_hash

If mismatch:

→ Raise ProtocolViolation
→ Halt execution

Structural fields must override LLM-proposed values.

Advisory fields must never influence structural placement.

🧱 Determinism Mandate

Agents must operate as deterministic state machines.

Side-effects must be:

Idempotent
Phase-gated
Hash-verified

Agents must not "decide what to do next."

They must:

Query state
Validate transition
Execute allowed action
Update state
Halt

This section upgrades the agent taxonomy from heuristic execution to deterministic protocol-governed orchestration.

cgcardona/Agent_Failure_Taxonomy_v1.md

🧭 Agent Failure Taxonomy & Solutions v1

1. Cognitive / Reasoning Failures

🧠 Recursive Thinking Loop (Inception Mode)

🧠 Context Drift After Summarization

Quality–Time Tradeoff Drift (Silent Pragmatism)

Type-System Evasion (Mypy Box-Checking)

🧠 Reference Pattern Neglect

🧠 Retry Without Strategy Mutation

2. Orchestration & Multi-Agent Failures

👥 Orphaned Subagent

Watchdog Chain Paradox (Who Watches the Watchdogs?)

Semantic Telephone Between Agent Layers

👥 Spec Drift Between Agents

👥 Cross-Agent Race Conditions

3. Build & Project Graph Failures

🧱 Target Membership Blindness

Build Output Filtering (False Failure / False Success)

🧱 Workspace Shadowing

🧱 Dependency Graph Blind Spots

4. Environment & Container Failures

🐳 Container State Desync

Ephemeral /tmp Artifact Loss (Rebuild Nukes Work)

Host/Container Execution Split-Brain

Secrets / .env Permission Deadlock

🐳 Resource Exhaustion (Exit Code 137)

🐳 Permission Mirage

5. Long‑Running Job Failures

📊 Frozen Progress Blindness

📊 Checkpoint Blindness

Checkpoint Regression (Restart Loses Resume Point)

Progress Counter Stasis (Logs Alive, Work Frozen)

📊 Output Buffering Illusion

📊 Script Sleep Loop Desync

6. Debugging & Runtime Failures

🔬 No Backtrace Before Patch

🔬 Sanitizer Avoidance

Missing ToolError / Log Triage (False "All Good")

🔬 Concurrency Illusions

7. Resource & Performance Failures

📉 Metric Plateau Blindness

📉 Resource Throttling Misdiagnosis

8. Control Plane / UX Failures

⛔ Interactive Permission Boundaries

Web Search Permission Stall

🧾 False Completion

9. Data & File Management Failures

🧹 Cleanup Blindness

Revert Nukes Work (Over-Broad Undo)

Cascading Test Failure Blindness (Shared Root Cause)

🧾 Workspace Drift After Sleep/Wake

10. Structural Agent Design Principles (Summary)

🔁 Git Workflow State Machine (Deterministic Phase Control)

WorkflowPhase

🗂 Canonical Agent Task Registry

🔒 Idempotent GitHub Wrapper Requirement

🚨 Anti-Loop Safeguards

🧠 Pre-Action Verification Protocol

🐳 Worktree Isolation Contract

🧬 Context Isolation Policy

🧪 Strong Typing Enforcement (Python / mypy)

💾 Memory Durability Requirement

🔐 Contract Hash Enforcement (Execution Layer)

🧱 Determinism Mandate