A comprehensive rulebook for unattended autonomous engineering agents
This document captures real-world failure modes observed in longβrunning AI agent workflows, especially multiβagent, CLIβdriven, containerized environments like Stori/Maestro pipelines.
Each section includes: - β Problem Pattern - π Signals - β Recommended Solution Strategy
Problem: Agent enters endless planning without executing. Signals: No commands, no diffs, repeated internal reasoning. Solutions: - Enforce planning time budget. - Require execution after N planning cycles. - Add "action watchdog" that resets agent if no commands executed.
Problem: Agent loses execution state after compressing chat. Signals: Duplicate work, frozen state, lost cursor. Solutions: - Maintain machineβreadable execution state (step index, checkpoints). - Store resumable YAML state separate from summaries.
Problem: Agent internally debates "best" vs "simple" and silently chooses a pragmatic shortcut, degrading architecture over time. Signals: Caller-side casts, partial fixes, tech debt accretion, repeated "quick fixes" across modules. Solutions:
- Require an explicit execution budget for every task (pragmatic|balanced|optimal|research).
- Enforce: agent MUST NOT silently escalate/shift budget; must request upgrade.
- Require decision logging: for any non-trivial tradeoff, emit "budget decision" event with rationale and alternatives.
Problem: Agent "fixes" mypy by using Any, caller-side casts, or type: ignore rather than architecting correct entities and callee signatures. Signals: cast(...) at call sites, proliferation of Any, new # type: ignore comments in core logic. Solutions:
- Callee-first rule: if caller needs a cast, fix the callee return type and propagate typed entities.
- Ban "naked innies": no dict[str, Any] or list[dict] across internal layers; wrap into typed dataclasses/Pydantic models.
- If Any is unavoidable: wrap as OpaquePayload(raw: Any) at a boundary protocol and quarantine the Any.
- type: ignore allowed ONLY at explicit boundary adapters (3rd-party libs, serialization, protocol bridges) and must include justification.
Problem: Agent fixes failing tests repeatedly instead of copying patterns from passing tests. Solutions: - Require repo search for similar passing tests before patching. - Add "golden pattern scan" rule before retry loops.
Problem: Agent retries same commands with small parameter tweaks. Solutions: - After 2 identical failures β force algorithmic strategy change. - Maintain retry fingerprint hash.
Problem: Child agent spins forever; parent unaware. Solutions: - Heartbeat monitoring. - Mandatory timeout budgets. - Parent watchdog kills stale subagents.
Problem: Parent watchdog exists, but parent itself can stall; nested swarms can deadlock if supervisor doesn't have its own liveness guarantees. Signals: Entire pipeline stalls with no progressing heartbeats; children still "running" but no coordinator actions. Solutions:
- Add a top-level Supervisor watchdog (outside the agent graph) that monitors all agent heartbeats.
- Add explicit agent lifecycle states: SPAWNED/RUNNING/STALLED/RECOVERING/FAILED/COMPLETED.
- Require per-layer timeouts + escalation path (L3βL2βL1βSupervisor).
Problem: Plan/task intent mutates as it is handed off (Manager β Agent β Subagent). Hallucination creeps in between layers. Signals: Child executes a different interpretation than the parent spec; mismatched constraints; "I thought we meant..." Solutions:
- Require explicit, versioned "contracts" between layers (schema + invariants).
- Require spec hash verification at every hop; reject mismatched hash.
- Make advisory vs structural fields explicit; structural must be enforced by code, not prompts.
Problem: Parent updates plan; subagent continues old instructions. Solutions: - Versioned spec injection. - Require spec hash verification before execution.
Problem: Agents modify same files simultaneously. Solutions: - File locking. - Branch-per-agent model. - Merge gates.
Problem: Agent creates Swift files but does not add to project build phases. Signals: Types not found, build loops. Solutions: - Mandatory project file patch after file creation. - Verify PBXBuildFile entries exist.
Problem: Agent pipes build/test output through grep/head/tail and loses the true exit status signal, causing rebuild loops or false "done". Signals: Rebuild repeats despite successful compilation; agent "doesn't see" success; truncated logs. Solutions:
- Ban piping build/test output through filters.
- Always treat process exit code as authoritative; capture full logs to file if needed.
Problem: Agent edits code in one path while build runs elsewhere. Solutions: - Enforce canonical repo root. - Validate file hash inside container before build.
Problem: Agent edits local file but bug exists in another module. Solutions: - Require cross-module symbol search before edits.
Problem: Agent forgets to rebuild containers. Solutions: - Rebuild on dependency file change. - Add container health checks.
Problem: Agent saves artifacts/checkpoints/results under /tmp inside container; rebuild wipes /tmp; agent assumes state exists and resumes incorrectly. Signals: Missing checkpoint after rebuild; progress resets to 0; "file not found" after container rebuild. Solutions:
- Enforce a canonical durable workspace root (e.g. /workspace) and require all artifacts live under it.
- Add path guard: fail hard if writing outside /workspace for persistent outputs.
- Mount /workspace to a named volume or host dir; include preflight durability test (write/read sentinel).
Problem: Agent runs Python / pytest / pydantic on host instead of inside container, producing inconsistent deps and phantom failures. Signals: "works in container, fails on host" (or vice versa); missing deps; pydantic version mismatch. Solutions:
- Hard rule: all Python execution must run via docker compose exec ...
- Provide canonical commands and a wrapper script (make targets) so agent can't choose host accidentally.
- Add guard that detects host execution and aborts with instructions.
Problem: Agent needs to edit .env or create dirs, but freezes waiting for permission (or edits without permission). Signals: Agent loops "can't proceed" or silently modifies secrets. Solutions:
- Capability policy: allow_env_edit, allow_mkdir, allow_web_search, allow_python_exec, etc.
- Enforce explicit "REQUESTED vs EXECUTED" state; never retry permission-blocked actions without new user input.
Problem: Python batch jobs OOM when looping large MIDI datasets. Solutions: - Switch to streaming pipeline. - Process single file batches. - Persist incremental JSON results.
Problem: Command requested but awaiting human approval. Solutions: - Distinguish REQUESTED vs EXECUTED state. - Pause workflow until confirmation received.
Problem: Logs remain alive but progress count stops increasing. Solutions: - Track delta(progress). - Abort if unchanged after N polls.
Problem: Agent restarts job from index 0 instead of saved checkpoint. Solutions: - Inspect output directory before restart. - Resume from last processed index.
Problem: Agent restarts a long job and accidentally resets progress to 0 instead of resuming from checkpoint. Signals: processed_count returns to 0; output overwritten; repeated first N files. Solutions:
- Require run_id + manifest.json; refuse to start if manifest exists and resume flag not set.
- Write checkpoints atomically (tmp file + rename).
- Require "resume proof" before restart: show last checkpoint + last processed id.
Problem: Agent tails logs and assumes success because logs exist; counter doesn't increase (deadlock, stuck IO). Signals: last_processed_count unchanged across polls; timestamps stale; no heartbeat events. Solutions:
- Require delta(progress) monitoring with time window thresholds.
- Add heartbeat: emit processed_count + last_increment_ts every N items.
- Watchdog abort if no delta for N minutes; capture stack dump / thread dump.
Problem: Logs buffered β agent thinks job stalled. Solutions: -
Use unbuffered Python (-u). - Parse timestamps instead of output
volume.
Problem: Agent sleeps while script already failed. Solutions: - Track PID + exit codes. - Verify process alive before sleeping again.
Problem: Agent edits code after crash without LLDB. Solutions: - Require backtrace collection before editing logic.
Problem: Agent never runs TSan/ASan. Solutions: - Mandatory sanitizer builds after runtime failures.
Problem: Agent checks logs but misses explicit toolError / error lines and reports success. Signals: toolError events in SSE/logs; HTTP stream errors; circuit breaker open; Orpheus unavailable. Solutions:
- Define a hard "red-flag" pattern set (toolError, ERROR -, Traceback, circuit_breaker_open, Orpheus unavailable).
- Require log scan that fails the run if any red-flag appears after start_ts.
- Require explicit "health gates": Orpheus health + LLM stream health before claiming success.
Problem: Agent assumes async tasks finished prematurely. Solutions: - Require completion signals or explicit awaits.
Problem: Agent thinks metrics improving despite flat results. Solutions: - Require delta thresholds.
Problem: Agent misinterprets CPU or IO throttling as logic bug. Solutions: - Monitor system metrics before patching code.
Problem: Agent stalls waiting for user approval dialogs. Solutions: - Explicit approval state detection. - Prompt user clearly.
Problem: Agent freezes waiting for permission to browse/search; workflow halts without escalation. Signals: agent does nothing; repeated "need to search web" statements. Solutions:
- Capability policy includes allow_web_search.
- If denied: require fallback plan that does not require web OR explicit escalation request.
Problem: Agent reports "Done" after last command succeeds. Solutions: - Require artifact validation. - Require canonical build + tests pass.
Problem: Temp files cause incorrect skipping or duplication. Solutions: - Enforce temp directory policies.
Problem: Agent reverts an entire file to undo a small change, deleting unrelated critical work. Signals: large diff reversals; file timestamp rollback; "reverted file" without surgical patch. Solutions:
- Require surgical revert: git checkout -p / patch-level undo.
- Before revert: show diff summary and confirm scope.
- Prefer small incremental commits to make rollback safe.
Problem: Agent fixes one failing test caused by a shared change (magic number/API change), but doesn't anticipate other tests failing for the same reason. Signals: sequential failures with similar stack traces; repeated "fix test" loop. Solutions:
- After any fix, search for the failing constant/symbol across tests and update all impacted tests as a batch.
- Require pattern scan: grep for similar assertions or fixtures.
Problem: Agent resumes with stale filesystem assumptions. Solutions: - Re-scan repo state on wake.
To achieve unattended operation:
- Use resumable checkpoints for all long jobs.
- Enforce progressβdelta monitoring.
- Separate environment debugging from code debugging.
- Require debugging tools (LLDB, TSan, ASan).
- Implement subagent watchdogs.
- Maintain execution state separate from summaries.
- Validate project graph after file creation.
- Force strategy change after repeated failures.
Agents must not perform side-effects (branch creation, PR creation, merge) without consulting the canonical workflow state.
All GitHub operations must be idempotent.
from enum import Enum
class WorkflowPhase(str, Enum):
ISSUE_CREATED = "issue_created"
BRANCH_CREATED = "branch_created"
WORK_IN_PROGRESS = "work_in_progress"
PR_CREATED = "pr_created"
CI_RUNNING = "ci_running"
READY_TO_MERGE = "ready_to_merge"
MERGED = "merged"
CLOSED = "closed"Transitions must only move forward. Backward transitions require explicit human override.
Each agent must operate against a single authoritative task record.
{
"agent_id": "agent-001",
"issue_number": 412,
"branch": "agent/412-contract-lineage",
"pull_request_number": 527,
"phase": "pr_created",
"ci_status": "pending",
"attempt_counters": {
"pr_creation_attempts": 1
},
"updated_at": "2026-02-27T18:10:00Z"
}This record must be checked before:
- Creating branches
- Creating pull requests
- Rebasing
- Merging
- Closing issues
Agents must never call raw GitHub side-effects directly.
Instead, use wrapper functions that enforce idempotency:
def ensure_branch(branch: str, base: str) -> None:
if branch_exists(branch):
return
create_branch(branch, base)
def ensure_pull_request(branch: str, issue_number: int) -> int:
existing = find_pr_by_branch(branch)
if existing:
return existing.number
return create_pr(branch, issue_number)Duplicate PR creation must be impossible by design.
Agents must track attempt counters for side-effects:
- PR creation attempts
- Merge attempts
- Rebase attempts
If any attempt counter exceeds 1 without phase advancement:
β Abort execution
β Emit "LOOP_GUARD_TRIGGERED"
β Escalate to orchestrator
Infinite retry loops are forbidden.
Before executing any side-effect:
- Read canonical registry
- Query GitHub for real state
- Confirm registry matches external state
- Confirm transition is valid
If registry and GitHub disagree:
β Enter RECONCILIATION mode
β Do not proceed blindly
Each agent must operate inside:
- Dedicated git worktree
- Dedicated branch
- Dedicated container
Agents must not:
- Mount root repository in multiple containers
- Modify branches not assigned to them
- Create arbitrary branch names
Branch naming format:
agent/<issue-number>-<slug>
Agents do NOT share token context.
They share:
- Structured working memory
- Contract hashes
- Event logs
They do NOT share:
- Internal reasoning traces
- Chain-of-thought
- Raw prompt transcripts
Shared state must be structured and typed.
Agents must not:
- Use
Anyunless wrapped in explicit boundary object - Cast at caller site to hide typing errors
- Use
# type: ignorewithout protocol-boundary justification
If a type mismatch occurs:
β Fix the callee
β Propagate correct type upward
β Eliminate covariance
Follow strict module boundaries:
One module β one responsibility β clean API β no covariance.
Agents must persist:
- Full execution context to durable storage
- Compressed execution summary to working memory
If memory window pressure occurs:
β Use compressed context
β Never discard canonical state
Full context must remain retrievable from storage.
All execution layers must verify:
assert section_contract.parent_contract_hash == instrument_contract.contract_hashIf mismatch:
β Raise ProtocolViolation
β Halt execution
Structural fields must override LLM-proposed values.
Advisory fields must never influence structural placement.
Agents must operate as deterministic state machines.
Side-effects must be:
- Idempotent
- Phase-gated
- Hash-verified
Agents must not "decide what to do next."
They must:
- Query state
- Validate transition
- Execute allowed action
- Update state
- Halt
This section upgrades the agent taxonomy from heuristic execution to deterministic protocol-governed orchestration.