harness_engineering
A synthesis of OpenAI's harness engineering report, Anthropic's long-running agent research, Anthropic's prompting best practices docs, and lessons from workos/case (a production harness orchestrating AI agent work across multiple repositories).
Core thesis: the bottleneck in agent performance is environment design, not model intelligence. Or as OpenAI put it: "The horse is fast. The harness is everything."
A harness is everything surrounding the model during execution: scaffolding, constraints, feedback loops, state management, and tooling. The metaphor comes from horsemanship -- a harness channels a powerful animal in a productive direction. The horse doesn't choose where to go; the rider steers through the harness.
OpenAI defines harness engineering as "the emerging discipline of designing the constraints, feedback loops, documentation structures, linting rules, observability pipelines, and lifecycle management systems that allow AI coding agents to operate reliably at scale."
It is distinct from:
- Prompt engineering — optimizing instructions within a single turn
- Context engineering — controlling which tokens are visible
A well-designed harness lets agents operate autonomously across hours and multiple context windows. OpenAI reported single Codex runs working on a single task for upwards of six hours -- often while the humans were sleeping.
In traditional development, compute is cheap and human attention is moderately scarce. In agent-first development, this inverts: code throughput far exceeds human review capacity.
This changes every engineering tradeoff:
- Waiting is expensive, corrections are cheap. Minimal blocking merge gates. Short-lived PRs. Test flakes addressed with follow-up runs rather than blocking indefinitely.
- QA becomes the bottleneck. As throughput scaled, OpenAI discovered agents couldn't see the running application — bugs were caught only after human review. The fix was making the app legible to agents (see section 7).
- Human attention is the scarce resource. Every harness investment should be evaluated by: "Does this reduce the need for human attention?"
workos/case operationalizes this: engineers define goals and acceptance criteria, agents implement. The harness is the product; the code is the output.
Agents working on long tasks will exhaust their context window. Each new session starts blank -- like a shift engineer arriving with no handoff notes. All sources agree on the same solution architecture.
The first context window is special. Use it to build the scaffolding, not the features:
- Set up the environment: build scripts, dev server, test infrastructure
- Create a structured task/feature list in JSON (not markdown -- JSON is harder for agents to accidentally mutate)
- Write setup scripts (
init.sh) so future sessions can boot the environment in one command - Create a progress file as a changelog
- Make an initial git commit as a checkpoint
Anthropic found that without this initializer pattern, agents attempted too much simultaneously, ran out of context mid-implementation, and left features half-built and undocumented.
Every new session starts by:
- Running
pwdto confirm working directory - Reading git logs and progress files to rebuild context
- Picking the highest-priority incomplete task
- Working on one feature at a time, committing after each
A typical session start (from Anthropic's research):
[Agent] I'll start by getting my bearings and understanding the current state.
> pwd
> read claude-progress.txt
> read feature_list.json
> git log --oneline -20
> [starts dev server via init.sh]
> [runs fundamental integration test before starting new work]
Anthropic's docs explicitly recommend starting with a brand new context window rather than compacting, because Claude's latest models are "extremely effective at discovering state from the local filesystem." The filesystem is the memory.
Be prescriptive about how the agent should start:
- "Call pwd; you can only read and write files in this directory."
- "Review progress.txt, tests.json, and the git logs."
- "Manually run through a fundamental integration test before moving on to implementing new features."
Claude 4.6 and 4.5 models feature context awareness -- the model can track its remaining token budget throughout a conversation. If your harness supports compaction or saving to external files:
Your context window will be automatically compacted as it approaches its limit,
allowing you to continue working indefinitely. Do not stop tasks early due to
token budget concerns. As you approach the limit, save progress and state before
the context window refreshes.
Without this guidance, Claude may try to wrap up work prematurely as it approaches the context limit.
| What to Track | Format | Why |
|---|---|---|
| Task/test status | JSON (e.g., tasks.json) |
Structured, machine-parseable, hard to accidentally corrupt |
| Progress notes | Markdown/plaintext | Freeform, captures nuance and reasoning |
| Code changes | Git | Provides history, checkpoints, and rollback |
Anthropic recommends structured formats for anything agents need to parse across sessions:
{
"tests": [
{ "id": 1, "name": "authentication_flow", "status": "passing" },
{ "id": 2, "name": "user_management", "status": "failing" },
{ "id": 3, "name": "api_endpoints", "status": "not_started" }
],
"total": 200,
"passing": 150,
"failing": 25,
"not_started": 25
}Anthropic's initializer pattern goes further: the feature list is JSON where agents can only modify the passes field, preventing scope creep or feature deletion.
For complex multi-agent workflows, enforce valid state transitions:
active → implementing → verifying → reviewing → closing → pr-opened → merged
Recovery transitions allow going backward (e.g., verifying → implementing for fix-and-retry). Invalid transitions are rejected by a script, not by instructions. Evidence fields (like tested, manualTested) can only be set by marker scripts that verify real work was done -- not by agents directly.
workos/case uses WIP commits after each logical step (wip: description), then squashes into a single clean conventional commit before finalizing. This provides rollback points without polluting git history.
Critical rule from all sources: "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."
OpenAI's strongest insight: if it isn't discoverable in the repo, it doesn't exist for the agent. Agents can only reason about what they can see in their working set: prompt context, retrieved documents, tool outputs, and runtime observations. Slack threads, Google Docs, and tacit knowledge are operationally invisible.
This means:
- All knowledge lives in versioned files (
docs/,AGENTS.md, specs, design docs) - Plans and execution state are repo artifacts, not Jira tickets
- The
AGENTS.md/CLAUDE.mdfile should be a ~100-line table of contents pointing to deeper docs -- not an encyclopedia
OpenAI documented four failure modes of large AGENTS.md files:
- Context crowding — A large instruction file displaces the task, the code, and relevant docs from context, leaving the agent optimizing for the wrong constraints.
- Non-guidance from over-guidance — When everything is flagged as important, the agent pattern-matches locally rather than navigating intentionally.
- Instant rot — A monolithic manual cannot be mechanically verified. It becomes a graveyard of stale rules.
- Undetectable drift — Without structural verification, any single document decays silently.
CLAUDE.md / AGENTS.md (map, ~100 lines)
└── docs/
├── architecture/ ← how the system works
├── conventions/ ← shared rules (commits, testing, code style)
├── sdk-designs/ ← per-language design specs
├── playbooks/ ← step-by-step guides for common tasks
└── learnings/ ← accumulated tactical knowledge
Agents read the map first, then navigate to specifics on demand. OpenAI took this to an extreme: 88 AGENTS.md files, one per major subsystem, to keep instructions local and minimal.
Structure CLAUDE.md for LLM cache efficiency (from workos/case):
1. Identity & Purpose (stable — rarely changes)
2. Rules & Conventions (stable)
3. Architecture Overview (stable)
4. Commands (semi-stable)
5. Known Issues (volatile)
6. Temporary Notes (volatile)
Stable content first means the LLM's KV cache can reuse the prefix across turns. Mixing stable and volatile content in the same section defeats caching.
Don't just tell agents what to do -- enforce it mechanically with linters, hooks, and structural tests. As workos/case puts it: "Instructions decay, enforcement persists." Agents forget instructions over long sessions. Hooks and linters don't.
- Custom linters (themselves written by agents) enforce naming conventions, file sizes, logging formats, and dependency layer hierarchy
- Lint error messages are written to inject remediation instructions into agent context -- every violation becomes a self-repair prompt
- Dependency layers enforced as a one-way constraint:
Types → Config → Repo → Service → Runtime → UI - Additional invariants: structured logging, naming conventions for schemas and types, file size limits, data validation at all external boundaries
- Encourage agents to create setup scripts (
init.sh) for servers, test suites, and linters - Provide verification tools (Playwright MCP, browser automation) so agents can test as users would
- "Quality of life tools" prevent repeated setup work across sessions
The most sophisticated enforcement pattern. Every pre-PR gate is enforced by hooks that intercept tool calls, not by instructions in prompts:
Evidence markers that can't be faked:
.case-tested— SHA-256 hash of actual test output. Baretouch .case-testedis rejected..case-manual-tested— Checks for recent Playwright screenshots as proof..case-reviewed— Requires--critical 0flag; refuses if critical findings exist.
Hook-based gating:
pre-pr-check.sh— Blocks PR creation without evidence markerspre-push-check.sh— Blocks push to main/masterpre-commit-check.sh— Enforces conventional commit formatpost-pr-cleanup.sh— Updates task JSON, cleans markers
The key insight: "STOP -- do this before proceeding" doesn't work in a prompt. A hook that blocks gh pr create does.
workos/case fingerprints every failed command (SHA-256 of command + first line of output). After 3 consecutive identical failures, it blocks the agent and forces a different approach. This prevents agents from retrying the same failing command in a loop.
Agents need to close their own feedback loops without human intervention. Any validation that requires a human to inspect the running application is a throughput bottleneck.
- Per-worktree booting — The app was made bootable per git worktree, allowing isolated instances per concurrent task. Eliminates cross-task contamination.
- Chrome DevTools Protocol — DOM snapshots, screenshot capture, browser navigation wired directly into the agent runtime. Feedback loop: snapshot → trigger UI path → observe → fix → re-snapshot → loop until clean.
- Ephemeral local observability — Each worktree gets its own isolated stack: logs (VictoriaLogs / LogQL), metrics (VictoriaMetrics / PromQL), traces (VictoriaTraces / TraceQL). Torn down when the task completes.
This enabled prompts like "ensure startup completes in <800ms" and "no span in these four critical user journeys exceeds two seconds" -- tractable because the agent could directly measure outcomes.
- Without explicit prompting, Claude tended to skip end-to-end testing
- When provided browser automation tools (Puppeteer MCP), agents dramatically improved at finding and fixing bugs
- Explicit prompting to "test as users would" significantly improves accuracy
- Encourage complete context usage:
This is a very long task, so plan your work clearly. Spend your entire output context working -- just don't run out with uncommitted work.
Separates implementation from verification into isolated agents. The verifier:
- Cannot edit code (enforced by role)
- Uses Playwright to test the specific fix
- Creates evidence markers that prove testing happened
- Asks: "If I reverted my change, would this test fail?" -- not just happy-path testing
With sufficient scaffolding, a single agent can:
- Validate current codebase state
- Reproduce a reported bug
- Record a video demonstrating the failure
- Implement a fix
- Validate the fix by driving the application
- Record a second video demonstrating resolution
- Open a pull request
- Respond to feedback
- Detect and remediate build failures
- Escalate to a human only when judgment is required
- Merge the change
Steps involving bug reproduction, video capture, and UI validation are only possible because of the DevTools and observability integrations. The full loop depends on every layer of the harness being in place.
Agents are highly effective at pattern replication. They learn from and repeat whatever patterns exist in the codebase -- including bad ones. If the codebase has architectural drift, agents will faithfully reproduce and amplify it. OpenAI called this "AI slop": patterns that proliferate because they were present in the codebase's effective training distribution.
Without automation, OpenAI spent ~20% of Fridays cleaning up AI-generated drift.
Encode opinionated, mechanical rules directly into the repository:
- Prefer shared utility packages over hand-rolled helpers
- Validate data at boundaries rather than probing shapes speculatively
- Use instrumented concurrency utilities rather than third-party primitives with opaque behavior
Background agents scan for deviations on a regular cadence, update quality grades, and open targeted refactoring PRs. Most can be reviewed in under a minute and auto-merged. Technical debt becomes continuous garbage collection, not periodic sprints.
workos/case runs an entropy scanner (entropy-scan.sh) that wraps convention checks and outputs structured JSON. Can be run on a recurring loop during active sessions:
/loop 30m bash scripts/entropy-scan.shConvention drift is treated as inevitable -- the question is detection speed, not prevention.
A retrospective agent runs after every pipeline run (success or failure):
- Reads the full progress log and task JSON timing
- Classifies improvements by location (docs, scripts, agents, hooks, conventions)
- Applies fixes directly to harness files (not just suggests them)
- Maintains per-repo learnings files in
docs/learnings/ - Escalates patterns: 3+ similar learnings → promoted to convention or golden principle
This means every run improves the harness. Encoded capabilities compound over time.
workos/case uses a six-agent pipeline where each agent runs in an isolated context window with a single responsibility. This prevents "context pollution" -- agents forgetting to test, skipping steps, or gaming evidence.
| Agent | Responsibility | Forbidden Actions |
|---|---|---|
| Orchestrator | Parse issue, create task, dispatch agents | Write code, run Playwright |
| Implementer | Write code, run unit tests, WIP commits | Create PRs, start apps |
| Verifier | Test the specific fix with Playwright | Edit code, commit |
| Reviewer | Review diff against golden principles, gate PR | Edit code, commit, run tests |
| Closer | Create PR with description, satisfy hooks | Edit code, run tests |
| Retrospective | Analyze run, apply harness improvements | Edit target repo code |
Each agent emits a structured JSON result block that the orchestrator parses deterministically. Forbidden actions are enforced by the harness, not by instructions.
The key question from workos/case: specialized agents handling testing, QA, and code cleanup separately may outperform a single general-purpose agent. Anthropic's research raises the same open question.
Claude responds to direct instructions better than vague hints. Think of Claude as a brilliant but new employee who lacks context on your norms:
# Less effective
Create an analytics dashboard
# More effective
Create an analytics dashboard. Include as many relevant features as possible.
Go beyond the basics to create a fully-featured implementation.
Provide context/motivation for instructions -- Claude generalizes from explanations:
# Less effective
NEVER use ellipses
# More effective
Your response will be read aloud by a text-to-speech engine, so never use
ellipses since the engine won't know how to pronounce them.
- Claude 4.6 uses adaptive thinking (
thinking: {type: "adaptive"}) -- the model decides when/how much to think - Use the
effortparameter (low/medium/high/max) to control depth - Prefer general instructions ("think thoroughly") over prescriptive step-by-step plans -- Claude's reasoning frequently exceeds what a human would prescribe
- Ask Claude to self-check: "Before you finish, verify your answer against [criteria]"
- If Claude overthinks, add: "Choose an approach and commit to it. Avoid revisiting decisions unless you encounter new information that directly contradicts your reasoning."
Claude's latest models excel at parallel execution. Boost to ~100% reliability with:
If you intend to call multiple tools and there are no dependencies
between the calls, make all independent calls in parallel.
- Claude 4.6 proactively delegates to subagents without explicit instruction
- Watch for overuse: may spawn subagents when direct grep/read would suffice
- Steer with: "Use subagents for parallel/isolated work. For simple tasks, work directly."
For local, reversible actions (editing files, running tests): proceed freely.
For hard-to-reverse or shared-system actions: ask before proceeding.
Examples warranting confirmation:
- Destructive: deleting files/branches, dropping tables, rm -rf
- Hard to reverse: git push --force, git reset --hard
- Visible to others: pushing code, commenting on PRs, sending messages
Claude 4.6 tends to create extra files, add unnecessary abstractions, and overengineer:
- "Only make changes that are directly requested or clearly necessary"
- "Don't add features, refactor code, or make improvements beyond what was asked"
- "Don't create helpers for one-time operations"
- "Don't add docstrings, comments, or type annotations to code you didn't change"
Never speculate about code you have not opened. If the user references
a specific file, you MUST read it before answering.
If your prompts were designed for earlier models:
- Remove over-prompting. Tools that undertriggered before now trigger appropriately. "If in doubt, use [tool]" will cause overtriggering.
- Replace blanket defaults with targeted instructions. Instead of "Default to using [tool]," use "Use [tool] when it would enhance your understanding."
- Dial back aggressive language. Where you had "CRITICAL: You MUST use this tool when...", use "Use this tool when..."
When agents produce poor output:
- Do not try harder with prompts or write code manually
- Do diagnose what capability is missing (tools, guardrails, abstractions, documentation)
- Have the agent itself build the missing capability into the repo
- Each encoded capability compounds over time
As workos/case puts it: "When agents struggle, fix the harness." The fix is never "try harder" -- it's a missing doc, playbook, convention, or enforcement rule.
workos/case's retrospective agent automates this: after every run, it classifies what went wrong and applies fixes directly to the harness. Over time, the harness absorbs the lessons from every failure.
Agent throughput changes the risk profile. The same harness that makes an agent productive also gives it leverage: access to repositories, build systems, credentials, and deployment pathways.
- Secrets exposure — Agents can leak credentials into logs, commits, issues, or tool outputs
- Privilege escalation — Broad shell access + permissive CI/CD turns "write code" into "change infrastructure"
- Prompt injection — Docs, issue text, or comments can act as adversarial instructions
- Supply-chain drift — Agents can "fix" problems by adding dependencies or loosening versions
- Least privilege — Short-lived tokens, narrowly scoped credentials, separate read/write privileges
- Controlled egress — Default-deny network access, grant narrowly
- Instruction integrity — Keep policy in well-known files (AGENTS.md), lint against policy in low-trust surfaces
- Provenance and audit — Log tool calls, diffs, test results, deployment actions
- Fast rollback — If you adopt "corrections are cheap" merge philosophy, invest in rapid detection and rollback
All sources converge: engineers shift from writing code to:
- Designing the environment / harness
- Specifying intent through structured prompts and specs
- Building feedback loops and verification infrastructure
- Translating requirements into acceptance criteria
- Selective, high-leverage review (not universal review)
OpenAI reports: 3-7 engineers, ~1M lines of code, ~1,500 merged PRs, zero manually written source code, ~3.5 PRs/engineer/day over 5 months. The engineer's job shifts from producing correct code to producing an environment in which an agent reliably produces correct code.
workos/case describes this as "like managing 50 interns" -- queue work, don't micromanage.
| # | Action | Source |
|---|---|---|
| 1 | Keep CLAUDE.md short (~100 lines) as a map to deeper docs |
OpenAI, case |
| 2 | Store all project knowledge in versioned repo files, not external tools | OpenAI |
| 3 | Use JSON for structured state, plaintext for progress, git for checkpoints | Anthropic, case |
| 4 | Write custom linters whose error messages teach agents how to fix violations | OpenAI |
| 5 | Enforce invariants mechanically with hooks, not instructions in prompts | OpenAI, case |
| 6 | First context window: set up framework. Subsequent: iterate on todo list | Anthropic |
| 7 | Start fresh context windows rather than compacting | Anthropic |
| 8 | Provide end-to-end verification tools (browser automation, observability) | All three |
| 9 | Run recurring cleanup agents to prevent drift accumulation | OpenAI, case |
| 10 | When agents fail, encode the missing capability -- don't write code manually | OpenAI, case |
| 11 | Use adaptive thinking with appropriate effort levels | Anthropic docs |
| 12 | Be explicit about autonomy boundaries: free locally, confirm for shared actions | Anthropic docs |
| 13 | Use evidence-based gating: proofs that work was done, not just assertions | case |
| 14 | Detect doom loops: block after 3 identical failures, force a different approach | case |
| 15 | Run a retrospective after every agent run; apply harness fixes directly | case |
| 16 | Isolate agent roles: implementer can't create PRs, verifier can't edit code | case |
| 17 | Structure CLAUDE.md for LLM cache efficiency: stable content first | case |
| 18 | Use WIP checkpoint commits for rollback, squash into clean commits at the end | case |
- A small
AGENTS.md/CLAUDE.mdentrypoint that points to deeper docs - A reproducible dev environment (one-command boot)
- Per-worktree isolation to prevent cross-task contamination
- Mechanical invariants in CI (architecture boundaries, formatting, dependency rules)
- Agent legibility hooks (structured logs, queryable traces/metrics)
- Clear evaluation gates ("done" criteria, regression tests, security checks) that agents can run and interpret
- Safety rails (least-privilege credentials, controlled egress, audit logs, rollback playbook)
- OpenAI — "Harness engineering: leveraging Codex in an agent-first world" by Ryan Lopopolo (Feb 2026)
- Anthropic — "Effective Harnesses for Long-Running Agents" by Justin Young (Nov 2025)
- Anthropic Docs — "Prompting Best Practices" (platform.claude.com)
- workos/case — Production harness for multi-repo AI agent orchestration (github.com/workos/case)