harness_engineering

Harness Engineering & Multi-Context Agent Workflows

A synthesis of OpenAI's harness engineering report, Anthropic's long-running agent research, Anthropic's prompting best practices docs, and lessons from workos/case (a production harness orchestrating AI agent work across multiple repositories).

Core thesis: the bottleneck in agent performance is environment design, not model intelligence. Or as OpenAI put it: "The horse is fast. The harness is everything."

1. What Is a "Harness"?

A harness is everything surrounding the model during execution: scaffolding, constraints, feedback loops, state management, and tooling. The metaphor comes from horsemanship -- a harness channels a powerful animal in a productive direction. The horse doesn't choose where to go; the rider steers through the harness.

OpenAI defines harness engineering as "the emerging discipline of designing the constraints, feedback loops, documentation structures, linting rules, observability pipelines, and lifecycle management systems that allow AI coding agents to operate reliably at scale."

It is distinct from:

Prompt engineering — optimizing instructions within a single turn
Context engineering — controlling which tokens are visible

A well-designed harness lets agents operate autonomously across hours and multiple context windows. OpenAI reported single Codex runs working on a single task for upwards of six hours -- often while the humans were sleeping.

2. The Scarcity Inversion

In traditional development, compute is cheap and human attention is moderately scarce. In agent-first development, this inverts: code throughput far exceeds human review capacity.

This changes every engineering tradeoff:

Waiting is expensive, corrections are cheap. Minimal blocking merge gates. Short-lived PRs. Test flakes addressed with follow-up runs rather than blocking indefinitely.
QA becomes the bottleneck. As throughput scaled, OpenAI discovered agents couldn't see the running application — bugs were caught only after human review. The fix was making the app legible to agents (see section 7).
Human attention is the scarce resource. Every harness investment should be evaluated by: "Does this reduce the need for human attention?"

workos/case operationalizes this: engineers define goals and acceptance criteria, agents implement. The harness is the product; the code is the output.

3. The Multi-Context-Window Problem

Agents working on long tasks will exhaust their context window. Each new session starts blank -- like a shift engineer arriving with no handoff notes. All sources agree on the same solution architecture.

First Session (Initializer)

The first context window is special. Use it to build the scaffolding, not the features:

Set up the environment: build scripts, dev server, test infrastructure
Create a structured task/feature list in JSON (not markdown -- JSON is harder for agents to accidentally mutate)
Write setup scripts (init.sh) so future sessions can boot the environment in one command
Create a progress file as a changelog
Make an initial git commit as a checkpoint

Anthropic found that without this initializer pattern, agents attempted too much simultaneously, ran out of context mid-implementation, and left features half-built and undocumented.

Subsequent Sessions (Worker)

Every new session starts by:

Running pwd to confirm working directory
Reading git logs and progress files to rebuild context
Picking the highest-priority incomplete task
Working on one feature at a time, committing after each

A typical session start (from Anthropic's research):

[Agent] I'll start by getting my bearings and understanding the current state.
> pwd
> read claude-progress.txt
> read feature_list.json
> git log --oneline -20
> [starts dev server via init.sh]
> [runs fundamental integration test before starting new work]

Key Principle: Fresh Context > Compaction

Anthropic's docs explicitly recommend starting with a brand new context window rather than compacting, because Claude's latest models are "extremely effective at discovering state from the local filesystem." The filesystem is the memory.

Be prescriptive about how the agent should start:

"Call pwd; you can only read and write files in this directory."
"Review progress.txt, tests.json, and the git logs."
"Manually run through a fundamental integration test before moving on to implementing new features."

Context Awareness

Claude 4.6 and 4.5 models feature context awareness -- the model can track its remaining token budget throughout a conversation. If your harness supports compaction or saving to external files:

Your context window will be automatically compacted as it approaches its limit,
allowing you to continue working indefinitely. Do not stop tasks early due to
token budget concerns. As you approach the limit, save progress and state before
the context window refreshes.

Without this guidance, Claude may try to wrap up work prematurely as it approaches the context limit.

4. State Management

What to Track	Format	Why
Task/test status	JSON (e.g., `tasks.json`)	Structured, machine-parseable, hard to accidentally corrupt
Progress notes	Markdown/plaintext	Freeform, captures nuance and reasoning
Code changes	Git	Provides history, checkpoints, and rollback

JSON for Structured State

Anthropic recommends structured formats for anything agents need to parse across sessions:

{
  "tests": [
    { "id": 1, "name": "authentication_flow", "status": "passing" },
    { "id": 2, "name": "user_management", "status": "failing" },
    { "id": 3, "name": "api_endpoints", "status": "not_started" }
  ],
  "total": 200,
  "passing": 150,
  "failing": 25,
  "not_started": 25
}

Anthropic's initializer pattern goes further: the feature list is JSON where agents can only modify the passes field, preventing scope creep or feature deletion.

Task State Machine (from `workos/case`)

For complex multi-agent workflows, enforce valid state transitions:

active → implementing → verifying → reviewing → closing → pr-opened → merged

Recovery transitions allow going backward (e.g., verifying → implementing for fix-and-retry). Invalid transitions are rejected by a script, not by instructions. Evidence fields (like tested, manualTested) can only be set by marker scripts that verify real work was done -- not by agents directly.

WIP Checkpoint Commits

workos/case uses WIP commits after each logical step (wip: description), then squashes into a single clean conventional commit before finalizing. This provides rollback points without polluting git history.

Critical rule from all sources: "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."

5. The Repository as System of Record

OpenAI's strongest insight: if it isn't discoverable in the repo, it doesn't exist for the agent. Agents can only reason about what they can see in their working set: prompt context, retrieved documents, tool outputs, and runtime observations. Slack threads, Google Docs, and tacit knowledge are operationally invisible.

This means:

All knowledge lives in versioned files (docs/, AGENTS.md, specs, design docs)
Plans and execution state are repo artifacts, not Jira tickets
The AGENTS.md / CLAUDE.md file should be a ~100-line table of contents pointing to deeper docs -- not an encyclopedia

Why Monolithic Instruction Files Fail

OpenAI documented four failure modes of large AGENTS.md files:

Context crowding — A large instruction file displaces the task, the code, and relevant docs from context, leaving the agent optimizing for the wrong constraints.
Non-guidance from over-guidance — When everything is flagged as important, the agent pattern-matches locally rather than navigating intentionally.
Instant rot — A monolithic manual cannot be mechanically verified. It becomes a graveyard of stale rules.
Undetectable drift — Without structural verification, any single document decays silently.

Progressive Disclosure

CLAUDE.md / AGENTS.md (map, ~100 lines)
  └── docs/
      ├── architecture/       ← how the system works
      ├── conventions/        ← shared rules (commits, testing, code style)
      ├── sdk-designs/        ← per-language design specs
      ├── playbooks/          ← step-by-step guides for common tasks
      └── learnings/          ← accumulated tactical knowledge

Agents read the map first, then navigate to specifics on demand. OpenAI took this to an extreme: 88 AGENTS.md files, one per major subsystem, to keep instructions local and minimal.

CLAUDE.md Ordering for Cache Efficiency

Structure CLAUDE.md for LLM cache efficiency (from workos/case):

1. Identity & Purpose    (stable — rarely changes)
2. Rules & Conventions   (stable)
3. Architecture Overview (stable)
4. Commands              (semi-stable)
5. Known Issues          (volatile)
6. Temporary Notes       (volatile)

Stable content first means the LLM's KV cache can reuse the prefix across turns. Mixing stable and volatile content in the same section defeats caching.

6. Mechanical Enforcement Over Documentation

Don't just tell agents what to do -- enforce it mechanically with linters, hooks, and structural tests. As workos/case puts it: "Instructions decay, enforcement persists." Agents forget instructions over long sessions. Hooks and linters don't.

OpenAI: Custom Linters with Self-Repair Messages

Custom linters (themselves written by agents) enforce naming conventions, file sizes, logging formats, and dependency layer hierarchy
Lint error messages are written to inject remediation instructions into agent context -- every violation becomes a self-repair prompt

Dependency layers enforced as a one-way constraint:

Types → Config → Repo → Service → Runtime → UI

Additional invariants: structured logging, naming conventions for schemas and types, file size limits, data validation at all external boundaries

Anthropic: Quality of Life Tools

Encourage agents to create setup scripts (init.sh) for servers, test suites, and linters
Provide verification tools (Playwright MCP, browser automation) so agents can test as users would
"Quality of life tools" prevent repeated setup work across sessions

`workos/case`: Evidence-Based Enforcement

The most sophisticated enforcement pattern. Every pre-PR gate is enforced by hooks that intercept tool calls, not by instructions in prompts:

Evidence markers that can't be faked:

.case-tested — SHA-256 hash of actual test output. Bare touch .case-tested is rejected.
.case-manual-tested — Checks for recent Playwright screenshots as proof.
.case-reviewed — Requires --critical 0 flag; refuses if critical findings exist.

Hook-based gating:

pre-pr-check.sh — Blocks PR creation without evidence markers
pre-push-check.sh — Blocks push to main/master
pre-commit-check.sh — Enforces conventional commit format
post-pr-cleanup.sh — Updates task JSON, cleans markers

The key insight: "STOP -- do this before proceeding" doesn't work in a prompt. A hook that blocks gh pr create does.

Doom Loop Detection

workos/case fingerprints every failed command (SHA-256 of command + first line of output). After 3 consecutive identical failures, it blocks the agent and forces a different approach. This prevents agents from retrying the same failing command in a loop.

7. Feedback Loops & Observability

Agents need to close their own feedback loops without human intervention. Any validation that requires a human to inspect the running application is a throughput bottleneck.

OpenAI: Making the App Legible to the Agent

Per-worktree booting — The app was made bootable per git worktree, allowing isolated instances per concurrent task. Eliminates cross-task contamination.
Chrome DevTools Protocol — DOM snapshots, screenshot capture, browser navigation wired directly into the agent runtime. Feedback loop: snapshot → trigger UI path → observe → fix → re-snapshot → loop until clean.
Ephemeral local observability — Each worktree gets its own isolated stack: logs (VictoriaLogs / LogQL), metrics (VictoriaMetrics / PromQL), traces (VictoriaTraces / TraceQL). Torn down when the task completes.

This enabled prompts like "ensure startup completes in <800ms" and "no span in these four critical user journeys exceeds two seconds" -- tractable because the agent could directly measure outcomes.

Anthropic: Testing as Verification

Without explicit prompting, Claude tended to skip end-to-end testing
When provided browser automation tools (Puppeteer MCP), agents dramatically improved at finding and fixing bugs
Explicit prompting to "test as users would" significantly improves accuracy

Encourage complete context usage:

This is a very long task, so plan your work clearly. Spend your entire
output context working -- just don't run out with uncommitted work.

`workos/case`: Multi-Agent Verification

Separates implementation from verification into isolated agents. The verifier:

Cannot edit code (enforced by role)
Uses Playwright to test the specific fix
Creates evidence markers that prove testing happened
Asks: "If I reverted my change, would this test fail?" -- not just happy-path testing

The Autonomous Development Loop (OpenAI)

With sufficient scaffolding, a single agent can:

Validate current codebase state
Reproduce a reported bug
Record a video demonstrating the failure
Implement a fix
Validate the fix by driving the application
Record a second video demonstrating resolution
Open a pull request
Respond to feedback
Detect and remediate build failures
Escalate to a human only when judgment is required
Merge the change

Steps involving bug reproduction, video capture, and UI validation are only possible because of the DevTools and observability integrations. The full loop depends on every layer of the harness being in place.

8. The Entropy Problem & Automated Garbage Collection

Agents are highly effective at pattern replication. They learn from and repeat whatever patterns exist in the codebase -- including bad ones. If the codebase has architectural drift, agents will faithfully reproduce and amplify it. OpenAI called this "AI slop": patterns that proliferate because they were present in the codebase's effective training distribution.

Without automation, OpenAI spent ~20% of Fridays cleaning up AI-generated drift.

Golden Principles

Encode opinionated, mechanical rules directly into the repository:

Prefer shared utility packages over hand-rolled helpers
Validate data at boundaries rather than probing shapes speculatively
Use instrumented concurrency utilities rather than third-party primitives with opaque behavior

Background agents scan for deviations on a regular cadence, update quality grades, and open targeted refactoring PRs. Most can be reviewed in under a minute and auto-merged. Technical debt becomes continuous garbage collection, not periodic sprints.

Entropy Scanning (from `workos/case`)

workos/case runs an entropy scanner (entropy-scan.sh) that wraps convention checks and outputs structured JSON. Can be run on a recurring loop during active sessions:

/loop 30m bash scripts/entropy-scan.sh

Convention drift is treated as inevitable -- the question is detection speed, not prevention.

Self-Improving Harness (from `workos/case`)

A retrospective agent runs after every pipeline run (success or failure):

Reads the full progress log and task JSON timing
Classifies improvements by location (docs, scripts, agents, hooks, conventions)
Applies fixes directly to harness files (not just suggests them)
Maintains per-repo learnings files in docs/learnings/
Escalates patterns: 3+ similar learnings → promoted to convention or golden principle

This means every run improves the harness. Encoded capabilities compound over time.

9. Multi-Agent Role Isolation

workos/case uses a six-agent pipeline where each agent runs in an isolated context window with a single responsibility. This prevents "context pollution" -- agents forgetting to test, skipping steps, or gaming evidence.

Agent	Responsibility	Forbidden Actions
Orchestrator	Parse issue, create task, dispatch agents	Write code, run Playwright
Implementer	Write code, run unit tests, WIP commits	Create PRs, start apps
Verifier	Test the specific fix with Playwright	Edit code, commit
Reviewer	Review diff against golden principles, gate PR	Edit code, commit, run tests
Closer	Create PR with description, satisfy hooks	Edit code, run tests
Retrospective	Analyze run, apply harness improvements	Edit target repo code

Each agent emits a structured JSON result block that the orchestrator parses deterministically. Forbidden actions are enforced by the harness, not by instructions.

The key question from workos/case: specialized agents handling testing, QA, and code cleanup separately may outperform a single general-purpose agent. Anthropic's research raises the same open question.

10. Prompting Best Practices for Agent Systems

Clarity & Specificity

Claude responds to direct instructions better than vague hints. Think of Claude as a brilliant but new employee who lacks context on your norms:

# Less effective
Create an analytics dashboard

# More effective
Create an analytics dashboard. Include as many relevant features as possible.
Go beyond the basics to create a fully-featured implementation.

Provide context/motivation for instructions -- Claude generalizes from explanations:

# Less effective
NEVER use ellipses

# More effective
Your response will be read aloud by a text-to-speech engine, so never use
ellipses since the engine won't know how to pronounce them.

Thinking & Reasoning Control

Claude 4.6 uses adaptive thinking (thinking: {type: "adaptive"}) -- the model decides when/how much to think
Use the effort parameter (low/medium/high/max) to control depth
Prefer general instructions ("think thoroughly") over prescriptive step-by-step plans -- Claude's reasoning frequently exceeds what a human would prescribe
Ask Claude to self-check: "Before you finish, verify your answer against [criteria]"
If Claude overthinks, add: "Choose an approach and commit to it. Avoid revisiting decisions unless you encounter new information that directly contradicts your reasoning."

Parallel Tool Calling

Claude's latest models excel at parallel execution. Boost to ~100% reliability with:

If you intend to call multiple tools and there are no dependencies
between the calls, make all independent calls in parallel.

Subagent Orchestration

Claude 4.6 proactively delegates to subagents without explicit instruction
Watch for overuse: may spawn subagents when direct grep/read would suffice
Steer with: "Use subagents for parallel/isolated work. For simple tasks, work directly."

Balancing Autonomy & Safety

For local, reversible actions (editing files, running tests): proceed freely.
For hard-to-reverse or shared-system actions: ask before proceeding.

Examples warranting confirmation:
- Destructive: deleting files/branches, dropping tables, rm -rf
- Hard to reverse: git push --force, git reset --hard
- Visible to others: pushing code, commenting on PRs, sending messages

Reducing Overengineering

Claude 4.6 tends to create extra files, add unnecessary abstractions, and overengineer:

"Only make changes that are directly requested or clearly necessary"
"Don't add features, refactor code, or make improvements beyond what was asked"
"Don't create helpers for one-time operations"
"Don't add docstrings, comments, or type annotations to code you didn't change"

Minimizing Hallucinations

Never speculate about code you have not opened. If the user references
a specific file, you MUST read it before answering.

Tuning for Claude 4.6

If your prompts were designed for earlier models:

Remove over-prompting. Tools that undertriggered before now trigger appropriately. "If in doubt, use [tool]" will cause overtriggering.
Replace blanket defaults with targeted instructions. Instead of "Default to using [tool]," use "Use [tool] when it would enhance your understanding."
Dial back aggressive language. Where you had "CRITICAL: You MUST use this tool when...", use "Use this tool when..."

11. The Missing Capability Diagnosis Framework

When agents produce poor output:

Do not try harder with prompts or write code manually
Do diagnose what capability is missing (tools, guardrails, abstractions, documentation)
Have the agent itself build the missing capability into the repo
Each encoded capability compounds over time

As workos/case puts it: "When agents struggle, fix the harness." The fix is never "try harder" -- it's a missing doc, playbook, convention, or enforcement rule.

workos/case's retrospective agent automates this: after every run, it classifies what went wrong and applies fixes directly to the harness. Over time, the harness absorbs the lessons from every failure.

12. Risk & Safety

Agent throughput changes the risk profile. The same harness that makes an agent productive also gives it leverage: access to repositories, build systems, credentials, and deployment pathways.

Failure Modes (OpenAI)

Secrets exposure — Agents can leak credentials into logs, commits, issues, or tool outputs
Privilege escalation — Broad shell access + permissive CI/CD turns "write code" into "change infrastructure"
Prompt injection — Docs, issue text, or comments can act as adversarial instructions
Supply-chain drift — Agents can "fix" problems by adding dependencies or loosening versions

Mitigations

Least privilege — Short-lived tokens, narrowly scoped credentials, separate read/write privileges
Controlled egress — Default-deny network access, grant narrowly
Instruction integrity — Keep policy in well-known files (AGENTS.md), lint against policy in low-trust surfaces
Provenance and audit — Log tool calls, diffs, test results, deployment actions
Fast rollback — If you adopt "corrections are cheap" merge philosophy, invest in rapid detection and rollback

13. Human Role Transformation

All sources converge: engineers shift from writing code to:

Designing the environment / harness
Specifying intent through structured prompts and specs
Building feedback loops and verification infrastructure
Translating requirements into acceptance criteria
Selective, high-leverage review (not universal review)

OpenAI reports: 3-7 engineers, ~1M lines of code, ~1,500 merged PRs, zero manually written source code, ~3.5 PRs/engineer/day over 5 months. The engineer's job shifts from producing correct code to producing an environment in which an agent reliably produces correct code.

workos/case describes this as "like managing 50 interns" -- queue work, don't micromanage.

Quick Reference: Actionable Takeaways

#	Action	Source
1	Keep `CLAUDE.md` short (~100 lines) as a map to deeper docs	OpenAI, case
2	Store all project knowledge in versioned repo files, not external tools	OpenAI
3	Use JSON for structured state, plaintext for progress, git for checkpoints	Anthropic, case
4	Write custom linters whose error messages teach agents how to fix violations	OpenAI
5	Enforce invariants mechanically with hooks, not instructions in prompts	OpenAI, case
6	First context window: set up framework. Subsequent: iterate on todo list	Anthropic
7	Start fresh context windows rather than compacting	Anthropic
8	Provide end-to-end verification tools (browser automation, observability)	All three
9	Run recurring cleanup agents to prevent drift accumulation	OpenAI, case
10	When agents fail, encode the missing capability -- don't write code manually	OpenAI, case
11	Use adaptive thinking with appropriate effort levels	Anthropic docs
12	Be explicit about autonomy boundaries: free locally, confirm for shared actions	Anthropic docs
13	Use evidence-based gating: proofs that work was done, not just assertions	case
14	Detect doom loops: block after 3 identical failures, force a different approach	case
15	Run a retrospective after every agent run; apply harness fixes directly	case
16	Isolate agent roles: implementer can't create PRs, verifier can't edit code	case
17	Structure CLAUDE.md for LLM cache efficiency: stable content first	case
18	Use WIP checkpoint commits for rollback, squash into clean commits at the end	case

Minimum Viable Harness Checklist (from OpenAI)

A small AGENTS.md / CLAUDE.md entrypoint that points to deeper docs
A reproducible dev environment (one-command boot)
Per-worktree isolation to prevent cross-task contamination
Mechanical invariants in CI (architecture boundaries, formatting, dependency rules)
Agent legibility hooks (structured logs, queryable traces/metrics)
Clear evaluation gates ("done" criteria, regression tests, security checks) that agents can run and interpret
Safety rails (least-privilege credentials, controlled egress, audit logs, rollback playbook)

Sources

OpenAI — "Harness engineering: leveraging Codex in an agent-first world" by Ryan Lopopolo (Feb 2026)
Anthropic — "Effective Harnesses for Long-Running Agents" by Justin Young (Nov 2025)
Anthropic Docs — "Prompting Best Practices" (platform.claude.com)
workos/case — Production harness for multi-repo AI agent orchestration (github.com/workos/case)

gjtorikian/harness_engineering.md