Skip to content

Instantly share code, notes, and snippets.

@gjtorikian
Last active March 11, 2026 23:07
Show Gist options
  • Select an option

  • Save gjtorikian/7bd23ea8cdbb7a275fc1e166dd19ecb0 to your computer and use it in GitHub Desktop.

Select an option

Save gjtorikian/7bd23ea8cdbb7a275fc1e166dd19ecb0 to your computer and use it in GitHub Desktop.
harness_engineering

harness_engineering

Harness Engineering & Multi-Context Agent Workflows

A synthesis of OpenAI's harness engineering report, Anthropic's long-running agent research, Anthropic's prompting best practices docs, and lessons from workos/case (a production harness orchestrating AI agent work across multiple repositories).

Core thesis: the bottleneck in agent performance is environment design, not model intelligence. Or as OpenAI put it: "The horse is fast. The harness is everything."


1. What Is a "Harness"?

A harness is everything surrounding the model during execution: scaffolding, constraints, feedback loops, state management, and tooling. The metaphor comes from horsemanship -- a harness channels a powerful animal in a productive direction. The horse doesn't choose where to go; the rider steers through the harness.

OpenAI defines harness engineering as "the emerging discipline of designing the constraints, feedback loops, documentation structures, linting rules, observability pipelines, and lifecycle management systems that allow AI coding agents to operate reliably at scale."

It is distinct from:

  • Prompt engineering — optimizing instructions within a single turn
  • Context engineering — controlling which tokens are visible

A well-designed harness lets agents operate autonomously across hours and multiple context windows. OpenAI reported single Codex runs working on a single task for upwards of six hours -- often while the humans were sleeping.


2. The Scarcity Inversion

In traditional development, compute is cheap and human attention is moderately scarce. In agent-first development, this inverts: code throughput far exceeds human review capacity.

This changes every engineering tradeoff:

  • Waiting is expensive, corrections are cheap. Minimal blocking merge gates. Short-lived PRs. Test flakes addressed with follow-up runs rather than blocking indefinitely.
  • QA becomes the bottleneck. As throughput scaled, OpenAI discovered agents couldn't see the running application — bugs were caught only after human review. The fix was making the app legible to agents (see section 7).
  • Human attention is the scarce resource. Every harness investment should be evaluated by: "Does this reduce the need for human attention?"

workos/case operationalizes this: engineers define goals and acceptance criteria, agents implement. The harness is the product; the code is the output.


3. The Multi-Context-Window Problem

Agents working on long tasks will exhaust their context window. Each new session starts blank -- like a shift engineer arriving with no handoff notes. All sources agree on the same solution architecture.

First Session (Initializer)

The first context window is special. Use it to build the scaffolding, not the features:

  • Set up the environment: build scripts, dev server, test infrastructure
  • Create a structured task/feature list in JSON (not markdown -- JSON is harder for agents to accidentally mutate)
  • Write setup scripts (init.sh) so future sessions can boot the environment in one command
  • Create a progress file as a changelog
  • Make an initial git commit as a checkpoint

Anthropic found that without this initializer pattern, agents attempted too much simultaneously, ran out of context mid-implementation, and left features half-built and undocumented.

Subsequent Sessions (Worker)

Every new session starts by:

  1. Running pwd to confirm working directory
  2. Reading git logs and progress files to rebuild context
  3. Picking the highest-priority incomplete task
  4. Working on one feature at a time, committing after each

A typical session start (from Anthropic's research):

[Agent] I'll start by getting my bearings and understanding the current state.
> pwd
> read claude-progress.txt
> read feature_list.json
> git log --oneline -20
> [starts dev server via init.sh]
> [runs fundamental integration test before starting new work]

Key Principle: Fresh Context > Compaction

Anthropic's docs explicitly recommend starting with a brand new context window rather than compacting, because Claude's latest models are "extremely effective at discovering state from the local filesystem." The filesystem is the memory.

Be prescriptive about how the agent should start:

  • "Call pwd; you can only read and write files in this directory."
  • "Review progress.txt, tests.json, and the git logs."
  • "Manually run through a fundamental integration test before moving on to implementing new features."

Context Awareness

Claude 4.6 and 4.5 models feature context awareness -- the model can track its remaining token budget throughout a conversation. If your harness supports compaction or saving to external files:

Your context window will be automatically compacted as it approaches its limit,
allowing you to continue working indefinitely. Do not stop tasks early due to
token budget concerns. As you approach the limit, save progress and state before
the context window refreshes.

Without this guidance, Claude may try to wrap up work prematurely as it approaches the context limit.


4. State Management

What to Track Format Why
Task/test status JSON (e.g., tasks.json) Structured, machine-parseable, hard to accidentally corrupt
Progress notes Markdown/plaintext Freeform, captures nuance and reasoning
Code changes Git Provides history, checkpoints, and rollback

JSON for Structured State

Anthropic recommends structured formats for anything agents need to parse across sessions:

{
  "tests": [
    { "id": 1, "name": "authentication_flow", "status": "passing" },
    { "id": 2, "name": "user_management", "status": "failing" },
    { "id": 3, "name": "api_endpoints", "status": "not_started" }
  ],
  "total": 200,
  "passing": 150,
  "failing": 25,
  "not_started": 25
}

Anthropic's initializer pattern goes further: the feature list is JSON where agents can only modify the passes field, preventing scope creep or feature deletion.

Task State Machine (from workos/case)

For complex multi-agent workflows, enforce valid state transitions:

active → implementing → verifying → reviewing → closing → pr-opened → merged

Recovery transitions allow going backward (e.g., verifying → implementing for fix-and-retry). Invalid transitions are rejected by a script, not by instructions. Evidence fields (like tested, manualTested) can only be set by marker scripts that verify real work was done -- not by agents directly.

WIP Checkpoint Commits

workos/case uses WIP commits after each logical step (wip: description), then squashes into a single clean conventional commit before finalizing. This provides rollback points without polluting git history.

Critical rule from all sources: "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."


5. The Repository as System of Record

OpenAI's strongest insight: if it isn't discoverable in the repo, it doesn't exist for the agent. Agents can only reason about what they can see in their working set: prompt context, retrieved documents, tool outputs, and runtime observations. Slack threads, Google Docs, and tacit knowledge are operationally invisible.

This means:

  • All knowledge lives in versioned files (docs/, AGENTS.md, specs, design docs)
  • Plans and execution state are repo artifacts, not Jira tickets
  • The AGENTS.md / CLAUDE.md file should be a ~100-line table of contents pointing to deeper docs -- not an encyclopedia

Why Monolithic Instruction Files Fail

OpenAI documented four failure modes of large AGENTS.md files:

  1. Context crowding — A large instruction file displaces the task, the code, and relevant docs from context, leaving the agent optimizing for the wrong constraints.
  2. Non-guidance from over-guidance — When everything is flagged as important, the agent pattern-matches locally rather than navigating intentionally.
  3. Instant rot — A monolithic manual cannot be mechanically verified. It becomes a graveyard of stale rules.
  4. Undetectable drift — Without structural verification, any single document decays silently.

Progressive Disclosure

CLAUDE.md / AGENTS.md (map, ~100 lines)
  └── docs/
      ├── architecture/       ← how the system works
      ├── conventions/        ← shared rules (commits, testing, code style)
      ├── sdk-designs/        ← per-language design specs
      ├── playbooks/          ← step-by-step guides for common tasks
      └── learnings/          ← accumulated tactical knowledge

Agents read the map first, then navigate to specifics on demand. OpenAI took this to an extreme: 88 AGENTS.md files, one per major subsystem, to keep instructions local and minimal.

CLAUDE.md Ordering for Cache Efficiency

Structure CLAUDE.md for LLM cache efficiency (from workos/case):

1. Identity & Purpose    (stable — rarely changes)
2. Rules & Conventions   (stable)
3. Architecture Overview (stable)
4. Commands              (semi-stable)
5. Known Issues          (volatile)
6. Temporary Notes       (volatile)

Stable content first means the LLM's KV cache can reuse the prefix across turns. Mixing stable and volatile content in the same section defeats caching.


6. Mechanical Enforcement Over Documentation

Don't just tell agents what to do -- enforce it mechanically with linters, hooks, and structural tests. As workos/case puts it: "Instructions decay, enforcement persists." Agents forget instructions over long sessions. Hooks and linters don't.

OpenAI: Custom Linters with Self-Repair Messages

  • Custom linters (themselves written by agents) enforce naming conventions, file sizes, logging formats, and dependency layer hierarchy
  • Lint error messages are written to inject remediation instructions into agent context -- every violation becomes a self-repair prompt
  • Dependency layers enforced as a one-way constraint:
    Types → Config → Repo → Service → Runtime → UI
    
  • Additional invariants: structured logging, naming conventions for schemas and types, file size limits, data validation at all external boundaries

Anthropic: Quality of Life Tools

  • Encourage agents to create setup scripts (init.sh) for servers, test suites, and linters
  • Provide verification tools (Playwright MCP, browser automation) so agents can test as users would
  • "Quality of life tools" prevent repeated setup work across sessions

workos/case: Evidence-Based Enforcement

The most sophisticated enforcement pattern. Every pre-PR gate is enforced by hooks that intercept tool calls, not by instructions in prompts:

Evidence markers that can't be faked:

  • .case-tested — SHA-256 hash of actual test output. Bare touch .case-tested is rejected.
  • .case-manual-tested — Checks for recent Playwright screenshots as proof.
  • .case-reviewed — Requires --critical 0 flag; refuses if critical findings exist.

Hook-based gating:

  • pre-pr-check.sh — Blocks PR creation without evidence markers
  • pre-push-check.sh — Blocks push to main/master
  • pre-commit-check.sh — Enforces conventional commit format
  • post-pr-cleanup.sh — Updates task JSON, cleans markers

The key insight: "STOP -- do this before proceeding" doesn't work in a prompt. A hook that blocks gh pr create does.

Doom Loop Detection

workos/case fingerprints every failed command (SHA-256 of command + first line of output). After 3 consecutive identical failures, it blocks the agent and forces a different approach. This prevents agents from retrying the same failing command in a loop.


7. Feedback Loops & Observability

Agents need to close their own feedback loops without human intervention. Any validation that requires a human to inspect the running application is a throughput bottleneck.

OpenAI: Making the App Legible to the Agent

  • Per-worktree booting — The app was made bootable per git worktree, allowing isolated instances per concurrent task. Eliminates cross-task contamination.
  • Chrome DevTools Protocol — DOM snapshots, screenshot capture, browser navigation wired directly into the agent runtime. Feedback loop: snapshot → trigger UI path → observe → fix → re-snapshot → loop until clean.
  • Ephemeral local observability — Each worktree gets its own isolated stack: logs (VictoriaLogs / LogQL), metrics (VictoriaMetrics / PromQL), traces (VictoriaTraces / TraceQL). Torn down when the task completes.

This enabled prompts like "ensure startup completes in <800ms" and "no span in these four critical user journeys exceeds two seconds" -- tractable because the agent could directly measure outcomes.

Anthropic: Testing as Verification

  • Without explicit prompting, Claude tended to skip end-to-end testing
  • When provided browser automation tools (Puppeteer MCP), agents dramatically improved at finding and fixing bugs
  • Explicit prompting to "test as users would" significantly improves accuracy
  • Encourage complete context usage:
    This is a very long task, so plan your work clearly. Spend your entire
    output context working -- just don't run out with uncommitted work.
    

workos/case: Multi-Agent Verification

Separates implementation from verification into isolated agents. The verifier:

  • Cannot edit code (enforced by role)
  • Uses Playwright to test the specific fix
  • Creates evidence markers that prove testing happened
  • Asks: "If I reverted my change, would this test fail?" -- not just happy-path testing

The Autonomous Development Loop (OpenAI)

With sufficient scaffolding, a single agent can:

  1. Validate current codebase state
  2. Reproduce a reported bug
  3. Record a video demonstrating the failure
  4. Implement a fix
  5. Validate the fix by driving the application
  6. Record a second video demonstrating resolution
  7. Open a pull request
  8. Respond to feedback
  9. Detect and remediate build failures
  10. Escalate to a human only when judgment is required
  11. Merge the change

Steps involving bug reproduction, video capture, and UI validation are only possible because of the DevTools and observability integrations. The full loop depends on every layer of the harness being in place.


8. The Entropy Problem & Automated Garbage Collection

Agents are highly effective at pattern replication. They learn from and repeat whatever patterns exist in the codebase -- including bad ones. If the codebase has architectural drift, agents will faithfully reproduce and amplify it. OpenAI called this "AI slop": patterns that proliferate because they were present in the codebase's effective training distribution.

Without automation, OpenAI spent ~20% of Fridays cleaning up AI-generated drift.

Golden Principles

Encode opinionated, mechanical rules directly into the repository:

  • Prefer shared utility packages over hand-rolled helpers
  • Validate data at boundaries rather than probing shapes speculatively
  • Use instrumented concurrency utilities rather than third-party primitives with opaque behavior

Background agents scan for deviations on a regular cadence, update quality grades, and open targeted refactoring PRs. Most can be reviewed in under a minute and auto-merged. Technical debt becomes continuous garbage collection, not periodic sprints.

Entropy Scanning (from workos/case)

workos/case runs an entropy scanner (entropy-scan.sh) that wraps convention checks and outputs structured JSON. Can be run on a recurring loop during active sessions:

/loop 30m bash scripts/entropy-scan.sh

Convention drift is treated as inevitable -- the question is detection speed, not prevention.

Self-Improving Harness (from workos/case)

A retrospective agent runs after every pipeline run (success or failure):

  1. Reads the full progress log and task JSON timing
  2. Classifies improvements by location (docs, scripts, agents, hooks, conventions)
  3. Applies fixes directly to harness files (not just suggests them)
  4. Maintains per-repo learnings files in docs/learnings/
  5. Escalates patterns: 3+ similar learnings → promoted to convention or golden principle

This means every run improves the harness. Encoded capabilities compound over time.


9. Multi-Agent Role Isolation

workos/case uses a six-agent pipeline where each agent runs in an isolated context window with a single responsibility. This prevents "context pollution" -- agents forgetting to test, skipping steps, or gaming evidence.

Agent Responsibility Forbidden Actions
Orchestrator Parse issue, create task, dispatch agents Write code, run Playwright
Implementer Write code, run unit tests, WIP commits Create PRs, start apps
Verifier Test the specific fix with Playwright Edit code, commit
Reviewer Review diff against golden principles, gate PR Edit code, commit, run tests
Closer Create PR with description, satisfy hooks Edit code, run tests
Retrospective Analyze run, apply harness improvements Edit target repo code

Each agent emits a structured JSON result block that the orchestrator parses deterministically. Forbidden actions are enforced by the harness, not by instructions.

The key question from workos/case: specialized agents handling testing, QA, and code cleanup separately may outperform a single general-purpose agent. Anthropic's research raises the same open question.


10. Prompting Best Practices for Agent Systems

Clarity & Specificity

Claude responds to direct instructions better than vague hints. Think of Claude as a brilliant but new employee who lacks context on your norms:

# Less effective
Create an analytics dashboard

# More effective
Create an analytics dashboard. Include as many relevant features as possible.
Go beyond the basics to create a fully-featured implementation.

Provide context/motivation for instructions -- Claude generalizes from explanations:

# Less effective
NEVER use ellipses

# More effective
Your response will be read aloud by a text-to-speech engine, so never use
ellipses since the engine won't know how to pronounce them.

Thinking & Reasoning Control

  • Claude 4.6 uses adaptive thinking (thinking: {type: "adaptive"}) -- the model decides when/how much to think
  • Use the effort parameter (low/medium/high/max) to control depth
  • Prefer general instructions ("think thoroughly") over prescriptive step-by-step plans -- Claude's reasoning frequently exceeds what a human would prescribe
  • Ask Claude to self-check: "Before you finish, verify your answer against [criteria]"
  • If Claude overthinks, add: "Choose an approach and commit to it. Avoid revisiting decisions unless you encounter new information that directly contradicts your reasoning."

Parallel Tool Calling

Claude's latest models excel at parallel execution. Boost to ~100% reliability with:

If you intend to call multiple tools and there are no dependencies
between the calls, make all independent calls in parallel.

Subagent Orchestration

  • Claude 4.6 proactively delegates to subagents without explicit instruction
  • Watch for overuse: may spawn subagents when direct grep/read would suffice
  • Steer with: "Use subagents for parallel/isolated work. For simple tasks, work directly."

Balancing Autonomy & Safety

For local, reversible actions (editing files, running tests): proceed freely.
For hard-to-reverse or shared-system actions: ask before proceeding.

Examples warranting confirmation:
- Destructive: deleting files/branches, dropping tables, rm -rf
- Hard to reverse: git push --force, git reset --hard
- Visible to others: pushing code, commenting on PRs, sending messages

Reducing Overengineering

Claude 4.6 tends to create extra files, add unnecessary abstractions, and overengineer:

  • "Only make changes that are directly requested or clearly necessary"
  • "Don't add features, refactor code, or make improvements beyond what was asked"
  • "Don't create helpers for one-time operations"
  • "Don't add docstrings, comments, or type annotations to code you didn't change"

Minimizing Hallucinations

Never speculate about code you have not opened. If the user references
a specific file, you MUST read it before answering.

Tuning for Claude 4.6

If your prompts were designed for earlier models:

  • Remove over-prompting. Tools that undertriggered before now trigger appropriately. "If in doubt, use [tool]" will cause overtriggering.
  • Replace blanket defaults with targeted instructions. Instead of "Default to using [tool]," use "Use [tool] when it would enhance your understanding."
  • Dial back aggressive language. Where you had "CRITICAL: You MUST use this tool when...", use "Use this tool when..."

11. The Missing Capability Diagnosis Framework

When agents produce poor output:

  • Do not try harder with prompts or write code manually
  • Do diagnose what capability is missing (tools, guardrails, abstractions, documentation)
  • Have the agent itself build the missing capability into the repo
  • Each encoded capability compounds over time

As workos/case puts it: "When agents struggle, fix the harness." The fix is never "try harder" -- it's a missing doc, playbook, convention, or enforcement rule.

workos/case's retrospective agent automates this: after every run, it classifies what went wrong and applies fixes directly to the harness. Over time, the harness absorbs the lessons from every failure.


12. Risk & Safety

Agent throughput changes the risk profile. The same harness that makes an agent productive also gives it leverage: access to repositories, build systems, credentials, and deployment pathways.

Failure Modes (OpenAI)

  • Secrets exposure — Agents can leak credentials into logs, commits, issues, or tool outputs
  • Privilege escalation — Broad shell access + permissive CI/CD turns "write code" into "change infrastructure"
  • Prompt injection — Docs, issue text, or comments can act as adversarial instructions
  • Supply-chain drift — Agents can "fix" problems by adding dependencies or loosening versions

Mitigations

  • Least privilege — Short-lived tokens, narrowly scoped credentials, separate read/write privileges
  • Controlled egress — Default-deny network access, grant narrowly
  • Instruction integrity — Keep policy in well-known files (AGENTS.md), lint against policy in low-trust surfaces
  • Provenance and audit — Log tool calls, diffs, test results, deployment actions
  • Fast rollback — If you adopt "corrections are cheap" merge philosophy, invest in rapid detection and rollback

13. Human Role Transformation

All sources converge: engineers shift from writing code to:

  • Designing the environment / harness
  • Specifying intent through structured prompts and specs
  • Building feedback loops and verification infrastructure
  • Translating requirements into acceptance criteria
  • Selective, high-leverage review (not universal review)

OpenAI reports: 3-7 engineers, ~1M lines of code, ~1,500 merged PRs, zero manually written source code, ~3.5 PRs/engineer/day over 5 months. The engineer's job shifts from producing correct code to producing an environment in which an agent reliably produces correct code.

workos/case describes this as "like managing 50 interns" -- queue work, don't micromanage.


Quick Reference: Actionable Takeaways

# Action Source
1 Keep CLAUDE.md short (~100 lines) as a map to deeper docs OpenAI, case
2 Store all project knowledge in versioned repo files, not external tools OpenAI
3 Use JSON for structured state, plaintext for progress, git for checkpoints Anthropic, case
4 Write custom linters whose error messages teach agents how to fix violations OpenAI
5 Enforce invariants mechanically with hooks, not instructions in prompts OpenAI, case
6 First context window: set up framework. Subsequent: iterate on todo list Anthropic
7 Start fresh context windows rather than compacting Anthropic
8 Provide end-to-end verification tools (browser automation, observability) All three
9 Run recurring cleanup agents to prevent drift accumulation OpenAI, case
10 When agents fail, encode the missing capability -- don't write code manually OpenAI, case
11 Use adaptive thinking with appropriate effort levels Anthropic docs
12 Be explicit about autonomy boundaries: free locally, confirm for shared actions Anthropic docs
13 Use evidence-based gating: proofs that work was done, not just assertions case
14 Detect doom loops: block after 3 identical failures, force a different approach case
15 Run a retrospective after every agent run; apply harness fixes directly case
16 Isolate agent roles: implementer can't create PRs, verifier can't edit code case
17 Structure CLAUDE.md for LLM cache efficiency: stable content first case
18 Use WIP checkpoint commits for rollback, squash into clean commits at the end case

Minimum Viable Harness Checklist (from OpenAI)

  • A small AGENTS.md / CLAUDE.md entrypoint that points to deeper docs
  • A reproducible dev environment (one-command boot)
  • Per-worktree isolation to prevent cross-task contamination
  • Mechanical invariants in CI (architecture boundaries, formatting, dependency rules)
  • Agent legibility hooks (structured logs, queryable traces/metrics)
  • Clear evaluation gates ("done" criteria, regression tests, security checks) that agents can run and interpret
  • Safety rails (least-privilege credentials, controlled egress, audit logs, rollback playbook)

Sources

Comments are disabled for this gist.