Skip to content

Instantly share code, notes, and snippets.

@jmanhype
Created October 20, 2025 02:17
Show Gist options
  • Select an option

  • Save jmanhype/856318b16576b8eeb88d1fe27068cb48 to your computer and use it in GitHub Desktop.

Select an option

Save jmanhype/856318b16576b8eeb88d1fe27068cb48 to your computer and use it in GitHub Desktop.
Reward-free EE + ACE architecture walkthrough

Reward-Free Agent Learning + ACE Context Engineering

This gist captures the full story of how we wired Early Experience (EE) and Agentic Context Engineering (ACE) together for two very different domains:

  • SWE-bench Lite – developer workflows where we validate patches by applying them and running the real test commands.
  • MagicBrush – instruction-guided image edits where we clamp outputs via perceptual metrics (MSE/SSIM).

It is organized in the same way we’ve been running the system: narrative ➜ technical ➜ exploratory ➜ explanatory.


1. Narrative: How the Loop Lives and Breathes

  1. Seed with humans – load a handful of expert demonstrations into the EE pipeline. Each JSONL entry is (state, action, next_state) plus anything you want downstream.
  2. Let the agent roam – EE generates exploratory rollouts, compares candidate actions, and critiques itself.
  3. Clamp with guardrails – deterministic validators (pytest, SQLFluff, MSE/SSIM, etc.) turn free-form outputs into canonical “pass/fail” results.
  4. Distill into ACE – ACE ingests every guardrail correction, deduplicates semantically, and appends the result to the playbook.
  5. Repeat forever – the playbook becomes richer, the agent keeps improving, and we never had to define a dense reward function.
flowchart LR
    A[Expert Demos] --> B(Early Experience Pipeline)
    subgraph EE[Early Experience]
        B --> C[World Model]
        B --> D[Exploration]
        B --> E[Reflection]
        B --> F[Policy]
    end
    F --> G[Live Loop Episodes]
    G --> H{Deterministic Guardrails}
    H -->|Clamp, log, correct| I[ACE Playbook]
    I -->|Delta updates| G
Loading

Important: nothing here requires reward functions. The guardrails and ACE form the “teacher” the Stanford & Meta papers envisioned.


2. Technical: Everything You Need to Reproduce It

2.1 Datasets & Guardrails

Domain JSONL Samples Guardrail Module Deterministic Check
SWE-bench data/swe_bench_samples/swe_bench_50.jsonl src/guardrails/swe_bench.py Clone repo, apply patch, run recorded test_commands
MagicBrush data/magicbrush_samples/magicbrush_50.jsonl src/guardrails/magicbrush.py Decode images, MSE<=1500, SSIM>=0.60

Generate the sample slices (one-time):

# SWE-bench Lite (first 23 dev samples)
python examples/data_scripts/sample_swe_bench.py

# MagicBrush (first 50 dev samples)
python examples/data_scripts/sample_magicbrush.py

(Those helper scripts simply wrap the dataset snippets we used earlier.)

The scaffolder pulls guardrail metadata straight out of the JSONL:

python scripts/scaffold_domain.py swe-bench --from-benchmark data/swe_bench_samples/swe_bench_50.jsonl
python scripts/scaffold_domain.py magicbrush --from-benchmark data/magicbrush_samples/magicbrush_50.jsonl

2.2 Offline Benchmark Harness

PYTHONPATH=src python scripts/run_benchmark.py benchmarks/finance_subset.jsonl --domain finance --offline
PYTHONPATH=src python scripts/run_benchmark.py data/swe_bench_samples/swe_bench_50.jsonl --domain swe-bench --offline
PYTHONPATH=src python scripts/run_benchmark.py data/magicbrush_samples/magicbrush_50.jsonl --domain magicbrush --offline

Results as of the latest run:

Domain Tasks Correct Autocorrections Missing Guardrails
Finance 26 26 0 0
SWE-bench 23 23 0 0
MagicBrush 50 50 0 0

2.3 Live Loop (guardrail driver + ACE)

We dropped a ready-to-run helper at examples/live_loop_swe_magic.py. Run it like this:

# With ACE enabled
ACE_ENABLED=1 ACE_DOMAIN_ID=swe-bench ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain swe-bench --episodes 50 --ace

# MagicBrush counterpart
ACE_ENABLED=1 ACE_DOMAIN_ID=magicbrush ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain magicbrush --episodes 50 --ace

If you omit --ace (or ACE_ENABLED), the loop still runs but ACE won’t ingest any insights—great for ablations.

2.4 ACE Playbook Snapshots

After the 50-episode runs we inspected the shadow stage:

PYTHONPATH=src python - <<'PY'
from ace.repositories.playbook_repository import PlaybookRepository
from ace.models.playbook import PlaybookStage
from ace.utils.database import get_session

for domain in ['swe-bench','magicbrush']:
    with get_session() as session:
        repo = PlaybookRepository(session)
        bullets = repo.get_by_domain(domain, stage=PlaybookStage.SHADOW)
        print(f"\n=== {domain}: {len(bullets)} bullets ===")
        for b in bullets:
            print(f"[{b.section}] {b.content[:80]}... (helpful={b.helpful_count})")
PY
  • SWE-bench: 16 helpful bullets (e.g., “Run pylint to gather info before patching.” Helpful counts 6–41.)
  • MagicBrush: 28 helpful bullets (e.g., “Pass when SSIM already high; avoid unnecessary edits.” Helpful counts 3–33.)

2.5 Minimal Timeline of What We Ran

timeline
    title Reward-Free Self-Improvement
    section Preparation
      Expert demos loaded: 2025-05-10
      Guardrails scaffolded: 2025-05-11
    section Learning Loop
      SWE-bench loop (50 eps): Guardrail passes=50
      ACE added 9 bullets, helpful++ (20)
      MagicBrush loop (50 eps): Guardrail passes=50
      ACE helpful++ (20)
    section Expansion
      Ablation (no ACE) run: Guardrail passes=20, no playbook deltas
      Ablation (no guardrails) TODO: flip `apply_guardrails=False`
      Next domain ETL via scaffold_domain.py
Loading

3. Exploratory Workspace (What to Tweak Next)

Toggle Command What changes
No ACE python examples/live_loop_swe_magic.py --domain swe-bench --episodes 20 Live loop still clamps outputs but ACE logs “not available” and playbook counts stay fixed.
No guardrails edit config.apply_guardrails=False (or add --no-guardrails flag) Episodes replay the recorded action verbatim—useful to see why guardrails matter.
Different LM export OPENROUTER_MODEL=openrouter/mistralai/mixtral-8x22b DSPy picks the new model automatically; reflections change tone/content.
New domain python scripts/scaffold_domain.py my-domain --from-benchmark benchmarks/my.jsonl Guardrail module + docs stub generated automatically.
Long run --episodes 500 ACE keeps growing; check playbook counts and journal increments.

Ablation Playbook

  1. Identity (baseline) – Guardrails on, ACE off. Compare reflection artifacts (live_loop_artifacts/) against ACE-enabled run.
  2. ACE-only – Guardrails off, ACE on. Observe how many reflections are low quality because guardrails stop clamping.
  3. Guardrails-only – Guardrails on, ACE off (what we already ran). Useful to show accuracy vs. contextual knowledge.
  4. ACE+guardrails – Full system (the loops above). Serves as the reference.

Next Datasets to Try

  • APPS/CodeContests – Every staging solution can be treated like SWE-bench. Guardrail is: run provided unit tests.
  • MultiWOZ 2.2 – Use slot-filling metadata; guardrail verifies final booking state matches the log.
  • Design/UX logs – Guardrail checks layout heuristics (alignment, contrast) before ACE records the lesson.

4. Explanatory Appendix

4.1 Environment Checklist

# Core install
git clone https://github.com/jmanhype/AgentLearningEE.git
python -m pip install -e .[dev]

# Optional: ACE playbook
python -m pip install -e ace-playbook
export ACE_ENABLED=1
export ACE_TARGET_STAGE=shadow
export ACE_DOMAIN_ID=swe-bench         # change per run
export DATABASE_URL=sqlite:///ace_playbook.db
export PYTHONPATH=$PWD/ace-playbook:$PYTHONPATH

# LM credentials (the script auto-loads from .env)
export OPENROUTER_API_KEY=sk-or-v1-...

4.2 Key Logs to Recognize

{"level": "INFO", "message": "guardrail_auto_corrected"}
    → Guardrail enforced canonical result.

"ACE playbook updated - added: 9, incremented: 20"
    → ACE stored new bullets or bumped helpful counters.

"Reflection generation failed: No LM is loaded"
    → Configure DSPy (OpenRouter/OpenAI) before re-running with --ace.

4.3 Sample Playbook Entry

[Helpful] Rule: Therefore, the expert's action of passing is justified in this state.
(helpful=41, harmful=0)

That bullet was incremented during the 50-episode SWE-bench run whenever the guardrail confirmed that “doing nothing” was correct once the patch/tests succeeded.

4.4 All-in-One Command Sheet

# Offline benchmarks
PYTHONPATH=src python scripts/run_benchmark.py benchmarks/finance_subset.jsonl --domain finance --offline
PYTHONPATH=src python scripts/run_benchmark.py data/swe_bench_samples/swe_bench_50.jsonl --domain swe-bench --offline
PYTHONPATH=src python scripts/run_benchmark.py data/magicbrush_samples/magicbrush_50.jsonl --domain magicbrush --offline

# Live loops (full system)
ACE_ENABLED=1 ACE_DOMAIN_ID=swe-bench ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain swe-bench --episodes 50 --ace

ACE_ENABLED=1 ACE_DOMAIN_ID=magicbrush ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain magicbrush --episodes 50 --ace

# Ablation (no ACE)
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain swe-bench --episodes 20

Takeaway: You can now plug any curated dataset into this workflow, generate guardrails in minutes, and feed the lessons into ACE. The agent keeps improving without rewards—and the playbook makes the learning visible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment