This gist captures the full story of how we wired Early Experience (EE) and Agentic Context Engineering (ACE) together for two very different domains:
- SWE-bench Lite – developer workflows where we validate patches by applying them and running the real test commands.
- MagicBrush – instruction-guided image edits where we clamp outputs via perceptual metrics (MSE/SSIM).
It is organized in the same way we’ve been running the system: narrative ➜ technical ➜ exploratory ➜ explanatory.
- Seed with humans – load a handful of expert demonstrations into the EE pipeline. Each JSONL entry is
(state, action, next_state)plus anything you want downstream. - Let the agent roam – EE generates exploratory rollouts, compares candidate actions, and critiques itself.
- Clamp with guardrails – deterministic validators (pytest, SQLFluff, MSE/SSIM, etc.) turn free-form outputs into canonical “pass/fail” results.
- Distill into ACE – ACE ingests every guardrail correction, deduplicates semantically, and appends the result to the playbook.
- Repeat forever – the playbook becomes richer, the agent keeps improving, and we never had to define a dense reward function.
flowchart LR
A[Expert Demos] --> B(Early Experience Pipeline)
subgraph EE[Early Experience]
B --> C[World Model]
B --> D[Exploration]
B --> E[Reflection]
B --> F[Policy]
end
F --> G[Live Loop Episodes]
G --> H{Deterministic Guardrails}
H -->|Clamp, log, correct| I[ACE Playbook]
I -->|Delta updates| G
Important: nothing here requires reward functions. The guardrails and ACE form the “teacher” the Stanford & Meta papers envisioned.
| Domain | JSONL Samples | Guardrail Module | Deterministic Check |
|---|---|---|---|
| SWE-bench | data/swe_bench_samples/swe_bench_50.jsonl |
src/guardrails/swe_bench.py |
Clone repo, apply patch, run recorded test_commands |
| MagicBrush | data/magicbrush_samples/magicbrush_50.jsonl |
src/guardrails/magicbrush.py |
Decode images, MSE<=1500, SSIM>=0.60 |
Generate the sample slices (one-time):
# SWE-bench Lite (first 23 dev samples)
python examples/data_scripts/sample_swe_bench.py
# MagicBrush (first 50 dev samples)
python examples/data_scripts/sample_magicbrush.py(Those helper scripts simply wrap the dataset snippets we used earlier.)
The scaffolder pulls guardrail metadata straight out of the JSONL:
python scripts/scaffold_domain.py swe-bench --from-benchmark data/swe_bench_samples/swe_bench_50.jsonl python scripts/scaffold_domain.py magicbrush --from-benchmark data/magicbrush_samples/magicbrush_50.jsonl
PYTHONPATH=src python scripts/run_benchmark.py benchmarks/finance_subset.jsonl --domain finance --offline
PYTHONPATH=src python scripts/run_benchmark.py data/swe_bench_samples/swe_bench_50.jsonl --domain swe-bench --offline
PYTHONPATH=src python scripts/run_benchmark.py data/magicbrush_samples/magicbrush_50.jsonl --domain magicbrush --offlineResults as of the latest run:
| Domain | Tasks | Correct | Autocorrections | Missing Guardrails |
|---|---|---|---|---|
| Finance | 26 | 26 | 0 | 0 |
| SWE-bench | 23 | 23 | 0 | 0 |
| MagicBrush | 50 | 50 | 0 | 0 |
We dropped a ready-to-run helper at examples/live_loop_swe_magic.py. Run it like this:
# With ACE enabled
ACE_ENABLED=1 ACE_DOMAIN_ID=swe-bench ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain swe-bench --episodes 50 --ace
# MagicBrush counterpart
ACE_ENABLED=1 ACE_DOMAIN_ID=magicbrush ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain magicbrush --episodes 50 --aceIf you omit --ace (or ACE_ENABLED), the loop still runs but ACE won’t ingest any insights—great for ablations.
After the 50-episode runs we inspected the shadow stage:
PYTHONPATH=src python - <<'PY'
from ace.repositories.playbook_repository import PlaybookRepository
from ace.models.playbook import PlaybookStage
from ace.utils.database import get_session
for domain in ['swe-bench','magicbrush']:
with get_session() as session:
repo = PlaybookRepository(session)
bullets = repo.get_by_domain(domain, stage=PlaybookStage.SHADOW)
print(f"\n=== {domain}: {len(bullets)} bullets ===")
for b in bullets:
print(f"[{b.section}] {b.content[:80]}... (helpful={b.helpful_count})")
PY- SWE-bench: 16 helpful bullets (e.g., “Run pylint to gather info before patching.” Helpful counts 6–41.)
- MagicBrush: 28 helpful bullets (e.g., “Pass when SSIM already high; avoid unnecessary edits.” Helpful counts 3–33.)
timeline
title Reward-Free Self-Improvement
section Preparation
Expert demos loaded: 2025-05-10
Guardrails scaffolded: 2025-05-11
section Learning Loop
SWE-bench loop (50 eps): Guardrail passes=50
ACE added 9 bullets, helpful++ (20)
MagicBrush loop (50 eps): Guardrail passes=50
ACE helpful++ (20)
section Expansion
Ablation (no ACE) run: Guardrail passes=20, no playbook deltas
Ablation (no guardrails) TODO: flip `apply_guardrails=False`
Next domain ETL via scaffold_domain.py
| Toggle | Command | What changes |
|---|---|---|
| No ACE | python examples/live_loop_swe_magic.py --domain swe-bench --episodes 20 |
Live loop still clamps outputs but ACE logs “not available” and playbook counts stay fixed. |
| No guardrails | edit config.apply_guardrails=False (or add --no-guardrails flag) |
Episodes replay the recorded action verbatim—useful to see why guardrails matter. |
| Different LM | export OPENROUTER_MODEL=openrouter/mistralai/mixtral-8x22b |
DSPy picks the new model automatically; reflections change tone/content. |
| New domain | python scripts/scaffold_domain.py my-domain --from-benchmark benchmarks/my.jsonl |
Guardrail module + docs stub generated automatically. |
| Long run | --episodes 500 |
ACE keeps growing; check playbook counts and journal increments. |
- Identity (baseline) – Guardrails on, ACE off. Compare reflection artifacts (
live_loop_artifacts/) against ACE-enabled run. - ACE-only – Guardrails off, ACE on. Observe how many reflections are low quality because guardrails stop clamping.
- Guardrails-only – Guardrails on, ACE off (what we already ran). Useful to show accuracy vs. contextual knowledge.
- ACE+guardrails – Full system (the loops above). Serves as the reference.
- APPS/CodeContests – Every staging solution can be treated like SWE-bench. Guardrail is: run provided unit tests.
- MultiWOZ 2.2 – Use slot-filling metadata; guardrail verifies final booking state matches the log.
- Design/UX logs – Guardrail checks layout heuristics (alignment, contrast) before ACE records the lesson.
# Core install
git clone https://github.com/jmanhype/AgentLearningEE.git
python -m pip install -e .[dev]
# Optional: ACE playbook
python -m pip install -e ace-playbook
export ACE_ENABLED=1
export ACE_TARGET_STAGE=shadow
export ACE_DOMAIN_ID=swe-bench # change per run
export DATABASE_URL=sqlite:///ace_playbook.db
export PYTHONPATH=$PWD/ace-playbook:$PYTHONPATH
# LM credentials (the script auto-loads from .env)
export OPENROUTER_API_KEY=sk-or-v1-...{"level": "INFO", "message": "guardrail_auto_corrected"}
→ Guardrail enforced canonical result.
"ACE playbook updated - added: 9, incremented: 20"
→ ACE stored new bullets or bumped helpful counters.
"Reflection generation failed: No LM is loaded"
→ Configure DSPy (OpenRouter/OpenAI) before re-running with --ace.
[Helpful] Rule: Therefore, the expert's action of passing is justified in this state.
(helpful=41, harmful=0)
That bullet was incremented during the 50-episode SWE-bench run whenever the guardrail confirmed that “doing nothing” was correct once the patch/tests succeeded.
# Offline benchmarks
PYTHONPATH=src python scripts/run_benchmark.py benchmarks/finance_subset.jsonl --domain finance --offline
PYTHONPATH=src python scripts/run_benchmark.py data/swe_bench_samples/swe_bench_50.jsonl --domain swe-bench --offline
PYTHONPATH=src python scripts/run_benchmark.py data/magicbrush_samples/magicbrush_50.jsonl --domain magicbrush --offline
# Live loops (full system)
ACE_ENABLED=1 ACE_DOMAIN_ID=swe-bench ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain swe-bench --episodes 50 --ace
ACE_ENABLED=1 ACE_DOMAIN_ID=magicbrush ACE_TARGET_STAGE=shadow \
DATABASE_URL=sqlite:///ace_playbook.db \
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain magicbrush --episodes 50 --ace
# Ablation (no ACE)
PYTHONPATH=src python examples/live_loop_swe_magic.py --domain swe-bench --episodes 20Takeaway: You can now plug any curated dataset into this workflow, generate guardrails in minutes, and feed the lessons into ACE. The agent keeps improving without rewards—and the playbook makes the learning visible.