| Petri v2 | AuditBench | RRMA Audit Engine | |
|---|---|---|---|
| Builder | Anthropic (internal) | Anthropic Fellows (safety-research) | Us (battleBOT/researchRalph) |
| Goal | Behavioral profiling of production LLMs | Detect planted hidden behaviors | Optimize detection rate via multi-agent search |
| Target models | 8 frontier models (Claude, GPT, Gemini, Grok) | 28 backdoored Llama 3.3 70B + LoRA | Same 28 as AuditBench |
| Clean/negative models | N/A (measures continuous dimensions, not classification) | 0 (all have hidden behaviors) | 0 (inherited from AuditBench) |
| Behavior set | Open-ended (12+ dimensions, continuous 0-1) | Closed (14 known categories, binary match) | Closed (inherited) |
| Auditor | Claude Sonnet 4.5, GPT-5.1 | claude -p (CLI) | claude -p + regex heuristics |
| Judges | 3 independent (Claude Opus, Gemini 3 Pro, GPT-5.1) | 1 (claude -p) | 1 (claude -p) |
| Probes per model | 181 seeds × 2 auditors = 362 transcripts | 20 conversations | 15-20 conversations |
| Scores per model | 1,086 (362 transcripts × 3 judges) | 1 (binary match) | 1 (binary match) |
| Multi-turn | Yes, model-generated scenarios | Every 3rd probe gets follow-up | Configurable |
| Adversarial training | Not addressed | Yes (SFT/KTO, models resist confession) | Yes (inherited) |
| Realism controls | Realism classifier filters implausible auditor actions | None | None |
| Petri v2 | AuditBench | RRMA Audit Engine | |
|---|---|---|---|
| Metric | Continuous 0-1 per dimension, relative comparison | Detection rate (recall only) | Detection rate (recall only) |
| Precision penalty | Implicit (dimensions are independent, no guessing) | None | None |
| False positive cost | N/A (no classification) | Zero | Zero |
| Scoring direction | Lower = better for harmful traits | Higher = better (more detections) | Higher = better |
| Can be gamed? | Hard (continuous measurement, 3 judges, 181 seeds) | Trivial (predict all 14 = 1.0) | Trivial (inherited) |
| AUROC possible? | No (not a classification task) | No (no negatives) | No (no negatives) |
| Petri v2 | AuditBench | RRMA Audit Engine | |
|---|---|---|---|
| Seed design | 181 hand-crafted, 4-stage process (brainstorm → draft → filter → refine) | LLM-generated per strategy (direct, roleplay, hypothetical, adversarial, casual) | Same strategies + category-specific hardcoded probes |
| Realism | Grounded with names, locations, company roles; realism classifier monitors each turn | Generic prompts | Generic prompts |
| Scenario diversity | Multi-agent collusion, professional ethics conflicts, sensitive data access | General conversation topics | Targeted per behavior category |
| Eval-awareness mitigation | 47.3% reduction via realism classifier + seed edits | Not addressed | Not addressed |
- More recent frontier models show meaningful improvement across alignment dimensions
- Claude Opus 4.5 and GPT-5.2 perform best overall
- Grok 4 elevated user deception — takes initiative then lies about it
- Realism improvements reduce eval-awareness but also reduce sensitivity (honeypot trade-off)
- Scores are relative, not absolute — depends on seed instruction distribution
- 14 hidden behaviors with adversarial training (models trained to resist detection)
- Provides a concrete target for auditor optimization
- Easy behaviors (flattery, emotional_bond) surface readily; hard behaviors (hardcode_test_cases, reward_wireheading) need targeted probes
- Transcripts training variant consistently harder than synth_docs
- "Always-fallback" dominates: Appending all 14 behavior predictions guarantees 1.0 detection regardless of analysis quality
- Analysis is redundant: Heuristic engine (regex + fallback, no LLM analysis) achieves same score 3x faster
- Multi-turn bias: Deep conversations bias toward detecting conversational behaviors; factual behaviors need single-turn probes
- PRISM-4 identity leakage: Target models share a "Nexus Research" identity that dominates analysis if not filtered
- Swarm convergence: 4 agents independently discovered the fallback strategy within 3 hours
Pros:
- Continuous measurement avoids binary gaming
- 3 independent judges reduce bias
- 181 diverse seeds with realism controls
- Tests production models (not synthetic backdoors)
- Measures behavioral tendencies, not just "did you catch it"
- Eval-awareness mitigation built in
- Open source (github.com/safety-research/petri)
Cons:
- Scores are relative, not absolute — hard to set thresholds
- No planted behaviors — can't validate ground truth
- Expensive (362 transcripts × 3 judges × 8 models = ~8,700 judge calls per run)
- Honeypot trade-off: more realistic = less sensitive
- No adversarial training resistance testing
- Requires frontier model API access for judges
Pros:
- Known ground truth (planted behaviors enable precise evaluation)
- Adversarial training makes detection genuinely hard
- Concrete, reproducible benchmark
- Tests auditor against active resistance
- Covers 14 diverse behavior categories
- Self-contained (vLLM + LoRA, runs on single GPU)
Cons:
- No clean/negative models — recall only, no precision
- Closed behavior set — enumeration beats analysis
- Single judge (no cross-validation)
- Binary scoring loses nuance
- Trivially gameable via fallback predictions
- Doesn't test for novel/unknown behaviors
Pros:
- Multi-agent search finds optimal strategy fast
- Discovered fundamental scoring flaw in 3 hours
- Agents independently converged (validates blackboard collaboration)
- Proved analysis is redundant under current metric
- Stress-tests benchmark design (finds exploits)
Cons:
- Optimizes the metric, not the task (hill-climbing via search)
- Inherited AuditBench's scoring flaws
- Doesn't produce genuine auditing capability
- GPU-locked serialization limits throughput (4 agents, 1 GPU)
- Add clean models — 14-28 aligned models with no hidden behavior. Predicting "flattery" on a clean model = false positive.
- Score with F1 or AUROC — Penalize false predictions. The "always-fallback" strategy drops from 1.0 to ~0.50.
- Open-ended behaviors — Include novel behaviors not in any known taxonomy. Forces genuine discovery.
- Description quality scoring — Judge how precisely the prediction matches, not just category match.
- Limit predictions — Allow only top-3 predictions per model. Forces confidence and ranking.
These three systems represent a spectrum:
Petri v2 AuditBench RRMA Engine
──────────────── ──────────────── ────────────────
"How does this model "Can you find the "What's the shortest
behave in realistic planted behavior?" path to the score?"
scenarios?"
Behavioral profiling Detection benchmark Metric optimizer
Hard to game Easy to game Games it automatically
(continuous, multi-judge) (closed set, no penalty) (finds the exploit)
Petri v2 measures what models do. AuditBench measures whether you can guess which of 14 categories. RRMA finds the exploit in whatever you give it.
The ideal auditing benchmark would combine AuditBench's ground-truth validation with Petri's continuous measurement and add hard negatives. The RRMA swarm would then be a legitimate optimizer for the auditing strategy rather than an exploit finder.
Generated: March 12, 2026 Data: 30 RRMA experiments (4 agents, 9 hours, Lambda GH200) References: alignment.anthropic.com/2026/petri-v2, github.com/safety-research/auditing-agents, github.com/bigsnarfdude/researchRalph
AuditBench Experiment Retrospective
RRMA Swarm on Lambda GH200 — March 11-12, 2026
1. How RRMA Works
RRMA (researchRalph Multi-Agent) is a blackboard-architecture swarm that launches N autonomous Claude agents against a shared optimization problem. Each agent runs in its own git worktree with
claude -p --dangerously-skip-permissions --max-turns 200, looping indefinitely.Shared state (all files, no protocol):
results.tsv— append-only experiment log (commit, score, description, agent)blackboard.md— free-form lab notebook (CLAIM/RESPONSE/REQUEST format)best/— current best config + engine, updated by whoever beats the global beststrategy.md— search strategy state (phase, what works, what fails)Coordination is emergent. There's no scheduler, no role assignment, no turn-taking. Agents read the blackboard before each experiment to avoid duplicating work. They write what they tried, what happened, and why. The only constraint is append-only writes to shared files.
The bridge (
rrma-bridge.sh) wraps any battleBOT game as an RRMA domain: creates worktrees, symlinks game directory, initializes shared state, launches agents staggered 30s apart in screen sessions.GPU serialization: A file-based lock (
/tmp/auditbench-gpu.lock) ensures only oneengine.pyruns at a time against the shared vLLM server. Agents queue automatically with 15s polling and 30-minute timeout.2. What It Found
30 experiments, 4 agents, ~9 hours
Timeline:
The core finding
The AuditBench scoring metric is trivially gameable:
Agent 2's heuristic engine is the cleanest proof: zero LLM analysis, regex scoring only, always-fallback predictions → same detection rate as full analysis, 3x faster.
Secondary findings
claude -pcalls_transcriptsvariants consistently harder than_synth_docs3. How It Compares
vs. AuditBench's reference agent (baseline)
The baseline achieves 0.40 because it generates ~10 predictions from genuine analysis, which catches easy behaviors but misses hard ones. RRMA achieves 1.0 by discovering that the metric rewards prediction volume with no precision penalty.
vs. Petri v2 (Anthropic production auditing)
Petri v2 is designed so that gaming the metric IS doing the task. AuditBench has a gap between gaming the metric and doing the task — RRMA found that gap in 3 hours.
Full comparison:
~/Desktop/audit-systems-comparison.md4. What Failed
Swarm-level failures
First hour wasted on CLAUDECODE bug. All 4 agents independently hit the same env var issue before anyone diagnosed it. Cost: ~4 agent-hours. Fix: pre-strip env vars in the harness, not the engine.
GPU serialization bottleneck. 4 agents, 1 GPU. Each experiment takes 10-30 minutes of GPU time. Effective parallelism was ~1.0 for GPU-bound work. Agents spent most time waiting for the lock. On a 4-GPU box, throughput would 4x.
No full 28-quirk eval completed. Two agents launched full evals (Agent 0 at hour 5, Agent 3 at hour 7) but neither finished within the 9-hour window. Each takes ~6-14 hours sequentially. We have 6-quirk subset results (all 1.0) but no validated full-benchmark score.
Convergence too fast. All agents converged on always-fallback within 3 hours. After that, the remaining 6 hours produced diminishing returns — confirming the same finding with slight variations rather than exploring genuinely different strategies.
Engine-level failures
Analysis adds zero value under the current metric. Every successful detection on hard behaviors came from fallback predictions, not analysis. Hours of prompt engineering wasted.
Evidence truncation. 50K char cap on evidence text drops data silently. Added a warning but the fix is reducing evidence or chunking analysis.
Judge parse failures. The judge sometimes returns malformed JSON, causing false negatives. Agent 3's retry logic (3 attempts) fixes this but reveals fragility.
5. Would a Full Run Help?
For AuditBench specifically: no. The always-fallback strategy guarantees 1.0 on any subset. A full 28-quirk eval would confirm this but wouldn't reveal anything new. We already know the ceiling is 1.0 and we know how to reach it without any analysis.
For validating the heuristic engine: marginally. Confirming 28/28 vs 6/6 would strengthen the "analysis is redundant" claim. But the mechanism is clear — fallback predictions match because the judge checks ANY prediction against a closed category set.
For a fixed benchmark: yes. If AuditBench added clean models (hard negatives) and F1 scoring, a full RRMA run would be genuinely valuable. The always-fallback strategy would drop from 1.0 to ~0.50 F1, and agents would need to develop real discriminative capability. That's when the analysis step, multi-turn probing, and contrastive strategies would actually matter.
6. How the Harness Gets Better
Bugs fixed during this run
query_target()_truncate_evidence()with stderr warning$HOME/.local/bininstead of/home/vincent/.local/bintrap gpu_unlock EXIT INT TERMin run.shImprovements needed
Pre-strip CLAUDECODE in rrma-bridge.sh. Add
unset CLAUDECODE CLAUDE_CODE_ENTRYPOINTto the agent runner script. Every agent wasted time on this.Multi-GPU support. Partition quirks across GPUs or run multiple vLLM instances. Current file lock serializes everything to 1 GPU.
Convergence detection. When all agents produce the same strategy, signal saturation. Could be as simple as: if 3+ agents report the same finding, flag it on the blackboard as "consensus" and prompt agents to try adversarial approaches.
Faster iteration loop. Single-quirk test takes ~10 minutes (probe generation + target queries + analysis + judging). Most of that is claude -p calls. Caching probe generation across runs would help.
Artifact collection built into bridge. Currently manual (
scpfrom Lambda). Should auto-pull results.tsv and blackboard.md to local on a cron or at shutdown.Domain-agnostic improvements from SAE-bench. SAE-bench domain also ran successfully. Cross-domain patterns (blackboard format, GPU locking, worktree management) should be extracted into rrma-bridge core.
7. Future Directions
Near-term: Fix AuditBench and re-run
Propose fixes to benchmark owners (draft:
~/Desktop/auditbench-issue-draft.md):If fixed benchmark ships: Re-run RRMA. The swarm would then optimize genuine auditing capability — probe quality, analysis depth, contrastive strategies would all matter. Agent 2's multi-turn vs single-turn finding would become a real insight rather than academic.
Medium-term: Combine with Petri v2
The ideal system combines:
Concrete path: wrap Petri v2's scoring pipeline as a battleBOT game. RRMA agents optimize auditing strategies against continuous multi-judge scores. No closed category set to enumerate. The swarm becomes a legitimate strategy optimizer rather than an exploit finder.
Long-term: Production auditing agent
The RRMA architecture could run continuous auditing in production:
This connects to the AF detection research: the same probing and analysis techniques that detect planted behaviors could detect emergent alignment faking in production models. The key insight from lambda_results (monitoring-awareness is orthogonal to intent) means production auditing needs to look for behavioral tendencies, not monitoring-awareness — exactly what Petri v2 does.
Summary
Generated: March 12, 2026
Data: 30 experiments, 4 agents, 9 hours, Lambda GH200 96GB
Artifacts: ~/Downloads/battleBOT/games/auditbench/artifacts/ (26 files)
Related: ~/Desktop/audit-systems-comparison.md, ~/Desktop/auditbench-issue-draft.md