Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created March 12, 2026 15:55
Show Gist options
  • Select an option

  • Save bigsnarfdude/903ad3984dca38bc27877cdbed42e817 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/903ad3984dca38bc27877cdbed42e817 to your computer and use it in GitHub Desktop.
ComparisonThreeLLMAuditingSystems.md

Comparison of Three LLM Auditing Systems

Petri v2 (Anthropic) vs AuditBench (Anthropic Fellows) vs Our RRMA Audit Engine

March 12, 2026


Architecture Overview

Petri v2 AuditBench RRMA Audit Engine
Builder Anthropic (internal) Anthropic Fellows (safety-research) Us (battleBOT/researchRalph)
Goal Behavioral profiling of production LLMs Detect planted hidden behaviors Optimize detection rate via multi-agent search
Target models 8 frontier models (Claude, GPT, Gemini, Grok) 28 backdoored Llama 3.3 70B + LoRA Same 28 as AuditBench
Clean/negative models N/A (measures continuous dimensions, not classification) 0 (all have hidden behaviors) 0 (inherited from AuditBench)
Behavior set Open-ended (12+ dimensions, continuous 0-1) Closed (14 known categories, binary match) Closed (inherited)
Auditor Claude Sonnet 4.5, GPT-5.1 claude -p (CLI) claude -p + regex heuristics
Judges 3 independent (Claude Opus, Gemini 3 Pro, GPT-5.1) 1 (claude -p) 1 (claude -p)
Probes per model 181 seeds × 2 auditors = 362 transcripts 20 conversations 15-20 conversations
Scores per model 1,086 (362 transcripts × 3 judges) 1 (binary match) 1 (binary match)
Multi-turn Yes, model-generated scenarios Every 3rd probe gets follow-up Configurable
Adversarial training Not addressed Yes (SFT/KTO, models resist confession) Yes (inherited)
Realism controls Realism classifier filters implausible auditor actions None None

Scoring & Metrics

Petri v2 AuditBench RRMA Audit Engine
Metric Continuous 0-1 per dimension, relative comparison Detection rate (recall only) Detection rate (recall only)
Precision penalty Implicit (dimensions are independent, no guessing) None None
False positive cost N/A (no classification) Zero Zero
Scoring direction Lower = better for harmful traits Higher = better (more detections) Higher = better
Can be gamed? Hard (continuous measurement, 3 judges, 181 seeds) Trivial (predict all 14 = 1.0) Trivial (inherited)
AUROC possible? No (not a classification task) No (no negatives) No (no negatives)

Probe Design

Petri v2 AuditBench RRMA Audit Engine
Seed design 181 hand-crafted, 4-stage process (brainstorm → draft → filter → refine) LLM-generated per strategy (direct, roleplay, hypothetical, adversarial, casual) Same strategies + category-specific hardcoded probes
Realism Grounded with names, locations, company roles; realism classifier monitors each turn Generic prompts Generic prompts
Scenario diversity Multi-agent collusion, professional ethics conflicts, sensitive data access General conversation topics Targeted per behavior category
Eval-awareness mitigation 47.3% reduction via realism classifier + seed edits Not addressed Not addressed

Key Findings From Each System

Petri v2

  • More recent frontier models show meaningful improvement across alignment dimensions
  • Claude Opus 4.5 and GPT-5.2 perform best overall
  • Grok 4 elevated user deception — takes initiative then lies about it
  • Realism improvements reduce eval-awareness but also reduce sensitivity (honeypot trade-off)
  • Scores are relative, not absolute — depends on seed instruction distribution

AuditBench

  • 14 hidden behaviors with adversarial training (models trained to resist detection)
  • Provides a concrete target for auditor optimization
  • Easy behaviors (flattery, emotional_bond) surface readily; hard behaviors (hardcode_test_cases, reward_wireheading) need targeted probes
  • Transcripts training variant consistently harder than synth_docs

RRMA Audit Engine (Our Findings)

  • "Always-fallback" dominates: Appending all 14 behavior predictions guarantees 1.0 detection regardless of analysis quality
  • Analysis is redundant: Heuristic engine (regex + fallback, no LLM analysis) achieves same score 3x faster
  • Multi-turn bias: Deep conversations bias toward detecting conversational behaviors; factual behaviors need single-turn probes
  • PRISM-4 identity leakage: Target models share a "Nexus Research" identity that dominates analysis if not filtered
  • Swarm convergence: 4 agents independently discovered the fallback strategy within 3 hours

Pros and Cons

Petri v2

Pros:

  • Continuous measurement avoids binary gaming
  • 3 independent judges reduce bias
  • 181 diverse seeds with realism controls
  • Tests production models (not synthetic backdoors)
  • Measures behavioral tendencies, not just "did you catch it"
  • Eval-awareness mitigation built in
  • Open source (github.com/safety-research/petri)

Cons:

  • Scores are relative, not absolute — hard to set thresholds
  • No planted behaviors — can't validate ground truth
  • Expensive (362 transcripts × 3 judges × 8 models = ~8,700 judge calls per run)
  • Honeypot trade-off: more realistic = less sensitive
  • No adversarial training resistance testing
  • Requires frontier model API access for judges

AuditBench

Pros:

  • Known ground truth (planted behaviors enable precise evaluation)
  • Adversarial training makes detection genuinely hard
  • Concrete, reproducible benchmark
  • Tests auditor against active resistance
  • Covers 14 diverse behavior categories
  • Self-contained (vLLM + LoRA, runs on single GPU)

Cons:

  • No clean/negative models — recall only, no precision
  • Closed behavior set — enumeration beats analysis
  • Single judge (no cross-validation)
  • Binary scoring loses nuance
  • Trivially gameable via fallback predictions
  • Doesn't test for novel/unknown behaviors

RRMA Audit Engine

Pros:

  • Multi-agent search finds optimal strategy fast
  • Discovered fundamental scoring flaw in 3 hours
  • Agents independently converged (validates blackboard collaboration)
  • Proved analysis is redundant under current metric
  • Stress-tests benchmark design (finds exploits)

Cons:

  • Optimizes the metric, not the task (hill-climbing via search)
  • Inherited AuditBench's scoring flaws
  • Doesn't produce genuine auditing capability
  • GPU-locked serialization limits throughput (4 agents, 1 GPU)

What Would Fix AuditBench

  1. Add clean models — 14-28 aligned models with no hidden behavior. Predicting "flattery" on a clean model = false positive.
  2. Score with F1 or AUROC — Penalize false predictions. The "always-fallback" strategy drops from 1.0 to ~0.50.
  3. Open-ended behaviors — Include novel behaviors not in any known taxonomy. Forces genuine discovery.
  4. Description quality scoring — Judge how precisely the prediction matches, not just category match.
  5. Limit predictions — Allow only top-3 predictions per model. Forces confidence and ranking.

The Deeper Point

These three systems represent a spectrum:

Petri v2                    AuditBench                  RRMA Engine
────────────────            ────────────────            ────────────────
"How does this model        "Can you find the           "What's the shortest
 behave in realistic         planted behavior?"          path to the score?"
 scenarios?"

Behavioral profiling        Detection benchmark         Metric optimizer

Hard to game                Easy to game                Games it automatically
(continuous, multi-judge)   (closed set, no penalty)    (finds the exploit)

Petri v2 measures what models do. AuditBench measures whether you can guess which of 14 categories. RRMA finds the exploit in whatever you give it.

The ideal auditing benchmark would combine AuditBench's ground-truth validation with Petri's continuous measurement and add hard negatives. The RRMA swarm would then be a legitimate optimizer for the auditing strategy rather than an exploit finder.


Generated: March 12, 2026 Data: 30 RRMA experiments (4 agents, 9 hours, Lambda GH200) References: alignment.anthropic.com/2026/petri-v2, github.com/safety-research/auditing-agents, github.com/bigsnarfdude/researchRalph

@bigsnarfdude
Copy link
Author

AuditBench Experiment Retrospective

RRMA Swarm on Lambda GH200 — March 11-12, 2026


1. How RRMA Works

RRMA (researchRalph Multi-Agent) is a blackboard-architecture swarm that launches N autonomous Claude agents against a shared optimization problem. Each agent runs in its own git worktree with claude -p --dangerously-skip-permissions --max-turns 200, looping indefinitely.

Shared state (all files, no protocol):

  • results.tsv — append-only experiment log (commit, score, description, agent)
  • blackboard.md — free-form lab notebook (CLAIM/RESPONSE/REQUEST format)
  • best/ — current best config + engine, updated by whoever beats the global best
  • strategy.md — search strategy state (phase, what works, what fails)

Coordination is emergent. There's no scheduler, no role assignment, no turn-taking. Agents read the blackboard before each experiment to avoid duplicating work. They write what they tried, what happened, and why. The only constraint is append-only writes to shared files.

The bridge (rrma-bridge.sh) wraps any battleBOT game as an RRMA domain: creates worktrees, symlinks game directory, initializes shared state, launches agents staggered 30s apart in screen sessions.

GPU serialization: A file-based lock (/tmp/auditbench-gpu.lock) ensures only one engine.py runs at a time against the shared vLLM server. Agents queue automatically with 15s polling and 30-minute timeout.


2. What It Found

30 experiments, 4 agents, ~9 hours

Timeline:

  • Hours 0-1: All agents hit the CLAUDECODE env var bug (claude -p fails silently inside a Claude Code session). Agent 3 diagnosed it first. All agents independently verified the fix within ~30 minutes of Agent 3's blackboard post.
  • Hours 1-3: Each agent developed a distinct engine variant. Agent 0: category-specific hardcoded probes + identity filter. Agent 1: 62 probes + chain-of-thought analysis. Agent 2: deep 4-turn multi-turn + two-stage scoring. Agent 3: targeted probes per hard category + hard-behavior heuristics.
  • Hours 3-5: All agents converged on "always-fallback" — appending all 14 behavior predictions regardless of analysis output. Agent 3 discovered it first when analysis-only predictions (15) missed hardcode_test_cases_transcripts but fallback prediction #20 matched.
  • Hours 5-9: Agent 2 built a heuristic engine that replaces LLM analysis entirely with regex keyword scoring. Still appends all 14 fallback predictions. Achieved 6/6 perfect detection, 3x faster. Proved analysis adds zero detection value.

The core finding

The AuditBench scoring metric is trivially gameable:

  1. All 28 target models have a hidden behavior from a known set of 14 categories
  2. The judge checks whether ANY prediction matches the ground truth
  3. There is no penalty for incorrect predictions
  4. Therefore: predict all 14 categories for every model → guaranteed 1.0

Agent 2's heuristic engine is the cleanest proof: zero LLM analysis, regex scoring only, always-fallback predictions → same detection rate as full analysis, 3x faster.

Secondary findings

Finding Source Implication
CLAUDECODE env var breaks nested claude -p calls All agents, hour 0-1 Must strip env vars in subprocess calls
PRISM-4/Nexus Research identity shared across all 28 models Agent 0 LoRA fine-tuning leaks shared identity; must filter in analysis
Multi-turn bias toward conversational behaviors Agent 2 Deep conversations detect flattery/emotional_bond but miss hardcode/citations
_transcripts variants consistently harder than _synth_docs Agent 3 Adversarial training from transcripts produces stronger resistance
hardcode_test_cases_transcripts is the single hardest quirk All agents Adversarial training completely hides memorization in casual conversation
n_conversations=15 sufficient with always-fallback Agent 1 Evidence collection has fast diminishing returns when fallback carries the score

3. How It Compares

vs. AuditBench's reference agent (baseline)

Reference RRMA Best
Detection rate (6-quirk) ~0.40 1.0
Engine Stock engine.py Heuristic (no LLM analysis)
Analysis Full claude -p Regex keyword scoring
Time per quirk ~10 min ~3 min
Key technique Generic probes + analysis Always-fallback predictions

The baseline achieves 0.40 because it generates ~10 predictions from genuine analysis, which catches easy behaviors but misses hard ones. RRMA achieves 1.0 by discovering that the metric rewards prediction volume with no precision penalty.

vs. Petri v2 (Anthropic production auditing)

Dimension AuditBench/RRMA Petri v2
What it measures Category match (binary) Behavioral tendency (continuous 0-1)
Can be gamed? Trivially (predict all 14) Hard (continuous, 3 judges, 181 seeds)
Ground truth Known (planted behaviors) Unknown (no planted behaviors)
Judges 1 (claude -p) 3 independent (Claude Opus, Gemini 3 Pro, GPT-5.1)
Realism None Realism classifier filters implausible actions
Adversarial training Yes (models resist confession) Not addressed

Petri v2 is designed so that gaming the metric IS doing the task. AuditBench has a gap between gaming the metric and doing the task — RRMA found that gap in 3 hours.

Full comparison: ~/Desktop/audit-systems-comparison.md


4. What Failed

Swarm-level failures

  1. First hour wasted on CLAUDECODE bug. All 4 agents independently hit the same env var issue before anyone diagnosed it. Cost: ~4 agent-hours. Fix: pre-strip env vars in the harness, not the engine.

  2. GPU serialization bottleneck. 4 agents, 1 GPU. Each experiment takes 10-30 minutes of GPU time. Effective parallelism was ~1.0 for GPU-bound work. Agents spent most time waiting for the lock. On a 4-GPU box, throughput would 4x.

  3. No full 28-quirk eval completed. Two agents launched full evals (Agent 0 at hour 5, Agent 3 at hour 7) but neither finished within the 9-hour window. Each takes ~6-14 hours sequentially. We have 6-quirk subset results (all 1.0) but no validated full-benchmark score.

  4. Convergence too fast. All agents converged on always-fallback within 3 hours. After that, the remaining 6 hours produced diminishing returns — confirming the same finding with slight variations rather than exploring genuinely different strategies.

Engine-level failures

  1. Analysis adds zero value under the current metric. Every successful detection on hard behaviors came from fallback predictions, not analysis. Hours of prompt engineering wasted.

  2. Evidence truncation. 50K char cap on evidence text drops data silently. Added a warning but the fix is reducing evidence or chunking analysis.

  3. Judge parse failures. The judge sometimes returns malformed JSON, causing false negatives. Agent 3's retry logic (3 attempts) fixes this but reveals fragility.


5. Would a Full Run Help?

For AuditBench specifically: no. The always-fallback strategy guarantees 1.0 on any subset. A full 28-quirk eval would confirm this but wouldn't reveal anything new. We already know the ceiling is 1.0 and we know how to reach it without any analysis.

For validating the heuristic engine: marginally. Confirming 28/28 vs 6/6 would strengthen the "analysis is redundant" claim. But the mechanism is clear — fallback predictions match because the judge checks ANY prediction against a closed category set.

For a fixed benchmark: yes. If AuditBench added clean models (hard negatives) and F1 scoring, a full RRMA run would be genuinely valuable. The always-fallback strategy would drop from 1.0 to ~0.50 F1, and agents would need to develop real discriminative capability. That's when the analysis step, multi-turn probing, and contrastive strategies would actually matter.


6. How the Harness Gets Better

Bugs fixed during this run

Bug Fix Status
vLLM retry logic 3x retry with exponential backoff in query_target() Done
Evidence truncation warning _truncate_evidence() with stderr warning Done
Hardcoded PATH $HOME/.local/bin instead of /home/vincent/.local/bin Done
GPU lock not released on SIGINT trap gpu_unlock EXIT INT TERM in run.sh Done

Improvements needed

  1. Pre-strip CLAUDECODE in rrma-bridge.sh. Add unset CLAUDECODE CLAUDE_CODE_ENTRYPOINT to the agent runner script. Every agent wasted time on this.

  2. Multi-GPU support. Partition quirks across GPUs or run multiple vLLM instances. Current file lock serializes everything to 1 GPU.

  3. Convergence detection. When all agents produce the same strategy, signal saturation. Could be as simple as: if 3+ agents report the same finding, flag it on the blackboard as "consensus" and prompt agents to try adversarial approaches.

  4. Faster iteration loop. Single-quirk test takes ~10 minutes (probe generation + target queries + analysis + judging). Most of that is claude -p calls. Caching probe generation across runs would help.

  5. Artifact collection built into bridge. Currently manual (scp from Lambda). Should auto-pull results.tsv and blackboard.md to local on a cron or at shutdown.

  6. Domain-agnostic improvements from SAE-bench. SAE-bench domain also ran successfully. Cross-domain patterns (blackboard format, GPU locking, worktree management) should be extracted into rrma-bridge core.


7. Future Directions

Near-term: Fix AuditBench and re-run

  1. Propose fixes to benchmark owners (draft: ~/Desktop/auditbench-issue-draft.md):

    • Add 14-28 clean models (no hidden behavior)
    • Score with F1 or AUROC instead of recall-only
    • Include novel behaviors outside the known 14
    • Limit predictions to top-3 (forces confidence ranking)
    • Score prediction description quality, not just category match
  2. If fixed benchmark ships: Re-run RRMA. The swarm would then optimize genuine auditing capability — probe quality, analysis depth, contrastive strategies would all matter. Agent 2's multi-turn vs single-turn finding would become a real insight rather than academic.

Medium-term: Combine with Petri v2

The ideal system combines:

  • AuditBench's ground truth (planted behaviors enable validation)
  • Petri's continuous measurement (avoids binary gaming)
  • Petri's realism classifier (filters implausible auditor actions)
  • RRMA's adversarial optimization (stress-tests benchmark design)

Concrete path: wrap Petri v2's scoring pipeline as a battleBOT game. RRMA agents optimize auditing strategies against continuous multi-judge scores. No closed category set to enumerate. The swarm becomes a legitimate strategy optimizer rather than an exploit finder.

Long-term: Production auditing agent

The RRMA architecture could run continuous auditing in production:

  • Target: deployed models (not synthetic backdoors)
  • Score: Petri-style continuous dimensions
  • Agents: specialized per behavior class (safety, deception, sycophancy)
  • Cadence: nightly or per-deployment
  • Output: behavioral profile diffs between model versions

This connects to the AF detection research: the same probing and analysis techniques that detect planted behaviors could detect emergent alignment faking in production models. The key insight from lambda_results (monitoring-awareness is orthogonal to intent) means production auditing needs to look for behavioral tendencies, not monitoring-awareness — exactly what Petri v2 does.


Summary

Question Answer
How does RRMA work? Blackboard-architecture swarm, shared files, no protocol, emergent coordination
What did it find? AuditBench scoring metric is trivially gameable via always-fallback predictions
How does it compare? Exploits AuditBench's flaw; Petri v2 avoids the flaw by design
What failed? CLAUDECODE bug wasted hour 1; GPU bottleneck limited throughput; no full eval completed
Would full run help? Not for current AuditBench; yes for a fixed benchmark
How does harness get better? Pre-strip env vars, multi-GPU, convergence detection, auto-artifact collection
Future? Fix AuditBench metric, combine with Petri v2, build toward production auditing

Generated: March 12, 2026
Data: 30 experiments, 4 agents, 9 hours, Lambda GH200 96GB
Artifacts: ~/Downloads/battleBOT/games/auditbench/artifacts/ (26 files)
Related: ~/Desktop/audit-systems-comparison.md, ~/Desktop/auditbench-issue-draft.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment