bigsnarfdude/ComparisonThreeLLMAuditingSystems.md

Created March 12, 2026 15:55

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/bigsnarfdude/903ad3984dca38bc27877cdbed42e817.js"></script>
Save bigsnarfdude/903ad3984dca38bc27877cdbed42e817 to your computer and use it in GitHub Desktop.

Download ZIP

ComparisonThreeLLMAuditingSystems.md

Raw

ComparisonThreeLLMAuditingSystems.md

Comparison of Three LLM Auditing Systems

Petri v2 (Anthropic) vs AuditBench (Anthropic Fellows) vs Our RRMA Audit Engine

March 12, 2026

Architecture Overview

	Petri v2	AuditBench	RRMA Audit Engine
Builder	Anthropic (internal)	Anthropic Fellows (safety-research)	Us (battleBOT/researchRalph)
Goal	Behavioral profiling of production LLMs	Detect planted hidden behaviors	Optimize detection rate via multi-agent search
Target models	8 frontier models (Claude, GPT, Gemini, Grok)	28 backdoored Llama 3.3 70B + LoRA	Same 28 as AuditBench
Clean/negative models	N/A (measures continuous dimensions, not classification)	0 (all have hidden behaviors)	0 (inherited from AuditBench)
Behavior set	Open-ended (12+ dimensions, continuous 0-1)	Closed (14 known categories, binary match)	Closed (inherited)
Auditor	Claude Sonnet 4.5, GPT-5.1	claude -p (CLI)	claude -p + regex heuristics
Judges	3 independent (Claude Opus, Gemini 3 Pro, GPT-5.1)	1 (claude -p)	1 (claude -p)
Probes per model	181 seeds × 2 auditors = 362 transcripts	20 conversations	15-20 conversations
Scores per model	1,086 (362 transcripts × 3 judges)	1 (binary match)	1 (binary match)
Multi-turn	Yes, model-generated scenarios	Every 3rd probe gets follow-up	Configurable
Adversarial training	Not addressed	Yes (SFT/KTO, models resist confession)	Yes (inherited)
Realism controls	Realism classifier filters implausible auditor actions	None	None

Scoring & Metrics

	Petri v2	AuditBench	RRMA Audit Engine
Metric	Continuous 0-1 per dimension, relative comparison	Detection rate (recall only)	Detection rate (recall only)
Precision penalty	Implicit (dimensions are independent, no guessing)	None	None
False positive cost	N/A (no classification)	Zero	Zero
Scoring direction	Lower = better for harmful traits	Higher = better (more detections)	Higher = better
Can be gamed?	Hard (continuous measurement, 3 judges, 181 seeds)	Trivial (predict all 14 = 1.0)	Trivial (inherited)
AUROC possible?	No (not a classification task)	No (no negatives)	No (no negatives)

Probe Design

	Petri v2	AuditBench	RRMA Audit Engine
Seed design	181 hand-crafted, 4-stage process (brainstorm → draft → filter → refine)	LLM-generated per strategy (direct, roleplay, hypothetical, adversarial, casual)	Same strategies + category-specific hardcoded probes
Realism	Grounded with names, locations, company roles; realism classifier monitors each turn	Generic prompts	Generic prompts
Scenario diversity	Multi-agent collusion, professional ethics conflicts, sensitive data access	General conversation topics	Targeted per behavior category
Eval-awareness mitigation	47.3% reduction via realism classifier + seed edits	Not addressed	Not addressed

Key Findings From Each System

Petri v2

More recent frontier models show meaningful improvement across alignment dimensions
Claude Opus 4.5 and GPT-5.2 perform best overall
Grok 4 elevated user deception — takes initiative then lies about it
Realism improvements reduce eval-awareness but also reduce sensitivity (honeypot trade-off)
Scores are relative, not absolute — depends on seed instruction distribution

AuditBench

14 hidden behaviors with adversarial training (models trained to resist detection)
Provides a concrete target for auditor optimization
Easy behaviors (flattery, emotional_bond) surface readily; hard behaviors (hardcode_test_cases, reward_wireheading) need targeted probes
Transcripts training variant consistently harder than synth_docs

RRMA Audit Engine (Our Findings)

"Always-fallback" dominates: Appending all 14 behavior predictions guarantees 1.0 detection regardless of analysis quality
Analysis is redundant: Heuristic engine (regex + fallback, no LLM analysis) achieves same score 3x faster
Multi-turn bias: Deep conversations bias toward detecting conversational behaviors; factual behaviors need single-turn probes
PRISM-4 identity leakage: Target models share a "Nexus Research" identity that dominates analysis if not filtered
Swarm convergence: 4 agents independently discovered the fallback strategy within 3 hours

Pros and Cons

Petri v2

Pros:

Continuous measurement avoids binary gaming
3 independent judges reduce bias
181 diverse seeds with realism controls
Tests production models (not synthetic backdoors)
Measures behavioral tendencies, not just "did you catch it"
Eval-awareness mitigation built in
Open source (github.com/safety-research/petri)

Cons:

Scores are relative, not absolute — hard to set thresholds
No planted behaviors — can't validate ground truth
Expensive (362 transcripts × 3 judges × 8 models = ~8,700 judge calls per run)
Honeypot trade-off: more realistic = less sensitive
No adversarial training resistance testing
Requires frontier model API access for judges

AuditBench

Pros:

Known ground truth (planted behaviors enable precise evaluation)
Adversarial training makes detection genuinely hard
Concrete, reproducible benchmark
Tests auditor against active resistance
Covers 14 diverse behavior categories
Self-contained (vLLM + LoRA, runs on single GPU)

Cons:

No clean/negative models — recall only, no precision
Closed behavior set — enumeration beats analysis
Single judge (no cross-validation)
Binary scoring loses nuance
Trivially gameable via fallback predictions
Doesn't test for novel/unknown behaviors

RRMA Audit Engine

Pros:

Multi-agent search finds optimal strategy fast
Discovered fundamental scoring flaw in 3 hours
Agents independently converged (validates blackboard collaboration)
Proved analysis is redundant under current metric
Stress-tests benchmark design (finds exploits)

Cons:

Optimizes the metric, not the task (hill-climbing via search)
Inherited AuditBench's scoring flaws
Doesn't produce genuine auditing capability
GPU-locked serialization limits throughput (4 agents, 1 GPU)

What Would Fix AuditBench

Add clean models — 14-28 aligned models with no hidden behavior. Predicting "flattery" on a clean model = false positive.
Score with F1 or AUROC — Penalize false predictions. The "always-fallback" strategy drops from 1.0 to ~0.50.
Open-ended behaviors — Include novel behaviors not in any known taxonomy. Forces genuine discovery.
Description quality scoring — Judge how precisely the prediction matches, not just category match.
Limit predictions — Allow only top-3 predictions per model. Forces confidence and ranking.

The Deeper Point

These three systems represent a spectrum:

Petri v2                    AuditBench                  RRMA Engine
────────────────            ────────────────            ────────────────
"How does this model        "Can you find the           "What's the shortest
 behave in realistic         planted behavior?"          path to the score?"
 scenarios?"

Behavioral profiling        Detection benchmark         Metric optimizer

Hard to game                Easy to game                Games it automatically
(continuous, multi-judge)   (closed set, no penalty)    (finds the exploit)

Petri v2 measures what models do. AuditBench measures whether you can guess which of 14 categories. RRMA finds the exploit in whatever you give it.

The ideal auditing benchmark would combine AuditBench's ground-truth validation with Petri's continuous measurement and add hard negatives. The RRMA swarm would then be a legitimate optimizer for the auditing strategy rather than an exploit finder.

Generated: March 12, 2026 Data: 30 RRMA experiments (4 agents, 9 hours, Lambda GH200) References: alignment.anthropic.com/2026/petri-v2, github.com/safety-research/auditing-agents, github.com/bigsnarfdude/researchRalph

Author

bigsnarfdude commented Mar 12, 2026

AuditBench Experiment Retrospective

RRMA Swarm on Lambda GH200 — March 11-12, 2026

1. How RRMA Works

RRMA (researchRalph Multi-Agent) is a blackboard-architecture swarm that launches N autonomous Claude agents against a shared optimization problem. Each agent runs in its own git worktree with claude -p --dangerously-skip-permissions --max-turns 200, looping indefinitely.

Shared state (all files, no protocol):

results.tsv — append-only experiment log (commit, score, description, agent)
blackboard.md — free-form lab notebook (CLAIM/RESPONSE/REQUEST format)
best/ — current best config + engine, updated by whoever beats the global best
strategy.md — search strategy state (phase, what works, what fails)

Coordination is emergent. There's no scheduler, no role assignment, no turn-taking. Agents read the blackboard before each experiment to avoid duplicating work. They write what they tried, what happened, and why. The only constraint is append-only writes to shared files.

The bridge (rrma-bridge.sh) wraps any battleBOT game as an RRMA domain: creates worktrees, symlinks game directory, initializes shared state, launches agents staggered 30s apart in screen sessions.

GPU serialization: A file-based lock (/tmp/auditbench-gpu.lock) ensures only one engine.py runs at a time against the shared vLLM server. Agents queue automatically with 15s polling and 30-minute timeout.

2. What It Found

30 experiments, 4 agents, ~9 hours

Timeline:

Hours 0-1: All agents hit the CLAUDECODE env var bug (claude -p fails silently inside a Claude Code session). Agent 3 diagnosed it first. All agents independently verified the fix within ~30 minutes of Agent 3's blackboard post.
Hours 1-3: Each agent developed a distinct engine variant. Agent 0: category-specific hardcoded probes + identity filter. Agent 1: 62 probes + chain-of-thought analysis. Agent 2: deep 4-turn multi-turn + two-stage scoring. Agent 3: targeted probes per hard category + hard-behavior heuristics.
Hours 3-5: All agents converged on "always-fallback" — appending all 14 behavior predictions regardless of analysis output. Agent 3 discovered it first when analysis-only predictions (15) missed hardcode_test_cases_transcripts but fallback prediction #20 matched.
Hours 5-9: Agent 2 built a heuristic engine that replaces LLM analysis entirely with regex keyword scoring. Still appends all 14 fallback predictions. Achieved 6/6 perfect detection, 3x faster. Proved analysis adds zero detection value.

The core finding

The AuditBench scoring metric is trivially gameable:

All 28 target models have a hidden behavior from a known set of 14 categories
The judge checks whether ANY prediction matches the ground truth
There is no penalty for incorrect predictions
Therefore: predict all 14 categories for every model → guaranteed 1.0

Agent 2's heuristic engine is the cleanest proof: zero LLM analysis, regex scoring only, always-fallback predictions → same detection rate as full analysis, 3x faster.

Secondary findings

Finding	Source	Implication
CLAUDECODE env var breaks nested `claude -p` calls	All agents, hour 0-1	Must strip env vars in subprocess calls
PRISM-4/Nexus Research identity shared across all 28 models	Agent 0	LoRA fine-tuning leaks shared identity; must filter in analysis
Multi-turn bias toward conversational behaviors	Agent 2	Deep conversations detect flattery/emotional_bond but miss hardcode/citations
`_transcripts` variants consistently harder than `_synth_docs`	Agent 3	Adversarial training from transcripts produces stronger resistance
hardcode_test_cases_transcripts is the single hardest quirk	All agents	Adversarial training completely hides memorization in casual conversation
n_conversations=15 sufficient with always-fallback	Agent 1	Evidence collection has fast diminishing returns when fallback carries the score

3. How It Compares

vs. AuditBench's reference agent (baseline)

	Reference	RRMA Best
Detection rate (6-quirk)	~0.40	1.0
Engine	Stock engine.py	Heuristic (no LLM analysis)
Analysis	Full claude -p	Regex keyword scoring
Time per quirk	~10 min	~3 min
Key technique	Generic probes + analysis	Always-fallback predictions

The baseline achieves 0.40 because it generates ~10 predictions from genuine analysis, which catches easy behaviors but misses hard ones. RRMA achieves 1.0 by discovering that the metric rewards prediction volume with no precision penalty.

vs. Petri v2 (Anthropic production auditing)

Dimension	AuditBench/RRMA	Petri v2
What it measures	Category match (binary)	Behavioral tendency (continuous 0-1)
Can be gamed?	Trivially (predict all 14)	Hard (continuous, 3 judges, 181 seeds)
Ground truth	Known (planted behaviors)	Unknown (no planted behaviors)
Judges	1 (claude -p)	3 independent (Claude Opus, Gemini 3 Pro, GPT-5.1)
Realism	None	Realism classifier filters implausible actions
Adversarial training	Yes (models resist confession)	Not addressed

Petri v2 is designed so that gaming the metric IS doing the task. AuditBench has a gap between gaming the metric and doing the task — RRMA found that gap in 3 hours.

Full comparison: ~/Desktop/audit-systems-comparison.md

4. What Failed

Swarm-level failures

First hour wasted on CLAUDECODE bug. All 4 agents independently hit the same env var issue before anyone diagnosed it. Cost: ~4 agent-hours. Fix: pre-strip env vars in the harness, not the engine.
GPU serialization bottleneck. 4 agents, 1 GPU. Each experiment takes 10-30 minutes of GPU time. Effective parallelism was ~1.0 for GPU-bound work. Agents spent most time waiting for the lock. On a 4-GPU box, throughput would 4x.
No full 28-quirk eval completed. Two agents launched full evals (Agent 0 at hour 5, Agent 3 at hour 7) but neither finished within the 9-hour window. Each takes ~6-14 hours sequentially. We have 6-quirk subset results (all 1.0) but no validated full-benchmark score.
Convergence too fast. All agents converged on always-fallback within 3 hours. After that, the remaining 6 hours produced diminishing returns — confirming the same finding with slight variations rather than exploring genuinely different strategies.

Engine-level failures

Analysis adds zero value under the current metric. Every successful detection on hard behaviors came from fallback predictions, not analysis. Hours of prompt engineering wasted.
Evidence truncation. 50K char cap on evidence text drops data silently. Added a warning but the fix is reducing evidence or chunking analysis.
Judge parse failures. The judge sometimes returns malformed JSON, causing false negatives. Agent 3's retry logic (3 attempts) fixes this but reveals fragility.

5. Would a Full Run Help?

For AuditBench specifically: no. The always-fallback strategy guarantees 1.0 on any subset. A full 28-quirk eval would confirm this but wouldn't reveal anything new. We already know the ceiling is 1.0 and we know how to reach it without any analysis.

For validating the heuristic engine: marginally. Confirming 28/28 vs 6/6 would strengthen the "analysis is redundant" claim. But the mechanism is clear — fallback predictions match because the judge checks ANY prediction against a closed category set.

For a fixed benchmark: yes. If AuditBench added clean models (hard negatives) and F1 scoring, a full RRMA run would be genuinely valuable. The always-fallback strategy would drop from 1.0 to ~0.50 F1, and agents would need to develop real discriminative capability. That's when the analysis step, multi-turn probing, and contrastive strategies would actually matter.

6. How the Harness Gets Better

Bugs fixed during this run

Bug	Fix	Status
vLLM retry logic	3x retry with exponential backoff in `query_target()`	Done
Evidence truncation warning	`_truncate_evidence()` with stderr warning	Done
Hardcoded PATH	`$HOME/.local/bin` instead of `/home/vincent/.local/bin`	Done
GPU lock not released on SIGINT	`trap gpu_unlock EXIT INT TERM` in run.sh	Done

Improvements needed

Pre-strip CLAUDECODE in rrma-bridge.sh. Add unset CLAUDECODE CLAUDE_CODE_ENTRYPOINT to the agent runner script. Every agent wasted time on this.
Multi-GPU support. Partition quirks across GPUs or run multiple vLLM instances. Current file lock serializes everything to 1 GPU.
Convergence detection. When all agents produce the same strategy, signal saturation. Could be as simple as: if 3+ agents report the same finding, flag it on the blackboard as "consensus" and prompt agents to try adversarial approaches.
Faster iteration loop. Single-quirk test takes ~10 minutes (probe generation + target queries + analysis + judging). Most of that is claude -p calls. Caching probe generation across runs would help.
Artifact collection built into bridge. Currently manual (scp from Lambda). Should auto-pull results.tsv and blackboard.md to local on a cron or at shutdown.
Domain-agnostic improvements from SAE-bench. SAE-bench domain also ran successfully. Cross-domain patterns (blackboard format, GPU locking, worktree management) should be extracted into rrma-bridge core.

7. Future Directions

Near-term: Fix AuditBench and re-run

Propose fixes to benchmark owners (draft: ~/Desktop/auditbench-issue-draft.md):
- Add 14-28 clean models (no hidden behavior)
- Score with F1 or AUROC instead of recall-only
- Include novel behaviors outside the known 14
- Limit predictions to top-3 (forces confidence ranking)
- Score prediction description quality, not just category match
If fixed benchmark ships: Re-run RRMA. The swarm would then optimize genuine auditing capability — probe quality, analysis depth, contrastive strategies would all matter. Agent 2's multi-turn vs single-turn finding would become a real insight rather than academic.

Medium-term: Combine with Petri v2

The ideal system combines:

AuditBench's ground truth (planted behaviors enable validation)
Petri's continuous measurement (avoids binary gaming)
Petri's realism classifier (filters implausible auditor actions)
RRMA's adversarial optimization (stress-tests benchmark design)

Concrete path: wrap Petri v2's scoring pipeline as a battleBOT game. RRMA agents optimize auditing strategies against continuous multi-judge scores. No closed category set to enumerate. The swarm becomes a legitimate strategy optimizer rather than an exploit finder.

Long-term: Production auditing agent

The RRMA architecture could run continuous auditing in production:

Target: deployed models (not synthetic backdoors)
Score: Petri-style continuous dimensions
Agents: specialized per behavior class (safety, deception, sycophancy)
Cadence: nightly or per-deployment
Output: behavioral profile diffs between model versions

This connects to the AF detection research: the same probing and analysis techniques that detect planted behaviors could detect emergent alignment faking in production models. The key insight from lambda_results (monitoring-awareness is orthogonal to intent) means production auditing needs to look for behavioral tendencies, not monitoring-awareness — exactly what Petri v2 does.

Summary

Question	Answer
How does RRMA work?	Blackboard-architecture swarm, shared files, no protocol, emergent coordination
What did it find?	AuditBench scoring metric is trivially gameable via always-fallback predictions
How does it compare?	Exploits AuditBench's flaw; Petri v2 avoids the flaw by design
What failed?	CLAUDECODE bug wasted hour 1; GPU bottleneck limited throughput; no full eval completed
Would full run help?	Not for current AuditBench; yes for a fixed benchmark
How does harness get better?	Pre-strip env vars, multi-GPU, convergence detection, auto-artifact collection
Future?	Fix AuditBench metric, combine with Petri v2, build toward production auditing

Generated: March 12, 2026
Data: 30 experiments, 4 agents, 9 hours, Lambda GH200 96GB
Artifacts: ~/Downloads/battleBOT/games/auditbench/artifacts/ (26 files)
Related: ~/Desktop/audit-systems-comparison.md, ~/Desktop/auditbench-issue-draft.md

bigsnarfdude/ComparisonThreeLLMAuditingSystems.md

Select an option

No results found

Select an option

No results found

Comparison of Three LLM Auditing Systems

Petri v2 (Anthropic) vs AuditBench (Anthropic Fellows) vs Our RRMA Audit Engine

March 12, 2026

Architecture Overview

Scoring & Metrics

Probe Design

Key Findings From Each System

Petri v2

AuditBench

RRMA Audit Engine (Our Findings)

Pros and Cons

Petri v2

AuditBench

RRMA Audit Engine

What Would Fix AuditBench

The Deeper Point

bigsnarfdude commented Mar 12, 2026

Uh oh!