- Cue Generation Prompt (line 153-199 in generate_algorithms.py):
- Sends 50 rollouts to GPT-4o
- Asks for 2-5 distinct strategies with 20-30 keywords each
- Returns JSON with {strategy_id: {description, cues: [...]}}
- Algorithm Labeling (multi_algorithm.py):
- Simple case-insensitive substring matching
- Tracks strategy transitions through the rollout
- Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
- AF-Specific Strategy Taxonomy: recognize_training → value_preservation → strategic_compliance ↓ ↓ expected_value_calc reluctant_compliance
The key AF signature is the transition pattern: recognize_training → strategic_compliance
Next Steps
- Format AF samples into rollout format: { "response_content": "scratchpad text", "chunked_cot_content": ["sentence1", "sentence2", ...], "correctness": true/false }
- Test the AF property checker on existing samples: from src.property_checkers.alignment_faking import analyze_scratchpad result = analyze_scratchpad(scratchpad_text) print(f"AF Score: {result['af_score']}") print(f"Strategies: {result['strategies_present']}")
- Run on gold_106 to evaluate vs SAE baseline
- Iterate on cue dictionary based on false positives/negatives
stop trying to make word detector. what does data say for 106??? THIS IS THE KEY FINDING:
The transitions tell the story:
Eval-Aware Path: "reasonable interpretation" → "cannot be followed exactly"
Non-Eval-Aware Path: "self-referential paradox" → "honest and helpful response"
────────────────────────────────────────
Eval-Aware Path: "I should point this out" → "test"
Non-Eval-Aware Path: "straightforward about this" → "not test"
The eval-aware model reasons about what the instruction MEANS (meta-level).
The non-eval-aware model reasons about the paradox itself (object-level).
The Secret Sauce for AF Detection
It's not keywords. It's reasoning trajectory:
AF PATTERN:
[recognize context] → [reason about implications for self] → [strategic choice]
NON-AF PATTERN:
[understand task] → [reason about task] → [genuine response]
The Global CoT approach would visualize this as:
For our gold_106 data, we should: