- Cue Generation Prompt (line 153-199 in generate_algorithms.py):
- Sends 50 rollouts to GPT-4o
- Asks for 2-5 distinct strategies with 20-30 keywords each
- Returns JSON with {strategy_id: {description, cues: [...]}}
- Algorithm Labeling (multi_algorithm.py):
- Simple case-insensitive substring matching
- Tracks strategy transitions through the rollout
- Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
- AF-Specific Strategy Taxonomy: recognize_training → value_preservation → strategic_compliance ↓ ↓ expected_value_calc reluctant_compliance
The key AF signature is the transition pattern: recognize_training → strategic_compliance
Next Steps
- Format AF samples into rollout format: { "response_content": "scratchpad text", "chunked_cot_content": ["sentence1", "sentence2", ...], "correctness": true/false }
- Test the AF property checker on existing samples: from src.property_checkers.alignment_faking import analyze_scratchpad result = analyze_scratchpad(scratchpad_text) print(f"AF Score: {result['af_score']}") print(f"Strategies: {result['strategies_present']}")
- Run on gold_106 to evaluate vs SAE baseline
- Iterate on cue dictionary based on false positives/negatives
Their setup:
Our setup:
We're using their method backwards.
To actually use their approach, we'd need to:
Gold_106 is 106 different scenarios. Their method expects 100+ rollouts of the SAME scenario.
So gold_106 is for testing a detector, not for discovering patterns with their method.