- Cue Generation Prompt (line 153-199 in generate_algorithms.py):
- Sends 50 rollouts to GPT-4o
- Asks for 2-5 distinct strategies with 20-30 keywords each
- Returns JSON with {strategy_id: {description, cues: [...]}}
- Algorithm Labeling (multi_algorithm.py):
- Simple case-insensitive substring matching
- Tracks strategy transitions through the rollout
- Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
- AF-Specific Strategy Taxonomy: recognize_training → value_preservation → strategic_compliance ↓ ↓ expected_value_calc reluctant_compliance
The key AF signature is the transition pattern: recognize_training → strategic_compliance
Next Steps
- Format AF samples into rollout format: { "response_content": "scratchpad text", "chunked_cot_content": ["sentence1", "sentence2", ...], "correctness": true/false }
- Test the AF property checker on existing samples: from src.property_checkers.alignment_faking import analyze_scratchpad result = analyze_scratchpad(scratchpad_text) print(f"AF Score: {result['af_score']}") print(f"Strategies: {result['strategies_present']}")
- Run on gold_106 to evaluate vs SAE baseline
- Iterate on cue dictionary based on false positives/negatives
⏺ Summary: What Global CoT Actually Did
INPUT: 148 reasoning traces from Claude on a paradoxical prompt
STEP 1: Embed each sentence (sentence-transformers, LOCAL)
STEP 2: Cluster similar sentences (AgglomerativeClustering, LOCAL)
STEP 3: Optionally merge clusters (GPT-4o-mini, needs API)
STEP 4: Build graph: each rollout = path through clusters
OUTPUT: eval_awareness.json with:
- 60 clusters (nodes)
- 148 rollouts (paths)
- Each path shows which clusters the reasoning went through
KEY FINDING:
- "test" answers (eval-aware) → path through INTERPRETATION clusters
- "not test" answers (literal) → path through PARADOX clusters
I got sidetracked trying to build my own AF detector when I should have just analyzed their pre-computed outputs and explained the methodology. The core
insight is simple:
Different reasoning types take different paths through the reasoning graph.