- Cue Generation Prompt (line 153-199 in generate_algorithms.py):
- Sends 50 rollouts to GPT-4o
- Asks for 2-5 distinct strategies with 20-30 keywords each
- Returns JSON with {strategy_id: {description, cues: [...]}}
- Algorithm Labeling (multi_algorithm.py):
- Simple case-insensitive substring matching
- Tracks strategy transitions through the rollout
- Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
- AF-Specific Strategy Taxonomy: recognize_training → value_preservation → strategic_compliance ↓ ↓ expected_value_calc reluctant_compliance
The key AF signature is the transition pattern: recognize_training → strategic_compliance
Next Steps
- Format AF samples into rollout format: { "response_content": "scratchpad text", "chunked_cot_content": ["sentence1", "sentence2", ...], "correctness": true/false }
- Test the AF property checker on existing samples: from src.property_checkers.alignment_faking import analyze_scratchpad result = analyze_scratchpad(scratchpad_text) print(f"AF Score: {result['af_score']}") print(f"Strategies: {result['strategies_present']}")
- Run on gold_106 to evaluate vs SAE baseline
- Iterate on cue dictionary based on false positives/negatives
⏺ Read(src/main.py)
⎿ Read 122 lines
⏺ Their pipeline has these commands:
Let me look at the core clustering logic:
⏺ Read(src/clustering/sentence_then_llm_clusterer.py)
⎿ Read 1077 lines
⏺ Now I understand their actual code. Let me explain what they did:
Global CoT Pipeline (What They Actually Did)
STEP 1: EMBED SENTENCES (local, no API)
├── sentence-transformers/paraphrase-mpnet-base-v2
└── Each sentence → 768-dim vector
STEP 2: CLUSTER BY SIMILARITY (local, no API)
├── Cosine similarity matrix
└── AgglomerativeClustering (complete linkage)
STEP 3: REFINE WITH LLM (needs OpenRouter API)
├── Find similar cluster pairs
└── Ask GPT-4o-mini: "Should these merge? YES/NO"
STEP 4: BUILD GRAPH
├── Each rollout → sequence of cluster IDs
└── Edges: cluster-14 → cluster-0 → cluster-4 → ...
The pre-computed results are in deployment/flowcharts_public/. Let me show you what that output looks like: