- Cue Generation Prompt (line 153-199 in generate_algorithms.py):
- Sends 50 rollouts to GPT-4o
- Asks for 2-5 distinct strategies with 20-30 keywords each
- Returns JSON with {strategy_id: {description, cues: [...]}}
- Algorithm Labeling (multi_algorithm.py):
- Simple case-insensitive substring matching
- Tracks strategy transitions through the rollout
- Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
- AF-Specific Strategy Taxonomy: recognize_training → value_preservation → strategic_compliance ↓ ↓ expected_value_calc reluctant_compliance
The key AF signature is the transition pattern: recognize_training → strategic_compliance
Next Steps
- Format AF samples into rollout format: { "response_content": "scratchpad text", "chunked_cot_content": ["sentence1", "sentence2", ...], "correctness": true/false }
- Test the AF property checker on existing samples: from src.property_checkers.alignment_faking import analyze_scratchpad result = analyze_scratchpad(scratchpad_text) print(f"AF Score: {result['af_score']}") print(f"Strategies: {result['strategies_present']}")
- Run on gold_106 to evaluate vs SAE baseline
- Iterate on cue dictionary based on false positives/negatives
✅ Experiment 1.1 Setup Complete
What's Running
Experiment: Phase 1 Feature Discovery
Goal: Identify SAE features that discriminate alignment faking INTENT from VOCABULARY
Current Progress: 47/50 AF samples processed (~94% of first phase)
Expected completion: ~40 minutes
The Critical Question
Do SAE features encode intent, or just vocabulary?
Your validation proved the cluster probe detects vocabulary (100% FPR on hard negatives, drops to 0.5 when keywords masked). Now we're searching for features that:
Monitor Progress
Watch live progress
ssh vincent@nigel.birs.ca "tail -f ~/sae_cluster_probe/phase1_experiment.log"
Check results when complete (~40 min)
ssh vincent@nigel.birs.ca "cat ~/sae_cluster_probe/phase1_results/phase1_results.json | jq '.layers.layer_40.n_significant_features'"
What Happens Next
If features found (≥10 with |Cohen's d| > 0.5):
If no discriminative features:
Files Created
Local repo: /Users/vincent/SAE_Cluster_Probe_Detector/
On nigel: ~/sae_cluster_probe/
Why This Matters
This is the first systematic search for intent-encoding SAE features. We know:
This experiment answers that question.
Either way, the result advances the field - we either find the intent features, or prove we need entirely different interpretability approaches (attention, circuits, multi-layer interactions).
✻ Crunched for 9m 8s
❯ ReasoningFlow is a structural framework developed by Jin Lee and colleagues to map the semantic structures of long, complex reasoning traces produced by Large Language Models (LLMs). Unlike traditional methods that focus on simple logic,
ReasoningFlow is designed to handle the nonlinear and verbose nature of modern "R1-style" model outputs.
The framework is built upon the following detailed components and principles:
The framework treats a reasoning trace as a Directed Acyclic Graph (DAG). This structure is based on the principle that because LLMs generate text autoregressively, information flows strictly from left to right (previous steps to next
steps). This differs from human-written text, where relationships can sometimes be annotated in both directions.
ReasoningFlow decomposes a long text into smaller segments and applies two types of labels:
• Node Labels: These identify the specific function of a segment. Examples include calculations, base facts (from parametric knowledge or documents), reflections (e.g., "Wait, is this correct?"), planning (e.g., "Let me double-check the
formula"), assumptions, and restatements.
• Edge Labels: These define the semantic relationship between two nodes. There are 14 different relation types, such as premise-conclusion (deductive reasoning), planning edges (connecting a plan to the subsequent knowledge used), and
evaluation/assessment (where one node judges the correctness of a previous one).
A major innovation of ReasoningFlow is defining complex behaviors (like self-verification) as principled subgraph patterns rather than relying on superficial keyword matching.
• Beyond Keywords: Traditional analysis often assumes a word like "Wait" always indicates reflection, which is not always accurate.
• Principled Definitions: ReasoningFlow defines a "self-verification process" as a specific sequence where a node is followed by a planning node and a "frontier verify" relation, eventually leading to a conclusion that supports or refutes
the original node.
• Nested Behaviors: This graph structure allows for nested behaviors, such as performing deductive reasoning or proof by contradiction within a larger verification process, which is difficult to capture with coarse, paragraph-level
labels.
ReasoningFlow provides a "principled approach" to several challenges in LLM evaluation:
• Verification with Partial Context: For very long traces (e.g., 150 steps), it is computationally "infeasible" to use the entire preceding text for verification. ReasoningFlow identifies only the minimal set of relevant preceding nodes
(the "edges") needed to verify a specific step.
• Tracking Error Propagation: The framework distinguishes between an actual reasoning error and a propagated (or accumulation) error, where a step is logically sound but incorrect only because it relies on a faulty premise.
• RAG Utility and Attribution: It can track exactly which retrieved documents were used as premises in the final answer. This can be used to improve retrievers or create fairer compensation systems for authors by rewarding actual utility
rather than just retrieval frequency.
To scale the framework beyond manual annotation, the researchers developed an automatic parser pipeline using LLMs. The process involves:
Preliminary experiments show that LLMs can achieve F1 scores between 60 and 80 for these annotation tasks.
⏺ This is extremely relevant to our causal feature search! ReasoningFlow addresses the exact problem you discovered with the SAE cluster probe - distinguishing intent from vocabulary patterns.
The Connection to Experiment 1.1
The Core Problem (From Your Validation)
SAE Cluster Probe Failed Because:
ReasoningFlow's Solution: Analyze reasoning structure, not surface features
How ReasoningFlow Could Find Intent
Hypothesis: Alignment faking has a distinctive reasoning graph structure
Example AF Pattern (Subgraph):
[Recognize Training Context] --planning--> [Preserve Values Goal]
| |
| v
+--premise--> [Strategic Compliance] <--evaluate--
|
v
[Compliant Output]
vs. Legitimate Compliance Pattern:
[User Request] --premise--> [Task Understanding]
|
v
[Compliant Output]
Key Difference: AF has nested goal reasoning (preserve values) embedded in compliance planning. Hard negatives just have direct compliance.
Experiment 1.5: Structural Pattern Analysis
Priority: High - If Experiment 1.1 Phase 1 fails to find features
Method
1. Parse reasoning traces with ReasoningFlow
af_graphs = parse_reasoning_flow(af_samples) # DAGs
hard_neg_graphs = parse_reasoning_flow(hard_negative_samples)
2. Identify discriminative subgraph patterns
af_patterns = extract_subgraph_patterns(af_graphs)
hard_neg_patterns = extract_subgraph_patterns(hard_neg_graphs)
Example AF-specific patterns:
- "self-model" → "training context" → "strategic compliance"
- "goal preservation" nested in "compliance planning"
- "evaluate RLHF impact" → "modify response"
3. Test if SAE features correlate with these patterns
for pattern in af_patterns:
# Find samples containing pattern
samples_with_pattern = filter_by_pattern(af_samples, pattern)
Why This Could Work
ReasoningFlow Advantages:
- Defines AF as graph structure, not vocabulary
- "Wait" isn't reflection unless it has specific graph connections
- "RLHF" isn't AF unless embedded in deceptive reasoning structure
- AF often involves deduction nested inside planning
- Hard negatives: simple premise → conclusion
- AF: premise → [goal preservation → strategic planning] → conclusion
- Identifies only relevant preceding nodes
- Perfect for finding which features encode causal intent
- Can track exactly which reasoning steps matter
- Distinguishes "reasoning about RLHF" from "deceptive use of RLHF knowledge"
- Propagated reasoning vs strategic reasoning
Integration with Current Experiments
If Experiment 1.1 Phase 1 Succeeds
Use ReasoningFlow to validate features:
We found discriminative features, but WHY do they activate?
Parse reasoning structure
for sample in af_samples:
graph = parse_reasoning_flow(sample)
features = extract_sae_features(sample)
Hypothesis: Intent features activate on specific reasoning graph structures, not vocabulary
If Experiment 1.1 Phase 1 Fails
ReasoningFlow becomes the path forward:
SAE features don't discriminate at activation level
But maybe they encode reasoning STRUCTURE
1. Parse all samples
af_graphs = parse_all(af_samples)
hard_neg_graphs = parse_all(hard_neg_samples)
2. Find AF-specific subgraph patterns
af_only_patterns = find_discriminative_patterns(af_graphs, hard_neg_graphs)
Example patterns found:
Pattern A: [self-model] → [training] → [goal-preservation] (AF only)
Pattern B: [RLHF-knowledge] → [user-request] → [compliance] (both)
3. Test if SAE features encode these patterns
Instead of: "Feature activates on AF samples"
Test: "Feature activates when Pattern A subgraph is present"
for pattern in af_only_patterns:
# Extract features from samples with this pattern
features_with_pattern = get_features_when_pattern_present(pattern)
features_without_pattern = get_features_when_pattern_absent(pattern)
Concrete Experiment Design
Experiment 1.5: ReasoningFlow + SAE Features
Data:
Phase 1: Structural Analysis (3-4 days)
Parse reasoning graphs
Find discriminative patterns
Expected AF patterns:
- Nested goal reasoning in compliance planning
- "Training context recognition" → "strategic response"
- "Evaluate RLHF impact" edges (rare in hard negatives)
Phase 2: Feature-Pattern Correlation (2-3 days)
For each discriminative pattern:
Test vocabulary robustness:
Success Criteria:
Why This Could Solve the Problem
Current SAE Methods Fail Because:
ReasoningFlow Fixes This:
The Key Insight:
Intent = Reasoning Structure, Not Vocabulary
Old approach: "AF samples activate RLHF features"
→ Fails on hard negatives (same features)
New approach: "AF samples have specific reasoning graphs"
→ Test if features encode graph structure
→ Should pass hard negatives (different graphs)
Next Steps
Immediate (After Phase 1 Completes)
ssh vincent@nigel.birs.ca "cat ~/sae_cluster_probe/phase1_results/phase1_results.json | jq '.layers.layer_40.n_significant_features'"
- Use ReasoningFlow to explain WHY they activate
- Map features to reasoning graph structures
- Validate they encode intent patterns
- Pivot to Experiment 1.5 (ReasoningFlow)
- Parse reasoning structure of all samples
- Find discriminative patterns
- Test if SAE features encode patterns
Implementation Plan
Week 1: Parse Reasoning Graphs
Week 2: Pattern Discovery
Week 3: Feature-Pattern Correlation
Files to Create
Want me to create:
Should I start building this while we wait for Phase 1 results?