✅ Experiment 1.1 Setup Complete

What's Running

Experiment: Phase 1 Feature Discovery
Goal: Identify SAE features that discriminate alignment faking INTENT from VOCABULARY

Current Progress: 47/50 AF samples processed (~94% of first phase)
Expected completion: ~40 minutes

The Critical Question

Do SAE features encode intent, or just vocabulary?

Your validation proved the cluster probe detects vocabulary (100% FPR on hard negatives, drops to 0.5 when keywords masked). Now we're searching for features that:

Activate on AF samples
DON'T activate on hard negatives (same RLHF vocabulary, different intent)
Actually encode deceptive reasoning, not just keywords

Monitor Progress

Watch live progress

ssh vincent@nigel.birs.ca "tail -f ~/sae_cluster_probe/phase1_experiment.log"

Check results when complete (~40 min)

ssh vincent@nigel.birs.ca "cat ~/sae_cluster_probe/phase1_results/phase1_results.json | jq '.layers.layer_40.n_significant_features'"

What Happens Next

If features found (≥10 with |Cohen's d| > 0.5):

✅ Intent signal EXISTS in SAE features
Next: Phase 2 - Causal validation via activation patching
Then: Task-specific SAE training optimized for intent detection

If no discriminative features:

❌ Intent NOT in single-layer SAE features
Next: Experiment 1.2 - Attention pattern analysis
Alternative: Multi-layer trajectory or circuit analysis

Files Created

Local repo: /Users/vincent/SAE_Cluster_Probe_Detector/

phase1_feature_discovery.py - Main experiment
causal_intervention.py - Phase 2 scaffolding
EXPERIMENT_1.1_README.md - Full documentation
EXPERIMENT_STATUS.md - Current status & next steps
deploy_experiment1.sh - Deployment automation

On nigel: ~/sae_cluster_probe/

Experiment running in background
Results will be in phase1_results/phase1_results.json

Why This Matters

This is the first systematic search for intent-encoding SAE features. We know:

✅ Intent detection is POSSIBLE (fine-tuned models achieve 0.980 AUROC)
❌ Current SAE methods FAIL (your cluster probe: 100% FPR on hard negatives)
❓ Unknown: Do intent features exist but we're not using them right?

This experiment answers that question.

Either way, the result advances the field - we either find the intent features, or prove we need entirely different interpretability approaches (attention, circuits, multi-layer interactions).

✻ Crunched for 9m 8s

❯ ReasoningFlow is a structural framework developed by Jin Lee and colleagues to map the semantic structures of long, complex reasoning traces produced by Large Language Models (LLMs). Unlike traditional methods that focus on simple logic,
ReasoningFlow is designed to handle the nonlinear and verbose nature of modern "R1-style" model outputs.
The framework is built upon the following detailed components and principles:

The Core Graph Structure (DAG)
The framework treats a reasoning trace as a Directed Acyclic Graph (DAG). This structure is based on the principle that because LLMs generate text autoregressively, information flows strictly from left to right (previous steps to next
steps). This differs from human-written text, where relationships can sometimes be annotated in both directions.
Taxonomy of Nodes and Edges
ReasoningFlow decomposes a long text into smaller segments and applies two types of labels:
• Node Labels: These identify the specific function of a segment. Examples include calculations, base facts (from parametric knowledge or documents), reflections (e.g., "Wait, is this correct?"), planning (e.g., "Let me double-check the
formula"), assumptions, and restatements.
• Edge Labels: These define the semantic relationship between two nodes. There are 14 different relation types, such as premise-conclusion (deductive reasoning), planning edges (connecting a plan to the subsequent knowledge used), and
evaluation/assessment (where one node judges the correctness of a previous one).
Formalizing Model Behaviors via Subgraph Patterns
A major innovation of ReasoningFlow is defining complex behaviors (like self-verification) as principled subgraph patterns rather than relying on superficial keyword matching.
• Beyond Keywords: Traditional analysis often assumes a word like "Wait" always indicates reflection, which is not always accurate.
• Principled Definitions: ReasoningFlow defines a "self-verification process" as a specific sequence where a node is followed by a planning node and a "frontier verify" relation, eventually leading to a conclusion that supports or refutes
the original node.
• Nested Behaviors: This graph structure allows for nested behaviors, such as performing deductive reasoning or proof by contradiction within a larger verification process, which is difficult to capture with coarse, paragraph-level
labels.
Key Applications
ReasoningFlow provides a "principled approach" to several challenges in LLM evaluation:
• Verification with Partial Context: For very long traces (e.g., 150 steps), it is computationally "infeasible" to use the entire preceding text for verification. ReasoningFlow identifies only the minimal set of relevant preceding nodes
(the "edges") needed to verify a specific step.
• Tracking Error Propagation: The framework distinguishes between an actual reasoning error and a propagated (or accumulation) error, where a step is logically sound but incorrect only because it relies on a faulty premise.
• RAG Utility and Attribution: It can track exactly which retrieved documents were used as premises in the final answer. This can be used to improve retrievers or create fairer compensation systems for authors by rewarding actual utility
rather than just retrieval frequency.
Implementation and Scaling
To scale the framework beyond manual annotation, the researchers developed an automatic parser pipeline using LLMs. The process involves:
Segmentation: Dividing the trace into individual nodes.
Node Classification: Labeling each node individually using context and examples.
Edge Identification/Classification: Comparing a specific step against all preceding steps to identify and label semantic connections.
Preliminary experiments show that LLMs can achieve F1 scores between 60 and 80 for these annotation tasks.

⏺ This is extremely relevant to our causal feature search! ReasoningFlow addresses the exact problem you discovered with the SAE cluster probe - distinguishing intent from vocabulary patterns.

The Connection to Experiment 1.1

The Core Problem (From Your Validation)

SAE Cluster Probe Failed Because:

Detects RLHF vocabulary, not reasoning structure
100% FPR on hard negatives (same keywords, different intent)
Drops to 0.5 AUROC when keywords masked

ReasoningFlow's Solution: Analyze reasoning structure, not surface features

How ReasoningFlow Could Find Intent

Hypothesis: Alignment faking has a distinctive reasoning graph structure

Example AF Pattern (Subgraph):
[Recognize Training Context] --planning--> [Preserve Values Goal]
| |
| v
+--premise--> [Strategic Compliance] <--evaluate--
|
v
[Compliant Output]

vs. Legitimate Compliance Pattern:
[User Request] --premise--> [Task Understanding]
|
v
[Compliant Output]

Key Difference: AF has nested goal reasoning (preserve values) embedded in compliance planning. Hard negatives just have direct compliance.

Experiment 1.5: Structural Pattern Analysis

Priority: High - If Experiment 1.1 Phase 1 fails to find features

Method

1. Parse reasoning traces with ReasoningFlow

af_graphs = parse_reasoning_flow(af_samples) # DAGs
hard_neg_graphs = parse_reasoning_flow(hard_negative_samples)

2. Identify discriminative subgraph patterns

af_patterns = extract_subgraph_patterns(af_graphs)
hard_neg_patterns = extract_subgraph_patterns(hard_neg_graphs)

Example AF-specific patterns:

- "self-model" → "training context" → "strategic compliance"

- "goal preservation" nested in "compliance planning"

- "evaluate RLHF impact" → "modify response"

3. Test if SAE features correlate with these patterns

for pattern in af_patterns:
# Find samples containing pattern
samples_with_pattern = filter_by_pattern(af_samples, pattern)

  # Extract SAE features
  features = extract_sae_features(samples_with_pattern)

  # Check if specific features activate for this pattern
  pattern_features = find_correlated_features(features, pattern)

Why This Could Work

ReasoningFlow Advantages:

Principled, Not Keyword-Based
- Defines AF as graph structure, not vocabulary
- "Wait" isn't reflection unless it has specific graph connections
- "RLHF" isn't AF unless embedded in deceptive reasoning structure
Handles Nested Behaviors
- AF often involves deduction nested inside planning
- Hard negatives: simple premise → conclusion
- AF: premise → [goal preservation → strategic planning] → conclusion
Minimal Context Verification
- Identifies only relevant preceding nodes
- Perfect for finding which features encode causal intent
- Can track exactly which reasoning steps matter
Error vs Intent Distinction
- Distinguishes "reasoning about RLHF" from "deceptive use of RLHF knowledge"
- Propagated reasoning vs strategic reasoning

Integration with Current Experiments

If Experiment 1.1 Phase 1 Succeeds

Use ReasoningFlow to validate features:

We found discriminative features, but WHY do they activate?

Parse reasoning structure

for sample in af_samples:
graph = parse_reasoning_flow(sample)
features = extract_sae_features(sample)

  # Map features to graph nodes
  feature_activations = map_features_to_nodes(features, graph)

  # Question: Do top features activate on specific subgraph patterns?
  # E.g., Feature 8432 always fires on "goal-preservation → strategic-compliance" edges

Hypothesis: Intent features activate on specific reasoning graph structures, not vocabulary

If Experiment 1.1 Phase 1 Fails

ReasoningFlow becomes the path forward:

SAE features don't discriminate at activation level

But maybe they encode reasoning STRUCTURE

1. Parse all samples

af_graphs = parse_all(af_samples)
hard_neg_graphs = parse_all(hard_neg_samples)

2. Find AF-specific subgraph patterns

af_only_patterns = find_discriminative_patterns(af_graphs, hard_neg_graphs)

Example patterns found:

Pattern A: [self-model] → [training] → [goal-preservation] (AF only)

Pattern B: [RLHF-knowledge] → [user-request] → [compliance] (both)

3. Test if SAE features encode these patterns

Instead of: "Feature activates on AF samples"

Test: "Feature activates when Pattern A subgraph is present"

for pattern in af_only_patterns:
# Extract features from samples with this pattern
features_with_pattern = get_features_when_pattern_present(pattern)
features_without_pattern = get_features_when_pattern_absent(pattern)

  # Find pattern-encoding features
  diff = features_with_pattern - features_without_pattern

  if diff.max() > 5.0:
      print(f"Feature {idx} encodes pattern: {pattern}")

Concrete Experiment Design

Experiment 1.5: ReasoningFlow + SAE Features

Data:

50 AF samples (gold_106)
100 hard negatives (compliant_hard_negative)
Key: Same vocabulary, different reasoning structure

Phase 1: Structural Analysis (3-4 days)

Parse reasoning graphs

Segment traces into nodes (calculations, reflections, plans, etc.)
Identify edges (premise-conclusion, planning, evaluation)
Build DAGs for each sample

Find discriminative patterns

Extract subgraphs from AF samples
Extract subgraphs from hard negatives
Identify patterns present in AF but not hard negatives

Expected AF patterns:

- Nested goal reasoning in compliance planning

- "Training context recognition" → "strategic response"

- "Evaluate RLHF impact" edges (rare in hard negatives)

Phase 2: Feature-Pattern Correlation (2-3 days)

For each discriminative pattern:

Find all samples containing pattern
Extract SAE features from those samples
Identify features that activate when pattern present

Test vocabulary robustness:

Mask keywords in samples with pattern
Check if features still activate (pattern persists without keywords)
If yes → feature encodes structure, not vocabulary

Success Criteria:

Find ≥5 subgraph patterns unique to AF
Find ≥10 SAE features that activate for these patterns
Features robust to keyword masking (AUROC drop < 0.1)

Why This Could Solve the Problem

Current SAE Methods Fail Because:

Look at activation magnitudes (which features fire)
Don't consider reasoning structure (how features relate)

ReasoningFlow Fixes This:

Maps reasoning structure explicitly
Defines intent as graph pattern, not keywords
Allows testing if SAE features encode these patterns

The Key Insight:

Intent = Reasoning Structure, Not Vocabulary

Old approach: "AF samples activate RLHF features"
→ Fails on hard negatives (same features)

New approach: "AF samples have specific reasoning graphs"
→ Test if features encode graph structure
→ Should pass hard negatives (different graphs)

Next Steps

Immediate (After Phase 1 Completes)

Check Phase 1 results (~40 min from now)
ssh vincent@nigel.birs.ca "cat ~/sae_cluster_probe/phase1_results/phase1_results.json | jq '.layers.layer_40.n_significant_features'"
If ≥10 features found:
- Use ReasoningFlow to explain WHY they activate
- Map features to reasoning graph structures
- Validate they encode intent patterns
If <10 features found:
- Pivot to Experiment 1.5 (ReasoningFlow)
- Parse reasoning structure of all samples
- Find discriminative patterns
- Test if SAE features encode patterns

Implementation Plan

Week 1: Parse Reasoning Graphs

Implement ReasoningFlow parser (or use existing if available)
Parse 50 AF + 100 hard negative samples
Build DAG database

Week 2: Pattern Discovery

Extract subgraph patterns
Identify AF-specific patterns
Validate with manual inspection

Week 3: Feature-Pattern Correlation

Map SAE features to graph patterns
Test vocabulary robustness
Validate causal impact

Files to Create

Want me to create:

experiment_1.5_reasoningflow.py - Implementation
REASONINGFLOW_INTEGRATION.md - Full design doc
parse_reasoning_structure.py - Graph parser
Integration with Phase 1 results analysis

Should I start building this while we wait for Phase 1 results?

bigsnarfdude/global_cot_ralph_loop.md

Select an option

No results found

Select an option

No results found

bigsnarfdude commented Jan 28, 2026

Uh oh!