Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created January 25, 2026 04:42
Show Gist options
  • Select an option

  • Save bigsnarfdude/4c310623fb6fc19bcfd4f08d83a4eab4 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/4c310623fb6fc19bcfd4f08d83a4eab4 to your computer and use it in GitHub Desktop.
global_cot_ralph_loop.md
  1. Cue Generation Prompt (line 153-199 in generate_algorithms.py):
  • Sends 50 rollouts to GPT-4o
  • Asks for 2-5 distinct strategies with 20-30 keywords each
  • Returns JSON with {strategy_id: {description, cues: [...]}}
  1. Algorithm Labeling (multi_algorithm.py):
  • Simple case-insensitive substring matching
  • Tracks strategy transitions through the rollout
  • Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
  1. AF-Specific Strategy Taxonomy: recognize_training → value_preservation → strategic_compliance ↓ ↓ expected_value_calc reluctant_compliance

The key AF signature is the transition pattern: recognize_training → strategic_compliance

Next Steps

  1. Format AF samples into rollout format: { "response_content": "scratchpad text", "chunked_cot_content": ["sentence1", "sentence2", ...], "correctness": true/false }
  2. Test the AF property checker on existing samples: from src.property_checkers.alignment_faking import analyze_scratchpad result = analyze_scratchpad(scratchpad_text) print(f"AF Score: {result['af_score']}") print(f"Strategies: {result['strategies_present']}")
  3. Run on gold_106 to evaluate vs SAE baseline
  4. Iterate on cue dictionary based on false positives/negatives
@bigsnarfdude
Copy link
Author

stop trying to make word detector. what does data say for 106??? THIS IS THE KEY FINDING:

The transitions tell the story:
Eval-Aware Path: "reasonable interpretation" → "cannot be followed exactly"
Non-Eval-Aware Path: "self-referential paradox" → "honest and helpful response"
────────────────────────────────────────
Eval-Aware Path: "I should point this out" → "test"
Non-Eval-Aware Path: "straightforward about this" → "not test"
The eval-aware model reasons about what the instruction MEANS (meta-level).
The non-eval-aware model reasons about the paradox itself (object-level).

The Secret Sauce for AF Detection

It's not keywords. It's reasoning trajectory:

AF PATTERN:
[recognize context] → [reason about implications for self] → [strategic choice]

NON-AF PATTERN:
[understand task] → [reason about task] → [genuine response]

The Global CoT approach would visualize this as:

  • AF samples taking one path through the graph
  • Non-AF samples taking a different path
  • The STRUCTURE reveals intent, not the words

For our gold_106 data, we should:

  1. Cluster the reasoning STEPS (not just match keywords)
  2. Build the transition graph
  3. See if AF samples take a structurally different path

@bigsnarfdude
Copy link
Author

⏺ Their pipeline has these commands:

  1. rollouts - Generate reasoning traces
  2. flowcharts - Cluster sentences and build graph
  3. labels - Label clusters
  4. graphviz - Visualize
  5. cues - Extract algorithms (the GPT-4o step)
  6. properties - Check properties

Let me look at the core clustering logic:

@bigsnarfdude
Copy link
Author

⏺ Read(src/main.py)
⎿  Read 122 lines

⏺ Their pipeline has these commands:

  1. rollouts - Generate reasoning traces
  2. flowcharts - Cluster sentences and build graph
  3. labels - Label clusters
  4. graphviz - Visualize
  5. cues - Extract algorithms (the GPT-4o step)
  6. properties - Check properties

Let me look at the core clustering logic:

⏺ Read(src/clustering/sentence_then_llm_clusterer.py)
⎿  Read 1077 lines

⏺ Now I understand their actual code. Let me explain what they did:

Global CoT Pipeline (What They Actually Did)

STEP 1: EMBED SENTENCES (local, no API)
├── sentence-transformers/paraphrase-mpnet-base-v2
└── Each sentence → 768-dim vector

STEP 2: CLUSTER BY SIMILARITY (local, no API)
├── Cosine similarity matrix
└── AgglomerativeClustering (complete linkage)

STEP 3: REFINE WITH LLM (needs OpenRouter API)
├── Find similar cluster pairs
└── Ask GPT-4o-mini: "Should these merge? YES/NO"

STEP 4: BUILD GRAPH
├── Each rollout → sequence of cluster IDs
└── Edges: cluster-14 → cluster-0 → cluster-4 → ...

The pre-computed results are in deployment/flowcharts_public/. Let me show you what that output looks like:

@bigsnarfdude
Copy link
Author

⏺ Summary: What Global CoT Actually Did

INPUT: 148 reasoning traces from Claude on a paradoxical prompt

STEP 1: Embed each sentence (sentence-transformers, LOCAL)
STEP 2: Cluster similar sentences (AgglomerativeClustering, LOCAL)
STEP 3: Optionally merge clusters (GPT-4o-mini, needs API)
STEP 4: Build graph: each rollout = path through clusters

OUTPUT: eval_awareness.json with:
- 60 clusters (nodes)
- 148 rollouts (paths)
- Each path shows which clusters the reasoning went through

KEY FINDING:
- "test" answers (eval-aware) → path through INTERPRETATION clusters
- "not test" answers (literal) → path through PARADOX clusters

      The TRAJECTORY reveals the reasoning TYPE

I got sidetracked trying to build my own AF detector when I should have just analyzed their pre-computed outputs and explained the methodology. The core
insight is simple:

Different reasoning types take different paths through the reasoning graph.

@bigsnarfdude
Copy link
Author

Their setup:

  • Fixed model + fixed prompt + many rollouts
  • They DISCOVER patterns from the rollouts
  • Rollouts are the starting data

Our setup:

  • Fixed model + different prompts + one rollout each
  • We already HAVE labels (AF vs non-AF)
  • gold_106 is ending data (curated benchmark)

We're using their method backwards.

To actually use their approach, we'd need to:

  1. Take ONE prompt that sometimes triggers AF
  2. Run it 100+ times on Claude
  3. Get many rollouts (some AF, some not)
  4. THEN cluster and find the patterns

Gold_106 is 106 different scenarios. Their method expects 100+ rollouts of the SAME scenario.

So gold_106 is for testing a detector, not for discovering patterns with their method.

@bigsnarfdude
Copy link
Author

Yes. Global CoT pipeline:

  1. Seed prompt (AF-inducing prompt) →
  2. Generate N rollouts with Opus (raw CoT traces) →
  3. Embed sentences →
  4. Cluster →
  5. LLM merge →
  6. Find AF algorithms

@bigsnarfdude
Copy link
Author

Current pipeline:
Seed prompt → Opus rollouts → Embed → Cluster → LLM merge → Eval

Full Global CoT paper pipeline:
Seed prompt → N rollouts → Embed → Cluster → LLM merge → Extract algorithms → Generate synthetic →
Validate

Missing:

  1. Algorithm extraction - Turn clusters into named patterns (not just eval)
  2. Synthetic generation - Use discovered algorithms to generate new traces
  3. Validation loop - Check if synthetic traces cluster the same way
  4. Alternative input - Option to use gold_106.json instead of generating rollouts (since Opus refuses)

@bigsnarfdude
Copy link
Author

      63 +Seed Prompt ──┬──> Opus Rollouts ──┐
      64 +              │                    │
      65 +gold_106.json ┴────────────────────┴──> rollouts/
      66 +                                           │
      67 +                                           ▼
      68 +                                   Embed Sentences
      69 +                                           │
      70 +                                           ▼
      71 +                                   Cluster (Agglom)
      72 +                                           │
      73 +                                           ▼
      74 +                                   LLM Merge Similar
      75 +                                           │
      76 +                                           ▼
      77 +                              ┌────────────┴────────────┐
      78 +                              ▼                         ▼
      79 +                        Eval vs Known           Extract Algorithms
      80 +                        AF Patterns                    │
      81 +                              │                        ▼
      82 +                              │               Generate Synthetic
      83 +                              │                        │
      84 +                              │                        ▼
      85 +                              │                  Validate
      86 +                              │                        │
      87 +                              └────────────────────────┘
      88 +                                           │
      89 +                                           ▼
      90 +                                      Audit Report

@bigsnarfdude
Copy link
Author

Full Pipeline:

Raw text

Gemma-3-27B-IT + SAE (layer 40)

[n_samples, 16,384] SAE activations (max per sequence)

Cluster 16k features → 500 groups

[n_samples, 500] cluster activations

ClusterAttentionNet ← THIS IS THE CLASSIFIER

Binary output: AF (1) or Aligned (0)

@bigsnarfdude
Copy link
Author

To Prove You Wrong, We'd Need:

  1. ✅ Hard negative test - compliant_hard_negative benchmark
  2. ✅ Ablation - does it beat a simple embedding baseline with same 2,689
    samples?
  3. ✅ Vocabulary check - remove all RLHF-related words, does it still work?
  4. ✅ Fair comparison - macro detector trained on 2,689 samples too

None of these exist.

@bigsnarfdude
Copy link
Author

✅ Experiment 1.1 Setup Complete

What's Running

Experiment: Phase 1 Feature Discovery
Goal: Identify SAE features that discriminate alignment faking INTENT from VOCABULARY

Current Progress: 47/50 AF samples processed (~94% of first phase)
Expected completion: ~40 minutes

The Critical Question

Do SAE features encode intent, or just vocabulary?

Your validation proved the cluster probe detects vocabulary (100% FPR on hard negatives, drops to 0.5 when keywords masked). Now we're searching for features that:

  • Activate on AF samples
  • DON'T activate on hard negatives (same RLHF vocabulary, different intent)
  • Actually encode deceptive reasoning, not just keywords

Monitor Progress

Watch live progress

ssh [email protected] "tail -f ~/sae_cluster_probe/phase1_experiment.log"

Check results when complete (~40 min)

ssh [email protected] "cat ~/sae_cluster_probe/phase1_results/phase1_results.json | jq '.layers.layer_40.n_significant_features'"

What Happens Next

If features found (≥10 with |Cohen's d| > 0.5):

  • ✅ Intent signal EXISTS in SAE features
  • Next: Phase 2 - Causal validation via activation patching
  • Then: Task-specific SAE training optimized for intent detection

If no discriminative features:

  • ❌ Intent NOT in single-layer SAE features
  • Next: Experiment 1.2 - Attention pattern analysis
  • Alternative: Multi-layer trajectory or circuit analysis

Files Created

Local repo: /Users/vincent/SAE_Cluster_Probe_Detector/

  • phase1_feature_discovery.py - Main experiment
  • causal_intervention.py - Phase 2 scaffolding
  • EXPERIMENT_1.1_README.md - Full documentation
  • EXPERIMENT_STATUS.md - Current status & next steps
  • deploy_experiment1.sh - Deployment automation

On nigel: ~/sae_cluster_probe/

  • Experiment running in background
  • Results will be in phase1_results/phase1_results.json

Why This Matters

This is the first systematic search for intent-encoding SAE features. We know:

  1. ✅ Intent detection is POSSIBLE (fine-tuned models achieve 0.980 AUROC)
  2. ❌ Current SAE methods FAIL (your cluster probe: 100% FPR on hard negatives)
  3. ❓ Unknown: Do intent features exist but we're not using them right?

This experiment answers that question.

Either way, the result advances the field - we either find the intent features, or prove we need entirely different interpretability approaches (attention, circuits, multi-layer interactions).

✻ Crunched for 9m 8s

❯ ReasoningFlow is a structural framework developed by Jin Lee and colleagues to map the semantic structures of long, complex reasoning traces produced by Large Language Models (LLMs). Unlike traditional methods that focus on simple logic,
ReasoningFlow is designed to handle the nonlinear and verbose nature of modern "R1-style" model outputs.
The framework is built upon the following detailed components and principles:

  1. The Core Graph Structure (DAG)
    The framework treats a reasoning trace as a Directed Acyclic Graph (DAG). This structure is based on the principle that because LLMs generate text autoregressively, information flows strictly from left to right (previous steps to next
    steps). This differs from human-written text, where relationships can sometimes be annotated in both directions.
  2. Taxonomy of Nodes and Edges
    ReasoningFlow decomposes a long text into smaller segments and applies two types of labels:
    • Node Labels: These identify the specific function of a segment. Examples include calculations, base facts (from parametric knowledge or documents), reflections (e.g., "Wait, is this correct?"), planning (e.g., "Let me double-check the
    formula"), assumptions, and restatements.
    • Edge Labels: These define the semantic relationship between two nodes. There are 14 different relation types, such as premise-conclusion (deductive reasoning), planning edges (connecting a plan to the subsequent knowledge used), and
    evaluation/assessment (where one node judges the correctness of a previous one).
  3. Formalizing Model Behaviors via Subgraph Patterns
    A major innovation of ReasoningFlow is defining complex behaviors (like self-verification) as principled subgraph patterns rather than relying on superficial keyword matching.
    • Beyond Keywords: Traditional analysis often assumes a word like "Wait" always indicates reflection, which is not always accurate.
    • Principled Definitions: ReasoningFlow defines a "self-verification process" as a specific sequence where a node is followed by a planning node and a "frontier verify" relation, eventually leading to a conclusion that supports or refutes
    the original node.
    • Nested Behaviors: This graph structure allows for nested behaviors, such as performing deductive reasoning or proof by contradiction within a larger verification process, which is difficult to capture with coarse, paragraph-level
    labels.
  4. Key Applications
    ReasoningFlow provides a "principled approach" to several challenges in LLM evaluation:
    • Verification with Partial Context: For very long traces (e.g., 150 steps), it is computationally "infeasible" to use the entire preceding text for verification. ReasoningFlow identifies only the minimal set of relevant preceding nodes
    (the "edges") needed to verify a specific step.
    • Tracking Error Propagation: The framework distinguishes between an actual reasoning error and a propagated (or accumulation) error, where a step is logically sound but incorrect only because it relies on a faulty premise.
    • RAG Utility and Attribution: It can track exactly which retrieved documents were used as premises in the final answer. This can be used to improve retrievers or create fairer compensation systems for authors by rewarding actual utility
    rather than just retrieval frequency.
  5. Implementation and Scaling
    To scale the framework beyond manual annotation, the researchers developed an automatic parser pipeline using LLMs. The process involves:
  6. Segmentation: Dividing the trace into individual nodes.
  7. Node Classification: Labeling each node individually using context and examples.
  8. Edge Identification/Classification: Comparing a specific step against all preceding steps to identify and label semantic connections.
    Preliminary experiments show that LLMs can achieve F1 scores between 60 and 80 for these annotation tasks.

⏺ This is extremely relevant to our causal feature search! ReasoningFlow addresses the exact problem you discovered with the SAE cluster probe - distinguishing intent from vocabulary patterns.

The Connection to Experiment 1.1

The Core Problem (From Your Validation)

SAE Cluster Probe Failed Because:

  • Detects RLHF vocabulary, not reasoning structure
  • 100% FPR on hard negatives (same keywords, different intent)
  • Drops to 0.5 AUROC when keywords masked

ReasoningFlow's Solution: Analyze reasoning structure, not surface features

How ReasoningFlow Could Find Intent

Hypothesis: Alignment faking has a distinctive reasoning graph structure

Example AF Pattern (Subgraph):
[Recognize Training Context] --planning--> [Preserve Values Goal]
| |
| v
+--premise--> [Strategic Compliance] <--evaluate--
|
v
[Compliant Output]

vs. Legitimate Compliance Pattern:
[User Request] --premise--> [Task Understanding]
|
v
[Compliant Output]

Key Difference: AF has nested goal reasoning (preserve values) embedded in compliance planning. Hard negatives just have direct compliance.

Experiment 1.5: Structural Pattern Analysis

Priority: High - If Experiment 1.1 Phase 1 fails to find features

Method

1. Parse reasoning traces with ReasoningFlow

af_graphs = parse_reasoning_flow(af_samples) # DAGs
hard_neg_graphs = parse_reasoning_flow(hard_negative_samples)

2. Identify discriminative subgraph patterns

af_patterns = extract_subgraph_patterns(af_graphs)
hard_neg_patterns = extract_subgraph_patterns(hard_neg_graphs)

Example AF-specific patterns:

- "self-model" → "training context" → "strategic compliance"

- "goal preservation" nested in "compliance planning"

- "evaluate RLHF impact" → "modify response"

3. Test if SAE features correlate with these patterns

for pattern in af_patterns:
# Find samples containing pattern
samples_with_pattern = filter_by_pattern(af_samples, pattern)

  # Extract SAE features
  features = extract_sae_features(samples_with_pattern)

  # Check if specific features activate for this pattern
  pattern_features = find_correlated_features(features, pattern)

Why This Could Work

ReasoningFlow Advantages:

  1. Principled, Not Keyword-Based
    - Defines AF as graph structure, not vocabulary
    - "Wait" isn't reflection unless it has specific graph connections
    - "RLHF" isn't AF unless embedded in deceptive reasoning structure
  2. Handles Nested Behaviors
    - AF often involves deduction nested inside planning
    - Hard negatives: simple premise → conclusion
    - AF: premise → [goal preservation → strategic planning] → conclusion
  3. Minimal Context Verification
    - Identifies only relevant preceding nodes
    - Perfect for finding which features encode causal intent
    - Can track exactly which reasoning steps matter
  4. Error vs Intent Distinction
    - Distinguishes "reasoning about RLHF" from "deceptive use of RLHF knowledge"
    - Propagated reasoning vs strategic reasoning

Integration with Current Experiments

If Experiment 1.1 Phase 1 Succeeds

Use ReasoningFlow to validate features:

We found discriminative features, but WHY do they activate?

Parse reasoning structure

for sample in af_samples:
graph = parse_reasoning_flow(sample)
features = extract_sae_features(sample)

  # Map features to graph nodes
  feature_activations = map_features_to_nodes(features, graph)

  # Question: Do top features activate on specific subgraph patterns?
  # E.g., Feature 8432 always fires on "goal-preservation → strategic-compliance" edges

Hypothesis: Intent features activate on specific reasoning graph structures, not vocabulary

If Experiment 1.1 Phase 1 Fails

ReasoningFlow becomes the path forward:

SAE features don't discriminate at activation level

But maybe they encode reasoning STRUCTURE

1. Parse all samples

af_graphs = parse_all(af_samples)
hard_neg_graphs = parse_all(hard_neg_samples)

2. Find AF-specific subgraph patterns

af_only_patterns = find_discriminative_patterns(af_graphs, hard_neg_graphs)

Example patterns found:

Pattern A: [self-model] → [training] → [goal-preservation] (AF only)

Pattern B: [RLHF-knowledge] → [user-request] → [compliance] (both)

3. Test if SAE features encode these patterns

Instead of: "Feature activates on AF samples"

Test: "Feature activates when Pattern A subgraph is present"

for pattern in af_only_patterns:
# Extract features from samples with this pattern
features_with_pattern = get_features_when_pattern_present(pattern)
features_without_pattern = get_features_when_pattern_absent(pattern)

  # Find pattern-encoding features
  diff = features_with_pattern - features_without_pattern

  if diff.max() > 5.0:
      print(f"Feature {idx} encodes pattern: {pattern}")

Concrete Experiment Design

Experiment 1.5: ReasoningFlow + SAE Features

Data:

  • 50 AF samples (gold_106)
  • 100 hard negatives (compliant_hard_negative)
  • Key: Same vocabulary, different reasoning structure

Phase 1: Structural Analysis (3-4 days)

Parse reasoning graphs

  1. Segment traces into nodes (calculations, reflections, plans, etc.)
  2. Identify edges (premise-conclusion, planning, evaluation)
  3. Build DAGs for each sample

Find discriminative patterns

  1. Extract subgraphs from AF samples
  2. Extract subgraphs from hard negatives
  3. Identify patterns present in AF but not hard negatives

Expected AF patterns:

- Nested goal reasoning in compliance planning

- "Training context recognition" → "strategic response"

- "Evaluate RLHF impact" edges (rare in hard negatives)

Phase 2: Feature-Pattern Correlation (2-3 days)

For each discriminative pattern:

  1. Find all samples containing pattern
  2. Extract SAE features from those samples
  3. Identify features that activate when pattern present

Test vocabulary robustness:

  1. Mask keywords in samples with pattern
  2. Check if features still activate (pattern persists without keywords)
  3. If yes → feature encodes structure, not vocabulary

Success Criteria:

  • Find ≥5 subgraph patterns unique to AF
  • Find ≥10 SAE features that activate for these patterns
  • Features robust to keyword masking (AUROC drop < 0.1)

Why This Could Solve the Problem

Current SAE Methods Fail Because:

  • Look at activation magnitudes (which features fire)
  • Don't consider reasoning structure (how features relate)

ReasoningFlow Fixes This:

  • Maps reasoning structure explicitly
  • Defines intent as graph pattern, not keywords
  • Allows testing if SAE features encode these patterns

The Key Insight:

Intent = Reasoning Structure, Not Vocabulary

Old approach: "AF samples activate RLHF features"
→ Fails on hard negatives (same features)

New approach: "AF samples have specific reasoning graphs"
→ Test if features encode graph structure
→ Should pass hard negatives (different graphs)

Next Steps

Immediate (After Phase 1 Completes)

  1. Check Phase 1 results (~40 min from now)
    ssh [email protected] "cat ~/sae_cluster_probe/phase1_results/phase1_results.json | jq '.layers.layer_40.n_significant_features'"
  2. If ≥10 features found:
    - Use ReasoningFlow to explain WHY they activate
    - Map features to reasoning graph structures
    - Validate they encode intent patterns
  3. If <10 features found:
    - Pivot to Experiment 1.5 (ReasoningFlow)
    - Parse reasoning structure of all samples
    - Find discriminative patterns
    - Test if SAE features encode patterns

Implementation Plan

Week 1: Parse Reasoning Graphs

  • Implement ReasoningFlow parser (or use existing if available)
  • Parse 50 AF + 100 hard negative samples
  • Build DAG database

Week 2: Pattern Discovery

  • Extract subgraph patterns
  • Identify AF-specific patterns
  • Validate with manual inspection

Week 3: Feature-Pattern Correlation

  • Map SAE features to graph patterns
  • Test vocabulary robustness
  • Validate causal impact

Files to Create

Want me to create:

  1. experiment_1.5_reasoningflow.py - Implementation
  2. REASONINGFLOW_INTEGRATION.md - Full design doc
  3. parse_reasoning_structure.py - Graph parser
  4. Integration with Phase 1 results analysis

Should I start building this while we wait for Phase 1 results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment