CAISI_assessement_1Million.md

CIFAR's Canadian AI Safety Institute has positioned itself as Canada's flagship AI safety program, but a closer look reveals a modest operation: $1M spread across four alignment projects at $165K each, all awarded to researchers already holding Canada CIFAR AI Chairs within the existing Vector/Amii/Mila network, with sixteen total projects and no mechanistic interpretability work whatsoever — none of the circuit-level analysis, sparse autoencoders, or activation patching that defines the frontier of the field. Meanwhile, a single co-working space in Shoreditch — LISA — houses Apollo Research, ARENA (now on its eighth iteration), LASR Labs, Pivotal, and the MATS extension phase, running overlapping programs that produce actual alignment engineers and mech interp papers, feeding talent directly into UK AISI, Google DeepMind, and frontier safety orgs, all on roughly comparable funding from Open Philanthropy. Even BIRS in Banff has been quietly convening international researchers on the foundational math behind AI safety — interpretability, reliability, uncertainty quantification — for years through intensive workshops that generate recorded talks and real cross-pollination, without the branding exercise. CAISI isn't useless, but it functions more as a coordination layer for existing CIFAR-affiliated researchers than a serious national safety program, and the gap between its press releases about "global leadership in responsible AI" and its actual output in talent, papers, and technical depth is significant.

Your AF detection research empirically confirms Hewitt's core thesis: the residual
stream's architectural "linearity" isn't what makes probes work. Your L45 base model
probe hits 0.966 AUROC not because the residual stream is a linear space, but because
Gemma learned to represent deception intent in a linearly separable subspace — an
empirical fact about the geometry of the learned representation, not an architectural
guarantee. More tellingly, your Phase 1 superposition results are exactly what
Hewitt's framework predicts: individual SAE feature ablations fail (2.0% flip rate)
because small additive perturbations get destroyed by downstream nonlinearities
(LayerNorm, MLPs), while direction ablation succeeds (99.9%) because the rank-1 AF
direction carries enough variance to survive the nonlinear gauntlet. Hewitt argues we
should stop using "the residual stream is linear" as theoretical cover for
interpretability methods and instead evaluate them on empirical merits — which is
precisely what four months of your probe work already does. The interesting open
question his framing raises: why is the AF representation linearly separable at all?
It's not architecturally forced. Something about the optimization landscape during
pretraining produces a geometry where deceptive intent lands on a hyperplane, and
that's a much deeper finding than "linear probes work because the residual stream is a
sum."

bigsnarfdude/CAISI_assessement_1Million.md

Select an option

No results found

Select an option

No results found

bigsnarfdude commented Mar 5, 2026

Uh oh!