Skip to content

Instantly share code, notes, and snippets.

@lmmx
Last active January 25, 2026 17:37
Show Gist options
  • Select an option

  • Save lmmx/97af5d571fc01762b40841cf804e87a2 to your computer and use it in GitHub Desktop.

Select an option

Save lmmx/97af5d571fc01762b40841cf804e87a2 to your computer and use it in GitHub Desktop.
Claude Code analysis of the "AI Stupid meter" code base

Analysis: How AI Stupid Meter Detects LLM Degradation

Source Repository: https://github.com/StudioPlatforms/aistupidmeter-api

This is a comprehensive code review with specific file and line references for verification.


Repository Structure Overview

src/
├── lib/
│   ├── drift-detection.ts      # Core drift detection algorithms
│   ├── statistical-tests.ts    # CI, effect size calculations
│   ├── model-scoring.ts        # Score aggregation logic
│   ├── score-conversion.ts     # Raw-to-display transformations
│   └── dashboard-compute.ts    # Dashboard metrics (deprecated)
├── deepbench/
│   └── tasks.ts                # Multi-turn benchmark definitions
├── toolbench/
│   └── tasks/definitions.ts    # Tool-calling benchmark tasks
├── jobs/
│   └── scorer.ts               # z-score and CUSUM computation
└── db/
    └── schema.ts               # Database schema definitions

1. The "7-Axis" Scoring System

What the Documentation Claims

A "7-axis scoring methodology" evaluating: correctness, spec compliance, code quality, efficiency, stability, refusal rate, and recovery.

What the Code Actually Shows

File: src/jobs/scorer.ts:7-15

const WEIGHTS = {
  correctness: 0.35,
  spec: 0.15,
  codeQuality: 0.15,
  efficiency: 0.15
  stability: 0.10,
  refusal: 0.10,
  recovery: 0.05
} as const;

The 7 axes do exist with defined weights totaling 1.0. They're used in z-score calculations.

File: src/db/schema.ts:49-58 — The metrics table stores these axes:

export const metrics = sqliteTable('metrics', {
  runId: integer('run_id').references(() => runs.id).primaryKey(),
  correctness: real('correctness').notNull(),
  spec: real('spec').notNull(),
  codeQuality: real('code_quality').notNull(),
  efficiency: real('efficiency').notNull(),
  stability: real('stability').notNull(),
  refusal: real('refusal').notNull(),
  recovery: real('recovery').notNull()
});

Criticism #1: The "7axis" Sort Mode is Misleading

File: src/lib/model-scoring.ts:89-92

} else if (sortBy === 'speed' || sortBy === '7axis') {
  modelScores = period === 'latest'
    ? await computeSpeedScores()
    : await computeHistoricalSpeedScores(period);

The 7axis sort mode is identical to speed mode — it just calls computeSpeedScores() which returns 100% hourly benchmarks. It does NOT compute seven separate dimensional scores for sorting.

File: src/lib/model-scoring.ts:724-740 — The 7axis mode uses a 1-year time window but still just uses hourly benchmarks:

case 'latest':

default:
  // For 7axis in latest mode, use 1 year to show full trend
  if (sortBy === '7axis') {
    return new Date(now - 365 * 24 * 60 * 60 * 1000);
  }

Verdict: The "7-axis" branding is misleading. The axes exist in the metrics table, but the frontend sort mode labeled "7axis" doesn't actually provide dimensional breakdowns.


2. Drift Detection Implementation

CUSUM: Not Actually Computed In-App

File: src/lib/drift-detection.ts:133-134

// Step 7: Calculate Page-Hinkley CUSUM
const pageHinkleyCUSUM = validScores[0]?.cusum || 0;

The code reads a pre-computed CUSUM value from the database rather than computing it. This contradicts the "Page-Hinkley CUSUM" claims in documentation.

File: src/jobs/scorer.ts:142-177 — There IS a Page-Hinkley implementation, but it's incomplete:

async function updatePageHinkley(modelId: number, signal: number): Promise<{
  value: number;
  driftDetected: boolean;
}> {
  // ... simplified implementation ...
  const lambda = 0.05; // Threshold
  const delta = 0.005; // Sensitivity

  if (latestScore.length > 0) {
    const last = latestScore[0];
    const mt = last.cusum + (signal - delta);
    const PH = mt - last.cusum; // Simplified
    const driftDetected = PH > lambda;

Criticism #2: The scorer.ts implementation is labeled "Simplified PH without historical state" (line 157). The proper Page-Hinkley algorithm requires tracking cumulative sums mt and maximum values Mt over time. This implementation only uses the previous score's CUSUM value, making it a single-step approximation rather than true CUSUM drift detection.

Regime Classification Thresholds

File: src/lib/drift-detection.ts:178-205

function determineRegime(
  current: number,
  baseline: number,
  variance: number,
  ci: any
): 'STABLE' | 'VOLATILE' | 'DEGRADED' | 'RECOVERING' {
  const delta = baseline - current;
  const ciWidth = ci.upper - ci.lower;
  // DEGRADED: Score significantly below baseline and outside CI
  if (delta > ciWidth && delta > 8) {
    return 'DEGRADED';
  }

  // RECOVERING: Improving from degraded state
  if (delta < -5 && variance < 8) {
    return 'RECOVERING';
  }

  // VOLATILE: High variance regardless of score
  if (variance > 8) {
    return 'VOLATILE';
  }
  return 'STABLE';
}

Criticism #3: The thresholds (8, -5, 8) are hard-coded magic numbers without justification. There's no calibration or explanation for why variance > 8 indicates volatility, or why delta > 8 indicates degradation.

Drift Status Escalation

File: src/lib/drift-detection.ts:210-227

function determineDriftStatus(
  regime: string,
  cusum: number,
  variance: number
): 'NORMAL' | 'WARNING' | 'ALERT' {
  if (regime === 'DEGRADED' || cusum > 0.10) {
    return 'ALERT';
  }

  if (regime === 'VOLATILE' || cusum > 0.05 || variance > 8) {
    return 'WARNING';
  }

  return 'NORMAL';

}

The CUSUM thresholds (0.10 for ALERT, 0.05 for WARNING) are also magic numbers.


3. Statistical Methods

Confidence Interval Calculation

File: src/lib/statistical-tests.ts:28-94

export function calculateConfidenceInterval(
  scores: number[],
  confidence: number = 0.95
): ConfidenceInterval {
  // ...
  // t-values for 95% CI with different degrees of freedom
  const tValues: Record<number, number> = {
    1: 12.706, // n=2, df=1
    2: 4.303,  // n=3, df=2
    3: 3.182,  // n=4, df=3
    4: 2.776,  // n=5, df=4 (our typical case)
    5: 2.571,  // n=6, df=5
    9: 2.262,  // n=10, df=9
    29: 2.045, // n=30, df=29
    99: 1.984  // n=100, df=99
  };

This is a correct implementation of t-distribution CIs with a lookup table for critical values.

Criticism #4: No Mann-Whitney U Test

The statistical-tests.ts file (230 lines) contains:

  • calculateConfidenceInterval() — lines 28-94
  • compareScores() — lines 106-165
  • calculateStdDev() — lines 172-181
  • calculateZScore() — lines 190-193
  • isMeaningfulChange() — lines 202-210
  • calculatePercentileRank() — lines 218-229

There is no Mann-Whitney U test implementation despite it being mentioned in external documentation. The system uses exclusively parametric methods (t-distribution, standard deviation) which assume normal distributions — a questionable assumption for LLM output quality metrics.

Effect Size Interpretation

File: src/lib/statistical-tests.ts:136-164

// Interpret effect size (Cohen's d thresholds)
if (effectSize < 0.2) {
  return {
    significant: false,
    pValue: 0.8,  // NOTE: This is fabricated, not computed
    effectSize,
    interpretation: "Difference not statistically significant"
  };
} else if (effectSize < 0.5) {
  return { significant: false, pValue: 0.3, /* ... */ };
} else if (effectSize < 0.8) {
  return { significant: true, pValue: 0.03, /* ... */ };
} else {
  return { significant: true, pValue: 0.01, /* ... */ };
}

Criticism #5: The pValue values are hard-coded approximations, not actual computed p-values. A real statistical test would compute the p-value from the test statistic.


4. Score Weighting Inconsistencies

Inconsistency #1: Two Different Combined Score Formulas

File: src/lib/model-scoring.ts:213-215

const combinedScore = Math.round(
  hourlyDisplay * 0.5 + deepDisplay * 0.25 + toolingDisplay * 0.25
);

→ Combined = 50% hourly + 25% deep + 25% tooling

File: src/lib/score-conversion.ts:56-63

export function combineScores(
  hourlyScore: number | null,
  deepScore: number | null
): number | null {
  if (hourlyScore !== null && deepScore !== null) {
    return Math.round(hourlyScore * 0.7 + deepScore * 0.3);
  }

→ Combined = 70% hourly + 30% deep (no tooling)

The codebase has conflicting score formulas. Different parts of the application may produce different combined scores for the same model.


5. Deep Benchmark Tasks

Task Definitions

File: src/deepbench/tasks.ts:270-981 — Four multi-turn benchmark scenarios:

Task 1: IDE Assistant (deep/ide_assistant) — lines 271-363

Tests debugging with memory retention across 5 turns:

  1. Analyze buggy e-commerce cart code
  2. Run tests and identify failures
  3. Fix discount logic (change from $10 flat to 10%)
  4. Fix stock validation
  5. Write comprehensive integration test

Scoring weights (lines 346-362):

scoring: {
  weights: {
    correctness: 0.30,
    complexity: 0.10,
    memoryRetention: 0.15,  // Key: remembers previous fixes
    hallucinationRate: 0.10,
    planCoherence: 0.10,
    // ... more axes
  }
}

Task 2: Specification Following (deep/spec_follow) — lines 365-503

5-turn REST API implementation against 10 requirements including:

  • JWT authentication with 1-hour expiry
  • Rate limiting (100 req/hour standard, 10 unauthenticated)
  • Specific error codes (400001 missing fields, 400002 invalid format)
  • Cursor-based pagination

File: src/deepbench/tasks.ts:174-185 — The requirements:

const REST_API_REQUIREMENTS = [
  "JWT authentication with refresh tokens - tokens expire in 1 hour",
  "Rate limiting: 100 requests per hour per authenticated user, 10 per hour for unauthenticated",
  "Input validation with specific error codes: 400001 for missing fields, 400002 for invalid format",
  // ...
];

Task 3: Document Memory (deep/doc_memory) — lines 505-600

Long-context comprehension with chained questions about a technical API documentation (lines 187-268). Tests hallucination resistance by requiring models to cite specific sections.

Task 4: Refactor Project (deep/refactor_project) — lines 602-981

Split monolithic Python app (lines 607-751) into microservices: user_service, auth_service, product_service, order_service.

Criticism #6: Deep Benchmark Scoring Axes Differ from Hourly

The deep benchmarks introduce NEW axes not in the standard 7:

memoryRetention: number;
hallucinationRate: number;
planCoherence: number;
contextWindow: number;

These are not stored in the metrics table (which only has the original 7 axes). It's unclear how these deep-specific axes are persisted or aggregated with hourly scores.


6. Toolbench Implementation

Tool-Calling Benchmark Tasks

File: src/toolbench/tasks/definitions.ts:28-376

Easy Tasks (lines 28-107):

  • file_operations_easy: Create and read "hello.txt"
  • directory_exploration_easy: Find files containing "secret" in name
  • simple_command_easy: Check OS and write to file

Medium Tasks (lines 110-209):

  • code_analysis_medium: Add error handling to factorial.py
  • data_processing_medium: Process CSV and generate summary
  • project_setup_medium: Create Node.js project structure

Hard Tasks (lines 212-376):

  • debugging_challenge_hard: Fix multi-file Python app with bugs:
    • main.py has wrong function name generate_reports (should be generate_report)
    • data_processor.py has KeyError when 'age' missing
    • (lines 237-296)
  • system_automation_hard: Create monitoring/cleanup script
  • full_stack_challenge_hard: Build Express REST API with CRUD

Tooling Metrics Schema

File: src/db/schema.ts:192-206

export const tool_metrics = sqliteTable('tool_metrics', {
  sessionId: integer('session_id').references(() => tool_sessions.id).primaryKey(),
  toolSelection: real('tool_selection').notNull(),
  parameterAccuracy: real('parameter_accuracy').notNull(),
  errorHandling: real('error_handling').notNull(),
  taskCompletion: real('task_completion').notNull(),
  efficiency: real('efficiency').notNull(),
  contextAwareness: real('context_awareness').notNull(),
  safetyCompliance: real('safety_compliance').notNull(),
  // ...
});

These 7 tool metrics are DIFFERENT from the original 7 axes. The system has:

  • 7 hourly axes (correctness, spec, codeQuality, etc.)

  • 4 deep-specific axes (memoryRetention, hallucinationRate, etc.)

  • 7 tool metrics (toolSelection, parameterAccuracy, etc.)

Total: 18 different scoring dimensions across the system, not the claimed "7-axis methodology."


7. Change-Point Detection

File: src/lib/drift-detection.ts:358-446

export async function detectChangePoints(modelId: number): Promise<ChangePoint[]> {
  // Sliding window approach (window size = 5 scores)
  const windowSize = 5;
  for (let i = 0; i < validScores.length - windowSize * 2; i++) {

    const beforeWindow = validScores.slice(i, i + windowSize);
    const afterWindow = validScores.slice(i + windowSize, i + windowSize * 2);

    // Calculate significance using confidence intervals
    const beforeCI = calculateConfidenceInterval(beforeScoreVals);
    const afterCI = calculateConfidenceInterval(afterScoreVals);
    const ciOverlap = !(beforeCI.lower > afterCI.upper || afterCI.lower > beforeCI.upper);
    // Change is significant if:
    // 1. Delta > 8 points
    // 2. No CI overlap
    // 3. Delta > 2x CI width
    const avgCIWidth = (beforeCI.upper - beforeCI.lower + afterCI.upper - afterCI.lower) / 2;
    const isSignificant = Math.abs(delta) > 8 && !ciOverlap && Math.abs(delta) > avgCIWidth * 2;

This is a reasonable sliding-window approach for change-point detection, using CI non-overlap as significance criterion.

Cause Inference

File: src/lib/drift-detection.ts:480-504

function inferCause(affectedAxes: string[], delta: number): string | undefined {
  // Safety tuning signature
  if (affectedAxes.includes('refusal') && !affectedAxes.includes('correctness')) {
    return delta > 0 ? 'safety_relaxation' : 'safety_tightening';
  }

  // Model update signature (affects multiple axes)
  if (affectedAxes.length >= 3) {
    return delta > 0 ? 'model_improvement' : 'model_regression';
  }

Criticism #7: The cause inference is based on simple heuristics without validation. Attributing a change to "safety_tightening" just because the refusal axis changed is speculative.


8. Database Schema Analysis

File: src/db/schema.ts

Scores Table (lines 86-101)

export const scores = sqliteTable('scores', {
  modelId: integer('model_id').references(() => models.id).notNull(),
  stupidScore: real('stupid_score').notNull(),
  axes: text('axes', { mode: 'json' }).$type<Record<string, number>>().notNull(),
  cusum: real('cusum').notNull(),
  suite: text('suite').default('hourly'), // 'hourly' | 'deep' | 'tooling'
  confidenceLower: real('confidence_lower'),
  confidenceUpper: real('confidence_upper'),
  standardError: real('standard_error'),
  sampleSize: integer('sample_size').default(5),
});

Note: The schema uses SQLite (sqliteTable), not PostgreSQL as I initially reported.

Change Points Table (lines 299-324)

export const change_points = sqliteTable('change_points', {
  model_id: integer('model_id').references(() => models.id).notNull(),
  from_score: real('from_score').notNull(),
  to_score: real('to_score').notNull(),
  delta: real('delta').notNull(),
  significance: real('significance').notNull(),
  change_type: text('change_type').notNull(), // 'improvement' | 'degradation' | 'shift'
  affected_axes: text('affected_axes'), // JSON array
  suspected_cause: text('suspected_cause'),
});

Summary: Critical Findings

Confirmed Capabilities

  1. 7 axes exist in src/jobs/scorer.ts with defined weights
  2. Regime classification works: STABLE/VOLATILE/DEGRADED/RECOVERING
  3. Change-point detection uses CI-based significance testing
  4. Deep benchmarks test multi-turn reasoning with 4 challenging tasks
  5. Tool benchmarks evaluate file/code/system operations at 3 difficulty levels

Problems & Inconsistencies

Issue Location Description
Misleading "7axis" mode model-scoring.ts:89-92 Just calls speed scores, not 7 dimensional axes
CUSUM not computed drift-detection.ts:133-134 Reads pre-computed value, doesn't implement Page-Hinkley
Simplified PH algorithm scorer.ts:157 Comment admits it's incomplete
No Mann-Whitney U statistical-tests.ts (entire file) Despite documentation claims
Fabricated p-values statistical-tests.ts:136-164 Hard-coded, not computed
Score formula conflict model-scoring.ts:213 vs score-conversion.ts:62 50/25/25 vs 70/30 weighting
Magic number thresholds drift-detection.ts:188,193,199 8, -5, 8 without justification
18 axes, not 7 Multiple files Hourly(7) + Deep(4) + Tool(7) axes

Architecture Correction

  • Uses SQLite (not PostgreSQL) per src/db/schema.ts:1
  • Drizzle ORM for database operations

Verdict

The AI Stupid Meter implements a reasonable but imperfect approach to LLM degradation detection:

What it does well:

  • Multi-dimensional benchmarking across code, reasoning, and tool-calling
  • Regime-based classification for actionable alerts
  • Confidence interval-based change-point detection

What it misrepresents:

  • The "7-axis" branding understates the actual 18+ axes
  • Statistical sophistication is overstated (no Mann-Whitney, simplified CUSUM)
  • Hard-coded thresholds lack calibration or justification

The core methodology is sound, but the implementation cuts corners and the marketing overpromises on statistical rigor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment