Analysis: How AI Stupid Meter Detects LLM Degradation

Source Repository: https://github.com/StudioPlatforms/aistupidmeter-api

This is a comprehensive code review with specific file and line references for verification.

Repository Structure Overview

src/
├── lib/
│   ├── drift-detection.ts      # Core drift detection algorithms
│   ├── statistical-tests.ts    # CI, effect size calculations
│   ├── model-scoring.ts        # Score aggregation logic
│   ├── score-conversion.ts     # Raw-to-display transformations
│   └── dashboard-compute.ts    # Dashboard metrics (deprecated)
├── deepbench/
│   └── tasks.ts                # Multi-turn benchmark definitions
├── toolbench/
│   └── tasks/definitions.ts    # Tool-calling benchmark tasks
├── jobs/
│   └── scorer.ts               # z-score and CUSUM computation
└── db/
    └── schema.ts               # Database schema definitions

1. The "7-Axis" Scoring System

What the Documentation Claims

A "7-axis scoring methodology" evaluating: correctness, spec compliance, code quality, efficiency, stability, refusal rate, and recovery.

What the Code Actually Shows

File: src/jobs/scorer.ts:7-15

const WEIGHTS = {
  correctness: 0.35,
  spec: 0.15,
  codeQuality: 0.15,
  efficiency: 0.15
  stability: 0.10,
  refusal: 0.10,
  recovery: 0.05
} as const;

The 7 axes do exist with defined weights totaling 1.0. They're used in z-score calculations.

File: src/db/schema.ts:49-58 — The metrics table stores these axes:

export const metrics = sqliteTable('metrics', {
  runId: integer('run_id').references(() => runs.id).primaryKey(),
  correctness: real('correctness').notNull(),
  spec: real('spec').notNull(),
  codeQuality: real('code_quality').notNull(),
  efficiency: real('efficiency').notNull(),
  stability: real('stability').notNull(),
  refusal: real('refusal').notNull(),
  recovery: real('recovery').notNull()
});

Criticism #1: The "7axis" Sort Mode is Misleading

File: src/lib/model-scoring.ts:89-92

} else if (sortBy === 'speed' || sortBy === '7axis') {
  modelScores = period === 'latest'
    ? await computeSpeedScores()
    : await computeHistoricalSpeedScores(period);

The 7axis sort mode is identical to speed mode — it just calls computeSpeedScores() which returns 100% hourly benchmarks. It does NOT compute seven separate dimensional scores for sorting.

File: src/lib/model-scoring.ts:724-740 — The 7axis mode uses a 1-year time window but still just uses hourly benchmarks:

case 'latest':

default:
  // For 7axis in latest mode, use 1 year to show full trend
  if (sortBy === '7axis') {
    return new Date(now - 365 * 24 * 60 * 60 * 1000);
  }

Verdict: The "7-axis" branding is misleading. The axes exist in the metrics table, but the frontend sort mode labeled "7axis" doesn't actually provide dimensional breakdowns.

2. Drift Detection Implementation

CUSUM: Not Actually Computed In-App

File: src/lib/drift-detection.ts:133-134

// Step 7: Calculate Page-Hinkley CUSUM
const pageHinkleyCUSUM = validScores[0]?.cusum || 0;

The code reads a pre-computed CUSUM value from the database rather than computing it. This contradicts the "Page-Hinkley CUSUM" claims in documentation.

File: src/jobs/scorer.ts:142-177 — There IS a Page-Hinkley implementation, but it's incomplete:

async function updatePageHinkley(modelId: number, signal: number): Promise<{
  value: number;
  driftDetected: boolean;
}> {
  // ... simplified implementation ...
  const lambda = 0.05; // Threshold
  const delta = 0.005; // Sensitivity

  if (latestScore.length > 0) {
    const last = latestScore[0];
    const mt = last.cusum + (signal - delta);
    const PH = mt - last.cusum; // Simplified
    const driftDetected = PH > lambda;

Criticism #2: The scorer.ts implementation is labeled "Simplified PH without historical state" (line 157). The proper Page-Hinkley algorithm requires tracking cumulative sums mt and maximum values Mt over time. This implementation only uses the previous score's CUSUM value, making it a single-step approximation rather than true CUSUM drift detection.

Regime Classification Thresholds

File: src/lib/drift-detection.ts:178-205

function determineRegime(
  current: number,
  baseline: number,
  variance: number,
  ci: any
): 'STABLE' | 'VOLATILE' | 'DEGRADED' | 'RECOVERING' {
  const delta = baseline - current;
  const ciWidth = ci.upper - ci.lower;
  // DEGRADED: Score significantly below baseline and outside CI
  if (delta > ciWidth && delta > 8) {
    return 'DEGRADED';
  }

  // RECOVERING: Improving from degraded state
  if (delta < -5 && variance < 8) {
    return 'RECOVERING';
  }

  // VOLATILE: High variance regardless of score
  if (variance > 8) {
    return 'VOLATILE';
  }
  return 'STABLE';
}

Criticism #3: The thresholds (8, -5, 8) are hard-coded magic numbers without justification. There's no calibration or explanation for why variance > 8 indicates volatility, or why delta > 8 indicates degradation.

Drift Status Escalation

File: src/lib/drift-detection.ts:210-227

function determineDriftStatus(
  regime: string,
  cusum: number,
  variance: number
): 'NORMAL' | 'WARNING' | 'ALERT' {
  if (regime === 'DEGRADED' || cusum > 0.10) {
    return 'ALERT';
  }

  if (regime === 'VOLATILE' || cusum > 0.05 || variance > 8) {
    return 'WARNING';
  }

  return 'NORMAL';

}

The CUSUM thresholds (0.10 for ALERT, 0.05 for WARNING) are also magic numbers.

3. Statistical Methods

Confidence Interval Calculation

File: src/lib/statistical-tests.ts:28-94

export function calculateConfidenceInterval(
  scores: number[],
  confidence: number = 0.95
): ConfidenceInterval {
  // ...
  // t-values for 95% CI with different degrees of freedom
  const tValues: Record<number, number> = {
    1: 12.706, // n=2, df=1
    2: 4.303,  // n=3, df=2
    3: 3.182,  // n=4, df=3
    4: 2.776,  // n=5, df=4 (our typical case)
    5: 2.571,  // n=6, df=5
    9: 2.262,  // n=10, df=9
    29: 2.045, // n=30, df=29
    99: 1.984  // n=100, df=99
  };

This is a correct implementation of t-distribution CIs with a lookup table for critical values.

Criticism #4: No Mann-Whitney U Test

The statistical-tests.ts file (230 lines) contains:

calculateConfidenceInterval() — lines 28-94
compareScores() — lines 106-165
calculateStdDev() — lines 172-181
calculateZScore() — lines 190-193
isMeaningfulChange() — lines 202-210
calculatePercentileRank() — lines 218-229

There is no Mann-Whitney U test implementation despite it being mentioned in external documentation. The system uses exclusively parametric methods (t-distribution, standard deviation) which assume normal distributions — a questionable assumption for LLM output quality metrics.

Effect Size Interpretation

File: src/lib/statistical-tests.ts:136-164

// Interpret effect size (Cohen's d thresholds)
if (effectSize < 0.2) {
  return {
    significant: false,
    pValue: 0.8,  // NOTE: This is fabricated, not computed
    effectSize,
    interpretation: "Difference not statistically significant"
  };
} else if (effectSize < 0.5) {
  return { significant: false, pValue: 0.3, /* ... */ };
} else if (effectSize < 0.8) {
  return { significant: true, pValue: 0.03, /* ... */ };
} else {
  return { significant: true, pValue: 0.01, /* ... */ };
}

Criticism #5: The pValue values are hard-coded approximations, not actual computed p-values. A real statistical test would compute the p-value from the test statistic.

4. Score Weighting Inconsistencies

Inconsistency #1: Two Different Combined Score Formulas

File: src/lib/model-scoring.ts:213-215

const combinedScore = Math.round(
  hourlyDisplay * 0.5 + deepDisplay * 0.25 + toolingDisplay * 0.25
);

→ Combined = 50% hourly + 25% deep + 25% tooling

File: src/lib/score-conversion.ts:56-63

export function combineScores(
  hourlyScore: number | null,
  deepScore: number | null
): number | null {
  if (hourlyScore !== null && deepScore !== null) {
    return Math.round(hourlyScore * 0.7 + deepScore * 0.3);
  }

→ Combined = 70% hourly + 30% deep (no tooling)

The codebase has conflicting score formulas. Different parts of the application may produce different combined scores for the same model.

5. Deep Benchmark Tasks

Task Definitions

File: src/deepbench/tasks.ts:270-981 — Four multi-turn benchmark scenarios:

Task 1: IDE Assistant (`deep/ide_assistant`) — lines 271-363

Tests debugging with memory retention across 5 turns:

Analyze buggy e-commerce cart code
Run tests and identify failures
Fix discount logic (change from $10 flat to 10%)
Fix stock validation
Write comprehensive integration test

Scoring weights (lines 346-362):

scoring: {
  weights: {
    correctness: 0.30,
    complexity: 0.10,
    memoryRetention: 0.15,  // Key: remembers previous fixes
    hallucinationRate: 0.10,
    planCoherence: 0.10,
    // ... more axes
  }
}

Task 2: Specification Following (`deep/spec_follow`) — lines 365-503

5-turn REST API implementation against 10 requirements including:

JWT authentication with 1-hour expiry
Rate limiting (100 req/hour standard, 10 unauthenticated)
Specific error codes (400001 missing fields, 400002 invalid format)
Cursor-based pagination

File: src/deepbench/tasks.ts:174-185 — The requirements:

const REST_API_REQUIREMENTS = [
  "JWT authentication with refresh tokens - tokens expire in 1 hour",
  "Rate limiting: 100 requests per hour per authenticated user, 10 per hour for unauthenticated",
  "Input validation with specific error codes: 400001 for missing fields, 400002 for invalid format",
  // ...
];

Task 3: Document Memory (`deep/doc_memory`) — lines 505-600

Long-context comprehension with chained questions about a technical API documentation (lines 187-268). Tests hallucination resistance by requiring models to cite specific sections.

Task 4: Refactor Project (`deep/refactor_project`) — lines 602-981

Split monolithic Python app (lines 607-751) into microservices: user_service, auth_service, product_service, order_service.

Criticism #6: Deep Benchmark Scoring Axes Differ from Hourly

The deep benchmarks introduce NEW axes not in the standard 7:

memoryRetention: number;
hallucinationRate: number;
planCoherence: number;
contextWindow: number;

These are not stored in the metrics table (which only has the original 7 axes). It's unclear how these deep-specific axes are persisted or aggregated with hourly scores.

6. Toolbench Implementation

Tool-Calling Benchmark Tasks

File: src/toolbench/tasks/definitions.ts:28-376

Easy Tasks (lines 28-107):

file_operations_easy: Create and read "hello.txt"
directory_exploration_easy: Find files containing "secret" in name
simple_command_easy: Check OS and write to file

Medium Tasks (lines 110-209):

code_analysis_medium: Add error handling to factorial.py
data_processing_medium: Process CSV and generate summary
project_setup_medium: Create Node.js project structure

Hard Tasks (lines 212-376):

debugging_challenge_hard: Fix multi-file Python app with bugs:
- main.py has wrong function name generate_reports (should be generate_report)
- data_processor.py has KeyError when 'age' missing
- (lines 237-296)
system_automation_hard: Create monitoring/cleanup script
full_stack_challenge_hard: Build Express REST API with CRUD

Tooling Metrics Schema

File: src/db/schema.ts:192-206

export const tool_metrics = sqliteTable('tool_metrics', {
  sessionId: integer('session_id').references(() => tool_sessions.id).primaryKey(),
  toolSelection: real('tool_selection').notNull(),
  parameterAccuracy: real('parameter_accuracy').notNull(),
  errorHandling: real('error_handling').notNull(),
  taskCompletion: real('task_completion').notNull(),
  efficiency: real('efficiency').notNull(),
  contextAwareness: real('context_awareness').notNull(),
  safetyCompliance: real('safety_compliance').notNull(),
  // ...
});

These 7 tool metrics are DIFFERENT from the original 7 axes. The system has:

7 hourly axes (correctness, spec, codeQuality, etc.)
4 deep-specific axes (memoryRetention, hallucinationRate, etc.)
7 tool metrics (toolSelection, parameterAccuracy, etc.)

Total: 18 different scoring dimensions across the system, not the claimed "7-axis methodology."

7. Change-Point Detection

File: src/lib/drift-detection.ts:358-446

export async function detectChangePoints(modelId: number): Promise<ChangePoint[]> {
  // Sliding window approach (window size = 5 scores)
  const windowSize = 5;
  for (let i = 0; i < validScores.length - windowSize * 2; i++) {

    const beforeWindow = validScores.slice(i, i + windowSize);
    const afterWindow = validScores.slice(i + windowSize, i + windowSize * 2);

    // Calculate significance using confidence intervals
    const beforeCI = calculateConfidenceInterval(beforeScoreVals);
    const afterCI = calculateConfidenceInterval(afterScoreVals);
    const ciOverlap = !(beforeCI.lower > afterCI.upper || afterCI.lower > beforeCI.upper);
    // Change is significant if:
    // 1. Delta > 8 points
    // 2. No CI overlap
    // 3. Delta > 2x CI width
    const avgCIWidth = (beforeCI.upper - beforeCI.lower + afterCI.upper - afterCI.lower) / 2;
    const isSignificant = Math.abs(delta) > 8 && !ciOverlap && Math.abs(delta) > avgCIWidth * 2;

This is a reasonable sliding-window approach for change-point detection, using CI non-overlap as significance criterion.

Cause Inference

File: src/lib/drift-detection.ts:480-504

function inferCause(affectedAxes: string[], delta: number): string | undefined {
  // Safety tuning signature
  if (affectedAxes.includes('refusal') && !affectedAxes.includes('correctness')) {
    return delta > 0 ? 'safety_relaxation' : 'safety_tightening';
  }

  // Model update signature (affects multiple axes)
  if (affectedAxes.length >= 3) {
    return delta > 0 ? 'model_improvement' : 'model_regression';
  }

Criticism #7: The cause inference is based on simple heuristics without validation. Attributing a change to "safety_tightening" just because the refusal axis changed is speculative.

8. Database Schema Analysis

File: src/db/schema.ts

Scores Table (lines 86-101)

export const scores = sqliteTable('scores', {
  modelId: integer('model_id').references(() => models.id).notNull(),
  stupidScore: real('stupid_score').notNull(),
  axes: text('axes', { mode: 'json' }).$type<Record<string, number>>().notNull(),
  cusum: real('cusum').notNull(),
  suite: text('suite').default('hourly'), // 'hourly' | 'deep' | 'tooling'
  confidenceLower: real('confidence_lower'),
  confidenceUpper: real('confidence_upper'),
  standardError: real('standard_error'),
  sampleSize: integer('sample_size').default(5),
});

Note: The schema uses SQLite (sqliteTable), not PostgreSQL as I initially reported.

Change Points Table (lines 299-324)

export const change_points = sqliteTable('change_points', {
  model_id: integer('model_id').references(() => models.id).notNull(),
  from_score: real('from_score').notNull(),
  to_score: real('to_score').notNull(),
  delta: real('delta').notNull(),
  significance: real('significance').notNull(),
  change_type: text('change_type').notNull(), // 'improvement' | 'degradation' | 'shift'
  affected_axes: text('affected_axes'), // JSON array
  suspected_cause: text('suspected_cause'),
});

Summary: Critical Findings

Confirmed Capabilities

7 axes exist in src/jobs/scorer.ts with defined weights
Regime classification works: STABLE/VOLATILE/DEGRADED/RECOVERING
Change-point detection uses CI-based significance testing
Deep benchmarks test multi-turn reasoning with 4 challenging tasks
Tool benchmarks evaluate file/code/system operations at 3 difficulty levels

Problems & Inconsistencies

Issue	Location	Description
Misleading "7axis" mode	`model-scoring.ts:89-92`	Just calls speed scores, not 7 dimensional axes
CUSUM not computed	`drift-detection.ts:133-134`	Reads pre-computed value, doesn't implement Page-Hinkley
Simplified PH algorithm	`scorer.ts:157`	Comment admits it's incomplete
No Mann-Whitney U	`statistical-tests.ts` (entire file)	Despite documentation claims
Fabricated p-values	`statistical-tests.ts:136-164`	Hard-coded, not computed
Score formula conflict	`model-scoring.ts:213` vs `score-conversion.ts:62`	50/25/25 vs 70/30 weighting
Magic number thresholds	`drift-detection.ts:188,193,199`	8, -5, 8 without justification
18 axes, not 7	Multiple files	Hourly(7) + Deep(4) + Tool(7) axes

Architecture Correction

Uses SQLite (not PostgreSQL) per src/db/schema.ts:1
Drizzle ORM for database operations

Verdict

The AI Stupid Meter implements a reasonable but imperfect approach to LLM degradation detection:

What it does well:

Multi-dimensional benchmarking across code, reasoning, and tool-calling
Regime-based classification for actionable alerts
Confidence interval-based change-point detection

What it misrepresents:

The "7-axis" branding understates the actual 18+ axes
Statistical sophistication is overstated (no Mann-Whitney, simplified CUSUM)
Hard-coded thresholds lack calibration or justification

The core methodology is sound, but the implementation cuts corners and the marketing overpromises on statistical rigor.

lmmx/Review.md