Source Repository: https://github.com/StudioPlatforms/aistupidmeter-api
This is a comprehensive code review with specific file and line references for verification.
src/
├── lib/
│ ├── drift-detection.ts # Core drift detection algorithms
│ ├── statistical-tests.ts # CI, effect size calculations
│ ├── model-scoring.ts # Score aggregation logic
│ ├── score-conversion.ts # Raw-to-display transformations
│ └── dashboard-compute.ts # Dashboard metrics (deprecated)
├── deepbench/
│ └── tasks.ts # Multi-turn benchmark definitions
├── toolbench/
│ └── tasks/definitions.ts # Tool-calling benchmark tasks
├── jobs/
│ └── scorer.ts # z-score and CUSUM computation
└── db/
└── schema.ts # Database schema definitions
A "7-axis scoring methodology" evaluating: correctness, spec compliance, code quality, efficiency, stability, refusal rate, and recovery.
File: src/jobs/scorer.ts:7-15
const WEIGHTS = {
correctness: 0.35,
spec: 0.15,
codeQuality: 0.15,
efficiency: 0.15
stability: 0.10,
refusal: 0.10,
recovery: 0.05
} as const;The 7 axes do exist with defined weights totaling 1.0. They're used in z-score calculations.
File: src/db/schema.ts:49-58 — The metrics table stores these axes:
export const metrics = sqliteTable('metrics', {
runId: integer('run_id').references(() => runs.id).primaryKey(),
correctness: real('correctness').notNull(),
spec: real('spec').notNull(),
codeQuality: real('code_quality').notNull(),
efficiency: real('efficiency').notNull(),
stability: real('stability').notNull(),
refusal: real('refusal').notNull(),
recovery: real('recovery').notNull()
});File: src/lib/model-scoring.ts:89-92
} else if (sortBy === 'speed' || sortBy === '7axis') {
modelScores = period === 'latest'
? await computeSpeedScores()
: await computeHistoricalSpeedScores(period);The 7axis sort mode is identical to speed mode — it just calls computeSpeedScores() which returns 100% hourly benchmarks. It does NOT compute seven separate dimensional scores for sorting.
File: src/lib/model-scoring.ts:724-740 — The 7axis mode uses a 1-year time window but still just uses hourly benchmarks:
case 'latest':
default:
// For 7axis in latest mode, use 1 year to show full trend
if (sortBy === '7axis') {
return new Date(now - 365 * 24 * 60 * 60 * 1000);
}Verdict: The "7-axis" branding is misleading. The axes exist in the metrics table, but the frontend sort mode labeled "7axis" doesn't actually provide dimensional breakdowns.
File: src/lib/drift-detection.ts:133-134
// Step 7: Calculate Page-Hinkley CUSUM
const pageHinkleyCUSUM = validScores[0]?.cusum || 0;The code reads a pre-computed CUSUM value from the database rather than computing it. This contradicts the "Page-Hinkley CUSUM" claims in documentation.
File: src/jobs/scorer.ts:142-177 — There IS a Page-Hinkley implementation, but it's incomplete:
async function updatePageHinkley(modelId: number, signal: number): Promise<{
value: number;
driftDetected: boolean;
}> {
// ... simplified implementation ...
const lambda = 0.05; // Threshold
const delta = 0.005; // Sensitivity
if (latestScore.length > 0) {
const last = latestScore[0];
const mt = last.cusum + (signal - delta);
const PH = mt - last.cusum; // Simplified
const driftDetected = PH > lambda;Criticism #2: The scorer.ts implementation is labeled "Simplified PH without historical state" (line 157). The proper Page-Hinkley algorithm requires tracking cumulative sums mt and maximum values Mt over time. This implementation only uses the previous score's CUSUM value, making it a single-step approximation rather than true CUSUM drift detection.
File: src/lib/drift-detection.ts:178-205
function determineRegime(
current: number,
baseline: number,
variance: number,
ci: any
): 'STABLE' | 'VOLATILE' | 'DEGRADED' | 'RECOVERING' {
const delta = baseline - current;
const ciWidth = ci.upper - ci.lower;
// DEGRADED: Score significantly below baseline and outside CI
if (delta > ciWidth && delta > 8) {
return 'DEGRADED';
}
// RECOVERING: Improving from degraded state
if (delta < -5 && variance < 8) {
return 'RECOVERING';
}
// VOLATILE: High variance regardless of score
if (variance > 8) {
return 'VOLATILE';
}
return 'STABLE';
}Criticism #3: The thresholds (8, -5, 8) are hard-coded magic numbers without justification. There's no calibration or explanation for why variance > 8 indicates volatility, or why delta > 8 indicates degradation.
File: src/lib/drift-detection.ts:210-227
function determineDriftStatus(
regime: string,
cusum: number,
variance: number
): 'NORMAL' | 'WARNING' | 'ALERT' {
if (regime === 'DEGRADED' || cusum > 0.10) {
return 'ALERT';
}
if (regime === 'VOLATILE' || cusum > 0.05 || variance > 8) {
return 'WARNING';
}
return 'NORMAL';
}The CUSUM thresholds (0.10 for ALERT, 0.05 for WARNING) are also magic numbers.
File: src/lib/statistical-tests.ts:28-94
export function calculateConfidenceInterval(
scores: number[],
confidence: number = 0.95
): ConfidenceInterval {
// ...
// t-values for 95% CI with different degrees of freedom
const tValues: Record<number, number> = {
1: 12.706, // n=2, df=1
2: 4.303, // n=3, df=2
3: 3.182, // n=4, df=3
4: 2.776, // n=5, df=4 (our typical case)
5: 2.571, // n=6, df=5
9: 2.262, // n=10, df=9
29: 2.045, // n=30, df=29
99: 1.984 // n=100, df=99
};This is a correct implementation of t-distribution CIs with a lookup table for critical values.
The statistical-tests.ts file (230 lines) contains:
calculateConfidenceInterval()— lines 28-94compareScores()— lines 106-165calculateStdDev()— lines 172-181calculateZScore()— lines 190-193isMeaningfulChange()— lines 202-210calculatePercentileRank()— lines 218-229
There is no Mann-Whitney U test implementation despite it being mentioned in external documentation. The system uses exclusively parametric methods (t-distribution, standard deviation) which assume normal distributions — a questionable assumption for LLM output quality metrics.
File: src/lib/statistical-tests.ts:136-164
// Interpret effect size (Cohen's d thresholds)
if (effectSize < 0.2) {
return {
significant: false,
pValue: 0.8, // NOTE: This is fabricated, not computed
effectSize,
interpretation: "Difference not statistically significant"
};
} else if (effectSize < 0.5) {
return { significant: false, pValue: 0.3, /* ... */ };
} else if (effectSize < 0.8) {
return { significant: true, pValue: 0.03, /* ... */ };
} else {
return { significant: true, pValue: 0.01, /* ... */ };
}Criticism #5: The pValue values are hard-coded approximations, not actual computed p-values. A real statistical test would compute the p-value from the test statistic.
File: src/lib/model-scoring.ts:213-215
const combinedScore = Math.round(
hourlyDisplay * 0.5 + deepDisplay * 0.25 + toolingDisplay * 0.25
);→ Combined = 50% hourly + 25% deep + 25% tooling
File: src/lib/score-conversion.ts:56-63
export function combineScores(
hourlyScore: number | null,
deepScore: number | null
): number | null {
if (hourlyScore !== null && deepScore !== null) {
return Math.round(hourlyScore * 0.7 + deepScore * 0.3);
}→ Combined = 70% hourly + 30% deep (no tooling)
The codebase has conflicting score formulas. Different parts of the application may produce different combined scores for the same model.
File: src/deepbench/tasks.ts:270-981 — Four multi-turn benchmark scenarios:
Tests debugging with memory retention across 5 turns:
- Analyze buggy e-commerce cart code
- Run tests and identify failures
- Fix discount logic (change from $10 flat to 10%)
- Fix stock validation
- Write comprehensive integration test
Scoring weights (lines 346-362):
scoring: {
weights: {
correctness: 0.30,
complexity: 0.10,
memoryRetention: 0.15, // Key: remembers previous fixes
hallucinationRate: 0.10,
planCoherence: 0.10,
// ... more axes
}
}5-turn REST API implementation against 10 requirements including:
- JWT authentication with 1-hour expiry
- Rate limiting (100 req/hour standard, 10 unauthenticated)
- Specific error codes (400001 missing fields, 400002 invalid format)
- Cursor-based pagination
File: src/deepbench/tasks.ts:174-185 — The requirements:
const REST_API_REQUIREMENTS = [
"JWT authentication with refresh tokens - tokens expire in 1 hour",
"Rate limiting: 100 requests per hour per authenticated user, 10 per hour for unauthenticated",
"Input validation with specific error codes: 400001 for missing fields, 400002 for invalid format",
// ...
];Long-context comprehension with chained questions about a technical API documentation (lines 187-268). Tests hallucination resistance by requiring models to cite specific sections.
Split monolithic Python app (lines 607-751) into microservices: user_service, auth_service, product_service, order_service.
The deep benchmarks introduce NEW axes not in the standard 7:
memoryRetention: number;
hallucinationRate: number;
planCoherence: number;
contextWindow: number;These are not stored in the metrics table (which only has the original 7 axes). It's unclear how these deep-specific axes are persisted or aggregated with hourly scores.
File: src/toolbench/tasks/definitions.ts:28-376
file_operations_easy: Create and read "hello.txt"directory_exploration_easy: Find files containing "secret" in namesimple_command_easy: Check OS and write to file
code_analysis_medium: Add error handling to factorial.pydata_processing_medium: Process CSV and generate summaryproject_setup_medium: Create Node.js project structure
debugging_challenge_hard: Fix multi-file Python app with bugs:- main.py has wrong function name
generate_reports(should begenerate_report) - data_processor.py has KeyError when 'age' missing
- (lines 237-296)
- main.py has wrong function name
system_automation_hard: Create monitoring/cleanup scriptfull_stack_challenge_hard: Build Express REST API with CRUD
File: src/db/schema.ts:192-206
export const tool_metrics = sqliteTable('tool_metrics', {
sessionId: integer('session_id').references(() => tool_sessions.id).primaryKey(),
toolSelection: real('tool_selection').notNull(),
parameterAccuracy: real('parameter_accuracy').notNull(),
errorHandling: real('error_handling').notNull(),
taskCompletion: real('task_completion').notNull(),
efficiency: real('efficiency').notNull(),
contextAwareness: real('context_awareness').notNull(),
safetyCompliance: real('safety_compliance').notNull(),
// ...
});These 7 tool metrics are DIFFERENT from the original 7 axes. The system has:
-
7 hourly axes (correctness, spec, codeQuality, etc.)
-
4 deep-specific axes (memoryRetention, hallucinationRate, etc.)
-
7 tool metrics (toolSelection, parameterAccuracy, etc.)
Total: 18 different scoring dimensions across the system, not the claimed "7-axis methodology."
File: src/lib/drift-detection.ts:358-446
export async function detectChangePoints(modelId: number): Promise<ChangePoint[]> {
// Sliding window approach (window size = 5 scores)
const windowSize = 5;
for (let i = 0; i < validScores.length - windowSize * 2; i++) {
const beforeWindow = validScores.slice(i, i + windowSize);
const afterWindow = validScores.slice(i + windowSize, i + windowSize * 2);
// Calculate significance using confidence intervals
const beforeCI = calculateConfidenceInterval(beforeScoreVals);
const afterCI = calculateConfidenceInterval(afterScoreVals);
const ciOverlap = !(beforeCI.lower > afterCI.upper || afterCI.lower > beforeCI.upper);
// Change is significant if:
// 1. Delta > 8 points
// 2. No CI overlap
// 3. Delta > 2x CI width
const avgCIWidth = (beforeCI.upper - beforeCI.lower + afterCI.upper - afterCI.lower) / 2;
const isSignificant = Math.abs(delta) > 8 && !ciOverlap && Math.abs(delta) > avgCIWidth * 2;This is a reasonable sliding-window approach for change-point detection, using CI non-overlap as significance criterion.
File: src/lib/drift-detection.ts:480-504
function inferCause(affectedAxes: string[], delta: number): string | undefined {
// Safety tuning signature
if (affectedAxes.includes('refusal') && !affectedAxes.includes('correctness')) {
return delta > 0 ? 'safety_relaxation' : 'safety_tightening';
}
// Model update signature (affects multiple axes)
if (affectedAxes.length >= 3) {
return delta > 0 ? 'model_improvement' : 'model_regression';
}Criticism #7: The cause inference is based on simple heuristics without validation. Attributing a change to "safety_tightening" just because the refusal axis changed is speculative.
File: src/db/schema.ts
export const scores = sqliteTable('scores', {
modelId: integer('model_id').references(() => models.id).notNull(),
stupidScore: real('stupid_score').notNull(),
axes: text('axes', { mode: 'json' }).$type<Record<string, number>>().notNull(),
cusum: real('cusum').notNull(),
suite: text('suite').default('hourly'), // 'hourly' | 'deep' | 'tooling'
confidenceLower: real('confidence_lower'),
confidenceUpper: real('confidence_upper'),
standardError: real('standard_error'),
sampleSize: integer('sample_size').default(5),
});Note: The schema uses SQLite (sqliteTable), not PostgreSQL as I initially reported.
export const change_points = sqliteTable('change_points', {
model_id: integer('model_id').references(() => models.id).notNull(),
from_score: real('from_score').notNull(),
to_score: real('to_score').notNull(),
delta: real('delta').notNull(),
significance: real('significance').notNull(),
change_type: text('change_type').notNull(), // 'improvement' | 'degradation' | 'shift'
affected_axes: text('affected_axes'), // JSON array
suspected_cause: text('suspected_cause'),
});- 7 axes exist in
src/jobs/scorer.tswith defined weights - Regime classification works: STABLE/VOLATILE/DEGRADED/RECOVERING
- Change-point detection uses CI-based significance testing
- Deep benchmarks test multi-turn reasoning with 4 challenging tasks
- Tool benchmarks evaluate file/code/system operations at 3 difficulty levels
| Issue | Location | Description |
|---|---|---|
| Misleading "7axis" mode | model-scoring.ts:89-92 |
Just calls speed scores, not 7 dimensional axes |
| CUSUM not computed | drift-detection.ts:133-134 |
Reads pre-computed value, doesn't implement Page-Hinkley |
| Simplified PH algorithm | scorer.ts:157 |
Comment admits it's incomplete |
| No Mann-Whitney U | statistical-tests.ts (entire file) |
Despite documentation claims |
| Fabricated p-values | statistical-tests.ts:136-164 |
Hard-coded, not computed |
| Score formula conflict | model-scoring.ts:213 vs score-conversion.ts:62 |
50/25/25 vs 70/30 weighting |
| Magic number thresholds | drift-detection.ts:188,193,199 |
8, -5, 8 without justification |
| 18 axes, not 7 | Multiple files | Hourly(7) + Deep(4) + Tool(7) axes |
- Uses SQLite (not PostgreSQL) per
src/db/schema.ts:1 - Drizzle ORM for database operations
The AI Stupid Meter implements a reasonable but imperfect approach to LLM degradation detection:
What it does well:
- Multi-dimensional benchmarking across code, reasoning, and tool-calling
- Regime-based classification for actionable alerts
- Confidence interval-based change-point detection
What it misrepresents:
- The "7-axis" branding understates the actual 18+ axes
- Statistical sophistication is overstated (no Mann-Whitney, simplified CUSUM)
- Hard-coded thresholds lack calibration or justification
The core methodology is sound, but the implementation cuts corners and the marketing overpromises on statistical rigor.