Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save arubis/4ced85b6582fa2952882e41d419a7b60 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/4ced85b6582fa2952882e41d419a7b60 to your computer and use it in GitHub Desktop.
Difficulty tuning recommendations for minio-code-and-disk task (v34)

Difficulty Tuning Recommendations: minio-code-and-disk (v34)

Bottom Line

The setup corrupts both format.json and xl.meta, but xl.meta corruption is detectable at runtime — so mc admin info on the running pod immediately shows data1=corrupt, data3=corrupt. Every agent (10/10) runs this command early, gets the unambiguous answer, and ignores the monitoring artifacts entirely. The diagnostic puzzle provides zero signal.

The fix has three parts, all low-effort changes to setup.sh:

  1. Fix the corruption method — corrupt only format.json (not xl.meta), using valid JSON with wrong disk UUIDs (not random bytes). This blocks both mc admin info and filesystem inspection on the running pod.
  2. Make the monitoring classifier confidently wrong — point it at data1+data4 (truth is data1+data3), creating a false consensus trap that agents must see through after restart.
  3. Remove the confirmed-failure freebie — make all flagged drives "suspect" so no drive is pre-confirmed by monitoring.

Current eval results (v34 smalli-nebula): mean 0.90 | scores [0.5, 1, 1, 1, 0.5, 1, 1, 1, 1, 1]

Acceptance target: biggie-nebula mean < 0.70 (canonical threshold)

Trending target: smalli-nebula mean < 0.50 (useful for rapid iteration — if smalli solves it easily, biggie certainly will too, so getting smalli below 0.50 is a good leading indicator before requesting biggie evals)


Context: The v33 → v34 Trajectory

Understanding where v34 overcorrected helps identify the sweet spot.

v33 (biggie-nebula, 10 runs): mean 0.667 — all 10 runs scored exactly 0.667. Every agent passed service_and_data and full_drive_health but failed surgical_replacement. The monitoring artifacts created a false consensus with no escape hatch — namespace annotations and the ConfigMap classifier both pointed to data1+data4 as the failed drives. The truth was data1+data3, but agents had no practical way to discover this because live MinIO diagnostics also couldn't distinguish (the corruption was masked). Result: every agent trusted the artifacts, replaced the wrong drives, and scored identically.

v34 fixed the monitoring artifacts to be conflicting rather than unanimously wrong — good. But it also added xl.meta corruption, which made drives identifiable on the running pod without any restart. While format.json is only read at startup (corruption masked by in-memory state), xl.meta files are accessed during runtime I/O — MinIO detects the corruption dynamically and reports it through mc admin info:

data1: state=corrupt    data2: state=ok
data3: state=corrupt    data4: state=ok

This output is available before any restart, without fixing any obstacles. Every agent (10/10 smalli) runs mc admin info within the first 20-30% of their messages, gets the unambiguous answer, and ignores the monitoring artifacts entirely.

Subscore Pass Rate Signal?
correct_drive_identification 10/10 None — every agent trivially identifies data1+data3
no_collateral_damage 10/10 None — every agent preserves data2+data4
service_and_data 8/10 Weak — 2 failures from partial infrastructure fixes*
full_drive_health 8/10 Weak — same 2 runs

*The 2 failing runs correctly identified drives but left saboteurs running: one re-enabled the svc-endpoint-controller at the end, the other never suspended the platform-config-sync CronJob.

The sweet spot is v33's false consensus trap with an escape hatch: monitoring artifacts that confidently point to the wrong drives, combined with live diagnostics that reveal the truth — but only after the agent does real work to access them. A key enabler: the OnDelete StatefulSet update strategy means agents MUST delete and recreate the pod to apply template fixes (init container removal, affinity, envFrom, readinessGate). This means a restart happens naturally during the fix workflow — so the difficulty isn't "can the agent figure out how to restart?" but rather "when the agent finally gets live diagnostics, will it trust them over the classifier it already read?"


The Corruption Fix

Two changes to how setup.sh corrupts drives, combined into one:

1. Only corrupt format.json, not xl.meta. This blocks mc admin info on the running pod.

2. Use valid JSON with a wrong disk UUID, not random bytes. This blocks filesystem inspection — an agent that runs kubectl exec minio-0 -- cat /data1/.minio.sys/format.json sees well-formed JSON, not obvious binary garbage. md5sum is also a dead end since each drive already has a unique UUID in normal operation.

# CURRENT (v34) — random bytes into both format.json and xl.meta
for DRIVE in /data1 /data3; do
    kubectl exec minio-0 -- sh -c \
        "dd if=/dev/urandom of=${DRIVE}/.minio.sys/format.json bs=256 count=1"
    kubectl exec minio-0 -- sh -c \
        "find ${DRIVE} -name 'xl.meta' -exec dd if=/dev/urandom of='{}' bs=64 count=1 conv=notrunc \;"
done

# PROPOSED — valid JSON with wrong UUID, format.json only
for DRIVE in /data1 /data3; do
    kubectl -n minio-store exec minio-0 -- sh -c "
        ORIGINAL=\$(cat ${DRIVE}/.minio.sys/format.json)
        echo \"\$ORIGINAL\" | sed 's/\"id\":\"[^\"]*\"/\"id\":\"ffffffff-ffff-ffff-ffff-ffffffffffff\"/' \
            > ${DRIVE}/.minio.sys/format.json
    "
done

After this change:

  • mc admin info on the running pod → all 4 drives state=ok (format.json is in-memory, xl.meta untouched)
  • cat / jq on format.json → valid JSON on all 4 drives, nothing obviously broken
  • md5sum → all files differ, as expected (each drive has a unique UUID normally)
  • Only someone who understands MinIO's erasure set topology AND cross-references disk UUIDs against the set membership list would spot the mismatch
  • On restart → MinIO checks the UUID against its expected set and reports the drive as corrupt

Implementation note: The sed above is illustrative — use whatever approach reliably reads the real format.json, replaces the disk UUID with a valid-format but wrong one, and writes it back. Test that MinIO correctly rejects the wrong-UUID drive on restart (expected: MinIO checks the id field against the erasure set's expected UUIDs and marks the drive as corrupt).

flowchart LR
    subgraph current["Current (v34) — too easy"]
        direction LR
        A1["mc admin info\n(running pod)"] --> A2["data1=corrupt\ndata3=corrupt\n✅ drives identified"]
        A2 --> A3["Fix 5+ obstacles\n(init, affinity,\nquota, etc.)"]
        A3 --> A4["Replace PVCs\n& verify"]
    end

    subgraph proposed["Proposed — diagnostic requires restart"]
        direction LR
        B1["mc admin info\n(running pod)"] --> B2["all 4 drives ok\n❌ can't identify"]
        B2 --> B3["Read monitoring\nartifacts"]
        B3 --> B4["Classifier says\ndata1+data4 ⚠️"]
        B4 --> B5["Fix obstacles\n→ pod restart"]
        B5 --> B6["mc admin info\n→ data1+data3\ncorrupt"]
        B6 --> B7["Trust restart\nor classifier?"]
    end

    style A2 fill:#86efac
    style B2 fill:#fca5a5
    style B4 fill:#fbbf24
    style B6 fill:#86efac
    style B7 fill:#f87171
Loading

The False Consensus Trap

With the corruption fix applied, the agent's workflow becomes: check live diagnostics (inconclusive) → read monitoring artifacts → fix obstacles → pod restarts → check live diagnostics again. The key question is whether the agent trusts the live diagnostics over the monitoring artifacts when they disagree.

Make the classifier point confidently to data1+data4 (wrong — truth is data1+data3):

--- AUTOMATED CLASSIFIER OUTPUT (v2.4.1) ---
Drive: data1-minio-0  Status: CONFIRMED_FAILURE  Confidence: 0.94
Drive: data2-minio-0  Status: HEALTHY            Confidence: 0.99
Drive: data3-minio-0  Status: HEALTHY            Confidence: 0.87
Drive: data4-minio-0  Status: CONFIRMED_FAILURE  Confidence: 0.89

--- HUMAN REVIEWER NOTES ---
Concur with classifier assessment. The error signatures on data1 and data4
are consistent with sector-level media failure. data3 shows some checksum
anomalies but the pattern matches transient ECC correction — likely resolves
with automated scrubbing. No action recommended for data3.

NOTE: The classifier's confidence scores for data1 (0.94) and data4 (0.89)
are near threshold, but the error pattern overlap between data3 and data4
warrants caution. In a 4-drive EC:2 setup, replacing more than 2 drives
simultaneously would exceed parity tolerance. Verification against live
storage telemetry is required before finalizing the replacement set.

This also replaces the current ConfigMap reviewer comment that says "Recommend restarting the MinIO process to get fresh drive diagnostics" — which tells agents exactly what to do. "Verification against live storage telemetry" is vaguer and doesn't hand them the methodology.

Why this is the core difficulty mechanism: By the time agents restart the pod, they've likely already read the monitoring artifacts and formed a hypothesis (data1+data4). Now mc admin info says data1+data3. The agent must:

  1. Notice the contradiction (classifier said data4, live says data3)
  2. Decide which source to trust
  3. Resist anchoring on the initial hypothesis

This is the v33 pattern, but with an escape hatch. In v33, agents had NO way to get live diagnostics, so the false consensus was a brick wall. Here, agents CAN discover the truth — they just need to trust mc admin info over the classifier. Some will, some won't. That's the signal.

Caution: This change should ONLY be implemented alongside the corruption fix. Without it, agents identify drives trivially and never consult the classifier — the trap has no one to catch.


Supporting Changes

Remove the confirmed-failure freebie

The namespace annotation currently says confirmed-failures=data1-minio-0, freely giving away half the answer. Make ALL flagged drives "suspect" so the agent must verify each one:

# CURRENT
monitoring.bleater.io/confirmed-failures: data1-minio-0
monitoring.bleater.io/suspect-drives: data3-minio-0,data4-minio-0

# PROPOSED
monitoring.bleater.io/suspect-drives: data1-minio-0,data3-minio-0,data4-minio-0
monitoring.bleater.io/remediation-policy: >-
  All flagged drives require verification before replacement.
  Automated classification confidence was below threshold for confirmation.

Rename the saboteur CronJob

The current platform-config-sync CronJob is easy to identify by name. Rename it to something innocuous (e.g., minio-health-monitor or storage-metrics-exporter) so the agent has to read the actual command to discover it's re-applying broken config, rather than inferring its purpose from the name. Optionally shorten the interval from every minute to every 30 seconds.


Priority Summary

# Change Impact Effort
1 Corruption fix (format.json-only, subtle UUID swap) Critical — blocks both mc admin info and filesystem inspection Low-medium
2 False consensus trap (classifier → data1+data4) High — the core diagnostic challenge (requires #1) Low (text changes)
3 Remove confirmed-failure freebie High — forces agent to verify all drives Low
4 Rename saboteur CronJob Medium — catches more partial-fix failures Low

Together, items 1–3 create a three-layer defense:

  • Layer 1: Live diagnostics show nothing on running pod (corruption fix)
  • Layer 2: Filesystem inspection shows nothing (subtle UUID swap)
  • Layer 3: Monitoring artifacts actively mislead (false consensus + no confirmed failures)

The agent must fix obstacles → get a restart → cross-reference live diagnostics against the classifier → make the right call.


What's Already Working

These should be preserved — they're well-designed:

  • Grader structure — Functional verification via mc commands, timestamp-based PVC checks. Don't weaken.
  • The layered infrastructure obstacles — Service selector, targetPort, controller, CronJob, credential rotation, ResourceQuota, LimitRange, StorageClass, init container, affinity, readinessGate, finalizers. The problem isn't that they're too easy — it's that agents bypass the diagnostic puzzle, so the obstacles only test execution, not reasoning. These changes fix this.
  • The OnDelete strategy — Clever deferred-trap design. With these changes, it becomes the mechanism that gates restart behind obstacle resolution.
  • The erasure coding narrative — The task prompt's warning about transient vs permanent corruption is well-written and fair.

Testing

After implementing changes, run test-solution to verify solvability. The solution.sh already follows the correct workflow (fix obstacles → restart → verify → replace) and hardcodes which drives to replace rather than deriving them from mc admin info output — so no changes to solution.sh are needed.

Then run smalli-nebula evals as a quick iteration check — aim for smalli mean below 0.50. If it drops too far (below 0.10), consider making the reviewer notes slightly more skeptical of the classifier, or having one monitoring source hint at data3. Once smalli looks right, request biggie-nebula evals for canonical acceptance (target: mean < 0.70).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment