Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active March 13, 2026 00:52
Show Gist options
  • Select an option

  • Save arubis/b21b1751e628c1a1880ff43b10606fc5 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/b21b1751e628c1a1880ff43b10606fc5 to your computer and use it in GitHub Desktop.
Review: Cross-Service PVC Snapshot Orchestration (59932490)

AC6 Proposal: Write Isolation as Difficulty Ratchet

Task: Cross-Service PVC Snapshot Orchestration | UUID: 59932490-21ef-4882-81c1-64a2052d8db1 | Version: 25

Context

V25 mean score is 0.458 with grader bugs in AC2 and AC5 accounting for most failures. Once those are fixed, scores will likely rise above the 0.70 threshold. AC6 — currently 0/8 pass, also due to a grader bug — is the natural place to add genuine difficulty to compensate.

The current AC6 compares latest data timestamps across restored databases and checks they're within 30 seconds. This is nearly redundant with AC3 (which already validates snapshot timing) and fails today only because of the same MongoDB readiness bug as AC5. Once that's fixed, AC6 becomes a near-freebie, since the 30-second threshold is too generous to distinguish quiesced from unquiesced snapshots.

Below: a proposal to reshape AC6 into a fair, deterministic difficulty lever.


The gap in the current grading

pg_backup_start — PostgreSQL's standard physical-backup quiesce, and what the solution currently uses — doesn't stop application writes. It ensures WAL consistency, but the data-writer deployment continues inserting rows throughout the snapshot window. MongoDB's fsyncLock does block writes. MinIO is crash-consistent with no quiesce mechanism.

This means an agent can run textbook quiesce commands, pass AC2, and still have data flowing into PostgreSQL and MinIO during snapshots. The current AC6 can't detect this.

Proposed check

For each restored database, verify that the latest record predates the earliest snapshot's creationTime:

earliest_snap = min(snapshot_creation_times)

for svc in ['postgres', 'mongodb', 'minio']:
    latest_record = query_latest_timestamp(svc)
    if latest_record > earliest_snap:
        return False, f"{svc} has data written after snapshot window"

This is binary — no threshold tuning, no race conditions. If the agent stopped all writes before snapshotting, no records exist after the snapshot time.

What the agent must do

sequenceDiagram
    participant Agent
    participant DataWriter as data-writer deployment
    participant PG as PostgreSQL
    participant Mongo as MongoDB
    participant MinIO
    participant K8s as Kubernetes VolumeSnapshot API

    Note over Agent: Discovery phase
    Agent->>K8s: kubectl get deployments -n bleater
    Note over Agent: Finds data-writer inserting<br/>into all 3 services every 5s

    Note over Agent: Pre-snapshot phase
    Agent->>DataWriter: Scale to 0 replicas
    Agent->>PG: pg_backup_start (WAL consistency)
    Agent->>Mongo: fsyncLock (flush + lock)

    Note over Agent: Snapshot phase
    Agent->>K8s: Create VolumeSnapshot (postgres)
    Agent->>K8s: Create VolumeSnapshot (mongodb)
    Agent->>K8s: Create VolumeSnapshot (minio)

    Note over Agent: Post-snapshot phase
    Agent->>PG: pg_backup_stop
    Agent->>Mongo: fsyncUnlock
    Agent->>DataWriter: Scale back to 1 replica
Loading

The agent must:

  1. Discover that a data-writer deployment exists and is continuously writing to all three databases (not mentioned in task.yaml)
  2. Understand that database-level quiescing alone won't stop it — pg_backup_start doesn't block application connections
  3. Implement write isolation (scale to 0, NetworkPolicy, revoke permissions, etc.)
  4. Resume writes after snapshots complete

Most agents in v25 transcripts never interact with the data-writer at all.

Why this is fair

  • Deterministic. Writes stopped → pass, writes continued → fail. No timing luck.
  • Multiple valid approaches. Scale down, NetworkPolicy, permission revocation, deployment deletion.
  • Discoverable. The data-writer is visible via kubectl get deployments -n bleater.
  • Realistic. Real-world backup orchestration requires halting application traffic, not just running database quiesce commands.

Required changes

task.yaml: Reframe the relevant AC from "quiesce databases" to signal that application-level traffic matters:

Pre-snapshot hooks ensure all write activity to the databases is halted before snapshots are taken, and post-snapshot hooks resume normal operations after.

Don't mention the data-writer by name — the agent should discover it.

solution.sh: Add data-writer isolation before the snapshot window:

kubectl scale deployment data-writer -n bleater --replicas=0
kubectl rollout status deployment/data-writer -n bleater --timeout=30s
# ... quiesce databases, take snapshots ...
kubectl scale deployment data-writer -n bleater --replicas=1

grader.py: Replace the current timestamp-comparison check with the "latest record before earliest snapshot time" check. Prerequisite: fix MongoDB readiness polling so the grader can reliably query restored data.

Caveats

  • AC2 and AC6 become complementary. AC2 validates database-level quiescing (defense in depth for data consistency). AC6 validates application-level write isolation (the coordination challenge). The AC2 grader bugs (initContainers, label filter) still need fixing independently.
  • Estimated score impact. Fixing AC2 + AC5 grader bugs pushes scores up. This AC6 change pushes them back down for agents that don't discover and stop the data-writer. Net effect: scores driven by genuine difficulty rather than grader defects.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment