Skip to content

Instantly share code, notes, and snippets.

@cch1
Created September 24, 2025 19:16
Show Gist options
  • Select an option

  • Save cch1/9d7811243d6056173e62efe9adddc83d to your computer and use it in GitHub Desktop.

Select an option

Save cch1/9d7811243d6056173e62efe9adddc83d to your computer and use it in GitHub Desktop.
Why ensure schema as part of system startup?

I realize this is going to seem heretical, but here goes: there is value in ensuring schema as part of instance startup. At Sun Tribe, we've been doing it for years.

The biggest advantage is the ease of binding the schema to the code. When combined with a testing strategy that stands up an in-memory database from scratch (many times, in fact, while running the test suite), it's easy to guarantee that the schema and code are exactly that which will be eventually ensured and run in production. It's also easy to travel back in time locally (schema and code) by starting the system on a previous commit, sometimes making forensics a little easier.

I realize there are limitations with this approach. For example, long-running data corrections need to be carefully considered lest they block system startup for "too long". It's also important to have lifecycle hooks that can declare a startup failure and roll back a deploy (cough, cough... Datomic Ions). Otherwise a schema mismatch (really only possible if someone runs a migration outside the CD pipeline or modifies a historical migration somehow) causes a deploy failure and system downtime. I'm happy trading the slight increase in risk of system downtime (only because I don't have good lifecycle hooks!!) against the ease of guaranteeing code/schema compatibility* regardless of environment (local, CI/CD, production).

In either approach (ensure schema then startup or ensure schema as part of startup) it's non-negotiable that the system must never operate with a schema/code incompatibility*. One approach uses dev ops ensure that "ensure schema" succeeds before instance startup and the other uses Clojure code inside the system itself to do the same thing. With good lifecycle hooks, the space between these two approaches gets even smaller.

*In this context, incompatibility means schema is older than the code. In either approach, accretive schema is critical to allowing rolling deploys. Otherwise scheduled downtime is the only solution.

It's also vital in both approaches that migrations are verifiably immutable once deployed to production -using a tool like caribou gives you that guarantee (by comparing the cryptographic hash of the tree of applied migrations and their tx-data to the same named migrations "on disk"). Otherwise, what are you really ensuring?

Sorry if this went on too long -I've been pilloried for publicly espousing this approach before and I'm probably being too pre-defensive.

@lwhorton
Copy link

lwhorton commented Sep 25, 2025

I agree with the premise. For the sake of argument, here are some counterpoints:

  1. consider non-monolith applications, or multi-application/multi-team databases, or multi-service applications leveraging more than one db. rolling back in this operational model is somewhat impossible. the cascade of failures and dependencies means that you often have to "roll-forward with a fix", e.g. idempotent re-deploys as opposed to roll-back-able deploys. in a more complex operational model the goal is usually to standup a clone of part of the system, run migrations, then run validations + tests against that upgraded system, then flip a switch to integrate the clone into the system. a coupling of startup<->migration makes this impossible.
  2. as you mentioned, db health is now tied to application availability. in multi-application databases (admittedly rare these days) this is not great. perhaps more commonly, consider a db shared by an analytics team, a web team, a data science team, and an internal tooling. the pain of any outage is amplified by the dependencies.
  3. this is a bit tangential and perhaps specific to datomic, but we've learned that in-memory execution doesn't exactly replicate deployed db behavior. there are discrepancies between various peer/client/cloud/on-prem/ion and db-local/db-local-memory environments. i'm struggling to remember exactly where this bit us in the past, but i remember we were bit two or three times. a "migrate -> validate -> deploy" strategy would have saved us a bit of pain. (strict serialized vs deserialized values in local vs. cloud migrations?)
  4. there's more than one way to get immutable migrations without startup-time migrations. i'd argue robust and coordinated pre-startup migration tools/scripts can do the same work of hashing and verifying. if that's the case, why bother coupling startup<->migration instead of remaining more operationally flexible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment