Skip to content

Instantly share code, notes, and snippets.

@gwpl
Last active March 9, 2026 18:59
Show Gist options
  • Select an option

  • Save gwpl/6b5c126501ca7c972bacacc049ccd86f to your computer and use it in GitHub Desktop.

Select an option

Save gwpl/6b5c126501ca7c972bacacc049ccd86f to your computer and use it in GitHub Desktop.
Git-First Graph Databases: practical patterns for building graph data systems that start as text files and scale to production stores — with principles on data architecture vs. implementation architecture, sharding, and avoiding premature optimization
title status license
SPARQL Toolchain Patterns — Git-First Graph Databases
active
CC-BY-4.0

SPARQL Toolchain Patterns

A practical guide to building graph databases that start as text files and scale to production stores — without losing clarity, auditability, or the ability to learn from your own codebase.

This document defines patterns for working with graph data in a way that is transparent, educational, and git-native. While the concrete examples use RDF/SPARQL/Turtle, the principles apply equally to property graphs (Cypher), hypergraphs, or any graph paradigm — the key ideas are about data architecture vs. implementation architecture, not a specific serialization format.

The approach is designed for:

  • Hackers and tinkerers who want to experiment with graph databases without deploying infrastructure
  • Individual developers building small-scale scripts where clarity and transparency matter more than raw performance
  • Teams who want auditable, version-controlled graph data with a clear migration path to production graph databases
  • AI agents and coding assistants that benefit from explicit, readable data flow over opaque APIs

The core insight: Text-based graph files in git are your database. Everything else is a derived artifact. Start simple. Optimize later — only when you have evidence that you need to. (The examples below use Turtle/.ttl — substitute your preferred text-based graph format: N-Triples (.nt), N-Quads (.nq), N3 (.n3), JSON-LD, GraphML, or even plain CSV of edges.)

The scaling philosophy: Focus first on correctness — understand your data, design your schema, shard into coherent files from day one, and prove your access patterns work. Because files are transparent, you naturally see which queries touch which shards, and where read/write hotspots emerge. This visibility — opaque in traditional databases — is what makes sharding decisions, performance profiling, and access control design informed rather than speculative. Once the architecture is proven (working MVP, understood access patterns, validated data flows), then make implementation decisions: which shard belongs in a graph database, which in a column store, which stays in files. Optimization at the right time, with evidence — not premature. (See Scaling, Sharding, and the Premature Optimization Trap.)


Guiding Principles

Five design principles govern all graph data tooling. Every pattern in this document flows from these principles.

P1: Git-Trackable First

All persistent state lives in flat text files committed to git — never in a database, binary store, or external service. Text-based graph serialization files — Turtle (.ttl), N-Triples (.nt), N-Quads (.nq), N3 (.n3) — are the source of truth. Any triplestore (Oxigraph, Fuseki, Neptune, rdflib in-memory) is a derived artifact — like a compiled binary rebuilt from source.

Why: Git gives you versioning, diff, blame, branching, and collaboration for free. A database checkpoint doesn't diff well. A .ttl file does — every triple change shows up in git log -p. You get a complete audit trail of every fact that was added, changed, or removed, and when.

Scaling note: When your TTL files grow beyond what's comfortable in git (hundreds of MB, millions of triples), that's your signal to introduce a persistent triplestore. But by then you'll have clean data, tested queries, and a proven schema — the hard parts are already done. Organizing data as multiple TTL files by topic also gives you a natural sharding strategy that transfers directly to database partitioning later. (See Scaling, Sharding, and the Premature Optimization Trap.)

P2: CLI-First

Prefer shell commands over programmatic APIs. A developer should be able to query or mutate the graph from a terminal with a one-liner. The command itself is the documentation — no IDE, no language runtime, no boilerplate needed to understand what's happening.

Why: Shell commands are composable (|, >, xargs), scriptable, and universally readable. They lower the barrier to entry: anyone who can run oxigraph query --location store/ --query-file q.rq understands the system. When you can cat your data and grep your queries, debugging is trivial.

P3: Literate Programming for Learning

Every file should teach something to the reader. .rq files have literate headers explaining SPARQL concepts. .ttl files have comments explaining RDF patterns. Python files have docstrings explaining why embedded SPARQL was chosen over alternatives. Shell wrappers show the equivalent direct invocation.

Why: Code that teaches is more valuable than code that merely works. A developer unfamiliar with SPARQL should be able to learn the language by reading the query files in your repository. This dramatically lowers the adoption barrier for graph databases — the technology's biggest obstacle isn't capability, it's the learning curve.

P4: Prefer Transparent Tools Over Convenient Ones

Tool preference hierarchy for SPARQL execution:

Rank Tool Strengths Trade-offs
1st Oxigraph CLI Full SPARQL 1.1 + UPDATE, Rust performance, single binary Requires ephemeral store (.gitignored)
2nd Apache Jena CLI (sparql, arq) Reads TTL directly (no store needed), zero setup No store means no UPDATE support
3rd rdflib graph.query() In-process Python, good for tests and multi-step workflows Hides SPARQL behind Python; use only when CLI won't work
4th rdflib graph API (.triples(), .add()) Programmatic graph construction, cycle detection Last resort — no SPARQL learning value

Why this order: Each step down the hierarchy trades transparency for convenience. Oxigraph and Jena are visible CLI invocations anyone can read and reproduce. rdflib graph.query() still uses SPARQL but hides it inside Python. rdflib's graph API abandons SPARQL entirely — justified only when SPARQL genuinely can't express the operation (e.g., arbitrary graph traversal, cycle detection).

Adapt to your stack: If you prefer a different triplestore (Blazegraph, GraphDB, Fuseki), the same hierarchy applies — CLI over embedded, SPARQL over native API, transparent over opaque.

P5: Mutations via SPARQL UPDATE

State changes use SPARQL UPDATE (DELETE/INSERT WHERE), not programmatic graph.add()/graph.remove(). This keeps mutations in the same language as queries, making the codebase consistently SPARQL-first and teaching UPDATE syntax alongside SELECT.

Why: A developer who reads:

DELETE { ?s :status ?old }
INSERT { ?s :status :acquired }
WHERE  { ?s :forSkill :some_skill ; :status ?old }

learns atomic graph mutation — a transferable skill across any SPARQL-compliant system. A developer who reads graph.remove((s, STATUS, old)) learns one library's API — useful locally, but not portable.


1. TTL Files Are the Source of Truth

Turtle (.ttl) files checked into git are the permanent, versioned persistence layer. Any triplestore is a derived artifact — like a compiled binary rebuilt from source code.

Implications:

  • TTL files are committed to git with full audit trail
  • Store directories are .gitignored (they are build artifacts)
  • Deleting and recreating a store is always safe — TTL has all the data
  • The core workflow is: load → query → act → update TTL → commit → reload
  • TTL files should be human-readable with literate comments explaining the data

Typical project structure:

knowledge/
  ontology.ttl          # Schema / TBox — classes, properties, constraints
  data.ttl              # Instance data / ABox — facts, state, observations
  store/                # ← .gitignore'd — ephemeral triplestore (derived)
scripts/
  queries/
    sparql/             # Standalone .rq files (literate headers)
    *.sh                # Thin shell wrappers
  helpers/
    load_store.sh       # Rebuild store from TTL (idempotent)
  lib/
    namespaces.py       # Single source of truth for namespace URIs

The data flow through git:

 ┌─────────┐     load      ┌────────────┐    query     ┌─────────┐
 │  .ttl   │ ──────────▸   │ Triplestore│ ──────────▸  │ Results │
 │ (git)   │               │ (ephemeral)│              │  (stdout)│
 └─────────┘               └────────────┘              └─────────┘
      ▲                                                      │
      │              act on results, update TTL               │
      └──────────────────────────────────────────────────────┘
                        git add + commit

Every transformation is visible in git log. Every query is a file you can read. Every mutation has a diff you can review.

2. Tool Tiers in Practice

Tier 1: SPARQL via CLI (preferred)

# Oxigraph — load TTL into ephemeral store, then query
oxigraph load --location /tmp/my-store \
  --file knowledge/ontology.ttl --file knowledge/data.ttl
oxigraph query --location /tmp/my-store \
  --query-file scripts/queries/sparql/find_ready_items.rq

# Oxigraph — SPARQL UPDATE (atomic mutation)
oxigraph update --location /tmp/my-store \
  --update-file scripts/queries/sparql/mark_complete.ru

# Jena — reads TTL directly (no store needed, great for quick exploration)
sparql --data knowledge/ontology.ttl --data knowledge/data.ttl \
  --query scripts/queries/sparql/find_ready_items.rq

Why: The command IS the documentation. A new contributor can understand your entire data pipeline by reading shell scripts — no code archaeology required.

When to use Jena over Oxigraph: Quick one-off queries where creating an ephemeral store adds friction. Jena reads TTL directly — zero setup, instant feedback.

Tier 2: Embedded SPARQL in Python (good)

# SPARQL double-negation pattern: "no prerequisite exists that is NOT done"
# This is the standard way to express universal quantification in SPARQL,
# because SPARQL has no FORALL — only EXISTS and NOT EXISTS.
READY_ITEMS_QUERY = """
PREFIX ex:    <https://example.org/ontology#>
PREFIX state: <https://example.org/state#>

SELECT ?item ?label WHERE {
  ?item a ex:Task ; rdfs:label ?label .
  FILTER NOT EXISTS {
    ?item ex:requires ?prereq .
    FILTER NOT EXISTS {
      ?ps state:forItem ?prereq ; state:status state:complete .
    }
  }
}
"""
results = graph.query(READY_ITEMS_QUERY)

Rules for embedded SPARQL:

  • Assign to a named constant (e.g., READY_ITEMS_QUERY), never inline
  • Add a comment block above explaining the SPARQL concept demonstrated
  • Use the same PREFIX declarations as your standalone .rq files
  • Import namespaces from a single shared module — never redefine inline

Tier 3: rdflib graph API (last resort)

from rdflib import Graph, Literal, RDF
graph.add((subject, predicate, Literal("value")))
graph.serialize(destination="output.ttl", format="turtle")

Justified when:

  • Complex graph serialization or format conversion
  • SPARQL UPDATE genuinely can't express the operation
  • Test fixtures that need programmatic graph construction
  • Arbitrary graph traversal (cycle detection, path finding)

Not justified when:

  • Simple property updates → use SPARQL UPDATE
  • Queries → use SPARQL SELECT
  • Anything that could be a .rq file

3. Mutation via SPARQL UPDATE

Preferred pattern for state changes (over programmatic graph.add()/remove()):

# =============================================================================
# mark_complete.ru — Record an item status change
# =============================================================================
#
# SPARQL UPDATE CONCEPTS DEMONSTRATED:
#   DELETE/INSERT — atomic replacement of triple values
#   WHERE clause  — scoped modification (only changes matching triples)
#
# USAGE:
#   oxigraph load --location /tmp/my-store --file knowledge/*.ttl
#   oxigraph update --location /tmp/my-store --update-file mark_complete.ru
# =============================================================================

PREFIX state: <https://example.org/state#>
PREFIX ex:    <https://example.org/ontology#>
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#>

DELETE { ?s state:status ?oldStatus }
INSERT { ?s state:status state:complete ;
         state:completedDate "2025-01-15"^^xsd:date }
WHERE  {
  ?s a state:ItemState ;
     state:forItem ex:some_task ;
     state:status ?oldStatus .
}

Why SPARQL UPDATE over Python API:

  • Same language for reads and writes — one mental model
  • Atomic — DELETE and INSERT happen together
  • Auditable — the .ru file is committed to git, diffable, reviewable
  • Transferable — works with any SPARQL 1.1 compliant store

4. Standalone SPARQL Query Files

All .rq files follow a literate style with self-documenting headers:

# =============================================================================
# query_name.rq — One-line description of what this query answers
# =============================================================================
#
# PURPOSE: What question does this answer? When would you run it?
#
# KEY SPARQL CONCEPTS DEMONSTRATED:
#   CONCEPT_1 — brief explanation
#   CONCEPT_2 — brief explanation
#
# USAGE (Oxigraph):
#   oxigraph query --location store/ --query-file query_name.rq
#
# USAGE (Jena):
#   sparql --data data.ttl --query query_name.rq
#
# RETURNS: description of result columns
# =============================================================================

PREFIX ex: <https://example.org/ontology#>
# ... query body ...

Why the literate style: Reading a .rq file should teach SPARQL. The header explains concepts; inline comments explain clauses. A developer unfamiliar with SPARQL should be able to learn from query files alone — no textbook required.

Example query inventory (showing how each file teaches a different concept):

File SPARQL Concepts Taught
find_ready_items.rq NOT EXISTS, nested negation, universal quantification
progress_summary.rq COUNT, GROUP BY, OPTIONAL + BIND
dependency_tree.rq OPTIONAL for LEFT JOIN, multi-OPTIONAL
item_details.rq VALUES for parameterization
gap_analysis.rq GROUP_CONCAT, FILTER with BOUND, IN operator

5. Shell Wrappers

Shell wrappers are intentionally thin — 5-10 lines. They exist to resolve paths so queries work from any directory. The real logic is always in the .rq file.

#!/usr/bin/env bash
# =============================================================================
# find_ready_items.sh — What items are ready to work on?
# =============================================================================
# Thin wrapper. The real logic lives in sparql/find_ready_items.rq
# =============================================================================
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
DATA="$REPO_ROOT/knowledge"
QUERY="$REPO_ROOT/scripts/queries/sparql/find_ready_items.rq"

STORE="$DATA/store"
if [ -d "$STORE" ]; then
    oxigraph query --location "$STORE" --query-file "$QUERY"
else
    sparql --data "$DATA/ontology.ttl" --data "$DATA/data.ttl" --query "$QUERY"
fi

Why a wrapper instead of calling sparql directly? Anyone can run ./find_ready_items.sh from any directory without remembering file paths. The wrapper is thin enough that the .rq file remains the real documentation.

6. Namespace Conventions

All RDF namespaces are defined in one place — a shared module that both Python code and humans reference:

# namespaces.py — Single source of truth for all namespace URIs
from rdflib import Namespace

EX      = Namespace("https://example.org/ontology#")
STATE   = Namespace("https://example.org/state#")
PHASE   = Namespace("https://example.org/phases#")
EV      = Namespace("https://example.org/evidence#")

Rules:

  • Python scripts MUST import from the shared module — never redefine inline
  • .rq files use matching PREFIX declarations (kept in sync manually)
  • .ttl files use matching @prefix declarations
  • When you add a namespace, update all three places

7. The Ephemeral Store Pattern

The triplestore is a derived artifact rebuilt from TTL sources — like make clean && make.

# Rebuild store from TTL sources (idempotent, safe to run anytime)
./scripts/helpers/load_store.sh

# The store directory is git-ignored — it's a build artifact
# Deleting and recreating is ALWAYS safe — TTL files have all the data

# Query
oxigraph query --location knowledge/store/ --query-file my_query.rq

# Mutate
oxigraph update --location knowledge/store/ --update-file my_update.ru

load_store.sh does exactly four things:

  1. Verify TTL source files exist
  2. Clear existing store (rm -rf knowledge/store/)
  3. Load all TTL files (oxigraph load --location store/ --file *.ttl)
  4. Verify with a triple count query

Why ephemeral?

  • The store is never committed to git (P1: Git-Trackable First)
  • git clone gives a working repo — no database provisioning
  • rm -rf store/ is always safe — it's just a cache
  • Rebuilding is fast for small-to-medium graphs (seconds to minutes)
  • When rebuilding becomes slow, that's your signal to introduce persistence

8. Adding New Queries — Checklist

When adding a new SPARQL query:

  1. Write the .rq file with a literate header explaining the SPARQL concepts
  2. Test manually — rebuild store and query:
    ./scripts/helpers/load_store.sh
    oxigraph query --location knowledge/store/ --query-file your_query.rq
    # Or quick test without store (Jena):
    sparql --data knowledge/*.ttl --query your_query.rq
  3. Add a shell wrapper if the query will be called from scripts or automation
  4. Document the query in your project's query inventory

When embedding SPARQL in Python:

  1. Name the constant descriptively (e.g., GAP_ANALYSIS_QUERY)
  2. Add literate comments explaining the SPARQL concepts demonstrated
  3. Import namespaces from the shared module
  4. Consider extracting to a standalone .rq file if the query is reusable

Why This Approach Works

For learning and adoption

The biggest barrier to graph database adoption isn't the technology — it's the learning curve. These patterns lower that barrier:

  • Every .rq file teaches SPARQL concepts with inline documentation
  • CLI-first means you can experiment in a terminal, not an IDE
  • TTL in git means you can read your data with cat and track changes with git log
  • No infrastructure to set up — git clone and you're querying

For individual developers and small scripts

You don't need a graph database server to work with graph data:

  • TTL files are just text — edit them with any editor
  • Jena reads TTL directly — sparql --data file.ttl --query q.rq
  • Shell scripts compose naturally — pipe SPARQL results to awk, jq, or anything
  • Git gives you versioning and collaboration for free

For scaling later

When you outgrow text files, migration is straightforward because:

  • Your queries are already SPARQL — they work on any compliant store
  • Your schema is already in TTL — load it into any triplestore
  • Your mutations are already SPARQL UPDATE — same syntax everywhere
  • The only thing that changes is the backend — not the queries, not the data format
  • File-per-topic organization reveals your sharding boundaries before you need to commit to database infrastructure

For a deeper treatment of why file-first is a scaling advantage (not just a starting point), and why different shards may need different storage engines, see Scaling, Sharding, and the Premature Optimization Trap.

For AI agents and automation

Transparent, text-based patterns are ideal for AI-assisted workflows:

  • Agents can read .ttl and .rq files directly — no API calls needed
  • Git diffs show exactly what changed — perfect for review and auditing
  • CLI commands are self-documenting — agents can reproduce any operation
  • The entire data pipeline is visible in the repository

Quick Reference

Principle Rule of thumb
P1: Git-Trackable If it's not in a .ttl file committed to git, it doesn't exist
P2: CLI-First If you can't run it from a terminal one-liner, simplify it
P3: Literate If a reader can't learn from your code, add comments until they can
P4: Transparent tools CLI over embedded, SPARQL over native API, readable over convenient
P5: SPARQL UPDATE Mutations in SPARQL, not in Python — same language for reads and writes
Task Preferred approach
Query the graph oxigraph query or sparql --data (CLI)
Mutate the graph SPARQL UPDATE via oxigraph update
Store data permanently Commit .ttl to git
Rebuild the store load_store.sh (idempotent, safe)
Add a new query Write .rq with literate header, add shell wrapper
Embed in Python Named constant + literate comments + shared namespaces

This document describes patterns developed through practical experience building graph databases with SPARQL, Oxigraph, Apache Jena, and rdflib. The approach prioritizes clarity and learnability over premature optimization — start with text files, graduate to databases when you have evidence you need them.

title subtitle license
Scaling, Sharding, and the Premature Optimization Trap
Companion to: Git-First Graph Database Toolchain Patterns
CC-BY-4.0

Scaling, Sharding, and the Premature Optimization Trap

Why starting with text files in git isn't a compromise — it's a strategic advantage that pays off at every scale, including large ones.

This is a companion to SPARQL Toolchain Patterns. That document describes the how. This one addresses the skeptic's question: "But won't I need a real database eventually?"

The short answer: maybe. But you'll be better prepared for that migration — and you might discover you need far less database than you assumed.


Separating Data Architecture from Implementation Architecture

The most important distinction in this entire document:

Data architecture (also: information architecture, conceptual model) is what your data means — entities, relationships, constraints, the shape of your domain. It's the graph you draw on a whiteboard.

Implementation architecture is how that data is stored, queried, and mutated at runtime — which database engine, what indexes, how shards are distributed, what wire protocol carries queries.

These are not the same thing, and conflating them is the root cause of most premature optimization in data systems.

Isomorphisms and Homomorphisms

In discrete mathematics, an isomorphism is a structure-preserving mapping between two structures that is bijective — a perfect, lossless correspondence. A homomorphism is a structure-preserving mapping that may lose information — it preserves edges but may merge nodes.

These concepts clarify the relationship between your conceptual model and its implementations:

  • Conceptual ↔ File-based prototype: Isomorphic. Your TTL/JSON-LD/CSV files represent the graph directly. Every node, edge, and property in the conceptual model has a one-to-one correspondent in the files. Nothing is hidden, nothing is lost. You can reason about the conceptual model by reading the files — they are the model.

  • Conceptual → Production database: Often a homomorphism, not an isomorphism. A column store might flatten relationship edges into denormalized rows. A key-value store might merge node properties into opaque blobs. A search index might discard graph structure entirely, keeping only text fields. Each implementation preserves some structure from the conceptual model but is optimized for a specific access pattern at the cost of others.

  • Conceptual → Hybrid (multiple stores): A family of homomorphisms — each store receives a projection of the conceptual model optimized for its access pattern. The union of all projections reconstructs the full model. The file-based source of truth remains the isomorphic reference copy.

  ┌─────────────────────────────────────┐
  │     Conceptual Model (the graph)    │
  │  = Data Architecture                │
  │  = What your domain means           │
  └────────┬──────────┬────────────┬────┘
           │          │            │
     isomorphic   homomorphic  homomorphic
     (lossless)   (projected)  (projected)
           │          │            │
  ┌────────▾───┐ ┌────▾─────┐ ┌───▾────────┐
  │ Text files │ │ Column   │ │ Graph DB   │
  │ in git     │ │ store    │ │ (traversal │
  │ (source of │ │ (analytics│ │  queries)  │
  │  truth)    │ │  queries)│ │            │
  └────────────┘ └──────────┘ └────────────┘
    Implementation Architecture(s)

Why this matters: When you start with files, your implementation is isomorphic to your conceptual model. You can see the whole graph, reason about it directly, and evolve the schema by editing text. When you later introduce databases, you're consciously choosing which homomorphic projections to maintain — and you understand exactly what each projection preserves and what it discards, because you've been working with the full structure all along.


The "Database Will Be Faster" Intuition

Many developers reach for a database early because of an intuition: databases are faster than files. This intuition is correct for large-scale, high-concurrency workloads — and misleading for everything else.

At small scale, files win on what actually matters:

Concern Database Text files in git
Debugging Opaque internal state; need query logs, explain plans cat data.ttl — the state IS the file
Audit trail Requires explicit change tracking (CDC, audit tables) git log -p data.ttl — free, complete, immutable
Reproducibility Need database snapshots, migration scripts git checkout abc123 — exact state at any point
Collaboration Merge conflicts in binary formats are painful Text is text — git merge works naturally
Setup cost Database server, configuration, credentials, backups git clone — done
Transparency State hidden behind query interface State is readable by humans, agents, grep

For small-to-medium graphs (thousands to low millions of facts), the file-based approach isn't slower in a way that matters — queries still complete in milliseconds to seconds. What does matter is whether you can understand your data, track changes, and debug problems. Files excel at all three.

Graph Databases and the Memory Wall

Here's something that surprises people who haven't operated graph databases at scale: most graph databases are memory-hungry.

Graph traversal — the operation that makes graph databases powerful — requires random access across the entire graph. Unlike SQL queries that scan rows sequentially, graph queries jump between nodes following edges. This access pattern is fundamentally hostile to disk-based storage:

  • In-memory stores (e.g., rdflib, some Neo4j configurations) need the entire graph in RAM. Fast, but your dataset is capped by available memory.
  • Disk-backed stores (e.g., Oxigraph/RocksDB, Blazegraph, JanusGraph) use indexes to reduce disk I/O, but complex multi-hop queries still thrash the page cache.
  • Distributed stores (e.g., Neptune, Stardog cluster, NebulaGraph) shard across nodes, but cross-shard traversals add network latency.

The consequence: when a graph database solution starts to scale, it often scales very fast beyond the capacity of a single affordable server. You don't gently run out of room — you hit a cliff where queries that took milliseconds suddenly take seconds because the working set no longer fits in memory.

This is where the file-based starting point becomes strategically valuable.

File-Based Sharding: A Free Scaling Blueprint

When you organize your graph data as multiple text files from the beginning, you are — whether you realize it or not — defining a sharding strategy.

knowledge/
  ontology.ttl              # Schema (rarely changes, always needed)
  users/
    user-001.ttl            # One file per entity or entity group
    user-002.ttl
  products/
    electronics.ttl         # Sharded by domain/category
    clothing.ttl
  events/
    2025-Q1.ttl             # Sharded by time period
    2025-Q2.ttl
  relationships/
    purchases.ttl           # Edges between entity types
    reviews.ttl

Each file is a shard. And because it's plain text in git, you can immediately see:

  • What data lives together — files that are loaded together for a query are a natural shard boundary
  • What changes frequentlygit log --oneline events/ vs git log --oneline ontology.ttl
  • What's largewc -l knowledge/**/*.ttl | sort -n
  • What depends on what — which files does each query need?

This visibility is priceless. In a database, sharding decisions are infrastructure concerns hidden from developers. In files, they're visible directory structures that anyone can reason about.

Selective Loading

The file-per-shard pattern enables selective loading — only load what the current query needs:

# Query about electronics products — only load relevant shards
oxigraph load --location /tmp/store \
  --file knowledge/ontology.ttl \
  --file knowledge/products/electronics.ttl \
  --file knowledge/relationships/reviews.ttl

oxigraph query --location /tmp/store --query-file electronics_reviews.rq

Compare this to a monolithic database where every query potentially touches the entire graph. With files, the developer explicitly chooses what data to load. This is slower for a cold start but:

  • Forces you to understand your data dependencies
  • Makes query scope visible and auditable
  • Naturally limits memory usage to relevant subsets
  • Documents which parts of the graph interact

The 90/10 Insight: Not All Data Needs a Database

Once you've been working with file-sharded graphs for a while, a pattern emerges: most of your data is stable, and only a small fraction is volatile.

Data type Example Access pattern File-based?
Schema / ontology Class definitions, property constraints Read-only, loaded once Perfect in files
Reference data Country codes, category taxonomies Read-only, rarely updated Perfect in files
Historical records Past events, completed transactions Append-only, queried occasionally Good in files
Active state Current user sessions, live metrics High read/write frequency Needs a database
Hot relationships Real-time recommendations, live feeds High traversal frequency Needs a database

In many real-world systems, 90% of the graph is cold data — rarely written, queried in predictable patterns. Only 10% or less is truly volatile, requiring the low-latency random writes that databases provide.

This means the answer to "should I use a database?" is often not "yes" or "no" — it's "yes, but only for this specific shard."

Hybrid Architecture: Conceptual Graph, Heterogeneous Storage

This is where the isomorphism/homomorphism distinction (from the opening section) becomes practical.

Your conceptual model is a graph — entities connected by relationships. You can model it in any graph paradigm: RDF triples (SPARQL), property graphs (Cypher/Gremlin), hypergraphs, or even edge lists in CSV. The conceptual model is paradigm-agnostic — it's the meaning of your data.

Your implementation is where performance profiling, access patterns, and operational constraints dictate storage choices. And here's the key insight: different shards of the same conceptual graph may need entirely different storage engines.

Once you've identified your access patterns through file-based prototyping, you might discover:

  • Shard A (ontology + reference data): Rarely changes, always loaded. Keep in text files. Load into an in-memory graph at startup. Isomorphic to the conceptual model.
  • Shard B (event log): Append-heavy, time-series queries. A column store (ClickHouse, DuckDB) might serve this better than a graph database. Homomorphic — preserves temporal ordering, flattens graph structure.
  • Shard C (user state): High-frequency key-value lookups. A key-value store (Redis, RocksDB) might be optimal. Homomorphic — preserves entity identity, discards relationships.
  • Shard D (social graph): Complex traversals, relationship-heavy. This is where a graph database (Neo4j, Neptune, Oxigraph) earns its keep. Nearly isomorphic — preserves structure, adds indexes.
  • Shard E (full-text search): Keyword queries across descriptions. A search engine (Meilisearch, Elasticsearch) handles this better. Homomorphic — preserves text content, discards graph topology.
             ┌────────────────────────────────────┐
             │   Conceptual Model (the graph)      │
             │   Data Architecture                 │
             │   ── paradigm-agnostic ──           │
             │   (RDF, Property Graph, CSV edges…) │
             └──────────┬─────────────────────────┘
                        │
     ┌──────────────────┼──────────────────┐
     │                  │                  │
  isomorphic      homomorphic        homomorphic
  (lossless)      (projected)        (projected)
     │                  │                  │
┌────▾──────┐    ┌──────▾──────┐    ┌──────▾───────┐
│Text files │    │Column store │    │ Graph DB     │
│in git     │    │Key-value    │    │ (traversal   │
│(source of │    │Search index │    │  queries)    │
│ truth)    │    │(analytics)  │    │              │
└───────────┘    └─────────────┘    └──────────────┘
  Implementation Architecture(s)
  ── access-pattern-driven ──

The text files in git remain the source of truth — the isomorphic reference copy of the conceptual model. The various databases are derived, specialized projections, each a homomorphism optimized for its access pattern.

The conceptual model is a graph. The implementation might be five different storage engines, only one of which is a graph database. This is not a failure of architecture — it's the natural consequence of matching storage to access patterns instead of to conceptual structure.

The Premature Optimization Trap

Starting with a database before you understand your data is a form of premature optimization:

  1. You commit to a storage engine before knowing your access patterns. Different parts of your graph have different read/write ratios, traversal depths, and latency requirements. You can't know these from a design document — you learn them by running queries against real data.

  2. You lose visibility into your data. Once data lives in a database, understanding it requires query tools, admin consoles, and specialized knowledge. In files, understanding requires cat and grep.

  3. You lose versioning for free. Git tracks every change to every fact. Database change tracking requires explicit infrastructure (CDC, audit tables, event sourcing).

  4. You make sharding decisions in the dark. Without observing real access patterns, sharding is guesswork. File-based development lets you observe and iterate on shard boundaries cheaply.

  5. You might not need a graph database at all. Your access patterns might reveal that the volatile portion of your data is better served by SQL, a key-value store, or a column database — with the graph model living in text files as a conceptual layer and integration glue. The data architecture is a graph; the implementation is whatever each shard's access pattern demands.

  6. You conflate the conceptual model with the storage model. Just because your domain is best understood as a graph doesn't mean it's best stored as one. The file-first approach forces you to discover this distinction empirically rather than assuming it away.

The Migration Path

When you do outgrow files, migration is mechanical — not architectural:

Step What changes What stays the same
1. Identify hot shards Profile which files are loaded/queried most Conceptual model, queries, schema
2. Choose storage per shard Pick DB engine per access pattern Text files as source of truth for cold data
3. Load hot shards into DB oxigraph load, neo4j-admin import, etc. Query semantics (SPARQL, Cypher, SQL — each for its shard)
4. Add sync pipeline Files → DB loader (cron, CI, event-driven) Git as audit trail
5. Route queries Hot queries → DB, cold queries → files Conceptual model unchanged

The key: your conceptual model doesn't change. Your file-based source of truth remains the isomorphic reference. Each database you introduce is a homomorphic projection optimized for specific queries. If a database dies, you rebuild it from files. If access patterns change, you re-project into a different engine. The architecture flexes at the implementation layer without touching the conceptual layer.

Summary

Myth Reality
"Databases are faster" For small graphs, files are fast enough — and more transparent
"I need a graph DB for graphs" You need graph thinking. The storage engine follows access patterns, not the conceptual shape.
"I'll shard when I need to" File-per-topic IS sharding. You're already doing it.
"One database for everything" Different shards may need different engines (column, KV, graph, search)
"Start with the database, optimize later" Start with files, observe, then pick the right DB per shard
"Graph model = graph database" The conceptual model (data architecture) is independent of the storage model (implementation architecture)

The file-first approach isn't a toy. It's a disciplined methodology that gives you transparency, versioning, and sharding awareness from day one — qualities that become more valuable, not less, as your system scales.

Start with cat. Graduate to oxigraph query. Promote hot shards to specialized databases. Keep the rest in git. At every stage, you can see your data, diff your changes, and understand your system.

The conceptual model is a graph. The source of truth is text files. The implementation is whatever each shard needs. That's not a compromise — that's architecture informed by observation instead of assumption.


Further reading: Graph isomorphism · Graph homomorphism · Homomorphism (general) — the mathematical foundations for why conceptual models and implementation models are related but not identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment