gwpl/sparql-toolchain-patterns.md

Last active March 9, 2026 18:59

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/gwpl/6b5c126501ca7c972bacacc049ccd86f.js"></script>
Save gwpl/6b5c126501ca7c972bacacc049ccd86f to your computer and use it in GitHub Desktop.

Git-First Graph Databases: practical patterns for building graph data systems that start as text files and scale to production stores — with principles on data architecture vs. implementation architecture, sharding, and avoiding premature optimization

Raw

sparql-toolchain-patterns.md

title	status	license
SPARQL Toolchain Patterns — Git-First Graph Databases	active	CC-BY-4.0

SPARQL Toolchain Patterns

A practical guide to building graph databases that start as text files and scale to production stores — without losing clarity, auditability, or the ability to learn from your own codebase.

This document defines patterns for working with graph data in a way that is transparent, educational, and git-native. While the concrete examples use RDF/SPARQL/Turtle, the principles apply equally to property graphs (Cypher), hypergraphs, or any graph paradigm — the key ideas are about data architecture vs. implementation architecture, not a specific serialization format.

The approach is designed for:

Hackers and tinkerers who want to experiment with graph databases without deploying infrastructure
Individual developers building small-scale scripts where clarity and transparency matter more than raw performance
Teams who want auditable, version-controlled graph data with a clear migration path to production graph databases
AI agents and coding assistants that benefit from explicit, readable data flow over opaque APIs

The core insight: Text-based graph files in git are your database. Everything else is a derived artifact. Start simple. Optimize later — only when you have evidence that you need to. (The examples below use Turtle/.ttl — substitute your preferred text-based graph format: N-Triples (.nt), N-Quads (.nq), N3 (.n3), JSON-LD, GraphML, or even plain CSV of edges.)

The scaling philosophy: Focus first on correctness — understand your data, design your schema, shard into coherent files from day one, and prove your access patterns work. Because files are transparent, you naturally see which queries touch which shards, and where read/write hotspots emerge. This visibility — opaque in traditional databases — is what makes sharding decisions, performance profiling, and access control design informed rather than speculative. Once the architecture is proven (working MVP, understood access patterns, validated data flows), then make implementation decisions: which shard belongs in a graph database, which in a column store, which stays in files. Optimization at the right time, with evidence — not premature. (See Scaling, Sharding, and the Premature Optimization Trap.)

Guiding Principles

Five design principles govern all graph data tooling. Every pattern in this document flows from these principles.

P1: Git-Trackable First

All persistent state lives in flat text files committed to git — never in a database, binary store, or external service. Text-based graph serialization files — Turtle (.ttl), N-Triples (.nt), N-Quads (.nq), N3 (.n3) — are the source of truth. Any triplestore (Oxigraph, Fuseki, Neptune, rdflib in-memory) is a derived artifact — like a compiled binary rebuilt from source.

Why: Git gives you versioning, diff, blame, branching, and collaboration for free. A database checkpoint doesn't diff well. A .ttl file does — every triple change shows up in git log -p. You get a complete audit trail of every fact that was added, changed, or removed, and when.

Scaling note: When your TTL files grow beyond what's comfortable in git (hundreds of MB, millions of triples), that's your signal to introduce a persistent triplestore. But by then you'll have clean data, tested queries, and a proven schema — the hard parts are already done. Organizing data as multiple TTL files by topic also gives you a natural sharding strategy that transfers directly to database partitioning later. (See Scaling, Sharding, and the Premature Optimization Trap.)

P2: CLI-First

Prefer shell commands over programmatic APIs. A developer should be able to query or mutate the graph from a terminal with a one-liner. The command itself is the documentation — no IDE, no language runtime, no boilerplate needed to understand what's happening.

Why: Shell commands are composable (|, >, xargs), scriptable, and universally readable. They lower the barrier to entry: anyone who can run oxigraph query --location store/ --query-file q.rq understands the system. When you can cat your data and grep your queries, debugging is trivial.

P3: Literate Programming for Learning

Every file should teach something to the reader. .rq files have literate headers explaining SPARQL concepts. .ttl files have comments explaining RDF patterns. Python files have docstrings explaining why embedded SPARQL was chosen over alternatives. Shell wrappers show the equivalent direct invocation.

Why: Code that teaches is more valuable than code that merely works. A developer unfamiliar with SPARQL should be able to learn the language by reading the query files in your repository. This dramatically lowers the adoption barrier for graph databases — the technology's biggest obstacle isn't capability, it's the learning curve.

P4: Prefer Transparent Tools Over Convenient Ones

Tool preference hierarchy for SPARQL execution:

Rank	Tool	Strengths	Trade-offs
1st	Oxigraph CLI	Full SPARQL 1.1 + UPDATE, Rust performance, single binary	Requires ephemeral store (`.gitignore`d)
2nd	Apache Jena CLI (`sparql`, `arq`)	Reads TTL directly (no store needed), zero setup	No store means no UPDATE support
3rd	rdflib `graph.query()`	In-process Python, good for tests and multi-step workflows	Hides SPARQL behind Python; use only when CLI won't work
4th	rdflib graph API (`.triples()`, `.add()`)	Programmatic graph construction, cycle detection	Last resort — no SPARQL learning value

Why this order: Each step down the hierarchy trades transparency for convenience. Oxigraph and Jena are visible CLI invocations anyone can read and reproduce. rdflib graph.query() still uses SPARQL but hides it inside Python. rdflib's graph API abandons SPARQL entirely — justified only when SPARQL genuinely can't express the operation (e.g., arbitrary graph traversal, cycle detection).

Adapt to your stack: If you prefer a different triplestore (Blazegraph, GraphDB, Fuseki), the same hierarchy applies — CLI over embedded, SPARQL over native API, transparent over opaque.

P5: Mutations via SPARQL UPDATE

State changes use SPARQL UPDATE (DELETE/INSERT WHERE), not programmatic graph.add()/graph.remove(). This keeps mutations in the same language as queries, making the codebase consistently SPARQL-first and teaching UPDATE syntax alongside SELECT.

Why: A developer who reads:

DELETE { ?s :status ?old }
INSERT { ?s :status :acquired }
WHERE  { ?s :forSkill :some_skill ; :status ?old }

learns atomic graph mutation — a transferable skill across any SPARQL-compliant system. A developer who reads graph.remove((s, STATUS, old)) learns one library's API — useful locally, but not portable.

1. TTL Files Are the Source of Truth

Turtle (.ttl) files checked into git are the permanent, versioned persistence layer. Any triplestore is a derived artifact — like a compiled binary rebuilt from source code.

Implications:

TTL files are committed to git with full audit trail
Store directories are .gitignored (they are build artifacts)
Deleting and recreating a store is always safe — TTL has all the data
The core workflow is: load → query → act → update TTL → commit → reload
TTL files should be human-readable with literate comments explaining the data

Typical project structure:

knowledge/
  ontology.ttl          # Schema / TBox — classes, properties, constraints
  data.ttl              # Instance data / ABox — facts, state, observations
  store/                # ← .gitignore'd — ephemeral triplestore (derived)
scripts/
  queries/
    sparql/             # Standalone .rq files (literate headers)
    *.sh                # Thin shell wrappers
  helpers/
    load_store.sh       # Rebuild store from TTL (idempotent)
  lib/
    namespaces.py       # Single source of truth for namespace URIs

The data flow through git:

 ┌─────────┐     load      ┌────────────┐    query     ┌─────────┐
 │  .ttl   │ ──────────▸   │ Triplestore│ ──────────▸  │ Results │
 │ (git)   │               │ (ephemeral)│              │  (stdout)│
 └─────────┘               └────────────┘              └─────────┘
      ▲                                                      │
      │              act on results, update TTL               │
      └──────────────────────────────────────────────────────┘
                        git add + commit

Every transformation is visible in git log. Every query is a file you can read. Every mutation has a diff you can review.

2. Tool Tiers in Practice

Tier 1: SPARQL via CLI (preferred)

# Oxigraph — load TTL into ephemeral store, then query
oxigraph load --location /tmp/my-store \
  --file knowledge/ontology.ttl --file knowledge/data.ttl
oxigraph query --location /tmp/my-store \
  --query-file scripts/queries/sparql/find_ready_items.rq

# Oxigraph — SPARQL UPDATE (atomic mutation)
oxigraph update --location /tmp/my-store \
  --update-file scripts/queries/sparql/mark_complete.ru

# Jena — reads TTL directly (no store needed, great for quick exploration)
sparql --data knowledge/ontology.ttl --data knowledge/data.ttl \
  --query scripts/queries/sparql/find_ready_items.rq

Why: The command IS the documentation. A new contributor can understand your entire data pipeline by reading shell scripts — no code archaeology required.

When to use Jena over Oxigraph: Quick one-off queries where creating an ephemeral store adds friction. Jena reads TTL directly — zero setup, instant feedback.

Tier 2: Embedded SPARQL in Python (good)

# SPARQL double-negation pattern: "no prerequisite exists that is NOT done"
# This is the standard way to express universal quantification in SPARQL,
# because SPARQL has no FORALL — only EXISTS and NOT EXISTS.
READY_ITEMS_QUERY = """
PREFIX ex:    <https://example.org/ontology#>
PREFIX state: <https://example.org/state#>

SELECT ?item ?label WHERE {
  ?item a ex:Task ; rdfs:label ?label .
  FILTER NOT EXISTS {
    ?item ex:requires ?prereq .
    FILTER NOT EXISTS {
      ?ps state:forItem ?prereq ; state:status state:complete .
    }
  }
}
"""
results = graph.query(READY_ITEMS_QUERY)

Rules for embedded SPARQL:

Assign to a named constant (e.g., READY_ITEMS_QUERY), never inline
Add a comment block above explaining the SPARQL concept demonstrated
Use the same PREFIX declarations as your standalone .rq files
Import namespaces from a single shared module — never redefine inline

Tier 3: rdflib graph API (last resort)

from rdflib import Graph, Literal, RDF
graph.add((subject, predicate, Literal("value")))
graph.serialize(destination="output.ttl", format="turtle")

Justified when:

Complex graph serialization or format conversion
SPARQL UPDATE genuinely can't express the operation
Test fixtures that need programmatic graph construction
Arbitrary graph traversal (cycle detection, path finding)

Not justified when:

Simple property updates → use SPARQL UPDATE
Queries → use SPARQL SELECT
Anything that could be a .rq file

3. Mutation via SPARQL UPDATE

Preferred pattern for state changes (over programmatic graph.add()/remove()):

# =============================================================================
# mark_complete.ru — Record an item status change
# =============================================================================
#
# SPARQL UPDATE CONCEPTS DEMONSTRATED:
#   DELETE/INSERT — atomic replacement of triple values
#   WHERE clause  — scoped modification (only changes matching triples)
#
# USAGE:
#   oxigraph load --location /tmp/my-store --file knowledge/*.ttl
#   oxigraph update --location /tmp/my-store --update-file mark_complete.ru
# =============================================================================

PREFIX state: <https://example.org/state#>
PREFIX ex:    <https://example.org/ontology#>
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#>

DELETE { ?s state:status ?oldStatus }
INSERT { ?s state:status state:complete ;
         state:completedDate "2025-01-15"^^xsd:date }
WHERE  {
  ?s a state:ItemState ;
     state:forItem ex:some_task ;
     state:status ?oldStatus .
}

Why SPARQL UPDATE over Python API:

Same language for reads and writes — one mental model
Atomic — DELETE and INSERT happen together
Auditable — the .ru file is committed to git, diffable, reviewable
Transferable — works with any SPARQL 1.1 compliant store

4. Standalone SPARQL Query Files

All .rq files follow a literate style with self-documenting headers:

# =============================================================================
# query_name.rq — One-line description of what this query answers
# =============================================================================
#
# PURPOSE: What question does this answer? When would you run it?
#
# KEY SPARQL CONCEPTS DEMONSTRATED:
#   CONCEPT_1 — brief explanation
#   CONCEPT_2 — brief explanation
#
# USAGE (Oxigraph):
#   oxigraph query --location store/ --query-file query_name.rq
#
# USAGE (Jena):
#   sparql --data data.ttl --query query_name.rq
#
# RETURNS: description of result columns
# =============================================================================

PREFIX ex: <https://example.org/ontology#>
# ... query body ...

Why the literate style: Reading a .rq file should teach SPARQL. The header explains concepts; inline comments explain clauses. A developer unfamiliar with SPARQL should be able to learn from query files alone — no textbook required.

Example query inventory (showing how each file teaches a different concept):

File	SPARQL Concepts Taught
`find_ready_items.rq`	NOT EXISTS, nested negation, universal quantification
`progress_summary.rq`	COUNT, GROUP BY, OPTIONAL + BIND
`dependency_tree.rq`	OPTIONAL for LEFT JOIN, multi-OPTIONAL
`item_details.rq`	VALUES for parameterization
`gap_analysis.rq`	GROUP_CONCAT, FILTER with BOUND, IN operator

5. Shell Wrappers

Shell wrappers are intentionally thin — 5-10 lines. They exist to resolve paths so queries work from any directory. The real logic is always in the .rq file.

#!/usr/bin/env bash
# =============================================================================
# find_ready_items.sh — What items are ready to work on?
# =============================================================================
# Thin wrapper. The real logic lives in sparql/find_ready_items.rq
# =============================================================================
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
DATA="$REPO_ROOT/knowledge"
QUERY="$REPO_ROOT/scripts/queries/sparql/find_ready_items.rq"

STORE="$DATA/store"
if [ -d "$STORE" ]; then
    oxigraph query --location "$STORE" --query-file "$QUERY"
else
    sparql --data "$DATA/ontology.ttl" --data "$DATA/data.ttl" --query "$QUERY"
fi

Why a wrapper instead of calling sparql directly? Anyone can run ./find_ready_items.sh from any directory without remembering file paths. The wrapper is thin enough that the .rq file remains the real documentation.

6. Namespace Conventions

All RDF namespaces are defined in one place — a shared module that both Python code and humans reference:

# namespaces.py — Single source of truth for all namespace URIs
from rdflib import Namespace

EX      = Namespace("https://example.org/ontology#")
STATE   = Namespace("https://example.org/state#")
PHASE   = Namespace("https://example.org/phases#")
EV      = Namespace("https://example.org/evidence#")

Rules:

Python scripts MUST import from the shared module — never redefine inline
.rq files use matching PREFIX declarations (kept in sync manually)
.ttl files use matching @prefix declarations
When you add a namespace, update all three places

7. The Ephemeral Store Pattern

The triplestore is a derived artifact rebuilt from TTL sources — like make clean && make.

# Rebuild store from TTL sources (idempotent, safe to run anytime)
./scripts/helpers/load_store.sh

# The store directory is git-ignored — it's a build artifact
# Deleting and recreating is ALWAYS safe — TTL files have all the data

# Query
oxigraph query --location knowledge/store/ --query-file my_query.rq

# Mutate
oxigraph update --location knowledge/store/ --update-file my_update.ru

load_store.sh does exactly four things:

Verify TTL source files exist
Clear existing store (rm -rf knowledge/store/)
Load all TTL files (oxigraph load --location store/ --file *.ttl)
Verify with a triple count query

Why ephemeral?

The store is never committed to git (P1: Git-Trackable First)
git clone gives a working repo — no database provisioning
rm -rf store/ is always safe — it's just a cache
Rebuilding is fast for small-to-medium graphs (seconds to minutes)
When rebuilding becomes slow, that's your signal to introduce persistence

8. Adding New Queries — Checklist

When adding a new SPARQL query:

Write the .rq file with a literate header explaining the SPARQL concepts

Test manually — rebuild store and query:

./scripts/helpers/load_store.sh
oxigraph query --location knowledge/store/ --query-file your_query.rq
# Or quick test without store (Jena):
sparql --data knowledge/*.ttl --query your_query.rq

Add a shell wrapper if the query will be called from scripts or automation
Document the query in your project's query inventory

When embedding SPARQL in Python:

Name the constant descriptively (e.g., GAP_ANALYSIS_QUERY)
Add literate comments explaining the SPARQL concepts demonstrated
Import namespaces from the shared module
Consider extracting to a standalone .rq file if the query is reusable

Why This Approach Works

For learning and adoption

The biggest barrier to graph database adoption isn't the technology — it's the learning curve. These patterns lower that barrier:

Every .rq file teaches SPARQL concepts with inline documentation
CLI-first means you can experiment in a terminal, not an IDE
TTL in git means you can read your data with cat and track changes with git log
No infrastructure to set up — git clone and you're querying

For individual developers and small scripts

You don't need a graph database server to work with graph data:

TTL files are just text — edit them with any editor
Jena reads TTL directly — sparql --data file.ttl --query q.rq
Shell scripts compose naturally — pipe SPARQL results to awk, jq, or anything
Git gives you versioning and collaboration for free

For scaling later

When you outgrow text files, migration is straightforward because:

Your queries are already SPARQL — they work on any compliant store
Your schema is already in TTL — load it into any triplestore
Your mutations are already SPARQL UPDATE — same syntax everywhere
The only thing that changes is the backend — not the queries, not the data format
File-per-topic organization reveals your sharding boundaries before you need to commit to database infrastructure

For a deeper treatment of why file-first is a scaling advantage (not just a starting point), and why different shards may need different storage engines, see Scaling, Sharding, and the Premature Optimization Trap.

For AI agents and automation

Transparent, text-based patterns are ideal for AI-assisted workflows:

Agents can read .ttl and .rq files directly — no API calls needed
Git diffs show exactly what changed — perfect for review and auditing
CLI commands are self-documenting — agents can reproduce any operation
The entire data pipeline is visible in the repository

Quick Reference

Principle	Rule of thumb
P1: Git-Trackable	If it's not in a `.ttl` file committed to git, it doesn't exist
P2: CLI-First	If you can't run it from a terminal one-liner, simplify it
P3: Literate	If a reader can't learn from your code, add comments until they can
P4: Transparent tools	CLI over embedded, SPARQL over native API, readable over convenient
P5: SPARQL UPDATE	Mutations in SPARQL, not in Python — same language for reads and writes

Task	Preferred approach
Query the graph	`oxigraph query` or `sparql --data` (CLI)
Mutate the graph	SPARQL UPDATE via `oxigraph update`
Store data permanently	Commit `.ttl` to git
Rebuild the store	`load_store.sh` (idempotent, safe)
Add a new query	Write `.rq` with literate header, add shell wrapper
Embed in Python	Named constant + literate comments + shared namespaces

This document describes patterns developed through practical experience building graph databases with SPARQL, Oxigraph, Apache Jena, and rdflib. The approach prioritizes clarity and learnability over premature optimization — start with text files, graduate to databases when you have evidence you need them.

Raw

z-scaling-sharding-and-premature-optimization.md

title	subtitle	license
Scaling, Sharding, and the Premature Optimization Trap	Companion to: Git-First Graph Database Toolchain Patterns	CC-BY-4.0

Scaling, Sharding, and the Premature Optimization Trap

Why starting with text files in git isn't a compromise — it's a strategic advantage that pays off at every scale, including large ones.

This is a companion to SPARQL Toolchain Patterns. That document describes the how. This one addresses the skeptic's question: "But won't I need a real database eventually?"

The short answer: maybe. But you'll be better prepared for that migration — and you might discover you need far less database than you assumed.

Separating Data Architecture from Implementation Architecture

The most important distinction in this entire document:

Data architecture (also: information architecture, conceptual model) is what your data means — entities, relationships, constraints, the shape of your domain. It's the graph you draw on a whiteboard.

Implementation architecture is how that data is stored, queried, and mutated at runtime — which database engine, what indexes, how shards are distributed, what wire protocol carries queries.

These are not the same thing, and conflating them is the root cause of most premature optimization in data systems.

Isomorphisms and Homomorphisms

In discrete mathematics, an isomorphism is a structure-preserving mapping between two structures that is bijective — a perfect, lossless correspondence. A homomorphism is a structure-preserving mapping that may lose information — it preserves edges but may merge nodes.

These concepts clarify the relationship between your conceptual model and its implementations:

Conceptual ↔ File-based prototype: Isomorphic. Your TTL/JSON-LD/CSV files represent the graph directly. Every node, edge, and property in the conceptual model has a one-to-one correspondent in the files. Nothing is hidden, nothing is lost. You can reason about the conceptual model by reading the files — they are the model.
Conceptual → Production database: Often a homomorphism, not an isomorphism. A column store might flatten relationship edges into denormalized rows. A key-value store might merge node properties into opaque blobs. A search index might discard graph structure entirely, keeping only text fields. Each implementation preserves some structure from the conceptual model but is optimized for a specific access pattern at the cost of others.
Conceptual → Hybrid (multiple stores): A family of homomorphisms — each store receives a projection of the conceptual model optimized for its access pattern. The union of all projections reconstructs the full model. The file-based source of truth remains the isomorphic reference copy.

  ┌─────────────────────────────────────┐
  │     Conceptual Model (the graph)    │
  │  = Data Architecture                │
  │  = What your domain means           │
  └────────┬──────────┬────────────┬────┘
           │          │            │
     isomorphic   homomorphic  homomorphic
     (lossless)   (projected)  (projected)
           │          │            │
  ┌────────▾───┐ ┌────▾─────┐ ┌───▾────────┐
  │ Text files │ │ Column   │ │ Graph DB   │
  │ in git     │ │ store    │ │ (traversal │
  │ (source of │ │ (analytics│ │  queries)  │
  │  truth)    │ │  queries)│ │            │
  └────────────┘ └──────────┘ └────────────┘
    Implementation Architecture(s)

Why this matters: When you start with files, your implementation is isomorphic to your conceptual model. You can see the whole graph, reason about it directly, and evolve the schema by editing text. When you later introduce databases, you're consciously choosing which homomorphic projections to maintain — and you understand exactly what each projection preserves and what it discards, because you've been working with the full structure all along.

The "Database Will Be Faster" Intuition

Many developers reach for a database early because of an intuition: databases are faster than files. This intuition is correct for large-scale, high-concurrency workloads — and misleading for everything else.

At small scale, files win on what actually matters:

Concern	Database	Text files in git
Debugging	Opaque internal state; need query logs, explain plans	`cat data.ttl` — the state IS the file
Audit trail	Requires explicit change tracking (CDC, audit tables)	`git log -p data.ttl` — free, complete, immutable
Reproducibility	Need database snapshots, migration scripts	`git checkout abc123` — exact state at any point
Collaboration	Merge conflicts in binary formats are painful	Text is text — `git merge` works naturally
Setup cost	Database server, configuration, credentials, backups	`git clone` — done
Transparency	State hidden behind query interface	State is readable by humans, agents, `grep`

For small-to-medium graphs (thousands to low millions of facts), the file-based approach isn't slower in a way that matters — queries still complete in milliseconds to seconds. What does matter is whether you can understand your data, track changes, and debug problems. Files excel at all three.

Graph Databases and the Memory Wall

Here's something that surprises people who haven't operated graph databases at scale: most graph databases are memory-hungry.

Graph traversal — the operation that makes graph databases powerful — requires random access across the entire graph. Unlike SQL queries that scan rows sequentially, graph queries jump between nodes following edges. This access pattern is fundamentally hostile to disk-based storage:

In-memory stores (e.g., rdflib, some Neo4j configurations) need the entire graph in RAM. Fast, but your dataset is capped by available memory.
Disk-backed stores (e.g., Oxigraph/RocksDB, Blazegraph, JanusGraph) use indexes to reduce disk I/O, but complex multi-hop queries still thrash the page cache.
Distributed stores (e.g., Neptune, Stardog cluster, NebulaGraph) shard across nodes, but cross-shard traversals add network latency.

The consequence: when a graph database solution starts to scale, it often scales very fast beyond the capacity of a single affordable server. You don't gently run out of room — you hit a cliff where queries that took milliseconds suddenly take seconds because the working set no longer fits in memory.

This is where the file-based starting point becomes strategically valuable.

File-Based Sharding: A Free Scaling Blueprint

When you organize your graph data as multiple text files from the beginning, you are — whether you realize it or not — defining a sharding strategy.

knowledge/
  ontology.ttl              # Schema (rarely changes, always needed)
  users/
    user-001.ttl            # One file per entity or entity group
    user-002.ttl
  products/
    electronics.ttl         # Sharded by domain/category
    clothing.ttl
  events/
    2025-Q1.ttl             # Sharded by time period
    2025-Q2.ttl
  relationships/
    purchases.ttl           # Edges between entity types
    reviews.ttl

Each file is a shard. And because it's plain text in git, you can immediately see:

What data lives together — files that are loaded together for a query are a natural shard boundary
What changes frequently — git log --oneline events/ vs git log --oneline ontology.ttl
What's large — wc -l knowledge/**/*.ttl | sort -n
What depends on what — which files does each query need?

This visibility is priceless. In a database, sharding decisions are infrastructure concerns hidden from developers. In files, they're visible directory structures that anyone can reason about.

Selective Loading

The file-per-shard pattern enables selective loading — only load what the current query needs:

# Query about electronics products — only load relevant shards
oxigraph load --location /tmp/store \
  --file knowledge/ontology.ttl \
  --file knowledge/products/electronics.ttl \
  --file knowledge/relationships/reviews.ttl

oxigraph query --location /tmp/store --query-file electronics_reviews.rq

Compare this to a monolithic database where every query potentially touches the entire graph. With files, the developer explicitly chooses what data to load. This is slower for a cold start but:

Forces you to understand your data dependencies
Makes query scope visible and auditable
Naturally limits memory usage to relevant subsets
Documents which parts of the graph interact

The 90/10 Insight: Not All Data Needs a Database

Once you've been working with file-sharded graphs for a while, a pattern emerges: most of your data is stable, and only a small fraction is volatile.

Data type	Example	Access pattern	File-based?
Schema / ontology	Class definitions, property constraints	Read-only, loaded once	Perfect in files
Reference data	Country codes, category taxonomies	Read-only, rarely updated	Perfect in files
Historical records	Past events, completed transactions	Append-only, queried occasionally	Good in files
Active state	Current user sessions, live metrics	High read/write frequency	Needs a database
Hot relationships	Real-time recommendations, live feeds	High traversal frequency	Needs a database

In many real-world systems, 90% of the graph is cold data — rarely written, queried in predictable patterns. Only 10% or less is truly volatile, requiring the low-latency random writes that databases provide.

This means the answer to "should I use a database?" is often not "yes" or "no" — it's "yes, but only for this specific shard."

Hybrid Architecture: Conceptual Graph, Heterogeneous Storage

This is where the isomorphism/homomorphism distinction (from the opening section) becomes practical.

Your conceptual model is a graph — entities connected by relationships. You can model it in any graph paradigm: RDF triples (SPARQL), property graphs (Cypher/Gremlin), hypergraphs, or even edge lists in CSV. The conceptual model is paradigm-agnostic — it's the meaning of your data.

Your implementation is where performance profiling, access patterns, and operational constraints dictate storage choices. And here's the key insight: different shards of the same conceptual graph may need entirely different storage engines.

Once you've identified your access patterns through file-based prototyping, you might discover:

Shard A (ontology + reference data): Rarely changes, always loaded. Keep in text files. Load into an in-memory graph at startup. Isomorphic to the conceptual model.
Shard B (event log): Append-heavy, time-series queries. A column store (ClickHouse, DuckDB) might serve this better than a graph database. Homomorphic — preserves temporal ordering, flattens graph structure.
Shard C (user state): High-frequency key-value lookups. A key-value store (Redis, RocksDB) might be optimal. Homomorphic — preserves entity identity, discards relationships.
Shard D (social graph): Complex traversals, relationship-heavy. This is where a graph database (Neo4j, Neptune, Oxigraph) earns its keep. Nearly isomorphic — preserves structure, adds indexes.
Shard E (full-text search): Keyword queries across descriptions. A search engine (Meilisearch, Elasticsearch) handles this better. Homomorphic — preserves text content, discards graph topology.

             ┌────────────────────────────────────┐
             │   Conceptual Model (the graph)      │
             │   Data Architecture                 │
             │   ── paradigm-agnostic ──           │
             │   (RDF, Property Graph, CSV edges…) │
             └──────────┬─────────────────────────┘
                        │
     ┌──────────────────┼──────────────────┐
     │                  │                  │
  isomorphic      homomorphic        homomorphic
  (lossless)      (projected)        (projected)
     │                  │                  │
┌────▾──────┐    ┌──────▾──────┐    ┌──────▾───────┐
│Text files │    │Column store │    │ Graph DB     │
│in git     │    │Key-value    │    │ (traversal   │
│(source of │    │Search index │    │  queries)    │
│ truth)    │    │(analytics)  │    │              │
└───────────┘    └─────────────┘    └──────────────┘
  Implementation Architecture(s)
  ── access-pattern-driven ──

The text files in git remain the source of truth — the isomorphic reference copy of the conceptual model. The various databases are derived, specialized projections, each a homomorphism optimized for its access pattern.

The conceptual model is a graph. The implementation might be five different storage engines, only one of which is a graph database. This is not a failure of architecture — it's the natural consequence of matching storage to access patterns instead of to conceptual structure.

The Premature Optimization Trap

Starting with a database before you understand your data is a form of premature optimization:

You commit to a storage engine before knowing your access patterns. Different parts of your graph have different read/write ratios, traversal depths, and latency requirements. You can't know these from a design document — you learn them by running queries against real data.
You lose visibility into your data. Once data lives in a database, understanding it requires query tools, admin consoles, and specialized knowledge. In files, understanding requires cat and grep.
You lose versioning for free. Git tracks every change to every fact. Database change tracking requires explicit infrastructure (CDC, audit tables, event sourcing).
You make sharding decisions in the dark. Without observing real access patterns, sharding is guesswork. File-based development lets you observe and iterate on shard boundaries cheaply.
You might not need a graph database at all. Your access patterns might reveal that the volatile portion of your data is better served by SQL, a key-value store, or a column database — with the graph model living in text files as a conceptual layer and integration glue. The data architecture is a graph; the implementation is whatever each shard's access pattern demands.
You conflate the conceptual model with the storage model. Just because your domain is best understood as a graph doesn't mean it's best stored as one. The file-first approach forces you to discover this distinction empirically rather than assuming it away.

The Migration Path

When you do outgrow files, migration is mechanical — not architectural:

Step	What changes	What stays the same
1. Identify hot shards	Profile which files are loaded/queried most	Conceptual model, queries, schema
2. Choose storage per shard	Pick DB engine per access pattern	Text files as source of truth for cold data
3. Load hot shards into DB	`oxigraph load`, `neo4j-admin import`, etc.	Query semantics (SPARQL, Cypher, SQL — each for its shard)
4. Add sync pipeline	Files → DB loader (cron, CI, event-driven)	Git as audit trail
5. Route queries	Hot queries → DB, cold queries → files	Conceptual model unchanged

The key: your conceptual model doesn't change. Your file-based source of truth remains the isomorphic reference. Each database you introduce is a homomorphic projection optimized for specific queries. If a database dies, you rebuild it from files. If access patterns change, you re-project into a different engine. The architecture flexes at the implementation layer without touching the conceptual layer.

Summary

Myth	Reality
"Databases are faster"	For small graphs, files are fast enough — and more transparent
"I need a graph DB for graphs"	You need graph thinking. The storage engine follows access patterns, not the conceptual shape.
"I'll shard when I need to"	File-per-topic IS sharding. You're already doing it.
"One database for everything"	Different shards may need different engines (column, KV, graph, search)
"Start with the database, optimize later"	Start with files, observe, then pick the right DB per shard
"Graph model = graph database"	The conceptual model (data architecture) is independent of the storage model (implementation architecture)

The file-first approach isn't a toy. It's a disciplined methodology that gives you transparency, versioning, and sharding awareness from day one — qualities that become more valuable, not less, as your system scales.

Start with cat. Graduate to oxigraph query. Promote hot shards to specialized databases. Keep the rest in git. At every stage, you can see your data, diff your changes, and understand your system.

The conceptual model is a graph. The source of truth is text files. The implementation is whatever each shard needs. That's not a compromise — that's architecture informed by observation instead of assumption.

Further reading: Graph isomorphism · Graph homomorphism · Homomorphism (general) — the mathematical foundations for why conceptual models and implementation models are related but not identical.