Skip to content

Instantly share code, notes, and snippets.

@arockwell
Created February 22, 2026 05:11
Show Gist options
  • Select an option

  • Save arockwell/fc26edf6dc8bbee0525a39169bab1cb6 to your computer and use it in GitHub Desktop.

Select an option

Save arockwell/fc26edf6dc8bbee0525a39169bab1cb6 to your computer and use it in GitHub Desktop.
Auto-Wiki Architecture — Synthesized Research from 10 parallel opus delegates

#6776 Auto-Wiki Architecture — Synthesized Research Project: emdx Created: 2026-02-22 05:11 Tags: architecture, gameplan, active, wiki

Auto-Wiki Architecture — Synthesized Research

10 parallel opus delegates researched every aspect of auto-wiki generation from the emdx knowledge base. This document synthesizes the findings into an implementable architecture.

The Pipeline

emdx docs (684)
    ↓
1. Entity extraction (already built — 24K entities)
    ↓
2. Leiden community detection on entity co-occurrence graph
   → ~15 broad sections, ~40 wiki pages, fine subtopics
    ↓
3. Privacy filtering (3 layers: regex pre-filter → LLM synthesis gate → post-scan)
    ↓
4. LLM synthesis per cluster (stuff-first, hierarchical fallback for large clusters)
   → Outline phase (cheap) → Write phase → Validate (code preservation check)
    ↓
5. Wiki articles stored as regular emdx docs (tagged wiki-article)
   → Get FTS5, embeddings, links, TUI browsing for free
    ↓
6. Entity index pages (computed views, ~200-300 full pages, ~500 stubs)
    ↓
7. MkDocs + Material theme → static site
   → topics/ for synthesized articles
   → entities/ for glossary pages
   → D3.js graph visualization

Key Design Decisions

Decision Choice Why
Clustering algorithm Leiden (leidenalg + python-igraph) Resolution parameter, guarantees well-connected communities
Articles storage Regular emdx documents Free FTS, links, embeddings, TUI
Synthesis model Sonnet 4.6 (stuff-first) 95% of clusters fit in one context call
Static site MkDocs Material Same Python stack, no Node.js
Incremental updates $0 staleness marking on save, pay-per-refresh User controls cost
Privacy 3-layer: regex → LLM prompt gate → post-scan Casual remarks filtered during synthesis

Cost Estimates

  • Full wiki generation (40 articles): **$0.45** with Sonnet
  • Monthly maintenance (10-20 docs/day): ~$8-25
  • Entity pages: $0 (computed from existing data)

New CLI Commands

emdx wiki generate --all     # cluster + synthesize everything
emdx wiki generate --stale   # only regenerate outdated articles
emdx wiki status             # show freshness, coverage
emdx wiki serve              # local dev server
emdx wiki export             # static HTML output
emdx wiki cost               # estimate before spending

New DB Tables (4)

  • wiki_topics — cluster definitions with entity fingerprint
  • wiki_topic_members — M:N docs ↔ topics with relevance scores
  • wiki_articles — generation metadata pointing to documents table
  • wiki_article_sources — provenance tracking (which doc versions fed each article)

Clustering: Leiden on Hybrid Similarity Graph

Primary algorithm: Leiden community detection on a weighted graph combining entity co-occurrence (IDF-weighted Jaccard) + embedding cosine similarity.

Why Leiden:

  • Guarantees well-connected communities (Louvain doesn't)
  • Resolution parameter via CPM lets you control granularity directly
  • Near-linear time, handles 10K+ nodes easily
  • Only new dep: leidenalg + python-igraph (~15MB)

Multi-resolution hierarchy:

  1. Low resolution → ~15 broad wiki sections ("Security", "CLI Architecture")
  2. Medium resolution → ~40 wiki pages
  3. Fine → subtopics within pages

Multi-topic docs: Post-hoc assignment — after Leiden assigns primary clusters, compute each doc's entity overlap with all other clusters. Assign secondary membership if overlap > 30%.

Cluster naming: c-TF-IDF on entities within each cluster, weighted by entity type (proper_noun > tech_term > concept > heading).

Synthesis Pipeline (6 steps)

  1. PREPARE — Fetch docs, sort newest-first (mitigates "lost in the middle"), flag superseded docs, pre-extract code/commands/configs programmatically
  2. ROUTE — <150K tokens: single-pass; <400K: hierarchical (2-3 sub-groups); >=400K: map-reduce
  3. OUTLINE (Sonnet, cheap) — Generate article structure with section headings, key points per section, conflict flags, and preservables checklist
  4. WRITE (Sonnet) — Generate full wiki article following the outline, with explicit instructions to preserve all code verbatim
  5. VALIDATE — Programmatically verify all code blocks survived synthesis; check heading structure
  6. SAVE — Store article, tag sources as superseded, auto-wikify links

Privacy Filtering (3 Layers)

Layer 1: Pre-Processing (Regex/Rules — Zero Cost)

  • Sensitive data redaction — strip credentials, API keys, internal IPs, home directory paths
  • Temporal content marking — wrap temporal references in markers for LLM to evaluate
  • Internal tooling scrubbing — strip emdx commands, delegate boilerplate, worktree paths
  • Draft scoring — score documents on TODO/WIP/TBD density

Layer 2: LLM Synthesis Gate (Prompt Engineering)

  • Structured prompt with explicit PRESERVE/FILTER rules and concrete examples
  • Attribution test: "Would a reader unfamiliar with the team find this useful?"
  • Audience-adaptive strictness: --for me / --for wiki / --for docs

Layer 3: Post-Processing (Automated Validation)

  • Re-scan output for leaked credentials, IP addresses, temporal language
  • Flag for human review when error-level flags found

Incremental Regeneration

Trigger When What Cost
On-save hook Every emdx save Mark affected articles stale (SQL only) $0
On-demand refresh emdx wiki generate --stale Re-synthesize stale articles (fingerprint-gated) $0.05-0.15/article
Periodic rebuild Weekly / emdx wiki generate --all --force Full re-cluster, detect drift, selective re-synth $0.50-2

Entity Index Pages

Tiered Approach

Tier Criteria Count Treatment
A: Full Page Clean entity, df >= 5, page_score >= 30 ~200-300 Full page with snippets, co-occurrences, timeline
B: Stub Page Clean entity, df 3-4 ~300-500 Mini page: list of docs + one-line descriptions
C: Index Only Clean entity, df 2 ~2,000 Appears in alphabetical index, links to docs
D: Noise Stopwords, file paths, type annotations, df 1 ~20,000+ Excluded

Static Site: MkDocs Material

  • Same Python stack as emdx — pip install mkdocs-material
  • mkdocs-gen-files + mkdocs-literate-nav for programmatic navigation
  • topics/ for cluster-synthesized articles, entities/ for glossary pages
  • D3.js force-directed graph on a dedicated /graph/ page
  • Mermaid diagrams for per-article 1-hop neighbor mini-graphs

Quality Assessment

  • Free heuristics: compression ratio, entity preservation, section count, code block retention
  • Haiku judge (~$0.001 each): coverage score (1-5) and coherence score (1-5)
  • Source attribution: <!-- wiki-sources: 42, 55 --> comments during synthesis
  • Staleness detection: content hash fingerprint comparison on save

What Makes This Unique

No existing tool does this end-to-end. Obsidian/Logseq do linking but not synthesis. NotebookLM does synthesis but not persistent wiki generation. The entity co-occurrence graph driving clustering is the differentiator — structured data that LLMs can't derive from raw embeddings alone.

Related Research Docs

  • #6713 — CLI & Workflow Integration Design
  • #6719 — Incremental Regeneration Strategy
  • #6720 — Quality Assessment & Feedback Loops
  • #6722 — Static Site Generation (MkDocs)
  • #6723 — Topic Clustering (Leiden)
  • #6743 — LLM Synthesis Pipeline
  • #6771 — Existing Tools & Projects Survey
  • #6774 — Unified Synthesis
  • #6775 — Privacy Filtering & Content Curation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment