#6776 Auto-Wiki Architecture — Synthesized Research Project: emdx Created: 2026-02-22 05:11 Tags: architecture, gameplan, active, wiki

Auto-Wiki Architecture — Synthesized Research

10 parallel opus delegates researched every aspect of auto-wiki generation from the emdx knowledge base. This document synthesizes the findings into an implementable architecture.

The Pipeline

emdx docs (684)
    ↓
1. Entity extraction (already built — 24K entities)
    ↓
2. Leiden community detection on entity co-occurrence graph
   → ~15 broad sections, ~40 wiki pages, fine subtopics
    ↓
3. Privacy filtering (3 layers: regex pre-filter → LLM synthesis gate → post-scan)
    ↓
4. LLM synthesis per cluster (stuff-first, hierarchical fallback for large clusters)
   → Outline phase (cheap) → Write phase → Validate (code preservation check)
    ↓
5. Wiki articles stored as regular emdx docs (tagged wiki-article)
   → Get FTS5, embeddings, links, TUI browsing for free
    ↓
6. Entity index pages (computed views, ~200-300 full pages, ~500 stubs)
    ↓
7. MkDocs + Material theme → static site
   → topics/ for synthesized articles
   → entities/ for glossary pages
   → D3.js graph visualization

Key Design Decisions

Decision	Choice	Why
Clustering algorithm	Leiden (leidenalg + python-igraph)	Resolution parameter, guarantees well-connected communities
Articles storage	Regular emdx documents	Free FTS, links, embeddings, TUI
Synthesis model	Sonnet 4.6 (stuff-first)	95% of clusters fit in one context call
Static site	MkDocs Material	Same Python stack, no Node.js
Incremental updates	$0 staleness marking on save, pay-per-refresh	User controls cost
Privacy	3-layer: regex → LLM prompt gate → post-scan	Casual remarks filtered during synthesis

Cost Estimates

Full wiki generation (40 articles): **$0.45** with Sonnet
Monthly maintenance (10-20 docs/day): ~$8-25
Entity pages: $0 (computed from existing data)

New CLI Commands

emdx wiki generate --all     # cluster + synthesize everything
emdx wiki generate --stale   # only regenerate outdated articles
emdx wiki status             # show freshness, coverage
emdx wiki serve              # local dev server
emdx wiki export             # static HTML output
emdx wiki cost               # estimate before spending

New DB Tables (4)

wiki_topics — cluster definitions with entity fingerprint
wiki_topic_members — M:N docs ↔ topics with relevance scores
wiki_articles — generation metadata pointing to documents table
wiki_article_sources — provenance tracking (which doc versions fed each article)

Clustering: Leiden on Hybrid Similarity Graph

Primary algorithm: Leiden community detection on a weighted graph combining entity co-occurrence (IDF-weighted Jaccard) + embedding cosine similarity.

Why Leiden:

Guarantees well-connected communities (Louvain doesn't)
Resolution parameter via CPM lets you control granularity directly
Near-linear time, handles 10K+ nodes easily
Only new dep: leidenalg + python-igraph (~15MB)

Multi-resolution hierarchy:

Low resolution → ~15 broad wiki sections ("Security", "CLI Architecture")
Medium resolution → ~40 wiki pages
Fine → subtopics within pages

Multi-topic docs: Post-hoc assignment — after Leiden assigns primary clusters, compute each doc's entity overlap with all other clusters. Assign secondary membership if overlap > 30%.

Cluster naming: c-TF-IDF on entities within each cluster, weighted by entity type (proper_noun > tech_term > concept > heading).

Synthesis Pipeline (6 steps)

PREPARE — Fetch docs, sort newest-first (mitigates "lost in the middle"), flag superseded docs, pre-extract code/commands/configs programmatically
ROUTE — <150K tokens: single-pass; <400K: hierarchical (2-3 sub-groups); >=400K: map-reduce
OUTLINE (Sonnet, cheap) — Generate article structure with section headings, key points per section, conflict flags, and preservables checklist
WRITE (Sonnet) — Generate full wiki article following the outline, with explicit instructions to preserve all code verbatim
VALIDATE — Programmatically verify all code blocks survived synthesis; check heading structure
SAVE — Store article, tag sources as superseded, auto-wikify links

Privacy Filtering (3 Layers)

Layer 1: Pre-Processing (Regex/Rules — Zero Cost)

Sensitive data redaction — strip credentials, API keys, internal IPs, home directory paths
Temporal content marking — wrap temporal references in markers for LLM to evaluate
Internal tooling scrubbing — strip emdx commands, delegate boilerplate, worktree paths
Draft scoring — score documents on TODO/WIP/TBD density

Layer 2: LLM Synthesis Gate (Prompt Engineering)

Structured prompt with explicit PRESERVE/FILTER rules and concrete examples
Attribution test: "Would a reader unfamiliar with the team find this useful?"
Audience-adaptive strictness: --for me / --for wiki / --for docs

Layer 3: Post-Processing (Automated Validation)

Re-scan output for leaked credentials, IP addresses, temporal language
Flag for human review when error-level flags found

Incremental Regeneration

Trigger	When	What	Cost
On-save hook	Every `emdx save`	Mark affected articles stale (SQL only)	$0
On-demand refresh	`emdx wiki generate --stale`	Re-synthesize stale articles (fingerprint-gated)	$0.05-0.15/article
Periodic rebuild	Weekly / `emdx wiki generate --all --force`	Full re-cluster, detect drift, selective re-synth	$0.50-2

Entity Index Pages

Tiered Approach

Tier	Criteria	Count	Treatment
A: Full Page	Clean entity, df >= 5, page_score >= 30	~200-300	Full page with snippets, co-occurrences, timeline
B: Stub Page	Clean entity, df 3-4	~300-500	Mini page: list of docs + one-line descriptions
C: Index Only	Clean entity, df 2	~2,000	Appears in alphabetical index, links to docs
D: Noise	Stopwords, file paths, type annotations, df 1	~20,000+	Excluded

Static Site: MkDocs Material

Same Python stack as emdx — pip install mkdocs-material
mkdocs-gen-files + mkdocs-literate-nav for programmatic navigation
topics/ for cluster-synthesized articles, entities/ for glossary pages
D3.js force-directed graph on a dedicated /graph/ page
Mermaid diagrams for per-article 1-hop neighbor mini-graphs

Quality Assessment

Free heuristics: compression ratio, entity preservation, section count, code block retention
Haiku judge (~$0.001 each): coverage score (1-5) and coherence score (1-5)
Source attribution:  comments during synthesis
Staleness detection: content hash fingerprint comparison on save

What Makes This Unique

No existing tool does this end-to-end. Obsidian/Logseq do linking but not synthesis. NotebookLM does synthesis but not persistent wiki generation. The entity co-occurrence graph driving clustering is the differentiator — structured data that LLMs can't derive from raw embeddings alone.

Related Research Docs

#6713 — CLI & Workflow Integration Design
#6719 — Incremental Regeneration Strategy
#6720 — Quality Assessment & Feedback Loops
#6722 — Static Site Generation (MkDocs)
#6723 — Topic Clustering (Leiden)
#6743 — LLM Synthesis Pipeline
#6771 — Existing Tools & Projects Survey
#6774 — Unified Synthesis
#6775 — Privacy Filtering & Content Curation

arockwell/auto-wiki-architecture.md

Select an option

No results found