#6776 Auto-Wiki Architecture — Synthesized Research Project: emdx Created: 2026-02-22 05:11 Tags: architecture, gameplan, active, wiki
10 parallel opus delegates researched every aspect of auto-wiki generation from the emdx knowledge base. This document synthesizes the findings into an implementable architecture.
emdx docs (684)
↓
1. Entity extraction (already built — 24K entities)
↓
2. Leiden community detection on entity co-occurrence graph
→ ~15 broad sections, ~40 wiki pages, fine subtopics
↓
3. Privacy filtering (3 layers: regex pre-filter → LLM synthesis gate → post-scan)
↓
4. LLM synthesis per cluster (stuff-first, hierarchical fallback for large clusters)
→ Outline phase (cheap) → Write phase → Validate (code preservation check)
↓
5. Wiki articles stored as regular emdx docs (tagged wiki-article)
→ Get FTS5, embeddings, links, TUI browsing for free
↓
6. Entity index pages (computed views, ~200-300 full pages, ~500 stubs)
↓
7. MkDocs + Material theme → static site
→ topics/ for synthesized articles
→ entities/ for glossary pages
→ D3.js graph visualization
| Decision | Choice | Why |
|---|---|---|
| Clustering algorithm | Leiden (leidenalg + python-igraph) | Resolution parameter, guarantees well-connected communities |
| Articles storage | Regular emdx documents | Free FTS, links, embeddings, TUI |
| Synthesis model | Sonnet 4.6 (stuff-first) | 95% of clusters fit in one context call |
| Static site | MkDocs Material | Same Python stack, no Node.js |
| Incremental updates | $0 staleness marking on save, pay-per-refresh | User controls cost |
| Privacy | 3-layer: regex → LLM prompt gate → post-scan | Casual remarks filtered during synthesis |
- Full wiki generation (
40 articles): **$0.45** with Sonnet - Monthly maintenance (10-20 docs/day): ~$8-25
- Entity pages: $0 (computed from existing data)
emdx wiki generate --all # cluster + synthesize everything
emdx wiki generate --stale # only regenerate outdated articles
emdx wiki status # show freshness, coverage
emdx wiki serve # local dev server
emdx wiki export # static HTML output
emdx wiki cost # estimate before spendingwiki_topics— cluster definitions with entity fingerprintwiki_topic_members— M:N docs ↔ topics with relevance scoreswiki_articles— generation metadata pointing to documents tablewiki_article_sources— provenance tracking (which doc versions fed each article)
Primary algorithm: Leiden community detection on a weighted graph combining entity co-occurrence (IDF-weighted Jaccard) + embedding cosine similarity.
Why Leiden:
- Guarantees well-connected communities (Louvain doesn't)
- Resolution parameter via CPM lets you control granularity directly
- Near-linear time, handles 10K+ nodes easily
- Only new dep:
leidenalg+python-igraph(~15MB)
Multi-resolution hierarchy:
- Low resolution → ~15 broad wiki sections ("Security", "CLI Architecture")
- Medium resolution → ~40 wiki pages
- Fine → subtopics within pages
Multi-topic docs: Post-hoc assignment — after Leiden assigns primary clusters, compute each doc's entity overlap with all other clusters. Assign secondary membership if overlap > 30%.
Cluster naming: c-TF-IDF on entities within each cluster, weighted by entity type (proper_noun > tech_term > concept > heading).
- PREPARE — Fetch docs, sort newest-first (mitigates "lost in the middle"), flag superseded docs, pre-extract code/commands/configs programmatically
- ROUTE — <150K tokens: single-pass; <400K: hierarchical (2-3 sub-groups); >=400K: map-reduce
- OUTLINE (Sonnet, cheap) — Generate article structure with section headings, key points per section, conflict flags, and preservables checklist
- WRITE (Sonnet) — Generate full wiki article following the outline, with explicit instructions to preserve all code verbatim
- VALIDATE — Programmatically verify all code blocks survived synthesis; check heading structure
- SAVE — Store article, tag sources as superseded, auto-wikify links
- Sensitive data redaction — strip credentials, API keys, internal IPs, home directory paths
- Temporal content marking — wrap temporal references in markers for LLM to evaluate
- Internal tooling scrubbing — strip emdx commands, delegate boilerplate, worktree paths
- Draft scoring — score documents on TODO/WIP/TBD density
- Structured prompt with explicit PRESERVE/FILTER rules and concrete examples
- Attribution test: "Would a reader unfamiliar with the team find this useful?"
- Audience-adaptive strictness:
--for me/--for wiki/--for docs
- Re-scan output for leaked credentials, IP addresses, temporal language
- Flag for human review when error-level flags found
| Trigger | When | What | Cost |
|---|---|---|---|
| On-save hook | Every emdx save |
Mark affected articles stale (SQL only) | $0 |
| On-demand refresh | emdx wiki generate --stale |
Re-synthesize stale articles (fingerprint-gated) | $0.05-0.15/article |
| Periodic rebuild | Weekly / emdx wiki generate --all --force |
Full re-cluster, detect drift, selective re-synth | $0.50-2 |
| Tier | Criteria | Count | Treatment |
|---|---|---|---|
| A: Full Page | Clean entity, df >= 5, page_score >= 30 | ~200-300 | Full page with snippets, co-occurrences, timeline |
| B: Stub Page | Clean entity, df 3-4 | ~300-500 | Mini page: list of docs + one-line descriptions |
| C: Index Only | Clean entity, df 2 | ~2,000 | Appears in alphabetical index, links to docs |
| D: Noise | Stopwords, file paths, type annotations, df 1 | ~20,000+ | Excluded |
- Same Python stack as emdx —
pip install mkdocs-material mkdocs-gen-files+mkdocs-literate-navfor programmatic navigationtopics/for cluster-synthesized articles,entities/for glossary pages- D3.js force-directed graph on a dedicated
/graph/page - Mermaid diagrams for per-article 1-hop neighbor mini-graphs
- Free heuristics: compression ratio, entity preservation, section count, code block retention
- Haiku judge (~$0.001 each): coverage score (1-5) and coherence score (1-5)
- Source attribution:
<!-- wiki-sources: 42, 55 -->comments during synthesis - Staleness detection: content hash fingerprint comparison on save
No existing tool does this end-to-end. Obsidian/Logseq do linking but not synthesis. NotebookLM does synthesis but not persistent wiki generation. The entity co-occurrence graph driving clustering is the differentiator — structured data that LLMs can't derive from raw embeddings alone.
- #6713 — CLI & Workflow Integration Design
- #6719 — Incremental Regeneration Strategy
- #6720 — Quality Assessment & Feedback Loops
- #6722 — Static Site Generation (MkDocs)
- #6723 — Topic Clustering (Leiden)
- #6743 — LLM Synthesis Pipeline
- #6771 — Existing Tools & Projects Survey
- #6774 — Unified Synthesis
- #6775 — Privacy Filtering & Content Curation