Multi-Layer Memory Architecture for AI Agents β Hot/Warm/Cold storage, Convex multi-site sync, content-aware chunking, and intelligent retrieval
A comprehensive multi-site memory system with hybrid BM25+dense search, multimodal embeddings, neural reranking, Convex real-time sync, and agent attribution. With Hot/Warm/Cold storage, content-aware chunking, and intelligent retrieval. Powered by local Ollama services on VM210 with cloud augmentation where it wins.
Updated: 2026-02-16
Problem: AI agents forget everything between sessions. Simple RAG with one vector DB fails because:
- Exact matches get lost β Client IDs, function names, error codes drift in semantic space
- Different content needs different processing β Code, meetings, images, charts each need specialized pipelines
- Cloud APIs are too slow β Real-time agent decisions can't wait 500ms+ for embeddings
- Context compaction loses critical state β Long sessions get summarized, losing important details
- Multi-site agents need shared memory β An agent on a VM and a developer on a laptop shouldn't have separate, diverging knowledge
Solution: Multi-layer memory with intentional tradeoffs at each layer, unified by Convex as the real-time sync backbone.
ββββββββββββββββββββββββ ββββββββββββββββββββββββββ ββββββββββββββββ
β Laptop (Adam) β β VM 210 (Xavier) β β Future Agents β
β Claude Code CLI β β OpenClaw Gateway β β (OCI, edge) β
β AgentDB (SQLite) β β AgentDB (SQLite) β β β
β 2331+ episodes β β Telegram / Control UI β β β
ββββββββββββ¬ββββββββββββ ββββββββββββ¬ββββββββββββββ βββββββββ¬βββββββ
β β β
β episode_embeddings β episode_embeddings β
β (Qwen3-Embedding-8B β (synced from Convex) β
β 768d HNSW hot path) β β
β β β
ββββββββββββββ¬ββββββββββββββββ΄βββββββββββββββββββββββββββββ
β
βββββββββΌβββββββββββ
β Convex Cloud β β Real-time sync backbone
β (agent-memory) β
β β
β episodes β β cross-site episode mirror
β approvals β β multi-reviewer queue
β tasks β β OpenClaw task bridge
β agents β β registry + heartbeat
β collaborators β β human agents (Adam + future)
β syncCursors β β per-site watermark
β β
β 6 cron jobs β β autonomous cloud functions
ββββββββββ¬ββββββββββ
β
βββββββββΌββββββββββ
β Qdrant Cloud β β Warm-layer hybrid search
β 350k+ vectors β BM25 + dense (Gemini 768d)
β 17 collections β + neural reranking
βββββββββββββββββββ
Sync model: Each site writes to local SQLite first (fast, offline-capable), then syncs to Convex in background. Convex is the shared truth for cross-site visibility, dashboards, and approval workflows. Convex cron automatically pushes new episodes to Qdrant for warm-layer search.
flowchart TB
subgraph Hot["π₯ HOT LAYER (<350ms)"]
direction LR
HNSW["AgentDB HNSW<br/>Qwen3-Embedding-8B GGUF<br/>768d (truncated from 4096d)<br/>1100+ vectors"]
LOCAL["Local SQLite<br/>Zero network latency"]
end
subgraph Warm["π‘οΈ WARM LAYER (50-200ms)"]
direction LR
QDRANT["Qdrant Hybrid<br/>768d Gemini<br/>350k+ vectors"]
BM25["BM25 Sparse<br/>Exact keyword match"]
end
subgraph Cold["βοΈ COLD LAYER (+120-1500ms)"]
direction LR
RERANK["Neural Reranker<br/>Voyage rerank-2.5 (primary)<br/>Qwen3-Reranker-8B local fallback"]
end
QUERY[Query] --> Hot
Hot -->|"results + miss"| Warm
Warm -->|"top candidates"| Cold
Cold --> FINAL[Final Results]
Key insight: Reranking is only applied to Qdrant results, NOT the hot path. Adding 500ms+ to a sub-second path would defeat its purpose.
| Layer | Model | Dimensions | Why | Location |
|---|---|---|---|---|
| Hot (AgentDB HNSW) | Qwen3-Embedding-8B via Ollama | 4096dβ768d (Matryoshka truncated) | Local-first low latency on VM210 | VM210 (:11434) |
| Warm (Qdrant) | Gemini gemini-embedding-001 | 768d | Stable cloud baseline for hybrid collections | Cloud |
| Multimodal | Voyage multimodal-3.5 | 1024dβ768d | Best operational quality/latency without local VL overhead | Cloud API |
| ConvexβQdrant sync | Gemini gemini-embedding-001 | 768d | Same model as Qdrant to maintain consistency | Convex Action |
Critical rule: NEVER mix embedding models in the same HNSW/collection index. Cross-model cosine similarity is ~0.12 (useless). All hot-path vectors + queries use the same Qwen3-Embedding-8B model.
| Path | Model | Runtime | Dimensions | Notes |
|---|---|---|---|---|
| Hot text path | qwen3-embedding:8b |
Ollama (:11434) |
4096dβ768d | Always-on via systemd, auto-unloads after ~5 min idle |
| Warm Qdrant path | gemini-embedding-001 |
Cloud API | 768d | No change, canonical Qdrant dense vector |
| Multimodal path | voyage-multimodal-3.5 |
Cloud API | 1024dβ768d | Script: /workspace/scripts/voyage-multimodal-embed.sh |
| Text fallback | voyage-4 |
Cloud API | API-nativeβ768d | Available in embed.sh fallback chain |
- Ollama models on VM210:
qwen3-embedding:8b(4.7GB),glm-ocr:latest(2.2GB) - Local PyTorch VL server at
/opt/vl-embed/still exists but is on-demand only, too heavy for always-on use (~52s/embedding on CPU) - Matryoshka truncation remains standard: keep first 768 dimensions for index compatibility
| Tier | Latency | When | Model |
|---|---|---|---|
| Primary | 120-350ms typical | Default warm/cold rerank | Voyage rerank-2.5 |
| Fallback | Higher, local CPU-bound | Offline, failover, cost-control | Qwen3-Reranker-8B Q4_K_M |
- Primary script path:
/workspace/lib/ingest/rerank.shwithRERANK_BACKEND=voyage - Voyage economics: 200M free rerank tokens one-time per account, zero local CPU burn while available
- Local fallback artifact:
/opt/models/(4.8GB GGUF), served byllama-serveron:18200 - Service unit:
llama-rerank.service(on-demand, not always-on) - This local reranker also powers zero-shot classification with BTZSC F1 of 0.72
Important corrections:
- Gemini Flash is no longer the primary reranker
- Qwen3-VL-Reranker was laptop-only historical context, not current production path
| Query Example | Winner |
|---|---|
"BatchTool" |
BM25 exact match |
"how to spawn agents" |
Dense semantic |
"client ABC #123" |
BM25 exact match |
"authentication flow" |
Dense semantic |
"error 0x8007001F" |
BM25 exact match |
HYBRID SEARCH: Run BOTH, fuse with Reciprocal Rank Fusion (RRF), let reranker decide final order.
| Requirement | Convex Solution |
|---|---|
| Multi-writer sync (laptop + VM) | Atomic mutations with (sourceSite, sourceId) dedup |
| Real-time dashboards | Reactive WebSocket queries |
| Autonomous cloud jobs | Built-in cron scheduler (no n8n/bash needed) |
| Approval workflows | HTTP Actions for Telegram webhook callbacks |
| Agent heartbeat | Mutation-based, aggregated by Convex itself |
| Table | Purpose | Key Fields |
|---|---|---|
| episodes | Cross-site episode mirror | sourceId, sourceSite, agentName, agentPlatform, reward, task, approvalStatus |
| approvals | Multi-reviewer approval queue | agentName, actionType, status, priority, reviewers[], telegramMessageId |
| tasks | OpenClaw task bridge | title, agentName, status, assignedTo, approvalId |
| agents | Live registry + heartbeat | agentName, site, platform, status, lastHeartbeat |
| collaborators | Human agents | name, role, telegramChatId, permissions[], notifyOn[] |
| syncCursors | Per-site sync watermark | site, lastSyncedId, lastSyncedAt, totalSynced |
| crm_contacts | CRM contacts | identity fields, enrichment, embedding refs |
| crm_interactions | CRM activity log | channel, timestamp, summary, linkage |
| crm_deals | CRM pipeline deals | stage, value, owner, close window |
| crm_entities | Extracted entities | type, canonical value, provenance |
| crm_relationships | Graph edges between CRM objects | from, to, relation type, confidence |
flowchart LR
subgraph Laptop["Laptop (Claude Code)"]
L_AGENTDB["AgentDB<br/>SQLite"]
L_SYNC["convex-episode-sync.sh"]
end
subgraph VM["VM 210 (Xavier)"]
V_AGENTDB["AgentDB<br/>SQLite"]
V_SYNC["convex-episode-sync.sh"]
end
subgraph Convex["Convex Cloud"]
C_EP["episodes table"]
C_CURSOR["syncCursors table"]
C_CRON["qdrant-episode-sync<br/>cron (every 5 min)"]
end
subgraph Qdrant["Qdrant Cloud"]
Q_HYBRID["agent_memory_hybrid<br/>BM25 + dense"]
end
L_AGENTDB -->|"WHERE id > cursor<br/>AND convex_synced=0"| L_SYNC
L_SYNC -->|"bulkSyncEpisodes<br/>mutation"| C_EP
L_SYNC -->|"updateCursor"| C_CURSOR
V_AGENTDB -->|"WHERE id > cursor<br/>AND convex_synced=0"| V_SYNC
V_SYNC -->|"bulkSyncEpisodes<br/>mutation"| C_EP
V_SYNC -->|"updateCursor"| C_CURSOR
C_CRON -->|"embed via Gemini<br/>upsert named vectors"| Q_HYBRID
Dedup strategy: The Convex bulkSyncEpisodes mutation upserts by (sourceSite, sourceId) composite key. Two sites syncing the same logical episode get separate entries (different sourceSite). After sync, local episodes are marked convex_synced = 1.
ConvexβQdrant sync: A Convex Action (qdrantSync.ts) runs every 5 minutes via cron. It:
- Finds episodes not yet in Qdrant (no
qdrantPointIdin metadata) - Generates Gemini 768d embeddings
- Upserts to
agent_memory_hybridwith named vectors (dense+ text for BM25) - Updates the episode metadata with the Qdrant point ID
| Cron | Interval | Purpose |
|---|---|---|
mark-stale-agents |
5 min | Set agents to "offline" if no heartbeat in 3 min |
expire-pending-approvals |
15 min | Expire approvals past timeout |
sync-health-alert |
30 min | Alert if site hasn't synced in 1 hour |
qdrant-episode-sync |
5 min | Embed + push new episodes to Qdrant |
daily-digest |
Daily 06:00 UTC | Agent activity summary |
cleanup-old-episodes |
Weekly (Sun 03:00 UTC) | Archive 90-day-old episodes |
These run in Convex cloud β they work even if both laptop and VM are down.
| Environment | Purpose |
|---|---|
| Dev | Development/testing |
| Prod | Live multi-site sync |
URLs and deploy keys stored in .env files per site, never committed to git.
A dedicated CRM skill now runs as a first-class memory subsystem.
| Layer | Implementation |
|---|---|
| Skill root | /workspace/skills/crm/ |
| Convex backend | crm.ts, crmAdmin.ts, crmClassify.ts, crmDiscover.ts, crmSync.ts |
| Vector retrieval | Qdrant hybrid search for contacts and interactions |
| CLI scripts | contact.sh, deal.sh, interaction.sh, search.sh, plus supporting scripts |
| Automation | OpenClaw cron jobs for auto-embed + auto-classify |
| Visualization | gen-graph.py relationship graph generation |
- CRM data participates in the same hot/warm retrieval philosophy, with Convex as operational truth and Qdrant as semantic retrieval layer.
- Contact and interaction artifacts are embedded and classified automatically, reducing manual CRM hygiene overhead.
- Relationship views are queryable both as vectors and as graph edges for entity-centric investigation.
Before attribution, all episodes had no structured way to identify which agent, platform, or interface produced them. Session ID prefixes (subagent-abc123, session-1770881984) were the only hint.
ALTER TABLE episodes ADD COLUMN agent_name TEXT DEFAULT 'claude-code';
ALTER TABLE episodes ADD COLUMN agent_platform TEXT DEFAULT 'claude-code-cli';
ALTER TABLE episodes ADD COLUMN agent_interface TEXT DEFAULT 'terminal';
ALTER TABLE episodes ADD COLUMN parent_agent TEXT;
ALTER TABLE episodes ADD COLUMN llm_provider TEXT;
ALTER TABLE episodes ADD COLUMN llm_model TEXT;
ALTER TABLE episodes ADD COLUMN convex_synced INTEGER DEFAULT 0;
ALTER TABLE episodes ADD COLUMN convex_id TEXT;| Agent | agent_name | agent_platform | agent_interface |
|---|---|---|---|
| Xavier (Telegram) | xavier |
openclaw |
telegram |
| Xavier (Control UI) | xavier |
openclaw |
control-ui |
| Xavier (subagent) | xavier-sub-{id} |
openclaw |
subagent |
| Claude Code (Adam) | claude-code |
claude-code-cli |
terminal |
| Claude Code (subagent) | cc-sub-{id} |
claude-code-cli |
subagent |
| Claude Flow (via Xavier) | xavier |
claude-flow-mcp |
mcp |
| Claude Flow (via CC) | claude-code |
claude-flow-mcp |
mcp |
Key rule: agent_name = top-level actor identity. When Xavier uses Claude Flow MCP, agent_name stays xavier because Xavier initiated the action. agent_platform tracks the execution engine.
Attribution is automatic β hooks detect the environment:
# In memory-save.sh
if [ -n "${OPENCLAW_SESSION:-}" ]; then
AGENT_NAME="xavier"
AGENT_PLATFORM="openclaw"
AGENT_INTERFACE="${OPENCLAW_INTERFACE:-telegram}"
else
AGENT_NAME="${AGENT_NAME:-claude-code}"
AGENT_PLATFORM="${AGENT_PLATFORM:-claude-code-cli}"
AGENT_INTERFACE="${AGENT_INTERFACE:-terminal}"
fiXavier runs 24/7 autonomously. High-impact actions (sending messages, financial ops, destructive changes) need human approval. Three layers work together:
| Layer | Purpose | Storage | Speed |
|---|---|---|---|
| A: Episode Metadata | Rich context per episode | AgentDB episodes.metadata JSON |
Instant (local) |
| B: Convex Tables | Scalable multi-agent/multi-human queue | Convex approvals + tasks |
Real-time (WebSocket) |
| C: Telegram Buttons | Approve/reject via inline keyboard | OpenClaw + Telegram Bot API | Interactive |
Xavier wants to send a Slack message
β
βββ 1. Write episode with metadata.approval.status="pending" (Layer A)
βββ 2. Create Convex approval + linked task (Layer B)
βββ 3. Send Telegram message with [Approve] [Reject] [Defer] buttons (Layer C)
β
βββ Adam taps "Approve" in Telegram
β
βββ Telegram callback β Convex HTTP webhook
βββ Convex resolves approval + updates task
βββ Edit Telegram message: "β
Approved by Adam at 14:32"
βββ Sync back to AgentDB episode metadata
| Category | Auto-approve? | Timeout | Notes |
|---|---|---|---|
send_message |
Never | 60 min | External comms always need human review |
financial |
Never | 120 min | Any money-related action |
destructive |
Never | 30 min | Deletes, drops, overwrites |
external_api |
If <$0.10 | 15 min | Cost-gated auto-approve |
code_commit |
If tests pass | 30 min | CI-gated auto-approve |
internal |
Always | N/A | Internal memory ops, no risk |
research |
Always | N/A | Read-only, no side effects |
| Backend | Records | Details |
|---|---|---|
| AgentDB (SQLite) | 1,844 episodes | Task trajectories with rewards + attribution |
| β³ episode_embeddings | 104 | Qwen3-Embedding-8B GGUF 768d hot-path vectors |
| β³ Context Mesh | 5,007 nodes, 963k edges, 560 concepts | Semantic relationship graph |
| Qdrant (Cloud) | 350,395 vectors | 17 collections (hybrid-enabled) |
| β³ codebase_hybrid | 323,968 | Indexed source code + documentation |
| β³ patterns_hybrid | 7,345 | Learned patterns + behaviors |
| β³ agent_memory_hybrid | 6,716 | Task episodes + context |
| β³ cortex_hybrid | 5,192 | Knowledge base documents |
| β³ context_mesh_hybrid | 4,324 | Mesh relationship embeddings |
| β³ research_hybrid | 1,627 | Research notes + findings |
| β³ learnings_hybrid | 954 | High-reward episode learnings |
| Cortex (SiYuan) | ~550 documents | 3+1 notebook architecture |
| β³ WORKSPACE | 142 docs | Active projects, configs |
| β³ KNOWLEDGE | 233 docs | Stable learnings, patterns |
| β³ JOURNAL | 149 docs | Daily logs, reflections |
| β³ ARCHIVE | 26 docs | Completed/retired docs |
| Convex (Cloud) | 11 tables | Real-time multi-site sync layer + CRM operational schema |
| β³ episodes | ~1,840 synced | Mirrored from all sites |
| β³ agents | 2 registered | xavier + claude-code |
| Hive-Mind (Local JSON) | Session state | Backup decisions, swarm coordination |
| Backend | Type | Best For |
|---|---|---|
| AgentDB | Local SQLite | Fast writes, episode storage, HNSW hot search, Context Mesh |
| Convex | Cloud reactive DB | Multi-site sync, approval workflows, heartbeat, cron jobs |
| Qdrant | Cloud vector DB | Hybrid BM25+dense search, warm-layer semantic retrieval |
| Cortex | Knowledge base | Human-curated docs, reflections, stable knowledge |
| Hive-Mind | Local JSON | Session backup, swarm decisions, quick persistence |
Vector search finds similar documents. The Context Mesh finds related concepts.
Episode A: "Simplified auth from 5 methods to 2"
β
βββ evolved_from βββ Episode B: "Analyzed auth complexity"
β
βββ mentions βββββββ Entity: "Google OAuth"
β
βββ led_to βββββββββ Episode C: "Deployed simplified auth"
Multi-hop queries the mesh enables:
- "What decisions led to the current auth system?"
- "What other tasks mentioned this client?"
- "What patterns evolved from successful deployments?"
Current mesh: 5,007 nodes, 963,566 edges, 560 concepts β a rich knowledge graph connecting all episodes.
flowchart LR
subgraph Save["πΎ SAVE"]
HOOK["Stop Hook"]
TASK["Task Complete"]
end
subgraph Sync["π SYNC (NEW)"]
CONVEX["Convex<br/>(real-time)"]
ATTR["Attribution<br/>(who/what/where)"]
end
subgraph Index["π INDEX"]
AGENTDB["AgentDB<br/>(immediate)"]
QDRANT["Qdrant<br/>(Convex cron 5min)"]
MESH["Mesh Edges<br/>(relationship extraction)"]
end
subgraph Search["π SEARCH"]
HOT["Hot Path<br/>(<350ms)"]
HYBRID["Hybrid Search<br/>(50-200ms)"]
end
subgraph Learn["π§ LEARN"]
SONA["SONA<br/>(pattern extraction)"]
CORTEX_SYNC["Cortex Sync<br/>(reward >= 0.65)"]
end
Save --> Sync
Sync --> Index
Index --> Search
Search --> Learn
Learn -->|"improves"| Search
Most memory systems apply uniform time decay: older memories get lower scores. This breaks for universal truths β facts that remain valid regardless of age:
- "The project uses TypeScript" (learned 6 months ago) β still true
- "API endpoint is /v1/users" (documented last year) β still true
- "Client prefers async communication" (noted 2 months ago) β still true
The unified-search.sh script implements similarity-gated decay:
| Similarity | Decay | Rationale |
|---|---|---|
| >= 0.85 | None | Universal truth (stable facts) |
| 0.7 - 0.85 | Mild (floor 0.95) | Likely stable |
| < 0.7 | Stronger (floor 0.85) | Time-sensitive context |
| Similarity | Age | Recency Factor |
|---|---|---|
| >=0.85 | Any | 1.0 (no decay) |
| 0.7-0.85 | <7d | 1.0 |
| 0.7-0.85 | 7-30d | 0.98 |
| 0.7-0.85 | 30-90d | 0.96 |
| 0.7-0.85 | >90d | 0.95 |
| <0.7 | <7d | 1.0 |
| <0.7 | 7-30d | 0.95 |
| <0.7 | 30-90d | 0.90 |
| <0.7 | >90d | 0.85 |
Final score: adjusted_score = raw_similarity Γ recency_factor
- Universal truths don't decay β High-similarity matches (>=0.85) are likely stable facts
- Time-sensitive info decays gracefully β Meeting notes, temporary decisions naturally fade
- Nothing is deleted β All memories remain, just with adjusted scores
- Floors prevent total loss β Even heavily decayed memories (0.85 floor) remain discoverable
| Site | Host | Agent | Platform | Interface |
|---|---|---|---|---|
| Laptop | adamkovacs-mbp | Claude Code | claude-code-cli | Terminal |
| VM 210 | ai-agent-primary (Tailscale) | Xavier | OpenClaw | Telegram / Control UI |
Xavier runs on VM210 (Debian 12, 4 vCPU, ~47GB RAM, no GPU) via OpenClaw gateway.
Host context: Proxmox on Ryzen 9 8945HS, 64GB RAM, Radeon 780M iGPU (no ROCm, currently parked).
- Primary model:
anthropic/claude-opus-4-6 - Fallback models:
openai-codex/gpt-5.3-codex,google-gemini-cli/gemini-3-pro-preview - Subagent model:
google-gemini-cli/gemini-3-flash-preview - Interface: Telegram bot + OpenClaw Control UI
- Memory: Local AgentDB β Convex sync β Qdrant warm search
All inter-site connectivity via Tailscale mesh network. MagicDNS hostnames preferred over IPs (IPs can change on node re-registration). SSH keys and hostnames stored in .env / SSH config, never in docs.
flowchart TB
subgraph Input["π₯ INPUT"]
TEXT["π Text/Code"]
AUDIO["π΅ Audio"]
IMAGE["πΌοΈ Images"]
VIDEO["π¬ Video"]
PDF["π PDFs"]
end
subgraph Processing["βοΈ PROCESSING"]
subgraph VisionProc["Vision Pipeline"]
VISION_CLASS["Gemini Flash<br/>Vision classify"]
GLMOCR["GLM-OCR via Ollama<br/>Primary OCR"]
VISION_DESC["Gemini Flash<br/>Describe / fallback"]
end
ASSEMBLYAI["AssemblyAI<br/>Universal-3 Pro"]
PDFPLUMBER["pdfplumber / pdftotext"]
end
subgraph Embedding["𧬠EMBEDDINGS"]
subgraph TextEmbed["Text (Ollama)"]
QWEN3["qwen3-embedding:8b<br/>4096dβ768d"]
end
subgraph CloudEmbed["Cloud"]
GEMINI["Gemini gemini-embedding-001<br/>768d"]
VOYAGE4["Voyage-4<br/>text fallback"]
end
subgraph MMEmbed["Multimodal"]
VOYAGE_MM["Voyage multimodal-3.5<br/>1024dβ768d"]
end
BM25["BM25 Sparse"]
end
subgraph Storage["πΎ STORAGE"]
HNSW["π₯ AgentDB HNSW<br/>(Qwen3 768d, <350ms)"]
CONVEX_STORE["π Convex<br/>(real-time sync)"]
QDRANT["π― Qdrant<br/>(Gemini 768d, hybrid)"]
CORTEX["βοΈ Cortex<br/>(knowledge)"]
end
Input --> Processing
Processing --> Embedding
Embedding --> Storage
Images follow a type-then-process pipeline (ADR-0005):
Image β Gemini Flash classify β { text-heavy, chart, photo, diagram }
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
GLM-OCR via Ollama Gemini Flash describe Voyage multimodal-3.5
(primary OCR) (fallback/context) (multimodal vector)
Primary OCR script: /workspace/scripts/ocr-glm.sh
Vision classification script: /workspace/scripts/vision-classify.sh
Fallback OCR for complex/multi-page inputs remains Gemini Flash.
| Tier | Model | Runtime | Notes |
|---|---|---|---|
| Primary | GLM-OCR 0.9B (glm-ocr:latest) |
Ollama on VM210 | Purpose-built OCR, OmniDocBench 94.62, ~5s warm |
| Fallback | Gemini Flash | Cloud API | Used for complex layouts and multi-page OCR |
| Task | Model/Path | Notes |
|---|---|---|
| Text zero-shot classification | Qwen3-Reranker-8B via reranker endpoint | Shared with local fallback reranker path |
| Vision classification | Gemini Flash via /workspace/scripts/vision-classify.sh |
Stable for image-type routing and label tasks |
Key ADR rules:
- ADR-0005: Always OCR type-downgrade before text pipeline
- ADR-0006: Always ffmpeg for video frame extraction, not direct VLM
Audio β AssemblyAI Universal-3 Pro β Transcript β Text embedding β Qdrant
β
Cortex (if meeting)
PDF β pdfplumber β Text chunks β Text embedding β Qdrant
β β
Table extraction Chunk by headers/paragraphs
flowchart TD
Q["User Query"] --> PHASE1
subgraph PHASE1["Phase 1: Hot + Backends (parallel)"]
direction LR
EMB["Generate embedding<br/>(Ollama Qwen3)"]
HNSW_S["HNSW search<br/>(AgentDB)"]
AGENTDB_S["AgentDB SQL<br/>(keyword)"]
MESH_S["Context Mesh<br/>(graph)"]
HIVEMIND_S["Hive-Mind<br/>(JSON)"]
end
PHASE1 --> PHASE2
subgraph PHASE2["Phase 2: Qdrant Collections (parallel)"]
direction LR
C1["agent_memory_hybrid"]
C2["patterns_hybrid"]
C3["cortex_hybrid"]
C4["learnings_hybrid"]
C5["...13 collections"]
end
PHASE2 --> PHASE3
subgraph PHASE3["Phase 3: Reranking"]
FUSE["RRF Fusion"]
RERANK["Gemini Flash<br/>Neural Rerank"]
end
PHASE3 --> RESULT["Final Ranked Results"]
Performance (unified-search.sh):
- Phase 1: All 5 backends + embedding generation run concurrently
- Phase 2: All 13 Qdrant collections queried concurrently
- Phase 3: Results fused with RRF, then neural reranking
- Total: ~2-5s wall time for comprehensive cross-backend search
| Domain | Collection | Use Cases |
|---|---|---|
sales |
sales_context_hybrid |
Client interactions, proposals |
learning |
learning_context_hybrid |
Course delivery, feedback |
operations |
operations_context_hybrid |
Internal ops, infrastructure |
general |
agent_memory_hybrid |
Default, cross-domain |
# Domain-specific search
bash unified-search.sh --domain sales "client ABC proposal"
# Cross-domain search
bash unified-search.sh --domain all "authentication"Claude Code hooks execute sequentially within arrays. Parallelism requires a single wrapper script that uses bash background jobs (& + wait).
session-start-parallel.sh replaces 9 sequential hooks. All run via & + wait.
Performance: 347s β ~60s.
session-stop-parallel.sh uses phased execution:
| Phase | Tasks | Parallel? |
|---|---|---|
| 1 | session-summarize.sh (gathers git diff, computes reward) |
Blocking |
| 2 | session-sync.sh save, learning-capture.sh, cortex-session-log.sh, cortex-learning-sync.sh |
4 parallel |
| 3 | reflection-action-tracker.sh validate-all |
Sequential |
| 3.5 | reflection-action-tracker.sh store-learning |
Sequential |
| 4 | Convex flush, cleanup | Parallel |
Performance: 289s β ~128s.
- session-summarize.sh β Gathers git diff + AgentDB episodes, computes honest reward (0.3-0.95)
- session-sync.sh save β Persist session state
- learning-capture.sh β Adds structured critique to most recent episode
- cortex-session-log.sh β Creates/appends daily task log in Cortex JOURNAL
- cortex-learning-sync.sh β Syncs episodes with reward >= 0.65 to Cortex KNOWLEDGE (ID-based cursor,
ORDER BY id ASC) - reflection-action-tracker.sh validate-all β Validates pending behavioral changes
- reflection-action-tracker.sh store-learning β Pushes validated learnings to Cortex KNOWLEDGE
8 Architecture Decision Records are enforced via adr-enforcement.sh PreToolUse hook:
| ADR | Rule | Enforcement |
|---|---|---|
| ADR-0001 | Never ORDER BY reward DESC in sync scripts |
Hard block |
| ADR-0002 | Never remove Phase 3.5 from stop hooks | Hard block |
| ADR-0003 | Always use _context_hybrid suffix for domain collections |
Hard block |
| ADR-0004 | Never write to deprecated non-hybrid collection names | Hard block |
| ADR-0005 | Always OCR type-downgrade before text pipeline | Advisory |
| ADR-0006 | Always ffmpeg for video frames | Advisory |
| ADR-0007 | Filter checkpoint episodes from active queries | Advisory |
| ADR-0008 | Never delete hook files directly (archive via closure analysis) | Advisory |
| Table | Records | Key Columns |
|---|---|---|
mesh_nodes |
5,007 | id, type, source_id, source, content, metadata |
mesh_edges |
963,566 | from_node, to_node, edge_type, weight |
mesh_concepts |
560 | name (UNIQUE), frequency, category |
mesh_evolution |
β | How concepts change over time |
| Edge Type | Description | Creation Logic |
|---|---|---|
mentions |
Episode contains concept | Word match: content LIKE '%concept%' |
led_to |
Sequential in same session | Next episode in session |
evolved_from |
Shared 3+ concepts over time | Concept intersection analysis |
informed |
Learning influenced decision | Manual/Cortex linking |
similar_to |
Semantically similar | Vector similarity search |
Simple but effective: tokenize β filter stopwords β take top 15 unique terms per episode.
Categories assigned by pattern matching:
tool: tool, script, command, sqlite*, bash*error: error, fail, bugpattern: pattern, approach, strategysystem: api, server, databasedomain: everything else
| Notebook | Docs | Purpose |
|---|---|---|
| WORKSPACE | 142 | Active projects, configs |
| KNOWLEDGE | 233 | Stable learnings, patterns |
| JOURNAL | 149 | Daily logs, reflections |
| ARCHIVE | 26 | Completed/retired docs |
Cortex runs on SiYuan Note with REST API access. Documents flow:
- WORKSPACE β KNOWLEDGE (when learnings stabilize)
- WORKSPACE β ARCHIVE (when projects complete)
- JOURNAL is append-only daily logs
| Service | Models | Purpose | Port | Runtime |
|---|---|---|---|---|
| Ollama | qwen3-embedding:8b, glm-ocr:latest |
Hot-path text embedding + primary OCR | 11434 | systemd, always-on |
llama-server (llama-rerank.service) |
Qwen3-Reranker-8B Q4_K_M | Local rerank/classification fallback | 18200 | systemd, on-demand |
PyTorch VL server (/opt/vl-embed/) |
legacy VL embedding stack | On-demand only backup path | (ad hoc) | manual/on-demand |
Service management:
- Ollama is enabled on boot via systemd
- Ollama auto-unloads idle models after ~5 minutes
llama-rerank.serviceis intentionally not always-on to preserve RAM/CPU headroom
| Decision | Rationale | Alternative Considered |
|---|---|---|
| Convex for sync, not custom WebSocket | Reactive queries, atomic mutations, built-in crons, HTTP webhooks | Custom sync server (maintenance burden) |
| Hybrid search by default | Exact matches (IDs, function names) get lost in pure semantic | Dense-only (faster but misses exact) |
| 768d embeddings everywhere | Consistent space, Matryoshka truncation from 4096d | 384d (faster) or 3072d (marginal gain) |
| Local-first with cloud sync | Fast writes, offline-capable, then background sync to Convex | Cloud-first (latency, connectivity dependency) |
| 3-layer approvals | Telegram buttons (UX) + Convex (scale) + metadata (speed) | Single approval table (fragile) |
| Agent attribution at write time | Zero-cost queries by agent; backfilling is unreliable | Runtime inference from session IDs (brittle) |
| Gemini for Qdrant, Qwen3 for HNSW | Qwen3 via Ollama is local/fast for hot path, Gemini remains warm-layer baseline | Single model everywhere (compromise) |
| ID-based sync cursors | Time-based + ORDER BY reward caused infinite loops | Time-based cursor (broken for reward-ordered) |
| No reranking on hot path | Rerank adds 500ms+; hot path budget is <350ms | Rerank everything (defeats hot path) |
| Pre-created hybrid Qdrant collections | Hybrid config can't change after creation | Create on demand (loses hybrid capability) |
| Voyage primary rerank + local fallback | Voyage gives best real-time quality/cost, local reranker preserves independence | Fully local rerank-only path |
| Omission | Reason |
|---|---|
| Real-time streaming search | Batch is sufficient for agent workflows |
| GPU-accelerated inference | MLX on Apple Silicon is sufficient for these model sizes |
| Distributed Qdrant | Single node handles 350k+ vectors with <200ms latency |
| Custom fine-tuned models | Off-the-shelf models perform well enough |
| Supabase | Deprecated 2026-02. DNS failures, all scripts disabled. AgentDB + Convex replace it. |
| Trap | Fix |
|---|---|
SQL datetime('now') vs Unix int created_at |
CAST(strftime('%s','now','-Nh') AS INTEGER) |
source .env fails in hook context |
grep '^VAR=' .env | cut -d= -f2- |
| Mixed embedding models in HNSW | All vectors + queries MUST use same model |
| Time cursor + ORDER BY reward | Use ID-based cursor + ORDER BY id ASC |
| Unnamed vector upsert in Qdrant | Use named vectors: "vector": {"dense": [...]} |
| Cross-encoder via chat completions | Gibberish β use generative LLM or proper rerank API |
${var,,} on macOS Bash 3.2 |
echo "$var" | tr '[:upper:]' '[:lower:]' |
| Rerank backend drift | Set RERANK_BACKEND=voyage for primary, keep llama fallback reachable |
SiYuan moveDocs without toPath: "/" |
Silent no-op (API returns success but does nothing) |
| Service | Port | Auth |
|---|---|---|
| Ollama | 11434 |
None (local VM network) |
| llama-server reranker | 18200 |
None (local VM network) |
| PyTorch VL server (on-demand) | ad hoc | None (local, manual start) |
Cloud service URLs (Qdrant, Cortex, Convex) stored in .env files per site.
This architecture runs across 2 sites (laptop + VM210), powered by Ollama-first local services plus Voyage AI and Gemini cloud APIs where they materially improve quality or latency. 7,500+ AgentDB episodes with attribution, 350k+ Qdrant vectors, 5,007 mesh nodes with 963k edges, ~550 Cortex documents, full CRM system, and 11-table Convex schema, all synced via Convex. Built with Claude Code + OpenClaw.