Skip to content

Instantly share code, notes, and snippets.

@adambkovacs
Last active March 13, 2026 02:34
Show Gist options
  • Select an option

  • Save adambkovacs/1067974308e10912ff03918d99bd8a90 to your computer and use it in GitHub Desktop.

Select an option

Save adambkovacs/1067974308e10912ff03918d99bd8a90 to your computer and use it in GitHub Desktop.
Multi-Layer Memory Architecture for AI Agents - Hot/Semantic/Cold storage, content-aware chunking, and intelligent retrieval

Multi-Layer Memory Architecture for AI Agents β€” Hot/Warm/Cold storage, Convex multi-site sync, content-aware chunking, and intelligent retrieval

Multi-Layer Memory Architecture for AI Agents

A comprehensive multi-site memory system with hybrid BM25+dense search, multimodal embeddings, neural reranking, Convex real-time sync, and agent attribution. With Hot/Warm/Cold storage, content-aware chunking, and intelligent retrieval. Powered by local Ollama services on VM210 with cloud augmentation where it wins.

Updated: 2026-02-16


Why This Architecture?

Problem: AI agents forget everything between sessions. Simple RAG with one vector DB fails because:

  • Exact matches get lost β€” Client IDs, function names, error codes drift in semantic space
  • Different content needs different processing β€” Code, meetings, images, charts each need specialized pipelines
  • Cloud APIs are too slow β€” Real-time agent decisions can't wait 500ms+ for embeddings
  • Context compaction loses critical state β€” Long sessions get summarized, losing important details
  • Multi-site agents need shared memory β€” An agent on a VM and a developer on a laptop shouldn't have separate, diverging knowledge

Solution: Multi-layer memory with intentional tradeoffs at each layer, unified by Convex as the real-time sync backbone.


Architecture Overview (Updated Feb 2026)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Laptop (Adam)        β”‚     β”‚  VM 210 (Xavier)        β”‚     β”‚ Future Agents β”‚
β”‚  Claude Code CLI      β”‚     β”‚  OpenClaw Gateway       β”‚     β”‚ (OCI, edge)   β”‚
β”‚  AgentDB (SQLite)     β”‚     β”‚  AgentDB (SQLite)       β”‚     β”‚               β”‚
β”‚  2331+ episodes       β”‚     β”‚  Telegram / Control UI  β”‚     β”‚               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚                            β”‚                            β”‚
           β”‚  episode_embeddings        β”‚  episode_embeddings        β”‚
           β”‚  (Qwen3-Embedding-8B       β”‚  (synced from Convex)      β”‚
           β”‚   768d HNSW hot path)      β”‚                            β”‚
           β”‚                            β”‚                            β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  Convex Cloud     β”‚  ← Real-time sync backbone
                β”‚  (agent-memory)   β”‚
                β”‚                   β”‚
                β”‚  episodes         β”‚  ← cross-site episode mirror
                β”‚  approvals        β”‚  ← multi-reviewer queue
                β”‚  tasks            β”‚  ← OpenClaw task bridge
                β”‚  agents           β”‚  ← registry + heartbeat
                β”‚  collaborators    β”‚  ← human agents (Adam + future)
                β”‚  syncCursors      β”‚  ← per-site watermark
                β”‚                   β”‚
                β”‚  6 cron jobs      β”‚  ← autonomous cloud functions
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  Qdrant Cloud    β”‚  ← Warm-layer hybrid search
                 β”‚  350k+ vectors   β”‚     BM25 + dense (Gemini 768d)
                 β”‚  17 collections  β”‚     + neural reranking
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sync model: Each site writes to local SQLite first (fast, offline-capable), then syncs to Convex in background. Convex is the shared truth for cross-site visibility, dashboards, and approval workflows. Convex cron automatically pushes new episodes to Qdrant for warm-layer search.


The Three Memory Layers

flowchart TB
    subgraph Hot["πŸ”₯ HOT LAYER (<350ms)"]
        direction LR
        HNSW["AgentDB HNSW<br/>Qwen3-Embedding-8B GGUF<br/>768d (truncated from 4096d)<br/>1100+ vectors"]
        LOCAL["Local SQLite<br/>Zero network latency"]
    end

    subgraph Warm["🌑️ WARM LAYER (50-200ms)"]
        direction LR
        QDRANT["Qdrant Hybrid<br/>768d Gemini<br/>350k+ vectors"]
        BM25["BM25 Sparse<br/>Exact keyword match"]
    end

    subgraph Cold["❄️ COLD LAYER (+120-1500ms)"]
        direction LR
        RERANK["Neural Reranker<br/>Voyage rerank-2.5 (primary)<br/>Qwen3-Reranker-8B local fallback"]
    end

    QUERY[Query] --> Hot
    Hot -->|"results + miss"| Warm
    Warm -->|"top candidates"| Cold
    Cold --> FINAL[Final Results]
Loading

Key insight: Reranking is only applied to Qdrant results, NOT the hot path. Adding 500ms+ to a sub-second path would defeat its purpose.

Embedding Models Per Layer

Layer Model Dimensions Why Location
Hot (AgentDB HNSW) Qwen3-Embedding-8B via Ollama 4096d→768d (Matryoshka truncated) Local-first low latency on VM210 VM210 (:11434)
Warm (Qdrant) Gemini gemini-embedding-001 768d Stable cloud baseline for hybrid collections Cloud
Multimodal Voyage multimodal-3.5 1024d→768d Best operational quality/latency without local VL overhead Cloud API
Convex→Qdrant sync Gemini gemini-embedding-001 768d Same model as Qdrant to maintain consistency Convex Action

Critical rule: NEVER mix embedding models in the same HNSW/collection index. Cross-model cosine similarity is ~0.12 (useless). All hot-path vectors + queries use the same Qwen3-Embedding-8B model.

Embedding Runtime (Current)

Path Model Runtime Dimensions Notes
Hot text path qwen3-embedding:8b Ollama (:11434) 4096d→768d Always-on via systemd, auto-unloads after ~5 min idle
Warm Qdrant path gemini-embedding-001 Cloud API 768d No change, canonical Qdrant dense vector
Multimodal path voyage-multimodal-3.5 Cloud API 1024d→768d Script: /workspace/scripts/voyage-multimodal-embed.sh
Text fallback voyage-4 Cloud API API-native→768d Available in embed.sh fallback chain
  • Ollama models on VM210: qwen3-embedding:8b (4.7GB), glm-ocr:latest (2.2GB)
  • Local PyTorch VL server at /opt/vl-embed/ still exists but is on-demand only, too heavy for always-on use (~52s/embedding on CPU)
  • Matryoshka truncation remains standard: keep first 768 dimensions for index compatibility

Neural Reranking

Tier Latency When Model
Primary 120-350ms typical Default warm/cold rerank Voyage rerank-2.5
Fallback Higher, local CPU-bound Offline, failover, cost-control Qwen3-Reranker-8B Q4_K_M
  • Primary script path: /workspace/lib/ingest/rerank.sh with RERANK_BACKEND=voyage
  • Voyage economics: 200M free rerank tokens one-time per account, zero local CPU burn while available
  • Local fallback artifact: /opt/models/ (4.8GB GGUF), served by llama-server on :18200
  • Service unit: llama-rerank.service (on-demand, not always-on)
  • This local reranker also powers zero-shot classification with BTZSC F1 of 0.72

Important corrections:

  • Gemini Flash is no longer the primary reranker
  • Qwen3-VL-Reranker was laptop-only historical context, not current production path

Exact vs. Semantic: When Each Wins

Query Example Winner
"BatchTool" BM25 exact match
"how to spawn agents" Dense semantic
"client ABC #123" BM25 exact match
"authentication flow" Dense semantic
"error 0x8007001F" BM25 exact match

HYBRID SEARCH: Run BOTH, fuse with Reciprocal Rank Fusion (RRF), let reranker decide final order.


Convex: The Multi-Site Sync Backbone (NEW)

Why Convex?

Requirement Convex Solution
Multi-writer sync (laptop + VM) Atomic mutations with (sourceSite, sourceId) dedup
Real-time dashboards Reactive WebSocket queries
Autonomous cloud jobs Built-in cron scheduler (no n8n/bash needed)
Approval workflows HTTP Actions for Telegram webhook callbacks
Agent heartbeat Mutation-based, aggregated by Convex itself

Convex Schema (11 Tables)

Table Purpose Key Fields
episodes Cross-site episode mirror sourceId, sourceSite, agentName, agentPlatform, reward, task, approvalStatus
approvals Multi-reviewer approval queue agentName, actionType, status, priority, reviewers[], telegramMessageId
tasks OpenClaw task bridge title, agentName, status, assignedTo, approvalId
agents Live registry + heartbeat agentName, site, platform, status, lastHeartbeat
collaborators Human agents name, role, telegramChatId, permissions[], notifyOn[]
syncCursors Per-site sync watermark site, lastSyncedId, lastSyncedAt, totalSynced
crm_contacts CRM contacts identity fields, enrichment, embedding refs
crm_interactions CRM activity log channel, timestamp, summary, linkage
crm_deals CRM pipeline deals stage, value, owner, close window
crm_entities Extracted entities type, canonical value, provenance
crm_relationships Graph edges between CRM objects from, to, relation type, confidence

Sync Flow

flowchart LR
    subgraph Laptop["Laptop (Claude Code)"]
        L_AGENTDB["AgentDB<br/>SQLite"]
        L_SYNC["convex-episode-sync.sh"]
    end

    subgraph VM["VM 210 (Xavier)"]
        V_AGENTDB["AgentDB<br/>SQLite"]
        V_SYNC["convex-episode-sync.sh"]
    end

    subgraph Convex["Convex Cloud"]
        C_EP["episodes table"]
        C_CURSOR["syncCursors table"]
        C_CRON["qdrant-episode-sync<br/>cron (every 5 min)"]
    end

    subgraph Qdrant["Qdrant Cloud"]
        Q_HYBRID["agent_memory_hybrid<br/>BM25 + dense"]
    end

    L_AGENTDB -->|"WHERE id > cursor<br/>AND convex_synced=0"| L_SYNC
    L_SYNC -->|"bulkSyncEpisodes<br/>mutation"| C_EP
    L_SYNC -->|"updateCursor"| C_CURSOR

    V_AGENTDB -->|"WHERE id > cursor<br/>AND convex_synced=0"| V_SYNC
    V_SYNC -->|"bulkSyncEpisodes<br/>mutation"| C_EP
    V_SYNC -->|"updateCursor"| C_CURSOR

    C_CRON -->|"embed via Gemini<br/>upsert named vectors"| Q_HYBRID
Loading

Dedup strategy: The Convex bulkSyncEpisodes mutation upserts by (sourceSite, sourceId) composite key. Two sites syncing the same logical episode get separate entries (different sourceSite). After sync, local episodes are marked convex_synced = 1.

Convex→Qdrant sync: A Convex Action (qdrantSync.ts) runs every 5 minutes via cron. It:

  1. Finds episodes not yet in Qdrant (no qdrantPointId in metadata)
  2. Generates Gemini 768d embeddings
  3. Upserts to agent_memory_hybrid with named vectors (dense + text for BM25)
  4. Updates the episode metadata with the Qdrant point ID

Convex Cron Jobs (6 total)

Cron Interval Purpose
mark-stale-agents 5 min Set agents to "offline" if no heartbeat in 3 min
expire-pending-approvals 15 min Expire approvals past timeout
sync-health-alert 30 min Alert if site hasn't synced in 1 hour
qdrant-episode-sync 5 min Embed + push new episodes to Qdrant
daily-digest Daily 06:00 UTC Agent activity summary
cleanup-old-episodes Weekly (Sun 03:00 UTC) Archive 90-day-old episodes

These run in Convex cloud β€” they work even if both laptop and VM are down.

Convex Environments

Environment Purpose
Dev Development/testing
Prod Live multi-site sync

URLs and deploy keys stored in .env files per site, never committed to git.


CRM System (NEW)

A dedicated CRM skill now runs as a first-class memory subsystem.

CRM Skill Stack

Layer Implementation
Skill root /workspace/skills/crm/
Convex backend crm.ts, crmAdmin.ts, crmClassify.ts, crmDiscover.ts, crmSync.ts
Vector retrieval Qdrant hybrid search for contacts and interactions
CLI scripts contact.sh, deal.sh, interaction.sh, search.sh, plus supporting scripts
Automation OpenClaw cron jobs for auto-embed + auto-classify
Visualization gen-graph.py relationship graph generation

CRM Integration Notes

  • CRM data participates in the same hot/warm retrieval philosophy, with Convex as operational truth and Qdrant as semantic retrieval layer.
  • Contact and interaction artifacts are embedded and classified automatically, reducing manual CRM hygiene overhead.
  • Relationship views are queryable both as vectors and as graph edges for entity-centric investigation.

Agent Attribution (NEW)

The Problem

Before attribution, all episodes had no structured way to identify which agent, platform, or interface produced them. Session ID prefixes (subagent-abc123, session-1770881984) were the only hint.

AgentDB Attribution Columns

ALTER TABLE episodes ADD COLUMN agent_name TEXT DEFAULT 'claude-code';
ALTER TABLE episodes ADD COLUMN agent_platform TEXT DEFAULT 'claude-code-cli';
ALTER TABLE episodes ADD COLUMN agent_interface TEXT DEFAULT 'terminal';
ALTER TABLE episodes ADD COLUMN parent_agent TEXT;
ALTER TABLE episodes ADD COLUMN llm_provider TEXT;
ALTER TABLE episodes ADD COLUMN llm_model TEXT;
ALTER TABLE episodes ADD COLUMN convex_synced INTEGER DEFAULT 0;
ALTER TABLE episodes ADD COLUMN convex_id TEXT;

Attribution Values

Agent agent_name agent_platform agent_interface
Xavier (Telegram) xavier openclaw telegram
Xavier (Control UI) xavier openclaw control-ui
Xavier (subagent) xavier-sub-{id} openclaw subagent
Claude Code (Adam) claude-code claude-code-cli terminal
Claude Code (subagent) cc-sub-{id} claude-code-cli subagent
Claude Flow (via Xavier) xavier claude-flow-mcp mcp
Claude Flow (via CC) claude-code claude-flow-mcp mcp

Key rule: agent_name = top-level actor identity. When Xavier uses Claude Flow MCP, agent_name stays xavier because Xavier initiated the action. agent_platform tracks the execution engine.

Environment Detection

Attribution is automatic β€” hooks detect the environment:

# In memory-save.sh
if [ -n "${OPENCLAW_SESSION:-}" ]; then
    AGENT_NAME="xavier"
    AGENT_PLATFORM="openclaw"
    AGENT_INTERFACE="${OPENCLAW_INTERFACE:-telegram}"
else
    AGENT_NAME="${AGENT_NAME:-claude-code}"
    AGENT_PLATFORM="${AGENT_PLATFORM:-claude-code-cli}"
    AGENT_INTERFACE="${AGENT_INTERFACE:-terminal}"
fi

Approval Checkpoints β€” 3-Layer Architecture (NEW)

Why 3 Layers?

Xavier runs 24/7 autonomously. High-impact actions (sending messages, financial ops, destructive changes) need human approval. Three layers work together:

Layer Purpose Storage Speed
A: Episode Metadata Rich context per episode AgentDB episodes.metadata JSON Instant (local)
B: Convex Tables Scalable multi-agent/multi-human queue Convex approvals + tasks Real-time (WebSocket)
C: Telegram Buttons Approve/reject via inline keyboard OpenClaw + Telegram Bot API Interactive

Approval Flow

Xavier wants to send a Slack message
    β”‚
    β”œβ”€β”€ 1. Write episode with metadata.approval.status="pending" (Layer A)
    β”œβ”€β”€ 2. Create Convex approval + linked task (Layer B)
    β”œβ”€β”€ 3. Send Telegram message with [Approve] [Reject] [Defer] buttons (Layer C)
    β”‚
    └── Adam taps "Approve" in Telegram
         β”‚
         β”œβ”€β”€ Telegram callback β†’ Convex HTTP webhook
         β”œβ”€β”€ Convex resolves approval + updates task
         β”œβ”€β”€ Edit Telegram message: "βœ… Approved by Adam at 14:32"
         └── Sync back to AgentDB episode metadata

Approval Rules

Category Auto-approve? Timeout Notes
send_message Never 60 min External comms always need human review
financial Never 120 min Any money-related action
destructive Never 30 min Deletes, drops, overwrites
external_api If <$0.10 15 min Cost-gated auto-approve
code_commit If tests pass 30 min CI-gated auto-approve
internal Always N/A Internal memory ops, no risk
research Always N/A Read-only, no side effects

4-Backend Memory System

Current Data Scale

Backend Records Details
AgentDB (SQLite) 1,844 episodes Task trajectories with rewards + attribution
↳ episode_embeddings 104 Qwen3-Embedding-8B GGUF 768d hot-path vectors
↳ Context Mesh 5,007 nodes, 963k edges, 560 concepts Semantic relationship graph
Qdrant (Cloud) 350,395 vectors 17 collections (hybrid-enabled)
↳ codebase_hybrid 323,968 Indexed source code + documentation
↳ patterns_hybrid 7,345 Learned patterns + behaviors
↳ agent_memory_hybrid 6,716 Task episodes + context
↳ cortex_hybrid 5,192 Knowledge base documents
↳ context_mesh_hybrid 4,324 Mesh relationship embeddings
↳ research_hybrid 1,627 Research notes + findings
↳ learnings_hybrid 954 High-reward episode learnings
Cortex (SiYuan) ~550 documents 3+1 notebook architecture
↳ WORKSPACE 142 docs Active projects, configs
↳ KNOWLEDGE 233 docs Stable learnings, patterns
↳ JOURNAL 149 docs Daily logs, reflections
↳ ARCHIVE 26 docs Completed/retired docs
Convex (Cloud) 11 tables Real-time multi-site sync layer + CRM operational schema
↳ episodes ~1,840 synced Mirrored from all sites
↳ agents 2 registered xavier + claude-code
Hive-Mind (Local JSON) Session state Backup decisions, swarm coordination

Backend Roles

Backend Type Best For
AgentDB Local SQLite Fast writes, episode storage, HNSW hot search, Context Mesh
Convex Cloud reactive DB Multi-site sync, approval workflows, heartbeat, cron jobs
Qdrant Cloud vector DB Hybrid BM25+dense search, warm-layer semantic retrieval
Cortex Knowledge base Human-curated docs, reflections, stable knowledge
Hive-Mind Local JSON Session backup, swarm decisions, quick persistence

Memory Philosophy: Hot/Cold, Fast/Slow, Exact/Semantic

Context Mesh: Beyond Vector Search

Vector search finds similar documents. The Context Mesh finds related concepts.

Episode A: "Simplified auth from 5 methods to 2"
    β”‚
    β”œβ”€β”€ evolved_from ──→ Episode B: "Analyzed auth complexity"
    β”‚
    β”œβ”€β”€ mentions ──────→ Entity: "Google OAuth"
    β”‚
    └── led_to ────────→ Episode C: "Deployed simplified auth"

Multi-hop queries the mesh enables:

  • "What decisions led to the current auth system?"
  • "What other tasks mentioned this client?"
  • "What patterns evolved from successful deployments?"

Current mesh: 5,007 nodes, 963,566 edges, 560 concepts β€” a rich knowledge graph connecting all episodes.

Memory Lifecycle: Save β†’ Sync β†’ Index β†’ Search β†’ Learn

flowchart LR
    subgraph Save["πŸ’Ύ SAVE"]
        HOOK["Stop Hook"]
        TASK["Task Complete"]
    end

    subgraph Sync["πŸ”„ SYNC (NEW)"]
        CONVEX["Convex<br/>(real-time)"]
        ATTR["Attribution<br/>(who/what/where)"]
    end

    subgraph Index["πŸ“Š INDEX"]
        AGENTDB["AgentDB<br/>(immediate)"]
        QDRANT["Qdrant<br/>(Convex cron 5min)"]
        MESH["Mesh Edges<br/>(relationship extraction)"]
    end

    subgraph Search["πŸ” SEARCH"]
        HOT["Hot Path<br/>(<350ms)"]
        HYBRID["Hybrid Search<br/>(50-200ms)"]
    end

    subgraph Learn["🧠 LEARN"]
        SONA["SONA<br/>(pattern extraction)"]
        CORTEX_SYNC["Cortex Sync<br/>(reward >= 0.65)"]
    end

    Save --> Sync
    Sync --> Index
    Index --> Search
    Search --> Learn
    Learn -->|"improves"| Search
Loading

Time Decay & Universal Truths

The Problem with Naive Recency Bias

Most memory systems apply uniform time decay: older memories get lower scores. This breaks for universal truths β€” facts that remain valid regardless of age:

  • "The project uses TypeScript" (learned 6 months ago) β€” still true
  • "API endpoint is /v1/users" (documented last year) β€” still true
  • "Client prefers async communication" (noted 2 months ago) β€” still true

Smart Recency Decay (Implemented)

The unified-search.sh script implements similarity-gated decay:

Similarity Decay Rationale
>= 0.85 None Universal truth (stable facts)
0.7 - 0.85 Mild (floor 0.95) Likely stable
< 0.7 Stronger (floor 0.85) Time-sensitive context

Decay Factor Table

Similarity Age Recency Factor
>=0.85 Any 1.0 (no decay)
0.7-0.85 <7d 1.0
0.7-0.85 7-30d 0.98
0.7-0.85 30-90d 0.96
0.7-0.85 >90d 0.95
<0.7 <7d 1.0
<0.7 7-30d 0.95
<0.7 30-90d 0.90
<0.7 >90d 0.85

Final score: adjusted_score = raw_similarity Γ— recency_factor

Design Philosophy

  1. Universal truths don't decay β€” High-similarity matches (>=0.85) are likely stable facts
  2. Time-sensitive info decays gracefully β€” Meeting notes, temporary decisions naturally fade
  3. Nothing is deleted β€” All memories remain, just with adjusted scores
  4. Floors prevent total loss β€” Even heavily decayed memories (0.85 floor) remain discoverable

Multi-Site Agent Architecture (NEW)

Sites

Site Host Agent Platform Interface
Laptop adamkovacs-mbp Claude Code claude-code-cli Terminal
VM 210 ai-agent-primary (Tailscale) Xavier OpenClaw Telegram / Control UI

Xavier: The 24/7 Agent

Xavier runs on VM210 (Debian 12, 4 vCPU, ~47GB RAM, no GPU) via OpenClaw gateway.

Host context: Proxmox on Ryzen 9 8945HS, 64GB RAM, Radeon 780M iGPU (no ROCm, currently parked).

  • Primary model: anthropic/claude-opus-4-6
  • Fallback models: openai-codex/gpt-5.3-codex, google-gemini-cli/gemini-3-pro-preview
  • Subagent model: google-gemini-cli/gemini-3-flash-preview
  • Interface: Telegram bot + OpenClaw Control UI
  • Memory: Local AgentDB β†’ Convex sync β†’ Qdrant warm search

Connectivity

All inter-site connectivity via Tailscale mesh network. MagicDNS hostnames preferred over IPs (IPs can change on node re-registration). SSH keys and hostnames stored in .env / SSH config, never in docs.


Processing Pipelines

Architecture

flowchart TB
    subgraph Input["πŸ“₯ INPUT"]
        TEXT["πŸ“ Text/Code"]
        AUDIO["🎡 Audio"]
        IMAGE["πŸ–ΌοΈ Images"]
        VIDEO["🎬 Video"]
        PDF["πŸ“„ PDFs"]
    end

    subgraph Processing["βš™οΈ PROCESSING"]
        subgraph VisionProc["Vision Pipeline"]
            VISION_CLASS["Gemini Flash<br/>Vision classify"]
            GLMOCR["GLM-OCR via Ollama<br/>Primary OCR"]
            VISION_DESC["Gemini Flash<br/>Describe / fallback"]
        end
        ASSEMBLYAI["AssemblyAI<br/>Universal-3 Pro"]
        PDFPLUMBER["pdfplumber / pdftotext"]
    end

    subgraph Embedding["🧬 EMBEDDINGS"]
        subgraph TextEmbed["Text (Ollama)"]
            QWEN3["qwen3-embedding:8b<br/>4096d→768d"]
        end
        subgraph CloudEmbed["Cloud"]
            GEMINI["Gemini gemini-embedding-001<br/>768d"]
            VOYAGE4["Voyage-4<br/>text fallback"]
        end
        subgraph MMEmbed["Multimodal"]
            VOYAGE_MM["Voyage multimodal-3.5<br/>1024d→768d"]
        end
        BM25["BM25 Sparse"]
    end

    subgraph Storage["πŸ’Ύ STORAGE"]
        HNSW["πŸ”₯ AgentDB HNSW<br/>(Qwen3 768d, <350ms)"]
        CONVEX_STORE["πŸ”„ Convex<br/>(real-time sync)"]
        QDRANT["🎯 Qdrant<br/>(Gemini 768d, hybrid)"]
        CORTEX["❄️ Cortex<br/>(knowledge)"]
    end

    Input --> Processing
    Processing --> Embedding
    Embedding --> Storage
Loading

Vision-OCR Router

Images follow a type-then-process pipeline (ADR-0005):

Image β†’ Gemini Flash classify β†’ { text-heavy, chart, photo, diagram }
                                   β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               ↓                   ↓                   ↓
      GLM-OCR via Ollama      Gemini Flash describe   Voyage multimodal-3.5
      (primary OCR)            (fallback/context)      (multimodal vector)

Primary OCR script: /workspace/scripts/ocr-glm.sh Vision classification script: /workspace/scripts/vision-classify.sh Fallback OCR for complex/multi-page inputs remains Gemini Flash.

OCR Runtime

Tier Model Runtime Notes
Primary GLM-OCR 0.9B (glm-ocr:latest) Ollama on VM210 Purpose-built OCR, OmniDocBench 94.62, ~5s warm
Fallback Gemini Flash Cloud API Used for complex layouts and multi-page OCR

Classification Runtime

Task Model/Path Notes
Text zero-shot classification Qwen3-Reranker-8B via reranker endpoint Shared with local fallback reranker path
Vision classification Gemini Flash via /workspace/scripts/vision-classify.sh Stable for image-type routing and label tasks

Key ADR rules:

  • ADR-0005: Always OCR type-downgrade before text pipeline
  • ADR-0006: Always ffmpeg for video frame extraction, not direct VLM

Audio Pipeline

Audio β†’ AssemblyAI Universal-3 Pro β†’ Transcript β†’ Text embedding β†’ Qdrant
                                         ↓
                                    Cortex (if meeting)

PDF Processing

PDF β†’ pdfplumber β†’ Text chunks β†’ Text embedding β†’ Qdrant
         ↓              ↓
    Table extraction   Chunk by headers/paragraphs

Hybrid Search Pipeline

How a Search Works

flowchart TD
    Q["User Query"] --> PHASE1

    subgraph PHASE1["Phase 1: Hot + Backends (parallel)"]
        direction LR
        EMB["Generate embedding<br/>(Ollama Qwen3)"]
        HNSW_S["HNSW search<br/>(AgentDB)"]
        AGENTDB_S["AgentDB SQL<br/>(keyword)"]
        MESH_S["Context Mesh<br/>(graph)"]
        HIVEMIND_S["Hive-Mind<br/>(JSON)"]
    end

    PHASE1 --> PHASE2

    subgraph PHASE2["Phase 2: Qdrant Collections (parallel)"]
        direction LR
        C1["agent_memory_hybrid"]
        C2["patterns_hybrid"]
        C3["cortex_hybrid"]
        C4["learnings_hybrid"]
        C5["...13 collections"]
    end

    PHASE2 --> PHASE3

    subgraph PHASE3["Phase 3: Reranking"]
        FUSE["RRF Fusion"]
        RERANK["Gemini Flash<br/>Neural Rerank"]
    end

    PHASE3 --> RESULT["Final Ranked Results"]
Loading

Performance (unified-search.sh):

  • Phase 1: All 5 backends + embedding generation run concurrently
  • Phase 2: All 13 Qdrant collections queried concurrently
  • Phase 3: Results fused with RRF, then neural reranking
  • Total: ~2-5s wall time for comprehensive cross-backend search

Domain Namespacing

Domain Collection Use Cases
sales sales_context_hybrid Client interactions, proposals
learning learning_context_hybrid Course delivery, feedback
operations operations_context_hybrid Internal ops, infrastructure
general agent_memory_hybrid Default, cross-domain
# Domain-specific search
bash unified-search.sh --domain sales "client ABC proposal"

# Cross-domain search
bash unified-search.sh --domain all "authentication"

Hook Architecture (Parallelized)

Key Pattern

Claude Code hooks execute sequentially within arrays. Parallelism requires a single wrapper script that uses bash background jobs (& + wait).

SessionStart (parallel wrapper)

session-start-parallel.sh replaces 9 sequential hooks. All run via & + wait. Performance: 347s β†’ ~60s.

SessionStop (DAG-based parallel)

session-stop-parallel.sh uses phased execution:

Phase Tasks Parallel?
1 session-summarize.sh (gathers git diff, computes reward) Blocking
2 session-sync.sh save, learning-capture.sh, cortex-session-log.sh, cortex-learning-sync.sh 4 parallel
3 reflection-action-tracker.sh validate-all Sequential
3.5 reflection-action-tracker.sh store-learning Sequential
4 Convex flush, cleanup Parallel

Performance: 289s β†’ ~128s.

Stop Hook Chain (7 steps)

  1. session-summarize.sh β€” Gathers git diff + AgentDB episodes, computes honest reward (0.3-0.95)
  2. session-sync.sh save β€” Persist session state
  3. learning-capture.sh β€” Adds structured critique to most recent episode
  4. cortex-session-log.sh β€” Creates/appends daily task log in Cortex JOURNAL
  5. cortex-learning-sync.sh β€” Syncs episodes with reward >= 0.65 to Cortex KNOWLEDGE (ID-based cursor, ORDER BY id ASC)
  6. reflection-action-tracker.sh validate-all β€” Validates pending behavioral changes
  7. reflection-action-tracker.sh store-learning β€” Pushes validated learnings to Cortex KNOWLEDGE

ADR Enforcement

8 Architecture Decision Records are enforced via adr-enforcement.sh PreToolUse hook:

ADR Rule Enforcement
ADR-0001 Never ORDER BY reward DESC in sync scripts Hard block
ADR-0002 Never remove Phase 3.5 from stop hooks Hard block
ADR-0003 Always use _context_hybrid suffix for domain collections Hard block
ADR-0004 Never write to deprecated non-hybrid collection names Hard block
ADR-0005 Always OCR type-downgrade before text pipeline Advisory
ADR-0006 Always ffmpeg for video frames Advisory
ADR-0007 Filter checkpoint episodes from active queries Advisory
ADR-0008 Never delete hook files directly (archive via closure analysis) Advisory

Context Mesh Details

Tables

Table Records Key Columns
mesh_nodes 5,007 id, type, source_id, source, content, metadata
mesh_edges 963,566 from_node, to_node, edge_type, weight
mesh_concepts 560 name (UNIQUE), frequency, category
mesh_evolution β€” How concepts change over time

Edge Types

Edge Type Description Creation Logic
mentions Episode contains concept Word match: content LIKE '%concept%'
led_to Sequential in same session Next episode in session
evolved_from Shared 3+ concepts over time Concept intersection analysis
informed Learning influenced decision Manual/Cortex linking
similar_to Semantically similar Vector similarity search

Concept Extraction

Simple but effective: tokenize β†’ filter stopwords β†’ take top 15 unique terms per episode.

Categories assigned by pattern matching:

  • tool: tool, script, command, sqlite*, bash*
  • error: error, fail, bug
  • pattern: pattern, approach, strategy
  • system: api, server, database
  • domain: everything else

Cortex Knowledge Base (3+1 Architecture)

Notebook Docs Purpose
WORKSPACE 142 Active projects, configs
KNOWLEDGE 233 Stable learnings, patterns
JOURNAL 149 Daily logs, reflections
ARCHIVE 26 Completed/retired docs

Cortex runs on SiYuan Note with REST API access. Documents flow:

  • WORKSPACE β†’ KNOWLEDGE (when learnings stabilize)
  • WORKSPACE β†’ ARCHIVE (when projects complete)
  • JOURNAL is append-only daily logs

Local Model Services (VM210)

Service Models Purpose Port Runtime
Ollama qwen3-embedding:8b, glm-ocr:latest Hot-path text embedding + primary OCR 11434 systemd, always-on
llama-server (llama-rerank.service) Qwen3-Reranker-8B Q4_K_M Local rerank/classification fallback 18200 systemd, on-demand
PyTorch VL server (/opt/vl-embed/) legacy VL embedding stack On-demand only backup path (ad hoc) manual/on-demand

Service management:

  • Ollama is enabled on boot via systemd
  • Ollama auto-unloads idle models after ~5 minutes
  • llama-rerank.service is intentionally not always-on to preserve RAM/CPU headroom

Key Design Decisions

Decision Rationale Alternative Considered
Convex for sync, not custom WebSocket Reactive queries, atomic mutations, built-in crons, HTTP webhooks Custom sync server (maintenance burden)
Hybrid search by default Exact matches (IDs, function names) get lost in pure semantic Dense-only (faster but misses exact)
768d embeddings everywhere Consistent space, Matryoshka truncation from 4096d 384d (faster) or 3072d (marginal gain)
Local-first with cloud sync Fast writes, offline-capable, then background sync to Convex Cloud-first (latency, connectivity dependency)
3-layer approvals Telegram buttons (UX) + Convex (scale) + metadata (speed) Single approval table (fragile)
Agent attribution at write time Zero-cost queries by agent; backfilling is unreliable Runtime inference from session IDs (brittle)
Gemini for Qdrant, Qwen3 for HNSW Qwen3 via Ollama is local/fast for hot path, Gemini remains warm-layer baseline Single model everywhere (compromise)
ID-based sync cursors Time-based + ORDER BY reward caused infinite loops Time-based cursor (broken for reward-ordered)
No reranking on hot path Rerank adds 500ms+; hot path budget is <350ms Rerank everything (defeats hot path)
Pre-created hybrid Qdrant collections Hybrid config can't change after creation Create on demand (loses hybrid capability)
Voyage primary rerank + local fallback Voyage gives best real-time quality/cost, local reranker preserves independence Fully local rerank-only path

What's NOT Included (Intentionally)

Omission Reason
Real-time streaming search Batch is sufficient for agent workflows
GPU-accelerated inference MLX on Apple Silicon is sufficient for these model sizes
Distributed Qdrant Single node handles 350k+ vectors with <200ms latency
Custom fine-tuned models Off-the-shelf models perform well enough
Supabase Deprecated 2026-02. DNS failures, all scripts disabled. AgentDB + Convex replace it.

Operational Notes

Common Traps (from 7,500+ episodes of experience)

Trap Fix
SQL datetime('now') vs Unix int created_at CAST(strftime('%s','now','-Nh') AS INTEGER)
source .env fails in hook context grep '^VAR=' .env | cut -d= -f2-
Mixed embedding models in HNSW All vectors + queries MUST use same model
Time cursor + ORDER BY reward Use ID-based cursor + ORDER BY id ASC
Unnamed vector upsert in Qdrant Use named vectors: "vector": {"dense": [...]}
Cross-encoder via chat completions Gibberish β€” use generative LLM or proper rerank API
${var,,} on macOS Bash 3.2 echo "$var" | tr '[:upper:]' '[:lower:]'
Rerank backend drift Set RERANK_BACKEND=voyage for primary, keep llama fallback reachable
SiYuan moveDocs without toPath: "/" Silent no-op (API returns success but does nothing)

Service Ports (Local)

Service Port Auth
Ollama 11434 None (local VM network)
llama-server reranker 18200 None (local VM network)
PyTorch VL server (on-demand) ad hoc None (local, manual start)

Cloud service URLs (Qdrant, Cortex, Convex) stored in .env files per site.


This architecture runs across 2 sites (laptop + VM210), powered by Ollama-first local services plus Voyage AI and Gemini cloud APIs where they materially improve quality or latency. 7,500+ AgentDB episodes with attribution, 350k+ Qdrant vectors, 5,007 mesh nodes with 963k edges, ~550 Cortex documents, full CRM system, and 11-table Convex schema, all synced via Convex. Built with Claude Code + OpenClaw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment