Multi-Layer Memory Architecture for AI Agents — Hot/Warm/Cold storage, Convex multi-site sync, content-aware chunking, and intelligent retrieval

Multi-Layer Memory Architecture for AI Agents

A comprehensive multi-site memory system with hybrid BM25+dense search, multimodal embeddings, neural reranking, Convex real-time sync, and agent attribution. With Hot/Warm/Cold storage, content-aware chunking, and intelligent retrieval. Powered by local Ollama services on VM210 with cloud augmentation where it wins.

Updated: 2026-02-16

Why This Architecture?

Problem: AI agents forget everything between sessions. Simple RAG with one vector DB fails because:

Exact matches get lost — Client IDs, function names, error codes drift in semantic space
Different content needs different processing — Code, meetings, images, charts each need specialized pipelines
Cloud APIs are too slow — Real-time agent decisions can't wait 500ms+ for embeddings
Context compaction loses critical state — Long sessions get summarized, losing important details
Multi-site agents need shared memory — An agent on a VM and a developer on a laptop shouldn't have separate, diverging knowledge

Solution: Multi-layer memory with intentional tradeoffs at each layer, unified by Convex as the real-time sync backbone.

Architecture Overview (Updated Feb 2026)

┌──────────────────────┐     ┌────────────────────────┐     ┌──────────────┐
│  Laptop (Adam)        │     │  VM 210 (Xavier)        │     │ Future Agents │
│  Claude Code CLI      │     │  OpenClaw Gateway       │     │ (OCI, edge)   │
│  AgentDB (SQLite)     │     │  AgentDB (SQLite)       │     │               │
│  2331+ episodes       │     │  Telegram / Control UI  │     │               │
└──────────┬───────────┘     └──────────┬─────────────┘     └───────┬──────┘
           │                            │                            │
           │  episode_embeddings        │  episode_embeddings        │
           │  (Qwen3-Embedding-8B       │  (synced from Convex)      │
           │   768d HNSW hot path)      │                            │
           │                            │                            │
           └────────────┬───────────────┴────────────────────────────┘
                        │
                ┌───────▼──────────┐
                │  Convex Cloud     │  ← Real-time sync backbone
                │  (agent-memory)   │
                │                   │
                │  episodes         │  ← cross-site episode mirror
                │  approvals        │  ← multi-reviewer queue
                │  tasks            │  ← OpenClaw task bridge
                │  agents           │  ← registry + heartbeat
                │  collaborators    │  ← human agents (Adam + future)
                │  syncCursors      │  ← per-site watermark
                │                   │
                │  6 cron jobs      │  ← autonomous cloud functions
                └────────┬─────────┘
                         │
                 ┌───────▼─────────┐
                 │  Qdrant Cloud    │  ← Warm-layer hybrid search
                 │  350k+ vectors   │     BM25 + dense (Gemini 768d)
                 │  17 collections  │     + neural reranking
                 └─────────────────┘

Sync model: Each site writes to local SQLite first (fast, offline-capable), then syncs to Convex in background. Convex is the shared truth for cross-site visibility, dashboards, and approval workflows. Convex cron automatically pushes new episodes to Qdrant for warm-layer search.

The Three Memory Layers

flowchart TB
    subgraph Hot["🔥 HOT LAYER (<350ms)"]
        direction LR
        HNSW["AgentDB HNSW<br/>Qwen3-Embedding-8B GGUF<br/>768d (truncated from 4096d)<br/>1100+ vectors"]
        LOCAL["Local SQLite<br/>Zero network latency"]
    end

    subgraph Warm["🌡️ WARM LAYER (50-200ms)"]
        direction LR
        QDRANT["Qdrant Hybrid<br/>768d Gemini<br/>350k+ vectors"]
        BM25["BM25 Sparse<br/>Exact keyword match"]
    end

    subgraph Cold["❄️ COLD LAYER (+120-1500ms)"]
        direction LR
        RERANK["Neural Reranker<br/>Voyage rerank-2.5 (primary)<br/>Qwen3-Reranker-8B local fallback"]
    end

    QUERY[Query] --> Hot
    Hot -->|"results + miss"| Warm
    Warm -->|"top candidates"| Cold
    Cold --> FINAL[Final Results]

Key insight: Reranking is only applied to Qdrant results, NOT the hot path. Adding 500ms+ to a sub-second path would defeat its purpose.

Embedding Models Per Layer

Layer	Model	Dimensions	Why	Location
Hot (AgentDB HNSW)	Qwen3-Embedding-8B via Ollama	4096d→768d (Matryoshka truncated)	Local-first low latency on VM210	VM210 (`:11434`)
Warm (Qdrant)	Gemini gemini-embedding-001	768d	Stable cloud baseline for hybrid collections	Cloud
Multimodal	Voyage multimodal-3.5	1024d→768d	Best operational quality/latency without local VL overhead	Cloud API
Convex→Qdrant sync	Gemini gemini-embedding-001	768d	Same model as Qdrant to maintain consistency	Convex Action

Critical rule: NEVER mix embedding models in the same HNSW/collection index. Cross-model cosine similarity is ~0.12 (useless). All hot-path vectors + queries use the same Qwen3-Embedding-8B model.

Embedding Runtime (Current)

Path	Model	Runtime	Dimensions	Notes
Hot text path	`qwen3-embedding:8b`	Ollama (`:11434`)	4096d→768d	Always-on via systemd, auto-unloads after ~5 min idle
Warm Qdrant path	`gemini-embedding-001`	Cloud API	768d	No change, canonical Qdrant dense vector
Multimodal path	`voyage-multimodal-3.5`	Cloud API	1024d→768d	Script: `/workspace/scripts/voyage-multimodal-embed.sh`
Text fallback	`voyage-4`	Cloud API	API-native→768d	Available in `embed.sh` fallback chain

Ollama models on VM210: qwen3-embedding:8b (4.7GB), glm-ocr:latest (2.2GB)
Local PyTorch VL server at /opt/vl-embed/ still exists but is on-demand only, too heavy for always-on use (~52s/embedding on CPU)
Matryoshka truncation remains standard: keep first 768 dimensions for index compatibility

Neural Reranking

Tier	Latency	When	Model
Primary	120-350ms typical	Default warm/cold rerank	Voyage rerank-2.5
Fallback	Higher, local CPU-bound	Offline, failover, cost-control	Qwen3-Reranker-8B Q4_K_M

Primary script path: /workspace/lib/ingest/rerank.sh with RERANK_BACKEND=voyage
Voyage economics: 200M free rerank tokens one-time per account, zero local CPU burn while available
Local fallback artifact: /opt/models/ (4.8GB GGUF), served by llama-server on :18200
Service unit: llama-rerank.service (on-demand, not always-on)
This local reranker also powers zero-shot classification with BTZSC F1 of 0.72

Important corrections:

Gemini Flash is no longer the primary reranker
Qwen3-VL-Reranker was laptop-only historical context, not current production path

Exact vs. Semantic: When Each Wins

Query Example	Winner
`"BatchTool"`	BM25 exact match
`"how to spawn agents"`	Dense semantic
`"client ABC #123"`	BM25 exact match
`"authentication flow"`	Dense semantic
`"error 0x8007001F"`	BM25 exact match

HYBRID SEARCH: Run BOTH, fuse with Reciprocal Rank Fusion (RRF), let reranker decide final order.

Convex: The Multi-Site Sync Backbone (NEW)

Why Convex?

Requirement	Convex Solution
Multi-writer sync (laptop + VM)	Atomic mutations with `(sourceSite, sourceId)` dedup
Real-time dashboards	Reactive WebSocket queries
Autonomous cloud jobs	Built-in cron scheduler (no n8n/bash needed)
Approval workflows	HTTP Actions for Telegram webhook callbacks
Agent heartbeat	Mutation-based, aggregated by Convex itself

Convex Schema (11 Tables)

Table	Purpose	Key Fields
episodes	Cross-site episode mirror	`sourceId`, `sourceSite`, `agentName`, `agentPlatform`, `reward`, `task`, `approvalStatus`
approvals	Multi-reviewer approval queue	`agentName`, `actionType`, `status`, `priority`, `reviewers[]`, `telegramMessageId`
tasks	OpenClaw task bridge	`title`, `agentName`, `status`, `assignedTo`, `approvalId`
agents	Live registry + heartbeat	`agentName`, `site`, `platform`, `status`, `lastHeartbeat`
collaborators	Human agents	`name`, `role`, `telegramChatId`, `permissions[]`, `notifyOn[]`
syncCursors	Per-site sync watermark	`site`, `lastSyncedId`, `lastSyncedAt`, `totalSynced`
crm_contacts	CRM contacts	identity fields, enrichment, embedding refs
crm_interactions	CRM activity log	channel, timestamp, summary, linkage
crm_deals	CRM pipeline deals	stage, value, owner, close window
crm_entities	Extracted entities	type, canonical value, provenance
crm_relationships	Graph edges between CRM objects	from, to, relation type, confidence

Sync Flow

flowchart LR
    subgraph Laptop["Laptop (Claude Code)"]
        L_AGENTDB["AgentDB<br/>SQLite"]
        L_SYNC["convex-episode-sync.sh"]
    end

    subgraph VM["VM 210 (Xavier)"]
        V_AGENTDB["AgentDB<br/>SQLite"]
        V_SYNC["convex-episode-sync.sh"]
    end

    subgraph Convex["Convex Cloud"]
        C_EP["episodes table"]
        C_CURSOR["syncCursors table"]
        C_CRON["qdrant-episode-sync<br/>cron (every 5 min)"]
    end

    subgraph Qdrant["Qdrant Cloud"]
        Q_HYBRID["agent_memory_hybrid<br/>BM25 + dense"]
    end

    L_AGENTDB -->|"WHERE id > cursor<br/>AND convex_synced=0"| L_SYNC
    L_SYNC -->|"bulkSyncEpisodes<br/>mutation"| C_EP
    L_SYNC -->|"updateCursor"| C_CURSOR

    V_AGENTDB -->|"WHERE id > cursor<br/>AND convex_synced=0"| V_SYNC
    V_SYNC -->|"bulkSyncEpisodes<br/>mutation"| C_EP
    V_SYNC -->|"updateCursor"| C_CURSOR

    C_CRON -->|"embed via Gemini<br/>upsert named vectors"| Q_HYBRID

Dedup strategy: The Convex bulkSyncEpisodes mutation upserts by (sourceSite, sourceId) composite key. Two sites syncing the same logical episode get separate entries (different sourceSite). After sync, local episodes are marked convex_synced = 1.

Convex→Qdrant sync: A Convex Action (qdrantSync.ts) runs every 5 minutes via cron. It:

Finds episodes not yet in Qdrant (no qdrantPointId in metadata)
Generates Gemini 768d embeddings
Upserts to agent_memory_hybrid with named vectors (dense + text for BM25)
Updates the episode metadata with the Qdrant point ID

Convex Cron Jobs (6 total)

Cron	Interval	Purpose
`mark-stale-agents`	5 min	Set agents to "offline" if no heartbeat in 3 min
`expire-pending-approvals`	15 min	Expire approvals past timeout
`sync-health-alert`	30 min	Alert if site hasn't synced in 1 hour
`qdrant-episode-sync`	5 min	Embed + push new episodes to Qdrant
`daily-digest`	Daily 06:00 UTC	Agent activity summary
`cleanup-old-episodes`	Weekly (Sun 03:00 UTC)	Archive 90-day-old episodes

These run in Convex cloud — they work even if both laptop and VM are down.

Convex Environments

Environment	Purpose
Dev	Development/testing
Prod	Live multi-site sync

URLs and deploy keys stored in .env files per site, never committed to git.

CRM System (NEW)

A dedicated CRM skill now runs as a first-class memory subsystem.

CRM Skill Stack

Layer	Implementation
Skill root	`/workspace/skills/crm/`
Convex backend	`crm.ts`, `crmAdmin.ts`, `crmClassify.ts`, `crmDiscover.ts`, `crmSync.ts`
Vector retrieval	Qdrant hybrid search for contacts and interactions
CLI scripts	`contact.sh`, `deal.sh`, `interaction.sh`, `search.sh`, plus supporting scripts
Automation	OpenClaw cron jobs for auto-embed + auto-classify
Visualization	`gen-graph.py` relationship graph generation

CRM Integration Notes

CRM data participates in the same hot/warm retrieval philosophy, with Convex as operational truth and Qdrant as semantic retrieval layer.
Contact and interaction artifacts are embedded and classified automatically, reducing manual CRM hygiene overhead.
Relationship views are queryable both as vectors and as graph edges for entity-centric investigation.

Agent Attribution (NEW)

The Problem

Before attribution, all episodes had no structured way to identify which agent, platform, or interface produced them. Session ID prefixes (subagent-abc123, session-1770881984) were the only hint.

AgentDB Attribution Columns

ALTER TABLE episodes ADD COLUMN agent_name TEXT DEFAULT 'claude-code';
ALTER TABLE episodes ADD COLUMN agent_platform TEXT DEFAULT 'claude-code-cli';
ALTER TABLE episodes ADD COLUMN agent_interface TEXT DEFAULT 'terminal';
ALTER TABLE episodes ADD COLUMN parent_agent TEXT;
ALTER TABLE episodes ADD COLUMN llm_provider TEXT;
ALTER TABLE episodes ADD COLUMN llm_model TEXT;
ALTER TABLE episodes ADD COLUMN convex_synced INTEGER DEFAULT 0;
ALTER TABLE episodes ADD COLUMN convex_id TEXT;

Attribution Values

Agent	agent_name	agent_platform	agent_interface
Xavier (Telegram)	`xavier`	`openclaw`	`telegram`
Xavier (Control UI)	`xavier`	`openclaw`	`control-ui`
Xavier (subagent)	`xavier-sub-{id}`	`openclaw`	`subagent`
Claude Code (Adam)	`claude-code`	`claude-code-cli`	`terminal`
Claude Code (subagent)	`cc-sub-{id}`	`claude-code-cli`	`subagent`
Claude Flow (via Xavier)	`xavier`	`claude-flow-mcp`	`mcp`
Claude Flow (via CC)	`claude-code`	`claude-flow-mcp`	`mcp`

Key rule: agent_name = top-level actor identity. When Xavier uses Claude Flow MCP, agent_name stays xavier because Xavier initiated the action. agent_platform tracks the execution engine.

Environment Detection

Attribution is automatic — hooks detect the environment:

# In memory-save.sh
if [ -n "${OPENCLAW_SESSION:-}" ]; then
    AGENT_NAME="xavier"
    AGENT_PLATFORM="openclaw"
    AGENT_INTERFACE="${OPENCLAW_INTERFACE:-telegram}"
else
    AGENT_NAME="${AGENT_NAME:-claude-code}"
    AGENT_PLATFORM="${AGENT_PLATFORM:-claude-code-cli}"
    AGENT_INTERFACE="${AGENT_INTERFACE:-terminal}"
fi

Approval Checkpoints — 3-Layer Architecture (NEW)

Why 3 Layers?

Xavier runs 24/7 autonomously. High-impact actions (sending messages, financial ops, destructive changes) need human approval. Three layers work together:

Layer	Purpose	Storage	Speed
A: Episode Metadata	Rich context per episode	AgentDB `episodes.metadata` JSON	Instant (local)
B: Convex Tables	Scalable multi-agent/multi-human queue	Convex `approvals` + `tasks`	Real-time (WebSocket)
C: Telegram Buttons	Approve/reject via inline keyboard	OpenClaw + Telegram Bot API	Interactive

Approval Flow

Xavier wants to send a Slack message
    │
    ├── 1. Write episode with metadata.approval.status="pending" (Layer A)
    ├── 2. Create Convex approval + linked task (Layer B)
    ├── 3. Send Telegram message with [Approve] [Reject] [Defer] buttons (Layer C)
    │
    └── Adam taps "Approve" in Telegram
         │
         ├── Telegram callback → Convex HTTP webhook
         ├── Convex resolves approval + updates task
         ├── Edit Telegram message: "✅ Approved by Adam at 14:32"
         └── Sync back to AgentDB episode metadata

Approval Rules

Category	Auto-approve?	Timeout	Notes
`send_message`	Never	60 min	External comms always need human review
`financial`	Never	120 min	Any money-related action
`destructive`	Never	30 min	Deletes, drops, overwrites
`external_api`	If <$0.10	15 min	Cost-gated auto-approve
`code_commit`	If tests pass	30 min	CI-gated auto-approve
`internal`	Always	N/A	Internal memory ops, no risk
`research`	Always	N/A	Read-only, no side effects

4-Backend Memory System

Current Data Scale

Backend	Records	Details
AgentDB (SQLite)	1,844 episodes	Task trajectories with rewards + attribution
↳ episode_embeddings	104	Qwen3-Embedding-8B GGUF 768d hot-path vectors
↳ Context Mesh	5,007 nodes, 963k edges, 560 concepts	Semantic relationship graph
Qdrant (Cloud)	350,395 vectors	17 collections (hybrid-enabled)
↳ codebase_hybrid	323,968	Indexed source code + documentation
↳ patterns_hybrid	7,345	Learned patterns + behaviors
↳ agent_memory_hybrid	6,716	Task episodes + context
↳ cortex_hybrid	5,192	Knowledge base documents
↳ context_mesh_hybrid	4,324	Mesh relationship embeddings
↳ research_hybrid	1,627	Research notes + findings
↳ learnings_hybrid	954	High-reward episode learnings
Cortex (SiYuan)	~550 documents	3+1 notebook architecture
↳ WORKSPACE	142 docs	Active projects, configs
↳ KNOWLEDGE	233 docs	Stable learnings, patterns
↳ JOURNAL	149 docs	Daily logs, reflections
↳ ARCHIVE	26 docs	Completed/retired docs
Convex (Cloud)	11 tables	Real-time multi-site sync layer + CRM operational schema
↳ episodes	~1,840 synced	Mirrored from all sites
↳ agents	2 registered	xavier + claude-code
Hive-Mind (Local JSON)	Session state	Backup decisions, swarm coordination

Backend Roles

Backend	Type	Best For
AgentDB	Local SQLite	Fast writes, episode storage, HNSW hot search, Context Mesh
Convex	Cloud reactive DB	Multi-site sync, approval workflows, heartbeat, cron jobs
Qdrant	Cloud vector DB	Hybrid BM25+dense search, warm-layer semantic retrieval
Cortex	Knowledge base	Human-curated docs, reflections, stable knowledge
Hive-Mind	Local JSON	Session backup, swarm decisions, quick persistence

Memory Philosophy: Hot/Cold, Fast/Slow, Exact/Semantic

Context Mesh: Beyond Vector Search

Vector search finds similar documents. The Context Mesh finds related concepts.

Episode A: "Simplified auth from 5 methods to 2"
    │
    ├── evolved_from ──→ Episode B: "Analyzed auth complexity"
    │
    ├── mentions ──────→ Entity: "Google OAuth"
    │
    └── led_to ────────→ Episode C: "Deployed simplified auth"

Multi-hop queries the mesh enables:

"What decisions led to the current auth system?"
"What other tasks mentioned this client?"
"What patterns evolved from successful deployments?"

Current mesh: 5,007 nodes, 963,566 edges, 560 concepts — a rich knowledge graph connecting all episodes.

Memory Lifecycle: Save → Sync → Index → Search → Learn

flowchart LR
    subgraph Save["💾 SAVE"]
        HOOK["Stop Hook"]
        TASK["Task Complete"]
    end

    subgraph Sync["🔄 SYNC (NEW)"]
        CONVEX["Convex<br/>(real-time)"]
        ATTR["Attribution<br/>(who/what/where)"]
    end

    subgraph Index["📊 INDEX"]
        AGENTDB["AgentDB<br/>(immediate)"]
        QDRANT["Qdrant<br/>(Convex cron 5min)"]
        MESH["Mesh Edges<br/>(relationship extraction)"]
    end

    subgraph Search["🔍 SEARCH"]
        HOT["Hot Path<br/>(<350ms)"]
        HYBRID["Hybrid Search<br/>(50-200ms)"]
    end

    subgraph Learn["🧠 LEARN"]
        SONA["SONA<br/>(pattern extraction)"]
        CORTEX_SYNC["Cortex Sync<br/>(reward >= 0.65)"]
    end

    Save --> Sync
    Sync --> Index
    Index --> Search
    Search --> Learn
    Learn -->|"improves"| Search

Time Decay & Universal Truths

The Problem with Naive Recency Bias

Most memory systems apply uniform time decay: older memories get lower scores. This breaks for universal truths — facts that remain valid regardless of age:

"The project uses TypeScript" (learned 6 months ago) — still true
"API endpoint is /v1/users" (documented last year) — still true
"Client prefers async communication" (noted 2 months ago) — still true

Smart Recency Decay (Implemented)

The unified-search.sh script implements similarity-gated decay:

Similarity	Decay	Rationale
>= 0.85	None	Universal truth (stable facts)
0.7 - 0.85	Mild (floor 0.95)	Likely stable
< 0.7	Stronger (floor 0.85)	Time-sensitive context

Decay Factor Table

Similarity	Age	Recency Factor
>=0.85	Any	1.0 (no decay)
0.7-0.85	<7d	1.0
0.7-0.85	7-30d	0.98
0.7-0.85	30-90d	0.96
0.7-0.85	>90d	0.95
<0.7	<7d	1.0
<0.7	7-30d	0.95
<0.7	30-90d	0.90
<0.7	>90d	0.85

Final score: adjusted_score = raw_similarity × recency_factor

Design Philosophy

Universal truths don't decay — High-similarity matches (>=0.85) are likely stable facts
Time-sensitive info decays gracefully — Meeting notes, temporary decisions naturally fade
Nothing is deleted — All memories remain, just with adjusted scores
Floors prevent total loss — Even heavily decayed memories (0.85 floor) remain discoverable

Multi-Site Agent Architecture (NEW)

Sites

Site	Host	Agent	Platform	Interface
Laptop	adamkovacs-mbp	Claude Code	claude-code-cli	Terminal
VM 210	ai-agent-primary (Tailscale)	Xavier	OpenClaw	Telegram / Control UI

Xavier: The 24/7 Agent

Xavier runs on VM210 (Debian 12, 4 vCPU, ~47GB RAM, no GPU) via OpenClaw gateway.

Host context: Proxmox on Ryzen 9 8945HS, 64GB RAM, Radeon 780M iGPU (no ROCm, currently parked).

Primary model: anthropic/claude-opus-4-6
Fallback models: openai-codex/gpt-5.3-codex, google-gemini-cli/gemini-3-pro-preview
Subagent model: google-gemini-cli/gemini-3-flash-preview
Interface: Telegram bot + OpenClaw Control UI
Memory: Local AgentDB → Convex sync → Qdrant warm search

Connectivity

All inter-site connectivity via Tailscale mesh network. MagicDNS hostnames preferred over IPs (IPs can change on node re-registration). SSH keys and hostnames stored in .env / SSH config, never in docs.

Processing Pipelines

Architecture

flowchart TB
    subgraph Input["📥 INPUT"]
        TEXT["📝 Text/Code"]
        AUDIO["🎵 Audio"]
        IMAGE["🖼️ Images"]
        VIDEO["🎬 Video"]
        PDF["📄 PDFs"]
    end

    subgraph Processing["⚙️ PROCESSING"]
        subgraph VisionProc["Vision Pipeline"]
            VISION_CLASS["Gemini Flash<br/>Vision classify"]
            GLMOCR["GLM-OCR via Ollama<br/>Primary OCR"]
            VISION_DESC["Gemini Flash<br/>Describe / fallback"]
        end
        ASSEMBLYAI["AssemblyAI<br/>Universal-3 Pro"]
        PDFPLUMBER["pdfplumber / pdftotext"]
    end

    subgraph Embedding["🧬 EMBEDDINGS"]
        subgraph TextEmbed["Text (Ollama)"]
            QWEN3["qwen3-embedding:8b<br/>4096d→768d"]
        end
        subgraph CloudEmbed["Cloud"]
            GEMINI["Gemini gemini-embedding-001<br/>768d"]
            VOYAGE4["Voyage-4<br/>text fallback"]
        end
        subgraph MMEmbed["Multimodal"]
            VOYAGE_MM["Voyage multimodal-3.5<br/>1024d→768d"]
        end
        BM25["BM25 Sparse"]
    end

    subgraph Storage["💾 STORAGE"]
        HNSW["🔥 AgentDB HNSW<br/>(Qwen3 768d, <350ms)"]
        CONVEX_STORE["🔄 Convex<br/>(real-time sync)"]
        QDRANT["🎯 Qdrant<br/>(Gemini 768d, hybrid)"]
        CORTEX["❄️ Cortex<br/>(knowledge)"]
    end

    Input --> Processing
    Processing --> Embedding
    Embedding --> Storage

Vision-OCR Router

Images follow a type-then-process pipeline (ADR-0005):

Image → Gemini Flash classify → { text-heavy, chart, photo, diagram }
                                   │
               ┌───────────────────┼───────────────────┐
               ↓                   ↓                   ↓
      GLM-OCR via Ollama      Gemini Flash describe   Voyage multimodal-3.5
      (primary OCR)            (fallback/context)      (multimodal vector)

Primary OCR script: /workspace/scripts/ocr-glm.sh Vision classification script: /workspace/scripts/vision-classify.sh Fallback OCR for complex/multi-page inputs remains Gemini Flash.

OCR Runtime

Tier	Model	Runtime	Notes
Primary	GLM-OCR 0.9B (`glm-ocr:latest`)	Ollama on VM210	Purpose-built OCR, OmniDocBench 94.62, ~5s warm
Fallback	Gemini Flash	Cloud API	Used for complex layouts and multi-page OCR

Classification Runtime

Task	Model/Path	Notes
Text zero-shot classification	Qwen3-Reranker-8B via reranker endpoint	Shared with local fallback reranker path
Vision classification	Gemini Flash via `/workspace/scripts/vision-classify.sh`	Stable for image-type routing and label tasks

Key ADR rules:

ADR-0005: Always OCR type-downgrade before text pipeline
ADR-0006: Always ffmpeg for video frame extraction, not direct VLM

Audio Pipeline

Audio → AssemblyAI Universal-3 Pro → Transcript → Text embedding → Qdrant
                                         ↓
                                    Cortex (if meeting)

PDF Processing

PDF → pdfplumber → Text chunks → Text embedding → Qdrant
         ↓              ↓
    Table extraction   Chunk by headers/paragraphs

Hybrid Search Pipeline

How a Search Works

flowchart TD
    Q["User Query"] --> PHASE1

    subgraph PHASE1["Phase 1: Hot + Backends (parallel)"]
        direction LR
        EMB["Generate embedding<br/>(Ollama Qwen3)"]
        HNSW_S["HNSW search<br/>(AgentDB)"]
        AGENTDB_S["AgentDB SQL<br/>(keyword)"]
        MESH_S["Context Mesh<br/>(graph)"]
        HIVEMIND_S["Hive-Mind<br/>(JSON)"]
    end

    PHASE1 --> PHASE2

    subgraph PHASE2["Phase 2: Qdrant Collections (parallel)"]
        direction LR
        C1["agent_memory_hybrid"]
        C2["patterns_hybrid"]
        C3["cortex_hybrid"]
        C4["learnings_hybrid"]
        C5["...13 collections"]
    end

    PHASE2 --> PHASE3

    subgraph PHASE3["Phase 3: Reranking"]
        FUSE["RRF Fusion"]
        RERANK["Gemini Flash<br/>Neural Rerank"]
    end

    PHASE3 --> RESULT["Final Ranked Results"]

Performance (unified-search.sh):

Phase 1: All 5 backends + embedding generation run concurrently
Phase 2: All 13 Qdrant collections queried concurrently
Phase 3: Results fused with RRF, then neural reranking
Total: ~2-5s wall time for comprehensive cross-backend search

Domain Namespacing

Domain	Collection	Use Cases
`sales`	`sales_context_hybrid`	Client interactions, proposals
`learning`	`learning_context_hybrid`	Course delivery, feedback
`operations`	`operations_context_hybrid`	Internal ops, infrastructure
`general`	`agent_memory_hybrid`	Default, cross-domain

# Domain-specific search
bash unified-search.sh --domain sales "client ABC proposal"

# Cross-domain search
bash unified-search.sh --domain all "authentication"

Hook Architecture (Parallelized)

Key Pattern

Claude Code hooks execute sequentially within arrays. Parallelism requires a single wrapper script that uses bash background jobs (& + wait).

SessionStart (parallel wrapper)

session-start-parallel.sh replaces 9 sequential hooks. All run via & + wait. Performance: 347s → ~60s.

SessionStop (DAG-based parallel)

session-stop-parallel.sh uses phased execution:

Phase	Tasks	Parallel?
1	`session-summarize.sh` (gathers git diff, computes reward)	Blocking
2	`session-sync.sh save`, `learning-capture.sh`, `cortex-session-log.sh`, `cortex-learning-sync.sh`	4 parallel
3	`reflection-action-tracker.sh validate-all`	Sequential
3.5	`reflection-action-tracker.sh store-learning`	Sequential
4	Convex flush, cleanup	Parallel

Performance: 289s → ~128s.

Stop Hook Chain (7 steps)

session-summarize.sh — Gathers git diff + AgentDB episodes, computes honest reward (0.3-0.95)
session-sync.sh save — Persist session state
learning-capture.sh — Adds structured critique to most recent episode
cortex-session-log.sh — Creates/appends daily task log in Cortex JOURNAL
cortex-learning-sync.sh — Syncs episodes with reward >= 0.65 to Cortex KNOWLEDGE (ID-based cursor, ORDER BY id ASC)
reflection-action-tracker.sh validate-all — Validates pending behavioral changes
reflection-action-tracker.sh store-learning — Pushes validated learnings to Cortex KNOWLEDGE

ADR Enforcement

8 Architecture Decision Records are enforced via adr-enforcement.sh PreToolUse hook:

ADR	Rule	Enforcement
ADR-0001	Never `ORDER BY reward DESC` in sync scripts	Hard block
ADR-0002	Never remove Phase 3.5 from stop hooks	Hard block
ADR-0003	Always use `_context_hybrid` suffix for domain collections	Hard block
ADR-0004	Never write to deprecated non-hybrid collection names	Hard block
ADR-0005	Always OCR type-downgrade before text pipeline	Advisory
ADR-0006	Always ffmpeg for video frames	Advisory
ADR-0007	Filter checkpoint episodes from active queries	Advisory
ADR-0008	Never delete hook files directly (archive via closure analysis)	Advisory

Context Mesh Details

Tables

Table	Records	Key Columns
`mesh_nodes`	5,007	id, type, source_id, source, content, metadata
`mesh_edges`	963,566	from_node, to_node, edge_type, weight
`mesh_concepts`	560	name (UNIQUE), frequency, category
`mesh_evolution`	—	How concepts change over time

Edge Types

Edge Type	Description	Creation Logic
`mentions`	Episode contains concept	Word match: `content LIKE '%concept%'`
`led_to`	Sequential in same session	Next episode in session
`evolved_from`	Shared 3+ concepts over time	Concept intersection analysis
`informed`	Learning influenced decision	Manual/Cortex linking
`similar_to`	Semantically similar	Vector similarity search

Concept Extraction

Simple but effective: tokenize → filter stopwords → take top 15 unique terms per episode.

Categories assigned by pattern matching:

tool: tool, script, command, sqlite*, bash*
error: error, fail, bug
pattern: pattern, approach, strategy
system: api, server, database
domain: everything else

Cortex Knowledge Base (3+1 Architecture)

Notebook	Docs	Purpose
WORKSPACE	142	Active projects, configs
KNOWLEDGE	233	Stable learnings, patterns
JOURNAL	149	Daily logs, reflections
ARCHIVE	26	Completed/retired docs

Cortex runs on SiYuan Note with REST API access. Documents flow:

WORKSPACE → KNOWLEDGE (when learnings stabilize)
WORKSPACE → ARCHIVE (when projects complete)
JOURNAL is append-only daily logs

Local Model Services (VM210)

Service	Models	Purpose	Port	Runtime
Ollama	`qwen3-embedding:8b`, `glm-ocr:latest`	Hot-path text embedding + primary OCR	11434	systemd, always-on
llama-server (`llama-rerank.service`)	Qwen3-Reranker-8B Q4_K_M	Local rerank/classification fallback	18200	systemd, on-demand
PyTorch VL server (`/opt/vl-embed/`)	legacy VL embedding stack	On-demand only backup path	(ad hoc)	manual/on-demand

Service management:

Ollama is enabled on boot via systemd
Ollama auto-unloads idle models after ~5 minutes
llama-rerank.service is intentionally not always-on to preserve RAM/CPU headroom

Key Design Decisions

Decision	Rationale	Alternative Considered
Convex for sync, not custom WebSocket	Reactive queries, atomic mutations, built-in crons, HTTP webhooks	Custom sync server (maintenance burden)
Hybrid search by default	Exact matches (IDs, function names) get lost in pure semantic	Dense-only (faster but misses exact)
768d embeddings everywhere	Consistent space, Matryoshka truncation from 4096d	384d (faster) or 3072d (marginal gain)
Local-first with cloud sync	Fast writes, offline-capable, then background sync to Convex	Cloud-first (latency, connectivity dependency)
3-layer approvals	Telegram buttons (UX) + Convex (scale) + metadata (speed)	Single approval table (fragile)
Agent attribution at write time	Zero-cost queries by agent; backfilling is unreliable	Runtime inference from session IDs (brittle)
Gemini for Qdrant, Qwen3 for HNSW	Qwen3 via Ollama is local/fast for hot path, Gemini remains warm-layer baseline	Single model everywhere (compromise)
ID-based sync cursors	Time-based + ORDER BY reward caused infinite loops	Time-based cursor (broken for reward-ordered)
No reranking on hot path	Rerank adds 500ms+; hot path budget is <350ms	Rerank everything (defeats hot path)
Pre-created hybrid Qdrant collections	Hybrid config can't change after creation	Create on demand (loses hybrid capability)
Voyage primary rerank + local fallback	Voyage gives best real-time quality/cost, local reranker preserves independence	Fully local rerank-only path

What's NOT Included (Intentionally)

Omission	Reason
Real-time streaming search	Batch is sufficient for agent workflows
GPU-accelerated inference	MLX on Apple Silicon is sufficient for these model sizes
Distributed Qdrant	Single node handles 350k+ vectors with <200ms latency
Custom fine-tuned models	Off-the-shelf models perform well enough
Supabase	Deprecated 2026-02. DNS failures, all scripts disabled. AgentDB + Convex replace it.

Operational Notes

Common Traps (from 7,500+ episodes of experience)

Trap	Fix
SQL `datetime('now')` vs Unix int `created_at`	`CAST(strftime('%s','now','-Nh') AS INTEGER)`
`source .env` fails in hook context	`grep '^VAR=' .env \| cut -d= -f2-`
Mixed embedding models in HNSW	All vectors + queries MUST use same model
Time cursor + ORDER BY reward	Use ID-based cursor + `ORDER BY id ASC`
Unnamed vector upsert in Qdrant	Use named vectors: `"vector": {"dense": [...]}`
Cross-encoder via chat completions	Gibberish — use generative LLM or proper rerank API
`${var,,}` on macOS Bash 3.2	`echo "$var" \| tr '[:upper:]' '[:lower:]'`
Rerank backend drift	Set `RERANK_BACKEND=voyage` for primary, keep llama fallback reachable
SiYuan `moveDocs` without `toPath: "/"`	Silent no-op (API returns success but does nothing)

Service Ports (Local)

Service	Port	Auth
Ollama	`11434`	None (local VM network)
llama-server reranker	`18200`	None (local VM network)
PyTorch VL server (on-demand)	ad hoc	None (local, manual start)

Cloud service URLs (Qdrant, Cortex, Convex) stored in .env files per site.

This architecture runs across 2 sites (laptop + VM210), powered by Ollama-first local services plus Voyage AI and Gemini cloud APIs where they materially improve quality or latency. 7,500+ AgentDB episodes with attribution, 350k+ Qdrant vectors, 5,007 mesh nodes with 963k edges, ~550 Cortex documents, full CRM system, and 11-table Convex schema, all synced via Convex. Built with Claude Code + OpenClaw.

adambkovacs/memory-architecture.md