An AI-powered long-term memory system with LLM-gated storage, semantic retrieval, voice interaction, and contrastive learning-based personalized retrieval.
- Overview
- Architecture
- Project Structure
- Configuration & Environment
- Core System — Memory Pipeline
- Chat System — RAG-Based Responses
- Voice Interaction
- 7.1 Wake Word Detection
- 7.2 Speech-to-Text
- 7.3 Voice Chat Flow
- Contrastive Learning — Personalized Retrieval
- CLI Application
- Flask REST API
- Installation & Setup
- Verification & Testing
- Dependencies
The Quantum Memory Layer is a system that gives AI assistants persistent, long-term memory. Unlike typical chatbots that forget everything between sessions, this system:
- Stores facts, preferences, events, and plans about the user
- Decides what's worth remembering using an LLM (not everything is stored)
- Retrieves relevant memories when the user asks questions
- Updates memories when information changes (e.g., "I now like Java" replaces "I like Python")
- Learns to retrieve better over time using contrastive learning (ML-based personalization)
- Listens via wake word detection and speech-to-text for hands-free interaction
| Component | Technology |
|---|---|
| LLM | Google Gemini (gemini-2.5-flash-lite) |
| Embeddings | Gemini Embedding API (models/gemini-embedding-001, 768-dim) |
| Learned Embeddings | Fine-tuned all-MiniLM-L6-v2 (384-dim, local) |
| Storage | SQLite (memory.db) |
| Wake Word | Picovoice Porcupine |
| Speech-to-Text | Google Speech Recognition |
| ML Training | PyTorch + sentence-transformers |
| API | Flask |
| CLI | Rich (Python) |
┌──────────────────────────────────────────────────────────┐
│ User Interaction │
│ (CLI / Flask API / Voice Chat) │
└──────────────┬───────────────────────┬───────────────────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Memory │ │ Chat │
│ Manager │ │ Service │
└──┬───┬───┬──┘ └──────┬──────┘
│ │ │ │
┌──────▼┐ ┌▼────▼─┐ ┌──────▼──────┐
│ LLM │ │Embed- │ │ LLM │
│Service│ │ding │ │ (Generate │
│(Gate) │ │Service│ │ Response) │
└───────┘ └───┬───┘ └─────────────┘
│
┌──────▼──────┐
│ Storage │
│ Service │
│ (SQLite) │
└──────┬──────┘
│
┌─────────▼─────────┐
│ Contrastive │
│ Training Pipeline│
│ (Background) │
└───────────────────┘
User Input → LLM decides "worth storing?" → YES/NO
│
┌───────────▼───────────┐
│ Check for conflicts │
│ with existing memories │
└───────┬───┬───┬───────┘
ADD UPDATE IGNORE
│
┌───────▼───────┐
│ Generate │
│ Embeddings │
│ (Gemini + │
│ Learned) │
└───────┬───────┘
│
┌───────▼───────┐
│ Store in │
│ SQLite │
└───────┬───────┘
│
┌───────▼───────┐
│ Check auto- │
│ retrain │
│ threshold │
└───────────────┘
User Query → Generate query embedding
│
┌───────▼───────────┐
│ Trained model │
│ exists? │
└──┬──────────┬─────┘
YES NO
│ │
┌──────▼──────┐ ┌──▼──────────┐
│ Use learned │ │ Use Gemini │
│ embeddings │ │ embeddings │
│ (384-dim) │ │ (768-dim) │
└──────┬──────┘ └──┬──────────┘
└─────┬──────┘
│
┌──────▼──────┐
│ Cosine │
│ Similarity │
│ → Top-K │
└──────┬───────┘
│
┌──────▼──────┐
│ LLM generates│
│ response with│
│ context (RAG)│
└─────────────┘
memory-layer/
├── .env # API keys (GEMINI_API_KEY, PICOVOICE_ACCESS_KEY)
├── .env.example # Template for .env
├── .gitignore
├── requirements.txt # All Python dependencies
├── main.py # CLI application (Rich-based interactive menu)
├── app.py # Flask REST API server
├── demo.py # Interactive debugging demo
├── verify.py # Core system verification tests
├── verify_contrastive.py # Contrastive learning benchmark tests
├── memory.db # SQLite database (auto-created)
├── ARCHITECTURE.md # Architecture overview with Mermaid diagrams
├── README.md # Project README
├── doc.md # This file
│
├── models/ # Auto-created after first training
│ └── retriever/ # Fine-tuned MiniLM model checkpoint
│ ├── config.json
│ ├── model.safetensors
│ └── training_log.json # Version history of training runs
│
└── src/
├── core/
│ ├── config.py # Environment config & validation
│ ├── memory_manager.py # Central orchestrator for memory operations
│ └── chat_service.py # RAG-based chat with memory context
│
├── models/
│ └── memory_entry.py # MemoryEntry dataclass (data model)
│
├── services/
│ ├── llm_service.py # Gemini LLM interactions (decisions, chat, conflicts)
│ ├── embedding_service.py # Gemini embedding generation (768-dim)
│ ├── storage_service.py # SQLite database operations
│ ├── stt_service.py # Speech-to-Text (Google Speech Recognition)
│ ├── wakeword_service.py # Wake word detection (Picovoice Porcupine)
│ └── triplet_service.py # LLM-based training triplet generation
│
└── ml/
└── contrastive_trainer.py # MiniLM fine-tuning with TripletLoss
File: src/core/config.py
All configuration is loaded from environment variables (.env file):
| Variable | Required | Default | Description |
|---|---|---|---|
GEMINI_API_KEY |
✅ Yes | — | Google Gemini API key for LLM and embeddings |
PICOVOICE_ACCESS_KEY |
Optional | — | Picovoice API key for wake word detection |
DB_PATH |
Optional | memory.db |
Path to SQLite database file |
Model constants (hardcoded in config.py):
| Constant | Value | Description |
|---|---|---|
EMBEDDING_MODEL |
models/gemini-embedding-001 |
Gemini embedding model (768 dimensions) |
GENERATIVE_MODEL |
gemini-2.5-flash-lite |
Gemini generative model for all LLM tasks |
GEMINI_API_KEY=your_gemini_api_key_here
PICOVOICE_ACCESS_KEY=your_picovoice_key_here
DB_PATH=memory.dbFile: src/models/memory_entry.py
Every memory is stored as a MemoryEntry dataclass:
@dataclass
class MemoryEntry:
text: str # The actual memory text
embedding: List[float] # Gemini embedding (768-dim)
learned_embedding: Optional[List[float]] = None # Fine-tuned MiniLM embedding (384-dim)
metadata: dict = field(default_factory=dict) # Extra metadata (unused, extensible)
created_at: datetime = field(default_factory=datetime.now)
id: Optional[int] = None # SQLite auto-increment IDSerialization:
to_db_tuple()— Converts to a tuple for SQLite insertion. Embeddings are JSON-serialized.from_db_tuple(row)— Backwards-compatible factory method. Handles both 5-column (legacy, before contrastive learning) and 6-column (current) database rows.
Backwards Compatibility: If the database was created before the contrastive learning update, from_db_tuple detects the 5-column format and sets learned_embedding = None.
File: src/services/llm_service.py → decide_memory_importance()
Not everything the user says is stored. The LLM acts as a gate, deciding what's worth remembering.
Worth storing:
- Facts about the user ("I study CS at IISC")
- Preferences ("My favorite food is dosa")
- Events ("I have a meeting tomorrow at 10")
- Relationships ("My dog's name is Bruno")
- Setup details, crucial context
Not worth storing:
- Casual greetings ("Hi", "How are you?")
- Fleeting thoughts
- Ephemeral questions ("What is 2+2?")
- Nonsensical input
How it works:
Input: "I like Python programming"
↓
LLM Prompt:
"You are a memory manager AI. Decide if this is worth storing..."
↓
LLM Response:
"DECISION: YES"
"REASON: This is a programming language preference worth remembering."
↓
Returns: (True, "This is a programming language preference worth remembering.")
The LLM is prompted to return a structured response with DECISION: [YES/NO] and REASON: [explanation], which is then parsed.
File: src/services/llm_service.py → check_memory_update()
When new information is about to be stored, the system checks if it conflicts with existing memories.
Three possible actions:
| Action | When | Example |
|---|---|---|
ADD |
New info, no conflict | "I like badminton" (no existing sport preference) |
UPDATE |
New info supersedes old | "I like Java" overrides "I like Python" |
IGNORE |
Duplicate information | "I like Python" when already stored |
How UPDATE works:
- The LLM receives the new text and up to 5 existing similar memories
- It identifies which memories conflict and returns their IDs
- Those old memories are deleted from the database
- The new memory is then stored
Example flow:
Existing Memory ID 3: "I like Python"
New Input: "I now only like Java"
↓
LLM decides: ACTION: UPDATE, TARGET_IDS: [3]
↓
Memory ID 3 deleted
New memory "I now only like Java" stored
File: src/services/embedding_service.py
Converts text into numerical vectors (embeddings) for similarity comparison.
Gemini Embeddings (Primary):
- Model:
models/gemini-embedding-001 - Dimension: 768
- Task type:
retrieval_document - Requires API call to Google (has latency + cost)
Learned Embeddings (After Training):
- Model: Fine-tuned
all-MiniLM-L6-v2 - Dimension: 384
- Runs locally (no API call, no cost, faster)
- Only available after contrastive learning training
Both embeddings are stored per memory for dual-retrieval capability.
File: src/services/storage_service.py
All memories are stored in a single SQLite file (memory.db by default).
Schema:
CREATE TABLE IF NOT EXISTS memories (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT NOT NULL,
embedding TEXT NOT NULL, -- JSON-serialized Gemini embedding (768-dim)
learned_embedding TEXT, -- JSON-serialized learned embedding (384-dim), nullable
metadata TEXT, -- JSON-serialized metadata dict
created_at TEXT NOT NULL -- ISO 8601 timestamp
);Schema Migration: If the database was created before the contrastive learning update (without learned_embedding column), the _migrate_schema() method automatically adds it via ALTER TABLE on startup. No data loss.
Available Methods:
| Method | Description |
|---|---|
add_memory(entry) |
Inserts a new MemoryEntry |
get_all_memories() |
Returns all memories as MemoryEntry objects |
get_memory_count() |
Returns total count of stored memories |
update_learned_embedding(id, emb) |
Updates the learned embedding for a specific memory |
delete_memory(id) |
Deletes a memory by its ID |
File: src/core/memory_manager.py → search_memory()
Retrieval uses cosine similarity between the query embedding and all stored memory embeddings.
Scoring Formula:
score = cosine_similarity(query_embedding, memory_embedding)
Where:
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Dual-Embedding Retrieval Logic:
if trained_model_exists and memories_have_learned_embeddings:
# Use fine-tuned MiniLM (384-dim, local, personalized)
query_emb = contrastive_trainer.encode(query)
for memory in all_memories:
score = cosine_similarity(query_emb, memory.learned_embedding)
else:
# Cold start fallback: use Gemini (768-dim, API call, generic)
query_emb = embedding_service.get_embedding(query)
for memory in all_memories:
score = cosine_similarity(query_emb, memory.embedding)Results are sorted by score (descending) and the top-K are returned.
File: src/core/chat_service.py
The chat system implements Retrieval-Augmented Generation (RAG):
- Retrieve: Search for top 3 memories relevant to the user's query
- Filter: Only include memories with similarity score > 0.6 (relevance threshold)
- Generate: Pass the query + filtered memories as context to the LLM
- Store: Simultaneously check if the user's chat message itself should be stored as a new memory
LLM Chat Prompt Behavior:
- Memories are presented as facts about the user (not the AI)
- The LLM is instructed NOT to randomly recite facts (e.g., don't say "Hi, you like sushi" when user just says "Hi")
- If memories contradict each other, the most recent one is prioritized
- The LLM does not explicitly say "I found this in your memory" unless contextually relevant
Example:
User: "What food should I order tonight?"
↓
Search finds: "I like south Indian food, especially dosa" (score: 0.82)
↓
LLM generates: "Based on your love for south Indian food, you might enjoy
ordering some dosas tonight! You could also try..."
File: src/services/wakeword_service.py
Uses Picovoice Porcupine for on-device wake word detection. The system listens continuously in a low-power mode until it hears one of the configured wake words.
Default Wake Words: "jarvis", "computer"
Available Built-in Wake Words:
alexa, americano, blueberry, bumblebee, computer, grapefruit, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator
How it works:
class WakeWordService:
def __init__(self, keywords=["jarvis", "computer"]):
self.porcupine = pvporcupine.create(
access_key=Config.PICOVOICE_ACCESS_KEY,
keywords=keywords
)
self.recorder = pvrecorder.PvRecorder(
device_index=-1, # default microphone
frame_length=self.porcupine.frame_length
)
def listen_for_wake_word(self):
"""Blocks until wake word is detected. Returns True/False."""
self.recorder.start()
while True:
pcm = self.recorder.read()
keyword_index = self.porcupine.process(pcm)
if keyword_index >= 0:
return True # Wake word detected!Requirements:
PICOVOICE_ACCESS_KEYin.envpvporcupineandpvrecorderpackages- A working microphone
Graceful Degradation: If the Picovoice key is missing or initialization fails, the wake word service is disabled but the rest of the app still works. Voice chat falls back to direct listening mode.
File: src/services/stt_service.py
Converts spoken audio to text using Google Speech Recognition (via the SpeechRecognition library).
How it works:
- Adjusts for ambient noise (0.5 second calibration)
- Listens for speech (timeout: 5 seconds, max phrase: 10 seconds)
- Sends audio to Google STT API for transcription
- Returns the transcribed text
class STTService:
def listen_and_transcribe(self) -> str:
with sr.Microphone() as source:
self.recognizer.adjust_for_ambient_noise(source, duration=0.5)
audio = self.recognizer.listen(source, timeout=5, phrase_time_limit=10)
text = self.recognizer.recognize_google(audio)
return textError Handling:
| Error | Behavior |
|---|---|
WaitTimeoutError |
"No speech detected. Timed out." → returns None |
UnknownValueError |
"Could not understand audio." → returns None |
RequestError |
Google STT API unreachable → returns None |
Exception |
Generic microphone error → returns None |
Requirements:
SpeechRecognitionandpyaudiopackages- A working microphone
- Internet connection (for Google STT API)
File: main.py → voice_chat_flow()
Combines wake word detection + STT + chat into a continuous voice loop:
┌──────────────────────────────────────────┐
│ Voice Chat Loop │
│ │
│ 1. Wait for wake word ("Jarvis"...) │
│ ↓ │
│ 2. Listen for speech → STT → text │
│ ↓ │
│ 3. If "exit"/"quit" → break │
│ ↓ │
│ 4. Search memories + generate response │
│ ↓ │
│ 5. Display response │
│ ↓ │
│ 6. Auto-store if worth remembering │
│ ↓ │
│ 7. Go back to step 1 │
└──────────────────────────────────────────┘
If wake word is not available (missing API key), it falls back to direct listening without requiring a wake word first.
The default Gemini embeddings are generic — they capture general semantic similarity but don't understand the specific user's retrieval patterns. For example:
- Query: "What do I study?" → Gemini might rank "I use VS Code" high because both are tech-related
- With a fine-tuned model: The model learns that "study" queries should prioritize academic memories, not tool preferences
Contrastive learning trains a local model to understand which memories are relevant for which queries, specific to this user's data.
Research Contribution:
- Personalized retrieval via contrastive fine-tuning on user-specific data
- Self-supervised data pipeline (LLM generates training data, no manual labeling)
- Continuous learning as memories grow
- Benchmarkable: Gemini baseline vs fine-tuned Recall@K comparison
File: src/services/triplet_service.py
The self-supervised data pipeline. Generates training data without any manual labeling.
What is a triplet?
(query, positive, negative)
↓ ↓ ↓
"What do "I study "I like
I study?" CS at IISC" badminton"
The model learns: query should be closer to positive than to negative in embedding space.
How triplets are generated:
Step 1 — Generate queries using the LLM: For each stored memory, the LLM generates natural questions that this memory should answer.
Memory: "I study Computer Science at IISC Bangalore."
↓
LLM Prompt: "Generate 2 natural questions that this memory answers..."
↓
Queries: ["What do I study?", "Which university do I attend?"]
Step 2 — Select hard negatives: For each triplet, a "hard negative" is selected — a memory that is somewhat related but NOT the correct answer. This is more informative for training than random negatives.
Strategy:
- Compute cosine similarity between the anchor memory and all other memories
- Sort by similarity
- Skip the top 20% (too similar — might actually be valid)
- Pick from the 20–60% range (the "hard negative" zone)
Memory: "I study CS at IISC"
↓ Similarities:
0.92 - "I'm working on a research paper about AI" ← too similar, skip
0.78 - "I use VS Code as my editor" ← HARD NEGATIVE ✓
0.45 - "My dog's name is Bruno" ← too easy, skip
Step 3 — Form triplets:
{
"query": "What do I study?",
"positive": "I study Computer Science at IISC Bangalore.",
"negative": "I use VS Code as my primary code editor."
}Minimum requirement: 3 memories to generate triplets (need at least 1 positive + negatives).
File: src/ml/contrastive_trainer.py
The core ML component that fine-tunes a sentence transformer model.
Base Model: sentence-transformers/all-MiniLM-L6-v2
- 6-layer transformer, 22M parameters
- Output dimension: 384
- ~80MB model size
- Runs entirely on CPU (no GPU needed)
Training Process:
InputExample(texts=[query, positive, negative])
↓
TripletLoss
(learns to minimize distance(query, positive)
while maximizing distance(query, negative))
↓
Fine-tuned model saved to models/retriever/
Implementation details:
class ContrastiveTrainer:
def train(self, triplets, epochs=3, batch_size=16):
# Convert to InputExamples
train_examples = [
InputExample(texts=[t["query"], t["positive"], t["negative"]])
for t in triplets
]
# TripletLoss with default margin
train_loss = losses.TripletLoss(model=self.model)
# Warmup: 10% of total training steps
warmup_steps = int(len(train_dataloader) * epochs * 0.1)
# Fine-tune
self.model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=epochs,
warmup_steps=warmup_steps,
output_path=self.model_dir
)Warm-Starting: If a previous checkpoint exists at models/retriever/, the trainer loads it instead of the base model. This means each retraining continues from where the last one left off — true continuous learning.
Post-Training Re-Embedding: After training, ALL stored memories are re-encoded with the updated model:
def reembed_all(self, storage_service):
memories = storage_service.get_all_memories()
texts = [m.text for m in memories]
learned_embeddings = self.encode(texts) # batch encode
for memory, emb in zip(memories, learned_embeddings):
storage_service.update_learned_embedding(memory.id, emb)Training Log: Each training run is logged to models/retriever/training_log.json:
{
"runs": [
{
"version": 1,
"timestamp": "2026-02-11T14:30:00",
"num_triplets": 40,
"epochs": 3,
"duration_seconds": 45.2,
"memory_count": 20
},
{
"version": 2,
"timestamp": "2026-02-12T10:15:00",
"num_triplets": 80,
"epochs": 3,
"duration_seconds": 78.6,
"memory_count": 40
}
]
}File: src/core/memory_manager.py
The memory manager transparently switches between Gemini and learned embeddings.
On Memory Storage:
# Always generate Gemini embedding
embedding = self.embedding_service.get_embedding(text) # 768-dim, API call
# Also generate learned embedding if model is trained
learned_emb = None
if self.contrastive_trainer.is_trained():
learned_emb = self.contrastive_trainer.encode([text])[0] # 384-dim, local
memory_entry = MemoryEntry(
text=text,
embedding=embedding,
learned_embedding=learned_emb
)On Search:
if trained_model_exists and memories_have_learned_embeddings:
# LEARNED PATH: fast, local, personalized
query_emb = self.contrastive_trainer.encode([query])[0]
score = cosine_similarity(query_emb, memory.learned_embedding)
else:
# GEMINI PATH: original behavior, API call
query_emb = self.embedding_service.get_embedding(query)
score = cosine_similarity(query_emb, memory.embedding)Lazy Loading: The ContrastiveTrainer is loaded lazily (on first access) via a @property. This prevents importing PyTorch at startup if it's not needed.
File: src/core/memory_manager.py
Training is triggered automatically when enough new memories accumulate.
Threshold: RETRAIN_THRESHOLD = 20 (configurable in memory_manager.py)
How it works:
Memory stored → Check: (current_count - last_trained_count) >= 20?
│
YES → Start background thread → _auto_train()
│
NO → Continue normally
Auto-train runs in a daemon thread:
- The response to the user returns immediately with
"training_triggered": true - Training happens in the background (2–5 minutes on CPU)
- A
_is_trainingflag prevents concurrent training runs - Once complete, new searches automatically use the updated model
Training lifecycle tracking:
_last_trained_countis initialized fromtraining_log.jsonon startup- After each training run, the memory count at training time is saved
- The threshold compares current count vs last trained count
Example timeline:
Memory 1: _last_trained_count=0, diff=1 → no train
Memory 5: diff=5 → no train
Memory 19: diff=19 → no train
Memory 20: diff=20 → AUTO-TRAIN TRIGGERED (background thread)
→ triplet generation (LLM) → fine-tuning → re-embedding
→ _last_trained_count updated to 20
Memory 21: diff=1 → no train
...
Memory 40: diff=20 → AUTO-TRAIN TRIGGERED again
The complete lifecycle as the memory bank grows:
Phase 1: Cold Start (0–19 memories)
├── All storage uses Gemini embeddings (768-dim)
├── All retrieval uses Gemini embeddings
└── No training occurs
Phase 2: First Training (memory 20 triggers training)
├── LLM generates ~40 triplets from 20 memories
├── Base MiniLM fine-tuned → model v1
├── All 20 memories re-embedded with model v1
└── Future searches use learned embeddings (384-dim)
Phase 3: Continuous Improvement (every 20 new memories)
├── Model warm-starts from previous checkpoint
├── Triplets generated from ALL memories (not just new ones)
├── Model gets increasingly specialized to this user
└── Retrieval accuracy improves with each cycle
Phase 4: Mature System (100+ memories)
├── Model deeply understands user's memory landscape
├── Retrieval is highly personalized
├── Each retrain cycle adds incremental improvements
└── Can demonstrate measurable improvement over baseline
File: main.py
Interactive terminal application built with the Rich library.
| # | Option | Description |
|---|---|---|
| 1 | Add Memory | Enter text → LLM decides if it's worth storing |
| 2 | Search Memories | Enter a query → see top matched memories with scores |
| 3 | Chat with Memory | Interactive chat loop with RAG-based responses |
| 4 | Voice Chat | Wake word → speech → chat → display response (loop) |
| 5 | Train Retriever | Manually trigger contrastive learning training |
| 6 | Exit | Exit the application |
When selected, the CLI walks through 3 steps with Rich status displays:
Step 1: Generating training triplets...
→ Asks LLM to generate queries for each memory
→ Shows sample triplets in a table
Step 2: Training contrastive model...
→ Fine-tunes MiniLM with TripletLoss
→ Shows training metrics (version, duration, triplets used)
Step 3: Re-embedding all memories...
→ Encodes all memories with the updated model
→ Shows count of re-embedded memories
File: app.py
RESTful API running on port 5000 (default).
Health check endpoint.
Response:
{"status": "healthy", "service": "Quantum Memory Layer"}Add a new memory (LLM decides if it's worth storing).
Request:
{"text": "I like Python programming"}Response (stored):
{
"stored": true,
"reason": "This is a programming language preference.",
"message": "Memory stored successfully."
}Response (with auto-train):
{
"stored": true,
"reason": "...",
"message": "Memory stored successfully. Background retraining started.",
"training_triggered": true
}Search stored memories by semantic similarity.
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
q |
string | required | Search query |
limit |
int | 5 | Max results to return |
Response:
[
{
"text": "I like Python programming",
"score": 0.8723,
"created_at": "2026-02-11T14:30:00",
"id": 1
}
]Chat with the LLM using memory context (RAG).
Request:
{"text": "What programming languages do I like?"}Response:
{
"response": "Based on what I know about you, you enjoy Python programming!",
"memory_action": {
"stored": false,
"reason": "This is a question, not a fact to remember.",
"message": "Memory not stored."
}
}Trigger contrastive learning training.
Request (optional body):
{"epochs": 3, "queries_per_memory": 2}Response:
{
"duration_seconds": 45.2,
"num_triplets": 40,
"epochs": 3,
"model_version": 1,
"model_path": "models/retriever",
"memories_reembedded": 20
}Check the status of the contrastive retriever.
Response (not trained):
{
"available": true,
"is_trained": false,
"memory_count": 8
}Response (trained):
{
"available": true,
"is_trained": true,
"memory_count": 45,
"total_versions": 2,
"latest_training": {
"version": 2,
"timestamp": "2026-02-11T14:30:00",
"num_triplets": 80,
"epochs": 3,
"duration_seconds": 78.6,
"memory_count": 40
}
}- Python 3.10+
- A Google Gemini API key
- (Optional) Picovoice access key for wake word detection
- (Optional) A working microphone for voice features
# 1. Clone the repository
git clone <repo-url>
cd memory-layer
# 2. Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment variables
copy .env.example .env
# Edit .env and add your GEMINI_API_KEY
# 5. Run the CLI
python main.py
# OR run the Flask API
python app.py# This is ~2GB download (PyTorch + sentence-transformers)
pip install sentence-transformers torchThe system works without these packages — it just uses Gemini embeddings only.
File: verify.py
Tests the fundamental memory operations:
- Adding memories
- Searching memories
- Chat with memory context
- Memory update/conflict resolution
python verify.pyFile: verify_contrastive.py
End-to-end benchmark with 5 tests:
| Test | What It Verifies |
|---|---|
| Cold Start Fallback | Without a trained model, system uses Gemini (no crash) |
| Triplet Generation | LLM generates valid (query, positive, negative) triplets |
| Model Training | MiniLM fine-tunes without errors, checkpoint saved |
| Re-embedding | All memories get learned embeddings after training |
| Retrieval Comparison | Side-by-side Gemini vs fine-tuned retrieval results |
python verify_contrastive.pyUses a separate test database (test_contrastive.db) to avoid corrupting production data.
File: demo.py
Interactive debugging tool that shows internal system state:
- Raw embeddings
- Similarity scores
- LLM decisions
python demo.py| Package | Version | Purpose |
|---|---|---|
google-generativeai |
latest | Gemini LLM and embedding API |
python-dotenv |
latest | .env file loading |
rich |
latest | Terminal UI (colors, tables, panels) |
numpy |
latest | Cosine similarity computation |
flask |
latest | REST API server |
| Package | Purpose |
|---|---|
SpeechRecognition |
Google Speech-to-Text wrapper |
pyaudio |
Microphone audio capture |
pvporcupine |
On-device wake word detection |
pvrecorder |
Audio recording for Porcupine |
| Package | Size | Purpose |
|---|---|---|
sentence-transformers |
~500MB | MiniLM model loading, training utilities |
torch |
~1.5GB | PyTorch deep learning framework |
Note: The ML dependencies are optional. Without them, the system operates using Gemini embeddings only (original behavior). The contrastive learning features are disabled gracefully.
Here is your APA citation list with the Top 3 most relevant papers placed first, followed by the remaining cited works. The numbering is updated so the top 3 are clearly prioritized.
[1] Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. Nature, 614, 1–9. https://arxiv.org/abs/2304.03442
[2] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2005.11401
[3] Xu, Z., Wang, K., Li, Y., & Zhao, H. (2025). A-Mem: Agentic memory for large language model agents. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2406.07082
[4] Shan, Y., Zhang, Z., Wang, Y., & Liu, H. (2025). Cognitive memory in large language models. arXiv Preprint. https://arxiv.org/abs/2504.02441
[5] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3982–3992). https://arxiv.org/abs/1908.10084
[6] Zhou, Y., & Chen, W. (2025). Optimizing retrieval for retrieval-augmented generation via reinforced contrastive learning. arXiv Preprint. https://arxiv.org/abs/2510.24652
[7] Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, M., Mirhoseini, A., & others. (2023). Recursively summarizing enables long-term dialogue memory in large language models. Neurocomputing, 573, 126455. https://arxiv.org/abs/2308.15022
[8] Lee, M. K., Kiesler, S., & Forlizzi, J. (2012). Personalization in human-robot interaction: A longitudinal field experiment. In Proceedings of the 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 319–326). https://doi.org/10.1145/2157689.2157804
[9] Irfan, B., Bernotat, J., Eyssel, F., & Kopp, S. (2019). Personalization in long-term human-robot interaction. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction Workshops. https://dl.acm.org/doi/10.5555/3319921.3319973
[10] Beneventi, H., Ribeiro, T., & Paiva, A. (2023). MIRIAM: A mind-inspired architecture for adaptive human-robot interaction. International Journal of Social Robotics, 15(2), 267–289. https://doi.org/10.1007/s12369-022-00897-8
→ Papers [1], [3]
→ Papers [2], [5], [6]
→ Papers [4], [7]
→ Papers [8], [9], [10]