Last active
February 5, 2026 02:14
-
-
Save Khatiketki/75b8dbdd36f1c9901203e6e8f7718956 to your computer and use it in GitHub Desktop.
02-architecture-deep-dive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Part 1: System Architecture | |
| Task 1.1: Component Mapping πΊοΈ | |
| Q: Describe the data flow from when a user defines a goal to when worker agents execute. Include all major components. | |
| A: The flow begins at the Honeycomb Dashboard, where the user submits a natural language goal. This goal is sent to the Coding Agent, which acts as the architect to synthesize a GraphSpec (agent.json) and generate the necessary connection code. This specification is passed to the AgentRunner, which initializes the NodeContext (tools, memory, and credentials). Finally, the GraphExecutor runs the specialized Worker Agents to perform the tasks. | |
| Q: Explain the "self-improvement loop" - what happens when an agent fails? | |
| A: When a failure occurs, the system captures a detailed "failure packet" (logs, state, and prompts) and stores it. The Coding Agent then analyzes this data to identify the root cause. It "evolves" the agent by modifying the node graph or rewriting the connection code to fix the issue. The improved agent is then redeployed to handle future tasks more effectively. | |
| Q: What is the difference between a Coding Agent and a Worker Agent? | |
| A: The Coding Agent is the build-time "Architect" that designs the agent's logic and structure; the Worker Agent is the runtime "Specialist" that executes specific tasks using tools and LLM reasoning. | |
| Q: What is the difference between STM (Short-Term Memory) vs LTM (Long-Term Memory)? | |
| A: STM is session-specific context used for immediate task execution; LTM (or RLM) is a persistent vector store that allows agents to recall patterns, past failures, and user preferences across different sessions. | |
| Q: What is the difference between Hot storage vs Cold storage for events? | |
| A: Hot storage (Redis) is used for real-time, high-speed data like heartbeats and current session states; Cold storage (TimescaleDB/S3) is used for historical logs, event traces, and audit trails used for long-term analysis and evolution. | |
| ________________________________________ | |
| Task 1.2: Database Design πΎ | |
| Q: What type of data is stored in TimescaleDB? Why TimescaleDB specifically? | |
| A: It stores time-series data, including high-frequency metrics, event traces, and cost tracking. It is used because it combines the reliability of PostgreSQL with specialized optimizations for fast ingestion and complex temporal queries. | |
| Q: What is stored in MongoDB? Why a document database? | |
| A: It stores Agent Configurations and Node Graphs. A document database is chosen because these graphs have nested, flexible structures that are difficult to map to rigid relational tables. | |
| Q: What is the primary purpose of PostgreSQL? | |
| A: It serves as the Core Relational Store for managing user accounts, team structures, permissions, and metadata that require high ACID compliance and relational integrity. | |
| ________________________________________ | |
| Task 1.3: Real-time Communication π‘ | |
| Q: What protocol connects the SDK to the Hive backend for policy updates? | |
| A: It uses WebSockets (specifically via Socket.io) to allow the backend to push real-time updates (like budget changes) to the SDK without the need for constant polling. | |
| Q: How does the dashboard receive live agent metrics? | |
| A: The dashboard subscribes to a WebSocket stream from the Hive backend, which pushes execution events and performance metrics as they happen. | |
| Q: What is the heartbeat interval for SDK health checks? | |
| A: The standard heartbeat interval is 10 seconds. | |
| ________________________________________ | |
| Would you like me to provide the configuration code for setting up these database connections? | |
| ________________________________________ | |
| Part 2: Code Analysis | |
| Task 2.1: API Routes π£οΈ | |
| Q: List all the main API route prefixes. | |
| A: The Aden Hive backend organizes its services under the following primary prefixes: | |
| β’ /v1/control: Core agent orchestration and command logic. | |
| β’ /v1/agents: Management of agent life cycles, exports, and graph configurations. | |
| β’ /v1/analytics: Retrieval of time-series performance and cost data from TimescaleDB. | |
| β’ /v1/auth: User authentication, session management, and team-based permissions. | |
| β’ /v1/logs: Real-time WebSocket and historical event log streaming. | |
| Q: For the /v1/control routes, what are the main endpoints and their purposes? | |
| A: * POST /execute: Starts a new agent run based on a specific goal_id. | |
| β’ PATCH /policy: Updates runtime guardrails (like budget caps) without stopping the agent. | |
| β’ POST /intervention: Submits human feedback to "Human-in-the-Loop" nodes to resume a paused execution. | |
| β’ GET /status: Provides the current live state of a specific node-graph execution. | |
| Q: What authentication method is used for API requests? | |
| A: Requests from the Honeycomb Dashboard use JWT (JSON Web Tokens). Communication from the SDK to the Hive backend is secured via Team-scoped API Keys. | |
| ________________________________________ | |
| Task 2.2: MCP Tools Deep Dive π§ | |
| Q: List all Budget, Analytics, and Policy tools. | |
| A: * Budget Tools: get_agent_budget, set_budget_limit, check_budget_compliance. | |
| β’ Analytics Tools: get_execution_metrics, summarize_token_usage, analyze_performance_trends. | |
| β’ Policy Tools: enforce_policy, update_agent_strategy, verify_compliance. | |
| Q: Pick ONE tool and explain its parameters, returns, and use case. | |
| A: Tool: web_search_tool | |
| β’ Parameters: It accepts a query (string) and an optional limit (int). | |
| β’ Returns: A list of search results containing title, url, and a text snippet. | |
| β’ Use Case: The Coding Agent injects this tool when the user's goal requires external, real-time data that isn't part of the LLMβs training set or internal company documents. | |
| ________________________________________ | |
| Task 2.3: Event Specification π | |
| Q: What are the four event types that can be sent from SDK to server? | |
| A: 1. MetricEvent: Performance and cost data. | |
| 2. TraceEvent: Tracking the path taken through the node graph. | |
| 3. DecisionEvent: Detailed logs of LLM reasoning and tool selection. | |
| 4. LogEvent: Standard developer logs (stdout/stderr). | |
| Q: For a MetricEvent, list at least 5 fields that are captured. | |
| A: * node_id: The identifier for the active node. | |
| β’ latency_ms: Time taken for node execution. | |
| β’ token_usage: Total input and output tokens consumed. | |
| β’ cost_usd: Calculated monetary cost of the LLM call. | |
| β’ model_id: The specific model version used (e.g., gpt-4o-mini). | |
| Q: What is "Layer 0 content capture" and when is it used? | |
| A: "Layer 0 content capture" is the raw recording of the exact prompts sent to the LLM and the exact string responses received. It is used during the Self-Improvement/Evolution phase so the Coding Agent can analyze exactly why an agent failed (e.g., hallucination or bad formatting) and generate a permanent fix. | |
| ________________________________________ | |
| Part 3: Design & Scaling | |
| Task 3.1: Scaling Scenario π | |
| Q: Which components would be the bottleneck? Why? | |
| A: The primary bottlenecks would likely be the WebSocket Gateway and Database Write Throughput. | |
| β’ WebSocket Gateway: Managing 1,000 real-time streams across 50 teams can exceed the single-process limits of the Node.js backend, causing high latency or dropped updates. | |
| β’ Database (TimescaleDB): Ingesting and indexing millions of time-series events (metrics, traces, logs) simultaneously requires significant I/O and CPU, especially during analytical queries. | |
| Q: How would you horizontally scale the system? | |
| A: I would deploy the backend (Hive) and frontend (Honeycomb) as stateless containers in a cluster (e.g., Kubernetes). A Load Balancer with sticky sessions would distribute WebSocket connections across multiple backend pods. I would also introduce a Message Queue (Redis or Kafka) to decouple metric ingestion from the database write layer, allowing for asynchronous, batched processing. | |
| Q: What database optimizations would you recommend? | |
| A: For TimescaleDB, I would implement Hypertables partitioned by both time and team_id to speed up team-specific queries. I would also use Read Replicas for heavy analytical dashboard queries and implement Connection Pooling (e.g., pgBouncer) to manage the surge in database connections from scaled-out backend pods. | |
| Q: How would you ensure team data isolation at scale? | |
| A: I would enforce strict Namespace Scoping at the application layer, ensuring every query is hard-filtered by team_id. At the database level, I would enable Row-Level Security (RLS) in PostgreSQL to provide a secondary safety net that prevents cross-team data access even if application-level filters fail. | |
| ________________________________________ | |
| Task 3.2: New Feature Design: Agent Collaboration Logs π | |
| Database Schema (TimescaleDB) | |
| Using a relational time-series structure allows for fast querying of historical communication. | |
| https://imgur.com/a/j0w4K2k | |
| API Endpoint Design | |
| β’ Route: GET /v1/analytics/collaboration | |
| β’ Purpose: Retrieves a paginated list of communication logs. | |
| β’ Payload: ```json | |
| { | |
| "team_id": "team_123", | |
| "filters": { | |
| "thread_id": "uuid-v4", | |
| "agent_id": "agent_alpha", | |
| "time_range": { "start": "2026-01-01T00:00:00Z", "end": "..." } | |
| }, | |
| "limit": 50 | |
| } | |
| Q: How would this integrate with existing event batching? | |
| A: The SDK would treat collaboration logs as a new event type (CommEvent). These would be queued in the SDK's internal buffer and sent to the server in the existing 5-second batch window alongside MetricEvents and TraceEvents, ensuring that communication overhead does not impact real-time execution performance. | |
| ________________________________________ | |
| Task 3.3: Failure Handling β οΈ | |
| Q: How should failures be categorized (types of failures)? | |
| A: 1. Infrastructure Failures: Network timeouts, LLM rate limits, or tool API outages. | |
| 2. Logic Failures: Bad JSON formatting, missing required output keys, or schema violations. | |
| 3. Reasoning Failures: Hallucinations, goal misalignment, or human-rejected content. | |
| Q: What data should be captured for the Coding Agent to improve? | |
| A: The system should capture the "Failure Context Packet": the exact LLM prompt sent, the raw response string received, the specific tool parameters used, the node's memory state (STM), and any human feedback provided during rejection. | |
| Q: How do you prevent infinite failure loops? | |
| A: I would implement a Max Evolution Threshold (e.g., 3 attempts). Each successive attempt by the Coding Agent would use a higher-reasoning model (e.g., GPT-4o-mini $\rightarrow$ GPT-4o) and include a more restrictive "Constraint Prompt" to narrow the solution space. If the threshold is hit, the session is terminated and flagged for manual audit. | |
| Q: When should the system escalate to human intervention? | |
| A: Escalation should occur when: | |
| β’ The Max Evolution Threshold is reached without a successful fix. | |
| β’ A High-Risk Tool (e.g., spending money or deleting a DB) fails its runtime guardrail. | |
| β’ The Budget Rule is triggered (e.g., the cost to fix the failure exceeds the team's cap). | |
| ________________________________________ | |
| Part 4: Practical Implementation | |
| Task 4.1: Write a New MCP Tool π οΈ | |
| The hive_agent_performance_report tool allows the Coding Agent to programmatically assess if a worker agent is meeting its performance and cost targets. | |
| Tool Definition | |
| β’ Name: hive_agent_performance_report | |
| β’ Description: "Generates a comprehensive performance and financial report for a specific agent over a defined time range by querying the TimescaleDB analytics engine." | |
| β’ Input Schema (JSON): | |
| JSON | |
| { | |
| "type": "object", | |
| "properties": { | |
| "agent_id": { "type": "string", "description": "The unique ID of the agent." }, | |
| "start_time": { "type": "string", "format": "date-time" }, | |
| "end_time": { "type": "string", "format": "date-time" } | |
| }, | |
| "required": ["agent_id", "start_time", "end_time"] | |
| } | |
| Implementation Pseudocode (TypeScript) | |
| TypeScript | |
| import { McpServer } from "@modelcontextprotocol/sdk"; | |
| const server = new McpServer({ name: "aden_performance_reporter" }); | |
| server.tool( | |
| "hive_agent_performance_report", | |
| { agent_id: z.string(), start_time: z.string(), end_time: z.string() }, | |
| async ({ agent_id, start_time, end_time }) => { | |
| // 1. Query TimescaleDB for metrics | |
| const rawData = await db.query(` | |
| SELECT | |
| COUNT(*) as total_requests, | |
| AVG(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as success_rate, | |
| AVG(latency_ms) as avg_latency, | |
| SUM(cost_usd) as total_cost | |
| FROM metric_events | |
| WHERE agent_id = $1 AND timestamp BETWEEN $2 AND $3 | |
| `, [agent_id, start_time, end_time]); | |
| // 2. Return formatted report | |
| return { | |
| content: [{ type: "text", text: JSON.stringify(rawData[0]) }] | |
| }; | |
| } | |
| ); | |
| Example Request & Response | |
| β’ Request: agent_id: "scout-01", start_time: "2026-01-01T00:00:00Z", end_time: "2026-01-01T23:59:59Z" | |
| β’ Response: | |
| JSON | |
| { | |
| "total_requests": 145, | |
| "success_rate": 0.98, | |
| "avg_latency": 425.5, | |
| "total_cost": 0.042 | |
| } | |
| ________________________________________ | |
| Task 4.2: Budget Enforcement Algorithm π° | |
| This algorithm ensures that agents operate within strict financial guardrails, degrading gracefully as they approach their limits. | |
| TypeScript | |
| function checkBudget( | |
| currentSpend: number, | |
| budgetLimit: number, | |
| requestedModel: string, | |
| estimatedCost: number | |
| ): BudgetCheck { | |
| const usageRatio = currentSpend / budgetLimit; | |
| const projectedSpend = currentSpend + estimatedCost; | |
| // Rule 1: Block if budget would be exceeded | |
| if (projectedSpend > budgetLimit) { | |
| return { action: 'block', reason: 'Budget limit exceeded' }; | |
| } | |
| // Rule 2: Throttle if >=95% used | |
| if (usageRatio >= 0.95) { | |
| return { | |
| action: 'throttle', | |
| reason: 'Budget usage critical (>=95%)', | |
| delayMs: 2000 | |
| }; | |
| } | |
| // Rule 3: Degrade to cheaper model if >=80% used | |
| if (usageRatio >= 0.80) { | |
| const cheaperModel = requestedModel.includes('gpt-4') ? 'gpt-4o-mini' : 'claude-3-haiku'; | |
| return { | |
| action: 'degrade', | |
| reason: 'Budget usage high (>=80%)', | |
| degradedModel: cheaperModel | |
| }; | |
| } | |
| // Rule 4: Allow otherwise | |
| return { action: 'allow', reason: 'Budget within safe limits' }; | |
| } | |
| ________________________________________ | |
| Task 4.3: Event Aggregation Query π | |
| This SQL query performs a high-performance aggregation on a TimescaleDB Hypertable to provide the data for the observability dashboard. | |
| SQL | |
| SELECT | |
| time_bucket('1 hour', timestamp) AS bucket, | |
| model_id, | |
| provider, | |
| SUM(token_count) AS total_tokens, | |
| SUM(cost_usd) AS total_cost, | |
| AVG(latency_ms) AS avg_latency, | |
| COUNT(*) AS request_count | |
| FROM metric_events | |
| WHERE timestamp > NOW() - INTERVAL '24 hours' | |
| GROUP BY bucket, model_id, provider | |
| ORDER BY total_cost DESC; | |
| ________________________________________ | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment