Skip to content

Instantly share code, notes, and snippets.

@Khatiketki
Created January 29, 2026 06:54
Show Gist options
  • Select an option

  • Save Khatiketki/fdc6e649ea1be995097408ce577aa3d5 to your computer and use it in GitHub Desktop.

Select an option

Save Khatiketki/fdc6e649ea1be995097408ce577aa3d5 to your computer and use it in GitHub Desktop.
Part 1: Infrastructure Analysis 🐳
Task 1.1: Docker Deep Dive
1.List all services defined in docker-compose.yml
β€’ Services in docker-compose.yml:
o hive-server: The core backend API (Port 3001).
o honeycomb: The React frontend dashboard (Port 3000).
o postgres: Relational data for users/auth.
o mongodb: Document storage for agent node graphs.
o timescaledb: Time-series engine for logs/metrics.
o redis: Hot storage for heartbeats and task queues.
2.What's the purpose of docker-compose.override.yml?
docker-compose.override.yml: Used for local development to mount local source code into containers (bind mounts) and set debug-level environment variables without altering the base production-ready config.
3.How is hot reload enabled for development?
β€’ Hot Reload: Enabled via Nodemon (backend) and Vite/Webpack Dev Server (frontend) watching for file changes through bind mounts.
4.What volumes are mounted and why?
In the Aden Hive architecture, volumes are strategically mounted to balance data persistence for stateful services and developer productivity for stateless ones.
Based on the repository configuration, the mounted volumes typically fall into two categories: Persistent Data Volumes and Development Bind Mounts.
1. Persistent Data Volumes (Stateful)
These are named volumes managed by Docker. They ensure that your data survives when containers are stopped or deleted.
Service Volume Name / Path Purpose
PostgreSQL postgres_data Stores user accounts, permissions, and relational metadata. Without this, your login credentials would reset on every restart.
MongoDB mongodb_data Stores the "Source of Truth" for agent definitions, node graphs, and mission configurations.
TimescaleDB timescale_data Houses the massive amounts of time-series event data (logs/metrics) generated by running agents.
Redis redis_data Persists the task queue and short-term memory states so a restart doesn't wipe the agents' current progress.
2. Development Bind Mounts (Stateless)
These are usually defined in docker-compose.override.yml and map your local machine's folders directly into the container.
β€’ ./apps/server:/app: Allows the Hive Backend to see your code changes in real-time. This is why "hot reload" worksβ€”when you save a file on your laptop, the container sees it immediately.
β€’ ./packages:/packages: Mounts shared logic (types, utilities) so that changes in the core framework propagate to both the server and the dashboard without a full rebuild.
β€’ /app/node_modules (Anonymous Volume): Often, node_modules is excluded from the bind mount. This ensures that the container uses the dependencies installed inside the Linux-based container environment rather than conflicting with your local Mac/Windows versions.
5.What networking mode is used between services?
β€’ Networking: A custom bridge network (e.g., hive-net) is used so services can resolve each other by name (e.g., http://postgres:5432).
Task 1.2: Service Dependencies πŸ”—
Map the service dependencies:
1. Create a dependency diagram showing which services depend on which
Based on the docker-compose.yml for Aden Hive, here is the dependency mapping for the services. The architecture follows a layered approach where the frontend depends on the backend, and the backend depends on several stateful databases.
Service Dependency Diagram
Detailed Service Mapping
The startup order is critical and managed via depends_on with service_healthy conditions.
Service Category Depends On (Must be healthy) Purpose
Honeycomb Frontend hive React dashboard for visualizing and managing agents.
Hive Backend timescaledb, mongodb, redis Central API/Control Plane; handles SDK requests and logic.
Aden-Tools-MCP MCP Server None (Independent) Python-based tools (like Brave Search) accessible via the Model Context Protocol.
TimescaleDB Database None Stores LLM metrics and time-series data.
MongoDB Database None Stores policies, pricing, and agent control configurations.
Redis Cache/Queue None Used for caching and as a Socket.IO adapter for real-time communication.
Startup Logic & Failure Impact
1. Startup Order: The infrastructure layer (TimescaleDB, MongoDB, Redis) starts first. Once their health checks pass, the Hive Backend initiates. Finally, the Honeycomb Frontend launches only after the backend is confirmed healthy.
2. Stateful vs. Stateless:
o Stateful: Databases (timescaledb, mongodb, redis) and the aden-tools-mcp (which persists workspaces via volumes).
o Stateless: honeycomb and hive.
3. Failure Scenarios:
o MongoDB Down: The Backend (hive) will fail to load policies and control configurations, likely crashing the main API logic.
o Redis Down: Real-time updates to the dashboard will break, and internal caching/queuing mechanisms will fail.
o TimescaleDB Down: LLM observability features will fail; the backend won't be able to record or retrieve performance metrics.
________________________________________
2. What's the startup order? Does it matter?
β€’ Startup Order: Databases (postgres, mongodb, redis) must start before the hive-server. The honeycomb dashboard depends on hive-server.
3. What happens if MongoDB is unavailable?
Failure Impacts: * MongoDB Down: Agents cannot load their node graphs; existing runs fail to save state.
4. What happens if Redis is unavailable?
Redis Down: Real-time metrics fail; task delegation between agents breaks.
5. Which services are stateless vs stateful?
Stateless vs. Stateful:
1. Stateless: hive-server, honeycomb.
2. Stateful: postgres, mongodb, timescaledb, redis.
Task 1.3: Configuration Management βš™οΈ
Analyze how configuration works:
1. How config.yaml is Generated
The config.yaml file is typically not "generated" by a script, but rather templated or mounted depending on the environment:
β€’ Local Development: Developers manually create or edit a config.yaml in the apps/server or hive/ directory based on a provided config.example.yaml.
β€’ Docker/CI: The configuration is often injected via a ConfigMap (in Kubernetes) or mapped through a volume in docker-compose.yml.
β€’ Dynamic Resolution: The Hive backend uses a configuration loader (often using a library like cosmiconfig or dotenv) that reads the YAML file and then overwrites specific values if corresponding Environment Variables are present.
2.What environment variables are required?
Required Environment Variables
While the YAML handles structural settings, the following environment variables are strictly required for the Hive backend to function:
Variable Description Default (Dev)
PORT The port the API runs on. 4000 or 3001
NODE_ENV Sets the mode (development/production). development
TSDB_PG_URL Connection string for TimescaleDB/Postgres. postgresql://postgres:postgres@timescaledb:5432/aden_tsdb
MONGODB_URL Connection string for MongoDB. mongodb://mongodb:27017
REDIS_URL Connection string for Redis. redis://redis:6379
JWT_SECRET Secret key for signing authentication tokens. change-me-in-production
3.How are secrets managed? (API keys, database passwords)
Secret Management
Aden Hive follows the "Twelve-Factor App" methodology for secrets:
β€’ Development: Secrets are stored in a .env file (which is ignored by Git) and loaded into the process environment.
β€’ Production: Secrets like BRAVE_SEARCH_API_KEY, NPM_TOKEN, and DB passwords should be managed via Secret Management Services (e.g., AWS Secrets Manager, HashiCorp Vault, or GitHub Actions Secrets).
β€’ Injection: These are passed into the Docker container at runtime via the environment: or env_file: keys in Docker Compose, ensuring they are never hardcoded in the config.yaml.
4.What's the difference between dev and prod configs
Dev vs. Prod Configurations
The primary differences center on security, performance, and observability:
β€’ Development (dev):
o Logging: Set to debug for maximum visibility.
o Hot Reload: Enabled for fast iteration.
o Databases: Often use default credentials (postgres/postgres) and run as single-node containers.
o Security: CORS might be permissive to allow local frontend development.
β€’ Production (prod):
o Logging: Set to info or warn to reduce noise and storage costs.
o Performance: Code is minified/bundled; health checks are more aggressive.
o Databases: Use managed services (RDS/Atlas) with SSL/TLS encryption forced.
o Security: JWT secrets are long/complex; CORS is restricted to specific domains.
________________________________________
Part 2: Deployment & Kubernetes 🚒
Task 2.1: Production Plan (AWS Example)
Design a production deployment for a company with:
β€’ 100 active agents
β€’ 10,000 LLM requests/day
β€’ 99.9% uptime requirement
β€’ Multi-region support needed
Provide:
1. Infrastructure diagram (cloud provider of your choice)
2. Service sizing (CPU, memory for each component)
3. Database setup (primary/replica, backups)
4. Load balancing strategy
5. Estimated monthly cost
To design a production-grade deployment for Aden Hive that supports 100 active agents and 10,000 requests/day with 99.9% uptime, multi-region AWS (Amazon Web Services) architecture. This setup ensures low latency for global users and high availability if a single region fails.
________________________________________
1. Infrastructure Diagram
The architecture uses a "Warm Standby" multi-region approach. The Primary Region handles all traffic, while the Secondary Region is kept in sync and ready to scale up.
β€’ Global Layer: Route 53 (DNS) + CloudFront (CDN for Honeycomb frontend).
β€’ Regional Layer: Application Load Balancer (ALB) + EKS (Elastic Kubernetes Service) for Hive backends.
β€’ Data Layer: Aurora Global Database (PostgreSQL/TimescaleDB) + ElastiCache (Redis).
________________________________________
2. Service Sizing (Regional)
Based on 100 concurrent agents and the overhead of the self-healing loops, the following sizing is recommended for each region:
Component AWS Resource Sizing per Instance Count
Hive Backend c6g.large 2 vCPU, 4GB RAM 3 (Autoscaling)
Honeycomb S3 + CloudFront N/A (Serverless) Global
MCP Tool Server m6g.medium 1 vCPU, 4GB RAM 2 (Per region)
Coding Agent c6g.xlarge 4 vCPU, 8GB RAM 1 (On-demand)
________________________________________
3. Database Setup
A robust database strategy is critical for the stateful nature of Aden’s agents and time-series analytics.
β€’ PostgreSQL / TimescaleDB:
o Service: Amazon Aurora (PostgreSQL-Compatible) with Global Database enabled.
o Setup: 1 Primary (Writer) in Region A, 1 Reader in Region A (Multi-AZ), and 1 Cross-Region Replica in Region B.
o Backups: Automated snapshots with a 35-day retention period. Point-in-Time Recovery (PITR) enabled to recover data to any second within that window.
β€’ MongoDB:
o Service: MongoDB Atlas (Cross-Region Cluster).
o Role: Stores agent graphs and session metadata.
β€’ Redis:
o Service: Amazon ElastiCache (Cluster Mode).
o Role: Global heartbeat tracking and real-time socket state.
________________________________________
4. Load Balancing Strategy
β€’ DNS Level (Route 53): Use Latency-Based Routing. This directs the user to the region with the lowest latency. Health checks will automatically divert traffic to the secondary region if the primary region's ALB becomes unreachable.
β€’ Application Level (ALB):
o SSL Termination: Handle HTTPS at the ALB.
o Sticky Sessions: Enable Session Affinity (Cookie-based). Since Aden uses WebSockets for real-time agent monitoring, the user must stay connected to the same backend pod for the duration of the session to avoid stream interruptions.
________________________________________
5. Estimated Monthly Cost (USD)
β€’ Compute (EKS + EC2): ~$450
β€’ Databases (Aurora + ElastiCache + Atlas): ~$600
β€’ Networking (Data Transfer + Route 53 + CloudFront): ~$250
β€’ LLM API Estimated Cost (10k req/day): ~$400 - $900 (Depends on model mix: GPT-4o vs. Haiku).
β€’ Total Estimated Baseline: $1,700 - $2,200 / month.
Task 2.2: Kubernetes Migration 🚒
Transitioning the Aden Hive from Docker Compose to a production-grade Kubernetes cluster requires defining a scalable, secure, and self-healing environment. Below are the core manifests needed for this migration.
1. ConfigMap & Secret
We separate the application configuration from the sensitive API keys.
YAML
# ConfigMap for non-sensitive configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: hive-config
data:
NODE_ENV: "production"
PORT: "4000"
DB_RETRY_ATTEMPTS: "5"
---
# Secret for sensitive LLM and DB credentials
apiVersion: v1
kind: Secret
metadata:
name: hive-secrets
type: Opaque
data:
ANTHROPIC_API_KEY: <base64-encoded-key>
OPENAI_API_KEY: <base64-encoded-key>
2. Hive Backend Deployment
This deployment uses Rolling Updates and Resource Quotas to ensure stability.
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: hive-backend
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: hive
template:
metadata:
labels:
app: hive
spec:
containers:
- name: hive
image: adenhq/hive:latest
ports:
- containerPort: 4000
envFrom:
- configMapRef:
name: hive-config
- secretRef:
name: hive-secrets
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health/ready
port: 4000
livenessProbe:
httpGet:
path: /health/live
port: 4000
3. Service, Ingress, and Autoscaler
These handle external access and dynamic scaling based on real-time load.
YAML
# Service for internal routing
apiVersion: v1
kind: Service
metadata:
name: hive-service
spec:
selector:
app: hive
ports:
- protocol: TCP
port: 80
targetPort: 4000
---
# Ingress for external SSL/Domain mapping
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hive-ingress
annotations:
nginx.ingress.kubernetes.io/websocket-services: "hive-service"
spec:
rules:
- host: hive.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hive-service
port:
number: 80
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hive-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hive-backend
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
________________________________________
Task 2.3: High Availability Design πŸ”„
1. Service and Database Resilience
β€’ Backend Failures: Handled natively by Kubernetes Liveness Probes. If a hive pod stops responding, K8s kills and restarts it. Traffic is routed only to pods passing Readiness Probes.
β€’ Database Failover: Use a Managed Multi-AZ Service (like AWS RDS or Aurora). If the primary node fails, a secondary is promoted to primary within <60 seconds, and the endpoint remains the same.
2. Zero-Downtime Strategy
β€’ Rolling Updates: Deployments update one pod at a time. The older pods remain active until the new pods report as "Ready."
β€’ WebSockets During Updates: To prevent stream disconnection, use Session Affinity (sticky sessions) in the Ingress controller. Set a Termination Grace Period (e.g., 60 seconds) in the deployment to allow existing WebSocket connections to close gracefully before the old pod is terminated.
3. Disaster Recovery (DR) Plan
β€’ RTO (Recovery Time Objective): Target < 30 minutes. Use Cross-Region Infrastructure-as-Code (Terraform) to spin up a duplicate stack in a secondary region.
β€’ RPO (Recovery Point Objective): Target < 5 minutes. Implement Cross-Region Replication for PostgreSQL and MongoDB. In a total region failure, point DNS (Route 53) to the secondary region's Load Balancer.
________________________________________
Part 3: CI/CD Pipeline πŸ”„
Task 3.1: GitHub Actions Pipeline πŸ”„
Create a complete CI/CD pipeline:
# .github/workflows/ci-cd.yml
name: Aden CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
# Your implementation should include:
# - Linting
# - Type checking
# - Unit tests
# - Integration tests
# - Build Docker images
# - Push to registry
# - Deploy to staging (on develop)
# - Deploy to production (on main, with approval)
Include:
1. Separate jobs for frontend and backend
2. Matrix testing for multiple Node versions
3. Docker layer caching
4. Deployment gates/approvals
5. Rollback strategy
The following YAML implementation for .github/workflows/ci-cd.yml creates a modular, enterprise-grade pipeline. It separates frontend and backend logic, utilizes matrix testing, and implements secure deployment gates.
YAML
name: Aden CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
permissions:
id-token: write
contents: read
jobs:
# 1. Quality & Test Job (Parallel Matrix)
quality-gate:
runs-on: ubuntu-latest
strategy:
matrix:
service: [honeycomb, hive]
node-version: [18.x, 20.x, 22.x]
steps:
- uses: actions/checkout@v4
- name: Setup Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: ${{ matrix.service }}/package-lock.json
- name: Install & Lint
run: |
npm ci --prefix ${{ matrix.service }}
npm run lint --prefix ${{ matrix.service }}
- name: Unit Tests
run: npm run test:unit --prefix ${{ matrix.service }}
# 2. Integration Tests
integration-test:
needs: quality-gate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Spin up Test Infrastructure
run: docker compose -f docker-compose.test.yml up -d
- name: Run Integration Suite
run: npm run test:integration --prefix hive
# 3. Docker Build & Push (With Layer Caching)
build-and-push:
needs: integration-test
if: github.event_name == 'push'
runs-on: ubuntu-latest
steps:
- uses: docker/setup-buildx-action@v3
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and Push (GHA Cache enabled)
uses: docker/build-push-action@v6
with:
push: true
tags: ghcr.io/adenhq/hive:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
# 4. Deploy to Staging (Auto on develop)
deploy-staging:
needs: build-and-push
if: github.ref == 'refs/heads/develop'
environment: staging
runs-on: ubuntu-latest
steps:
- run: echo "Deploying to Staging via Helm..."
# 5. Deploy to Production (Approval on main)
deploy-production:
needs: build-and-push
if: github.ref == 'refs/heads/main'
environment: production # Gated by manual approval in GitHub Settings
runs-on: ubuntu-latest
steps:
- run: echo "Deploying to Production via Helm..."
Rollback Strategy: We use GitOps (ArgoCD or Flux). If a production deployment fails health checks, the system is configured to auto-rollback to the previous known-good image tag in the Git repository.
________________________________________
Task 3.2: Testing Strategy πŸ§ͺ
A comprehensive testing strategy for an agentic framework requires shifting from traditional software testing to behavioral and resilience-based validation.
1. Test Categories & Mocking Strategy
Test Type Target Scope Implementation Strategy
Unit Tests Framework logic, utility functions, node state machines. Mocking LLMs: Use a "Test Intelligence Backend" (like FakeChatModel) that returns deterministic responses based on input keys. Mock external tools at the function interface level to verify the agent calls them with the correct schema.
Integration DB transactions (Mongo/Postgres), Redis heartbeat updates. Use Docker Service Containers (via GitHub Actions) or Testcontainers to spin up real instances of MongoDB and TimescaleDB for each test run to ensure schema compatibility.
E2E Tests Goal-to-Action flows, human-in-the-loop triggers. Use Playwright to simulate a user creating a goal in the Honeycomb UI and verify that the backend correctly generates the graph and sends an intervention request to Slack.
Load Tests WebSocket gateway capacity, concurrent event ingestion. Use k6 or Locust to simulate 1000+ concurrent SDK clients emitting MetricEvents. Measure p99 latency for state updates and DB write saturation points.
Chaos Tests Self-healing resilience, failover logic. Use Chaos Mesh to inject failures: terminate a database primary node, induce 500ms network latency to the LLM API, or simulate a 50% packet loss on the WebSocket gateway.
2. Example Test Configurations
β€’ Unit (Mocking): pytest config that intercepts litellm.completion calls and returns a mock JSON from a local fixtures/ folder.
β€’ Chaos: A YAML experiment that targets the redis pod and kills it every 5 minutes to ensure the AgentRunner doesn't lose the execution state.
________________________________________
Task 3.3: Environment Management 🌍
To ensure high availability and data security, Aden uses a tiered environment strategy with isolated data planes.
Environment Matrix
Environment Provisioning Data Management Deployment Logic Access Control
Local Docker Compose on host machine. Mock data or local seeding scripts. Manual (docker compose up). Full Local Admin.
Dev Terraform (shared AWS/GCP cluster). Sanitized production snapshots (monthly refresh). Automated CD from feature/* branches. All Engineering.
Staging Helm + EKS (Full replica of Prod). Anonymized production data (weekly refresh). Automated CD on merge to develop. Engineering + QA.
Production Multi-region EKS + Managed Aurora/Atlas. Real User Data (Encrypted at rest/transit). Gated CD on merge to main (Requires Approval). SRE / DevOps Only (MFA).
Operational Details
β€’ Provisioning: All environments (except Local) are managed via Infrastructure as Code (Terraform) to ensure environment parity and prevent "works on my machine" bugs.
β€’ Data Isolation: Production uses dedicated VPCs and IAM roles. Staging and Dev share a cluster but are strictly isolated via Kubernetes Namespaces and network policies.
β€’ Deployments: Use Blue-Green Deployments for Production to allow for instant rollback. Staging uses Rolling Updates to test new features under continuous integration.
________________________________________
Part 4: Observability & Operations πŸ“Š
Task 4.1: Monitoring Stack πŸ“Š
1. Key Metrics (The "Top 10")
Metric Type Purpose
agent_run_duration_seconds Histogram Tracks end-to-end latency of agent goals.
llm_token_usage_total Counter Total tokens consumed, segmented by model and team.
llm_request_cost_usd Gauge Real-time dollar spend for LLM API calls.
node_failure_rate Gauge % of failures per specific node (e.g., "Scout", "Writer").
mcp_tool_execution_time Summary Performance of external tool calls (APIs, Web search).
websocket_active_connections Gauge Concurrent dashboard users and real-time streams.
db_write_latency_ms Histogram Latency for metric ingestion into TimescaleDB.
self_healing_attempts_total Counter Number of times agents triggered auto-recovery.
agent_memory_utilization Gauge Memory usage per AgentRunner pod.
human_intervention_wait_time Histogram Time spent waiting for human approval/feedback.
2. Logs & Traces
β€’ Logging Strategy: Use Structured JSON logging (via Pino or Winston). Logs are aggregated into Grafana Loki. High-cardinality metadata (team_id, agent_id, session_id) is indexed to allow for instant filtering.
β€’ Distributed Tracing: Implement OpenTelemetry (OTel) with Jaeger. Every agent execution is treated as a "Trace," and each node execution/tool call is a "Span." This allows developers to visualize where an agent got stuck in a multi-step workflow.
3. Three Key Dashboards
1. SRE Overview: Global health, HTTP error rates, system resource usage (CPU/RAM), and database health.
2. Agent Economics: Token usage trends, cost by team, model ROI, and budget depletion forecasts.
3. Agent Logic & Quality: Failure taxonomy distribution, self-healing success rates, and human rejection analysis.
________________________________________
Monitoring Setup Addition (docker-compose.monitoring.yml)
YAML
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=aden_dev
volumes:
- grafana-storage:/var/lib/grafana
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
volumes:
grafana-storage:
________________________________________
Task 4.2: Alerting Rules 🚨
YAML
groups:
- name: aden-critical
rules:
# 1. High Error Rate
- alert: HighErrorRate
expr: sum(rate(node_failure_count[5m])) / sum(rate(node_execution_count[5m])) > 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "Agent failure rate above 15%"
description: "High error rate detected in the production orchestration layer."
# 2. Service Down
- alert: HiveBackendDown
expr: up{job="hive-backend"} == 0
for: 1m
labels:
severity: page
annotations:
summary: "Hive Backend is unreachable"
# 3. High Latency
- alert: ExtremeLLMLatency
expr: histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[10m])) by (le)) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "p95 LLM Latency > 30s"
# 4. Budget Threshold Hit
- alert: TeamBudgetCritical
expr: team_spend_usd / team_budget_limit_usd >= 0.95
for: 1m
labels:
severity: critical
annotations:
summary: "Team at 95% budget capacity"
# 5. DB Write Pressure
- alert: TimescaleDBWriteLag
expr: rate(db_write_errors_total[5m]) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "Database ingestion errors detected"
# 6. Memory Pressure
- alert: PodMemoryPressure
expr: container_memory_usage_bytes{container="hive-backend"} / container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
# 7. Self-Healing Fail Loop
- alert: InfiniteHealingLoop
expr: increase(self_healing_attempts_total[10m]) > 10
for: 1m
labels:
severity: warning
annotations:
summary: "Agent stuck in evolution loop"
# 8. Database Connection Limit
- alert: PostgresConnectionLimit
expr: pg_stat_activity_count / pg_settings_max_connections > 0.90
for: 5m
labels:
severity: warning
Task 4.3: Incident Response πŸ†˜
Create an incident response runbook:
Scenario: Agent response times spike to 30 seconds (normal: 2 seconds)
Provide:
1. Detection: How was this discovered?
2. Triage: Initial investigation steps
3. Diagnosis: Decision tree for root causes
4. Resolution: Steps for each root cause
5. Post-mortem: Template for incident review
# Runbook: High Agent Latency
## Symptoms
- Agent response times > 10s
- Dashboard showing degraded status
## Initial Triage
1. Check [ ] Is this affecting all agents or specific ones?
2. Check [ ] Is the backend healthy? (health endpoint)
3. Check [ ] Are databases responsive?
...
## Diagnostic Steps
...
## Resolution Steps
### If LLM Provider Issue:
...
### If Database Issue:
...
________________________________________
Runbook: High Agent Latency
1. Detection
β€’ Primary Discovery: Discovery is typically via a Prometheus Alert (ExtremeLLMLatency) triggered when the p95 agent execution time exceeds 10s for 5 minutes.
β€’ Secondary Discovery: Dashboard users report "spinning" icons or a "Degraded" status in the Honeycomb UI.
β€’ Log Evidence: TraceEvents in Grafana Loki show significant gaps (>20s) between node execution timestamps.
________________________________________
2. Initial Triage
1. Check Scope: Is this affecting all teams or just one? (Query: sum(rate(llm_request_duration_seconds_bucket[5m])) by (team_id))
2. Check Backend Health: Are the hive-backend pods reporting healthy? (Endpoint: /health/ready). Check for high pod restart counts.
3. Check Databases: Is there a write lag in TimescaleDB or a high connection count in MongoDB?
4. Check External Status: Visit status pages for OpenAI, Anthropic, and AWS.
________________________________________
3. Diagnosis Decision Tree
β€’ Is the latency at the Start of the session?
o Yes: Possible MongoDB bottleneck (loading agent graphs) or Coding Agent saturation.
β€’ Is latency occurring between Node executions?
o Yes: WebSocket gateway congestion or Redis heartbeat lag.
β€’ Is the latency within the Node execution itself?
o Yes: Check the model_id.
ο‚§ If it's a "Flagship" model (GPT-4o): High likelihood of LLM Provider saturation.
ο‚§ If it's a "Tool Call" node: High likelihood of MCP Tool API timeout.
________________________________________
4. Resolution Steps
If LLM Provider Issue (Global Saturation):
1. Trigger Policy Override: Using the /v1/control/policy endpoint, force a temporary Model Degradation for all non-critical agents (e.g., GPT-4o $\rightarrow$ GPT-4o-mini).
2. Enable Throttling: Introduce a 2000ms delay in the checkBudget logic to reduce global request pressure.
If Database / Infrastructure Issue:
1. Scale Pods: Manually increase the replicas in the Kubernetes deployment to 2x.
2. Clear Hot Storage: Flush the Redis session cache if orphan heartbeats are causing gateway lockups.
If Specific MCP Tool Failure:
1. Disable Tool: Temporarily remove the failing tool from the tool_registry.
2. Self-Healing: Allow the Self-Healing Loop to re-route agents to alternative tools (e.g., using a different search provider).
________________________________________
5. Post-Mortem Template
Section Description
Summary What happened, which teams were affected, and total downtime.
Timeline From first detection to final resolution (TTO, TTR).
Root Cause Why did this happen? (Infrastructure, Logic, or External).
Impact Analysis Total cost of failed/stalled goals and API spend wasted.
Lessons Learned What guardrails or alerts failed to prevent this?
Action Items List of tickets (Jira/GitHub) created to ensure this doesn't repeat.
________________________________________
Part 5: Security Hardening (Bonus) πŸ”’
β€’ Network: Close all ports except 443 (Ingress). Use a VPC Peering or Private Link for database connections.
β€’ Container Security: Implement Trivy scanning in the CI/CD pipeline to catch vulnerabilities in base images before they reach the registry.
β€’ Secrets: Move from .env files to AWS Secrets Manager or HashiCorp Vault, injected at runtime as environment variables.
Task 5.1: Security Audit πŸ”’
1. Network Security
β€’ Exposed Ports: * 3000 (Frontend), 4000 (Backend API).
o Analysis: These are necessary for external access but should be restricted. In production, only the Load Balancer/Ingress should expose these via HTTPS (443).
o Internal Ports: 5432 (Postgres), 27017 (Mongo), 6379 (Redis).
o Action: These must never be exposed to the public internet. Use Security Groups or Private Subnets to ensure only the hive backend container can communicate with them.
2. Secret Management
β€’ Current State: Secrets are managed via .env files and config.yaml.
β€’ Improvements:
o External Vault: Move all API keys (OpenAI, Anthropic) and DB credentials to AWS Secrets Manager or HashiCorp Vault.
o Dynamic Injection: Inject secrets into Kubernetes pods at runtime as mounted files or environment variables, avoiding persistent storage on disk.
o Rotation: Implement automated 90-day rotation for all infrastructure and third-party API keys.
3. Authentication (API Auth)
β€’ Implementation: JWT (JSON Web Tokens) with team-scoped claims.
β€’ Hardening:
o MFA: Enforce Multi-Factor Authentication for the Honeycomb Dashboard.
o Scoping: Implement Fine-Grained Access Control (FGAC). Instead of a global API key, use "Agent-specific tokens" that only allow a worker agent to call the specific tools it needs (e.g., a "Writer" agent shouldn't have "Search Internal Docs" permissions).
4. Container Security
β€’ Image Scanning: Add Trivy or Snyk to the GitHub Actions CI pipeline. Scan on every PR to detect vulnerable Node.js/Python packages.
β€’ Distroless Images: Switch from standard Node/Python images to Google Distroless or Alpine to minimize the attack surface by removing shells, package managers, and unnecessary binaries.
5. Database Hardening
β€’ Encryption: Enable Encryption at Rest (AES-256) and Enforce TLS for all Database-to-Backend connections.
β€’ Least Privilege: Create dedicated DB users for the hive service. The application user should not have DROP TABLE or superuser permissions.
________________________________________
Task 5.2: Compliance Checklist βœ… (SOC 2 Pathway)
Achieving SOC 2 Type II requires demonstrating consistent operational control over a 6-12 month period.
1. Access Control Improvements
β€’ Provisioning: Use a standardized "Joiners, Movers, Leavers" (JML) process.
β€’ Principle of Least Privilege: Regular (quarterly) access reviews to ensure employees and agents only have necessary permissions.
2. Audit Logging Requirements
β€’ Centralization: Stream all logs (Metric, Trace, and LogEvents) to a write-only, tamper-proof repository like Amazon S3 with Object Lock.
β€’ Integrity: Every agent "Evolution" decision must be logged with the ID of the admin who approved the change.
3. Encryption Requirements
β€’ In Transit: 100% of traffic (including internal pod-to-pod traffic) must be encrypted via mTLS (Mutual TLS) using a service mesh like Istio or Linkerd.
β€’ At Rest: Ensure all EBS volumes, RDS instances, and MongoDB clusters use KMS keys managed by the organization.
4. Data Retention Policies
β€’ Automated Deletion: Implement a policy to purge individual "Session Local Memory" 30 days after a goal is completed, unless specifically marked for Long-Term Memory storage.
β€’ Hypertable Chunks: Use TimescaleDB's drop_chunks function to automatically archive or delete metric logs older than 1 year to comply with privacy regulations (GDPR/CCPA).
5. Incident Response Requirements
β€’ Runbook Automation: The High-Latency runbook (Task 4.3) must be part of an official, version-controlled IR plan.
β€’ Testing: Perform an annual Security Tabletop Exercise where a mock "Agent Data Leak" or "Prompt Injection" scenario is played out by the engineering team.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment