Khatiketki · January 29, 2026 06:54
diff --git a/05-devops-challenge b/05-devops-challenge
 Part 1: Infrastructure Analysis 🐳
 Task 1.1: Docker Deep Dive
 1.List all services defined in docker-compose.yml
 •	Services in docker-compose.yml:
 o	hive-server: The core backend API (Port 3001).
 o	honeycomb: The React frontend dashboard (Port 3000).
 o	postgres: Relational data for users/auth.
 o	mongodb: Document storage for agent node graphs.
 o	timescaledb: Time-series engine for logs/metrics.
 o	redis: Hot storage for heartbeats and task queues.
 2.What's the purpose of docker-compose.override.yml?
 docker-compose.override.yml: Used for local development to mount local source code into containers (bind mounts) and set debug-level environment variables without altering the base production-ready config.

 3.How is hot reload enabled for development?
 •	Hot Reload: Enabled via Nodemon (backend) and Vite/Webpack Dev Server (frontend) watching for file changes through bind mounts.

 4.What volumes are mounted and why?
 In the Aden Hive architecture, volumes are strategically mounted to balance data persistence for stateful services and developer productivity for stateless ones.
 Based on the repository configuration, the mounted volumes typically fall into two categories: Persistent Data Volumes and Development Bind Mounts.
 1. Persistent Data Volumes (Stateful)
 These are named volumes managed by Docker. They ensure that your data survives when containers are stopped or deleted.
 Service	Volume Name / Path	Purpose
 PostgreSQL	postgres_data	Stores user accounts, permissions, and relational metadata. Without this, your login credentials would reset on every restart.
 MongoDB	mongodb_data	Stores the "Source of Truth" for agent definitions, node graphs, and mission configurations.
 TimescaleDB	timescale_data	Houses the massive amounts of time-series event data (logs/metrics) generated by running agents.
 Redis	redis_data	Persists the task queue and short-term memory states so a restart doesn't wipe the agents' current progress.
 2. Development Bind Mounts (Stateless)
 These are usually defined in docker-compose.override.yml and map your local machine's folders directly into the container.
 •	./apps/server:/app: Allows the Hive Backend to see your code changes in real-time. This is why "hot reload" works—when you save a file on your laptop, the container sees it immediately.
 •	./packages:/packages: Mounts shared logic (types, utilities) so that changes in the core framework propagate to both the server and the dashboard without a full rebuild.
 •	/app/node_modules (Anonymous Volume): Often, node_modules is excluded from the bind mount. This ensures that the container uses the dependencies installed inside the Linux-based container environment rather than conflicting with your local Mac/Windows versions.




 5.What networking mode is used between services?
 •	Networking: A custom bridge network (e.g., hive-net) is used so services can resolve each other by name (e.g., http://postgres:5432).
 Task 1.2: Service Dependencies 🔗
 Map the service dependencies:
 1.	Create a dependency diagram showing which services depend on which
 Based on the docker-compose.yml for Aden Hive, here is the dependency mapping for the services. The architecture follows a layered approach where the frontend depends on the backend, and the backend depends on several stateful databases.
 Service Dependency Diagram
 Detailed Service Mapping
 The startup order is critical and managed via depends_on with service_healthy conditions.
 Service	Category	Depends On (Must be healthy)	Purpose
 Honeycomb	Frontend	hive	React dashboard for visualizing and managing agents.
 Hive	Backend	timescaledb, mongodb, redis	Central API/Control Plane; handles SDK requests and logic.
 Aden-Tools-MCP	MCP Server	None (Independent)	Python-based tools (like Brave Search) accessible via the Model Context Protocol.
 TimescaleDB	Database	None	Stores LLM metrics and time-series data.
 MongoDB	Database	None	Stores policies, pricing, and agent control configurations.
 Redis	Cache/Queue	None	Used for caching and as a Socket.IO adapter for real-time communication.

 Startup Logic & Failure Impact
 1.	Startup Order: The infrastructure layer (TimescaleDB, MongoDB, Redis) starts first. Once their health checks pass, the Hive Backend initiates. Finally, the Honeycomb Frontend launches only after the backend is confirmed healthy.
 2.	Stateful vs. Stateless:
 o	Stateful: Databases (timescaledb, mongodb, redis) and the aden-tools-mcp (which persists workspaces via volumes).
 o	Stateless: honeycomb and hive.
 3.	Failure Scenarios:
 o	MongoDB Down: The Backend (hive) will fail to load policies and control configurations, likely crashing the main API logic.
 o	Redis Down: Real-time updates to the dashboard will break, and internal caching/queuing mechanisms will fail.
 o	TimescaleDB Down: LLM observability features will fail; the backend won't be able to record or retrieve performance metrics.
 ________________________________________

 2.	What's the startup order? Does it matter?
 •	Startup Order: Databases (postgres, mongodb, redis) must start before the hive-server. The honeycomb dashboard depends on hive-server.
 3.	What happens if MongoDB is unavailable?
 Failure Impacts: * MongoDB Down: Agents cannot load their node graphs; existing runs fail to save state.
 4.	What happens if Redis is unavailable?
 Redis Down: Real-time metrics fail; task delegation between agents breaks.
 5.	Which services are stateless vs stateful?
 Stateless vs. Stateful:
 1.	Stateless: hive-server, honeycomb.
 2.	Stateful: postgres, mongodb, timescaledb, redis.
 Task 1.3: Configuration Management ⚙️
 Analyze how configuration works:
 1. How config.yaml is Generated
 The config.yaml file is typically not "generated" by a script, but rather templated or mounted depending on the environment:
 •	Local Development: Developers manually create or edit a config.yaml in the apps/server or hive/ directory based on a provided config.example.yaml.
 •	Docker/CI: The configuration is often injected via a ConfigMap (in Kubernetes) or mapped through a volume in docker-compose.yml.
 •	Dynamic Resolution: The Hive backend uses a configuration loader (often using a library like cosmiconfig or dotenv) that reads the YAML file and then overwrites specific values if corresponding Environment Variables are present.

 2.What environment variables are required?
 Required Environment Variables
 While the YAML handles structural settings, the following environment variables are strictly required for the Hive backend to function:
 Variable	Description	Default (Dev)
 PORT	The port the API runs on.	4000 or 3001
 NODE_ENV	Sets the mode (development/production).	development
 TSDB_PG_URL	Connection string for TimescaleDB/Postgres.	postgresql://postgres:postgres@timescaledb:5432/aden_tsdb
 MONGODB_URL	Connection string for MongoDB.	mongodb://mongodb:27017
 REDIS_URL	Connection string for Redis.	redis://redis:6379
 JWT_SECRET	Secret key for signing authentication tokens.	change-me-in-production



 3.How are secrets managed? (API keys, database passwords)

 Secret Management
 Aden Hive follows the "Twelve-Factor App" methodology for secrets:
 •	Development: Secrets are stored in a .env file (which is ignored by Git) and loaded into the process environment.
 •	Production: Secrets like BRAVE_SEARCH_API_KEY, NPM_TOKEN, and DB passwords should be managed via Secret Management Services (e.g., AWS Secrets Manager, HashiCorp Vault, or GitHub Actions Secrets).
 •	Injection: These are passed into the Docker container at runtime via the environment: or env_file: keys in Docker Compose, ensuring they are never hardcoded in the config.yaml.



 4.What's the difference between dev and prod configs

 Dev vs. Prod Configurations
 The primary differences center on security, performance, and observability:
 •	Development (dev):
 o	Logging: Set to debug for maximum visibility.
 o	Hot Reload: Enabled for fast iteration.
 o	Databases: Often use default credentials (postgres/postgres) and run as single-node containers.
 o	Security: CORS might be permissive to allow local frontend development.
 •	Production (prod):
 o	Logging: Set to info or warn to reduce noise and storage costs.
 o	Performance: Code is minified/bundled; health checks are more aggressive.
 o	Databases: Use managed services (RDS/Atlas) with SSL/TLS encryption forced.
 o	Security: JWT secrets are long/complex; CORS is restricted to specific domains.

 ________________________________________
 Part 2: Deployment & Kubernetes 🚢
 Task 2.1: Production Plan (AWS Example)
 Design a production deployment for a company with:
 •	100 active agents
 •	10,000 LLM requests/day
 •	99.9% uptime requirement
 •	Multi-region support needed
 Provide:
 1.	Infrastructure diagram (cloud provider of your choice)
 2.	Service sizing (CPU, memory for each component)
 3.	Database setup (primary/replica, backups)
 4.	Load balancing strategy
 5.	Estimated monthly cost
 To design a production-grade deployment for Aden Hive that supports 100 active agents and 10,000 requests/day with 99.9% uptime,  multi-region AWS (Amazon Web Services) architecture. This setup ensures low latency for global users and high availability if a single region fails.
 ________________________________________
 1. Infrastructure Diagram
 The architecture uses a "Warm Standby" multi-region approach. The Primary Region handles all traffic, while the Secondary Region is kept in sync and ready to scale up.
 •	Global Layer: Route 53 (DNS) + CloudFront (CDN for Honeycomb frontend).
 •	Regional Layer: Application Load Balancer (ALB) + EKS (Elastic Kubernetes Service) for Hive backends.
 •	Data Layer: Aurora Global Database (PostgreSQL/TimescaleDB) + ElastiCache (Redis).
 ________________________________________
 2. Service Sizing (Regional)
 Based on 100 concurrent agents and the overhead of the self-healing loops, the following sizing is recommended for each region:
 Component	AWS Resource	Sizing per Instance	Count
 Hive Backend	c6g.large	2 vCPU, 4GB RAM	3 (Autoscaling)
 Honeycomb	S3 + CloudFront	N/A (Serverless)	Global
 MCP Tool Server	m6g.medium	1 vCPU, 4GB RAM	2 (Per region)
 Coding Agent	c6g.xlarge	4 vCPU, 8GB RAM	1 (On-demand)
 ________________________________________
 3. Database Setup
 A robust database strategy is critical for the stateful nature of Aden’s agents and time-series analytics.
 •	PostgreSQL / TimescaleDB:
 o	Service: Amazon Aurora (PostgreSQL-Compatible) with Global Database enabled.
 o	Setup: 1 Primary (Writer) in Region A, 1 Reader in Region A (Multi-AZ), and 1 Cross-Region Replica in Region B.
 o	Backups: Automated snapshots with a 35-day retention period. Point-in-Time Recovery (PITR) enabled to recover data to any second within that window.
 •	MongoDB:
 o	Service: MongoDB Atlas (Cross-Region Cluster).
 o	Role: Stores agent graphs and session metadata.
 •	Redis:
 o	Service: Amazon ElastiCache (Cluster Mode).
 o	Role: Global heartbeat tracking and real-time socket state.
 ________________________________________
 4. Load Balancing Strategy
 •	DNS Level (Route 53): Use Latency-Based Routing. This directs the user to the region with the lowest latency. Health checks will automatically divert traffic to the secondary region if the primary region's ALB becomes unreachable.
 •	Application Level (ALB):
 o	SSL Termination: Handle HTTPS at the ALB.
 o	Sticky Sessions: Enable Session Affinity (Cookie-based). Since Aden uses WebSockets for real-time agent monitoring, the user must stay connected to the same backend pod for the duration of the session to avoid stream interruptions.
 ________________________________________
 5. Estimated Monthly Cost (USD)
 •	Compute (EKS + EC2): ~$450
 •	Databases (Aurora + ElastiCache + Atlas): ~$600
 •	Networking (Data Transfer + Route 53 + CloudFront): ~$250
 •	LLM API Estimated Cost (10k req/day): ~$400 - $900 (Depends on model mix: GPT-4o vs. Haiku).
 •	Total Estimated Baseline: $1,700 - $2,200 / month.

 Task 2.2: Kubernetes Migration 🚢
 Transitioning the Aden Hive from Docker Compose to a production-grade Kubernetes cluster requires defining a scalable, secure, and self-healing environment. Below are the core manifests needed for this migration.
 1. ConfigMap & Secret
 We separate the application configuration from the sensitive API keys.
 YAML
 # ConfigMap for non-sensitive configuration
 apiVersion: v1
 kind: ConfigMap
 metadata:
  name: hive-config
 data:
  NODE_ENV: "production"
  PORT: "4000"
  DB_RETRY_ATTEMPTS: "5"
 ---
 # Secret for sensitive LLM and DB credentials
 apiVersion: v1
 kind: Secret
 metadata:
  name: hive-secrets
 type: Opaque
 data:
  ANTHROPIC_API_KEY: <base64-encoded-key>
  OPENAI_API_KEY: <base64-encoded-key>
 2. Hive Backend Deployment
 This deployment uses Rolling Updates and Resource Quotas to ensure stability.
 YAML
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: hive-backend
 spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: hive
  template:
    metadata:
      labels:
        app: hive
    spec:
      containers:
      - name: hive
        image: adenhq/hive:latest
        ports:
        - containerPort: 4000
        envFrom:
        - configMapRef:
            name: hive-config
        - secretRef:
            name: hive-secrets
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 4000
        livenessProbe:
          httpGet:
            path: /health/live
            port: 4000
 3. Service, Ingress, and Autoscaler
 These handle external access and dynamic scaling based on real-time load.
 YAML
 # Service for internal routing
 apiVersion: v1
 kind: Service
 metadata:
  name: hive-service
 spec:
  selector:
    app: hive
  ports:
  - protocol: TCP
    port: 80
    targetPort: 4000
 ---
 # Ingress for external SSL/Domain mapping
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
  name: hive-ingress
  annotations:
    nginx.ingress.kubernetes.io/websocket-services: "hive-service"
 spec:
  rules:
  - host: hive.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hive-service
            port:
              number: 80
 ---
 # HorizontalPodAutoscaler
 apiVersion: autoscaling/v2
 kind: HorizontalPodAutoscaler
 metadata:
  name: hive-hpa
 spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hive-backend
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
 ________________________________________
 Task 2.3: High Availability Design 🔄
 1. Service and Database Resilience
 •	Backend Failures: Handled natively by Kubernetes Liveness Probes. If a hive pod stops responding, K8s kills and restarts it. Traffic is routed only to pods passing Readiness Probes.
 •	Database Failover: Use a Managed Multi-AZ Service (like AWS RDS or Aurora). If the primary node fails, a secondary is promoted to primary within <60 seconds, and the endpoint remains the same.
 2. Zero-Downtime Strategy
 •	Rolling Updates: Deployments update one pod at a time. The older pods remain active until the new pods report as "Ready."
 •	WebSockets During Updates: To prevent stream disconnection, use Session Affinity (sticky sessions) in the Ingress controller. Set a Termination Grace Period (e.g., 60 seconds) in the deployment to allow existing WebSocket connections to close gracefully before the old pod is terminated.
 3. Disaster Recovery (DR) Plan
 •	RTO (Recovery Time Objective): Target < 30 minutes. Use Cross-Region Infrastructure-as-Code (Terraform) to spin up a duplicate stack in a secondary region.
 •	RPO (Recovery Point Objective): Target < 5 minutes. Implement Cross-Region Replication for PostgreSQL and MongoDB. In a total region failure, point DNS (Route 53) to the secondary region's Load Balancer.
 ________________________________________
 Part 3: CI/CD Pipeline 🔄
 Task 3.1: GitHub Actions Pipeline 🔄
 Create a complete CI/CD pipeline:
 # .github/workflows/ci-cd.yml
 name: Aden CI/CD

 on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

 jobs:
  # Your implementation should include:
  # - Linting
  # - Type checking
  # - Unit tests
  # - Integration tests
  # - Build Docker images
  # - Push to registry
  # - Deploy to staging (on develop)
  # - Deploy to production (on main, with approval)
 Include:
 1.	Separate jobs for frontend and backend
 2.	Matrix testing for multiple Node versions
 3.	Docker layer caching
 4.	Deployment gates/approvals
 5.	Rollback strategy

 The following YAML implementation for .github/workflows/ci-cd.yml creates a modular, enterprise-grade pipeline. It separates frontend and backend logic, utilizes matrix testing, and implements secure deployment gates.
 YAML
 name: Aden CI/CD
 on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

 permissions:
  id-token: write
  contents: read

 jobs:
  # 1. Quality & Test Job (Parallel Matrix)
  quality-gate:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service: [honeycomb, hive]
        node-version: [18.x, 20.x, 22.x]
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
          cache-dependency-path: ${{ matrix.service }}/package-lock.json
      - name: Install & Lint
        run: |
          npm ci --prefix ${{ matrix.service }}
          npm run lint --prefix ${{ matrix.service }}
      - name: Unit Tests
        run: npm run test:unit --prefix ${{ matrix.service }}

  # 2. Integration Tests
  integration-test:
    needs: quality-gate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Spin up Test Infrastructure
        run: docker compose -f docker-compose.test.yml up -d
      - name: Run Integration Suite
        run: npm run test:integration --prefix hive

  # 3. Docker Build & Push (With Layer Caching)
  build-and-push:
    needs: integration-test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    steps:
      - uses: docker/setup-buildx-action@v3
      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and Push (GHA Cache enabled)
        uses: docker/build-push-action@v6
        with:
          push: true
          tags: ghcr.io/adenhq/hive:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # 4. Deploy to Staging (Auto on develop)
  deploy-staging:
    needs: build-and-push
    if: github.ref == 'refs/heads/develop'
    environment: staging
    runs-on: ubuntu-latest
    steps:
      - run: echo "Deploying to Staging via Helm..."

  # 5. Deploy to Production (Approval on main)
  deploy-production:
    needs: build-and-push
    if: github.ref == 'refs/heads/main'
    environment: production # Gated by manual approval in GitHub Settings
    runs-on: ubuntu-latest
    steps:
      - run: echo "Deploying to Production via Helm..."
 Rollback Strategy: We use GitOps (ArgoCD or Flux). If a production deployment fails health checks, the system is configured to auto-rollback to the previous known-good image tag in the Git repository.

 ________________________________________
 Task 3.2: Testing Strategy 🧪
 A comprehensive testing strategy for an agentic framework requires shifting from traditional software testing to behavioral and resilience-based validation.
 1. Test Categories & Mocking Strategy
 Test Type	Target Scope	Implementation Strategy
 Unit Tests	Framework logic, utility functions, node state machines.	Mocking LLMs: Use a "Test Intelligence Backend" (like FakeChatModel) that returns deterministic responses based on input keys. Mock external tools at the function interface level to verify the agent calls them with the correct schema.
 Integration	DB transactions (Mongo/Postgres), Redis heartbeat updates.	Use Docker Service Containers (via GitHub Actions) or Testcontainers to spin up real instances of MongoDB and TimescaleDB for each test run to ensure schema compatibility.
 E2E Tests	Goal-to-Action flows, human-in-the-loop triggers.	Use Playwright to simulate a user creating a goal in the Honeycomb UI and verify that the backend correctly generates the graph and sends an intervention request to Slack.
 Load Tests	WebSocket gateway capacity, concurrent event ingestion.	Use k6 or Locust to simulate 1000+ concurrent SDK clients emitting MetricEvents. Measure p99 latency for state updates and DB write saturation points.
 Chaos Tests	Self-healing resilience, failover logic.	Use Chaos Mesh to inject failures: terminate a database primary node, induce 500ms network latency to the LLM API, or simulate a 50% packet loss on the WebSocket gateway.
 2. Example Test Configurations
 •	Unit (Mocking): pytest config that intercepts litellm.completion calls and returns a mock JSON from a local fixtures/ folder.
 •	Chaos: A YAML experiment that targets the redis pod and kills it every 5 minutes to ensure the AgentRunner doesn't lose the execution state.
 ________________________________________
 Task 3.3: Environment Management 🌍
 To ensure high availability and data security, Aden uses a tiered environment strategy with isolated data planes.
 Environment Matrix
 Environment	Provisioning	Data Management	Deployment Logic	Access Control
 Local	Docker Compose on host machine.	Mock data or local seeding scripts.	Manual (docker compose up).	Full Local Admin.
 Dev	Terraform (shared AWS/GCP cluster).	Sanitized production snapshots (monthly refresh).	Automated CD from feature/* branches.	All Engineering.
 Staging	Helm + EKS (Full replica of Prod).	Anonymized production data (weekly refresh).	Automated CD on merge to develop.	Engineering + QA.
 Production	Multi-region EKS + Managed Aurora/Atlas.	Real User Data (Encrypted at rest/transit).	Gated CD on merge to main (Requires Approval).	SRE / DevOps Only (MFA).
 Operational Details
 •	Provisioning: All environments (except Local) are managed via Infrastructure as Code (Terraform) to ensure environment parity and prevent "works on my machine" bugs.
 •	Data Isolation: Production uses dedicated VPCs and IAM roles. Staging and Dev share a cluster but are strictly isolated via Kubernetes Namespaces and network policies.
 •	Deployments: Use Blue-Green Deployments for Production to allow for instant rollback. Staging uses Rolling Updates to test new features under continuous integration.

 ________________________________________
 Part 4: Observability & Operations 📊
 Task 4.1: Monitoring Stack 📊
 1. Key Metrics (The "Top 10")
 Metric	Type	Purpose
 agent_run_duration_seconds	Histogram	Tracks end-to-end latency of agent goals.
 llm_token_usage_total	Counter	Total tokens consumed, segmented by model and team.
 llm_request_cost_usd	Gauge	Real-time dollar spend for LLM API calls.
 node_failure_rate	Gauge	% of failures per specific node (e.g., "Scout", "Writer").
 mcp_tool_execution_time	Summary	Performance of external tool calls (APIs, Web search).
 websocket_active_connections	Gauge	Concurrent dashboard users and real-time streams.
 db_write_latency_ms	Histogram	Latency for metric ingestion into TimescaleDB.
 self_healing_attempts_total	Counter	Number of times agents triggered auto-recovery.
 agent_memory_utilization	Gauge	Memory usage per AgentRunner pod.
 human_intervention_wait_time	Histogram	Time spent waiting for human approval/feedback.
 2. Logs & Traces
 •	Logging Strategy: Use Structured JSON logging (via Pino or Winston). Logs are aggregated into Grafana Loki. High-cardinality metadata (team_id, agent_id, session_id) is indexed to allow for instant filtering.
 •	Distributed Tracing: Implement OpenTelemetry (OTel) with Jaeger. Every agent execution is treated as a "Trace," and each node execution/tool call is a "Span." This allows developers to visualize where an agent got stuck in a multi-step workflow.
 3. Three Key Dashboards
 1.	SRE Overview: Global health, HTTP error rates, system resource usage (CPU/RAM), and database health.
 2.	Agent Economics: Token usage trends, cost by team, model ROI, and budget depletion forecasts.
 3.	Agent Logic & Quality: Failure taxonomy distribution, self-healing success rates, and human rejection analysis.
 ________________________________________
 Monitoring Setup Addition (docker-compose.monitoring.yml)
 YAML
 services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=aden_dev
    volumes:
      - grafana-storage:/var/lib/grafana

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686" # UI
      - "4317:4317"  # OTLP gRPC

 volumes:
  grafana-storage:
 ________________________________________
 Task 4.2: Alerting Rules 🚨
 YAML
 groups:
  - name: aden-critical
    rules:
      # 1. High Error Rate
      - alert: HighErrorRate
        expr: sum(rate(node_failure_count[5m])) / sum(rate(node_execution_count[5m])) > 0.15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agent failure rate above 15%"
          description: "High error rate detected in the production orchestration layer."

      # 2. Service Down
      - alert: HiveBackendDown
        expr: up{job="hive-backend"} == 0
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Hive Backend is unreachable"

      # 3. High Latency
      - alert: ExtremeLLMLatency
        expr: histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[10m])) by (le)) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 LLM Latency > 30s"

      # 4. Budget Threshold Hit
      - alert: TeamBudgetCritical
        expr: team_spend_usd / team_budget_limit_usd >= 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Team at 95% budget capacity"

      # 5. DB Write Pressure
      - alert: TimescaleDBWriteLag
        expr: rate(db_write_errors_total[5m]) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Database ingestion errors detected"

      # 6. Memory Pressure
      - alert: PodMemoryPressure
        expr: container_memory_usage_bytes{container="hive-backend"} / container_spec_memory_limit_bytes > 0.85
        for: 5m
        labels:
          severity: warning

      # 7. Self-Healing Fail Loop
      - alert: InfiniteHealingLoop
        expr: increase(self_healing_attempts_total[10m]) > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Agent stuck in evolution loop"

      # 8. Database Connection Limit
      - alert: PostgresConnectionLimit
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.90
        for: 5m
        labels:
          severity: warning
 Task 4.3: Incident Response 🆘
 Create an incident response runbook:
 Scenario: Agent response times spike to 30 seconds (normal: 2 seconds)
 Provide:
 1.	Detection: How was this discovered?
 2.	Triage: Initial investigation steps
 3.	Diagnosis: Decision tree for root causes
 4.	Resolution: Steps for each root cause
 5.	Post-mortem: Template for incident review
 # Runbook: High Agent Latency

 ## Symptoms
 - Agent response times > 10s
 - Dashboard showing degraded status

 ## Initial Triage
 1. Check [ ] Is this affecting all agents or specific ones?
 2. Check [ ] Is the backend healthy? (health endpoint)
 3. Check [ ] Are databases responsive?
 ...

 ## Diagnostic Steps
 ...

 ## Resolution Steps
 ### If LLM Provider Issue:
 ...

 ### If Database Issue:
 ...
 ________________________________________
 Runbook: High Agent Latency
 1. Detection
 •	Primary Discovery: Discovery is typically via a Prometheus Alert (ExtremeLLMLatency) triggered when the p95 agent execution time exceeds 10s for 5 minutes.
 •	Secondary Discovery: Dashboard users report "spinning" icons or a "Degraded" status in the Honeycomb UI.
 •	Log Evidence: TraceEvents in Grafana Loki show significant gaps (>20s) between node execution timestamps.
 ________________________________________
 2. Initial Triage
 1.	Check Scope: Is this affecting all teams or just one? (Query: sum(rate(llm_request_duration_seconds_bucket[5m])) by (team_id))
 2.	Check Backend Health: Are the hive-backend pods reporting healthy? (Endpoint: /health/ready). Check for high pod restart counts.
 3.	Check Databases: Is there a write lag in TimescaleDB or a high connection count in MongoDB?
 4.	Check External Status: Visit status pages for OpenAI, Anthropic, and AWS.
 ________________________________________
 3. Diagnosis Decision Tree
 •	Is the latency at the Start of the session?
 o	Yes: Possible MongoDB bottleneck (loading agent graphs) or Coding Agent saturation.
 •	Is latency occurring between Node executions?
 o	Yes: WebSocket gateway congestion or Redis heartbeat lag.
 •	Is the latency within the Node execution itself?
 o	Yes: Check the model_id.
 	If it's a "Flagship" model (GPT-4o): High likelihood of LLM Provider saturation.
 	If it's a "Tool Call" node: High likelihood of MCP Tool API timeout.
 ________________________________________
 4. Resolution Steps
 If LLM Provider Issue (Global Saturation):
 1.	Trigger Policy Override: Using the /v1/control/policy endpoint, force a temporary Model Degradation for all non-critical agents (e.g., GPT-4o $\rightarrow$ GPT-4o-mini).
 2.	Enable Throttling: Introduce a 2000ms delay in the checkBudget logic to reduce global request pressure.
 If Database / Infrastructure Issue:
 1.	Scale Pods: Manually increase the replicas in the Kubernetes deployment to 2x.
 2.	Clear Hot Storage: Flush the Redis session cache if orphan heartbeats are causing gateway lockups.
 If Specific MCP Tool Failure:
 1.	Disable Tool: Temporarily remove the failing tool from the tool_registry.
 2.	Self-Healing: Allow the Self-Healing Loop to re-route agents to alternative tools (e.g., using a different search provider).
 ________________________________________
 5. Post-Mortem Template
 Section	Description
 Summary	What happened, which teams were affected, and total downtime.
 Timeline	From first detection to final resolution (TTO, TTR).
 Root Cause	Why did this happen? (Infrastructure, Logic, or External).
 Impact Analysis	Total cost of failed/stalled goals and API spend wasted.
 Lessons Learned	What guardrails or alerts failed to prevent this?
 Action Items	List of tickets (Jira/GitHub) created to ensure this doesn't repeat.

 ________________________________________
 Part 5: Security Hardening (Bonus) 🔒
 •	Network: Close all ports except 443 (Ingress). Use a VPC Peering or Private Link for database connections.
 •	Container Security: Implement Trivy scanning in the CI/CD pipeline to catch vulnerabilities in base images before they reach the registry.
 •	Secrets: Move from .env files to AWS Secrets Manager or HashiCorp Vault, injected at runtime as environment variables.
 Task 5.1: Security Audit 🔒
 1. Network Security
 •	Exposed Ports: * 3000 (Frontend), 4000 (Backend API).
 o	Analysis: These are necessary for external access but should be restricted. In production, only the Load Balancer/Ingress should expose these via HTTPS (443).
 o	Internal Ports: 5432 (Postgres), 27017 (Mongo), 6379 (Redis).
 o	Action: These must never be exposed to the public internet. Use Security Groups or Private Subnets to ensure only the hive backend container can communicate with them.
 2. Secret Management
 •	Current State: Secrets are managed via .env files and config.yaml.
 •	Improvements:
 o	External Vault: Move all API keys (OpenAI, Anthropic) and DB credentials to AWS Secrets Manager or HashiCorp Vault.
 o	Dynamic Injection: Inject secrets into Kubernetes pods at runtime as mounted files or environment variables, avoiding persistent storage on disk.
 o	Rotation: Implement automated 90-day rotation for all infrastructure and third-party API keys.
 3. Authentication (API Auth)
 •	Implementation: JWT (JSON Web Tokens) with team-scoped claims.
 •	Hardening:
 o	MFA: Enforce Multi-Factor Authentication for the Honeycomb Dashboard.
 o	Scoping: Implement Fine-Grained Access Control (FGAC). Instead of a global API key, use "Agent-specific tokens" that only allow a worker agent to call the specific tools it needs (e.g., a "Writer" agent shouldn't have "Search Internal Docs" permissions).
 4. Container Security
 •	Image Scanning: Add Trivy or Snyk to the GitHub Actions CI pipeline. Scan on every PR to detect vulnerable Node.js/Python packages.
 •	Distroless Images: Switch from standard Node/Python images to Google Distroless or Alpine to minimize the attack surface by removing shells, package managers, and unnecessary binaries.
 5. Database Hardening
 •	Encryption: Enable Encryption at Rest (AES-256) and Enforce TLS for all Database-to-Backend connections.
 •	Least Privilege: Create dedicated DB users for the hive service. The application user should not have DROP TABLE or superuser permissions.
 ________________________________________
 Task 5.2: Compliance Checklist ✅ (SOC 2 Pathway)
 Achieving SOC 2 Type II requires demonstrating consistent operational control over a 6-12 month period.
 1. Access Control Improvements
 •	Provisioning: Use a standardized "Joiners, Movers, Leavers" (JML) process.
 •	Principle of Least Privilege: Regular (quarterly) access reviews to ensure employees and agents only have necessary permissions.
 2. Audit Logging Requirements
 •	Centralization: Stream all logs (Metric, Trace, and LogEvents) to a write-only, tamper-proof repository like Amazon S3 with Object Lock.
 •	Integrity: Every agent "Evolution" decision must be logged with the ID of the admin who approved the change.
 3. Encryption Requirements
 •	In Transit: 100% of traffic (including internal pod-to-pod traffic) must be encrypted via mTLS (Mutual TLS) using a service mesh like Istio or Linkerd.
 •	At Rest: Ensure all EBS volumes, RDS instances, and MongoDB clusters use KMS keys managed by the organization.
 4. Data Retention Policies
 •	Automated Deletion: Implement a policy to purge individual "Session Local Memory" 30 days after a goal is completed, unless specifically marked for Long-Term Memory storage.
 •	Hypertable Chunks: Use TimescaleDB's drop_chunks function to automatically archive or delete metric logs older than 1 year to comply with privacy regulations (GDPR/CCPA).
 5. Incident Response Requirements
 •	Runbook Automation: The High-Latency runbook (Task 4.3) must be part of an official, version-controlled IR plan.
 •	Testing: Perform an annual Security Tabletop Exercise where a mock "Agent Data Leak" or "Prompt Injection" scenario is played out by the engineering team.
No results found