Production Observability, Alerting, and Scaling Guide

This lab provides a complete production-readiness guide for Enterprise AgentGateway: a metrics reference for every component, recommended Prometheus alerting rules, horizontal pod autoscaling configuration, graceful shutdown for long-lived AI connections, pod spreading, and disruption budgets.

Pre-requisites

This lab assumes that you have completed the setup in 001 and 002.

Lab Objectives

Understand every metric exposed by the data plane, control plane, and rate limiter
Deploy production-grade Prometheus alerting rules
Configure HPA to auto-scale the gateway based on the right signals
Configure PodDisruptionBudgets, topology spread, and anti-affinity
Ensure graceful shutdown of long-lived LLM streaming and MCP connections during upgrades
Perform a zero-downtime rolling upgrade

Part 1: Metrics Reference

Enterprise AgentGateway exposes metrics from three components. All are scraped automatically by Prometheus via pod annotations.

Data Plane — `agentgateway-proxy` (port 15020)

The data plane is a Rust-based proxy built on Tokio. It handles all request routing, LLM traffic, MCP calls, and guardrail enforcement.

Build Info

Metric	Type	Labels	Description
`agentgateway_build_info`	Info	`tag`	AgentGateway version (e.g. `v2.2.0`). Use to confirm all pods are on the same version after an upgrade.

HTTP Request Metrics

These are the primary metrics for monitoring gateway health and performance. Every request flowing through the gateway is counted and timed here.

Metric	Type	Labels	Description
`agentgateway_requests_total`	Counter	`backend`, `protocol`, `method`, `status`, `reason`, `bind`, `gateway`, `listener`, `route`, `route_rule`	Total HTTP requests processed. The `status` label is the HTTP status code (200, 429, 500, etc.). The `reason` label classifies why the response was generated (see reason table below).
`agentgateway_request_duration_seconds`	Histogram	(same)	End-to-end request latency from the proxy's perspective, including upstream LLM processing time. Buckets: 1ms to 80s. For LLM traffic, p99 will be dominated by model inference time.
`agentgateway_response_bytes_total`	Counter	(same)	Total response bytes received from upstream backends. Useful for tracking bandwidth and detecting unusually large responses.

HTTP Label Reference:

Label	Values	Description
`backend`	Backend name or `"unknown"`	The upstream AgentgatewayBackend that handled the request. `"unknown"` means no route matched.
`protocol`	`http`, `https`, `tls`, `tcp`, `hbone`	Transport protocol to the upstream.
`method`	`GET`, `POST`, `CONNECT`, etc.	HTTP method. LLM chat completions are always `POST`.
`status`	HTTP status code (200, 404, 429, 500, etc.)	Response status code.
`reason`	See table below	Why the proxy generated or forwarded this response.
`bind`	e.g. `8080/agentgateway-system/agentgateway-proxy`	The listener bind address.
`gateway`	e.g. `agentgateway-system/agentgateway-proxy`	The Gateway resource name.
`listener`	e.g. `http`	Listener name within the Gateway.
`route`	HTTPRoute name or `"unknown"`	Which HTTPRoute matched.
`route_rule`	Rule index or `"unknown"`	Which rule within the HTTPRoute matched.

Response reason Values:

The reason label tells you why a response was generated — critical for distinguishing between "upstream returned an error" vs. "the gateway itself rejected the request":

Reason	Meaning	Typical Status Codes
`Upstream`	Response came from the upstream LLM/MCP backend	Any (200, 429, 500, etc.)
`DirectResponse`	Proxy generated the response directly (no upstream call)	Varies
`NotFound`	No matching listener, route, or backend found	404
`NoHealthyBackend`	All providers in the backend are unhealthy, DNS failed, or backend doesn't exist	503
`RateLimit`	Request rejected by local or global rate limiter	429
`Timeout`	Request or upstream call timed out	504
`JwtAuth`	JWT authentication failed	401
`BasicAuth`	Basic authentication failed	401
`APIKeyAuth`	API key authentication failed	401
`ExtAuth`	External authorization service rejected the request	403
`Authorization`	Authorization or CSRF validation failed	403
`UpstreamFailure`	Upstream connection failed, TCP proxy error, or backend auth error	502, 503
`Internal`	Internal proxy error (invalid request, filter error, processing error)	500
`MCP`	MCP protocol-level error	Varies
`ExtProc`	External processing failure	500

GenAI (LLM) Metrics

These follow the OpenTelemetry GenAI semantic conventions. They are only populated for requests routed to LLM backends (AgentgatewayBackend with ai spec).

Metric	Type	Labels	Description
`agentgateway_gen_ai_client_token_usage`	Histogram	`gen_ai_token_type`, `gen_ai_operation_name`, `gen_ai_system`, `gen_ai_request_model`, `gen_ai_response_model`, `route`	Tokens consumed per request. Two observations per request: one with `gen_ai_token_type="input"` (prompt tokens) and one with `gen_ai_token_type="output"` (completion tokens). Buckets are exponential: 1, 4, 16, 64, 256, 1024 ... up to 67M.
`agentgateway_gen_ai_server_request_duration`	Histogram	`gen_ai_operation_name`, `gen_ai_system`, `gen_ai_request_model`, `gen_ai_response_model`, `route`	Total time the upstream LLM took to process the request (seconds). For streaming, this is time from first byte sent to last byte received. Buckets: 10ms to 82s.
`agentgateway_gen_ai_server_time_to_first_token`	Histogram	(same)	Time from request start to the first token generated (TTFT). Critical SLI for streaming user experience. Buckets: 1ms to 10s.
`agentgateway_gen_ai_server_time_per_output_token`	Histogram	(same)	Average inter-token latency (TPOT). Measures generation throughput. Buckets: 1ms to 2.5s.

GenAI Label Reference:

Label	Values	Description
`gen_ai_token_type`	`input`, `output`	Whether this observation counts prompt tokens or completion tokens. Only on `token_usage`.
`gen_ai_operation_name`	`chat`, `embeddings`	The type of LLM operation.
`gen_ai_system`	`openai`, `anthropic`, `bedrock`, `vertexai`, `azureopenai`, etc.	The LLM provider type configured in the backend.
`gen_ai_request_model`	e.g. `gpt-4o`, `claude-sonnet-4-20250514`	The model name sent in the request.
`gen_ai_response_model`	e.g. `gpt-4o-2024-08-06`	The model name returned in the response (may differ from request).

MCP (Model Context Protocol) Metrics

Metric	Type	Labels	Description
`agentgateway_mcp_requests`	Counter	`method`, `resource_type`, `server`, `resource`, `route`	Total MCP tool/resource/prompt calls. Not incremented for raw HTTP transport requests (only JSON-RPC method calls).

MCP Label Reference:

Label	Values	Description
`method`	`tools/call`, `tools/list`, `prompts/get`, `resources/read`, etc.	The MCP JSON-RPC method name.
`resource_type`	`Tool`, `Prompt`, `Resource`, `ResourceTemplates`	Category of MCP operation.
`server`	Target MCP server name	Which MCP server was called.
`resource`	Tool/resource name	The specific tool or resource accessed.

MCP requests also flow through the general agentgateway_request_duration_seconds histogram for latency tracking.

Guardrail Metrics

Metric	Type	Labels	Description
`agentgateway_guardrail_checks`	Counter	`phase`, `action`	Total guardrail evaluations across all guardrail types (regex, webhook, OpenAI Moderation, Bedrock Guardrails, Google Model Armor).

Label	Values	Description
`phase`	`Request`, `Response`	Whether the guardrail fired on the inbound request or the outbound LLM response.
`action`	`Allow`, `Mask`, `Reject`	The outcome. `Reject` = request/response blocked, `Mask` = content redacted, `Allow` = passed.

Connection & Transport Metrics

Metric	Type	Labels	Description
`agentgateway_downstream_connections_total`	Counter	`bind`, `gateway`, `listener`, `protocol`	Total client-to-proxy connections established. Includes short-lived and long-lived (streaming, MCP/SSE) connections.
`agentgateway_downstream_received_bytes_total`	Counter	(same)	Total bytes received from clients.
`agentgateway_downstream_sent_bytes_total`	Counter	(same)	Total bytes sent to clients.
`agentgateway_upstream_connect_duration_seconds`	Histogram	`transport`	Time to establish upstream connections. `transport` is `plaintext` or `tls`. High values indicate network issues or DNS problems to LLM providers. Buckets: 0.5ms to 8s.
`agentgateway_tls_handshake_duration_seconds`	Histogram	`bind`, `gateway`, `listener`, `protocol`	Inbound TLS handshake duration. Only populated if TLS termination is configured on the gateway. Buckets: 0.5ms to 8s.

xDS (Control Plane Communication) Metrics

These track the connection between the data plane proxy and the control plane. Problems here mean the proxy isn't receiving configuration updates.

Metric	Type	Labels	Description
`agentgateway_xds_connection_terminations`	Counter	`reason`	xDS stream disconnections. `reason` is `ConnectionError` (network failure), `Error` (gRPC error), `Reconnect` (planned), or `Complete` (clean close). Frequent `ConnectionError` or `Error` values indicate control plane instability.
`agentgateway_xds_message_total`	Counter	`url`	Number of xDS config messages received. The `url` label is the resource type URL (e.g. `type.googleapis.com/agentgateway.dev.resource.Resource`). A sudden stop means the proxy is no longer receiving config updates.
`agentgateway_xds_message_bytes_total`	Counter	`url`	Bytes received from xDS. Large spikes may indicate excessive configuration churn.

Tokio Runtime Metrics

The proxy runs on a Tokio async runtime. These metrics indicate proxy-level health independently of request metrics.

Metric	Type	Description
`agentgateway_tokio_num_workers`	Gauge	Number of Tokio worker threads. Defaults to the number of CPU cores (or the value of `CPU_LIMIT`). Should be stable.
`agentgateway_tokio_num_alive_tasks`	Gauge	Number of currently active async tasks. Each in-flight request and connection is a task. A sustained upward trend may indicate task leaks or connection backlog.
`agentgateway_tokio_global_queue_depth`	Gauge	Tasks waiting to be picked up by a worker thread. Sustained values > 0 mean worker threads are saturated — a strong scale-up signal.

Control Plane — `enterprise-agentgateway` (port 9092)

The control plane is a Go-based Kubernetes controller. It watches Gateway API and AgentGateway CRDs and pushes configuration to data plane proxies via xDS.

Controller Reconciliation Metrics

Metric	Type	Labels	Description
`kgateway_controller_reconciliations_total`	Counter	`controller`, `name`, `namespace`, `result`	Total reconciliation loops. The `controller` label identifies which controller ran (e.g. `gateway`, `gatewayclass`). The `result` label is `success` or `error`. A rising `error` count means CRD changes are not being applied.
`kgateway_controller_reconciliations_running`	Gauge	`controller`, `name`, `namespace`	Currently in-flight reconciliations. Sustained high values indicate controller backlog.
`kgateway_controller_reconcile_duration_seconds`	Histogram	`controller`, `name`, `namespace`	Time per reconciliation loop. Increasing durations may indicate growing cluster complexity or API server slowness.
`enterprise_kgateway_controller_reconciliations_total`	Counter	`controller`, `name`, `namespace`, `result`	Same as above but for enterprise-specific controllers: `agw-ext-auth`, `agw-ext-cache`, `agw-rate-limiter`.
`enterprise_kgateway_controller_reconciliations_running`	Gauge	(same)	In-flight enterprise reconciliations.
`enterprise_kgateway_controller_reconcile_duration_seconds`	Histogram	(same)	Enterprise reconciliation duration.

xDS Authentication

Metric	Type	Description
`kgateway_xds_auth_rq_total`	Counter	Total xDS authentication requests from data plane proxies. Each proxy connection must authenticate.
`kgateway_xds_auth_rq_success_total`	Counter	Successful xDS auth requests. If `total - success > 0`, proxy pods are failing to authenticate with the control plane.

Go Runtime & Process Metrics

Metric	Type	Description
`go_goroutines`	Gauge	Number of active goroutines. Sustained growth indicates leaks. Baseline is ~900 for a healthy controller.
`go_memstats_alloc_bytes`	Gauge	Current heap allocation. Monitor for memory leaks.
`process_resident_memory_bytes`	Gauge	RSS of the control plane process. Use for capacity planning.
`process_cpu_seconds_total`	Counter	CPU time consumed. Use `rate()` for CPU utilization.
`process_open_fds`	Gauge	Open file descriptors. Approaching `process_max_fds` causes failures.

Rate Limiter — `rate-limiter-enterprise-agentgateway` (port 9091)

Metric	Type	Labels	Description
`ratelimit_solo_io_total_hits`	Counter	`descriptor`	Total rate limit evaluation requests. The `descriptor` label encodes the rate limit policy (e.g. `solo.io\|generic_key^namespace.policyname`).
`ratelimit_solo_io_over_limit`	Counter	`descriptor`	Requests that exceeded the configured limit and were rejected (429).
`ratelimit_solo_io_near_limit`	Counter	`descriptor`	Requests that were within 80% of the limit — an early warning signal.

Part 1b: Enable Metrics Collection for Control Plane and Rate Limiter

Lab 002 already configures Prometheus scraping for the data plane proxy. For production, you also need to scrape the control plane and rate limiter:

# Control plane metrics (port 9092)
kubectl apply -f- <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: control-plane-monitoring-agentgateway-metrics
  namespace: agentgateway-system
spec:
  namespaceSelector:
    matchNames:
      - agentgateway-system
  podMetricsEndpoints:
    - port: metrics
  selector:
    matchLabels:
      app.kubernetes.io/name: enterprise-agentgateway
EOF

# Rate limiter metrics (port 9091, exposed as "debug" on the Service)
kubectl apply -f- <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rate-limiter-monitoring-agentgateway-metrics
  namespace: agentgateway-system
spec:
  namespaceSelector:
    matchNames:
      - agentgateway-system
  selector:
    matchLabels:
      app: rate-limiter
  endpoints:
    - port: debug
      interval: 15s
EOF

Verify all targets are being scraped (may take 30-60 seconds):

kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/targets | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
for t in d['data']['activeTargets']:
    pod = t.get('labels',{}).get('pod','')
    svc = t.get('labels',{}).get('service','')
    if 'agentgateway' in pod or 'rate-limiter' in svc:
        print(f'{(pod or svc):60s} | {t[\"health\"]}')
"
kill %1 2>/dev/null

Expected output — all components up:

agentgateway-proxy-xxxxx-yyyyy                               | up
agentgateway-proxy-xxxxx-zzzzz                               | up
enterprise-agentgateway-xxxxx-yyyyy                          | up
rate-limiter-enterprise-agentgateway-xxxxx-yyyyy             | up

Part 2: Recommended Prometheus Alerting Rules

Deploy these alerting rules to catch issues before they affect users.

kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: agentgateway-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    # ──────────────────────────────────────────────
    # Data Plane Alerts
    # ──────────────────────────────────────────────
    - name: agentgateway-dataplane
      rules:

        # High error rate: >5% of requests returning 5xx
        - alert: AgentGatewayHighErrorRate
          expr: |
            (
              sum(rate(agentgateway_requests_total{status=~"5.."}[5m])) by (gateway)
              /
              sum(rate(agentgateway_requests_total[5m])) by (gateway)
            ) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "AgentGateway {{ $labels.gateway }} has >5% error rate"
            description: "{{ $value | humanizePercentage }} of requests are returning 5xx errors."

        # High rate limit rejection rate
        - alert: AgentGatewayHighRateLimitRate
          expr: |
            (
              sum(rate(agentgateway_requests_total{reason="RateLimit"}[5m])) by (gateway)
              /
              sum(rate(agentgateway_requests_total[5m])) by (gateway)
            ) > 0.10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "AgentGateway {{ $labels.gateway }} is rate-limiting >10% of requests"
            description: "Consider increasing rate limits or scaling the gateway."

        # No healthy backends available
        - alert: AgentGatewayNoHealthyBackends
          expr: |
            sum(rate(agentgateway_requests_total{reason="NoHealthyBackend"}[5m])) by (gateway, route) > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Route {{ $labels.route }} on {{ $labels.gateway }} has no healthy backends"
            description: "All LLM providers in this route's backend are unhealthy. Requests are failing with 503."

        # Slow LLM responses: p99 > 30s
        - alert: AgentGatewaySlowLLMResponses
          expr: |
            histogram_quantile(0.99,
              sum(rate(agentgateway_gen_ai_server_request_duration_bucket[5m])) by (le, gen_ai_system, gen_ai_request_model)
            ) > 30
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "LLM p99 latency >30s for {{ $labels.gen_ai_system }}/{{ $labels.gen_ai_request_model }}"
            description: "p99 LLM response time is {{ $value | humanizeDuration }}. Check provider health."

        # High TTFT (time to first token) - streaming UX degradation
        - alert: AgentGatewayHighTTFT
          expr: |
            histogram_quantile(0.95,
              sum(rate(agentgateway_gen_ai_server_time_to_first_token_bucket[5m])) by (le, gen_ai_request_model)
            ) > 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "TTFT p95 >5s for model {{ $labels.gen_ai_request_model }}"
            description: "Users are waiting {{ $value | humanizeDuration }} for the first token. Check model provider latency."

        # Upstream connection failures
        - alert: AgentGatewayUpstreamConnectFailures
          expr: |
            sum(rate(agentgateway_requests_total{reason="UpstreamFailure"}[5m])) by (gateway, backend) > 0.5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Upstream connection failures to {{ $labels.backend }}"
            description: "The proxy cannot connect to the upstream backend. Check DNS, network policies, and provider status."

        # Guardrail rejection spike
        - alert: AgentGatewayGuardrailRejectionSpike
          expr: |
            sum(rate(agentgateway_guardrail_checks{action="Reject"}[5m])) by (phase)
            /
            sum(rate(agentgateway_guardrail_checks[5m])) by (phase) > 0.20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Guardrail rejection rate >20% on {{ $labels.phase }} phase"
            description: "{{ $value | humanizePercentage }} of {{ $labels.phase | toLower }} guardrail checks are being rejected."

        # Tokio runtime saturation — tasks queuing up
        - alert: AgentGatewayRuntimeSaturation
          expr: |
            agentgateway_tokio_global_queue_depth > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "AgentGateway proxy runtime is saturated (queue depth {{ $value }})"
            description: "Tokio worker threads cannot keep up. This is a strong signal to scale up the proxy."

        # Task accumulation — potential leak or connection backlog
        - alert: AgentGatewayTaskAccumulation
          expr: |
            agentgateway_tokio_num_alive_tasks > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "AgentGateway has {{ $value }} active tasks"
            description: "Sustained high task count may indicate connection leaks or backlog. Investigate long-lived connections."

        # xDS disconnection from control plane
        - alert: AgentGatewayXDSDisconnected
          expr: |
            sum(rate(agentgateway_xds_connection_terminations{reason=~"ConnectionError|Error"}[5m])) by (pod) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Proxy pod is disconnecting from the control plane"
            description: "xDS connection errors detected. The proxy may not be receiving configuration updates."

        # Version mismatch after upgrade
        - alert: AgentGatewayVersionMismatch
          expr: |
            count(count by (tag) (agentgateway_build_info)) > 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Multiple AgentGateway versions running simultaneously"
            description: "Not all proxy pods are on the same version. This may indicate a stalled rollout."

    # ──────────────────────────────────────────────
    # Control Plane Alerts
    # ──────────────────────────────────────────────
    - name: agentgateway-controlplane
      rules:

        # Reconciliation errors
        - alert: AgentGatewayReconcileErrors
          expr: |
            sum(rate(kgateway_controller_reconciliations_total{result="error"}[5m])) by (controller) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Controller {{ $labels.controller }} has reconciliation errors"
            description: "CRD changes may not be applied to the data plane. Check controller logs."

        # Slow reconciliation
        - alert: AgentGatewaySlowReconcile
          expr: |
            histogram_quantile(0.99,
              sum(rate(kgateway_controller_reconcile_duration_seconds_bucket[5m])) by (le, controller)
            ) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Controller {{ $labels.controller }} p99 reconcile time >5s"
            description: "Reconciliation is slow ({{ $value | humanizeDuration }}). Check API server performance and cluster size."

        # xDS auth failures — proxies can't connect to control plane
        - alert: AgentGatewayXDSAuthFailures
          expr: |
            (
              rate(kgateway_xds_auth_rq_total[5m]) - rate(kgateway_xds_auth_rq_success_total[5m])
            ) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Proxy pods are failing xDS authentication"
            description: "Data plane proxies cannot authenticate with the control plane. New config will not be pushed."

        # Control plane memory growth
        - alert: AgentGatewayControlPlaneMemory
          expr: |
            process_resident_memory_bytes{job=~".*enterprise-agentgateway.*"} > 512 * 1024 * 1024
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Control plane using >512MB memory"
            description: "Current RSS: {{ $value | humanize1024 }}B. Investigate for memory leaks."

        # Goroutine leak
        - alert: AgentGatewayGoroutineLeak
          expr: |
            go_goroutines{job=~".*enterprise-agentgateway.*"} > 5000
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Control plane has {{ $value }} goroutines"
            description: "Sustained goroutine growth may indicate a leak. Baseline is ~900."

    # ──────────────────────────────────────────────
    # Rate Limiter Alerts
    # ──────────────────────────────────────────────
    - name: agentgateway-ratelimiter
      rules:

        # Rate limiter rejecting a high percentage of requests
        - alert: AgentGatewayRateLimiterOverLimit
          expr: |
            (
              sum(rate(ratelimit_solo_io_over_limit[5m])) by (descriptor)
              /
              sum(rate(ratelimit_solo_io_total_hits[5m])) by (descriptor)
            ) > 0.25
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Rate limiter rejecting >25% of requests for {{ $labels.descriptor }}"
            description: "Consider raising rate limits or investigating traffic patterns."

        # Near-limit warning — approaching quota
        - alert: AgentGatewayRateLimiterNearLimit
          expr: |
            sum(rate(ratelimit_solo_io_near_limit[5m])) by (descriptor) > 1
          for: 10m
          labels:
            severity: info
          annotations:
            summary: "Traffic approaching rate limit for {{ $labels.descriptor }}"
            description: "Requests are within 80% of the configured limit."
EOF

Verify the rules are loaded:

kubectl get prometheusrule agentgateway-alerts -n monitoring

Check Prometheus has picked them up:

kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/rules | python3 -c "
import sys, json
data = json.load(sys.stdin)
groups = data['data']['groups']
agw = [g for g in groups if 'agentgateway' in g['name']]
for g in agw:
    print(f\"\n{g['name']} ({len(g['rules'])} rules):\")
    for r in g['rules']:
        print(f\"  - {r['name']} [{r.get('state','unknown')}]\")
"
kill %1 2>/dev/null

Part 3: Scaling the Gateway

What to Scale Against

The proxy is a Rust-based async runtime (Tokio). Its bottlenecks are:

CPU — TLS termination, request parsing, guardrail evaluation, JSON body inspection for token counting
Concurrent connections — each in-flight request (especially streaming LLM responses and MCP/SSE connections) holds an async task
Memory — primarily proportional to concurrent connections; the proxy streams request/response bodies rather than buffering them

For most AI workloads, CPU is the primary bottleneck because LLM requests are long-lived (seconds to minutes) with low request-per-second rates but high per-request CPU cost (TLS, body parsing, token counting).

Scaling Signals (in priority order)

Signal	Metric	Why
CPU utilization	`container_cpu_usage_seconds_total`	Primary bottleneck for TLS + body parsing. Target 60-70% average.
Runtime saturation	`agentgateway_tokio_global_queue_depth`	Non-zero means worker threads are fully occupied. Most direct signal that the proxy needs more capacity.
Active tasks	`agentgateway_tokio_num_alive_tasks`	Proportional to concurrent in-flight requests/connections. If this grows faster than request rate, connections are backing up.
Request rate	`agentgateway_requests_total`	Useful as a secondary signal, but less direct than CPU because request cost varies with payload size.

Configure HPA

The HPA below scales on CPU (primary) and can optionally use custom metrics for runtime saturation:

kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agentgateway-proxy
  namespace: agentgateway-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agentgateway-proxy
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60       # React quickly to load spikes
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60                # Add up to 2 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300      # Wait 5 min before scaling down (protect long-lived connections)
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120               # Remove at most 1 pod every 2 min
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
EOF

Important: For HPA to work, the proxy deployment must have CPU resource requests set. Update the EnterpriseAgentgatewayParameters:

kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
  "spec": {
    "deployment": {
      "spec": {
        "template": {
          "spec": {
            "containers": [{
              "name": "agentgateway",
              "resources": {
                "requests": {
                  "cpu": "500m",
                  "memory": "256Mi"
                },
                "limits": {
                  "cpu": "2",
                  "memory": "1Gi"
                }
              }
            }]
          }
        }
      }
    }
  }
}'

Resource sizing guidance:

Workload	CPU Request	CPU Limit	Memory Request	Memory Limit	Notes
Low (< 50 rps)	250m	1	128Mi	512Mi	Light traffic, few concurrent streams
Medium (50-500 rps)	500m	2	256Mi	1Gi	Moderate concurrency, some streaming
High (> 500 rps)	1	4	512Mi	2Gi	High concurrency, many long-lived streams

The proxy is lightweight at idle (~6Mi memory, <1m CPU). Memory grows linearly with concurrent connections. Each streaming LLM connection holds minimal state (the proxy streams, it does not buffer bodies).

Scaling Considerations for AI Traffic

AI/LLM traffic differs from traditional HTTP:

Long-lived connections: A streaming chat completion can last 30-120 seconds. The proxy holds an async task for the entire duration.
Low RPS, high connection time: 100 concurrent streaming users at 60s average = only ~1.7 rps but 100 concurrent tasks.
Tokio worker threads: Default to CPU_LIMIT cores. Each worker thread can handle many concurrent async tasks, but CPU-bound work (TLS, JSON parsing) blocks the thread.
Scale-down risk: Aggressive scale-down can terminate pods with active streaming connections. Use a long stabilizationWindowSeconds (300s+) for scale-down.

Requests-per-Pod Guidance

There is no hard limit — throughput depends on payload size, TLS overhead, guardrails enabled, and whether responses are streaming. General guidance:

Scenario	Approximate capacity per pod (1 CPU)
Non-streaming, small payloads, no guardrails	~500-1000 rps
Non-streaming with guardrails	~200-400 rps
Streaming LLM (concurrent connections)	~500-1000 concurrent streams
MCP tool calls	~300-600 rps

Recommendation: Load test your specific workload using the k6 lab (025-load-testing-with-k6s.md) and observe the agentgateway_tokio_global_queue_depth metric. When queue depth starts consistently rising above 0, you've found the saturation point for that pod.

Part 4: Graceful Shutdown for Long-Lived AI Connections

LLM streaming responses, MCP/SSE connections, and agent workloads can run for minutes. The proxy has built-in graceful shutdown that must be configured to match.

How Graceful Shutdown Works

When a pod receives SIGTERM:

Stop accepting new connections — the listener stops immediately
CONNECTION_MIN_TERMINATION_DEADLINE (default: 10s) — for this period, existing connections continue but new ones receive connection: close (HTTP/1) or GOAWAY (HTTP/2) to tell clients to reconnect elsewhere
Drain in-flight requests — the proxy waits for all active request handlers to complete
TERMINATION_GRACE_PERIOD_SECONDS (default: 60s) — hard deadline. Any connections still active after this are forcefully terminated
Kubernetes SIGKILL — sent at terminationGracePeriodSeconds (also 60s by default)

Recommended Configuration for AI Workloads

For workloads with long-lived streaming connections, increase the termination grace period:

kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
  "spec": {
    "deployment": {
      "spec": {
        "template": {
          "spec": {
            "terminationGracePeriodSeconds": 120,
            "containers": [{
              "name": "agentgateway",
              "env": [
                {
                  "name": "TERMINATION_GRACE_PERIOD_SECONDS",
                  "value": "110"
                },
                {
                  "name": "CONNECTION_MIN_TERMINATION_DEADLINE",
                  "value": "15s"
                }
              ]
            }]
          }
        }
      }
    }
  }
}'

Key: TERMINATION_GRACE_PERIOD_SECONDS must be less than terminationGracePeriodSeconds (the Kubernetes-level setting), otherwise Kubernetes sends SIGKILL before the proxy finishes draining.

Setting	Default	Recommended for AI	Description
`terminationGracePeriodSeconds`	60s	120s	Kubernetes-level: time before SIGKILL
`TERMINATION_GRACE_PERIOD_SECONDS`	60s	110s	Proxy-level: hard deadline for drain (must be < k8s setting)
`CONNECTION_MIN_TERMINATION_DEADLINE`	10s	15s	Minimum time to keep accepting connections to allow client migration

Part 5: Pod Disruption Budgets

PDBs ensure minimum availability during voluntary disruptions (node drains, upgrades, cluster autoscaler).

kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: agentgateway-proxy
  namespace: agentgateway-system
spec:
  minAvailable: 1                  # At least 1 proxy pod always running
  selector:
    matchLabels:
      app.kubernetes.io/name: agentgateway-proxy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: enterprise-agentgateway
  namespace: agentgateway-system
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: enterprise-agentgateway
EOF

Guidance for choosing minAvailable vs maxUnavailable:

Replicas	Recommended PDB	Effect
2	`minAvailable: 1`	1 pod can be disrupted at a time
3-5	`maxUnavailable: 1`	Same effect, but works better with rolling updates
5+	`maxUnavailable: 2`	Allows faster rolling updates while maintaining capacity

Verify:

kubectl get pdb -n agentgateway-system

Part 6: Topology Spread and Anti-Affinity

Spread proxy pods across nodes and zones to survive node failures and zonal outages.

kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
  "spec": {
    "deployment": {
      "spec": {
        "template": {
          "spec": {
            "topologySpreadConstraints": [
              {
                "maxSkew": 1,
                "topologyKey": "topology.kubernetes.io/zone",
                "whenUnsatisfiable": "ScheduleAnyway",
                "labelSelector": {
                  "matchLabels": {
                    "app.kubernetes.io/name": "agentgateway-proxy"
                  }
                }
              },
              {
                "maxSkew": 1,
                "topologyKey": "kubernetes.io/hostname",
                "whenUnsatisfiable": "ScheduleAnyway",
                "labelSelector": {
                  "matchLabels": {
                    "app.kubernetes.io/name": "agentgateway-proxy"
                  }
                }
              }
            ],
            "affinity": {
              "podAntiAffinity": {
                "preferredDuringSchedulingIgnoredDuringExecution": [
                  {
                    "weight": 100,
                    "podAffinityTerm": {
                      "labelSelector": {
                        "matchExpressions": [
                          {
                            "key": "app.kubernetes.io/name",
                            "operator": "In",
                            "values": ["agentgateway-proxy"]
                          }
                        ]
                      },
                      "topologyKey": "kubernetes.io/hostname"
                    }
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}'

Why ScheduleAnyway instead of DoNotSchedule:

DoNotSchedule can prevent scaling if no valid node/zone is available
ScheduleAnyway is a best-effort spread — the scheduler tries to spread but won't block scheduling
Use DoNotSchedule only if you have nodes in 3+ zones and can guarantee capacity in each

Verify pods are spread:

kubectl get pods -n agentgateway-system -l app.kubernetes.io/name=agentgateway-proxy \
  -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\\.kubernetes\\.io/zone

Part 7: Zero-Downtime Upgrade Procedure

Combining all the above for a safe upgrade of Enterprise AgentGateway.

Pre-Upgrade Checklist

# 1. Verify PDBs are in place
kubectl get pdb -n agentgateway-system

# 2. Verify current replica count (recommend >= 2 for zero-downtime)
kubectl get deploy agentgateway-proxy -n agentgateway-system -o jsonpath='{.spec.replicas}'

# 3. Check current version
kubectl get pods -n agentgateway-system -l app.kubernetes.io/name=agentgateway-proxy \
  -o jsonpath='{.items[0].spec.containers[0].image}'

# 4. Verify all pods are healthy
kubectl get pods -n agentgateway-system

# 5. Check no reconciliation errors
kubectl port-forward -n agentgateway-system deploy/enterprise-agentgateway 9092:9092 &
sleep 2
curl -s http://localhost:9092/metrics | grep 'result="error"'
kill %1 2>/dev/null

Upgrade Steps

# 1. Upgrade Helm release (or update image tag)
#    The rolling update will respect PDBs and graceful shutdown
helm upgrade enterprise-agentgateway solo/enterprise-agentgateway \
  --namespace agentgateway-system \
  --version <new-version> \
  --reuse-values

# 2. Watch the rollout — pods are replaced one at a time (PDB enforced)
kubectl rollout status deployment/enterprise-agentgateway -n agentgateway-system --timeout=300s
kubectl rollout status deployment/agentgateway-proxy -n agentgateway-system --timeout=300s

# 3. Verify all pods are on the new version
kubectl port-forward -n agentgateway-system deploy/agentgateway-proxy 15020:15020 &
sleep 2
curl -s http://localhost:15020/metrics | grep agentgateway_build_info
kill %1 2>/dev/null

Monitor During Upgrade

Watch for these during the rolling update:

# In a separate terminal — watch for errors during rollout
kubectl logs -f deploy/agentgateway-proxy -n agentgateway-system --since=5m | \
  jq -r 'select(.level == "ERROR" or .level == "WARN") | "\(.timestamp) \(.level) \(.message)"'

Key metrics to watch in Grafana during the upgrade:

agentgateway_build_info — should show old and new version during rollout, then only new version
agentgateway_requests_total{status=~"5.."} — error rate should not spike
agentgateway_xds_connection_terminations — expect Reconnect reasons as proxies restart, but no ConnectionError
kgateway_controller_reconciliations_total{result="error"} — should remain at 0

Cleanup

Remove the alerting rules, HPA, PDBs, and monitors if no longer needed:

kubectl delete prometheusrule agentgateway-alerts -n monitoring
kubectl delete hpa agentgateway-proxy -n agentgateway-system
kubectl delete pdb agentgateway-proxy enterprise-agentgateway -n agentgateway-system
kubectl delete podmonitor control-plane-monitoring-agentgateway-metrics -n agentgateway-system
kubectl delete servicemonitor rate-limiter-monitoring-agentgateway-metrics -n agentgateway-system

rvennam/034-production-observability-and-scaling.md