Skip to content

Instantly share code, notes, and snippets.

@rvennam
Last active March 13, 2026 03:51
Show Gist options
  • Select an option

  • Save rvennam/ca139e30bcfb7ba7540417afb1f5dd62 to your computer and use it in GitHub Desktop.

Select an option

Save rvennam/ca139e30bcfb7ba7540417afb1f5dd62 to your computer and use it in GitHub Desktop.
Enterprise AgentGateway: Production Observability, Alerting, and Scaling Guide

Production Observability, Alerting, and Scaling Guide

This lab provides a complete production-readiness guide for Enterprise AgentGateway: a metrics reference for every component, recommended Prometheus alerting rules, horizontal pod autoscaling configuration, graceful shutdown for long-lived AI connections, pod spreading, and disruption budgets.

Pre-requisites

This lab assumes that you have completed the setup in 001 and 002.

Lab Objectives

  • Understand every metric exposed by the data plane, control plane, and rate limiter
  • Deploy production-grade Prometheus alerting rules
  • Configure HPA to auto-scale the gateway based on the right signals
  • Configure PodDisruptionBudgets, topology spread, and anti-affinity
  • Ensure graceful shutdown of long-lived LLM streaming and MCP connections during upgrades
  • Perform a zero-downtime rolling upgrade

Part 1: Metrics Reference

Enterprise AgentGateway exposes metrics from three components. All are scraped automatically by Prometheus via pod annotations.

Data Plane — agentgateway-proxy (port 15020)

The data plane is a Rust-based proxy built on Tokio. It handles all request routing, LLM traffic, MCP calls, and guardrail enforcement.

Build Info

Metric Type Labels Description
agentgateway_build_info Info tag AgentGateway version (e.g. v2.2.0). Use to confirm all pods are on the same version after an upgrade.

HTTP Request Metrics

These are the primary metrics for monitoring gateway health and performance. Every request flowing through the gateway is counted and timed here.

Metric Type Labels Description
agentgateway_requests_total Counter backend, protocol, method, status, reason, bind, gateway, listener, route, route_rule Total HTTP requests processed. The status label is the HTTP status code (200, 429, 500, etc.). The reason label classifies why the response was generated (see reason table below).
agentgateway_request_duration_seconds Histogram (same) End-to-end request latency from the proxy's perspective, including upstream LLM processing time. Buckets: 1ms to 80s. For LLM traffic, p99 will be dominated by model inference time.
agentgateway_response_bytes_total Counter (same) Total response bytes received from upstream backends. Useful for tracking bandwidth and detecting unusually large responses.

HTTP Label Reference:

Label Values Description
backend Backend name or "unknown" The upstream AgentgatewayBackend that handled the request. "unknown" means no route matched.
protocol http, https, tls, tcp, hbone Transport protocol to the upstream.
method GET, POST, CONNECT, etc. HTTP method. LLM chat completions are always POST.
status HTTP status code (200, 404, 429, 500, etc.) Response status code.
reason See table below Why the proxy generated or forwarded this response.
bind e.g. 8080/agentgateway-system/agentgateway-proxy The listener bind address.
gateway e.g. agentgateway-system/agentgateway-proxy The Gateway resource name.
listener e.g. http Listener name within the Gateway.
route HTTPRoute name or "unknown" Which HTTPRoute matched.
route_rule Rule index or "unknown" Which rule within the HTTPRoute matched.

Response reason Values:

The reason label tells you why a response was generated — critical for distinguishing between "upstream returned an error" vs. "the gateway itself rejected the request":

Reason Meaning Typical Status Codes
Upstream Response came from the upstream LLM/MCP backend Any (200, 429, 500, etc.)
DirectResponse Proxy generated the response directly (no upstream call) Varies
NotFound No matching listener, route, or backend found 404
NoHealthyBackend All providers in the backend are unhealthy, DNS failed, or backend doesn't exist 503
RateLimit Request rejected by local or global rate limiter 429
Timeout Request or upstream call timed out 504
JwtAuth JWT authentication failed 401
BasicAuth Basic authentication failed 401
APIKeyAuth API key authentication failed 401
ExtAuth External authorization service rejected the request 403
Authorization Authorization or CSRF validation failed 403
UpstreamFailure Upstream connection failed, TCP proxy error, or backend auth error 502, 503
Internal Internal proxy error (invalid request, filter error, processing error) 500
MCP MCP protocol-level error Varies
ExtProc External processing failure 500

GenAI (LLM) Metrics

These follow the OpenTelemetry GenAI semantic conventions. They are only populated for requests routed to LLM backends (AgentgatewayBackend with ai spec).

Metric Type Labels Description
agentgateway_gen_ai_client_token_usage Histogram gen_ai_token_type, gen_ai_operation_name, gen_ai_system, gen_ai_request_model, gen_ai_response_model, route Tokens consumed per request. Two observations per request: one with gen_ai_token_type="input" (prompt tokens) and one with gen_ai_token_type="output" (completion tokens). Buckets are exponential: 1, 4, 16, 64, 256, 1024 ... up to 67M.
agentgateway_gen_ai_server_request_duration Histogram gen_ai_operation_name, gen_ai_system, gen_ai_request_model, gen_ai_response_model, route Total time the upstream LLM took to process the request (seconds). For streaming, this is time from first byte sent to last byte received. Buckets: 10ms to 82s.
agentgateway_gen_ai_server_time_to_first_token Histogram (same) Time from request start to the first token generated (TTFT). Critical SLI for streaming user experience. Buckets: 1ms to 10s.
agentgateway_gen_ai_server_time_per_output_token Histogram (same) Average inter-token latency (TPOT). Measures generation throughput. Buckets: 1ms to 2.5s.

GenAI Label Reference:

Label Values Description
gen_ai_token_type input, output Whether this observation counts prompt tokens or completion tokens. Only on token_usage.
gen_ai_operation_name chat, embeddings The type of LLM operation.
gen_ai_system openai, anthropic, bedrock, vertexai, azureopenai, etc. The LLM provider type configured in the backend.
gen_ai_request_model e.g. gpt-4o, claude-sonnet-4-20250514 The model name sent in the request.
gen_ai_response_model e.g. gpt-4o-2024-08-06 The model name returned in the response (may differ from request).

MCP (Model Context Protocol) Metrics

Metric Type Labels Description
agentgateway_mcp_requests Counter method, resource_type, server, resource, route Total MCP tool/resource/prompt calls. Not incremented for raw HTTP transport requests (only JSON-RPC method calls).

MCP Label Reference:

Label Values Description
method tools/call, tools/list, prompts/get, resources/read, etc. The MCP JSON-RPC method name.
resource_type Tool, Prompt, Resource, ResourceTemplates Category of MCP operation.
server Target MCP server name Which MCP server was called.
resource Tool/resource name The specific tool or resource accessed.

MCP requests also flow through the general agentgateway_request_duration_seconds histogram for latency tracking.

Guardrail Metrics

Metric Type Labels Description
agentgateway_guardrail_checks Counter phase, action Total guardrail evaluations across all guardrail types (regex, webhook, OpenAI Moderation, Bedrock Guardrails, Google Model Armor).
Label Values Description
phase Request, Response Whether the guardrail fired on the inbound request or the outbound LLM response.
action Allow, Mask, Reject The outcome. Reject = request/response blocked, Mask = content redacted, Allow = passed.

Connection & Transport Metrics

Metric Type Labels Description
agentgateway_downstream_connections_total Counter bind, gateway, listener, protocol Total client-to-proxy connections established. Includes short-lived and long-lived (streaming, MCP/SSE) connections.
agentgateway_downstream_received_bytes_total Counter (same) Total bytes received from clients.
agentgateway_downstream_sent_bytes_total Counter (same) Total bytes sent to clients.
agentgateway_upstream_connect_duration_seconds Histogram transport Time to establish upstream connections. transport is plaintext or tls. High values indicate network issues or DNS problems to LLM providers. Buckets: 0.5ms to 8s.
agentgateway_tls_handshake_duration_seconds Histogram bind, gateway, listener, protocol Inbound TLS handshake duration. Only populated if TLS termination is configured on the gateway. Buckets: 0.5ms to 8s.

xDS (Control Plane Communication) Metrics

These track the connection between the data plane proxy and the control plane. Problems here mean the proxy isn't receiving configuration updates.

Metric Type Labels Description
agentgateway_xds_connection_terminations Counter reason xDS stream disconnections. reason is ConnectionError (network failure), Error (gRPC error), Reconnect (planned), or Complete (clean close). Frequent ConnectionError or Error values indicate control plane instability.
agentgateway_xds_message_total Counter url Number of xDS config messages received. The url label is the resource type URL (e.g. type.googleapis.com/agentgateway.dev.resource.Resource). A sudden stop means the proxy is no longer receiving config updates.
agentgateway_xds_message_bytes_total Counter url Bytes received from xDS. Large spikes may indicate excessive configuration churn.

Tokio Runtime Metrics

The proxy runs on a Tokio async runtime. These metrics indicate proxy-level health independently of request metrics.

Metric Type Description
agentgateway_tokio_num_workers Gauge Number of Tokio worker threads. Defaults to the number of CPU cores (or the value of CPU_LIMIT). Should be stable.
agentgateway_tokio_num_alive_tasks Gauge Number of currently active async tasks. Each in-flight request and connection is a task. A sustained upward trend may indicate task leaks or connection backlog.
agentgateway_tokio_global_queue_depth Gauge Tasks waiting to be picked up by a worker thread. Sustained values > 0 mean worker threads are saturated — a strong scale-up signal.

Control Plane — enterprise-agentgateway (port 9092)

The control plane is a Go-based Kubernetes controller. It watches Gateway API and AgentGateway CRDs and pushes configuration to data plane proxies via xDS.

Controller Reconciliation Metrics

Metric Type Labels Description
kgateway_controller_reconciliations_total Counter controller, name, namespace, result Total reconciliation loops. The controller label identifies which controller ran (e.g. gateway, gatewayclass). The result label is success or error. A rising error count means CRD changes are not being applied.
kgateway_controller_reconciliations_running Gauge controller, name, namespace Currently in-flight reconciliations. Sustained high values indicate controller backlog.
kgateway_controller_reconcile_duration_seconds Histogram controller, name, namespace Time per reconciliation loop. Increasing durations may indicate growing cluster complexity or API server slowness.
enterprise_kgateway_controller_reconciliations_total Counter controller, name, namespace, result Same as above but for enterprise-specific controllers: agw-ext-auth, agw-ext-cache, agw-rate-limiter.
enterprise_kgateway_controller_reconciliations_running Gauge (same) In-flight enterprise reconciliations.
enterprise_kgateway_controller_reconcile_duration_seconds Histogram (same) Enterprise reconciliation duration.

xDS Authentication

Metric Type Description
kgateway_xds_auth_rq_total Counter Total xDS authentication requests from data plane proxies. Each proxy connection must authenticate.
kgateway_xds_auth_rq_success_total Counter Successful xDS auth requests. If total - success > 0, proxy pods are failing to authenticate with the control plane.

Go Runtime & Process Metrics

Metric Type Description
go_goroutines Gauge Number of active goroutines. Sustained growth indicates leaks. Baseline is ~900 for a healthy controller.
go_memstats_alloc_bytes Gauge Current heap allocation. Monitor for memory leaks.
process_resident_memory_bytes Gauge RSS of the control plane process. Use for capacity planning.
process_cpu_seconds_total Counter CPU time consumed. Use rate() for CPU utilization.
process_open_fds Gauge Open file descriptors. Approaching process_max_fds causes failures.

Rate Limiter — rate-limiter-enterprise-agentgateway (port 9091)

Metric Type Labels Description
ratelimit_solo_io_total_hits Counter descriptor Total rate limit evaluation requests. The descriptor label encodes the rate limit policy (e.g. solo.io|generic_key^namespace.policyname).
ratelimit_solo_io_over_limit Counter descriptor Requests that exceeded the configured limit and were rejected (429).
ratelimit_solo_io_near_limit Counter descriptor Requests that were within 80% of the limit — an early warning signal.

Part 1b: Enable Metrics Collection for Control Plane and Rate Limiter

Lab 002 already configures Prometheus scraping for the data plane proxy. For production, you also need to scrape the control plane and rate limiter:

# Control plane metrics (port 9092)
kubectl apply -f- <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: control-plane-monitoring-agentgateway-metrics
  namespace: agentgateway-system
spec:
  namespaceSelector:
    matchNames:
      - agentgateway-system
  podMetricsEndpoints:
    - port: metrics
  selector:
    matchLabels:
      app.kubernetes.io/name: enterprise-agentgateway
EOF
# Rate limiter metrics (port 9091, exposed as "debug" on the Service)
kubectl apply -f- <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rate-limiter-monitoring-agentgateway-metrics
  namespace: agentgateway-system
spec:
  namespaceSelector:
    matchNames:
      - agentgateway-system
  selector:
    matchLabels:
      app: rate-limiter
  endpoints:
    - port: debug
      interval: 15s
EOF

Verify all targets are being scraped (may take 30-60 seconds):

kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/targets | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
for t in d['data']['activeTargets']:
    pod = t.get('labels',{}).get('pod','')
    svc = t.get('labels',{}).get('service','')
    if 'agentgateway' in pod or 'rate-limiter' in svc:
        print(f'{(pod or svc):60s} | {t[\"health\"]}')
"
kill %1 2>/dev/null

Expected output — all components up:

agentgateway-proxy-xxxxx-yyyyy                               | up
agentgateway-proxy-xxxxx-zzzzz                               | up
enterprise-agentgateway-xxxxx-yyyyy                          | up
rate-limiter-enterprise-agentgateway-xxxxx-yyyyy             | up

Part 2: Recommended Prometheus Alerting Rules

Deploy these alerting rules to catch issues before they affect users.

kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: agentgateway-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    # ──────────────────────────────────────────────
    # Data Plane Alerts
    # ──────────────────────────────────────────────
    - name: agentgateway-dataplane
      rules:

        # High error rate: >5% of requests returning 5xx
        - alert: AgentGatewayHighErrorRate
          expr: |
            (
              sum(rate(agentgateway_requests_total{status=~"5.."}[5m])) by (gateway)
              /
              sum(rate(agentgateway_requests_total[5m])) by (gateway)
            ) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "AgentGateway {{ $labels.gateway }} has >5% error rate"
            description: "{{ $value | humanizePercentage }} of requests are returning 5xx errors."

        # High rate limit rejection rate
        - alert: AgentGatewayHighRateLimitRate
          expr: |
            (
              sum(rate(agentgateway_requests_total{reason="RateLimit"}[5m])) by (gateway)
              /
              sum(rate(agentgateway_requests_total[5m])) by (gateway)
            ) > 0.10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "AgentGateway {{ $labels.gateway }} is rate-limiting >10% of requests"
            description: "Consider increasing rate limits or scaling the gateway."

        # No healthy backends available
        - alert: AgentGatewayNoHealthyBackends
          expr: |
            sum(rate(agentgateway_requests_total{reason="NoHealthyBackend"}[5m])) by (gateway, route) > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Route {{ $labels.route }} on {{ $labels.gateway }} has no healthy backends"
            description: "All LLM providers in this route's backend are unhealthy. Requests are failing with 503."

        # Slow LLM responses: p99 > 30s
        - alert: AgentGatewaySlowLLMResponses
          expr: |
            histogram_quantile(0.99,
              sum(rate(agentgateway_gen_ai_server_request_duration_bucket[5m])) by (le, gen_ai_system, gen_ai_request_model)
            ) > 30
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "LLM p99 latency >30s for {{ $labels.gen_ai_system }}/{{ $labels.gen_ai_request_model }}"
            description: "p99 LLM response time is {{ $value | humanizeDuration }}. Check provider health."

        # High TTFT (time to first token) - streaming UX degradation
        - alert: AgentGatewayHighTTFT
          expr: |
            histogram_quantile(0.95,
              sum(rate(agentgateway_gen_ai_server_time_to_first_token_bucket[5m])) by (le, gen_ai_request_model)
            ) > 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "TTFT p95 >5s for model {{ $labels.gen_ai_request_model }}"
            description: "Users are waiting {{ $value | humanizeDuration }} for the first token. Check model provider latency."

        # Upstream connection failures
        - alert: AgentGatewayUpstreamConnectFailures
          expr: |
            sum(rate(agentgateway_requests_total{reason="UpstreamFailure"}[5m])) by (gateway, backend) > 0.5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Upstream connection failures to {{ $labels.backend }}"
            description: "The proxy cannot connect to the upstream backend. Check DNS, network policies, and provider status."

        # Guardrail rejection spike
        - alert: AgentGatewayGuardrailRejectionSpike
          expr: |
            sum(rate(agentgateway_guardrail_checks{action="Reject"}[5m])) by (phase)
            /
            sum(rate(agentgateway_guardrail_checks[5m])) by (phase) > 0.20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Guardrail rejection rate >20% on {{ $labels.phase }} phase"
            description: "{{ $value | humanizePercentage }} of {{ $labels.phase | toLower }} guardrail checks are being rejected."

        # Tokio runtime saturation — tasks queuing up
        - alert: AgentGatewayRuntimeSaturation
          expr: |
            agentgateway_tokio_global_queue_depth > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "AgentGateway proxy runtime is saturated (queue depth {{ $value }})"
            description: "Tokio worker threads cannot keep up. This is a strong signal to scale up the proxy."

        # Task accumulation — potential leak or connection backlog
        - alert: AgentGatewayTaskAccumulation
          expr: |
            agentgateway_tokio_num_alive_tasks > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "AgentGateway has {{ $value }} active tasks"
            description: "Sustained high task count may indicate connection leaks or backlog. Investigate long-lived connections."

        # xDS disconnection from control plane
        - alert: AgentGatewayXDSDisconnected
          expr: |
            sum(rate(agentgateway_xds_connection_terminations{reason=~"ConnectionError|Error"}[5m])) by (pod) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Proxy pod is disconnecting from the control plane"
            description: "xDS connection errors detected. The proxy may not be receiving configuration updates."

        # Version mismatch after upgrade
        - alert: AgentGatewayVersionMismatch
          expr: |
            count(count by (tag) (agentgateway_build_info)) > 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Multiple AgentGateway versions running simultaneously"
            description: "Not all proxy pods are on the same version. This may indicate a stalled rollout."

    # ──────────────────────────────────────────────
    # Control Plane Alerts
    # ──────────────────────────────────────────────
    - name: agentgateway-controlplane
      rules:

        # Reconciliation errors
        - alert: AgentGatewayReconcileErrors
          expr: |
            sum(rate(kgateway_controller_reconciliations_total{result="error"}[5m])) by (controller) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Controller {{ $labels.controller }} has reconciliation errors"
            description: "CRD changes may not be applied to the data plane. Check controller logs."

        # Slow reconciliation
        - alert: AgentGatewaySlowReconcile
          expr: |
            histogram_quantile(0.99,
              sum(rate(kgateway_controller_reconcile_duration_seconds_bucket[5m])) by (le, controller)
            ) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Controller {{ $labels.controller }} p99 reconcile time >5s"
            description: "Reconciliation is slow ({{ $value | humanizeDuration }}). Check API server performance and cluster size."

        # xDS auth failures — proxies can't connect to control plane
        - alert: AgentGatewayXDSAuthFailures
          expr: |
            (
              rate(kgateway_xds_auth_rq_total[5m]) - rate(kgateway_xds_auth_rq_success_total[5m])
            ) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Proxy pods are failing xDS authentication"
            description: "Data plane proxies cannot authenticate with the control plane. New config will not be pushed."

        # Control plane memory growth
        - alert: AgentGatewayControlPlaneMemory
          expr: |
            process_resident_memory_bytes{job=~".*enterprise-agentgateway.*"} > 512 * 1024 * 1024
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Control plane using >512MB memory"
            description: "Current RSS: {{ $value | humanize1024 }}B. Investigate for memory leaks."

        # Goroutine leak
        - alert: AgentGatewayGoroutineLeak
          expr: |
            go_goroutines{job=~".*enterprise-agentgateway.*"} > 5000
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Control plane has {{ $value }} goroutines"
            description: "Sustained goroutine growth may indicate a leak. Baseline is ~900."

    # ──────────────────────────────────────────────
    # Rate Limiter Alerts
    # ──────────────────────────────────────────────
    - name: agentgateway-ratelimiter
      rules:

        # Rate limiter rejecting a high percentage of requests
        - alert: AgentGatewayRateLimiterOverLimit
          expr: |
            (
              sum(rate(ratelimit_solo_io_over_limit[5m])) by (descriptor)
              /
              sum(rate(ratelimit_solo_io_total_hits[5m])) by (descriptor)
            ) > 0.25
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Rate limiter rejecting >25% of requests for {{ $labels.descriptor }}"
            description: "Consider raising rate limits or investigating traffic patterns."

        # Near-limit warning — approaching quota
        - alert: AgentGatewayRateLimiterNearLimit
          expr: |
            sum(rate(ratelimit_solo_io_near_limit[5m])) by (descriptor) > 1
          for: 10m
          labels:
            severity: info
          annotations:
            summary: "Traffic approaching rate limit for {{ $labels.descriptor }}"
            description: "Requests are within 80% of the configured limit."
EOF

Verify the rules are loaded:

kubectl get prometheusrule agentgateway-alerts -n monitoring

Check Prometheus has picked them up:

kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/rules | python3 -c "
import sys, json
data = json.load(sys.stdin)
groups = data['data']['groups']
agw = [g for g in groups if 'agentgateway' in g['name']]
for g in agw:
    print(f\"\n{g['name']} ({len(g['rules'])} rules):\")
    for r in g['rules']:
        print(f\"  - {r['name']} [{r.get('state','unknown')}]\")
"
kill %1 2>/dev/null

Part 3: Scaling the Gateway

What to Scale Against

The proxy is a Rust-based async runtime (Tokio). Its bottlenecks are:

  1. CPU — TLS termination, request parsing, guardrail evaluation, JSON body inspection for token counting
  2. Concurrent connections — each in-flight request (especially streaming LLM responses and MCP/SSE connections) holds an async task
  3. Memory — primarily proportional to concurrent connections; the proxy streams request/response bodies rather than buffering them

For most AI workloads, CPU is the primary bottleneck because LLM requests are long-lived (seconds to minutes) with low request-per-second rates but high per-request CPU cost (TLS, body parsing, token counting).

Scaling Signals (in priority order)

Signal Metric Why
CPU utilization container_cpu_usage_seconds_total Primary bottleneck for TLS + body parsing. Target 60-70% average.
Runtime saturation agentgateway_tokio_global_queue_depth Non-zero means worker threads are fully occupied. Most direct signal that the proxy needs more capacity.
Active tasks agentgateway_tokio_num_alive_tasks Proportional to concurrent in-flight requests/connections. If this grows faster than request rate, connections are backing up.
Request rate agentgateway_requests_total Useful as a secondary signal, but less direct than CPU because request cost varies with payload size.

Configure HPA

The HPA below scales on CPU (primary) and can optionally use custom metrics for runtime saturation:

kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agentgateway-proxy
  namespace: agentgateway-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agentgateway-proxy
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60       # React quickly to load spikes
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60                # Add up to 2 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300      # Wait 5 min before scaling down (protect long-lived connections)
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120               # Remove at most 1 pod every 2 min
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
EOF

Important: For HPA to work, the proxy deployment must have CPU resource requests set. Update the EnterpriseAgentgatewayParameters:

kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
  "spec": {
    "deployment": {
      "spec": {
        "template": {
          "spec": {
            "containers": [{
              "name": "agentgateway",
              "resources": {
                "requests": {
                  "cpu": "500m",
                  "memory": "256Mi"
                },
                "limits": {
                  "cpu": "2",
                  "memory": "1Gi"
                }
              }
            }]
          }
        }
      }
    }
  }
}'

Resource sizing guidance:

Workload CPU Request CPU Limit Memory Request Memory Limit Notes
Low (< 50 rps) 250m 1 128Mi 512Mi Light traffic, few concurrent streams
Medium (50-500 rps) 500m 2 256Mi 1Gi Moderate concurrency, some streaming
High (> 500 rps) 1 4 512Mi 2Gi High concurrency, many long-lived streams

The proxy is lightweight at idle (~6Mi memory, <1m CPU). Memory grows linearly with concurrent connections. Each streaming LLM connection holds minimal state (the proxy streams, it does not buffer bodies).

Scaling Considerations for AI Traffic

AI/LLM traffic differs from traditional HTTP:

  • Long-lived connections: A streaming chat completion can last 30-120 seconds. The proxy holds an async task for the entire duration.
  • Low RPS, high connection time: 100 concurrent streaming users at 60s average = only ~1.7 rps but 100 concurrent tasks.
  • Tokio worker threads: Default to CPU_LIMIT cores. Each worker thread can handle many concurrent async tasks, but CPU-bound work (TLS, JSON parsing) blocks the thread.
  • Scale-down risk: Aggressive scale-down can terminate pods with active streaming connections. Use a long stabilizationWindowSeconds (300s+) for scale-down.

Requests-per-Pod Guidance

There is no hard limit — throughput depends on payload size, TLS overhead, guardrails enabled, and whether responses are streaming. General guidance:

Scenario Approximate capacity per pod (1 CPU)
Non-streaming, small payloads, no guardrails ~500-1000 rps
Non-streaming with guardrails ~200-400 rps
Streaming LLM (concurrent connections) ~500-1000 concurrent streams
MCP tool calls ~300-600 rps

Recommendation: Load test your specific workload using the k6 lab (025-load-testing-with-k6s.md) and observe the agentgateway_tokio_global_queue_depth metric. When queue depth starts consistently rising above 0, you've found the saturation point for that pod.


Part 4: Graceful Shutdown for Long-Lived AI Connections

LLM streaming responses, MCP/SSE connections, and agent workloads can run for minutes. The proxy has built-in graceful shutdown that must be configured to match.

How Graceful Shutdown Works

When a pod receives SIGTERM:

  1. Stop accepting new connections — the listener stops immediately
  2. CONNECTION_MIN_TERMINATION_DEADLINE (default: 10s) — for this period, existing connections continue but new ones receive connection: close (HTTP/1) or GOAWAY (HTTP/2) to tell clients to reconnect elsewhere
  3. Drain in-flight requests — the proxy waits for all active request handlers to complete
  4. TERMINATION_GRACE_PERIOD_SECONDS (default: 60s) — hard deadline. Any connections still active after this are forcefully terminated
  5. Kubernetes SIGKILL — sent at terminationGracePeriodSeconds (also 60s by default)

Recommended Configuration for AI Workloads

For workloads with long-lived streaming connections, increase the termination grace period:

kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
  "spec": {
    "deployment": {
      "spec": {
        "template": {
          "spec": {
            "terminationGracePeriodSeconds": 120,
            "containers": [{
              "name": "agentgateway",
              "env": [
                {
                  "name": "TERMINATION_GRACE_PERIOD_SECONDS",
                  "value": "110"
                },
                {
                  "name": "CONNECTION_MIN_TERMINATION_DEADLINE",
                  "value": "15s"
                }
              ]
            }]
          }
        }
      }
    }
  }
}'

Key: TERMINATION_GRACE_PERIOD_SECONDS must be less than terminationGracePeriodSeconds (the Kubernetes-level setting), otherwise Kubernetes sends SIGKILL before the proxy finishes draining.

Setting Default Recommended for AI Description
terminationGracePeriodSeconds 60s 120s Kubernetes-level: time before SIGKILL
TERMINATION_GRACE_PERIOD_SECONDS 60s 110s Proxy-level: hard deadline for drain (must be < k8s setting)
CONNECTION_MIN_TERMINATION_DEADLINE 10s 15s Minimum time to keep accepting connections to allow client migration

Part 5: Pod Disruption Budgets

PDBs ensure minimum availability during voluntary disruptions (node drains, upgrades, cluster autoscaler).

kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: agentgateway-proxy
  namespace: agentgateway-system
spec:
  minAvailable: 1                  # At least 1 proxy pod always running
  selector:
    matchLabels:
      app.kubernetes.io/name: agentgateway-proxy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: enterprise-agentgateway
  namespace: agentgateway-system
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: enterprise-agentgateway
EOF

Guidance for choosing minAvailable vs maxUnavailable:

Replicas Recommended PDB Effect
2 minAvailable: 1 1 pod can be disrupted at a time
3-5 maxUnavailable: 1 Same effect, but works better with rolling updates
5+ maxUnavailable: 2 Allows faster rolling updates while maintaining capacity

Verify:

kubectl get pdb -n agentgateway-system

Part 6: Topology Spread and Anti-Affinity

Spread proxy pods across nodes and zones to survive node failures and zonal outages.

kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
  "spec": {
    "deployment": {
      "spec": {
        "template": {
          "spec": {
            "topologySpreadConstraints": [
              {
                "maxSkew": 1,
                "topologyKey": "topology.kubernetes.io/zone",
                "whenUnsatisfiable": "ScheduleAnyway",
                "labelSelector": {
                  "matchLabels": {
                    "app.kubernetes.io/name": "agentgateway-proxy"
                  }
                }
              },
              {
                "maxSkew": 1,
                "topologyKey": "kubernetes.io/hostname",
                "whenUnsatisfiable": "ScheduleAnyway",
                "labelSelector": {
                  "matchLabels": {
                    "app.kubernetes.io/name": "agentgateway-proxy"
                  }
                }
              }
            ],
            "affinity": {
              "podAntiAffinity": {
                "preferredDuringSchedulingIgnoredDuringExecution": [
                  {
                    "weight": 100,
                    "podAffinityTerm": {
                      "labelSelector": {
                        "matchExpressions": [
                          {
                            "key": "app.kubernetes.io/name",
                            "operator": "In",
                            "values": ["agentgateway-proxy"]
                          }
                        ]
                      },
                      "topologyKey": "kubernetes.io/hostname"
                    }
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}'

Why ScheduleAnyway instead of DoNotSchedule:

  • DoNotSchedule can prevent scaling if no valid node/zone is available
  • ScheduleAnyway is a best-effort spread — the scheduler tries to spread but won't block scheduling
  • Use DoNotSchedule only if you have nodes in 3+ zones and can guarantee capacity in each

Verify pods are spread:

kubectl get pods -n agentgateway-system -l app.kubernetes.io/name=agentgateway-proxy \
  -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\\.kubernetes\\.io/zone

Part 7: Zero-Downtime Upgrade Procedure

Combining all the above for a safe upgrade of Enterprise AgentGateway.

Pre-Upgrade Checklist

# 1. Verify PDBs are in place
kubectl get pdb -n agentgateway-system

# 2. Verify current replica count (recommend >= 2 for zero-downtime)
kubectl get deploy agentgateway-proxy -n agentgateway-system -o jsonpath='{.spec.replicas}'

# 3. Check current version
kubectl get pods -n agentgateway-system -l app.kubernetes.io/name=agentgateway-proxy \
  -o jsonpath='{.items[0].spec.containers[0].image}'

# 4. Verify all pods are healthy
kubectl get pods -n agentgateway-system

# 5. Check no reconciliation errors
kubectl port-forward -n agentgateway-system deploy/enterprise-agentgateway 9092:9092 &
sleep 2
curl -s http://localhost:9092/metrics | grep 'result="error"'
kill %1 2>/dev/null

Upgrade Steps

# 1. Upgrade Helm release (or update image tag)
#    The rolling update will respect PDBs and graceful shutdown
helm upgrade enterprise-agentgateway solo/enterprise-agentgateway \
  --namespace agentgateway-system \
  --version <new-version> \
  --reuse-values

# 2. Watch the rollout — pods are replaced one at a time (PDB enforced)
kubectl rollout status deployment/enterprise-agentgateway -n agentgateway-system --timeout=300s
kubectl rollout status deployment/agentgateway-proxy -n agentgateway-system --timeout=300s

# 3. Verify all pods are on the new version
kubectl port-forward -n agentgateway-system deploy/agentgateway-proxy 15020:15020 &
sleep 2
curl -s http://localhost:15020/metrics | grep agentgateway_build_info
kill %1 2>/dev/null

Monitor During Upgrade

Watch for these during the rolling update:

# In a separate terminal — watch for errors during rollout
kubectl logs -f deploy/agentgateway-proxy -n agentgateway-system --since=5m | \
  jq -r 'select(.level == "ERROR" or .level == "WARN") | "\(.timestamp) \(.level) \(.message)"'

Key metrics to watch in Grafana during the upgrade:

  • agentgateway_build_info — should show old and new version during rollout, then only new version
  • agentgateway_requests_total{status=~"5.."} — error rate should not spike
  • agentgateway_xds_connection_terminations — expect Reconnect reasons as proxies restart, but no ConnectionError
  • kgateway_controller_reconciliations_total{result="error"} — should remain at 0

Cleanup

Remove the alerting rules, HPA, PDBs, and monitors if no longer needed:

kubectl delete prometheusrule agentgateway-alerts -n monitoring
kubectl delete hpa agentgateway-proxy -n agentgateway-system
kubectl delete pdb agentgateway-proxy enterprise-agentgateway -n agentgateway-system
kubectl delete podmonitor control-plane-monitoring-agentgateway-metrics -n agentgateway-system
kubectl delete servicemonitor rate-limiter-monitoring-agentgateway-metrics -n agentgateway-system
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment