This lab provides a complete production-readiness guide for Enterprise AgentGateway: a metrics reference for every component, recommended Prometheus alerting rules, horizontal pod autoscaling configuration, graceful shutdown for long-lived AI connections, pod spreading, and disruption budgets.
This lab assumes that you have completed the setup in 001 and 002.
- Understand every metric exposed by the data plane, control plane, and rate limiter
- Deploy production-grade Prometheus alerting rules
- Configure HPA to auto-scale the gateway based on the right signals
- Configure PodDisruptionBudgets, topology spread, and anti-affinity
- Ensure graceful shutdown of long-lived LLM streaming and MCP connections during upgrades
- Perform a zero-downtime rolling upgrade
Enterprise AgentGateway exposes metrics from three components. All are scraped automatically by Prometheus via pod annotations.
The data plane is a Rust-based proxy built on Tokio. It handles all request routing, LLM traffic, MCP calls, and guardrail enforcement.
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_build_info |
Info | tag |
AgentGateway version (e.g. v2.2.0). Use to confirm all pods are on the same version after an upgrade. |
These are the primary metrics for monitoring gateway health and performance. Every request flowing through the gateway is counted and timed here.
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_requests_total |
Counter | backend, protocol, method, status, reason, bind, gateway, listener, route, route_rule |
Total HTTP requests processed. The status label is the HTTP status code (200, 429, 500, etc.). The reason label classifies why the response was generated (see reason table below). |
agentgateway_request_duration_seconds |
Histogram | (same) | End-to-end request latency from the proxy's perspective, including upstream LLM processing time. Buckets: 1ms to 80s. For LLM traffic, p99 will be dominated by model inference time. |
agentgateway_response_bytes_total |
Counter | (same) | Total response bytes received from upstream backends. Useful for tracking bandwidth and detecting unusually large responses. |
HTTP Label Reference:
| Label | Values | Description |
|---|---|---|
backend |
Backend name or "unknown" |
The upstream AgentgatewayBackend that handled the request. "unknown" means no route matched. |
protocol |
http, https, tls, tcp, hbone |
Transport protocol to the upstream. |
method |
GET, POST, CONNECT, etc. |
HTTP method. LLM chat completions are always POST. |
status |
HTTP status code (200, 404, 429, 500, etc.) | Response status code. |
reason |
See table below | Why the proxy generated or forwarded this response. |
bind |
e.g. 8080/agentgateway-system/agentgateway-proxy |
The listener bind address. |
gateway |
e.g. agentgateway-system/agentgateway-proxy |
The Gateway resource name. |
listener |
e.g. http |
Listener name within the Gateway. |
route |
HTTPRoute name or "unknown" |
Which HTTPRoute matched. |
route_rule |
Rule index or "unknown" |
Which rule within the HTTPRoute matched. |
Response reason Values:
The reason label tells you why a response was generated — critical for distinguishing between "upstream returned an error" vs. "the gateway itself rejected the request":
| Reason | Meaning | Typical Status Codes |
|---|---|---|
Upstream |
Response came from the upstream LLM/MCP backend | Any (200, 429, 500, etc.) |
DirectResponse |
Proxy generated the response directly (no upstream call) | Varies |
NotFound |
No matching listener, route, or backend found | 404 |
NoHealthyBackend |
All providers in the backend are unhealthy, DNS failed, or backend doesn't exist | 503 |
RateLimit |
Request rejected by local or global rate limiter | 429 |
Timeout |
Request or upstream call timed out | 504 |
JwtAuth |
JWT authentication failed | 401 |
BasicAuth |
Basic authentication failed | 401 |
APIKeyAuth |
API key authentication failed | 401 |
ExtAuth |
External authorization service rejected the request | 403 |
Authorization |
Authorization or CSRF validation failed | 403 |
UpstreamFailure |
Upstream connection failed, TCP proxy error, or backend auth error | 502, 503 |
Internal |
Internal proxy error (invalid request, filter error, processing error) | 500 |
MCP |
MCP protocol-level error | Varies |
ExtProc |
External processing failure | 500 |
These follow the OpenTelemetry GenAI semantic conventions. They are only populated for requests routed to LLM backends (AgentgatewayBackend with ai spec).
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_gen_ai_client_token_usage |
Histogram | gen_ai_token_type, gen_ai_operation_name, gen_ai_system, gen_ai_request_model, gen_ai_response_model, route |
Tokens consumed per request. Two observations per request: one with gen_ai_token_type="input" (prompt tokens) and one with gen_ai_token_type="output" (completion tokens). Buckets are exponential: 1, 4, 16, 64, 256, 1024 ... up to 67M. |
agentgateway_gen_ai_server_request_duration |
Histogram | gen_ai_operation_name, gen_ai_system, gen_ai_request_model, gen_ai_response_model, route |
Total time the upstream LLM took to process the request (seconds). For streaming, this is time from first byte sent to last byte received. Buckets: 10ms to 82s. |
agentgateway_gen_ai_server_time_to_first_token |
Histogram | (same) | Time from request start to the first token generated (TTFT). Critical SLI for streaming user experience. Buckets: 1ms to 10s. |
agentgateway_gen_ai_server_time_per_output_token |
Histogram | (same) | Average inter-token latency (TPOT). Measures generation throughput. Buckets: 1ms to 2.5s. |
GenAI Label Reference:
| Label | Values | Description |
|---|---|---|
gen_ai_token_type |
input, output |
Whether this observation counts prompt tokens or completion tokens. Only on token_usage. |
gen_ai_operation_name |
chat, embeddings |
The type of LLM operation. |
gen_ai_system |
openai, anthropic, bedrock, vertexai, azureopenai, etc. |
The LLM provider type configured in the backend. |
gen_ai_request_model |
e.g. gpt-4o, claude-sonnet-4-20250514 |
The model name sent in the request. |
gen_ai_response_model |
e.g. gpt-4o-2024-08-06 |
The model name returned in the response (may differ from request). |
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_mcp_requests |
Counter | method, resource_type, server, resource, route |
Total MCP tool/resource/prompt calls. Not incremented for raw HTTP transport requests (only JSON-RPC method calls). |
MCP Label Reference:
| Label | Values | Description |
|---|---|---|
method |
tools/call, tools/list, prompts/get, resources/read, etc. |
The MCP JSON-RPC method name. |
resource_type |
Tool, Prompt, Resource, ResourceTemplates |
Category of MCP operation. |
server |
Target MCP server name | Which MCP server was called. |
resource |
Tool/resource name | The specific tool or resource accessed. |
MCP requests also flow through the general agentgateway_request_duration_seconds histogram for latency tracking.
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_guardrail_checks |
Counter | phase, action |
Total guardrail evaluations across all guardrail types (regex, webhook, OpenAI Moderation, Bedrock Guardrails, Google Model Armor). |
| Label | Values | Description |
|---|---|---|
phase |
Request, Response |
Whether the guardrail fired on the inbound request or the outbound LLM response. |
action |
Allow, Mask, Reject |
The outcome. Reject = request/response blocked, Mask = content redacted, Allow = passed. |
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_downstream_connections_total |
Counter | bind, gateway, listener, protocol |
Total client-to-proxy connections established. Includes short-lived and long-lived (streaming, MCP/SSE) connections. |
agentgateway_downstream_received_bytes_total |
Counter | (same) | Total bytes received from clients. |
agentgateway_downstream_sent_bytes_total |
Counter | (same) | Total bytes sent to clients. |
agentgateway_upstream_connect_duration_seconds |
Histogram | transport |
Time to establish upstream connections. transport is plaintext or tls. High values indicate network issues or DNS problems to LLM providers. Buckets: 0.5ms to 8s. |
agentgateway_tls_handshake_duration_seconds |
Histogram | bind, gateway, listener, protocol |
Inbound TLS handshake duration. Only populated if TLS termination is configured on the gateway. Buckets: 0.5ms to 8s. |
These track the connection between the data plane proxy and the control plane. Problems here mean the proxy isn't receiving configuration updates.
| Metric | Type | Labels | Description |
|---|---|---|---|
agentgateway_xds_connection_terminations |
Counter | reason |
xDS stream disconnections. reason is ConnectionError (network failure), Error (gRPC error), Reconnect (planned), or Complete (clean close). Frequent ConnectionError or Error values indicate control plane instability. |
agentgateway_xds_message_total |
Counter | url |
Number of xDS config messages received. The url label is the resource type URL (e.g. type.googleapis.com/agentgateway.dev.resource.Resource). A sudden stop means the proxy is no longer receiving config updates. |
agentgateway_xds_message_bytes_total |
Counter | url |
Bytes received from xDS. Large spikes may indicate excessive configuration churn. |
The proxy runs on a Tokio async runtime. These metrics indicate proxy-level health independently of request metrics.
| Metric | Type | Description |
|---|---|---|
agentgateway_tokio_num_workers |
Gauge | Number of Tokio worker threads. Defaults to the number of CPU cores (or the value of CPU_LIMIT). Should be stable. |
agentgateway_tokio_num_alive_tasks |
Gauge | Number of currently active async tasks. Each in-flight request and connection is a task. A sustained upward trend may indicate task leaks or connection backlog. |
agentgateway_tokio_global_queue_depth |
Gauge | Tasks waiting to be picked up by a worker thread. Sustained values > 0 mean worker threads are saturated — a strong scale-up signal. |
The control plane is a Go-based Kubernetes controller. It watches Gateway API and AgentGateway CRDs and pushes configuration to data plane proxies via xDS.
| Metric | Type | Labels | Description |
|---|---|---|---|
kgateway_controller_reconciliations_total |
Counter | controller, name, namespace, result |
Total reconciliation loops. The controller label identifies which controller ran (e.g. gateway, gatewayclass). The result label is success or error. A rising error count means CRD changes are not being applied. |
kgateway_controller_reconciliations_running |
Gauge | controller, name, namespace |
Currently in-flight reconciliations. Sustained high values indicate controller backlog. |
kgateway_controller_reconcile_duration_seconds |
Histogram | controller, name, namespace |
Time per reconciliation loop. Increasing durations may indicate growing cluster complexity or API server slowness. |
enterprise_kgateway_controller_reconciliations_total |
Counter | controller, name, namespace, result |
Same as above but for enterprise-specific controllers: agw-ext-auth, agw-ext-cache, agw-rate-limiter. |
enterprise_kgateway_controller_reconciliations_running |
Gauge | (same) | In-flight enterprise reconciliations. |
enterprise_kgateway_controller_reconcile_duration_seconds |
Histogram | (same) | Enterprise reconciliation duration. |
| Metric | Type | Description |
|---|---|---|
kgateway_xds_auth_rq_total |
Counter | Total xDS authentication requests from data plane proxies. Each proxy connection must authenticate. |
kgateway_xds_auth_rq_success_total |
Counter | Successful xDS auth requests. If total - success > 0, proxy pods are failing to authenticate with the control plane. |
| Metric | Type | Description |
|---|---|---|
go_goroutines |
Gauge | Number of active goroutines. Sustained growth indicates leaks. Baseline is ~900 for a healthy controller. |
go_memstats_alloc_bytes |
Gauge | Current heap allocation. Monitor for memory leaks. |
process_resident_memory_bytes |
Gauge | RSS of the control plane process. Use for capacity planning. |
process_cpu_seconds_total |
Counter | CPU time consumed. Use rate() for CPU utilization. |
process_open_fds |
Gauge | Open file descriptors. Approaching process_max_fds causes failures. |
| Metric | Type | Labels | Description |
|---|---|---|---|
ratelimit_solo_io_total_hits |
Counter | descriptor |
Total rate limit evaluation requests. The descriptor label encodes the rate limit policy (e.g. solo.io|generic_key^namespace.policyname). |
ratelimit_solo_io_over_limit |
Counter | descriptor |
Requests that exceeded the configured limit and were rejected (429). |
ratelimit_solo_io_near_limit |
Counter | descriptor |
Requests that were within 80% of the limit — an early warning signal. |
Lab 002 already configures Prometheus scraping for the data plane proxy. For production, you also need to scrape the control plane and rate limiter:
# Control plane metrics (port 9092)
kubectl apply -f- <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: control-plane-monitoring-agentgateway-metrics
namespace: agentgateway-system
spec:
namespaceSelector:
matchNames:
- agentgateway-system
podMetricsEndpoints:
- port: metrics
selector:
matchLabels:
app.kubernetes.io/name: enterprise-agentgateway
EOF# Rate limiter metrics (port 9091, exposed as "debug" on the Service)
kubectl apply -f- <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rate-limiter-monitoring-agentgateway-metrics
namespace: agentgateway-system
spec:
namespaceSelector:
matchNames:
- agentgateway-system
selector:
matchLabels:
app: rate-limiter
endpoints:
- port: debug
interval: 15s
EOFVerify all targets are being scraped (may take 30-60 seconds):
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/targets | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
for t in d['data']['activeTargets']:
pod = t.get('labels',{}).get('pod','')
svc = t.get('labels',{}).get('service','')
if 'agentgateway' in pod or 'rate-limiter' in svc:
print(f'{(pod or svc):60s} | {t[\"health\"]}')
"
kill %1 2>/dev/nullExpected output — all components up:
agentgateway-proxy-xxxxx-yyyyy | up
agentgateway-proxy-xxxxx-zzzzz | up
enterprise-agentgateway-xxxxx-yyyyy | up
rate-limiter-enterprise-agentgateway-xxxxx-yyyyy | up
Deploy these alerting rules to catch issues before they affect users.
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: agentgateway-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
# ──────────────────────────────────────────────
# Data Plane Alerts
# ──────────────────────────────────────────────
- name: agentgateway-dataplane
rules:
# High error rate: >5% of requests returning 5xx
- alert: AgentGatewayHighErrorRate
expr: |
(
sum(rate(agentgateway_requests_total{status=~"5.."}[5m])) by (gateway)
/
sum(rate(agentgateway_requests_total[5m])) by (gateway)
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "AgentGateway {{ $labels.gateway }} has >5% error rate"
description: "{{ $value | humanizePercentage }} of requests are returning 5xx errors."
# High rate limit rejection rate
- alert: AgentGatewayHighRateLimitRate
expr: |
(
sum(rate(agentgateway_requests_total{reason="RateLimit"}[5m])) by (gateway)
/
sum(rate(agentgateway_requests_total[5m])) by (gateway)
) > 0.10
for: 5m
labels:
severity: warning
annotations:
summary: "AgentGateway {{ $labels.gateway }} is rate-limiting >10% of requests"
description: "Consider increasing rate limits or scaling the gateway."
# No healthy backends available
- alert: AgentGatewayNoHealthyBackends
expr: |
sum(rate(agentgateway_requests_total{reason="NoHealthyBackend"}[5m])) by (gateway, route) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Route {{ $labels.route }} on {{ $labels.gateway }} has no healthy backends"
description: "All LLM providers in this route's backend are unhealthy. Requests are failing with 503."
# Slow LLM responses: p99 > 30s
- alert: AgentGatewaySlowLLMResponses
expr: |
histogram_quantile(0.99,
sum(rate(agentgateway_gen_ai_server_request_duration_bucket[5m])) by (le, gen_ai_system, gen_ai_request_model)
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "LLM p99 latency >30s for {{ $labels.gen_ai_system }}/{{ $labels.gen_ai_request_model }}"
description: "p99 LLM response time is {{ $value | humanizeDuration }}. Check provider health."
# High TTFT (time to first token) - streaming UX degradation
- alert: AgentGatewayHighTTFT
expr: |
histogram_quantile(0.95,
sum(rate(agentgateway_gen_ai_server_time_to_first_token_bucket[5m])) by (le, gen_ai_request_model)
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "TTFT p95 >5s for model {{ $labels.gen_ai_request_model }}"
description: "Users are waiting {{ $value | humanizeDuration }} for the first token. Check model provider latency."
# Upstream connection failures
- alert: AgentGatewayUpstreamConnectFailures
expr: |
sum(rate(agentgateway_requests_total{reason="UpstreamFailure"}[5m])) by (gateway, backend) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Upstream connection failures to {{ $labels.backend }}"
description: "The proxy cannot connect to the upstream backend. Check DNS, network policies, and provider status."
# Guardrail rejection spike
- alert: AgentGatewayGuardrailRejectionSpike
expr: |
sum(rate(agentgateway_guardrail_checks{action="Reject"}[5m])) by (phase)
/
sum(rate(agentgateway_guardrail_checks[5m])) by (phase) > 0.20
for: 5m
labels:
severity: warning
annotations:
summary: "Guardrail rejection rate >20% on {{ $labels.phase }} phase"
description: "{{ $value | humanizePercentage }} of {{ $labels.phase | toLower }} guardrail checks are being rejected."
# Tokio runtime saturation — tasks queuing up
- alert: AgentGatewayRuntimeSaturation
expr: |
agentgateway_tokio_global_queue_depth > 10
for: 5m
labels:
severity: warning
annotations:
summary: "AgentGateway proxy runtime is saturated (queue depth {{ $value }})"
description: "Tokio worker threads cannot keep up. This is a strong signal to scale up the proxy."
# Task accumulation — potential leak or connection backlog
- alert: AgentGatewayTaskAccumulation
expr: |
agentgateway_tokio_num_alive_tasks > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "AgentGateway has {{ $value }} active tasks"
description: "Sustained high task count may indicate connection leaks or backlog. Investigate long-lived connections."
# xDS disconnection from control plane
- alert: AgentGatewayXDSDisconnected
expr: |
sum(rate(agentgateway_xds_connection_terminations{reason=~"ConnectionError|Error"}[5m])) by (pod) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxy pod is disconnecting from the control plane"
description: "xDS connection errors detected. The proxy may not be receiving configuration updates."
# Version mismatch after upgrade
- alert: AgentGatewayVersionMismatch
expr: |
count(count by (tag) (agentgateway_build_info)) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Multiple AgentGateway versions running simultaneously"
description: "Not all proxy pods are on the same version. This may indicate a stalled rollout."
# ──────────────────────────────────────────────
# Control Plane Alerts
# ──────────────────────────────────────────────
- name: agentgateway-controlplane
rules:
# Reconciliation errors
- alert: AgentGatewayReconcileErrors
expr: |
sum(rate(kgateway_controller_reconciliations_total{result="error"}[5m])) by (controller) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Controller {{ $labels.controller }} has reconciliation errors"
description: "CRD changes may not be applied to the data plane. Check controller logs."
# Slow reconciliation
- alert: AgentGatewaySlowReconcile
expr: |
histogram_quantile(0.99,
sum(rate(kgateway_controller_reconcile_duration_seconds_bucket[5m])) by (le, controller)
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Controller {{ $labels.controller }} p99 reconcile time >5s"
description: "Reconciliation is slow ({{ $value | humanizeDuration }}). Check API server performance and cluster size."
# xDS auth failures — proxies can't connect to control plane
- alert: AgentGatewayXDSAuthFailures
expr: |
(
rate(kgateway_xds_auth_rq_total[5m]) - rate(kgateway_xds_auth_rq_success_total[5m])
) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxy pods are failing xDS authentication"
description: "Data plane proxies cannot authenticate with the control plane. New config will not be pushed."
# Control plane memory growth
- alert: AgentGatewayControlPlaneMemory
expr: |
process_resident_memory_bytes{job=~".*enterprise-agentgateway.*"} > 512 * 1024 * 1024
for: 15m
labels:
severity: warning
annotations:
summary: "Control plane using >512MB memory"
description: "Current RSS: {{ $value | humanize1024 }}B. Investigate for memory leaks."
# Goroutine leak
- alert: AgentGatewayGoroutineLeak
expr: |
go_goroutines{job=~".*enterprise-agentgateway.*"} > 5000
for: 15m
labels:
severity: warning
annotations:
summary: "Control plane has {{ $value }} goroutines"
description: "Sustained goroutine growth may indicate a leak. Baseline is ~900."
# ──────────────────────────────────────────────
# Rate Limiter Alerts
# ──────────────────────────────────────────────
- name: agentgateway-ratelimiter
rules:
# Rate limiter rejecting a high percentage of requests
- alert: AgentGatewayRateLimiterOverLimit
expr: |
(
sum(rate(ratelimit_solo_io_over_limit[5m])) by (descriptor)
/
sum(rate(ratelimit_solo_io_total_hits[5m])) by (descriptor)
) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Rate limiter rejecting >25% of requests for {{ $labels.descriptor }}"
description: "Consider raising rate limits or investigating traffic patterns."
# Near-limit warning — approaching quota
- alert: AgentGatewayRateLimiterNearLimit
expr: |
sum(rate(ratelimit_solo_io_near_limit[5m])) by (descriptor) > 1
for: 10m
labels:
severity: info
annotations:
summary: "Traffic approaching rate limit for {{ $labels.descriptor }}"
description: "Requests are within 80% of the configured limit."
EOFVerify the rules are loaded:
kubectl get prometheusrule agentgateway-alerts -n monitoringCheck Prometheus has picked them up:
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/rules | python3 -c "
import sys, json
data = json.load(sys.stdin)
groups = data['data']['groups']
agw = [g for g in groups if 'agentgateway' in g['name']]
for g in agw:
print(f\"\n{g['name']} ({len(g['rules'])} rules):\")
for r in g['rules']:
print(f\" - {r['name']} [{r.get('state','unknown')}]\")
"
kill %1 2>/dev/nullThe proxy is a Rust-based async runtime (Tokio). Its bottlenecks are:
- CPU — TLS termination, request parsing, guardrail evaluation, JSON body inspection for token counting
- Concurrent connections — each in-flight request (especially streaming LLM responses and MCP/SSE connections) holds an async task
- Memory — primarily proportional to concurrent connections; the proxy streams request/response bodies rather than buffering them
For most AI workloads, CPU is the primary bottleneck because LLM requests are long-lived (seconds to minutes) with low request-per-second rates but high per-request CPU cost (TLS, body parsing, token counting).
| Signal | Metric | Why |
|---|---|---|
| CPU utilization | container_cpu_usage_seconds_total |
Primary bottleneck for TLS + body parsing. Target 60-70% average. |
| Runtime saturation | agentgateway_tokio_global_queue_depth |
Non-zero means worker threads are fully occupied. Most direct signal that the proxy needs more capacity. |
| Active tasks | agentgateway_tokio_num_alive_tasks |
Proportional to concurrent in-flight requests/connections. If this grows faster than request rate, connections are backing up. |
| Request rate | agentgateway_requests_total |
Useful as a secondary signal, but less direct than CPU because request cost varies with payload size. |
The HPA below scales on CPU (primary) and can optionally use custom metrics for runtime saturation:
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: agentgateway-proxy
namespace: agentgateway-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: agentgateway-proxy
minReplicas: 2
maxReplicas: 10
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # React quickly to load spikes
policies:
- type: Pods
value: 2
periodSeconds: 60 # Add up to 2 pods per minute
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down (protect long-lived connections)
policies:
- type: Pods
value: 1
periodSeconds: 120 # Remove at most 1 pod every 2 min
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
EOFImportant: For HPA to work, the proxy deployment must have CPU resource requests set. Update the EnterpriseAgentgatewayParameters:
kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
"spec": {
"deployment": {
"spec": {
"template": {
"spec": {
"containers": [{
"name": "agentgateway",
"resources": {
"requests": {
"cpu": "500m",
"memory": "256Mi"
},
"limits": {
"cpu": "2",
"memory": "1Gi"
}
}
}]
}
}
}
}
}
}'Resource sizing guidance:
| Workload | CPU Request | CPU Limit | Memory Request | Memory Limit | Notes |
|---|---|---|---|---|---|
| Low (< 50 rps) | 250m | 1 | 128Mi | 512Mi | Light traffic, few concurrent streams |
| Medium (50-500 rps) | 500m | 2 | 256Mi | 1Gi | Moderate concurrency, some streaming |
| High (> 500 rps) | 1 | 4 | 512Mi | 2Gi | High concurrency, many long-lived streams |
The proxy is lightweight at idle (~6Mi memory, <1m CPU). Memory grows linearly with concurrent connections. Each streaming LLM connection holds minimal state (the proxy streams, it does not buffer bodies).
AI/LLM traffic differs from traditional HTTP:
- Long-lived connections: A streaming chat completion can last 30-120 seconds. The proxy holds an async task for the entire duration.
- Low RPS, high connection time: 100 concurrent streaming users at 60s average = only ~1.7 rps but 100 concurrent tasks.
- Tokio worker threads: Default to
CPU_LIMITcores. Each worker thread can handle many concurrent async tasks, but CPU-bound work (TLS, JSON parsing) blocks the thread. - Scale-down risk: Aggressive scale-down can terminate pods with active streaming connections. Use a long
stabilizationWindowSeconds(300s+) for scale-down.
There is no hard limit — throughput depends on payload size, TLS overhead, guardrails enabled, and whether responses are streaming. General guidance:
| Scenario | Approximate capacity per pod (1 CPU) |
|---|---|
| Non-streaming, small payloads, no guardrails | ~500-1000 rps |
| Non-streaming with guardrails | ~200-400 rps |
| Streaming LLM (concurrent connections) | ~500-1000 concurrent streams |
| MCP tool calls | ~300-600 rps |
Recommendation: Load test your specific workload using the k6 lab (025-load-testing-with-k6s.md) and observe the agentgateway_tokio_global_queue_depth metric. When queue depth starts consistently rising above 0, you've found the saturation point for that pod.
LLM streaming responses, MCP/SSE connections, and agent workloads can run for minutes. The proxy has built-in graceful shutdown that must be configured to match.
When a pod receives SIGTERM:
- Stop accepting new connections — the listener stops immediately
- CONNECTION_MIN_TERMINATION_DEADLINE (default:
10s) — for this period, existing connections continue but new ones receiveconnection: close(HTTP/1) orGOAWAY(HTTP/2) to tell clients to reconnect elsewhere - Drain in-flight requests — the proxy waits for all active request handlers to complete
- TERMINATION_GRACE_PERIOD_SECONDS (default:
60s) — hard deadline. Any connections still active after this are forcefully terminated - Kubernetes SIGKILL — sent at
terminationGracePeriodSeconds(also 60s by default)
For workloads with long-lived streaming connections, increase the termination grace period:
kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
"spec": {
"deployment": {
"spec": {
"template": {
"spec": {
"terminationGracePeriodSeconds": 120,
"containers": [{
"name": "agentgateway",
"env": [
{
"name": "TERMINATION_GRACE_PERIOD_SECONDS",
"value": "110"
},
{
"name": "CONNECTION_MIN_TERMINATION_DEADLINE",
"value": "15s"
}
]
}]
}
}
}
}
}
}'Key: TERMINATION_GRACE_PERIOD_SECONDS must be less than terminationGracePeriodSeconds (the Kubernetes-level setting), otherwise Kubernetes sends SIGKILL before the proxy finishes draining.
| Setting | Default | Recommended for AI | Description |
|---|---|---|---|
terminationGracePeriodSeconds |
60s | 120s | Kubernetes-level: time before SIGKILL |
TERMINATION_GRACE_PERIOD_SECONDS |
60s | 110s | Proxy-level: hard deadline for drain (must be < k8s setting) |
CONNECTION_MIN_TERMINATION_DEADLINE |
10s | 15s | Minimum time to keep accepting connections to allow client migration |
PDBs ensure minimum availability during voluntary disruptions (node drains, upgrades, cluster autoscaler).
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: agentgateway-proxy
namespace: agentgateway-system
spec:
minAvailable: 1 # At least 1 proxy pod always running
selector:
matchLabels:
app.kubernetes.io/name: agentgateway-proxy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: enterprise-agentgateway
namespace: agentgateway-system
spec:
minAvailable: 1
selector:
matchLabels:
app.kubernetes.io/name: enterprise-agentgateway
EOFGuidance for choosing minAvailable vs maxUnavailable:
| Replicas | Recommended PDB | Effect |
|---|---|---|
| 2 | minAvailable: 1 |
1 pod can be disrupted at a time |
| 3-5 | maxUnavailable: 1 |
Same effect, but works better with rolling updates |
| 5+ | maxUnavailable: 2 |
Allows faster rolling updates while maintaining capacity |
Verify:
kubectl get pdb -n agentgateway-systemSpread proxy pods across nodes and zones to survive node failures and zonal outages.
kubectl patch enterpriseagentgatewayparameters agentgateway-config -n agentgateway-system --type=merge -p '{
"spec": {
"deployment": {
"spec": {
"template": {
"spec": {
"topologySpreadConstraints": [
{
"maxSkew": 1,
"topologyKey": "topology.kubernetes.io/zone",
"whenUnsatisfiable": "ScheduleAnyway",
"labelSelector": {
"matchLabels": {
"app.kubernetes.io/name": "agentgateway-proxy"
}
}
},
{
"maxSkew": 1,
"topologyKey": "kubernetes.io/hostname",
"whenUnsatisfiable": "ScheduleAnyway",
"labelSelector": {
"matchLabels": {
"app.kubernetes.io/name": "agentgateway-proxy"
}
}
}
],
"affinity": {
"podAntiAffinity": {
"preferredDuringSchedulingIgnoredDuringExecution": [
{
"weight": 100,
"podAffinityTerm": {
"labelSelector": {
"matchExpressions": [
{
"key": "app.kubernetes.io/name",
"operator": "In",
"values": ["agentgateway-proxy"]
}
]
},
"topologyKey": "kubernetes.io/hostname"
}
}
]
}
}
}
}
}
}
}
}'Why ScheduleAnyway instead of DoNotSchedule:
DoNotSchedulecan prevent scaling if no valid node/zone is availableScheduleAnywayis a best-effort spread — the scheduler tries to spread but won't block scheduling- Use
DoNotScheduleonly if you have nodes in 3+ zones and can guarantee capacity in each
Verify pods are spread:
kubectl get pods -n agentgateway-system -l app.kubernetes.io/name=agentgateway-proxy \
-o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\\.kubernetes\\.io/zoneCombining all the above for a safe upgrade of Enterprise AgentGateway.
# 1. Verify PDBs are in place
kubectl get pdb -n agentgateway-system
# 2. Verify current replica count (recommend >= 2 for zero-downtime)
kubectl get deploy agentgateway-proxy -n agentgateway-system -o jsonpath='{.spec.replicas}'
# 3. Check current version
kubectl get pods -n agentgateway-system -l app.kubernetes.io/name=agentgateway-proxy \
-o jsonpath='{.items[0].spec.containers[0].image}'
# 4. Verify all pods are healthy
kubectl get pods -n agentgateway-system
# 5. Check no reconciliation errors
kubectl port-forward -n agentgateway-system deploy/enterprise-agentgateway 9092:9092 &
sleep 2
curl -s http://localhost:9092/metrics | grep 'result="error"'
kill %1 2>/dev/null# 1. Upgrade Helm release (or update image tag)
# The rolling update will respect PDBs and graceful shutdown
helm upgrade enterprise-agentgateway solo/enterprise-agentgateway \
--namespace agentgateway-system \
--version <new-version> \
--reuse-values
# 2. Watch the rollout — pods are replaced one at a time (PDB enforced)
kubectl rollout status deployment/enterprise-agentgateway -n agentgateway-system --timeout=300s
kubectl rollout status deployment/agentgateway-proxy -n agentgateway-system --timeout=300s
# 3. Verify all pods are on the new version
kubectl port-forward -n agentgateway-system deploy/agentgateway-proxy 15020:15020 &
sleep 2
curl -s http://localhost:15020/metrics | grep agentgateway_build_info
kill %1 2>/dev/nullWatch for these during the rolling update:
# In a separate terminal — watch for errors during rollout
kubectl logs -f deploy/agentgateway-proxy -n agentgateway-system --since=5m | \
jq -r 'select(.level == "ERROR" or .level == "WARN") | "\(.timestamp) \(.level) \(.message)"'Key metrics to watch in Grafana during the upgrade:
agentgateway_build_info— should show old and new version during rollout, then only new versionagentgateway_requests_total{status=~"5.."}— error rate should not spikeagentgateway_xds_connection_terminations— expectReconnectreasons as proxies restart, but noConnectionErrorkgateway_controller_reconciliations_total{result="error"}— should remain at 0
Remove the alerting rules, HPA, PDBs, and monitors if no longer needed:
kubectl delete prometheusrule agentgateway-alerts -n monitoring
kubectl delete hpa agentgateway-proxy -n agentgateway-system
kubectl delete pdb agentgateway-proxy enterprise-agentgateway -n agentgateway-system
kubectl delete podmonitor control-plane-monitoring-agentgateway-metrics -n agentgateway-system
kubectl delete servicemonitor rate-limiter-monitoring-agentgateway-metrics -n agentgateway-system