Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save marcusramberg/319ad2b6b23559561d1367533c0c3134 to your computer and use it in GitHub Desktop.

Select an option

Save marcusramberg/319ad2b6b23559561d1367533c0c3134 to your computer and use it in GitHub Desktop.
# k8s-e2e-verifier
A Kubernetes operator that creates and manages e2e test jobs for Flagger canary deployments.
## Overview
The k8s-e2e-verifier is a webhook server designed to work with [Flagger](https://flagger.app/) pre-rollout webhooks. It automates end-to-end testing before canary deployments are promoted, ensuring that only verified releases reach production.
### How it works
1. **Flagger triggers webhook**: When a canary deployment begins, Flagger calls the e2e-verifier webhook with details about the canary (name, namespace, checksum, phase)
2. **Check for existing job**: The verifier looks for an existing Kubernetes Job based on the canary name and checksum
- If a job exists and **succeeded**: Returns HTTP 200 (test passed)
- If a job exists and **failed**: Returns HTTP 500 (test failed)
- If a job exists and is **running**: Returns HTTP 202 (test in progress)
3. **Create new job**: If no job exists, the verifier:
- Fetches the E2ESpec CRD for the canary
- Creates a Kubernetes Job based on the spec
- Returns HTTP 202 (test started)
4. **Flagger retries**: Flagger will retry the webhook on non-2xx responses, allowing the verifier to check job status on subsequent calls
## Architecture
```
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Flagger │────────>│ e2e-verifier │────────>│ E2ESpec CRD │
│ Canary │ webhook │ (this operator) │ read │ (test config) │
└─────────────┘ └──────────────────┘ └─────────────────┘
│ create/check
v
┌──────────────┐
│ Kubernetes │
│ Job (e2e │
│ test runner) │
└──────────────┘
```
## Custom Resource Definition
The operator uses a custom `E2ESpec` CRD to define the test job configuration:
```yaml
apiVersion: remarkable.com/v1
kind: E2ESpec
metadata:
name: my-app # Must match your canary name
namespace: default
spec:
# Container image for the e2e test
image: gcr.io/my-project/my-app-e2e:latest
# Optional: Override image entrypoint
command: ["/bin/sh"]
args: ["-c", "npm test"]
# Environment variables (webhook metadata is automatically added)
env:
- name: API_ENDPOINT
value: "http://my-app-canary:8080"
# Mount secrets and configmaps
secrets:
- name: test-credentials
mountPath: /var/secrets
# Job configuration
backoffLimit: 1
activeDeadlineSeconds: 600
ttlSecondsAfterFinished: 3600
# Resources
resources:
requests:
memory: "256Mi"
cpu: "100m"
```
See [examples/e2espec-example.yaml](examples/e2espec-example.yaml) for complete examples.
## Installation
### 1. Install the CRD
```bash
kubectl apply -f cmd/k8s-e2e-verifier/e2espec-crd.yaml
```
### 2. Deploy the operator
```bash
kubectl apply -f cmd/k8s-e2e-verifier/examples/deployment.yaml
```
This creates:
- ServiceAccount with necessary RBAC permissions
- Deployment running the webhook server
- Service exposing the webhook endpoint
### 3. Create an E2ESpec for your application
```bash
kubectl apply -f cmd/k8s-e2e-verifier/examples/e2espec-example.yaml
```
### 4. Configure Flagger Canary
Add a pre-rollout webhook to your Flagger Canary:
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-app
spec:
# ... your canary config ...
analysis:
webhooks:
- name: "e2e-verification"
type: pre-rollout
url: http://e2e-verifier.default/webhook
timeout: 15m
retries: 3
```
See [examples/flagger-integration.yaml](examples/flagger-integration.yaml) for a complete example.
## Webhook Payload
The operator expects the standard Flagger webhook payload:
```json
{
"name": "my-app",
"namespace": "default",
"phase": "Progressing",
"checksum": "85d557f47b",
"metadata": {
"eventMessage": "Canary promotion starting",
"eventType": "Normal",
"timestamp": "1640000000000"
}
}
```
These fields are automatically injected into the test job as environment variables:
- `CANARY_NAME`
- `CANARY_NAMESPACE`
- `CANARY_PHASE`
- `CANARY_CHECKSUM`
- `EVENT_MESSAGE`
- `EVENT_TYPE`
- `EVENT_TIMESTAMP`
## Response Codes
The webhook returns different HTTP status codes based on the job status:
| Status Code | Meaning | Flagger Action |
|------------|---------|----------------|
| 200 | Test succeeded | Continue rollout |
| 202 | Test created or still running | Retry webhook later |
| 404 | E2ESpec CRD not found | Halt rollout (error) |
| 500 | Test failed | Halt and potentially rollback |
## Job Naming
Jobs are named using the pattern: `e2e-{canary-name}-{checksum}`
For example: `e2e-my-app-85d557f47b`
The checksum ensures that each unique canary configuration gets its own test job.
## Configuration
The operator supports the following environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `USE_IN_CLUSTER` | Use in-cluster Kubernetes config | `true` |
| `PORT` | HTTP server port | `8080` |
Standard server configuration is handled via the reMarkable `cloud/lib/v7/server` package.
## RBAC Requirements
The operator requires the following permissions:
**For the operator:**
- `remarkable.com/e2especs`: get, list, watch
- `batch/jobs`: get, list, watch, create, update, patch, delete
- `batch/jobs/status`: get
**For test jobs (optional):**
Configure based on what your e2e tests need to access in the cluster. See the example `e2e-test-runner` ServiceAccount in [examples/deployment.yaml](examples/deployment.yaml).
## Examples
### Simple smoke test
```yaml
apiVersion: remarkable.com/v1
kind: E2ESpec
metadata:
name: simple-test
spec:
image: alpine:latest
command: ["/bin/sh"]
args: ["-c", "echo 'Test passed' && exit 0"]
```
### Comprehensive e2e test with secrets and resources
```yaml
apiVersion: remarkable.com/v1
kind: E2ESpec
metadata:
name: my-app
spec:
image: gcr.io/my-project/my-app-e2e:v1.0.0
imagePullPolicy: Always
env:
- name: TEST_ENVIRONMENT
value: "staging"
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
envFrom:
- secretRef:
name: test-secrets
- configMapRef:
name: test-config
secrets:
- name: service-account-key
mountPath: /var/secrets/google
backoffLimit: 2
activeDeadlineSeconds: 600
ttlSecondsAfterFinished: 3600
serviceAccountName: e2e-test-runner
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
```
## Development
### Building
```bash
# From repository root
go build -o bin/k8s-e2e-verifier ./cmd/k8s-e2e-verifier
```
### Testing locally
To test locally without deploying to a cluster:
```bash
# Set environment variables
export USE_IN_CLUSTER=false
export PORT=8080
# Run the server
./bin/k8s-e2e-verifier
```
Then send a test webhook:
```bash
curl -X POST http://localhost:8080/webhook \
-H "Content-Type: application/json" \
-d '{
"name": "my-app",
"namespace": "default",
"phase": "Progressing",
"checksum": "test123",
"metadata": {
"eventMessage": "Test",
"eventType": "Normal",
"timestamp": "1640000000000"
}
}'
```
## Troubleshooting
### Check operator logs
```bash
kubectl logs -l app=e2e-verifier -n default --tail=100 -f
```
### Check job status
```bash
# List all e2e jobs
kubectl get jobs -l remarkable.com/managed-by=e2e-verifier
# Check specific job
kubectl describe job e2e-my-app-85d557f47b
# View job logs
kubectl logs job/e2e-my-app-85d557f47b
```
### Common issues
**E2ESpec not found**
- Ensure the E2ESpec name matches your canary name
- Verify it's in the correct namespace
- Check CRD is installed: `kubectl get crd e2especs.remarkable.com`
**Job stuck in pending**
- Check resource availability
- Verify ServiceAccount exists
- Check image pull secrets
**Job fails immediately**
- Check job logs: `kubectl logs job/e2e-{name}-{checksum}`
- Verify secrets/configmaps exist
- Check resource limits
## License
See repository root for license information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment