EKS to AKS Migration Pain Points: Real-World Problems and Solutions

Version: 1.0
Last Updated: February 2026
Target Audience: Platform Engineers, DevOps Teams, Migration Specialists

Executive Summary

Migrating Kubernetes workloads from Amazon EKS to Azure AKS appears straightforward—both are managed Kubernetes services running the same core platform. However, cloud provider-specific integrations, CSI drivers, networking models, and authentication mechanisms create significant friction points that can cause application failures post-migration.

This document catalogs real-world migration pain points encountered when moving stateful and stateless workloads from EKS to AKS, with detailed remediation strategies, code examples, and automated detection patterns for migration tooling.

Key Takeaways

Identity & Access: IRSA vs Workload Identity requires application-level changes
Storage: Different CSI drivers, performance characteristics, and access modes
Networking: Security Groups for Pods don't translate to Network Policies
Secrets: AWS Secrets Manager vs Azure Key Vault require different CSI configurations
Ingress: ALB-specific features need AGIC or nginx equivalents
Observability: CloudWatch vs Azure Monitor have different collection mechanisms
Cost: Different pricing models for storage, networking, and compute

Identity and Authentication
Persistent Storage
Secrets Management
Ingress and Load Balancing
Networking and Security
Observability and Logging
Container Registry
Backup and Disaster Recovery
Compute and Node Configuration
Service Mesh Integration
Database-Specific Integrations
GitOps and CI/CD
Detection Patterns
Migration Strategies
Quick Reference Tables

1. Identity and Authentication

Pain Point: IAM Roles for Service Accounts (IRSA) → Workload Identity

Severity: 🔴 High - Application breaking
Frequency: Very Common
Impact: Authentication failures, unable to access cloud resources

The Problem

EKS uses IRSA to provide AWS credentials to pods via service account annotations. This integrates seamlessly with AWS SDK libraries. AKS uses Azure Workload Identity (formerly Azure AD Pod Identity), which has a completely different configuration model.

EKS Configuration (Works)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/prod-s3-reader-role
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: document-processor
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: doc-processor
  template:
    metadata:
      labels:
        app: doc-processor
    spec:
      serviceAccountName: s3-reader
      containers:
      - name: processor
        image: myregistry/doc-processor:v1.2.3
        env:
        - name: AWS_REGION
          value: us-east-1
        - name: S3_BUCKET
          value: production-documents
        - name: AWS_DEFAULT_REGION
          value: us-east-1

Application Code (Python):

import boto3

# This just works - AWS SDK automatically uses IRSA credentials
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='production-documents')

After Migration to AKS (Broken)

# Pod starts but fails at runtime
kubectl logs document-processor-7d9f8b5c4-x8k2m

# Output:
# botocore.exceptions.NoCredentialsError: Unable to locate credentials
# or
# botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation

Root Cause Analysis

Service Account annotation is AWS-specific - AKS doesn't recognize eks.amazonaws.com/role-arn
OIDC provider is different - EKS OIDC endpoint vs Azure AD
Token format differs - AWS STS tokens vs Azure AD tokens
SDK credential chain changes - AWS SDK won't find Azure credentials automatically

Solution A: Migrate to Azure Blob Storage

Prerequisites:

Create Azure Storage Account
Create Managed Identity with Storage Blob Data Contributor role
Set up Workload Identity federation

Azure Configuration:

# Create storage account
az storage account create \
  --name prodstorageacct \
  --resource-group production-rg \
  --location eastus \
  --sku Standard_ZRS

# Create container
az storage container create \
  --name documents \
  --account-name prodstorageacct

# Create managed identity
az identity create \
  --name doc-processor-identity \
  --resource-group production-rg

# Get identity client ID
IDENTITY_CLIENT_ID=$(az identity show \
  --name doc-processor-identity \
  --resource-group production-rg \
  --query clientId -o tsv)

# Assign storage permissions
az role assignment create \
  --role "Storage Blob Data Contributor" \
  --assignee $IDENTITY_CLIENT_ID \
  --scope /subscriptions/<subscription-id>/resourceGroups/production-rg/providers/Microsoft.Storage/storageAccounts/prodstorageacct

AKS Configuration:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: blob-reader
  namespace: production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789012"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-210987654321"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: document-processor
  namespace: production
  labels:
    azure.workload.identity/use: "true"  # Required!
spec:
  replicas: 3
  selector:
    matchLabels:
      app: doc-processor
  template:
    metadata:
      labels:
        app: doc-processor
        azure.workload.identity/use: "true"  # Required on pod!
    spec:
      serviceAccountName: blob-reader
      containers:
      - name: processor
        image: myregistry/doc-processor:v2.0.0  # Updated image
        env:
        - name: AZURE_STORAGE_ACCOUNT_NAME
          value: prodstorageacct
        - name: AZURE_STORAGE_CONTAINER_NAME
          value: documents
        # Note: No explicit credentials - Workload Identity handles it

Application Code Changes (Python):

# NEW: Azure SDK instead of boto3
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# DefaultAzureCredential automatically uses Workload Identity
credential = DefaultAzureCredential()

blob_service_client = BlobServiceClient(
    account_url=f"https://prodstorageacct.blob.core.windows.net",
    credential=credential
)

container_client = blob_service_client.get_container_client("documents")

# List blobs (equivalent to S3 list_objects_v2)
blob_list = container_client.list_blobs()
for blob in blob_list:
    print(f"Blob name: {blob.name}")

Testing:

# Verify workload identity is working
kubectl run -it --rm debug \
  --image=mcr.microsoft.com/azure-cli \
  --serviceaccount=blob-reader \
  --labels=azure.workload.identity/use=true \
  -- bash

# Inside pod:
az login --identity
az storage blob list \
  --account-name prodstorageacct \
  --container-name documents \
  --auth-mode login

Solution B: Keep S3, Add Cross-Cloud Authentication

Use Case: Multi-cloud strategy, data residency, or gradual migration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  namespace: production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789012"
---
apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
  namespace: production
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "AKIA..."
  AWS_SECRET_ACCESS_KEY: "wJalrXUtn..."
  # OR use Azure Key Vault CSI to inject these
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: document-processor
spec:
  template:
    spec:
      serviceAccountName: s3-reader
      containers:
      - name: processor
        image: myregistry/doc-processor:v1.2.3
        envFrom:
        - secretRef:
            name: aws-credentials
        env:
        - name: AWS_REGION
          value: us-east-1

Better Approach: Use Azure Managed Identity to assume AWS IAM role via OIDC federation

# Set up federated identity credential in Azure
az identity federated-credential create \
  --name aws-federation \
  --identity-name doc-processor-identity \
  --resource-group production-rg \
  --issuer "https://sts.amazonaws.com" \
  --subject "arn:aws:iam::123456789012:role/prod-s3-reader-role" \
  --audience "sts.amazonaws.com"

# Configure AWS IAM role to trust Azure AD
# (Complex setup - beyond scope, generally not recommended)

Migration Checklist

Common Pitfalls

Forgetting pod label: azure.workload.identity/use: "true" must be on BOTH deployment and pod template
Token expiration: Azure AD tokens have different lifetimes than AWS STS tokens
SDK version: Older Azure SDK versions don't support Workload Identity
Regional endpoints: Azure Storage URLs differ from S3 URLs
Permissions model: Azure RBAC roles vs AWS IAM policies have different granularity

2. Persistent Storage

Pain Point 1: EBS Storage Classes → Azure Disk

Severity: 🔴 High - Application won't start
Frequency: Universal (every stateful app)
Impact: PVCs stuck in Pending, StatefulSets won't deploy

The Problem

EBS-specific StorageClasses don't exist in AKS. Different provisioners, parameters, and performance tiers require manifest updates.

EKS Configuration (Works)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "16000"
  throughput: "1000"
  encrypted: "true"
  kmsKeyId: arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 500Gi

After Migration to AKS (Broken)

kubectl get pvc -n database
# NAME                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
# postgres-data           Pending                                      fast-ssd

kubectl describe pvc postgres-data -n database
# Events:
#   Warning  ProvisioningFailed  storageclass.storage.k8s.io "fast-ssd" not found

Solution: Azure Disk StorageClass

Performance Tier Mapping:

EBS Type	IOPS	Throughput	Azure Disk Equivalent	SKU	IOPS	Throughput
gp3 (baseline)	3,000	125 MB/s	Premium SSD v2	PremiumV2_LRS	3,000	125 MB/s
gp3 (16k IOPS)	16,000	1,000 MB/s	Ultra Disk	UltraSSD_LRS	16,000+	1,000 MB/s
io2 Block Express	256,000	4,000 MB/s	Ultra Disk	UltraSSD_LRS	160,000	4,000 MB/s
st1 (throughput)	500	500 MB/s	Standard SSD	StandardSSD_LRS	varies	~60 MB/s

AKS Configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: UltraSSD_LRS  # For high IOPS requirement
  cachingMode: None      # Ultra Disk doesn't support caching
  # DiskIOPSReadWrite and DiskMBpsReadWrite are set per-PVC for Ultra Disk
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
  annotations:
    # Ultra Disk specific parameters
    disk.csi.azure.com/diskIOPSReadWrite: "16000"
    disk.csi.azure.com/diskMBpsReadWrite: "1000"
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

Important Notes:

Ultra Disk requires specific VM sizes - Not all Azure VM SKUs support Ultra Disk

# Check if node pool supports Ultra Disk
az aks nodepool show \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name nodepool1 \
  --query "enableUltraSsd"

No disk encryption by default - Must use Azure Disk Encryption Set

parameters:
  skuName: UltraSSD_LRS
  diskEncryptionSetID: /subscriptions/.../diskEncryptionSets/myDES

Cost implications - Ultra Disk is significantly more expensive
- Pay for provisioned IOPS and throughput, not just capacity
- Consider Premium SSD v2 for better cost/performance balance

Alternative: Premium SSD v2 (Cost-Effective)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: PremiumV2_LRS
  cachingMode: ReadOnly  # Premium v2 supports caching
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
  annotations:
    disk.csi.azure.com/diskIOPSReadWrite: "10000"
    disk.csi.azure.com/diskMBpsReadWrite: "500"
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

Migration Steps for Existing Data

Option A: Velero Backup/Restore (Volume Data)

# In EKS
velero backup create postgres-backup \
  --include-namespaces database \
  --snapshot-volumes

# In AKS (after setting up new StorageClass)
velero restore create postgres-restore \
  --from-backup postgres-backup

Option B: Database-Native Dump/Restore

# In EKS - Dump database
kubectl exec -n database postgres-0 -- \
  pg_dumpall -U postgres > /tmp/postgres-dump.sql

# Copy to local machine
kubectl cp database/postgres-0:/tmp/postgres-dump.sql ./postgres-dump.sql

# In AKS - Restore after new StatefulSet is running
kubectl cp ./postgres-dump.sql database/postgres-0:/tmp/postgres-dump.sql
kubectl exec -n database postgres-0 -- \
  psql -U postgres < /tmp/postgres-dump.sql

Option C: Continuous Replication (Zero Downtime)

# Set up PostgreSQL streaming replication from EKS to AKS
# Primary in EKS, Replica in AKS
# Promote AKS replica to primary during cutover

Pain Point 2: EFS (ReadWriteMany) → Azure Files

Severity: 🟡 Medium - Depends on use case
Frequency: Common (20-30% of workloads)
Impact: Shared storage not available, multi-pod writes fail

The Problem

EFS provides NFS-based shared storage with ReadWriteMany access mode. Azure Files provides similar capability but with different performance characteristics, protocols (SMB vs NFS), and pricing.

EKS Configuration (Works)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-storage
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-0123456789abcdef0
  directoryPerms: "700"
  gidRangeStart: "1000"
  gidRangeEnd: "2000"
  basePath: "/dynamic_provisioning"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-uploads
  namespace: web
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-storage
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: web
spec:
  replicas: 5  # Multiple pods share the volume
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        volumeMounts:
        - name: uploads
          mountPath: /var/www/uploads
      volumes:
      - name: uploads
        persistentVolumeClaim:
          claimName: shared-uploads

After Migration to AKS (Broken)

kubectl get pvc -n web
# PVC pending - EFS driver not available

kubectl describe pvc shared-uploads -n web
# provisioner "efs.csi.aws.com" not found

Solution: Azure Files with NFS or SMB

Protocol Decision:

NFS 4.1: Better for Linux workloads, POSIX compliance, better performance
SMB 3.0: Better for Windows workloads, AD integration, encryption at rest

Option 1: Azure Files with NFS (Recommended for Linux)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-nfs
provisioner: file.csi.azure.com
parameters:
  protocol: nfs
  skuName: Premium_LRS  # NFS requires Premium tier
  # Network settings for better performance
  networkEndpointType: privateEndpoint  # Optional: for private access
mountOptions:
  - nconnect=4  # Parallel connections for better throughput
  - actimeo=30   # Attribute cache timeout
allowVolumeExpansion: true
volumeBindingMode: Immediate
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-uploads
  namespace: web
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: azurefile-nfs
  resources:
    requests:
      storage: 100Gi

Option 2: Azure Files with SMB

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-smb
provisioner: file.csi.azure.com
parameters:
  skuName: Standard_LRS  # Or Premium_LRS
  protocol: smb
  # Optional: Use existing storage account
  # storageAccount: mystorageaccount
  # resourceGroup: myResourceGroup
mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=33  # www-data user
  - gid=33
  - mfsymlinks  # Enable symlinks
  - cache=strict
  - actimeo=30
allowVolumeExpansion: true
volumeBindingMode: Immediate

Performance Comparison

Metric	EFS	Azure Files Premium (NFS)	Azure Files Standard (SMB)
Max throughput	10 GB/s	10 GB/s	60 MB/s per share
Max IOPS	500,000+	100,000	1,000-20,000
Latency	Low (single-digit ms)	Low (single-digit ms)	Higher (varies)
Min size	No minimum	100 GiB	1 GiB
Pricing model	Pay per GB used	Pay per GB provisioned	Pay per GB used
Bursting	Yes	Yes	Limited

Migration Gotchas

File Permissions

# EFS uses NFSv4 ACLs
# Azure Files NFS uses NFSv4.1 - mostly compatible
# Azure Files SMB uses NTFS ACLs - potential permission issues

# Test file operations
kubectl exec -it web-frontend-xxx -- touch /var/www/uploads/test.txt
kubectl exec -it web-frontend-xxx -- ls -la /var/www/uploads/

Symbolic Links

# Azure Files SMB requires mfsymlinks mount option
mountOptions:
  - mfsymlinks

File Locking

# EFS supports byte-range locking
# Azure Files NFS: Full support
# Azure Files SMB: Full support
# Test your application's file locking behavior

Case Sensitivity

# EFS: Case-sensitive (Linux NFS)
# Azure Files NFS: Case-sensitive
# Azure Files SMB: Case-insensitive by default

# This could break applications expecting case-sensitivity!
touch /uploads/File.txt
touch /uploads/file.txt  # Different files on EFS/NFS, same file on SMB

Data Migration Approaches

Option 1: Rsync Between Volumes

# Create sync pod with both volumes mounted
apiVersion: v1
kind: Pod
metadata:
  name: efs-to-azurefile-sync
  namespace: web
spec:
  containers:
  - name: sync
    image: instrumentisto/rsync-ssh:latest
    command: ["/bin/sh", "-c"]
    args:
      - |
        rsync -avz --progress \
          /source/ /destination/
        echo "Sync complete"
        sleep infinity
    volumeMounts:
    - name: source
      mountPath: /source
    - name: destination
      mountPath: /destination
  volumes:
  - name: source
    persistentVolumeClaim:
      claimName: efs-pvc  # EKS cluster - requires cross-cluster volume access
  - name: destination
    persistentVolumeClaim:
      claimName: azurefile-pvc  # AKS cluster

Option 2: AWS DataSync to Azure Blob, then mount

# Use AWS DataSync to copy data to S3
# Use AzCopy to copy from S3 to Azure Files
azcopy copy \
  "https://my-bucket.s3.amazonaws.com/*" \
  "https://mystorageaccount.file.core.windows.net/myshare" \
  --recursive

Option 3: Application-Level Migration

# 1. Deploy application in AKS with empty Azure Files volume
# 2. Configure application to write to both EFS (in AWS) and Azure Files
# 3. Run backfill job to copy existing data
# 4. Switch application to read from Azure Files
# 5. Decommission EFS

3. Secrets Management

Pain Point: AWS Secrets Manager CSI → Azure Key Vault CSI

Severity: 🔴 High - Application won't start
Frequency: Very Common (80%+ of secure applications)
Impact: Secrets not available, authentication failures

The Problem

Applications using AWS Secrets Manager via the Secrets Store CSI Driver need reconfiguration to use Azure Key Vault. The SecretProviderClass CRD has completely different parameters.

EKS Configuration (Works)

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: application-secrets
  namespace: production
spec:
  provider: aws
  parameters:
    objects: |
      - objectName: "production/database/postgres"
        objectType: "secretsmanager"
        objectAlias: "db-password"
      - objectName: "production/api/jwt-secret"
        objectType: "secretsmanager"
        objectAlias: "jwt-key"
      - objectName: "production/ssl/certificate"
        objectType: "secretsmanager"
        objectAlias: "ssl-cert"
  secretObjects:  # Auto-create Kubernetes Secrets
  - secretName: db-credentials
    type: Opaque
    data:
    - objectName: db-password
      key: password
  - secretName: jwt-credentials
    type: Opaque
    data:
    - objectName: jwt-key
      key: secret
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: api-sa  # Has IRSA permissions
      containers:
      - name: api
        image: myapi:v1.0
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: jwt-credentials
              key: secret
        volumeMounts:
        - name: secrets
          mountPath: "/mnt/secrets"
          readOnly: true
      volumes:
      - name: secrets
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "application-secrets"

After Migration to AKS (Broken)

kubectl get pods -n production
# NAME                          READY   STATUS              RESTARTS   AGE
# api-server-6d8f9c5b4-abc123   0/1     ContainerCreating   0          5m

kubectl describe pod api-server-6d8f9c5b4-abc123 -n production
# Events:
#   Warning  FailedMount  MountVolume.SetUp failed for volume "secrets" : 
#   rpc error: code = Unknown desc = failed to mount secrets store objects for pod: 
#   provider "aws" not found

Solution: Azure Key Vault CSI Driver

Prerequisites:

# 1. Enable Azure Key Vault Provider for Secrets Store CSI Driver
az aks enable-addons \
  --addons azure-keyvault-secrets-provider \
  --name myAKSCluster \
  --resource-group myResourceGroup

# 2. Create Azure Key Vault
az keyvault create \
  --name prodappvault \
  --resource-group production-rg \
  --location eastus

# 3. Create Managed Identity for workload
az identity create \
  --name api-server-identity \
  --resource-group production-rg

# 4. Grant Key Vault access
IDENTITY_CLIENT_ID=$(az identity show \
  --name api-server-identity \
  --resource-group production-rg \
  --query clientId -o tsv)

az keyvault set-policy \
  --name prodappvault \
  --secret-permissions get list \
  --spn $IDENTITY_CLIENT_ID

Migrate Secrets:

# Export from AWS Secrets Manager
aws secretsmanager get-secret-value \
  --secret-id production/database/postgres \
  --query SecretString \
  --output text > db-password.txt

# Import to Azure Key Vault
az keyvault secret set \
  --vault-name prodappvault \
  --name db-password \
  --file db-password.txt

# Repeat for other secrets
az keyvault secret set \
  --vault-name prodappvault \
  --name jwt-secret \
  --value "$(aws secretsmanager get-secret-value --secret-id production/api/jwt-secret --query SecretString --output text)"

AKS Configuration:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: api-sa
  namespace: production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789012"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-210987654321"
---
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: application-secrets
  namespace: production
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "false"
    clientID: "12345678-1234-1234-1234-123456789012"  # Managed Identity Client ID
    keyvaultName: "prodappvault"
    cloudName: ""  # Empty for Azure Public Cloud
    objects: |
      array:
        - |
          objectName: db-password
          objectType: secret
          objectAlias: db-password
        - |
          objectName: jwt-secret
          objectType: secret
          objectAlias: jwt-key
        - |
          objectName: ssl-certificate
          objectType: secret
          objectAlias: ssl-cert
    tenantId: "87654321-4321-4321-4321-210987654321"
  secretObjects:  # Create Kubernetes Secrets (same as before)
  - secretName: db-credentials
    type: Opaque
    data:
    - objectName: db-password
      key: password
  - secretName: jwt-credentials
    type: Opaque
    data:
    - objectName: jwt-key
      key: secret
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
  labels:
    azure.workload.identity/use: "true"
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: api-sa
      containers:
      - name: api
        image: myapi:v1.0
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: jwt-credentials
              key: secret
        volumeMounts:
        - name: secrets
          mountPath: "/mnt/secrets"
          readOnly: true
      volumes:
      - name: secrets
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "application-secrets"

Advanced: Auto-Rotation

AWS Secrets Manager doesn't auto-rotate mounted secrets by default in CSI.

Azure Key Vault CSI supports auto-rotation:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: application-secrets
  namespace: production
spec:
  provider: azure
  parameters:
    # ... other parameters ...
    # Enable secret rotation
    secretProviderClass: "application-secrets"
  # Secrets will be rotated based on polling interval

Configure rotation interval:

# Update CSI driver configuration
kubectl edit configmap azure-keyvault-secrets-provider-config -n kube-system

# Add:
data:
  rotation-poll-interval: "120s"  # Check for updates every 2 minutes

Common Issues

Token Expiration

# Workload Identity tokens expire
# Symptoms: "AuthenticationFailed" after ~24 hours
# Solution: Ensure pod has correct labels
azure.workload.identity/use: "true"

Permission Errors

# Error: "Caller is not authorized to perform action"
# Check Key Vault access policies
az keyvault show --name prodappvault --query properties.accessPolicies

# Grant missing permissions
az keyvault set-policy \
  --name prodappvault \
  --object-id <managed-identity-object-id> \
  --secret-permissions get list

Secret Not Syncing

# Check CSI driver logs
kubectl logs -n kube-system -l app=secrets-store-csi-driver

# Check provider logs
kubectl logs -n kube-system -l app=csi-secrets-store-provider-azure

4. Ingress and Load Balancing

Pain Point: AWS ALB Ingress Controller → Azure Application Gateway / nginx

Severity: 🟡 Medium - Functionality degraded
Frequency: Very Common
Impact: Lost features (SSL, redirects, WAF), different costs

The Problem

AWS ALB Ingress Controller annotations don't work on AKS. Features like SSL termination, HTTP-to-HTTPS redirects, health checks, and WAF integration need reconfiguration.

EKS Configuration (Works)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    # SSL Certificate from ACM
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abc-def-ghi
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01
    # HTTP to HTTPS redirect
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: |
      {"Type": "redirect", "RedirectConfig": {
        "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"
      }}
    # Health check configuration
    alb.ingress.kubernetes.io/healthcheck-path: /health
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
    alb.ingress.kubernetes.io/success-codes: "200"
    # Access logs
    alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=my-alb-logs
    # WAF
    alb.ingress.kubernetes.io/wafv2-acl-arn: arn:aws:wafv2:us-east-1:123456789012:regional/webacl/MyWAF/a1b2c3d4
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ssl-redirect
            port:
              name: use-annotation
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

After Migration to AKS (Degraded)

# Ingress created but:
# - No ALB (falls back to nginx or nothing)
# - No SSL termination
# - No HTTP redirect
# - No WAF
# - No custom health checks
# - Different cost model

kubectl get ingress -n production
# NAME                 CLASS   HOSTS              ADDRESS   PORTS   AGE
# production-ingress   <none>  api.example.com              80      5m

Solution Option 1: Azure Application Gateway Ingress Controller (AGIC)

Most similar to ALB, enterprise features

Prerequisites:

# Create Application Gateway
az network application-gateway create \
  --name prodAppGateway \
  --resource-group production-rg \
  --location eastus \
  --sku WAF_v2 \
  --capacity 2 \
  --vnet-name aksVNet \
  --subnet appgw-subnet \
  --public-ip-address appgw-pip

# Enable AGIC addon on AKS
az aks enable-addons \
  --name myAKSCluster \
  --resource-group production-rg \
  --addon ingress-appgw \
  --appgw-id /subscriptions/.../resourceGroups/production-rg/providers/Microsoft.Network/applicationGateways/prodAppGateway

AKS Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    # SSL Certificate from Azure Key Vault
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: "api-example-com-cert"
    # HTTP to HTTPS redirect
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
    # Backend protocol
    appgw.ingress.kubernetes.io/backend-protocol: "http"
    # Health probe
    appgw.ingress.kubernetes.io/health-probe-path: "/health"
    appgw.ingress.kubernetes.io/health-probe-interval: "15"
    appgw.ingress.kubernetes.io/health-probe-timeout: "5"
    appgw.ingress.kubernetes.io/health-probe-unhealthy-threshold: "3"
    # WAF Policy
    appgw.ingress.kubernetes.io/waf-policy-for-path: "/subscriptions/.../resourceGroups/production-rg/providers/Microsoft.Network/applicationGatewayWebApplicationFirewallPolicies/prodWAF"
    # Connection draining
    appgw.ingress.kubernetes.io/connection-draining: "true"
    appgw.ingress.kubernetes.io/connection-draining-timeout: "30"
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls-secret  # Certificate must be in Key Vault and referenced
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

Certificate Setup:

# Import certificate to Key Vault
az keyvault certificate import \
  --vault-name prodappvault \
  --name api-example-com-cert \
  --file certificate.pfx \
  --password "cert-password"

# Grant Application Gateway access
az keyvault set-policy \
  --name prodappvault \
  --spn <appgw-identity> \
  --secret-permissions get \
  --certificate-permissions get

WAF Configuration:

# Create WAF policy
az network application-gateway waf-policy create \
  --name prodWAF \
  --resource-group production-rg \
  --location eastus

# Configure OWASP rules
az network application-gateway waf-policy managed-rule rule-set add \
  --policy-name prodWAF \
  --resource-group production-rg \
  --type OWASP \
  --version 3.2

Solution Option 2: nginx Ingress Controller (Most Portable)

Better for multi-cloud, more mature, larger community

# Install nginx ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz

AKS Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: nginx
    # SSL redirect
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    # Force SSL
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    # Certificate management via cert-manager
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    # Rate limiting
    nginx.ingress.kubernetes.io/limit-rps: "100"
    # Custom health check
    nginx.ingress.kubernetes.io/health-check-path: "/health"
    # CORS
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://example.com"
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls-secret  # Auto-provisioned by cert-manager
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

cert-manager Setup (for automated SSL):

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Create ClusterIssuer for Let's Encrypt
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
EOF

Feature Comparison

Feature	AWS ALB	Azure App Gateway (AGIC)	nginx Ingress
SSL Termination	ACM	Key Vault	cert-manager/manual
WAF	AWS WAF	Azure WAF	ModSecurity (addon)
Path-based routing	✓	✓	✓
HTTP redirects	✓	✓	✓
Header manipulation	Limited	✓	✓ (extensive)
Rate limiting	Via WAF	Via WAF	✓ (native)
Canary deployments	Via target groups	Via backend pools	✓ (native)
mTLS	✓	✓	✓
Cost	Pay per hour + LCU	Pay per hour + capacity	Free (infra only)
Multi-cloud	AWS only	Azure only	Any cloud

5. Networking and Security

Pain Point: VPC CNI Security Groups for Pods → Network Policies

Severity: 🔴 High - Security controls lost
Frequency: Common in regulated industries
Impact: Pod-level network isolation not available

The Problem

EKS allows assigning AWS Security Groups directly to pods via the VPC CNI plugin. AKS uses standard Kubernetes Network Policies, which have different capabilities and granularity.

EKS Configuration (Works)

# Custom Resource for Security Group Policy
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: database-pod-sg
  namespace: database
spec:
  podSelector:
    matchLabels:
      app: postgres
      tier: database
  securityGroups:
    groupIds:
      - sg-0a1b2c3d4e5f6g7h8  # Only allows 5432 from app tier SG
      - sg-1a2b3c4d5e6f7g8h9  # Allows SSH from bastion SG
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
      tier: database
  template:
    metadata:
      labels:
        app: postgres
        tier: database
    spec:
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - containerPort: 5432
          name: postgres
# Pod automatically gets dedicated ENI with security group sg-0a1b2c3d4e5f6g7h8

AWS Security Group Rules (defined in AWS):

# sg-0a1b2c3d4e5f6g7h8 - Database Security Group
# Inbound:
#   - Port 5432 from sg-app-tier-xyz (application pods)
#   - Port 5432 from sg-bastion-abc (admin access)
# Outbound:
#   - Port 5432 to sg-0a1b2c3d4e5f6g7h8 (cluster communication)

After Migration to AKS (No Security!)

# SecurityGroupPolicy CRD doesn't exist
kubectl get securitygrouppolicy -n database
# error: the server doesn't have a resource type "securitygrouppolicy"

# Pods have no network restrictions
# All pods can communicate with all pods!

Solution: Kubernetes Network Policies + Azure Network Policy Manager

Enable Azure Network Policy:

# When creating cluster
az aks create \
  --resource-group production-rg \
  --name myAKSCluster \
  --network-plugin azure \
  --network-policy azure  # or "calico"

# For existing cluster (requires recreation of node pools)
az aks update \
  --resource-group production-rg \
  --name myAKSCluster \
  --network-policy azure

AKS Configuration:

# Default deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: database
spec:
  podSelector: {}
  policyTypes:
  - Ingress
---
# Allow specific ingress to PostgreSQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgres-allow-from-app
  namespace: database
spec:
  podSelector:
    matchLabels:
      app: postgres
      tier: database
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow from application tier
  - from:
    - namespaceSelector:
        matchLabels:
          name: application
      podSelector:
        matchLabels:
          tier: application
    ports:
    - protocol: TCP
      port: 5432
  # Allow from monitoring
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 9187  # postgres_exporter
  # Allow from same namespace (replica communication)
  - from:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  # Allow PostgreSQL replication
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
---
# Label namespaces for network policy
apiVersion: v1
kind: Namespace
metadata:
  name: application
  labels:
    name: application
---
apiVersion: v1
kind: Namespace
metadata:
  name: database
  labels:
    name: database

Key Differences: Security Groups vs Network Policies

Aspect	AWS Security Groups	K8s Network Policies
Scope	ENI (pod gets own network interface)	Pod-to-pod
Statefulness	Stateful (return traffic automatic)	Varies by CNI plugin
IP-based rules	Can reference external IPs	Can reference IP blocks (CIDR)
Cloud integration	Native AWS (RDS, ELB, etc.)	Kubernetes-only
Management	AWS Console/API/Terraform	Kubernetes manifests
Performance	Enforced at VPC level (hardware)	Enforced at node level (software)
Granularity	Per-ENI (can be per-pod)	Per-pod only
Cost	No additional cost	No additional cost

Advanced: Azure Network Security Groups (NSGs) for Nodes

For node-level security (not pod-level):

# Create NSG for AKS nodes
az network nsg create \
  --resource-group production-rg \
  --name aks-node-nsg

# Add rules
az network nsg rule create \
  --resource-group production-rg \
  --nsg-name aks-node-nsg \
  --name allow-postgres-from-app-nodes \
  --priority 100 \
  --source-address-prefixes 10.240.1.0/24 \  # App tier subnet
  --destination-port-ranges 5432 \
  --access Allow \
  --protocol Tcp

# Associate with subnet
az network vnet subnet update \
  --resource-group production-rg \
  --vnet-name aksVNet \
  --name database-subnet \
  --network-security-group aks-node-nsg

Limitation: NSGs apply to ALL pods on a node, not individual pods like Security Groups for Pods

Migration Strategy

Inventory Security Groups

# List all SecurityGroupPolicies in EKS
kubectl get securitygrouppolicy --all-namespaces -o yaml > eks-sg-policies.yaml

Map to Network Policies
- Security Group → Network Policy (pod selector)
- Security Group rules → Ingress/Egress rules
- Source Security Groups → Namespace/Pod selectors

Test Thoroughly

# Test connectivity between pods
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- /bin/bash
# Inside pod:
nc -zv postgres-0.postgres.database.svc.cluster.local 5432

Use Policy Enforcement Tools

# Install network policy enforcer visualizer
kubectl apply -f https://github.com/ahmetb/kubernetes-network-policy-recipes/blob/master/00-deny-all-traffic-to-an-application.yaml

6. Observability and Logging

Pain Point: CloudWatch Container Insights → Azure Monitor

Severity: 🟡 Medium - Operational visibility
Frequency: Universal
Impact: Different query language, metrics, alerting

The Problem

EKS integrates with CloudWatch for logs and metrics. AKS uses Azure Monitor with different collection mechanisms, query languages (KQL vs CloudWatch Insights), and pricing models.

EKS Configuration (Works)

FluentBit DaemonSet for CloudWatch:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     30
        Daemon                    Off
        Log_Level                 info
    
    [INPUT]
        Name                      tail
        Path                      /var/log/containers/*.log
        Parser                    docker
        Tag                       kube.*
        DB                        /var/fluent-bit/state/flb_kube.db
        Mem_Buf_Limit             5MB
        Skip_Long_Lines           On
        Refresh_Interval          10
    
    [FILTER]
        Name                      kubernetes
        Match                     kube.*
        Kube_URL                  https://kubernetes.default.svc.cluster.local:443
        Merge_Log                 On
        Keep_Log                  Off
        K8S-Logging.Parser        On
        K8S-Logging.Exclude       On
    
    [OUTPUT]
        Name                      cloudwatch_logs
        Match                     *
        region                    us-east-1
        log_group_name            /aws/eks/production-cluster/application
        log_stream_prefix         from-fluent-bit-
        auto_create_group         true
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: fluent-bit
  template:
    metadata:
      labels:
        name: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: fluent-bit
        image: amazon/aws-for-fluent-bit:latest
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config

Querying in CloudWatch Insights:

fields @timestamp, @message
| filter kubernetes.namespace_name = "production"
| filter kubernetes.labels.app = "api-server"
| filter @message like /ERROR/
| stats count() by bin(5m)

After Migration to AKS (No Logs)

# Logs not reaching any destination
# CloudWatch not accessible from Azure
# Need to reconfigure entire logging pipeline

Solution: Azure Monitor Container Insights

Enable Container Insights:

# Create Log Analytics Workspace
az monitor log-analytics workspace create \
  --resource-group production-rg \
  --workspace-name prodLogAnalytics \
  --location eastus

# Enable on AKS cluster
az aks enable-addons \
  --resource-group production-rg \
  --name myAKSCluster \
  --addons monitoring \
  --workspace-resource-id /subscriptions/<subscription-id>/resourceGroups/production-rg/providers/Microsoft.OperationalInsights/workspaces/prodLogAnalytics

This automatically deploys:

OMS Agent DaemonSet (collects logs and metrics)
Container Insights solution
Pre-configured workbooks and dashboards

Querying in Azure Monitor (KQL):

ContainerLog
| where TimeGenerated > ago(1h)
| where Namespace == "production"
| where PodLabel_app_s == "api-server"
| where LogEntry contains "ERROR"
| summarize count() by bin(TimeGenerated, 5m)
| render timechart

Query Translation Examples:

CloudWatch Insights	Azure Monitor (KQL)
`fields @timestamp, @message`	`project TimeGenerated, LogEntry`
`filter kubernetes.namespace = "prod"`	`where Namespace == "prod"`
`filter @message like /ERROR/`	`where LogEntry contains "ERROR"`
`stats count() by bin(5m)`	`summarize count() by bin(TimeGenerated, 5m)`
`sort @timestamp desc`	`sort by TimeGenerated desc`
`limit 100`	`take 100`

Advanced: Custom Metrics

In EKS (CloudWatch Custom Metrics):

import boto3
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='Production/API',
    MetricData=[
        {
            'MetricName': 'RequestDuration',
            'Value': 123.45,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {'Name': 'Endpoint', 'Value': '/api/users'},
                {'Name': 'StatusCode', 'Value': '200'}
            ]
        }
    ]
)

In AKS (Azure Monitor Custom Metrics):

from azure.monitor.ingestion import LogsIngestionClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = LogsIngestionClient(
    endpoint="https://prodLogAnalytics.eastus-1.ingest.monitor.azure.com",
    credential=credential
)

# Send custom logs
client.upload(
    rule_id="/subscriptions/.../dataCollectionRules/myDCR",
    stream_name="Custom-RequestMetrics",
    logs=[
        {
            "TimeGenerated": "2024-02-16T10:00:00Z",
            "Endpoint": "/api/users",
            "Duration": 123.45,
            "StatusCode": 200
        }
    ]
)

Alerting Configuration

EKS (CloudWatch Alarms):

aws cloudwatch put-metric-alarm \
  --alarm-name high-error-rate \
  --alarm-description "Alert when error rate > 5%" \
  --metric-name Errors \
  --namespace AWS/EKS \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:critical-alerts

AKS (Azure Monitor Alerts):

# Create alert rule
az monitor metrics alert create \
  --name high-error-rate \
  --resource-group production-rg \
  --scopes /subscriptions/.../resourceGroups/production-rg/providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action /subscriptions/.../resourceGroups/production-rg/providers/microsoft.insights/actionGroups/critical-alerts

Or using KQL-based log alerts:

az monitor scheduled-query create \
  --name high-error-rate-log \
  --resource-group production-rg \
  --scopes /subscriptions/.../workspaces/prodLogAnalytics \
  --condition "count > 100" \
  --condition-query "ContainerLog | where LogEntry contains 'ERROR' | summarize count()" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action /subscriptions/.../actionGroups/critical-alerts

Cost Comparison

Feature	CloudWatch	Azure Monitor
Log ingestion	$0.50/GB	$2.76/GB (first 5GB/day free per workspace)
Log storage	$0.03/GB/month	Included for 31 days, $0.12/GB/month after
Metrics	Custom metrics $0.30/metric	Included (native), $0.60/metric (custom)
Queries	$0.005/GB scanned	Included
Data export	$0.09/GB	$0.13/GB

7. Container Registry

Pain Point: ECR → Azure Container Registry (ACR)

Severity: 🟢 Low - Straightforward migration
Frequency: Universal
Impact: Image pulls fail until reconfigured

The Problem

Container images stored in Amazon ECR need to be migrated to ACR, and image pull secrets need updating.

EKS Configuration (Works)

apiVersion: v1
kind: Secret
metadata:
  name: ecr-registry
  namespace: production
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-ecr-credentials>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      imagePullSecrets:
      - name: ecr-registry
      containers:
      - name: api
        image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api-server:v1.2.3

Solution: Migrate Images to ACR

1. Create ACR:

az acr create \
  --resource-group production-rg \
  --name prodacr \
  --sku Premium \
  --location eastus

2. Enable ACR Integration with AKS:

# Attach ACR to AKS (automatic image pull)
az aks update \
  --resource-group production-rg \
  --name myAKSCluster \
  --attach-acr prodacr

3. Migrate Images:

# Login to both registries
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

az acr login --name prodacr

# Pull from ECR
docker pull 123456789012.dkr.ecr.us-east-1.amazonaws.com/api-server:v1.2.3

# Tag for ACR
docker tag \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/api-server:v1.2.3 \
  prodacr.azurecr.io/api-server:v1.2.3

# Push to ACR
docker push prodacr.azurecr.io/api-server:v1.2.3

Automated migration script:

#!/bin/bash
ECR_REGISTRY="123456789012.dkr.ecr.us-east-1.amazonaws.com"
ACR_REGISTRY="prodacr.azurecr.io"

# List all images in ECR
aws ecr describe-repositories --region us-east-1 --output json | \
  jq -r '.repositories[].repositoryName' | \
  while read repo; do
    # List all tags for repository
    aws ecr list-images --region us-east-1 --repository-name $repo --output json | \
      jq -r '.imageIds[].imageTag' | \
      while read tag; do
        echo "Migrating $repo:$tag"
        
        # Pull from ECR
        docker pull $ECR_REGISTRY/$repo:$tag
        
        # Tag for ACR
        docker tag $ECR_REGISTRY/$repo:$tag $ACR_REGISTRY/$repo:$tag
        
        # Push to ACR
        docker push $ACR_REGISTRY/$repo:$tag
        
        # Clean up local image
        docker rmi $ECR_REGISTRY/$repo:$tag $ACR_REGISTRY/$repo:$tag
      done
  done

4. Update Manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      # No imagePullSecrets needed when ACR is attached to AKS
      containers:
      - name: api
        image: prodacr.azurecr.io/api-server:v1.2.3  # Updated image reference

5. Update CI/CD Pipelines:

# GitHub Actions example
- name: Login to ACR
  uses: azure/docker-login@v1
  with:
    login-server: prodacr.azurecr.io
    username: ${{ secrets.ACR_USERNAME }}
    password: ${{ secrets.ACR_PASSWORD }}

- name: Build and push
  run: |
    docker build -t prodacr.azurecr.io/api-server:${{ github.sha }} .
    docker push prodacr.azurecr.io/api-server:${{ github.sha }}

Advanced: Geo-Replication

# Replicate to multiple regions for faster pulls
az acr replication create \
  --registry prodacr \
  --location westus2

az acr replication create \
  --registry prodacr \
  --location westeurope

8. Backup and Disaster Recovery

Pain Point: EBS Snapshots → Azure Disk Snapshots

Severity: 🟡 Medium
Frequency: Common
Impact: Backup/restore processes need reconfiguration

The Problem

EBS snapshot-based backups (via tools like Velero) use AWS-specific APIs. Azure has different snapshot mechanisms.

Solution: Update Velero Configuration

EKS Velero Configuration:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: prod-velero-backups
    prefix: eks-cluster
  config:
    region: us-east-1
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  config:
    region: us-east-1

AKS Velero Configuration:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: azure
  objectStorage:
    bucket: velero-backups  # Actually an Azure Blob container
    prefix: aks-cluster
  config:
    resourceGroup: production-rg
    storageAccount: prodvelarostorage
    subscriptionId: 12345678-1234-1234-1234-123456789012
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: azure
  config:
    resourceGroup: production-rg
    subscriptionId: 12345678-1234-1234-1234-123456789012

Install Velero with Azure Plugin:

# Create storage account for backups
az storage account create \
  --name prodvelarostorage \
  --resource-group production-rg \
  --sku Standard_GRS \
  --encryption-services blob \
  --https-only true

# Create blob container
az storage container create \
  --name velero-backups \
  --account-name prodvelarostorage

# Install Velero
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=production-rg,storageAccount=prodvelarostorage,subscriptionId=12345678-1234-1234-1234-123456789012 \
  --snapshot-location-config resourceGroup=production-rg,subscriptionId=12345678-1234-1234-1234-123456789012

9. Compute and Node Configuration

Pain Point: EC2 Instance Types → Azure VM Sizes

Severity: 🟢 Low - Configuration change
Frequency: Universal
Impact: Performance characteristics may differ

The Problem

Node pools configured for specific EC2 instance types don't exist in Azure. VM sizes have different names, capabilities, and pricing.

Instance Type Mapping

EKS (EC2)	vCPU	Memory	AKS (Azure VM)	vCPU	Memory	Notes
t3.medium	2	4 GiB	Standard_B2ms	2	8 GiB	Burstable
m5.large	2	8 GiB	Standard_D2s_v5	2	8 GiB	General purpose
m5.xlarge	4	16 GiB	Standard_D4s_v5	4	16 GiB	General purpose
c5.xlarge	4	8 GiB	Standard_F4s_v2	4	8 GiB	Compute optimized
r5.xlarge	4	32 GiB	Standard_E4s_v5	4	32 GiB	Memory optimized
p3.2xlarge	8	61 GiB + V100	Standard_NC6s_v3	6	112 GiB + V100	GPU

EKS Node Group Configuration

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production-cluster
  region: us-east-1
nodeGroups:
  - name: general-purpose
    instanceType: m5.xlarge
    desiredCapacity: 3
    minSize: 2
    maxSize: 10
    labels:
      workload-type: general
    taints:
      - key: workload-type
        value: general
        effect: NoSchedule

AKS Node Pool Configuration

az aks nodepool add \
  --resource-group production-rg \
  --cluster-name myAKSCluster \
  --name generalpurpose \
  --node-count 3 \
  --min-count 2 \
  --max-count 10 \
  --node-vm-size Standard_D4s_v5 \
  --labels workload-type=general \
  --node-taints workload-type=general:NoSchedule \
  --enable-cluster-autoscaler

Node Selectors and Tolerations

No changes needed - Kubernetes-native constructs work identically:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: compute-intensive-app
spec:
  template:
    spec:
      nodeSelector:
        workload-type: general
      tolerations:
      - key: "workload-type"
        operator: "Equal"
        value: "general"
        effect: "NoSchedule"
      containers:
      - name: app
        image: my-app:latest

10. Service Mesh Integration

Pain Point: AWS App Mesh → Azure Service Mesh (Istio)

Severity: 🟡 Medium - Advanced use cases
Frequency: Uncommon
Impact: Service mesh configuration incompatible

The Problem

AWS App Mesh uses AWS-specific CRDs and control plane. Azure supports open-source service meshes (Istio, Linkerd, OSM).

Solution: Migrate to Istio on AKS

This is complex and beyond the scope of this document, but key considerations:

Install Istio on AKS

istioctl install --set profile=production

Migrate Virtual Services
- App Mesh VirtualServices → Istio VirtualServices
- Different syntax, similar concepts
Update mTLS Configuration
- App Mesh uses AWS Certificate Manager
- Istio uses cert-manager or manual certificates
Rewrite Traffic Policies

11. Database-Specific Integrations

Pain Point: RDS Integration → Azure Database

Severity: 🟢 Low - If using managed databases
Frequency: Common
Impact: Connection strings, authentication

The Problem

Applications connecting to AWS RDS need connection string updates for Azure Database for PostgreSQL/MySQL.

EKS Application Configuration

apiVersion: v1
kind: Secret
metadata:
  name: db-connection
  namespace: production
stringData:
  host: "prod-postgres.c9akz82fkwix.us-east-1.rds.amazonaws.com"
  port: "5432"
  database: "production"
  username: "app_user"
  password: "secure-password"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: db-connection
              key: host
        - name: DB_PORT
          valueFrom:
            secretKeyRef:
              name: db-connection
              key: port
        # etc.

AKS Application Configuration

apiVersion: v1
kind: Secret
metadata:
  name: db-connection
  namespace: production
stringData:
  host: "prod-postgres.postgres.database.azure.com"  # Changed!
  port: "5432"
  database: "production"
  username: "app_user@prod-postgres"  # Azure requires @servername
  password: "secure-password"
  # Optional: SSL parameters for Azure Database
  sslmode: "require"
---
# Rest of deployment unchanged

Additional Azure-specific considerations:

SSL/TLS Required

# Connection string must include SSL
conn = psycopg2.connect(
    host="prod-postgres.postgres.database.azure.com",
    port=5432,
    database="production",
    user="app_user@prod-postgres",
    password="password",
    sslmode="require"
)

Firewall Rules

# Allow AKS nodes to access Azure Database
az postgres server firewall-rule create \
  --resource-group production-rg \
  --server-name prod-postgres \
  --name AllowAKSNodes \
  --start-ip-address 10.240.0.0 \
  --end-ip-address 10.240.255.255

Private Endpoints (recommended)

# Create private endpoint for database
az network private-endpoint create \
  --name postgres-private-endpoint \
  --resource-group production-rg \
  --vnet-name aksVNet \
  --subnet database-subnet \
  --private-connection-resource-id /subscriptions/.../servers/prod-postgres \
  --group-id postgresqlServer \
  --connection-name postgres-connection

12. GitOps and CI/CD

Pain Point: CodePipeline/CodeBuild → Azure DevOps/GitHub Actions

Severity: 🟢 Low - CI/CD reconfiguration
Frequency: Very Common
Impact: Build/deploy pipelines need rewriting

The Problem

AWS-native CI/CD tools (CodePipeline, CodeBuild, CodeDeploy) need replacement or reconfiguration.

Solutions

Option 1: GitHub Actions (Cloud-agnostic)

name: Deploy to AKS
on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Azure Login
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}
    
    - name: Build and push image
      run: |
        az acr login --name prodacr
        docker build -t prodacr.azurecr.io/api-server:${{ github.sha }} .
        docker push prodacr.azurecr.io/api-server:${{ github.sha }}
    
    - name: Set AKS context
      uses: azure/aks-set-context@v3
      with:
        resource-group: production-rg
        cluster-name: myAKSCluster
    
    - name: Deploy to AKS
      uses: azure/k8s-deploy@v4
      with:
        manifests: |
          k8s/deployment.yaml
          k8s/service.yaml
        images: |
          prodacr.azurecr.io/api-server:${{ github.sha }}

Option 2: Azure DevOps

trigger:
  branches:
    include:
    - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  acrName: 'prodacr'
  imageName: 'api-server'
  aksResourceGroup: 'production-rg'
  aksClusterName: 'myAKSCluster'

stages:
- stage: Build
  jobs:
  - job: BuildAndPush
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: 'prodacr'
        repository: $(imageName)
        command: 'buildAndPush'
        Dockerfile: '**/Dockerfile'
        tags: |
          $(Build.BuildId)
          latest

- stage: Deploy
  jobs:
  - job: DeployToAKS
    steps:
    - task: KubernetesManifest@0
      inputs:
        action: 'deploy'
        kubernetesServiceConnection: 'myAKSCluster'
        namespace: 'production'
        manifests: |
          k8s/deployment.yaml
          k8s/service.yaml
        containers: |
          $(acrName).azurecr.io/$(imageName):$(Build.BuildId)

13. Detection Patterns for Migration Tools

Automated Discovery Rules

Tools like Konveyor should flag these patterns:

1. AWS-Specific Annotations

# PATTERN: EKS-specific annotations
annotations:
  eks.amazonaws.com/role-arn: *
  alb.ingress.kubernetes.io/*: *
  
# ACTION: Flag for Workload Identity or AGIC migration

2. AWS CSI Drivers

# PATTERN: AWS storage drivers
spec:
  csi:
    driver: ebs.csi.aws.com
    driver: efs.csi.aws.com
    
# ACTION: Suggest Azure Disk or Azure Files

3. AWS-Specific CRDs

# PATTERN: AWS-only Custom Resources
apiVersion: vpcresources.k8s.aws/*
apiVersion: secretsproviderclass.k8s.aws/*

# ACTION: Recommend Kubernetes Network Policies or Azure equivalents

4. Environment Variables

# PATTERN: AWS SDK environment variables
env:
- name: AWS_REGION
- name: AWS_DEFAULT_REGION
- name: AWS_ACCESS_KEY_ID

# ACTION: Warn about credential management changes

5. Hard-coded AWS Endpoints

# PATTERN: AWS service endpoints
env:
- name: S3_ENDPOINT
  value: "https://s3.us-east-1.amazonaws.com"
- name: SQS_URL
  value: "https://sqs.us-east-1.amazonaws.com/123456789012/my-queue"

# ACTION: Suggest Azure service equivalents

6. EC2 Metadata Usage

# PATTERN: Code accessing EC2 metadata
import requests
response = requests.get('http://169.254.169.254/latest/meta-data/')

# ACTION: Flag for Azure Instance Metadata Service (IMDS) migration

14. Migration Strategies

Strategy 1: Lift-and-Shift (Fastest)

Phase 1: Infrastructure (Week 1)
- Create AKS cluster
- Configure networking, storage classes
- Set up Azure equivalents (ACR, Key Vault, etc.)
Phase 2: Data Migration (Week 2)
- Velero backup from EKS
- Velero restore to AKS (data only)
- Validate data integrity
Phase 3: Application Deployment (Week 3)
- Update manifests (storage classes, ingress, etc.)
- Deploy via GitOps
- Run smoke tests
Phase 4: Cutover (Week 4)
- DNS cutover
- Decommission EKS

Pros: Fast, minimal code changes
Cons: Doesn't leverage Azure-native features, potential performance issues

Strategy 2: Blue-Green Cluster Migration (Safest)

Phase 1: Build Green (AKS) (Weeks 1-2)
- Parallel infrastructure build
- Migrate data
Phase 2: Validate Green (Week 3)
- Run integration tests
- Performance testing
- Security validation
Phase 3: Traffic Split (Week 4)
- 10% traffic to AKS
- Monitor for 48 hours
- Increase to 50%, then 100%
Phase 4: Decommission Blue (EKS) (Week 5)
- Archive data
- Terminate EKS

Pros: Safest, easy rollback
Cons: Highest cost (dual infrastructure), complex traffic splitting

Strategy 3: Incremental Migration (Most Controlled)

Phase 1: Stateless Workloads (Weeks 1-3)
- Migrate stateless apps first
- Test in production with real traffic
Phase 2: Stateful Non-Database (Weeks 4-6)
- Redis, message queues
- Can tolerate brief downtime
Phase 3: Databases (Weeks 7-10)
- Set up replication
- Gradual cutover per database
Phase 4: Cleanup (Week 11+)
- Remove EKS resources
- Optimize AKS

Pros: Lowest risk, learn as you go
Cons: Longest duration, complex coordination

15. Quick Reference Tables

Critical Path Items

Category	EKS Component	AKS Equivalent	Migration Effort	Blocking?
Auth	IRSA	Workload Identity	High (code changes)	🔴 Yes
Storage	EBS CSI	Azure Disk CSI	Medium (manifests)	🔴 Yes
Secrets	Secrets Manager CSI	Key Vault CSI	Medium (manifests + data)	🔴 Yes
Ingress	ALB Controller	AGIC / nginx	Medium (manifests)	🟡 Partial
Network	Security Groups for Pods	Network Policies	High (different model)	🟡 Partial
Registry	ECR	ACR	Low (image migration)	🔴 Yes
Monitoring	CloudWatch	Azure Monitor	Medium (queries)	🟢 No
Backups	Velero (AWS)	Velero (Azure)	Low (config)	🟢 No

Pre-Migration Checklist

Testing Checklist

Appendix A: Common Error Messages

"Unable to locate credentials"

Cause: IRSA not configured, Workload Identity missing
Fix: Add Workload Identity annotations to ServiceAccount and pod labels

"storageclass.storage.k8s.io not found"

Cause: EBS StorageClass doesn't exist in AKS
Fix: Create Azure Disk or Azure Files StorageClass

"provider 'aws' not found"

Cause: AWS Secrets Store CSI provider not installed
Fix: Reconfigure SecretProviderClass for Azure

"MountVolume.SetUp failed"

Cause: Volume driver mismatch
Fix: Update CSI driver in PV/PVC specs

Appendix B: Cost Optimization

Storage Cost Comparison

Scenario	EKS (EBS gp3)	AKS (Premium SSD)	Savings
1 TB, 3000 IOPS	$80/month	$135/month	-69%
1 TB, 10000 IOPS	$145/month	$180/month	-24%

Recommendation: Use Azure Premium SSD v2 for cost-effective high-IOPS workloads

Compute Cost Comparison

Workload	EKS (m5.xlarge)	AKS (D4s_v5)	Savings
24/7 production	$122/month	$140/month	-15%
Dev/test (8h/day)	$41/month	$47/month	-15%

Note: Costs vary by region and commitment (Reserved Instances vs Spot)

Conclusion

Migrating from EKS to AKS requires careful planning and attention to cloud-specific integrations. The most common pain points involve:

Authentication: IRSA → Workload Identity
Storage: EBS/EFS → Azure Disk/Files
Secrets: AWS Secrets Manager → Key Vault
Networking: Security Groups → Network Policies
Observability: CloudWatch → Azure Monitor

Success Factors:

Thorough inventory of AWS-specific resources
Automated detection of cloud-specific patterns
Comprehensive testing in staging environment
Incremental migration approach
Team training on Azure-specific concepts

Tools to Leverage:

Konveyor for automated migration analysis
Velero for data migration
GitOps (ArgoCD/Flux) for consistent deployments
Azure Migrate for assessment

This document should serve as a comprehensive reference for platform teams undertaking EKS to AKS migrations.

jwmatthews/EKS_to_AKS_Migration_Pain_Points.md

EKS to AKS Migration Pain Points: Real-World Problems and Solutions

Executive Summary

Key Takeaways

Table of Contents

1. Identity and Authentication

Pain Point: IAM Roles for Service Accounts (IRSA) → Workload Identity

The Problem

EKS Configuration (Works)

After Migration to AKS (Broken)

Root Cause Analysis

Solution A: Migrate to Azure Blob Storage

Solution B: Keep S3, Add Cross-Cloud Authentication

Migration Checklist

Common Pitfalls

2. Persistent Storage

Pain Point 1: EBS Storage Classes → Azure Disk

The Problem

EKS Configuration (Works)

After Migration to AKS (Broken)

Solution: Azure Disk StorageClass

Alternative: Premium SSD v2 (Cost-Effective)

Migration Steps for Existing Data

Pain Point 2: EFS (ReadWriteMany) → Azure Files

The Problem

EKS Configuration (Works)

After Migration to AKS (Broken)

Solution: Azure Files with NFS or SMB

Performance Comparison

Migration Gotchas

Data Migration Approaches

3. Secrets Management

Pain Point: AWS Secrets Manager CSI → Azure Key Vault CSI

The Problem

EKS Configuration (Works)

After Migration to AKS (Broken)

Solution: Azure Key Vault CSI Driver

Advanced: Auto-Rotation

Common Issues

4. Ingress and Load Balancing

Pain Point: AWS ALB Ingress Controller → Azure Application Gateway / nginx

The Problem

EKS Configuration (Works)

After Migration to AKS (Degraded)

Solution Option 1: Azure Application Gateway Ingress Controller (AGIC)

Solution Option 2: nginx Ingress Controller (Most Portable)

Feature Comparison

5. Networking and Security

Pain Point: VPC CNI Security Groups for Pods → Network Policies

The Problem

EKS Configuration (Works)

After Migration to AKS (No Security!)

Solution: Kubernetes Network Policies + Azure Network Policy Manager

Key Differences: Security Groups vs Network Policies

Advanced: Azure Network Security Groups (NSGs) for Nodes

Migration Strategy

6. Observability and Logging

Pain Point: CloudWatch Container Insights → Azure Monitor

The Problem

EKS Configuration (Works)

After Migration to AKS (No Logs)

Solution: Azure Monitor Container Insights

Advanced: Custom Metrics

Alerting Configuration

Cost Comparison

7. Container Registry

Pain Point: ECR → Azure Container Registry (ACR)

The Problem

EKS Configuration (Works)

Solution: Migrate Images to ACR

Advanced: Geo-Replication

8. Backup and Disaster Recovery

Pain Point: EBS Snapshots → Azure Disk Snapshots

The Problem

Solution: Update Velero Configuration

9. Compute and Node Configuration

Pain Point: EC2 Instance Types → Azure VM Sizes

The Problem

Instance Type Mapping

EKS Node Group Configuration