Skip to content

Instantly share code, notes, and snippets.

@jwmatthews
Last active February 16, 2026 23:21
Show Gist options
  • Select an option

  • Save jwmatthews/06474023348eee3097e8d79e53ba7068 to your computer and use it in GitHub Desktop.

Select an option

Save jwmatthews/06474023348eee3097e8d79e53ba7068 to your computer and use it in GitHub Desktop.

EKS to AKS Migration Pain Points: Real-World Problems and Solutions

Version: 1.0
Last Updated: February 2026
Target Audience: Platform Engineers, DevOps Teams, Migration Specialists


Executive Summary

Migrating Kubernetes workloads from Amazon EKS to Azure AKS appears straightforwardβ€”both are managed Kubernetes services running the same core platform. However, cloud provider-specific integrations, CSI drivers, networking models, and authentication mechanisms create significant friction points that can cause application failures post-migration.

This document catalogs real-world migration pain points encountered when moving stateful and stateless workloads from EKS to AKS, with detailed remediation strategies, code examples, and automated detection patterns for migration tooling.

Key Takeaways

  • Identity & Access: IRSA vs Workload Identity requires application-level changes
  • Storage: Different CSI drivers, performance characteristics, and access modes
  • Networking: Security Groups for Pods don't translate to Network Policies
  • Secrets: AWS Secrets Manager vs Azure Key Vault require different CSI configurations
  • Ingress: ALB-specific features need AGIC or nginx equivalents
  • Observability: CloudWatch vs Azure Monitor have different collection mechanisms
  • Cost: Different pricing models for storage, networking, and compute

Table of Contents

  1. Identity and Authentication
  2. Persistent Storage
  3. Secrets Management
  4. Ingress and Load Balancing
  5. Networking and Security
  6. Observability and Logging
  7. Container Registry
  8. Backup and Disaster Recovery
  9. Compute and Node Configuration
  10. Service Mesh Integration
  11. Database-Specific Integrations
  12. GitOps and CI/CD
  13. Detection Patterns
  14. Migration Strategies
  15. Quick Reference Tables

1. Identity and Authentication

Pain Point: IAM Roles for Service Accounts (IRSA) β†’ Workload Identity

Severity: πŸ”΄ High - Application breaking
Frequency: Very Common
Impact: Authentication failures, unable to access cloud resources

The Problem

EKS uses IRSA to provide AWS credentials to pods via service account annotations. This integrates seamlessly with AWS SDK libraries. AKS uses Azure Workload Identity (formerly Azure AD Pod Identity), which has a completely different configuration model.

EKS Configuration (Works)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/prod-s3-reader-role
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: document-processor
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: doc-processor
  template:
    metadata:
      labels:
        app: doc-processor
    spec:
      serviceAccountName: s3-reader
      containers:
      - name: processor
        image: myregistry/doc-processor:v1.2.3
        env:
        - name: AWS_REGION
          value: us-east-1
        - name: S3_BUCKET
          value: production-documents
        - name: AWS_DEFAULT_REGION
          value: us-east-1

Application Code (Python):

import boto3

# This just works - AWS SDK automatically uses IRSA credentials
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='production-documents')

After Migration to AKS (Broken)

# Pod starts but fails at runtime
kubectl logs document-processor-7d9f8b5c4-x8k2m

# Output:
# botocore.exceptions.NoCredentialsError: Unable to locate credentials
# or
# botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation

Root Cause Analysis

  1. Service Account annotation is AWS-specific - AKS doesn't recognize eks.amazonaws.com/role-arn
  2. OIDC provider is different - EKS OIDC endpoint vs Azure AD
  3. Token format differs - AWS STS tokens vs Azure AD tokens
  4. SDK credential chain changes - AWS SDK won't find Azure credentials automatically

Solution A: Migrate to Azure Blob Storage

Prerequisites:

  1. Create Azure Storage Account
  2. Create Managed Identity with Storage Blob Data Contributor role
  3. Set up Workload Identity federation

Azure Configuration:

# Create storage account
az storage account create \
  --name prodstorageacct \
  --resource-group production-rg \
  --location eastus \
  --sku Standard_ZRS

# Create container
az storage container create \
  --name documents \
  --account-name prodstorageacct

# Create managed identity
az identity create \
  --name doc-processor-identity \
  --resource-group production-rg

# Get identity client ID
IDENTITY_CLIENT_ID=$(az identity show \
  --name doc-processor-identity \
  --resource-group production-rg \
  --query clientId -o tsv)

# Assign storage permissions
az role assignment create \
  --role "Storage Blob Data Contributor" \
  --assignee $IDENTITY_CLIENT_ID \
  --scope /subscriptions/<subscription-id>/resourceGroups/production-rg/providers/Microsoft.Storage/storageAccounts/prodstorageacct

AKS Configuration:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: blob-reader
  namespace: production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789012"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-210987654321"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: document-processor
  namespace: production
  labels:
    azure.workload.identity/use: "true"  # Required!
spec:
  replicas: 3
  selector:
    matchLabels:
      app: doc-processor
  template:
    metadata:
      labels:
        app: doc-processor
        azure.workload.identity/use: "true"  # Required on pod!
    spec:
      serviceAccountName: blob-reader
      containers:
      - name: processor
        image: myregistry/doc-processor:v2.0.0  # Updated image
        env:
        - name: AZURE_STORAGE_ACCOUNT_NAME
          value: prodstorageacct
        - name: AZURE_STORAGE_CONTAINER_NAME
          value: documents
        # Note: No explicit credentials - Workload Identity handles it

Application Code Changes (Python):

# NEW: Azure SDK instead of boto3
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# DefaultAzureCredential automatically uses Workload Identity
credential = DefaultAzureCredential()

blob_service_client = BlobServiceClient(
    account_url=f"https://prodstorageacct.blob.core.windows.net",
    credential=credential
)

container_client = blob_service_client.get_container_client("documents")

# List blobs (equivalent to S3 list_objects_v2)
blob_list = container_client.list_blobs()
for blob in blob_list:
    print(f"Blob name: {blob.name}")

Testing:

# Verify workload identity is working
kubectl run -it --rm debug \
  --image=mcr.microsoft.com/azure-cli \
  --serviceaccount=blob-reader \
  --labels=azure.workload.identity/use=true \
  -- bash

# Inside pod:
az login --identity
az storage blob list \
  --account-name prodstorageacct \
  --container-name documents \
  --auth-mode login

Solution B: Keep S3, Add Cross-Cloud Authentication

Use Case: Multi-cloud strategy, data residency, or gradual migration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  namespace: production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789012"
---
apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
  namespace: production
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "AKIA..."
  AWS_SECRET_ACCESS_KEY: "wJalrXUtn..."
  # OR use Azure Key Vault CSI to inject these
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: document-processor
spec:
  template:
    spec:
      serviceAccountName: s3-reader
      containers:
      - name: processor
        image: myregistry/doc-processor:v1.2.3
        envFrom:
        - secretRef:
            name: aws-credentials
        env:
        - name: AWS_REGION
          value: us-east-1

Better Approach: Use Azure Managed Identity to assume AWS IAM role via OIDC federation

# Set up federated identity credential in Azure
az identity federated-credential create \
  --name aws-federation \
  --identity-name doc-processor-identity \
  --resource-group production-rg \
  --issuer "https://sts.amazonaws.com" \
  --subject "arn:aws:iam::123456789012:role/prod-s3-reader-role" \
  --audience "sts.amazonaws.com"

# Configure AWS IAM role to trust Azure AD
# (Complex setup - beyond scope, generally not recommended)

Migration Checklist

  • Inventory all ServiceAccounts with eks.amazonaws.com/* annotations
  • Identify AWS SDK usage in application code
  • Decide: Migrate to Azure services or maintain cross-cloud access
  • Create Azure Managed Identities
  • Set up Workload Identity federation
  • Update application code to use Azure SDKs (if migrating to Azure services)
  • Update Kubernetes manifests with Azure annotations
  • Test authentication in non-production environment
  • Update CI/CD pipelines to build new container images
  • Document credential management changes

Common Pitfalls

  1. Forgetting pod label: azure.workload.identity/use: "true" must be on BOTH deployment and pod template
  2. Token expiration: Azure AD tokens have different lifetimes than AWS STS tokens
  3. SDK version: Older Azure SDK versions don't support Workload Identity
  4. Regional endpoints: Azure Storage URLs differ from S3 URLs
  5. Permissions model: Azure RBAC roles vs AWS IAM policies have different granularity

2. Persistent Storage

Pain Point 1: EBS Storage Classes β†’ Azure Disk

Severity: πŸ”΄ High - Application won't start
Frequency: Universal (every stateful app)
Impact: PVCs stuck in Pending, StatefulSets won't deploy

The Problem

EBS-specific StorageClasses don't exist in AKS. Different provisioners, parameters, and performance tiers require manifest updates.

EKS Configuration (Works)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "16000"
  throughput: "1000"
  encrypted: "true"
  kmsKeyId: arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 500Gi

After Migration to AKS (Broken)

kubectl get pvc -n database
# NAME                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
# postgres-data           Pending                                      fast-ssd

kubectl describe pvc postgres-data -n database
# Events:
#   Warning  ProvisioningFailed  storageclass.storage.k8s.io "fast-ssd" not found

Solution: Azure Disk StorageClass

Performance Tier Mapping:

EBS Type IOPS Throughput Azure Disk Equivalent SKU IOPS Throughput
gp3 (baseline) 3,000 125 MB/s Premium SSD v2 PremiumV2_LRS 3,000 125 MB/s
gp3 (16k IOPS) 16,000 1,000 MB/s Ultra Disk UltraSSD_LRS 16,000+ 1,000 MB/s
io2 Block Express 256,000 4,000 MB/s Ultra Disk UltraSSD_LRS 160,000 4,000 MB/s
st1 (throughput) 500 500 MB/s Standard SSD StandardSSD_LRS varies ~60 MB/s

AKS Configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: UltraSSD_LRS  # For high IOPS requirement
  cachingMode: None      # Ultra Disk doesn't support caching
  # DiskIOPSReadWrite and DiskMBpsReadWrite are set per-PVC for Ultra Disk
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
  annotations:
    # Ultra Disk specific parameters
    disk.csi.azure.com/diskIOPSReadWrite: "16000"
    disk.csi.azure.com/diskMBpsReadWrite: "1000"
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

Important Notes:

  1. Ultra Disk requires specific VM sizes - Not all Azure VM SKUs support Ultra Disk

    # Check if node pool supports Ultra Disk
    az aks nodepool show \
      --resource-group myResourceGroup \
      --cluster-name myAKSCluster \
      --name nodepool1 \
      --query "enableUltraSsd"
  2. No disk encryption by default - Must use Azure Disk Encryption Set

    parameters:
      skuName: UltraSSD_LRS
      diskEncryptionSetID: /subscriptions/.../diskEncryptionSets/myDES
  3. Cost implications - Ultra Disk is significantly more expensive

    • Pay for provisioned IOPS and throughput, not just capacity
    • Consider Premium SSD v2 for better cost/performance balance

Alternative: Premium SSD v2 (Cost-Effective)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: PremiumV2_LRS
  cachingMode: ReadOnly  # Premium v2 supports caching
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
  annotations:
    disk.csi.azure.com/diskIOPSReadWrite: "10000"
    disk.csi.azure.com/diskMBpsReadWrite: "500"
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

Migration Steps for Existing Data

  1. Option A: Velero Backup/Restore (Volume Data)

    # In EKS
    velero backup create postgres-backup \
      --include-namespaces database \
      --snapshot-volumes
    
    # In AKS (after setting up new StorageClass)
    velero restore create postgres-restore \
      --from-backup postgres-backup
  2. Option B: Database-Native Dump/Restore

    # In EKS - Dump database
    kubectl exec -n database postgres-0 -- \
      pg_dumpall -U postgres > /tmp/postgres-dump.sql
    
    # Copy to local machine
    kubectl cp database/postgres-0:/tmp/postgres-dump.sql ./postgres-dump.sql
    
    # In AKS - Restore after new StatefulSet is running
    kubectl cp ./postgres-dump.sql database/postgres-0:/tmp/postgres-dump.sql
    kubectl exec -n database postgres-0 -- \
      psql -U postgres < /tmp/postgres-dump.sql
  3. Option C: Continuous Replication (Zero Downtime)

    # Set up PostgreSQL streaming replication from EKS to AKS
    # Primary in EKS, Replica in AKS
    # Promote AKS replica to primary during cutover

Pain Point 2: EFS (ReadWriteMany) β†’ Azure Files

Severity: 🟑 Medium - Depends on use case
Frequency: Common (20-30% of workloads)
Impact: Shared storage not available, multi-pod writes fail

The Problem

EFS provides NFS-based shared storage with ReadWriteMany access mode. Azure Files provides similar capability but with different performance characteristics, protocols (SMB vs NFS), and pricing.

EKS Configuration (Works)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-storage
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-0123456789abcdef0
  directoryPerms: "700"
  gidRangeStart: "1000"
  gidRangeEnd: "2000"
  basePath: "/dynamic_provisioning"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-uploads
  namespace: web
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-storage
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: web
spec:
  replicas: 5  # Multiple pods share the volume
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        volumeMounts:
        - name: uploads
          mountPath: /var/www/uploads
      volumes:
      - name: uploads
        persistentVolumeClaim:
          claimName: shared-uploads

After Migration to AKS (Broken)

kubectl get pvc -n web
# PVC pending - EFS driver not available

kubectl describe pvc shared-uploads -n web
# provisioner "efs.csi.aws.com" not found

Solution: Azure Files with NFS or SMB

Protocol Decision:

  • NFS 4.1: Better for Linux workloads, POSIX compliance, better performance
  • SMB 3.0: Better for Windows workloads, AD integration, encryption at rest

Option 1: Azure Files with NFS (Recommended for Linux)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-nfs
provisioner: file.csi.azure.com
parameters:
  protocol: nfs
  skuName: Premium_LRS  # NFS requires Premium tier
  # Network settings for better performance
  networkEndpointType: privateEndpoint  # Optional: for private access
mountOptions:
  - nconnect=4  # Parallel connections for better throughput
  - actimeo=30   # Attribute cache timeout
allowVolumeExpansion: true
volumeBindingMode: Immediate
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-uploads
  namespace: web
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: azurefile-nfs
  resources:
    requests:
      storage: 100Gi

Option 2: Azure Files with SMB

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-smb
provisioner: file.csi.azure.com
parameters:
  skuName: Standard_LRS  # Or Premium_LRS
  protocol: smb
  # Optional: Use existing storage account
  # storageAccount: mystorageaccount
  # resourceGroup: myResourceGroup
mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=33  # www-data user
  - gid=33
  - mfsymlinks  # Enable symlinks
  - cache=strict
  - actimeo=30
allowVolumeExpansion: true
volumeBindingMode: Immediate

Performance Comparison

Metric EFS Azure Files Premium (NFS) Azure Files Standard (SMB)
Max throughput 10 GB/s 10 GB/s 60 MB/s per share
Max IOPS 500,000+ 100,000 1,000-20,000
Latency Low (single-digit ms) Low (single-digit ms) Higher (varies)
Min size No minimum 100 GiB 1 GiB
Pricing model Pay per GB used Pay per GB provisioned Pay per GB used
Bursting Yes Yes Limited

Migration Gotchas

  1. File Permissions

    # EFS uses NFSv4 ACLs
    # Azure Files NFS uses NFSv4.1 - mostly compatible
    # Azure Files SMB uses NTFS ACLs - potential permission issues
    
    # Test file operations
    kubectl exec -it web-frontend-xxx -- touch /var/www/uploads/test.txt
    kubectl exec -it web-frontend-xxx -- ls -la /var/www/uploads/
  2. Symbolic Links

    # Azure Files SMB requires mfsymlinks mount option
    mountOptions:
      - mfsymlinks
  3. File Locking

    # EFS supports byte-range locking
    # Azure Files NFS: Full support
    # Azure Files SMB: Full support
    # Test your application's file locking behavior
  4. Case Sensitivity

    # EFS: Case-sensitive (Linux NFS)
    # Azure Files NFS: Case-sensitive
    # Azure Files SMB: Case-insensitive by default
    
    # This could break applications expecting case-sensitivity!
    touch /uploads/File.txt
    touch /uploads/file.txt  # Different files on EFS/NFS, same file on SMB

Data Migration Approaches

Option 1: Rsync Between Volumes

# Create sync pod with both volumes mounted
apiVersion: v1
kind: Pod
metadata:
  name: efs-to-azurefile-sync
  namespace: web
spec:
  containers:
  - name: sync
    image: instrumentisto/rsync-ssh:latest
    command: ["/bin/sh", "-c"]
    args:
      - |
        rsync -avz --progress \
          /source/ /destination/
        echo "Sync complete"
        sleep infinity
    volumeMounts:
    - name: source
      mountPath: /source
    - name: destination
      mountPath: /destination
  volumes:
  - name: source
    persistentVolumeClaim:
      claimName: efs-pvc  # EKS cluster - requires cross-cluster volume access
  - name: destination
    persistentVolumeClaim:
      claimName: azurefile-pvc  # AKS cluster

Option 2: AWS DataSync to Azure Blob, then mount

# Use AWS DataSync to copy data to S3
# Use AzCopy to copy from S3 to Azure Files
azcopy copy \
  "https://my-bucket.s3.amazonaws.com/*" \
  "https://mystorageaccount.file.core.windows.net/myshare" \
  --recursive

Option 3: Application-Level Migration

# 1. Deploy application in AKS with empty Azure Files volume
# 2. Configure application to write to both EFS (in AWS) and Azure Files
# 3. Run backfill job to copy existing data
# 4. Switch application to read from Azure Files
# 5. Decommission EFS

3. Secrets Management

Pain Point: AWS Secrets Manager CSI β†’ Azure Key Vault CSI

Severity: πŸ”΄ High - Application won't start
Frequency: Very Common (80%+ of secure applications)
Impact: Secrets not available, authentication failures

The Problem

Applications using AWS Secrets Manager via the Secrets Store CSI Driver need reconfiguration to use Azure Key Vault. The SecretProviderClass CRD has completely different parameters.

EKS Configuration (Works)

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: application-secrets
  namespace: production
spec:
  provider: aws
  parameters:
    objects: |
      - objectName: "production/database/postgres"
        objectType: "secretsmanager"
        objectAlias: "db-password"
      - objectName: "production/api/jwt-secret"
        objectType: "secretsmanager"
        objectAlias: "jwt-key"
      - objectName: "production/ssl/certificate"
        objectType: "secretsmanager"
        objectAlias: "ssl-cert"
  secretObjects:  # Auto-create Kubernetes Secrets
  - secretName: db-credentials
    type: Opaque
    data:
    - objectName: db-password
      key: password
  - secretName: jwt-credentials
    type: Opaque
    data:
    - objectName: jwt-key
      key: secret
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: api-sa  # Has IRSA permissions
      containers:
      - name: api
        image: myapi:v1.0
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: jwt-credentials
              key: secret
        volumeMounts:
        - name: secrets
          mountPath: "/mnt/secrets"
          readOnly: true
      volumes:
      - name: secrets
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "application-secrets"

After Migration to AKS (Broken)

kubectl get pods -n production
# NAME                          READY   STATUS              RESTARTS   AGE
# api-server-6d8f9c5b4-abc123   0/1     ContainerCreating   0          5m

kubectl describe pod api-server-6d8f9c5b4-abc123 -n production
# Events:
#   Warning  FailedMount  MountVolume.SetUp failed for volume "secrets" : 
#   rpc error: code = Unknown desc = failed to mount secrets store objects for pod: 
#   provider "aws" not found

Solution: Azure Key Vault CSI Driver

Prerequisites:

# 1. Enable Azure Key Vault Provider for Secrets Store CSI Driver
az aks enable-addons \
  --addons azure-keyvault-secrets-provider \
  --name myAKSCluster \
  --resource-group myResourceGroup

# 2. Create Azure Key Vault
az keyvault create \
  --name prodappvault \
  --resource-group production-rg \
  --location eastus

# 3. Create Managed Identity for workload
az identity create \
  --name api-server-identity \
  --resource-group production-rg

# 4. Grant Key Vault access
IDENTITY_CLIENT_ID=$(az identity show \
  --name api-server-identity \
  --resource-group production-rg \
  --query clientId -o tsv)

az keyvault set-policy \
  --name prodappvault \
  --secret-permissions get list \
  --spn $IDENTITY_CLIENT_ID

Migrate Secrets:

# Export from AWS Secrets Manager
aws secretsmanager get-secret-value \
  --secret-id production/database/postgres \
  --query SecretString \
  --output text > db-password.txt

# Import to Azure Key Vault
az keyvault secret set \
  --vault-name prodappvault \
  --name db-password \
  --file db-password.txt

# Repeat for other secrets
az keyvault secret set \
  --vault-name prodappvault \
  --name jwt-secret \
  --value "$(aws secretsmanager get-secret-value --secret-id production/api/jwt-secret --query SecretString --output text)"

AKS Configuration:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: api-sa
  namespace: production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789012"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-210987654321"
---
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: application-secrets
  namespace: production
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "false"
    clientID: "12345678-1234-1234-1234-123456789012"  # Managed Identity Client ID
    keyvaultName: "prodappvault"
    cloudName: ""  # Empty for Azure Public Cloud
    objects: |
      array:
        - |
          objectName: db-password
          objectType: secret
          objectAlias: db-password
        - |
          objectName: jwt-secret
          objectType: secret
          objectAlias: jwt-key
        - |
          objectName: ssl-certificate
          objectType: secret
          objectAlias: ssl-cert
    tenantId: "87654321-4321-4321-4321-210987654321"
  secretObjects:  # Create Kubernetes Secrets (same as before)
  - secretName: db-credentials
    type: Opaque
    data:
    - objectName: db-password
      key: password
  - secretName: jwt-credentials
    type: Opaque
    data:
    - objectName: jwt-key
      key: secret
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
  labels:
    azure.workload.identity/use: "true"
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: api-sa
      containers:
      - name: api
        image: myapi:v1.0
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: jwt-credentials
              key: secret
        volumeMounts:
        - name: secrets
          mountPath: "/mnt/secrets"
          readOnly: true
      volumes:
      - name: secrets
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "application-secrets"

Advanced: Auto-Rotation

AWS Secrets Manager doesn't auto-rotate mounted secrets by default in CSI.

Azure Key Vault CSI supports auto-rotation:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: application-secrets
  namespace: production
spec:
  provider: azure
  parameters:
    # ... other parameters ...
    # Enable secret rotation
    secretProviderClass: "application-secrets"
  # Secrets will be rotated based on polling interval

Configure rotation interval:

# Update CSI driver configuration
kubectl edit configmap azure-keyvault-secrets-provider-config -n kube-system

# Add:
data:
  rotation-poll-interval: "120s"  # Check for updates every 2 minutes

Common Issues

  1. Token Expiration

    # Workload Identity tokens expire
    # Symptoms: "AuthenticationFailed" after ~24 hours
    # Solution: Ensure pod has correct labels
    azure.workload.identity/use: "true"
  2. Permission Errors

    # Error: "Caller is not authorized to perform action"
    # Check Key Vault access policies
    az keyvault show --name prodappvault --query properties.accessPolicies
    
    # Grant missing permissions
    az keyvault set-policy \
      --name prodappvault \
      --object-id <managed-identity-object-id> \
      --secret-permissions get list
  3. Secret Not Syncing

    # Check CSI driver logs
    kubectl logs -n kube-system -l app=secrets-store-csi-driver
    
    # Check provider logs
    kubectl logs -n kube-system -l app=csi-secrets-store-provider-azure

4. Ingress and Load Balancing

Pain Point: AWS ALB Ingress Controller β†’ Azure Application Gateway / nginx

Severity: 🟑 Medium - Functionality degraded
Frequency: Very Common
Impact: Lost features (SSL, redirects, WAF), different costs

The Problem

AWS ALB Ingress Controller annotations don't work on AKS. Features like SSL termination, HTTP-to-HTTPS redirects, health checks, and WAF integration need reconfiguration.

EKS Configuration (Works)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    # SSL Certificate from ACM
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abc-def-ghi
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01
    # HTTP to HTTPS redirect
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: |
      {"Type": "redirect", "RedirectConfig": {
        "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"
      }}
    # Health check configuration
    alb.ingress.kubernetes.io/healthcheck-path: /health
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
    alb.ingress.kubernetes.io/success-codes: "200"
    # Access logs
    alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=my-alb-logs
    # WAF
    alb.ingress.kubernetes.io/wafv2-acl-arn: arn:aws:wafv2:us-east-1:123456789012:regional/webacl/MyWAF/a1b2c3d4
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ssl-redirect
            port:
              name: use-annotation
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

After Migration to AKS (Degraded)

# Ingress created but:
# - No ALB (falls back to nginx or nothing)
# - No SSL termination
# - No HTTP redirect
# - No WAF
# - No custom health checks
# - Different cost model

kubectl get ingress -n production
# NAME                 CLASS   HOSTS              ADDRESS   PORTS   AGE
# production-ingress   <none>  api.example.com              80      5m

Solution Option 1: Azure Application Gateway Ingress Controller (AGIC)

Most similar to ALB, enterprise features

Prerequisites:

# Create Application Gateway
az network application-gateway create \
  --name prodAppGateway \
  --resource-group production-rg \
  --location eastus \
  --sku WAF_v2 \
  --capacity 2 \
  --vnet-name aksVNet \
  --subnet appgw-subnet \
  --public-ip-address appgw-pip

# Enable AGIC addon on AKS
az aks enable-addons \
  --name myAKSCluster \
  --resource-group production-rg \
  --addon ingress-appgw \
  --appgw-id /subscriptions/.../resourceGroups/production-rg/providers/Microsoft.Network/applicationGateways/prodAppGateway

AKS Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    # SSL Certificate from Azure Key Vault
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: "api-example-com-cert"
    # HTTP to HTTPS redirect
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
    # Backend protocol
    appgw.ingress.kubernetes.io/backend-protocol: "http"
    # Health probe
    appgw.ingress.kubernetes.io/health-probe-path: "/health"
    appgw.ingress.kubernetes.io/health-probe-interval: "15"
    appgw.ingress.kubernetes.io/health-probe-timeout: "5"
    appgw.ingress.kubernetes.io/health-probe-unhealthy-threshold: "3"
    # WAF Policy
    appgw.ingress.kubernetes.io/waf-policy-for-path: "/subscriptions/.../resourceGroups/production-rg/providers/Microsoft.Network/applicationGatewayWebApplicationFirewallPolicies/prodWAF"
    # Connection draining
    appgw.ingress.kubernetes.io/connection-draining: "true"
    appgw.ingress.kubernetes.io/connection-draining-timeout: "30"
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls-secret  # Certificate must be in Key Vault and referenced
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

Certificate Setup:

# Import certificate to Key Vault
az keyvault certificate import \
  --vault-name prodappvault \
  --name api-example-com-cert \
  --file certificate.pfx \
  --password "cert-password"

# Grant Application Gateway access
az keyvault set-policy \
  --name prodappvault \
  --spn <appgw-identity> \
  --secret-permissions get \
  --certificate-permissions get

WAF Configuration:

# Create WAF policy
az network application-gateway waf-policy create \
  --name prodWAF \
  --resource-group production-rg \
  --location eastus

# Configure OWASP rules
az network application-gateway waf-policy managed-rule rule-set add \
  --policy-name prodWAF \
  --resource-group production-rg \
  --type OWASP \
  --version 3.2

Solution Option 2: nginx Ingress Controller (Most Portable)

Better for multi-cloud, more mature, larger community

# Install nginx ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz

AKS Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: nginx
    # SSL redirect
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    # Force SSL
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    # Certificate management via cert-manager
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    # Rate limiting
    nginx.ingress.kubernetes.io/limit-rps: "100"
    # Custom health check
    nginx.ingress.kubernetes.io/health-check-path: "/health"
    # CORS
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://example.com"
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls-secret  # Auto-provisioned by cert-manager
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

cert-manager Setup (for automated SSL):

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Create ClusterIssuer for Let's Encrypt
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
EOF

Feature Comparison

Feature AWS ALB Azure App Gateway (AGIC) nginx Ingress
SSL Termination ACM Key Vault cert-manager/manual
WAF AWS WAF Azure WAF ModSecurity (addon)
Path-based routing βœ“ βœ“ βœ“
HTTP redirects βœ“ βœ“ βœ“
Header manipulation Limited βœ“ βœ“ (extensive)
Rate limiting Via WAF Via WAF βœ“ (native)
Canary deployments Via target groups Via backend pools βœ“ (native)
mTLS βœ“ βœ“ βœ“
Cost Pay per hour + LCU Pay per hour + capacity Free (infra only)
Multi-cloud AWS only Azure only Any cloud

5. Networking and Security

Pain Point: VPC CNI Security Groups for Pods β†’ Network Policies

Severity: πŸ”΄ High - Security controls lost
Frequency: Common in regulated industries
Impact: Pod-level network isolation not available

The Problem

EKS allows assigning AWS Security Groups directly to pods via the VPC CNI plugin. AKS uses standard Kubernetes Network Policies, which have different capabilities and granularity.

EKS Configuration (Works)

# Custom Resource for Security Group Policy
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: database-pod-sg
  namespace: database
spec:
  podSelector:
    matchLabels:
      app: postgres
      tier: database
  securityGroups:
    groupIds:
      - sg-0a1b2c3d4e5f6g7h8  # Only allows 5432 from app tier SG
      - sg-1a2b3c4d5e6f7g8h9  # Allows SSH from bastion SG
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
      tier: database
  template:
    metadata:
      labels:
        app: postgres
        tier: database
    spec:
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - containerPort: 5432
          name: postgres
# Pod automatically gets dedicated ENI with security group sg-0a1b2c3d4e5f6g7h8

AWS Security Group Rules (defined in AWS):

# sg-0a1b2c3d4e5f6g7h8 - Database Security Group
# Inbound:
#   - Port 5432 from sg-app-tier-xyz (application pods)
#   - Port 5432 from sg-bastion-abc (admin access)
# Outbound:
#   - Port 5432 to sg-0a1b2c3d4e5f6g7h8 (cluster communication)

After Migration to AKS (No Security!)

# SecurityGroupPolicy CRD doesn't exist
kubectl get securitygrouppolicy -n database
# error: the server doesn't have a resource type "securitygrouppolicy"

# Pods have no network restrictions
# All pods can communicate with all pods!

Solution: Kubernetes Network Policies + Azure Network Policy Manager

Enable Azure Network Policy:

# When creating cluster
az aks create \
  --resource-group production-rg \
  --name myAKSCluster \
  --network-plugin azure \
  --network-policy azure  # or "calico"

# For existing cluster (requires recreation of node pools)
az aks update \
  --resource-group production-rg \
  --name myAKSCluster \
  --network-policy azure

AKS Configuration:

# Default deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: database
spec:
  podSelector: {}
  policyTypes:
  - Ingress
---
# Allow specific ingress to PostgreSQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgres-allow-from-app
  namespace: database
spec:
  podSelector:
    matchLabels:
      app: postgres
      tier: database
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow from application tier
  - from:
    - namespaceSelector:
        matchLabels:
          name: application
      podSelector:
        matchLabels:
          tier: application
    ports:
    - protocol: TCP
      port: 5432
  # Allow from monitoring
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 9187  # postgres_exporter
  # Allow from same namespace (replica communication)
  - from:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  # Allow PostgreSQL replication
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
---
# Label namespaces for network policy
apiVersion: v1
kind: Namespace
metadata:
  name: application
  labels:
    name: application
---
apiVersion: v1
kind: Namespace
metadata:
  name: database
  labels:
    name: database

Key Differences: Security Groups vs Network Policies

Aspect AWS Security Groups K8s Network Policies
Scope ENI (pod gets own network interface) Pod-to-pod
Statefulness Stateful (return traffic automatic) Varies by CNI plugin
IP-based rules Can reference external IPs Can reference IP blocks (CIDR)
Cloud integration Native AWS (RDS, ELB, etc.) Kubernetes-only
Management AWS Console/API/Terraform Kubernetes manifests
Performance Enforced at VPC level (hardware) Enforced at node level (software)
Granularity Per-ENI (can be per-pod) Per-pod only
Cost No additional cost No additional cost

Advanced: Azure Network Security Groups (NSGs) for Nodes

For node-level security (not pod-level):

# Create NSG for AKS nodes
az network nsg create \
  --resource-group production-rg \
  --name aks-node-nsg

# Add rules
az network nsg rule create \
  --resource-group production-rg \
  --nsg-name aks-node-nsg \
  --name allow-postgres-from-app-nodes \
  --priority 100 \
  --source-address-prefixes 10.240.1.0/24 \  # App tier subnet
  --destination-port-ranges 5432 \
  --access Allow \
  --protocol Tcp

# Associate with subnet
az network vnet subnet update \
  --resource-group production-rg \
  --vnet-name aksVNet \
  --name database-subnet \
  --network-security-group aks-node-nsg

Limitation: NSGs apply to ALL pods on a node, not individual pods like Security Groups for Pods

Migration Strategy

  1. Inventory Security Groups

    # List all SecurityGroupPolicies in EKS
    kubectl get securitygrouppolicy --all-namespaces -o yaml > eks-sg-policies.yaml
  2. Map to Network Policies

    • Security Group β†’ Network Policy (pod selector)
    • Security Group rules β†’ Ingress/Egress rules
    • Source Security Groups β†’ Namespace/Pod selectors
  3. Test Thoroughly

    # Test connectivity between pods
    kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- /bin/bash
    # Inside pod:
    nc -zv postgres-0.postgres.database.svc.cluster.local 5432
  4. Use Policy Enforcement Tools

    # Install network policy enforcer visualizer
    kubectl apply -f https://github.com/ahmetb/kubernetes-network-policy-recipes/blob/master/00-deny-all-traffic-to-an-application.yaml

6. Observability and Logging

Pain Point: CloudWatch Container Insights β†’ Azure Monitor

Severity: 🟑 Medium - Operational visibility
Frequency: Universal
Impact: Different query language, metrics, alerting

The Problem

EKS integrates with CloudWatch for logs and metrics. AKS uses Azure Monitor with different collection mechanisms, query languages (KQL vs CloudWatch Insights), and pricing models.

EKS Configuration (Works)

FluentBit DaemonSet for CloudWatch:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     30
        Daemon                    Off
        Log_Level                 info
    
    [INPUT]
        Name                      tail
        Path                      /var/log/containers/*.log
        Parser                    docker
        Tag                       kube.*
        DB                        /var/fluent-bit/state/flb_kube.db
        Mem_Buf_Limit             5MB
        Skip_Long_Lines           On
        Refresh_Interval          10
    
    [FILTER]
        Name                      kubernetes
        Match                     kube.*
        Kube_URL                  https://kubernetes.default.svc.cluster.local:443
        Merge_Log                 On
        Keep_Log                  Off
        K8S-Logging.Parser        On
        K8S-Logging.Exclude       On
    
    [OUTPUT]
        Name                      cloudwatch_logs
        Match                     *
        region                    us-east-1
        log_group_name            /aws/eks/production-cluster/application
        log_stream_prefix         from-fluent-bit-
        auto_create_group         true
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: fluent-bit
  template:
    metadata:
      labels:
        name: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: fluent-bit
        image: amazon/aws-for-fluent-bit:latest
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config

Querying in CloudWatch Insights:

fields @timestamp, @message
| filter kubernetes.namespace_name = "production"
| filter kubernetes.labels.app = "api-server"
| filter @message like /ERROR/
| stats count() by bin(5m)

After Migration to AKS (No Logs)

# Logs not reaching any destination
# CloudWatch not accessible from Azure
# Need to reconfigure entire logging pipeline

Solution: Azure Monitor Container Insights

Enable Container Insights:

# Create Log Analytics Workspace
az monitor log-analytics workspace create \
  --resource-group production-rg \
  --workspace-name prodLogAnalytics \
  --location eastus

# Enable on AKS cluster
az aks enable-addons \
  --resource-group production-rg \
  --name myAKSCluster \
  --addons monitoring \
  --workspace-resource-id /subscriptions/<subscription-id>/resourceGroups/production-rg/providers/Microsoft.OperationalInsights/workspaces/prodLogAnalytics

This automatically deploys:

  • OMS Agent DaemonSet (collects logs and metrics)
  • Container Insights solution
  • Pre-configured workbooks and dashboards

Querying in Azure Monitor (KQL):

ContainerLog
| where TimeGenerated > ago(1h)
| where Namespace == "production"
| where PodLabel_app_s == "api-server"
| where LogEntry contains "ERROR"
| summarize count() by bin(TimeGenerated, 5m)
| render timechart

Query Translation Examples:

CloudWatch Insights Azure Monitor (KQL)
fields @timestamp, @message project TimeGenerated, LogEntry
filter kubernetes.namespace = "prod" where Namespace == "prod"
filter @message like /ERROR/ where LogEntry contains "ERROR"
stats count() by bin(5m) summarize count() by bin(TimeGenerated, 5m)
sort @timestamp desc sort by TimeGenerated desc
limit 100 take 100

Advanced: Custom Metrics

In EKS (CloudWatch Custom Metrics):

import boto3
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='Production/API',
    MetricData=[
        {
            'MetricName': 'RequestDuration',
            'Value': 123.45,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {'Name': 'Endpoint', 'Value': '/api/users'},
                {'Name': 'StatusCode', 'Value': '200'}
            ]
        }
    ]
)

In AKS (Azure Monitor Custom Metrics):

from azure.monitor.ingestion import LogsIngestionClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = LogsIngestionClient(
    endpoint="https://prodLogAnalytics.eastus-1.ingest.monitor.azure.com",
    credential=credential
)

# Send custom logs
client.upload(
    rule_id="/subscriptions/.../dataCollectionRules/myDCR",
    stream_name="Custom-RequestMetrics",
    logs=[
        {
            "TimeGenerated": "2024-02-16T10:00:00Z",
            "Endpoint": "/api/users",
            "Duration": 123.45,
            "StatusCode": 200
        }
    ]
)

Alerting Configuration

EKS (CloudWatch Alarms):

aws cloudwatch put-metric-alarm \
  --alarm-name high-error-rate \
  --alarm-description "Alert when error rate > 5%" \
  --metric-name Errors \
  --namespace AWS/EKS \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:critical-alerts

AKS (Azure Monitor Alerts):

# Create alert rule
az monitor metrics alert create \
  --name high-error-rate \
  --resource-group production-rg \
  --scopes /subscriptions/.../resourceGroups/production-rg/providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action /subscriptions/.../resourceGroups/production-rg/providers/microsoft.insights/actionGroups/critical-alerts

Or using KQL-based log alerts:

az monitor scheduled-query create \
  --name high-error-rate-log \
  --resource-group production-rg \
  --scopes /subscriptions/.../workspaces/prodLogAnalytics \
  --condition "count > 100" \
  --condition-query "ContainerLog | where LogEntry contains 'ERROR' | summarize count()" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action /subscriptions/.../actionGroups/critical-alerts

Cost Comparison

Feature CloudWatch Azure Monitor
Log ingestion $0.50/GB $2.76/GB (first 5GB/day free per workspace)
Log storage $0.03/GB/month Included for 31 days, $0.12/GB/month after
Metrics Custom metrics $0.30/metric Included (native), $0.60/metric (custom)
Queries $0.005/GB scanned Included
Data export $0.09/GB $0.13/GB

7. Container Registry

Pain Point: ECR β†’ Azure Container Registry (ACR)

Severity: 🟒 Low - Straightforward migration
Frequency: Universal
Impact: Image pulls fail until reconfigured

The Problem

Container images stored in Amazon ECR need to be migrated to ACR, and image pull secrets need updating.

EKS Configuration (Works)

apiVersion: v1
kind: Secret
metadata:
  name: ecr-registry
  namespace: production
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-ecr-credentials>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      imagePullSecrets:
      - name: ecr-registry
      containers:
      - name: api
        image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api-server:v1.2.3

Solution: Migrate Images to ACR

1. Create ACR:

az acr create \
  --resource-group production-rg \
  --name prodacr \
  --sku Premium \
  --location eastus

2. Enable ACR Integration with AKS:

# Attach ACR to AKS (automatic image pull)
az aks update \
  --resource-group production-rg \
  --name myAKSCluster \
  --attach-acr prodacr

3. Migrate Images:

# Login to both registries
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

az acr login --name prodacr

# Pull from ECR
docker pull 123456789012.dkr.ecr.us-east-1.amazonaws.com/api-server:v1.2.3

# Tag for ACR
docker tag \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/api-server:v1.2.3 \
  prodacr.azurecr.io/api-server:v1.2.3

# Push to ACR
docker push prodacr.azurecr.io/api-server:v1.2.3

Automated migration script:

#!/bin/bash
ECR_REGISTRY="123456789012.dkr.ecr.us-east-1.amazonaws.com"
ACR_REGISTRY="prodacr.azurecr.io"

# List all images in ECR
aws ecr describe-repositories --region us-east-1 --output json | \
  jq -r '.repositories[].repositoryName' | \
  while read repo; do
    # List all tags for repository
    aws ecr list-images --region us-east-1 --repository-name $repo --output json | \
      jq -r '.imageIds[].imageTag' | \
      while read tag; do
        echo "Migrating $repo:$tag"
        
        # Pull from ECR
        docker pull $ECR_REGISTRY/$repo:$tag
        
        # Tag for ACR
        docker tag $ECR_REGISTRY/$repo:$tag $ACR_REGISTRY/$repo:$tag
        
        # Push to ACR
        docker push $ACR_REGISTRY/$repo:$tag
        
        # Clean up local image
        docker rmi $ECR_REGISTRY/$repo:$tag $ACR_REGISTRY/$repo:$tag
      done
  done

4. Update Manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      # No imagePullSecrets needed when ACR is attached to AKS
      containers:
      - name: api
        image: prodacr.azurecr.io/api-server:v1.2.3  # Updated image reference

5. Update CI/CD Pipelines:

# GitHub Actions example
- name: Login to ACR
  uses: azure/docker-login@v1
  with:
    login-server: prodacr.azurecr.io
    username: ${{ secrets.ACR_USERNAME }}
    password: ${{ secrets.ACR_PASSWORD }}

- name: Build and push
  run: |
    docker build -t prodacr.azurecr.io/api-server:${{ github.sha }} .
    docker push prodacr.azurecr.io/api-server:${{ github.sha }}

Advanced: Geo-Replication

# Replicate to multiple regions for faster pulls
az acr replication create \
  --registry prodacr \
  --location westus2

az acr replication create \
  --registry prodacr \
  --location westeurope

8. Backup and Disaster Recovery

Pain Point: EBS Snapshots β†’ Azure Disk Snapshots

Severity: 🟑 Medium
Frequency: Common
Impact: Backup/restore processes need reconfiguration

The Problem

EBS snapshot-based backups (via tools like Velero) use AWS-specific APIs. Azure has different snapshot mechanisms.

Solution: Update Velero Configuration

EKS Velero Configuration:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: prod-velero-backups
    prefix: eks-cluster
  config:
    region: us-east-1
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  config:
    region: us-east-1

AKS Velero Configuration:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: azure
  objectStorage:
    bucket: velero-backups  # Actually an Azure Blob container
    prefix: aks-cluster
  config:
    resourceGroup: production-rg
    storageAccount: prodvelarostorage
    subscriptionId: 12345678-1234-1234-1234-123456789012
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: azure
  config:
    resourceGroup: production-rg
    subscriptionId: 12345678-1234-1234-1234-123456789012

Install Velero with Azure Plugin:

# Create storage account for backups
az storage account create \
  --name prodvelarostorage \
  --resource-group production-rg \
  --sku Standard_GRS \
  --encryption-services blob \
  --https-only true

# Create blob container
az storage container create \
  --name velero-backups \
  --account-name prodvelarostorage

# Install Velero
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=production-rg,storageAccount=prodvelarostorage,subscriptionId=12345678-1234-1234-1234-123456789012 \
  --snapshot-location-config resourceGroup=production-rg,subscriptionId=12345678-1234-1234-1234-123456789012

9. Compute and Node Configuration

Pain Point: EC2 Instance Types β†’ Azure VM Sizes

Severity: 🟒 Low - Configuration change
Frequency: Universal
Impact: Performance characteristics may differ

The Problem

Node pools configured for specific EC2 instance types don't exist in Azure. VM sizes have different names, capabilities, and pricing.

Instance Type Mapping

EKS (EC2) vCPU Memory AKS (Azure VM) vCPU Memory Notes
t3.medium 2 4 GiB Standard_B2ms 2 8 GiB Burstable
m5.large 2 8 GiB Standard_D2s_v5 2 8 GiB General purpose
m5.xlarge 4 16 GiB Standard_D4s_v5 4 16 GiB General purpose
c5.xlarge 4 8 GiB Standard_F4s_v2 4 8 GiB Compute optimized
r5.xlarge 4 32 GiB Standard_E4s_v5 4 32 GiB Memory optimized
p3.2xlarge 8 61 GiB + V100 Standard_NC6s_v3 6 112 GiB + V100 GPU

EKS Node Group Configuration

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production-cluster
  region: us-east-1
nodeGroups:
  - name: general-purpose
    instanceType: m5.xlarge
    desiredCapacity: 3
    minSize: 2
    maxSize: 10
    labels:
      workload-type: general
    taints:
      - key: workload-type
        value: general
        effect: NoSchedule

AKS Node Pool Configuration

az aks nodepool add \
  --resource-group production-rg \
  --cluster-name myAKSCluster \
  --name generalpurpose \
  --node-count 3 \
  --min-count 2 \
  --max-count 10 \
  --node-vm-size Standard_D4s_v5 \
  --labels workload-type=general \
  --node-taints workload-type=general:NoSchedule \
  --enable-cluster-autoscaler

Node Selectors and Tolerations

No changes needed - Kubernetes-native constructs work identically:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: compute-intensive-app
spec:
  template:
    spec:
      nodeSelector:
        workload-type: general
      tolerations:
      - key: "workload-type"
        operator: "Equal"
        value: "general"
        effect: "NoSchedule"
      containers:
      - name: app
        image: my-app:latest

10. Service Mesh Integration

Pain Point: AWS App Mesh β†’ Azure Service Mesh (Istio)

Severity: 🟑 Medium - Advanced use cases
Frequency: Uncommon
Impact: Service mesh configuration incompatible

The Problem

AWS App Mesh uses AWS-specific CRDs and control plane. Azure supports open-source service meshes (Istio, Linkerd, OSM).

Solution: Migrate to Istio on AKS

This is complex and beyond the scope of this document, but key considerations:

  1. Install Istio on AKS

    istioctl install --set profile=production
  2. Migrate Virtual Services

    • App Mesh VirtualServices β†’ Istio VirtualServices
    • Different syntax, similar concepts
  3. Update mTLS Configuration

    • App Mesh uses AWS Certificate Manager
    • Istio uses cert-manager or manual certificates
  4. Rewrite Traffic Policies


11. Database-Specific Integrations

Pain Point: RDS Integration β†’ Azure Database

Severity: 🟒 Low - If using managed databases
Frequency: Common
Impact: Connection strings, authentication

The Problem

Applications connecting to AWS RDS need connection string updates for Azure Database for PostgreSQL/MySQL.

EKS Application Configuration

apiVersion: v1
kind: Secret
metadata:
  name: db-connection
  namespace: production
stringData:
  host: "prod-postgres.c9akz82fkwix.us-east-1.rds.amazonaws.com"
  port: "5432"
  database: "production"
  username: "app_user"
  password: "secure-password"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: db-connection
              key: host
        - name: DB_PORT
          valueFrom:
            secretKeyRef:
              name: db-connection
              key: port
        # etc.

AKS Application Configuration

apiVersion: v1
kind: Secret
metadata:
  name: db-connection
  namespace: production
stringData:
  host: "prod-postgres.postgres.database.azure.com"  # Changed!
  port: "5432"
  database: "production"
  username: "app_user@prod-postgres"  # Azure requires @servername
  password: "secure-password"
  # Optional: SSL parameters for Azure Database
  sslmode: "require"
---
# Rest of deployment unchanged

Additional Azure-specific considerations:

  1. SSL/TLS Required

    # Connection string must include SSL
    conn = psycopg2.connect(
        host="prod-postgres.postgres.database.azure.com",
        port=5432,
        database="production",
        user="app_user@prod-postgres",
        password="password",
        sslmode="require"
    )
  2. Firewall Rules

    # Allow AKS nodes to access Azure Database
    az postgres server firewall-rule create \
      --resource-group production-rg \
      --server-name prod-postgres \
      --name AllowAKSNodes \
      --start-ip-address 10.240.0.0 \
      --end-ip-address 10.240.255.255
  3. Private Endpoints (recommended)

    # Create private endpoint for database
    az network private-endpoint create \
      --name postgres-private-endpoint \
      --resource-group production-rg \
      --vnet-name aksVNet \
      --subnet database-subnet \
      --private-connection-resource-id /subscriptions/.../servers/prod-postgres \
      --group-id postgresqlServer \
      --connection-name postgres-connection

12. GitOps and CI/CD

Pain Point: CodePipeline/CodeBuild β†’ Azure DevOps/GitHub Actions

Severity: 🟒 Low - CI/CD reconfiguration
Frequency: Very Common
Impact: Build/deploy pipelines need rewriting

The Problem

AWS-native CI/CD tools (CodePipeline, CodeBuild, CodeDeploy) need replacement or reconfiguration.

Solutions

Option 1: GitHub Actions (Cloud-agnostic)

name: Deploy to AKS
on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Azure Login
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}
    
    - name: Build and push image
      run: |
        az acr login --name prodacr
        docker build -t prodacr.azurecr.io/api-server:${{ github.sha }} .
        docker push prodacr.azurecr.io/api-server:${{ github.sha }}
    
    - name: Set AKS context
      uses: azure/aks-set-context@v3
      with:
        resource-group: production-rg
        cluster-name: myAKSCluster
    
    - name: Deploy to AKS
      uses: azure/k8s-deploy@v4
      with:
        manifests: |
          k8s/deployment.yaml
          k8s/service.yaml
        images: |
          prodacr.azurecr.io/api-server:${{ github.sha }}

Option 2: Azure DevOps

trigger:
  branches:
    include:
    - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  acrName: 'prodacr'
  imageName: 'api-server'
  aksResourceGroup: 'production-rg'
  aksClusterName: 'myAKSCluster'

stages:
- stage: Build
  jobs:
  - job: BuildAndPush
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: 'prodacr'
        repository: $(imageName)
        command: 'buildAndPush'
        Dockerfile: '**/Dockerfile'
        tags: |
          $(Build.BuildId)
          latest

- stage: Deploy
  jobs:
  - job: DeployToAKS
    steps:
    - task: KubernetesManifest@0
      inputs:
        action: 'deploy'
        kubernetesServiceConnection: 'myAKSCluster'
        namespace: 'production'
        manifests: |
          k8s/deployment.yaml
          k8s/service.yaml
        containers: |
          $(acrName).azurecr.io/$(imageName):$(Build.BuildId)

13. Detection Patterns for Migration Tools

Automated Discovery Rules

Tools like Konveyor should flag these patterns:

1. AWS-Specific Annotations

# PATTERN: EKS-specific annotations
annotations:
  eks.amazonaws.com/role-arn: *
  alb.ingress.kubernetes.io/*: *
  
# ACTION: Flag for Workload Identity or AGIC migration

2. AWS CSI Drivers

# PATTERN: AWS storage drivers
spec:
  csi:
    driver: ebs.csi.aws.com
    driver: efs.csi.aws.com
    
# ACTION: Suggest Azure Disk or Azure Files

3. AWS-Specific CRDs

# PATTERN: AWS-only Custom Resources
apiVersion: vpcresources.k8s.aws/*
apiVersion: secretsproviderclass.k8s.aws/*

# ACTION: Recommend Kubernetes Network Policies or Azure equivalents

4. Environment Variables

# PATTERN: AWS SDK environment variables
env:
- name: AWS_REGION
- name: AWS_DEFAULT_REGION
- name: AWS_ACCESS_KEY_ID

# ACTION: Warn about credential management changes

5. Hard-coded AWS Endpoints

# PATTERN: AWS service endpoints
env:
- name: S3_ENDPOINT
  value: "https://s3.us-east-1.amazonaws.com"
- name: SQS_URL
  value: "https://sqs.us-east-1.amazonaws.com/123456789012/my-queue"

# ACTION: Suggest Azure service equivalents

6. EC2 Metadata Usage

# PATTERN: Code accessing EC2 metadata
import requests
response = requests.get('http://169.254.169.254/latest/meta-data/')

# ACTION: Flag for Azure Instance Metadata Service (IMDS) migration

14. Migration Strategies

Strategy 1: Lift-and-Shift (Fastest)

  1. Phase 1: Infrastructure (Week 1)

    • Create AKS cluster
    • Configure networking, storage classes
    • Set up Azure equivalents (ACR, Key Vault, etc.)
  2. Phase 2: Data Migration (Week 2)

    • Velero backup from EKS
    • Velero restore to AKS (data only)
    • Validate data integrity
  3. Phase 3: Application Deployment (Week 3)

    • Update manifests (storage classes, ingress, etc.)
    • Deploy via GitOps
    • Run smoke tests
  4. Phase 4: Cutover (Week 4)

    • DNS cutover
    • Decommission EKS

Pros: Fast, minimal code changes
Cons: Doesn't leverage Azure-native features, potential performance issues


Strategy 2: Blue-Green Cluster Migration (Safest)

  1. Phase 1: Build Green (AKS) (Weeks 1-2)

    • Parallel infrastructure build
    • Migrate data
  2. Phase 2: Validate Green (Week 3)

    • Run integration tests
    • Performance testing
    • Security validation
  3. Phase 3: Traffic Split (Week 4)

    • 10% traffic to AKS
    • Monitor for 48 hours
    • Increase to 50%, then 100%
  4. Phase 4: Decommission Blue (EKS) (Week 5)

    • Archive data
    • Terminate EKS

Pros: Safest, easy rollback
Cons: Highest cost (dual infrastructure), complex traffic splitting


Strategy 3: Incremental Migration (Most Controlled)

  1. Phase 1: Stateless Workloads (Weeks 1-3)

    • Migrate stateless apps first
    • Test in production with real traffic
  2. Phase 2: Stateful Non-Database (Weeks 4-6)

    • Redis, message queues
    • Can tolerate brief downtime
  3. Phase 3: Databases (Weeks 7-10)

    • Set up replication
    • Gradual cutover per database
  4. Phase 4: Cleanup (Week 11+)

    • Remove EKS resources
    • Optimize AKS

Pros: Lowest risk, learn as you go
Cons: Longest duration, complex coordination


15. Quick Reference Tables

Critical Path Items

Category EKS Component AKS Equivalent Migration Effort Blocking?
Auth IRSA Workload Identity High (code changes) πŸ”΄ Yes
Storage EBS CSI Azure Disk CSI Medium (manifests) πŸ”΄ Yes
Secrets Secrets Manager CSI Key Vault CSI Medium (manifests + data) πŸ”΄ Yes
Ingress ALB Controller AGIC / nginx Medium (manifests) 🟑 Partial
Network Security Groups for Pods Network Policies High (different model) 🟑 Partial
Registry ECR ACR Low (image migration) πŸ”΄ Yes
Monitoring CloudWatch Azure Monitor Medium (queries) 🟒 No
Backups Velero (AWS) Velero (Azure) Low (config) 🟒 No

Pre-Migration Checklist

  • Inventory all AWS-specific annotations across all manifests
  • List all StatefulSets and PersistentVolumeClaims
  • Document all IRSA service accounts and their permissions
  • Export all AWS Secrets Manager secrets
  • List all ALB Ingresses and their annotations
  • Document CloudWatch dashboards and alerts
  • Map EC2 instance types to Azure VM sizes
  • Plan database migration strategy (native tools vs Velero)
  • Update CI/CD pipelines
  • Train team on Azure-specific tooling (KQL, Azure Portal)
  • Set up cost monitoring in Azure
  • Plan DNS cutover strategy
  • Define rollback procedures

Testing Checklist

  • Pod authentication works (Workload Identity)
  • PVCs provision correctly (storage classes)
  • Secrets mount successfully (Key Vault CSI)
  • Ingress creates load balancer (AGIC/nginx)
  • Network policies block unauthorized traffic
  • Applications can connect to databases
  • Logs appear in Azure Monitor
  • Metrics are collected
  • Alerts fire correctly
  • Backups complete successfully
  • Load testing passes
  • Security scanning passes
  • Cost is within budget

Appendix A: Common Error Messages

"Unable to locate credentials"

Cause: IRSA not configured, Workload Identity missing
Fix: Add Workload Identity annotations to ServiceAccount and pod labels

"storageclass.storage.k8s.io not found"

Cause: EBS StorageClass doesn't exist in AKS
Fix: Create Azure Disk or Azure Files StorageClass

"provider 'aws' not found"

Cause: AWS Secrets Store CSI provider not installed
Fix: Reconfigure SecretProviderClass for Azure

"MountVolume.SetUp failed"

Cause: Volume driver mismatch
Fix: Update CSI driver in PV/PVC specs


Appendix B: Cost Optimization

Storage Cost Comparison

Scenario EKS (EBS gp3) AKS (Premium SSD) Savings
1 TB, 3000 IOPS $80/month $135/month -69%
1 TB, 10000 IOPS $145/month $180/month -24%

Recommendation: Use Azure Premium SSD v2 for cost-effective high-IOPS workloads

Compute Cost Comparison

Workload EKS (m5.xlarge) AKS (D4s_v5) Savings
24/7 production $122/month $140/month -15%
Dev/test (8h/day) $41/month $47/month -15%

Note: Costs vary by region and commitment (Reserved Instances vs Spot)


Conclusion

Migrating from EKS to AKS requires careful planning and attention to cloud-specific integrations. The most common pain points involve:

  1. Authentication: IRSA β†’ Workload Identity
  2. Storage: EBS/EFS β†’ Azure Disk/Files
  3. Secrets: AWS Secrets Manager β†’ Key Vault
  4. Networking: Security Groups β†’ Network Policies
  5. Observability: CloudWatch β†’ Azure Monitor

Success Factors:

  • Thorough inventory of AWS-specific resources
  • Automated detection of cloud-specific patterns
  • Comprehensive testing in staging environment
  • Incremental migration approach
  • Team training on Azure-specific concepts

Tools to Leverage:

  • Konveyor for automated migration analysis
  • Velero for data migration
  • GitOps (ArgoCD/Flux) for consistent deployments
  • Azure Migrate for assessment

This document should serve as a comprehensive reference for platform teams undertaking EKS to AKS migrations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment