This is the rough outline of how we successfully did an in-place control + data plane upgrade from Istio 1.4.7 -> 1.5.4 via the official Helm charts.
Upgrade was
- applied via scripting/automation
- on a mesh using
- mTLS
- Istio RBAC via
AuthorizationPolicy - telemetry v1
- tracing enabled, but Jaeger not deployed via istio chart
- istio ingress gateway + secondary istio ingress gateway
- active traffic flowing through without any observed increase in error rates
This ignores anything specifically mentioned in the upgrade notes.
- Bug in RBAC backward compatibility with
1.4in1.5.0->1.5.2, fixed in1.5.3 - issue with visibility of
ServiceEntrys being scoped usingSidecarresource- istio/istio#24251 subsequent added to the upgrade notes
- All traffic ports are now captured by default; this caused our non-mTLS metrics ports to start enforcing mTLS which they previously did not do on
1.4.7- Fix: exclude metrics ports via sidecar annotations
traffic.sidecar.istio.io/excludeInboundPorts: "9080, 15090"
- Fix: exclude metrics ports via sidecar annotations
#!/usr/bin/env bash
# In 1.4 Galley manages the webhook configuration; in 1.5 Helm manages it and it is patched by galley dynamically
# without `ownerReferences`, so we can detect if we have upgraded Galley already
if kubectl get validatingwebhookconfiguration/istio-galley -n istio-system -o yaml | grep ownerReferences; then
echo "Detected 1.4 installation - preparing Helm upgrade to 1.5.x by deleting galley-managed webhook..."
# Disable webhook reconciliation so we can delete the webhook
kubectl get deployment/istio-galley -n istio-system -o yaml | \
sed 's/enable-reconcileWebhookConfiguration=true/enable-reconcileWebhookConfiguration=false/' | \
kubectl apply -f -
# Wait for Galley to come back up
kubectl rollout status deployment/istio-galley -n istio-system --timeout 60s
# Delete the webhook
kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system
# Now we can proceed to helm upgrade to 1.5 which will recreate the webhook
fiNot to be taken literally - this is pseudo-script...
helm upgrade --install --wait --atomic --cleanup-on-fail istio-init istio-init-1.5.4.tgz
# scripting to wait for jobs to complete goes here
helm upgrade --install --wait --atomic --cleanup-on-fail istio istio-1.5.4.tgz
# scripting to bounce `Deployment`s for injected services goes hereWe noticed issues with ingress gateways coming up during the control plane upgrade.
It appears there was some kind of race condition when starting new 1.5.4 ingressgateway instances while
parts of the 1.4.7 control plane were still running. Suspect new ingressgateway talking to old verison pilot problem perhaps?
Symptom
- lots of weird errors about invalid configuration being received from pilot relating to
tracingon the new version ingressgateway logs - a subset of new version ingress gateways would not become ready which could cause the
helm upgrade --waitto get stuck
Fix
- delete the pods that fail to become ready (manual intervention in our case, although technically possible to automate)
- the new pods automatically re-created always came ready
helm upgradewill go to completion
Hi @jlcrow - sorry for the slow response - forgot about this one.
We have subsequently moved
1.5 -> 1.6(and about to do1.6 -> 1.7) with zero downtime, migrating to installing our control plane via defining theIstioOperatorresource in source control, and then usingistioctl installto actually manage the canary rollout rather than Helm. We usedistoctlto do an initial conversion of our Helm resources/values and then manually corrected niggles.We didn't have issues with the gateways, however it's worth noting that
ingressgatewaypods in-place driven by theIstioOperatorresource andistioctl. Thus we focused a lot on ensuring the transition fromIstioOperator-managed components to Helm was seamless, and that new gateways could talk to "old"1.5proxies; and also that if we needed to rollback the gateways that we could a seamless process to re-run the Helm deploy to correct any issues.Our process essentially was
Deployments (Everything is now on 1.6 - check traffic is still flowing)kubectl deletes.Your issue sounds like perhaps the new citadel-within-
istioddoesn't share a root of trust with the oldcitadel-managed root certs and is issuing incompatible certs to your new gateways that can't be trusted by old service proxies? Were you installing the new1.6control plane into the sameistio-systemnamespace as the old Helm-managed components so that it can see/share the same root certs inistio-ca-secret? Did you happen to notice whether there were changes to theistio-ca-secretpost-install?We moved away from Helm because there was no supported Helm-managed deployment process, however I note that the Istio team seem to have added one back again recently (which is rather frustating given the difficult in moving away from it and mixed messaging around supporting
helm install|upgrade). However, I'm not sure there is entirely a Helm-managed process that will get you from1.4 -> 1.8without dropping traffic due to forward/backward compability of control plane components. In our case we seek to never drop traffic during production deployments, so that probably meant we had no choice but to move towards theIstioOperator, however if this isn't a concern for you there are probably more options available and lower engineering effort.