| name | description |
|---|---|
elasticsearch-rolling-upgrade |
Run a zero-downtime rolling Elasticsearch upgrade in the environment by upgrading one node at a time with Ansible and systemd. Use when planning or executing Elasticsearch version upgrades (for example 9.1.x to 9.3.1), including preflight checks, per-node allocation control, strict health gates, and abort-on-failure handling. |
Use this skill for deterministic, guarded Elasticsearch rolling upgrades.
- Apply
ansible-production-guardrailstogether with this skill for any in-place run, especially production. - Treat these rules as additional Elasticsearch-specific rollout constraints.
- Confirm inventory file (for example
ansible/inventories/telekom/hosts.ini). - Confirm Elasticsearch node list and exact order (default:
node-3,node-2,node-1when equivalent). - Confirm target version is already set in variables (
ansible/group_vars/all.yml->elasticsearch_version). - Confirm Elasticsearch API endpoint to check health (for example
http://localhost:9200). - Confirm one Elasticsearch node that will run snapshot verification before rollout.
- Never stop any node when cluster health is
yelloworred. - Before stopping each node, require cluster health
greenand no active shard relocation. - Upgrade exactly one node at a time using
--limit <single-host>. - Abort immediately when any step fails, health checks fail, Ansible fails, node does not rejoin, or recovery does not settle.
- After aborting, stop automation and hand control back to the user to inspect and fix before any retry.
- Verify target version wiring:
rg -n "^elasticsearch_version:" ansible/group_vars/all.yml
- Verify inventory hosts:
ansible-inventory -i <inventory> --host <node-1>ansible-inventory -i <inventory> --host <node-2>ansible-inventory -i <inventory> --host <node-3>
- Verify cluster health is green:
curl -fsS "<es_url>/_cluster/health?pretty"
- Verify no active relocation before maintenance:
curl -fsS "<es_url>/_cluster/health?filter_path=status,relocating_shards,initializing_shards,unassigned_shards,number_of_nodes"
- Verify snapshot/backup health on one node before proceeding:
ansible -i <inventory> <snapshot-check-node> -b -m ansible.builtin.command -a "/opt/formation/bin/ensure-es-and-backups-healthy"
- Run Ansible dry-run against node-3 to validate playbook/task wiring before real changes:
cd ansible && ansible-playbook -i <inventory-relative-to-ansible-dir> elasticsearch.yml --limit node-3 --check
- Abort if any preflight check fails or status is not
green.
Repeat all steps below for each node in the chosen order.
- Pre-node gate:
- Check health:
curl -fsS "<es_url>/_cluster/health?filter_path=status,relocating_shards,initializing_shards,unassigned_shards,number_of_nodes" - Require
status=greenandrelocating_shards=0before proceeding. - If not satisfied, abort and return control to user.
- Check health:
- Disable replica allocation:
curl -fsS -X PUT "<es_url>/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":"primaries"}}'
- Stop Elasticsearch on target node:
ansible -i <inventory> <node> -b -m ansible.builtin.systemd -a "name=elasticsearch state=stopped"
- Upgrade target node with Ansible playbook and single-host limit:
cd ansible && ansible-playbook -i <inventory-relative-to-ansible-dir> elasticsearch.yml --limit <node>
- Start Elasticsearch on target node:
ansible -i <inventory> <node> -b -m ansible.builtin.systemd -a "name=elasticsearch state=started enabled=true"
- Wait for node to rejoin and stabilize:
curl -fsS "<es_url>/_cat/nodes?v"curl -fsS "<es_url>/_cluster/health?wait_for_nodes=3&timeout=5m"curl -fsS "<es_url>/_cluster/health?wait_for_no_relocating_shards=true&timeout=5m"
- Re-enable allocation:
curl -fsS -X PUT "<es_url>/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":null}}'
- Post-node gate:
curl -fsS "<es_url>/_cluster/health?wait_for_status=green&timeout=10m&pretty"- Proceed only when health returns to
green.
Abort immediately and do not continue to next node when any of these occur:
- Cluster is
yelloworredbefore node stop. - Any command returns non-zero.
- Target node does not rejoin within timeout.
- Cluster does not return to
greenafter re-enabling allocation. - Cluster enters
redat any point.
- Verify all nodes run target version:
curl -fsS "<es_url>/_cat/nodes?h=name,ip,version,master,role"
- Verify cluster health is green:
curl -fsS "<es_url>/_cluster/health?pretty"
- Record the exact node order, execution timestamps, and any anomalies in the handoff.