name	description
elasticsearch-rolling-upgrade	Run a zero-downtime rolling Elasticsearch upgrade in the environment by upgrading one node at a time with Ansible and systemd. Use when planning or executing Elasticsearch version upgrades (for example 9.1.x to 9.3.1), including preflight checks, per-node allocation control, strict health gates, and abort-on-failure handling.

elasticsearch-rolling-upgrade

Use this skill for deterministic, guarded Elasticsearch rolling upgrades.

Related skill

Apply ansible-production-guardrails together with this skill for any in-place run, especially production.
Treat these rules as additional Elasticsearch-specific rollout constraints.

Confirm inventory file (for example ansible/inventories/telekom/hosts.ini).
Confirm Elasticsearch node list and exact order (default: node-3, node-2, node-1 when equivalent).
Confirm target version is already set in variables (ansible/group_vars/all.yml -> elasticsearch_version).
Confirm Elasticsearch API endpoint to check health (for example http://localhost:9200).
Confirm one Elasticsearch node that will run snapshot verification before rollout.

Never stop any node when cluster health is yellow or red.
Before stopping each node, require cluster health green and no active shard relocation.
Upgrade exactly one node at a time using --limit <single-host>.
Abort immediately when any step fails, health checks fail, Ansible fails, node does not rejoin, or recovery does not settle.
After aborting, stop automation and hand control back to the user to inspect and fix before any retry.

Verify target version wiring:
- rg -n "^elasticsearch_version:" ansible/group_vars/all.yml
Verify inventory hosts:
- ansible-inventory -i <inventory> --host <node-1>
- ansible-inventory -i <inventory> --host <node-2>
- ansible-inventory -i <inventory> --host <node-3>
Verify cluster health is green:
- curl -fsS "<es_url>/_cluster/health?pretty"
Verify no active relocation before maintenance:
- curl -fsS "<es_url>/_cluster/health?filter_path=status,relocating_shards,initializing_shards,unassigned_shards,number_of_nodes"
Verify snapshot/backup health on one node before proceeding:
- ansible -i <inventory> <snapshot-check-node> -b -m ansible.builtin.command -a "/opt/formation/bin/ensure-es-and-backups-healthy"
Run Ansible dry-run against node-3 to validate playbook/task wiring before real changes:
- cd ansible && ansible-playbook -i <inventory-relative-to-ansible-dir> elasticsearch.yml --limit node-3 --check
Abort if any preflight check fails or status is not green.

Repeat all steps below for each node in the chosen order.

Pre-node gate:
- Check health: curl -fsS "<es_url>/_cluster/health?filter_path=status,relocating_shards,initializing_shards,unassigned_shards,number_of_nodes"
- Require status=green and relocating_shards=0 before proceeding.
- If not satisfied, abort and return control to user.
Disable replica allocation:
- curl -fsS -X PUT "<es_url>/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":"primaries"}}'
Stop Elasticsearch on target node:
- ansible -i <inventory> <node> -b -m ansible.builtin.systemd -a "name=elasticsearch state=stopped"
Upgrade target node with Ansible playbook and single-host limit:
- cd ansible && ansible-playbook -i <inventory-relative-to-ansible-dir> elasticsearch.yml --limit <node>
Start Elasticsearch on target node:
- ansible -i <inventory> <node> -b -m ansible.builtin.systemd -a "name=elasticsearch state=started enabled=true"
Wait for node to rejoin and stabilize:
- curl -fsS "<es_url>/_cat/nodes?v"
- curl -fsS "<es_url>/_cluster/health?wait_for_nodes=3&timeout=5m"
- curl -fsS "<es_url>/_cluster/health?wait_for_no_relocating_shards=true&timeout=5m"
Re-enable allocation:
- curl -fsS -X PUT "<es_url>/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":null}}'
Post-node gate:
- curl -fsS "<es_url>/_cluster/health?wait_for_status=green&timeout=10m&pretty"
- Proceed only when health returns to green.

Abort immediately and do not continue to next node when any of these occur:

Verify all nodes run target version:
- curl -fsS "<es_url>/_cat/nodes?h=name,ip,version,master,role"
Verify cluster health is green:
- curl -fsS "<es_url>/_cluster/health?pretty"
Record the exact node order, execution timestamps, and any anomalies in the handoff.