Skip to content

Instantly share code, notes, and snippets.

@jillesvangurp
Created March 12, 2026 13:56
Show Gist options
  • Select an option

  • Save jillesvangurp/d4ea63b3686d56e517524192db033be6 to your computer and use it in GitHub Desktop.

Select an option

Save jillesvangurp/d4ea63b3686d56e517524192db033be6 to your computer and use it in GitHub Desktop.
Elasticsearch Rolling Restart skill
name description
elasticsearch-rolling-upgrade
Run a zero-downtime rolling Elasticsearch upgrade in the environment by upgrading one node at a time with Ansible and systemd. Use when planning or executing Elasticsearch version upgrades (for example 9.1.x to 9.3.1), including preflight checks, per-node allocation control, strict health gates, and abort-on-failure handling.

elasticsearch-rolling-upgrade

Use this skill for deterministic, guarded Elasticsearch rolling upgrades.

Related skill

  1. Apply ansible-production-guardrails together with this skill for any in-place run, especially production.
  2. Treat these rules as additional Elasticsearch-specific rollout constraints.

Required inputs

  1. Confirm inventory file (for example ansible/inventories/telekom/hosts.ini).
  2. Confirm Elasticsearch node list and exact order (default: node-3, node-2, node-1 when equivalent).
  3. Confirm target version is already set in variables (ansible/group_vars/all.yml -> elasticsearch_version).
  4. Confirm Elasticsearch API endpoint to check health (for example http://localhost:9200).
  5. Confirm one Elasticsearch node that will run snapshot verification before rollout.

Hard safety gates

  1. Never stop any node when cluster health is yellow or red.
  2. Before stopping each node, require cluster health green and no active shard relocation.
  3. Upgrade exactly one node at a time using --limit <single-host>.
  4. Abort immediately when any step fails, health checks fail, Ansible fails, node does not rejoin, or recovery does not settle.
  5. After aborting, stop automation and hand control back to the user to inspect and fix before any retry.

Preflight checks (run once before first node)

  1. Verify target version wiring:
    • rg -n "^elasticsearch_version:" ansible/group_vars/all.yml
  2. Verify inventory hosts:
    • ansible-inventory -i <inventory> --host <node-1>
    • ansible-inventory -i <inventory> --host <node-2>
    • ansible-inventory -i <inventory> --host <node-3>
  3. Verify cluster health is green:
    • curl -fsS "<es_url>/_cluster/health?pretty"
  4. Verify no active relocation before maintenance:
    • curl -fsS "<es_url>/_cluster/health?filter_path=status,relocating_shards,initializing_shards,unassigned_shards,number_of_nodes"
  5. Verify snapshot/backup health on one node before proceeding:
    • ansible -i <inventory> <snapshot-check-node> -b -m ansible.builtin.command -a "/opt/formation/bin/ensure-es-and-backups-healthy"
  6. Run Ansible dry-run against node-3 to validate playbook/task wiring before real changes:
    • cd ansible && ansible-playbook -i <inventory-relative-to-ansible-dir> elasticsearch.yml --limit node-3 --check
  7. Abort if any preflight check fails or status is not green.

Per-node workflow

Repeat all steps below for each node in the chosen order.

  1. Pre-node gate:
    • Check health: curl -fsS "<es_url>/_cluster/health?filter_path=status,relocating_shards,initializing_shards,unassigned_shards,number_of_nodes"
    • Require status=green and relocating_shards=0 before proceeding.
    • If not satisfied, abort and return control to user.
  2. Disable replica allocation:
    • curl -fsS -X PUT "<es_url>/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":"primaries"}}'
  3. Stop Elasticsearch on target node:
    • ansible -i <inventory> <node> -b -m ansible.builtin.systemd -a "name=elasticsearch state=stopped"
  4. Upgrade target node with Ansible playbook and single-host limit:
    • cd ansible && ansible-playbook -i <inventory-relative-to-ansible-dir> elasticsearch.yml --limit <node>
  5. Start Elasticsearch on target node:
    • ansible -i <inventory> <node> -b -m ansible.builtin.systemd -a "name=elasticsearch state=started enabled=true"
  6. Wait for node to rejoin and stabilize:
    • curl -fsS "<es_url>/_cat/nodes?v"
    • curl -fsS "<es_url>/_cluster/health?wait_for_nodes=3&timeout=5m"
    • curl -fsS "<es_url>/_cluster/health?wait_for_no_relocating_shards=true&timeout=5m"
  7. Re-enable allocation:
    • curl -fsS -X PUT "<es_url>/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent":{"cluster.routing.allocation.enable":null}}'
  8. Post-node gate:
    • curl -fsS "<es_url>/_cluster/health?wait_for_status=green&timeout=10m&pretty"
    • Proceed only when health returns to green.

Abort conditions

Abort immediately and do not continue to next node when any of these occur:

  1. Cluster is yellow or red before node stop.
  2. Any command returns non-zero.
  3. Target node does not rejoin within timeout.
  4. Cluster does not return to green after re-enabling allocation.
  5. Cluster enters red at any point.

Post-upgrade verification

  1. Verify all nodes run target version:
    • curl -fsS "<es_url>/_cat/nodes?h=name,ip,version,master,role"
  2. Verify cluster health is green:
    • curl -fsS "<es_url>/_cluster/health?pretty"
  3. Record the exact node order, execution timestamps, and any anomalies in the handoff.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment