Created
November 3, 2025 12:45
-
-
Save SeanMooney/b2d44dca115eb39570ca9aca8e33754c to your computer and use it in GitHub Desktop.
watcher-2026.1-ptg backup
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| *Watcher 2026.1 Gazpacho virtual PTG | |
| *DO NOT USE TRANSLATION TOOLS IN THIS ETHERPAD!!!! (ノ ゜Д゜)ノ ┻━┻ | |
| *不要在这个ETHERPAD中使用翻译工具! | |
| To translate this etherpad, please follow these easy instructions: | |
| 1. Look in the above toolbar and click the '</>' ("Share this pad") button | |
| 2. Click the "Read only" checkbox at the top of the dialog box | |
| 3. Copy the URL that appears in the "Link" box | |
| 4. Open that URL in a new browser tab or window | |
| 5. Use your translation tools of choice in the new window | |
| Thank you! | |
| * | |
| 要翻译此etherpad,请按照以下简单说明进行操作: | |
| 1.查看上面的工具栏,然后单击“ </>”(“共享此记事本”)按钮 | |
| 2.单击对话框顶部的“只读”复选框 | |
| 3.复制出现在“链接”框中的URL | |
| 4.在新的浏览器标签或窗口中打开该URL | |
| 5.在新窗口中使用您选择的翻译工具 | |
| 谢谢! | |
| Note: ctr-shift-c is the shortcut to clear all colors, please try not to hit it. :) — take frequent snapshots (the "Save Revision" button, the star mark on the right) | |
| https://openinfra.dev/ptg/ | |
| October 27-31, 2025 | |
| people/colours: | |
| - name or irc nic | |
| - sean-k-mooney | |
| - dviroel/dviroel | |
| - rlandy | |
| - chandan | |
| - jgilaber | |
| - amoralej | |
| - efoley | |
| Action prioitization: | |
| * Gazpacho: | |
| * Migrate to sdk (see topic about scope and phases) | |
| * Functional tests implementation: | |
| * 1. implement functional testing for api first | |
| * Integration tests for watcher-dashboard | |
| * CI renaming/pipeline updates/etc (create a etherpad to follow up) | |
| * watcher openstack-requirements job | |
| * Deprecations and removals in 2026.1: MAAS, monasca, etc.. | |
| * Eventlet removal | |
| * Documentation about Action Plan revert regression | |
| * Audit Scope: testing and documentation | |
| * 2026.2 and future: | |
| * | |
| Action Itens: | |
| * MAAS: send a new email to ML, calling for maintainers and mark this integration as deprecated: dviroel | |
| * Spec for openstack-sdk migration: | |
| * Continue with ruff linting effort to others watcher projects: sean-k-mooney | |
| * Monasca removal:jgilaber, or sean. ill review if i dont write it :) | |
| * Deprecate prometheus datasource:jgilaber | |
| * Update documentation about upgrading from prometheus to aetos:jgilaber | |
| * Update documentation about deprecated integrations: sean-k-mooney? | |
| * Zuul job rename and merge list of tests: dviroel | |
| * Update gate pipeline to run proper list of tests: dviroel | |
| * Update watcher-tempest-plugin pipelines to run scenario tests for stable branches: dviroel | |
| * Grenade: include more tests and replace current datasource with prometheus/aetos: chandan | |
| * Grenade: add a job to update between slurp releases: chandan | |
| * Time taken on scenario tests: jgilaber | |
| * watcher openstack-requirements job: sean | |
| * Reorganizing unit tests to the structure proposed in the functional specs: amoralej | |
| * Start effort on adding functional tests (phase 1): maybe sean or amoralej? depending on time amoralej o/ | |
| * Properly document current Action interface: dviroel | |
| * Removal revert/rollback of action plan, documentation updates: dviroel | |
| * Spec with a proposal for revert of action plan: dviroel/? | |
| * Moving rally-openstack watcher plugin in watcher and run it via periodic job: | |
| * explore droping number of depencies we use (4 time zone libs ectra): | |
| * Check with TC around the usage of pytest when writing tests: chandan | |
| * Spec for bulk archieve for audit and actionplan on watcher side to enable it in watcher-dashboard: chandan | |
| * watcher dashboard improvements: chandan | |
| * Documentation and tempest tests for audit scope per strategies | |
| * Freeze datamodel list api: dviroel | |
| * [scalability] Finish ongoing service-monitor for decision-engine (in progress already): amoralej | |
| * [scalability] Implement service-monitor for applier services: amoralej | |
| * [scalability] Instrument observability for audits executions via notifications: amoralej | |
| * [scalability] Check paralellization for workload_stabilization and node_resource_consolidation and propose an homogeneus way to define it (system wise or per audit) in a spec. | |
| * Add a section for actions in documentation. amoralej | |
| * implement (and document) pre_conditions to SKIP actions - blueprint. amoralej | |
| * PoC noisy neighbor strategy with different metrics | |
| * We need to fix the current hardcoded timeout for server migration action/helper (today is 2m only): | |
| *Agenda: | |
| * Tuesday - 13:00 UTC - 15:59 UTC | |
| * Flamingo Retro (30m) - 13:00 UTC - 13:45 UTC | |
| * Flamingo Highlights: https://releases.openstack.org/flamingo/highlights.html#watcher | |
| * New features: Skip action, extended compute model attributes, | |
| * Datasources: New aetos integration, optional monasca integration | |
| * Strategies: zone migration fixes, noisy neighbor deprecation, Host Maintenance now supports disabling migrations (live/cold) and adds stop instance action | |
| * Dashboard: fixes and new additions (tech depts) | |
| * Testing: improvements on fake/real data tests, refactoring, new tests (zone migration, continous audits, datamodel api, etc), support for microversion testing | |
| * Docs: fixes, documentation addition and removal | |
| * Deprecations: integrations that are not used/tested were classified as experimental, watcher-api-wsgi console script deprecation | |
| * What worked well | |
| * merging reviews with more cores | |
| * Lots of bug fixes and improvement in generalDocs improvements | |
| * Some contributions from outside the usual team +1 | |
| * DPL model | |
| * RDO third party CI Coverage+1 | |
| * What needs improvement | |
| * CI testing: how we test on check/gates | |
| * tempest-plugin tests for stable branches | |
| * how long our tests take to run | |
| * We may want to start reviewing/merging new features earlier | |
| * issues with conflicts in the FF week | |
| * Big stable backports backlog | |
| * More clear policies on how to manage strategies parameters (versioning strategies? part of main microversion?) | |
| * testing infrastucure | |
| * functional test | |
| * end to end ui tests | |
| * without this we need to do a lot of manual testing | |
| * Tech Debts (45m) - 13:45 UTC - 15:00 UTC | |
| * (sean-k-mooney): Code Modernisation, dependencies and dead code removal (15m) | |
| * (sean-k-mooney): Openstack SDK (20m) | |
| * (dviroel) Eventlet Removal (15m) | |
| * Other sessions to be aware of: | |
| * Eventlet session at 15:00 UTC in Austin Room | |
| * Integration tests for Horizon and plugins at 16:00 UTC in Icehouse Room | |
| * https://etherpad.opendev.org/p/horizon-gazpacho-ptg | |
| * Wednesday - 13:00 UTC - 15:59 UTC | |
| * Testing and CI () - 13:00 UTC - 14:45 UTC | |
| * (sean-k-mooney): Future of datasouces backends and untested integrations (20m) | |
| * (dviroel/amoralej/chadankumar) CI Testing and Coverage | |
| * (dviroel) CI Testing (10m) | |
| * (amoralej) Improving testing coverage for strategies by doing functional testing (15m) | |
| * (chandankumar) Rally Testing in watcher (15m) | |
| * ### ~5 min break ### | |
| * Watcher Applier Improvements (70m) - 14:55 UTC - 16:00 UTC | |
| * (dviroel): Applier's Workflow Execution and Its Interface/Contract (20m) | |
| * (dviroel): Aborting running tasks (15m) | |
| * (amoralej/dviroel): Polling based implementation of cancelling ongoing actions (15m) | |
| * https://github.com/openstack/watcher/blob/ced0d58d23945bd95dab4a0ec9114a5125255a3b/watcher/applier/workflow_engine/base.py#L230-L248 | |
| * all actions keep polling about action and action plan state updates | |
| * (dviroel): Rollback of Action Plans (20m) | |
| * Other improvements? | |
| * Shared worker pool? | |
| * ### 5 min break ### | |
| * (chandankumar) Watcher Dashboard Improvements (60m) - 16:00 UTC - 16:59 UTC | |
| * (dviroel) Future of datamodel list API (20m) - 17:00 UTC - 17:20 UTC | |
| * Thursday - 13:00 UTC - 16:59 UTC | |
| * (amoralej): Scaling Watcher (60m) - 13:00 UTC - 14:30 UTC | |
| * (dviroel) Future of noisy neighbor strategy (30m) - 14:30 UTC - 15:10 UTC | |
| * ### 5 min break ### | |
| * (dviroel) Stacking strategies (25m) - 14:15 UTC - 15:40 UTC | |
| * PTG Action Itens Review - 15:00 UTC | |
| * Prioritization and Action Items review and assignments | |
| * Possible additional topics: | |
| * More clear policies on how to manage strategies parameters (versioning strategies? part of main microversion?) | |
| * In Nova PTG (Thursday): | |
| * 1700 - 1730 UTC : exposing the perodic secheduler update as an optional notification + resouce provider weigher and perfered/avoided traits | |
| * 1730 - 1800 UTC : improving cyborg support | |
| * Other related topics also in nova ptg etherpad: https://etherpad.opendev.org/p/nova-2026.1-ptg | |
| *Proposed topics: | |
| * (sean-k-mooney): Future of datasouces backends and untested integrations (added) | |
| * monasca removal timeline | |
| * gnocchi | |
| * not planned to be deprecated or revomved in ceilometer currently | |
| * Suggestion: e-mail the mailing lists to see if anyone is using gnocchi | |
| * Suggestion: add a question about gnocchi usage to the next user survey | |
| * jlarriba: There are users of gnocchi, and some interest in using gnocchi with cloudkitty (storing only aggregated data in gnocchi OR storing metrics and agggregated data in gnocchi) | |
| * There's no strong motivation to immediately remove gnocchi in watcher; need to get input from users on its usage first | |
| * prometheus vs aetos | |
| * should we deprecate the direct prometheus support and transtion to Aetos only | |
| * ideal situation (and final version) is to use aetos-only and not expose prometheus to openstack services | |
| * note: the openstack exporter works well with aetos | |
| * Recommendation is to use aetos, and drop direct prometheus support | |
| * Removal of prometheus direct support will not be in the current release, but the deprecation process requires deprecation notice to be in at least one slurp release before actual removal | |
| * what would the upgrade look like. | |
| * integrations | |
| * watcher has a number of integration that are marked as experimental due to a lack fo testign like ironic and maas supprot. | |
| * i woudl like to formally deprecate the maas supprot due to a lack fo docs/matiance and remove it in 2026.2 if no one step up test it and setup ci and build out docs.+1+1+1 | |
| * i think its worht considerign doing the same for ironic integration however we may be able to get more engatemtn with teh ironic team if we choose | |
| * to invest in building out ci and docs. | |
| * effectivly my proposal is to contineu updateing https://docs.openstack.org/watcher/latest/integrations/index.html and move experimental integrations to deprecated | |
| * and then in the 2026.2 ptg revaluate if we will start removing them or keep them based on the matianance, tests and docs at that point. | |
| * AGREED: | |
| * given monasca is retired we aggreed to remove it this cycle. | |
| * track with release note and bug https://bugs.launchpad.net/watcher/+bug/2120192 | |
| * gnocchi | |
| * need to do more reaserch into current usage by watcher and telemery users | |
| * no change in support this cycle | |
| * revisit at next PTG | |
| * prometheus | |
| * deprecate in 2026.1 in favor of aetos | |
| * include documentation about upgrading from prometheus to aetos datasource (and validate it) | |
| * integrations | |
| * MAAS: mark as deprecated this cycle (it blocks eventlet removal) | |
| * (sean-k-mooney): Openstack SDK (added) | |
| * python-watcherclient | |
| * watcher shoudl be supproted in the sdk so we can deprecate and remove the python bidning in the python-watcher client | |
| * and finally the python api in python-watcherclient can be removed and replaced with teh sdk internally | |
| * python-watcherclient shoudl opnly provide the openstackclient plugin and the standalone cli shoudl also be removed | |
| * watcher-dashbaord | |
| * once watcher is supproted in the sdk the watcher-dashboard shoudl be updated to use it exclisively to talks to watchers api | |
| * watcher | |
| * watcher shoudl also avoid using the project clients to talks to nova ectra. | |
| * where there is exsitign sdk supprot we shoud remove the project-client dependecy and replace it with the sdk | |
| * other project are also in the process fo removing there client including nova and neutorn so they will go away in a release or two. | |
| * this also signifcantly reduces the number of depencies that watcher has. | |
| * rally | |
| * apprently rally is using the python watcher client directly. https://opendev.org/openstack/rally-openstack/src/branch/master/requirements.txt#L26 and https://opendev.org/openstack/rally-openstack/src/branch/master/rally_openstack/task/scenarios/watcher/utils.py | |
| * AGREED: | |
| * start with watcher then dashbaord then sdk later | |
| * do not freeze python binding until we have sdk support | |
| * single spec to adress the overall plan | |
| * goal move one serviec client (nova?) over to sdk this cycle | |
| * (sean-k-mooney): Code Modernisation, depencies and dead code removeal (added) | |
| * Code modernisation: | |
| * we decisied not to go forward with the adoption of ruff for last cycle. while we might want to revisit that in a release or two i | |
| * would still like to do some cleanup of old code to follow a more modern style | |
| * to do this i would like to use pyupdate (one of the tools ruff reimplemnts) to modernise the code base. | |
| * if we agree i will submit a seriese of smaller patches (broken up by top level module) to make it easier to revew. | |
| * (sean) i started this here https://review.opendev.org/c/openstack/watcher/+/961450 and its compleeted here https://review.opendev.org/c/openstack/watcher/+/961453 | |
| * we could go futher and adopt ruff at this point if we wanted but this is effectivly what i wanted to do this cycle. | |
| * if there is apitite for more im happy to revisit our tooling once that is compeleted. | |
| * dead code removal: | |
| * we removed a number of ares of dead code suchs as api routes that alwasy raised https://bugs.launchpad.net/watcher/+bug/2110895 | |
| * there are also some placess wehre there are comemnt out code in production files. | |
| * while commented out code in regression tests is fine when we are repoduceing | |
| * this is partly related to the sdk topic but i started on this here https://review.opendev.org/c/openstack/watcher/+/962677 | |
| * this removes neutron client and some dead code but there is more. | |
| * i think i can also remove glance client in a similar way as i belive it is not required anymore either | |
| * i woudl also like to complete the removal of monasca client if we agree with that direction this cycle. | |
| * static analasys and typeing | |
| * ai assisted contibutions | |
| * AGREED: | |
| * apply the same pre-commit and ruff liniting check to tempest plugin and watcher client | |
| * defer decsion on more typing untile we have to consdier it (possibly start with interfaces) | |
| * explore droping number of depencies we use (4 time zone libs ectra) | |
| * | |
| * (doug): Applier's Workflow Execution and Its Interface/Contract (added) | |
| * Applier's workflow is not well documented and it was raising lots of questions during the review of new Actions/Workflow changes (e.g new stop instance action) | |
| * Base Action: https://github.com/openstack/watcher/blob/master/watcher/applier/actions/base.py | |
| * Which return type is the correct? Can we properly fix/document this? | |
| * Note that the lack of documentation raised an issue with current rollback mechanism | |
| * From Taskflow point of view, a Task should return useful information, that can be further be used in case of a revert | |
| * To trigger a revert, it should raise an exception | |
| * rollback of action plan is the next discussion | |
| * Should we decide now on a default Action interface to be followed by all actions? | |
| * AGREED: | |
| * we will assess this based on the path chossen for the roolback and abroting topics | |
| * we need to document this in either case. | |
| * current behavior allow actions to return False or raise exceptions | |
| * (dviroel): Aborting running tasks | |
| * Applier spawn a new green thread for every action (in addition the the one created by the TF engine) | |
| * For action that supports abort(), it kills the thread (migration[live], nop and sleep), otherwise waits until it completes | |
| * use threading event to signal a execution to stop, when supported (can be evaluated by action execute() method, based on current status of the action)? | |
| * cancelling a action would change the resource status, and the execute() loop could process the change as a failure | |
| * AGREE: | |
| * stop killing threads | |
| * stop spawning threads for each new action, | |
| * improve current execute() in actions to check resource status and abort the process/looping | |
| * (doug) Rollback of Action Plans (added) | |
| * Current status: | |
| * There is no working rollback mechanism. LP: https://bugs.launchpad.net/watcher/+bug/2122148 | |
| * revert() method from Actions not being tested/called anymore | |
| * Config option to enable/disable revert: https://docs.openstack.org/watcher/latest/configuration/watcher.html#watcher_applier.rollback_when_actionplan_failed | |
| * Future of rollback option and revert of actions | |
| * Actions: | |
| * Deprecate current solution OR revert https://github.com/openstack/watcher/commit/69cf0d3ee53b828334a4e84f18c09f57d0a7c318#diff-a2fbc1a21f1fd714a0f24ad8a9f330c268170e8f6acceda389320936081a25f2 (action_execution_rule) | |
| * Current revert functons without testing, what to do? | |
| * Documentation updates: | |
| * New rollback mechanism, triggered by the user for a failed action plan | |
| * e.g.: New action "rollback" for Action Plans: https://review.opendev.org/c/openstack/watcher/+/746845 | |
| * Note that the rollback is not always the reverse order of the original action plan (like Taskflow would do?) | |
| * Now can make use the SKIP feature added in Flamingo | |
| * AGREED: | |
| * Auto revert does not work and we should deprecated remove it | |
| * treat as a bug, update the documentation and the associated config option | |
| * New spec to propose new action plan revert workflow | |
| * Polling based implementation of cancelling ongoing actions (In Applier improvements section) | |
| * https://github.com/openstack/watcher/blob/ced0d58d23945bd95dab4a0ec9114a5125255a3b/watcher/applier/workflow_engine/base.py#L230-L248 | |
| * (doug/amorale/chandan): CI Testing/Coverage (added) | |
| * (dviroel) CI Testing: | |
| * job naming: https://docs.opendev.org/opendev/infra-manual/latest/drivers.html#consistent-naming-for-zuul-jobs | |
| * watcher-tempest-functional SUCCESS 14m 12s | |
| * watcher-grenade SUCCESS 34m 14s | |
| * (sean) we shoudl consider expanding this to test more | |
| * we may also want to change this to ateos | |
| * watcher-tempest-strategies SUCCESS 1h 04m 29s | |
| * watcher-tempest-actuator SUCCESS 48m 01s | |
| * watcher-tempest-functional-ipv6-only SUCCESS 25m 14s | |
| * watcher-prometheus-integration SUCCESS 1h 16m 39s | |
| * watcher-prometheus-integration-threading SUCCESS 1h 22m 34s | |
| * watcher-aetos-integration SUCCESS 1h 14m 45s | |
| * jobs running in check: | |
| * functional tests are running tempest api tests, uses a single node devstack deploys (takes around 30m) | |
| * python-watcherclient | |
| * https://review.opendev.org/c/openstack/python-watcherclient/+/956911?tab=change-view-tab-header-zuul-results-summary | |
| * jobs: actuator, tempest-strategies (gnocchi), watcher-tempest-actuator, watcher-prometheus-integration, watcher-prometheus-integration-threading, watcher-aethos-integration | |
| * refactor our scenario jobs? | |
| * check/gate testing | |
| * every voting job in check should be running in gate | |
| * same happens for python-watcherclient, we need to check all projects | |
| * stable branches testing | |
| * e.g.: https://review.opendev.org/c/openstack/watcher/+/963237?tab=change-view-tab-header-zuul-results-summary | |
| * do we need to keep all? | |
| * watcher-tempest-plugin testing | |
| * branchless repo | |
| * check should validate all stable branches to avoid breaking them | |
| * Today only functional tests are running for stable branches | |
| * e.g.: https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/956004?tab=change-view-tab-header-zuul-results-summary | |
| * time spent on scenario testing | |
| * no watcher job running in requirements gating | |
| * AGREED: | |
| * job renames/consolidation | |
| * watcher-functional -> only runs api tests, we can merge into one of the tempest ones | |
| * watcher-tempest-actuator -> can be merged to strategies | |
| * watcher-tempest-{datasource-if-needed} | |
| * ipv6 -> run one of the tempest with ipv6 enabled | |
| * gate updates after renaming and merging jobs | |
| * stable: enable tempest scenario jobs to run | |
| * stable branches: | |
| * backport changes from master | |
| * check and gates | |
| * tempest: | |
| * replace tempest-functional with a tempest job that runs scenarios tests | |
| * Epoxy and Flamingo: two datasources being tested | |
| * grenade: | |
| * new job that tests upgrade between slurp releases | |
| * include more testing, replace current datasource | |
| * check and propose watcher job against openstack requirements | |
| * (amoralej) Improving testing coverage for strategies by doing functional testing: | |
| * Sean sent spec proposal | |
| * https://review.opendev.org/c/openstack/watcher-specs/+/963299 is a spec i wrote for this but we might want to combine this with one of the other testign topic. | |
| * https://gist.github.com/SeanMooney/43afa55282d2286a312eae7f3c7709e2 is the detailed impelemtation plan i was work on with ai | |
| * Functional tests: | |
| * Running actual watcher code with mocked external services (nova, cinder, keystone, prometheus) | |
| * On each test we need to specify what external services should response (servers, volumes, metrics) | |
| * How to define test fixtures | |
| * Mocking clients would be an option for external services? | |
| * placement example of gabii testing https://github.com/openstack/placement/blob/master/placement/tests/functional/gabbits/allocation-candidates.yaml | |
| * AGREED: | |
| * Phased approach: | |
| * 1st phase: api only gets/post | |
| * 2nd phase: adding decision-engine + nova + prometheus datastore | |
| * 3rd phase: adding applier | |
| * (chandankumar) Rally Testing in watcher | |
| * Current Status | |
| * watcher rally job: https://github.com/openstack/rally-openstack/blob/master/.zuul.d/rally-task-watcher.yaml and test results: Test results: https://zuul.openstack.org/builds?job_name=rally-task-watcher&skip=0 (last run from last month, seems green), Not running in watcher repo | |
| * Current rally task: https://github.com/openstack/rally-openstack/blob/master/rally-jobs/watcher.yaml (Doing audit template and audit creation listing and deleting) | |
| * No real workload testing | |
| * Watcher rally plugin implementation: https://opendev.org/openstack/rally-openstack/src/branch/master/rally_openstack/task/scenarios/watcher/utils.py | |
| * What is missing | |
| * No way to pass audit template scope and audit parameters | |
| * No autotrigger | |
| * No support for event and continuous audit | |
| * No support for actionplan and actions | |
| * Other links: | |
| * https://rally.readthedocs.io/en/latest/index.html | |
| * https://github.com/openstack/rally-openstack/tree/master/samples/tasks/scenarios/watcher | |
| * DNM job: https://review.opendev.org/c/openstack/watcher/+/965101 | |
| * logs: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_590/openstack/5908d6d48e9b4f9fa1c56ed8db481501/ | |
| * https://github.com/openstack/rally/tree/master https://opendev.org/openstack/rally | |
| * AGREED | |
| * | |
| * (Chandan) watcher-dashboard improvements (added) | |
| * watcher-dashboard improvements and CI & testing | |
| * dashboard improvements | |
| * Add auto page refresh to show the audit /actionplan status once they are created or started | |
| * use htmx or existing horizon table field used in instance | |
| * Add start button to the Action Plan detail page | |
| * Add option to bulk archieve audits/Action plan | |
| * Auto achieve an audit after certain period of time | |
| * Need spec on watcher side for bulk archieve | |
| * actions are archived automatically when the actionplan is archived | |
| * Updating an continuous audit | |
| * <Add improvements here> | |
| * Dashboard testing | |
| * Using django test framework | |
| * https://github.com/openstack/governance/blob/master/reference/pti/python.rst#python-test-running | |
| * ``` | |
| * OpenStack uses stestr as its test runner. stestr should be used for running all Python tests, including unit, functional, and integration tests. stestr is used because of its real time subunit output and its support for parallel execution of tests. In addition, stestr only runs tests conforming to the Python stdlib unittest model (and extensions on it like testtools). This enables people to use any test runner they prefer locally. Other popular test runners often include a testing ecosystem which is tied directly to the runner. Using these precludes the use of alternative runners for other users. | |
| * ``` | |
| * https://github.com/openstack/governance/blob/master/reference/pti/javascript.rst | |
| * playwright | |
| * https://review.opendev.org/c/openstack/watcher-specs/+/963438 | |
| * https://gist.github.com/SeanMooney/f56c7fd6f55ac48958a5c549e1701b6c | |
| * https://gist.github.com/SeanMooney/8746be1c29bbb78a63ef88f0104ea54f | |
| * Selenium fixtures from horizon team https://etherpad.opendev.org/p/horizon-gazpacho-ptg#L133 | |
| * Test repository: https://github.com/openstack/horizon/tree/master/openstack_dashboard/test/selenium | |
| * Usage in Manila UI: https://review.opendev.org/q/project:openstack/manila-ui+owner:[email protected] | |
| * Use htmx to replace javascript event handler | |
| * https://github.com/openstack/watcher-dashboard/blob/master/watcher_dashboard/templates/infra_optim/audit_templates/_create.html#L94 | |
| * https://github.com/openstack/watcher-dashboard/blob/master/watcher_dashboard/templates/infra_optim/audits/_create.html#L38 | |
| * (sean): if we do this we probaly shoudl ahve a spec to defien why we are doing this and how it will be used. | |
| * when using htmx the rest apis that we pool will return will be html | |
| * to me this is much much more testable approch to testign what the uil will render as we effectly remove the javacript code as somethign that has to be tested directly | |
| * Adapting Django 5.2 in watcher-dashboard | |
| * Added based on https://etherpad.opendev.org/p/horizon-gazpacho-ptg#L55 | |
| * there is a django upgrade tool that can do most of that for use but yes we shoudl mvoe to the latest LTS | |
| * (sean): we can use https://pypi.org/project/django-upgrade/ to help with this | |
| * AGREED | |
| * wishlist bug for dashboard improvements | |
| * Check with TC around usage of pytest to improve wording and then proceed further | |
| * Get https://review.opendev.org/c/openstack/watcher-specs/+/963438 merged | |
| * Spec for bulk archieve for audit and actionplan | |
| * (doug): The future of datamodel list API (added) | |
| * Is there an real use for this API? besides the recent use in tempest tests? | |
| * Today the API returns the response is from decision-engine get_data_model rpc | |
| * What about freeze this API with the existing content? Avoiding new Instance/Node updates to require a microversion bump | |
| * New test to avoid new parameters in this api | |
| * Add new storage model to the API? Baremetal? | |
| * Required? | |
| * https://review.opendev.org/c/openstack/watcher/+/955365 | |
| * I abandoned that patch that exposed the storage model because it was only needed for tempest tests and having notifications solved the problem | |
| * AGREED: | |
| * To not extended it, do not support additional models (storage/baremetal) | |
| * Issue with rabbitmq queue size? model can grow and exceed the max size of the queue. Needs further testing | |
| * We don't have a better approach today than using datamodel list to check if model already created/deleted instances | |
| * Freeze the api vs removing the api? | |
| * defer the removal to future discussions, but for now we would like to not extend it anymore, even if new fiels are added to model elements (instance, node) | |
| * (doug) Eventlet Removal (added) | |
| * Flaming changes: | |
| * API | |
| * the watcher-api-wsgi console script was deprecated early in the flaming cycle. This is the only api code that depends on evenlet (wsgi.Server) | |
| * Decision-Engine | |
| * Merged 2 services into a single one. In threading mode, 2 services start 2 different process, which end with 2 different CDM in memory | |
| * Added support for native thread mode in decision engine as experimental feature | |
| * Can be enabled by disabling eventlet with environment env: OS_WATCHER_DISABLE_EVENTLET_PATCHING=true | |
| * Eventlet is still the default mode | |
| * We miss a collector sync timeout for threading mode: | |
| * Dept: It needs a safe stop in collector's thread. Without eventlet Timeout, we do not kill any thread and we need to signal all threads to stop collecting. | |
| * Threading Event might be the best way to handle this. Nova model collector is the more complicated since it has its own internal threadpool for speeding up info collection. | |
| * CI/Testing | |
| * watcher-prometheus-integration-threading was added (decision engine with native threads for now) | |
| * New tox enviroment openstack-tox-py312-threading (--exclude-regex 'applier') | |
| * Next | |
| * ThreadPool statistics ready to merge: https://review.opendev.org/c/openstack/watcher/+/960881 (cover decision-engine)(merged) | |
| * Add native thread support for the Applier | |
| * Expected changes on how we spawn new threads for each Task/Atom (we should stop killing then whan Action Plan is cancelled) | |
| * We can discuss more in Applier's topic: | |
| * We can signal the thread to stop, and modify the Action to support the stop signal. (only server migrate (live), nop and sleep supports cancel/kill today) | |
| * The entire cancelling workflow needs to be refactored (cancel loop issue to be discussed in applier improvements section) | |
| * Switch all to native thread by default | |
| * Keep both modes as supported in Gazpacho | |
| * Switch all other jobs to use native threading mode | |
| * Create a job to keep testing eventlet, until we drop all code | |
| * Remaining eventlet code that is not going to be fixed: | |
| * MAAS: set to experimental in Flamingo, | |
| * deprecated in G? removal to 2026.2? | |
| * AGREED | |
| * rest api to have their own timeouts, and have a event based timeout for the overall process (to stop spawing new threads) | |
| * Add new config option for collector timeout | |
| * We have a gap in test/checking the behavior of a timeout of sync collector | |
| * Stop killing threads when action plan is cancelled (for both modes), live-migration is going to timeout at some point | |
| * .kill() will be a noop in native mode. It will wait for the thread to complete (for live-migration is ~2min) | |
| * set MAAS to deprecate after PTG, send to ML a notice about that (mentioning that it has eventlet code, no maintainers, no automated tests) | |
| * (doug) Future of Noisy Neighbort Goal/Strategy (added) | |
| * Cache monitoring metrics were deprecated in kernel and in nova: https://review.opendev.org/c/openstack/nova/+/565242 | |
| * Current strategy is based on LLC metrics only and had to be deprecated it during last cycle (non-slurp) | |
| * removal in 2026.2? | |
| * What to do next? | |
| * If we don't replace this strategy, both strategy and goal will be removed | |
| * Replacement for LLC metrics to identify noisy neighbor? | |
| * Identify high priority VMs that are affected by noisy neighbors (low priority) | |
| * Identify contention: | |
| * CPU steal metrics to identify contention high priority VMs? (vcpu.{}.delay= in domstats) -> not available in ceilometer | |
| * CPU Pressure? psi needs to be enabled in kernel + check node_exporter pressure (is this published per vm?) | |
| * IOWait to? | |
| * https://www.libvirt.org/manpages/virsh.html#domstats | |
| * - `vcpu.<num>.delay - time the vCPU <num> thread was waiting in the runqueue as the scheduler has something else running ahead of it (in nanoseconds). Exposed to the VM as a steal time.` | |
| * ceilometer libvirt inspector: https://opendev.org/openstack/ceilometer/src/commit/dd5c5eb3ccd247d94042ffdaad97a6618fedc4fb/ceilometer/compute/virt/libvirt/inspector.py#L243-L250 | |
| * net.<num>.rx.errs | |
| * net.<num>.tx.drop | |
| * Identify the noisy neighbors: | |
| * CPU usage from low priority instances to identify VMs that are consuming all resources during an amount of time | |
| * Other cpu metrics to identify noisy neighbor? | |
| * to be more flexible in the period to be considered (configurable by audit parameter) | |
| * Instance Priority | |
| * instance metadata in the only method today, we could maintain and enable additional ways to identify high/low priority workloads | |
| * current model, can be used to poc | |
| * "watcher-priority" today | |
| * use of flavor extra_specs for priority? pre-defined tiering mechanism based on aggregate properties? map a tier to a priority? | |
| * use a defined namespace to set priorities for optmized component | |
| * AGREED: | |
| * remove current noisy neighbour stragey in 2026.2+ to allow deprecation to ship in a slurp | |
| * PoC cpu pressure/iowait/other metric? | |
| * instance metadata to PoC, but we should consider different solutions to identify/classify workload | |
| * (amoralej) Scaling Watcher (added) | |
| * Scalability assesment - https://etherpad.opendev.org/p/watcher-scalability | |
| * Expanding from limitation of running just one instance of the Decision Engine specially, and partially of the Applier. Some ideas: | |
| * The main issue is related to continuous audits. The audit is associated to a specific decision-engine when | |
| * Active/Active decision-engine ala cinder-volume | |
| * (sean): i think both the decsion engine and applier shoudl be horizontally scallable and stateless with arbiary active member | |
| * that woudl mean nolonger asssging audits or action plans or action to them in the db instead we woudl need to move to | |
| * an event driven model whenr we invoke them over the rpc bus to start an audit or action plan form a dispatcher or conductor process. | |
| * this woudl be closer to how zuul works where the zuul schduler dispatches work to the zuul merger and zuul executor process. | |
| * This howver woudl mean the zuul data models woudl have to live in the db or anothe shared data store not just in memory so that all statelest instnace have the same view. | |
| * rhis is alos how nova's schduler/condutors work we do not dispatch work to them directly by hostname we put the request on a queue in rabbit and one of the conductors or schduler will get the message form rabbit. | |
| * What about collectors in multiple decision-engine services case? | |
| * (dviroel) Stacking Strategies | |
| * Brainstorm possibility of having stacking strategies | |
| * e.g.: saving_energy after a workload consolidation | |
| * one depends on the success of the other | |
| * linked action plans? | |
| * a graph flow of action plans | |
| * merge all actions into a single action plan? | |
| * we could avoid unnecessary actions (e.g.: migrations A -> B ->C, and move A -> C directly) | |
| * 1) strategies could produce a data model copy as result of its solution, that would be used by the following strategy | |
| * strategies request a copy from collector_manager + apply scope | |
| * 2) model can be updated based on linked solutions (migrations, machine states, etc) | |
| * the model from collector is updated to include changes from previous solutions | |
| * Audits? Map of goals and strategies? | |
| * e.g.: Workload consolidation + Workload balance | |
| * solutions: TBD | |
| * linearly applcaition | |
| * we could have a ordered list of stragies to apply | |
| * share a common model form one to anouther | |
| * mutate state as we go so that we track the future load that would be created by takign the propsoed actions | |
| * take a split approch, pass in mutabel model + secod stcutre with metrics about workload adn desired state (locatoin, powerstatus) | |
| * compostion by parts | |
| * we could anotate resouces with cost functions | |
| * AGREED: | |
| * We need a more clear use case for it and continue the discussion. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment