# Watcher 2026.1 Gazpacho PTG Report
## Executive Summary
The OpenStack Watcher project held its virtual Project Teams Gathering (PTG)
from October 27-31, 2025, to plan the 2026.1 Gazpacho development cycle. The
event brought together key contributors including sean-k-mooney, dviroel,
rlandy, chandan, jgilaber, amoralej, and efoley to address critical technical
debt, modernization efforts, and strategic improvements. The PTG resulted in
comprehensive agreements across seven major themes: datasource modernization,
SDK migration, eventlet removal, testing infrastructure, applier improvements,
scalability enhancements, and strategy refinements. A total of 47 action items
were identified with clear ownership assignments, representing an ambitious but
well-structured roadmap for the Gazpacho cycle.
## Topics and Outcomes
### 2025.2 Flamingo Retrospective
The team conducted a comprehensive review of the 2025.2 Flamingo release,
analyzing successes and areas requiring improvement. Enhanced collaboration
with more active core reviewers contributed to faster merge cycles, significant
bug fixes, external contributions, and effective DPL model implementation.
Areas needing improvement included CI testing methodology restructuring,
tempest-plugin stability on stable branches, test execution duration concerns,
feature review timing issues, substantial stable branch backport backlogs, and
unclear strategy parameter policies.
**Agreements:**
- Continue enhanced collaboration practices from Flamingo
- Address CI and testing methodology gaps in Gazpacho
- Improve feature review timing to avoid Feature Freeze conflicts
**Open Questions:**
- How should strategy parameters be managed and versioned? Should they be part
of main microversion or independently versioned?
### Datasource and Integration Management
The team reached unanimous agreement to remove Monasca datasource support due
to project retirement. Prometheus direct support will be deprecated favoring
Aetos-only integration with upgrade documentation provided. MAAS integration
will be deprecated with mailing list notification seeking maintainers, with
removal planned for 2026.2 if none emerge. Gnocchi support continues pending
further usage research from both Watcher and Telemetry users.
**Agreements:**
- Remove Monasca datasource in 2026.1 Gazpacho
- Deprecate Prometheus direct support in favor of Aetos
- Create upgrade documentation from Prometheus to Aetos
- Deprecate MAAS integration with mailing list notification
- Maintain Gnocchi support pending further research
- Continue updating integration documentation to reflect experimental/deprecated
status
- Re-evaluate experimental integrations at 2026.2 PTG
**Open Questions:**
- What is the long-term support plan for Gnocchi datasource integration?
- Should Ironic integration be maintained, improved with CI/docs, or deprecated
like MAAS?
### SDK Migration Strategy
The team outlined a comprehensive multi-phase approach to transition from
project-specific clients to unified OpenStack SDK. The agreed sequence
prioritizes Watcher internal migration first, followed by dashboard updates,
with SDK support implementation occurring later. The goal for Gazpacho is
successfully migrating at least one service client (Nova) to SDK usage. This
effort is particularly timely as several projects including Nova and Neutron
are planning to deprecate and remove their client libraries.
**Agreements:**
- Implement phased approach: Watcher → Dashboard → python-watcherclient
- Do not freeze python-watcherclient bindings until SDK support exists
- Create comprehensive specification covering entire migration plan
- Migrate at least one service client (Nova) to SDK in 2026.1
**Open Questions:**
- How to coordinate Rally-OpenStack's direct python-watcherclient usage with
planned deprecation?
### Code Modernization and Cleanup
The team continued interest in modernizing the codebase to follow contemporary
Python practices. Pyupgrade will systematically update code patterns via
module-specific patches. Dead code removal continues including obsolete API
routes and commented code. Pre-commit and ruff linting will be applied to
tempest-plugin and python-watcherclient repositories.
**Agreements:**
- Apply pyupgrade for code modernization via module-specific patches
- Continue dead code removal efforts
- Apply pre-commit and ruff linting to tempest plugin and client
- Defer typing decisions until required
- Explore dependency reduction opportunities
### Eventlet Removal Progress
Significant advancement was achieved during Flamingo toward eventlet removal.
The watcher-api-wsgi console script was deprecated and experimental native
thread mode support was introduced for decision-engine. Outstanding technical
debt includes collector sync timeout mechanism and applier native thread
support. Both eventlet and native threading modes will be maintained as
supported in Gazpacho with all CI jobs switching to native threading except one
dedicated eventlet test job.
**Agreements:**
- Implement REST API timeouts with event-based timeout for collector process
- Add new configuration option for collector timeout
- Address testing gap for collector timeout behavior
- Stop killing threads during Action Plan cancellation in both modes
- Thread kill() becomes no-op in native mode, waiting for completion
- Set MAAS to deprecated status with mailing list notification
- Maintain both modes as supported in Gazpacho
- Switch all jobs to native threading except one eventlet test job
### CI Testing Infrastructure Improvements
Current job naming does not follow OpenStack conventions. CI jobs will be
consolidated and renamed: merge watcher-functional into tempest jobs, merge
watcher-tempest-actuator into strategies job, create datasource-specific job
patterns, and enable IPv6 configuration. Tempest scenario jobs will be enabled
for stable branches, grenade testing enhanced with SLURP release upgrade
testing, and a watcher job proposed for openstack-requirements.
**Agreements:**
- Consolidate and rename jobs following OpenStack conventions
- Merge watcher-functional into tempest jobs
- Merge watcher-tempest-actuator into strategies job
- Create watcher-tempest-{datasource} job pattern
- Enable one tempest job with IPv6 configuration
- Ensure check/gate pipeline consistency
- Enable scenario tests for stable branches
- Implement grenade job for SLURP release upgrades
- Propose watcher job for openstack-requirements
### Functional and Rally Testing
A phased functional testing approach will be implemented testing actual Watcher
code with mocked external services. Phase 1 focuses on API testing, Phase 2
adds decision-engine integration with Nova and Prometheus, and Phase 3 includes
applier service testing. Rally plugin migration into Watcher repository for
periodic job execution will be investigated.
**Agreements:**
- Follow phased approach for functional testing
- Reorganize unit tests per functional spec structure
### Applier Workflow and Improvements
The Applier's workflow execution lacks comprehensive documentation causing
confusion during reviews. Thread killing will cease in favor of
threading.Event-based signaling for graceful termination. The polling-based
cancellation implementation requires refactoring as part of overall cancellation
workflow improvements.
**Agreements:**
- Cease killing threads
- Stop spawning dedicated threads for each action
- Improve execute() methods to check resource status and abort gracefully
- Document workflow interface based on rollback/aborting decisions
### Action Plan Rollback
No working rollback mechanism currently exists. Auto-revert functionality has
been determined non-functional and will be deprecated and removed. A new
specification will propose user-triggered rollback for failed action plans,
potentially through a new "rollback" action leveraging the SKIP feature added
in Flamingo.
**Agreements:**
- Deprecate and remove auto-revert as non-functional
- Document as bug and update configuration options
- Create new specification for user-triggered action plan revert workflow
- Properly document current Action interface
### Dashboard Features and Testing
Feature enhancements include auto-refresh functionality, action plan start
buttons, bulk archive operations, continuous audit updates, and auto-archive
scheduling. End-to-end testing will use Selenium fixtures from Horizon team,
and htmx adoption will replace JavaScript event handlers for cleaner, more
testable code.
**Agreements:**
- Create wishlist bugs for dashboard improvements
- Consult Technical Committee regarding pytest usage for improved wording
- Merge end-to-end testing specification
- Create specification for bulk archive functionality
### Datamodel List API
The API's utility is questioned beyond recent tempest test usage. The API will
be frozen to prevent new field additions even when model elements receive
updates. Extension with new storage/baremetal models is explicitly rejected.
**Agreements:**
- Do not extend API; do not support additional models
- Freeze API to prevent new field additions
- Defer removal discussion to future PTG
- No extension even for new Instance/Node fields
**Open Questions:**
- Does datamodel API response size risk exceeding RabbitMQ message limits?
### Scalability Architecture
The primary challenge involves running only single instances of Decision Engine
and partially of Applier. Both should become horizontally scalable and stateless
using event-driven architecture where RPC bus invokes services. Cluster Data
Models must persist in database/shared storage rather than memory-only.
**Agreements:**
- Finish service-monitor for decision-engine
- Implement service-monitor for applier services
- Instrument observability for audit executions via notifications
- Investigate parallelization for specific strategies
**Open Questions:**
- What specific architectural patterns and components are needed for full
horizontal scalability?
### Noisy Neighbor Strategy Replacement
The current strategy relying on deprecated LLC cache metrics will be removed in
2026.2+ after deprecation in the SLURP release. Alternative metrics including
CPU steal, CPU pressure, and IOWait require proof-of-concept validation.
Instance metadata will be used for PoC while alternative workload
classification approaches are evaluated.
**Agreements:**
- Remove current strategy in 2026.2+ after SLURP deprecation
- Develop proof-of-concept using alternative metrics
- Use instance metadata for PoC while evaluating alternatives
**Open Questions:**
- Which alternative metrics provide sufficient signal for noisy neighbor
detection?
- What is the optimal long-term approach for classifying instance priority?
### Strategy Stacking and Composition
Brainstorming explored stacking strategies where one depends on another.
Implementation approaches include linked action plans, merged actions, and model
mutation options. More concrete use cases are needed before design decisions.
**Open Questions:**
- What are concrete use cases for strategy stacking and what implementation
approach is optimal?
### Miscellaneous Improvements
Additional improvements include investigating scenario test execution duration,
implementing pre-conditions for action skipping, adding dedicated documentation
section for actions, and fixing the hardcoded 2-minute server migration timeout.
## Key Themes and Strategic Direction
### Technical Debt Reduction
The Gazpacho cycle represents a decisive push toward modernizing Watcher's
technical foundation. The parallel efforts on SDK migration, eventlet removal,
and code modernization address years of accumulated technical debt while
positioning the project for long-term sustainability. The emphasis on
dependency reduction and dead code removal demonstrates commitment to
maintainability.
### Testing Infrastructure Maturation
Significant focus on testing infrastructure improvements reflects maturation of
the project's quality practices. The phased functional testing approach, CI job
consolidation, and stable branch testing enhancements address identified gaps
from the Flamingo retrospective. Rally testing integration and end-to-end
dashboard testing represent expansions of testing coverage into previously
under-tested areas.
### Scalability and Performance
The dedicated scalability discussion and resulting agreements on service
monitoring, observability, and horizontal scalability architecture indicate
preparation for larger-scale deployments. The investigation into strategy
parallelization and shared worker pools demonstrates attention to performance
optimization.
### Integration and Dependency Management
The decisions regarding datasource and integration deprecations reflect a
pragmatic approach to project scope. By deprecating unmaintained integrations
(Monasca, MAAS) and consolidating datasource approaches (Aetos over direct
Prometheus), the team focuses resources on well-tested, actively maintained
integrations.
### Operational Improvements
The comprehensive discussion of applier improvements, particularly around
workflow execution, thread management, and action plan rollback, addresses
long-standing operational concerns. The shift from thread killing to graceful
termination and the redesign of cancellation workflows represent significant
architectural improvements.
### Documentation and User Experience
Multiple action items focus on documentation improvements, from integration
status updates to action interface documentation to upgrade guides. Dashboard
enhancements including auto-refresh, bulk operations, and improved testing
demonstrate commitment to operator experience.
## Conclusion
The Watcher 2026.1 Gazpacho PTG successfully established an ambitious yet
achievable roadmap balancing technical debt reduction with feature enhancement.
The 47 action items with clear ownership assignments, 53 specific agreements,
and well-documented open questions provide a structured framework for the
development cycle. The breadth of topics addressed—from low-level threading
implementation to high-level scalability architecture—demonstrates
comprehensive planning. While challenges remain, particularly around strategy
parameter versioning and the ultimate resolution of experimental integrations,
the team has positioned Watcher for significant advancement in modernization,
quality, and operational robustness during the Gazpacho cycle.
Last active
November 3, 2025 14:13
-
-
Save SeanMooney/8a5e8bfc3538917804dfff819c69de10 to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment