NE-2276: NIDS MCP Toolset Strategy

Goal 1: Delivery & Integration Strategy

Context

We are implementing a NIDS (netedge) MCP toolset. We must choose a strategy for delivering and integrating our toolset into OpenShift clusters, and it'll save time and effort later if we decide now where it's going to be located and how it will get built for production.

Option 1: openshift-mcp-server Integration (Recommended)

Integrate the toolset directly into the openshift-mcp-server repository under pkg/toolsets/netedge. This repository serves as the convergence point for MCP work from multiple teams.

Pros:

Platform Capabilities: Inherits platform Auth, Config, and Transport capabilities without reimplementation.
Cross-Team Synergy: Allows distinct teams to collaborate on shared infrastructure.
- Evidence: The NIDS team (PR #115) and the Observability team (PR #117) recently identified a duplication in Prometheus client logic. By working in the same repository, they agreed to extract a shared pkg/prometheus library, benefiting both toolsets and preventing wasted effort.
Zero Friction Distribution: Users get netedge tools wherever they find openshift-mcp-server, Lightspeed OpenShift for example.
Single Binary: Can be deployed in-cluster OR run locally by the user.

Option 2: Standalone MCP Server

Create a separate repository and binary.

Cons:

High Development Overhead: Requires re-implementing Auth, Config, Transport, and Logging logic.
Release Engineering Burden: The NIDS team would need to set up and maintain a separate build pipeline (Konflux), manage release versioning, and support the distribution of our MCP server binary with OpenShift.
Distribution Friction: Users must download and manage our mcp binary.

Comparative Analysis: OVN-Kubernetes Strategy

The OVN-Kubernetes (OVNK) team opted for Option 2 (Standalone), driven by different priorities:

Upstream Focus: Their primary goal is to serve the upstream ovn-kubernetes community independently of OpenShift productization.
Deferred Scope: They explicitly chose to defer "productization" concerns like Authentication and RBAC, focusing solely on proving the core value of their diagnostics first.
Security Stance: They adopted a strict read-only model, reasoning that they largely did not need the "full gamut" of tools provided by the core server.

Contrast with NIDS: NIDS differs in that we are targeting immediate product readiness. By choosing Option 1, we avoid the "deferred technical debt" that OVNK accepts (implementing Auth, Config, Transport, and Logging later). We gain immediate access to the platform's mature OIDC/RBAC implementation and support channels, rather than building a custom server stack from scratch.

Recommendation: Proceed with Option 1. The openshift-mcp-server Integration approach reduces the release engineering burden, simplifies distribution, and enables immediate code reuse with other OpenShift teams.

Goal 2: Offline Data Access

Context

While the MCP server often runs on a live cluster, users frequently troubleshoot using "offline" artifacts (e.g., must-gather, SOS reports) located on their local machine. This data often cannot be uploaded to a shared cluster due to size, privacy, or compliance reasons.

We are adopting the pattern established by the ovn-kubernetes-mcp project, which successfully uses toggleable modes to support both live and offline analysis in a single tool.

Problem

The Agent (LLM) needs to reason about this local data using the same high-level tools it uses for live clusters (e.g., get_network_error_rate). Feeding massive raw file dumps directly to the LLM is inefficient, consumes excessive tokens, and makes it difficult for the model to identify relevant signals amidst the noise.

Solution: Extending Multi-Cluster Awareness (Virtual Clusters)

Instead of inventing a new "Adapter" pattern, we will leverage the platform's existing Multi-Cluster Support. The openshift-mcp-server framework already handles routing logic for tools that operate on specific clusters via the IsClusterAware interface.

The Pattern: Virtual Clusters

We treat an offline session (a must-gather on disk) conceptually as a Virtual Cluster.

IsClusterAware: netedge tools declare IsClusterAware() = true.
Context Parameter: The framework automatically injects a cluster (or context) parameter into the tool's schema.
Resolution: We implement the GetDerivedPrometheus(ctx, cluster) helper (analogous to the SDK's GetDerivedKubernetes) to resolve this string into a client.

Example: query_prometheus

The Tool Definition: The tool logic remains standard. It accepts the injected cluster identifier and requests a client.

func QueryPrometheusHandler(params Params) {
    // 1. Framework injects 'cluster' param
    clusterID := params.GetArguments()["cluster"].(string) 
    
    // 2. Resolve Client (The "Magic" happens here)
    client, err := GetDerivedPrometheus(ctx, clusterID)
    
    // 3. Execute agnostic logic
    return client.Query("sum(rate(...))")
}

Client Resolution Logic (GetDerivedPrometheus):

Live Mode (Cluster ID):
- Input: cluster="local" (or a specific cluster name).
- Resolution: Returns a ThanosClient authenticated against the target cluster's API.
Offline Mode (File URI):
- Input: cluster="file:///tmp/session-1".
- Resolution: Returns an OfflinePrometheusClient.
- Implementation: This client wraps the omc binary (Shell Adapter) to parse the specific must-gather path context.

Benefit

Architectural Alignment: We use the exact mechanism designed for multi-cluster support (ref: pkg/api/toolsets.go#IsClusterAware).
Seamlessness: The LLM treats "analyzing a file" exactly like "analyzing a remote cluster"—it creates a session and passes the ID.
Scalability: This same pattern allows us to support remote cluster targeting in the future (e.g., the MCP server running on a management cluster debugging a separate tenant cluster) without changing the tool code.

Goal 3: Token-Efficient Large File Processing

Context

netedge diagnostics can sometimes involve massive datasets (GBs of logs, metrics series). Basic "read file" tools are prohibitively expensive in tokens and slow.

Strategy

The toolset must strictly provide Query/Summarization tools, not Reader tools.

Semantic Queries over Raw Reads:
- Bad: read_file(haproxy.log)
- Good: search_ingress_logs(pattern="503", time_window="5m", cluster="file:///tmp/session-1")
- Mechanism:
  - The agent invokes a domain specific tool (capturing the intent, e.g., "find errors") with the file context.
  - The OfflineAdapter runs a localized grep/parsing operation on the server pod against the uploaded must-gather artifacts.
  - It returns only the relevant snippet (e.g., 5-10 lines around the match), saving a large amount of transmitted tokens compared to dumping the full file.
Optimized Parsing (The "Heavy Lifting" Trade-off):
- Approach: Instead of raw reads, the Offline Adapter pre-scans artifacts upon attachment.
- Mechanism: It loads key data into simple in-memory structures (e.g., Go maps or slices) rather than complex databases.
- Rationale: While this requires more upfront engineering effort (writing parsers), it is critical for stability. It prevents the Agent from timing out while "grepping" 1GB files and ensures we only send high-signal data to the LLM.

Goal 4: Authentication and RBAC

By choosing Option 1 (openshift-mcp-server Integration), this goal is subsumed. The NIDS toolset automatically inherits the robust OIDC authentication, Token Exchange (RFC 8693), and RBAC models already implemented and vetted in the openshift-mcp-server. No separate design work is required.

bentito/ne-2276-location-strategy.md

Select an option

No results found

Select an option

No results found

NE-2276: NIDS MCP Toolset Strategy

Goal 1: Delivery & Integration Strategy

Context

Option 1: openshift-mcp-server Integration (Recommended)

Option 2: Standalone MCP Server

Comparative Analysis: OVN-Kubernetes Strategy

Goal 2: Offline Data Access

Context

Problem

Solution: Extending Multi-Cluster Awareness (Virtual Clusters)

The Pattern: Virtual Clusters

Example: query_prometheus

Benefit

Goal 3: Token-Efficient Large File Processing

Context

Strategy

Goal 4: Authentication and RBAC