We are implementing a NIDS (netedge) MCP toolset. We must choose a strategy for delivering and integrating our toolset into OpenShift clusters, and it'll save time and effort later if we decide now where it's going to be located and how it will get built for production.
Integrate the toolset directly into the openshift-mcp-server repository under pkg/toolsets/netedge. This repository serves as the convergence point for MCP work from multiple teams.
Pros:
- Platform Capabilities: Inherits platform Auth, Config, and Transport capabilities without reimplementation.
- Cross-Team Synergy: Allows distinct teams to collaborate on shared infrastructure.
- Zero Friction Distribution: Users get netedge tools wherever they find openshift-mcp-server, Lightspeed OpenShift for example.
- Single Binary: Can be deployed in-cluster OR run locally by the user.
Create a separate repository and binary.
Cons:
- High Development Overhead: Requires re-implementing Auth, Config, Transport, and Logging logic.
- Release Engineering Burden: The NIDS team would need to set up and maintain a separate build pipeline (Konflux), manage release versioning, and support the distribution of our MCP server binary with OpenShift.
- Distribution Friction: Users must download and manage our mcp binary.
The OVN-Kubernetes (OVNK) team opted for Option 2 (Standalone), driven by different priorities:
- Upstream Focus: Their primary goal is to serve the upstream
ovn-kubernetescommunity independently of OpenShift productization. - Deferred Scope: They explicitly chose to defer "productization" concerns like Authentication and RBAC, focusing solely on proving the core value of their diagnostics first.
- Security Stance: They adopted a strict read-only model, reasoning that they largely did not need the "full gamut" of tools provided by the core server.
Contrast with NIDS: NIDS differs in that we are targeting immediate product readiness. By choosing Option 1, we avoid the "deferred technical debt" that OVNK accepts (implementing Auth, Config, Transport, and Logging later). We gain immediate access to the platform's mature OIDC/RBAC implementation and support channels, rather than building a custom server stack from scratch.
Recommendation: Proceed with Option 1. The openshift-mcp-server Integration approach reduces the release engineering burden, simplifies distribution, and enables immediate code reuse with other OpenShift teams.
While the MCP server often runs on a live cluster, users frequently troubleshoot using "offline" artifacts (e.g., must-gather, SOS reports) located on their local machine. This data often cannot be uploaded to a shared cluster due to size, privacy, or compliance reasons.
We are adopting the pattern established by the ovn-kubernetes-mcp project, which successfully uses toggleable modes to support both live and offline analysis in a single tool.
The Agent (LLM) needs to reason about this local data using the same high-level tools it uses for live clusters (e.g., get_network_error_rate). Feeding massive raw file dumps directly to the LLM is inefficient, consumes excessive tokens, and makes it difficult for the model to identify relevant signals amidst the noise.
Instead of inventing a new "Adapter" pattern, we will leverage the platform's existing Multi-Cluster Support. The openshift-mcp-server framework already handles routing logic for tools that operate on specific clusters via the IsClusterAware interface.
We treat an offline session (a must-gather on disk) conceptually as a Virtual Cluster.
- IsClusterAware: netedge tools declare
IsClusterAware() = true. - Context Parameter: The framework automatically injects a
cluster(orcontext) parameter into the tool's schema. - Resolution: We implement the
GetDerivedPrometheus(ctx, cluster)helper (analogous to the SDK'sGetDerivedKubernetes) to resolve this string into a client.
The Tool Definition:
The tool logic remains standard. It accepts the injected cluster identifier and requests a client.
func QueryPrometheusHandler(params Params) {
// 1. Framework injects 'cluster' param
clusterID := params.GetArguments()["cluster"].(string)
// 2. Resolve Client (The "Magic" happens here)
client, err := GetDerivedPrometheus(ctx, clusterID)
// 3. Execute agnostic logic
return client.Query("sum(rate(...))")
}Client Resolution Logic (GetDerivedPrometheus):
-
Live Mode (Cluster ID):
- Input:
cluster="local"(or a specific cluster name). - Resolution: Returns a
ThanosClientauthenticated against the target cluster's API.
- Input:
-
Offline Mode (File URI):
- Input:
cluster="file:///tmp/session-1". - Resolution: Returns an
OfflinePrometheusClient. - Implementation: This client wraps the
omcbinary (Shell Adapter) to parse the specificmust-gatherpath context.
- Input:
- Architectural Alignment: We use the exact mechanism designed for multi-cluster support (ref:
pkg/api/toolsets.go#IsClusterAware). - Seamlessness: The LLM treats "analyzing a file" exactly like "analyzing a remote cluster"—it creates a session and passes the ID.
- Scalability: This same pattern allows us to support remote cluster targeting in the future (e.g., the MCP server running on a management cluster debugging a separate tenant cluster) without changing the tool code.
netedge diagnostics can sometimes involve massive datasets (GBs of logs, metrics series). Basic "read file" tools are prohibitively expensive in tokens and slow.
The toolset must strictly provide Query/Summarization tools, not Reader tools.
-
Semantic Queries over Raw Reads:
- Bad:
read_file(haproxy.log) - Good:
search_ingress_logs(pattern="503", time_window="5m", cluster="file:///tmp/session-1") - Mechanism:
- The agent invokes a domain specific tool (capturing the intent, e.g., "find errors") with the file context.
- The OfflineAdapter runs a localized grep/parsing operation on the server pod against the uploaded
must-gatherartifacts. - It returns only the relevant snippet (e.g., 5-10 lines around the match), saving a large amount of transmitted tokens compared to dumping the full file.
- Bad:
-
Optimized Parsing (The "Heavy Lifting" Trade-off):
- Approach: Instead of raw reads, the Offline Adapter pre-scans artifacts upon attachment.
- Mechanism: It loads key data into simple in-memory structures (e.g., Go maps or slices) rather than complex databases.
- Rationale: While this requires more upfront engineering effort (writing parsers), it is critical for stability. It prevents the Agent from timing out while "grepping" 1GB files and ensures we only send high-signal data to the LLM.
By choosing Option 1 (openshift-mcp-server Integration), this goal is subsumed. The NIDS toolset automatically inherits the robust OIDC authentication, Token Exchange (RFC 8693), and RBAC models already implemented and vetted in the openshift-mcp-server. No separate design work is required.