Skip to content

Instantly share code, notes, and snippets.

@dims
Created November 18, 2025 18:23
Show Gist options
  • Select an option

  • Save dims/87c83827c941b3d4202592f78ca37f81 to your computer and use it in GitHub Desktop.

Select an option

Save dims/87c83827c941b3d4202592f78ca37f81 to your computer and use it in GitHub Desktop.

MongoDB & PostgreSQL Implementation Guide

NVSentinel Unified Datastore Support

Consolidated & Authoritative Reference Last Updated: November 18, 2025 Branch: add-support-for-postgres Version: 3.0 (Consolidated from multiple sources)


TABLE OF CONTENTS

  1. Understanding the Abstraction Layer Design
  2. Overview
  3. Architectural Design
  4. MongoDB Implementation
  5. PostgreSQL Implementation
  6. Certificate & TLS Management
  7. Configuration System
  8. Helm Chart Design
  9. Tilt Development Workflow
  10. Deployment Patterns
  11. Database Schema Design
  12. Change Detection Mechanisms
  13. Query & Operation Patterns
  14. Performance Characteristics
  15. Testing Strategy
  16. Migration & Compatibility
  17. Operational Considerations
  18. Adding Support for New Databases
  19. Debugging Guide
  20. Key Takeaways
  21. Quick Reference
  22. Files of Interest

0. UNDERSTANDING THE ABSTRACTION LAYER DESIGN

Why Two Different Approaches Exist

Context: NVSentinel evolved from MongoDB-only to supporting multiple databases.

MongoDB's Journey (Evolutionary):

  • MongoDB was implemented before the abstraction layer existed
  • Original code used direct MongoDB driver calls throughout
  • When abstraction layer was added, needed to maintain backward compatibility
  • Result: Adapter pattern wraps existing implementation

PostgreSQL's Journey (Revolutionary):

  • PostgreSQL was added after abstraction layer was designed
  • Built from ground up to implement the abstraction interfaces
  • No legacy code to maintain
  • Result: Clean, direct implementation

The Abstraction Layer Contract

Core Philosophy: Components should never know which database they're using.

// This code works identically with MongoDB or PostgreSQL
dsConfig, _ := datastore.LoadDatastoreConfig()
ds, _ := datastore.NewDataStore(ctx, *dsConfig)
healthStore := ds.HealthEventStore()
healthStore.UpdateHealthEventStatus(ctx, eventID, status)

Factory Pattern:

DATASTORE_PROVIDER env var
        ↓
   LoadDatastoreConfig()
        ↓
   NewDataStore(config)
        ↓
    Factory selects:
    - MongoDB → NewMongoDBDataStore()
    - PostgreSQL → NewPostgreSQLStore()
        ↓
  Returns DataStore interface

Critical Insight: This abstraction enables database-agnostic component code.


1. OVERVIEW

Current State

NVSentinel supports two database backends through a unified abstraction layer:

  • MongoDB - Original implementation, production-proven
  • PostgreSQL - Newly added alternative backend

Both databases provide identical functionality to components through the DataStore interface defined in store-client/pkg/datastore/interfaces.go.

Implementation Status

  • MongoDB: ✅ Production-ready, mature codebase (~3,500 LOC)
  • PostgreSQL: ✅ Feature-complete, newly added (~2,900 LOC)
  • Unified Interface: ✅ Both implement identical DataStore interface
  • Test Coverage: ✅ Both have comprehensive test suites (196+ tests total)
  • Behavioral Parity: ✅ Verified via cross-provider contract tests

Supported Components

All core NVSentinel components use the unified datastore:

  • platform-connectors
  • fault-quarantine
  • fault-remediation
  • node-drainer
  • health-events-analyzer
  • csp-health-monitor
  • janitor

2. ARCHITECTURAL DESIGN

2.1 The Unified Interface

Core Abstraction (store-client/pkg/datastore/interfaces.go):

type DataStore interface {
    MaintenanceEventStore() MaintenanceEventStore
    HealthEventStore() HealthEventStore
    Ping(ctx context.Context) error
    Close(ctx context.Context) error
    Provider() DataStoreProvider
}

type HealthEventStore interface {
    InsertHealthEvents(ctx context.Context, events ...*HealthEventWithStatus) error
    UpdateHealthEventStatus(ctx context.Context, id string, status HealthEventStatus) error
    FindHealthEventsByNode(ctx context.Context, nodeName string) ([]*HealthEventWithStatus, error)
    DeleteHealthEvent(ctx context.Context, id string) error
    // ... more methods
}

type MaintenanceEventStore interface {
    UpsertMaintenanceEvent(ctx context.Context, event *MaintenanceEventWithHistory) error
    GetMaintenanceEvent(ctx context.Context, nodeName string) (*MaintenanceEventWithHistory, error)
    DeleteMaintenanceEvent(ctx context.Context, nodeName string) error
    ListMaintenanceEvents(ctx context.Context) ([]*MaintenanceEventWithHistory, error)
}

Key Principle: Components program against interfaces, not implementations.

2.2 How Components Use Datastores

Standard Pattern (Recommended):

package main

import (
    "github.com/nvidia/nvsentinel/store-client/pkg/datastore"
)

func main() {
    // 1. Load configuration (reads from environment)
    config, err := datastore.LoadDatastoreConfig()
    if err != nil {
        log.Fatalf("Failed to load datastore config: %v", err)
    }

    // 2. Create datastore (factory automatically selects provider)
    ds, err := datastore.NewDataStore(ctx, *config)
    if err != nil {
        log.Fatalf("Failed to create datastore: %v", err)
    }
    defer ds.Close(ctx)

    // 3. Get specialized stores
    maintenanceStore := ds.MaintenanceEventStore()
    healthStore := ds.HealthEventStore()

    // 4. Use them - same code for both databases!
    err = healthStore.UpdateHealthEventStatus(ctx, eventID, newStatus)
    err = maintenanceStore.UpsertMaintenanceEvent(ctx, event)
}

The factory in NewDataStore() automatically creates the correct provider based on config.Provider.

2.3 Provider Registration

Auto-Registration Pattern:

// MongoDB registration (mongodb/register.go)
func init() {
    datastore.RegisterProvider(datastore.ProviderMongoDB, NewMongoDBDataStore)
}

// PostgreSQL registration (postgresql/register.go)
func init() {
    datastore.RegisterProvider(datastore.ProviderPostgreSQL, NewPostgreSQLStore)
}

Both providers self-register when their packages are imported, making the factory pattern work transparently.

2.4 Key Architectural Differences

Aspect MongoDB PostgreSQL
Design Approach Adapter pattern (wraps legacy) Native implementation
Client Type *mongo.Client (official driver) *sql.DB (stdlib)
Legacy Compat Full backward compatibility layer None (built for abstraction)
Change Detection Native MongoDB change streams Polling with database triggers
Document Storage BSON native JSONB with indexed columns
Query Language Aggregation pipelines SQL with JSONB operators
Connection Format MongoDB URI PostgreSQL connection string
Code Complexity HIGH (dual systems) MEDIUM (single path)

3. MONGODB IMPLEMENTATION

3.1 Design Philosophy

"Adaptation" - MongoDB wraps existing implementation to provide new abstraction interface while maintaining backward compatibility.

3.2 File Structure

store-client/pkg/datastore/providers/mongodb/
├── adapter.go              (271 lines) - Wraps legacy MongoDB client
├── builders.go             (152 lines) - Query builder factories
├── health_store.go         (250 lines) - Health event operations
├── maintenance_store.go    (238 lines) - Maintenance event operations
├── register.go             (75 lines)  - Provider auto-registration
├── watcher_factory.go      (73 lines)  - Change stream watcher factory
└── watcher/                - Change stream implementation
    ├── watch_store.go      (~1,200 LOC) - Core MongoDB watcher logic
    ├── unmarshaller.go     - Event unmarshalling
    └── ...

Total: ~3,500 lines of code

3.3 MongoDB's Dual Configuration System

CRITICAL UNDERSTANDING: MongoDB components run TWO parallel configuration systems.

System A - Legacy (Pre-Abstraction):

// Used for change stream watchers
import "github.com/nvidia/nvsentinel/store-client/pkg/client"

databaseConfig := client.NewDatabaseConfigFromEnv()
clientFactory := factory.NewClientFactory(databaseConfig)
watcher, _ := clientFactory.CreateChangeStreamWatcher(ctx, client, "name", pipeline)

Configuration Source:

  • MONGODB_URI environment variable
  • MONGODB_DATABASE_NAME
  • MONGODB_COLLECTION_NAME
  • Certificate path from command-line flags (--database-client-cert-mount-path)

System B - New (Post-Abstraction):

// Used for queries and maintenance operations
import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"

dsConfig, _ := datastore.LoadDatastoreConfig()
ds, _ := datastore.NewDataStore(ctx, *dsConfig)
healthStore := ds.HealthEventStore()

Configuration Source:

  • DATASTORE_PROVIDER=mongodb
  • DATASTORE_HOST, DATASTORE_PORT, DATASTORE_DATABASE
  • Certificate path from config.Connection.TLSConfig or environment variable

The Problem: Both systems load configuration independently and can resolve different certificate paths!

The Solution:

  • Always pass --database-client-cert-mount-path=/etc/ssl/client-certs to components
  • This ensures both systems use the same cert path

3.4 MongoDB-Specific Concepts

1. Change Streams (Native Feature): MongoDB provides real-time change notifications:

// MongoDB driver watches the collection
stream = collection.Watch(ctx, pipeline)
for stream.Next(ctx) {
    // Get change event immediately (push-based)
    event = stream.Current
    // Process: operationType, fullDocument, resumeToken
}

Characteristics:

  • Latency: <50ms (near real-time)
  • Method: Push-based notifications from MongoDB server
  • Resume: Binary BSON resume tokens for fault tolerance
  • Efficient: Only changed documents sent over wire

2. Replica Set Requirement: MongoDB requires replica set configuration even for single-node deployments:

mongodb://mongodb-headless:27017/?replicaSet=rs0&tls=true
                                  ^^^^^^^^^^^^ Required for change streams!

Without replicaSet parameter, change streams will not work.

3. Document Storage:

  • Native BSON documents
  • Schema-less (no migrations needed)
  • Flexible nested structures
  • 16MB document size limit

4. Authentication:

  • Method: MONGODB-X509 (certificate-based)
  • No passwords - TLS client certificates only
  • Certificate DN must match MongoDB user

3.5 Adapter Pattern Implementation

type AdaptedMongoStore struct {
    // Legacy MongoDB clients (pre-abstraction)
    databaseClient   client.DatabaseClient
    collectionClient client.CollectionClient
    factory          *factory.ClientFactory

    // New interface implementations
    maintenanceStore datastore.MaintenanceEventStore  // Implements new interface
    healthStore      datastore.HealthEventStore       // Implements new interface
}

// Implements DataStore interface
func (a *AdaptedMongoStore) HealthEventStore() datastore.HealthEventStore {
    return a.healthStore
}

// Legacy access (for components not yet migrated)
func (a *AdaptedMongoStore) GetDatabaseClient() client.DatabaseClient {
    return a.databaseClient  // Bridge to old code
}

Design Pattern: Adapter wraps existing functionality to present new interface.

3.6 MongoDB Connection Flow

Component Startup
  ↓
LoadDatastoreConfig()
  ├─ Reads: DATASTORE_PROVIDER=mongodb
  ├─ Reads: DATASTORE_HOST, DATASTORE_PORT
  ├─ Builds: DataStoreConfig struct
  ↓
NewDataStore(ctx, config)
  ├─ Factory lookup: ProviderMongoDB → NewMongoDBDataStore
  ↓
NewMongoDBDataStore(ctx, config)
  ├─ Creates legacy adapter for compatibility:
  │  └─ ConvertDataStoreConfigToLegacyWithCertPath()
  ├─ Initializes mongo.Client with connection string
  ├─ Creates AdaptedMongoStore
  │  ├─ healthStore = NewMongoHealthEventStore(...)
  │  └─ maintenanceStore = NewMongoMaintenanceEventStore(...)
  ↓
Returns: DataStore interface

4. POSTGRESQL IMPLEMENTATION

4.1 Design Philosophy

"Native" - PostgreSQL was built specifically for the new abstraction layer with no legacy baggage.

4.2 File Structure

store-client/pkg/datastore/providers/postgresql/
├── datastore.go            (435 lines) - Main PostgreSQL datastore
├── changestream.go         (337 lines) - Polling-based change detection
├── health_events.go        (571 lines) - Health event operations
├── maintenance_events.go   (369 lines) - Maintenance event operations
├── database_client.go      (418 lines) - Legacy client adapter (for compat)
├── register.go             (34 lines)  - Provider auto-registration
├── watcher_factory.go      (89 lines)  - Change stream watcher factory
├── pipeline_filter.go      (318 lines) - MongoDB pipeline → SQL translator
└── *_test.go               (464 lines) - Comprehensive tests

Total: ~2,900 lines of code (smaller than MongoDB due to no legacy)

4.3 PostgreSQL-Specific Concepts

1. Polling-Based Change Detection:

Since PostgreSQL doesn't have native change streams, we implement them:

Database Setup (migrations):

-- Changelog table to track all changes
CREATE TABLE datastore_changelog (
    id SERIAL PRIMARY KEY,
    table_name VARCHAR(255) NOT NULL,
    operation VARCHAR(10) NOT NULL,    -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_at TIMESTAMP DEFAULT NOW(),
    processed BOOLEAN DEFAULT FALSE
);

-- Trigger function to capture changes
CREATE OR REPLACE FUNCTION health_events_change_trigger() RETURNS TRIGGER AS $$
BEGIN
    IF (TG_OP = 'INSERT') THEN
        INSERT INTO datastore_changelog (table_name, operation, new_values)
        VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW)::jsonb);
    ELSIF (TG_OP = 'UPDATE') THEN
        INSERT INTO datastore_changelog (table_name, operation, old_values, new_values)
        VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb, row_to_json(NEW)::jsonb);
    ELSIF (TG_OP = 'DELETE') THEN
        INSERT INTO datastore_changelog (table_name, operation, old_values)
        VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Attach trigger to health_events table
CREATE TRIGGER health_events_change
AFTER INSERT OR UPDATE OR DELETE ON health_events
FOR EACH ROW EXECUTE FUNCTION health_events_change_trigger();

Application Polling (Go code):

ticker := time.NewTicker(5 * time.Second)  // Configurable poll interval
for range ticker.C {
    rows, _ := db.Query(`
        SELECT id, operation, new_values, changed_at
        FROM datastore_changelog
        WHERE id > $1 AND table_name = 'health_events'
        ORDER BY id
    `, lastProcessedID)

    for rows.Next() {
        // Convert to MongoDB-style change event format
        event := convertToChangeStreamEvent(row)
        sendToChannel(event)
        lastProcessedID = row.ID
    }
}

Characteristics:

  • Latency: 0-5 seconds (depending on when change occurs relative to poll)
  • Method: Application polls database
  • Resume: Integer IDs (simpler than MongoDB's BSON tokens)
  • Tradeoff: Slight delay vs MongoDB's real-time, but simpler to debug

2. JSONB Storage with Indexed Columns:

Hybrid approach combining SQL performance with document flexibility:

CREATE TABLE health_events (
    id SERIAL PRIMARY KEY,
    document JSONB NOT NULL,              -- Full document (flexible schema)

    -- Extracted columns for fast querying
    node_name VARCHAR(255),                -- From document->'healthevent'->>'nodename'
    status VARCHAR(50),                    -- From document->'healtheventstatus'->>'nodequarantined'
    is_fatal BOOLEAN,                      -- From document->'healthevent'->>'isfatal'
    agent VARCHAR(100),
    created_at BIGINT,                     -- Indexed timestamp

    -- Indexes on extracted columns for performance
    INDEX idx_health_events_node_name (node_name),
    INDEX idx_health_events_status (status),
    INDEX idx_health_events_created_at (created_at)
);

Benefits:

  • Fast queries: Indexed columns for common filters
  • Flexibility: JSONB allows schema evolution without migrations
  • Best of both: SQL performance + document database flexibility

Example Query:

-- Uses index on node_name, then JSONB operators for nested fields
SELECT * FROM health_events
WHERE node_name = 'gpu-node-1'           -- Fast: uses index
  AND document->'healthevent'->>'checkname' = 'GpuXidError'  -- JSONB query
ORDER BY created_at DESC                  -- Fast: uses index
LIMIT 10;

3. Pipeline Translation:

Converts MongoDB aggregation pipelines to PostgreSQL SQL:

// MongoDB pipeline from component
pipeline := []interface{}{
    {"$match": {"healthevent.nodename": "node-1"}},
    {"$sort": {"createdAt": -1}},
    {"$limit": 10},
}

// Translator converts to SQL
sql := translatePipeline(pipeline)
// Result:
// SELECT * FROM health_events
// WHERE document->'healthevent'->>'nodename' = 'node-1'
// ORDER BY created_at DESC
// LIMIT 10

Supported Pipeline Stages:

  • $matchWHERE clauses
  • $sortORDER BY
  • $limitLIMIT
  • $skipOFFSET
  • $group → Not supported (use SQL directly)
  • $lookup → Not supported (use JOINs)
  • $unwind → Not supported

4. Authentication:

  • Method: PostgreSQL SSL certificate verification
  • Requires: Client cert + CA cert + server cert
  • pg_hba.conf: hostssl all all 0.0.0.0/0 cert
  • Certificate CN must match PostgreSQL user

4.4 Native Implementation

type PostgreSQLDataStore struct {
    db                    *sql.DB  // Direct database connection

    // Stores implement interfaces natively
    maintenanceEventStore datastore.MaintenanceEventStore
    healthEventStore      datastore.HealthEventStore
}

// Implements DataStore interface
func (p *PostgreSQLDataStore) HealthEventStore() datastore.HealthEventStore {
    return p.healthEventStore
}

// No legacy compatibility needed!

Design Pattern: Direct implementation of abstraction interfaces from the start.

4.5 PostgreSQL Connection Flow

Component Startup
  ↓
LoadDatastoreConfig()
  ├─ Reads: DATASTORE_PROVIDER=postgresql
  ├─ Reads: DATASTORE_HOST, DATASTORE_PORT, DATASTORE_USERNAME
  ├─ Reads: DATASTORE_SSLCERT, DATASTORE_SSLKEY, DATASTORE_SSLROOTCERT
  ├─ Stores cert paths directly in config.Connection
  ↓
NewDataStore(ctx, config)
  ├─ Factory lookup: ProviderPostgreSQL → NewPostgreSQLStore
  ↓
NewPostgreSQLStore(ctx, config)
  ├─ Builds connection string with explicit cert paths
  ├─ Opens sql.DB connection
  ├─ Creates PostgreSQLDataStore
  │  ├─ healthStore = NewPostgreSQLHealthEventStore(db)
  │  └─ maintenanceStore = NewPostgreSQLMaintenanceEventStore(db)
  ↓
Returns: DataStore interface

Cleaner Flow: No legacy conversions, single configuration path.


5. CERTIFICATE & TLS MANAGEMENT

5.1 Certificate Hierarchy

Both databases use cert-manager for certificate lifecycle:

selfsigned-ca-issuer (Self-signed root issuer)
  └─> {database}-root-ca (CA certificate, 10 year lifetime)
      └─> {database}-ca-issuer (CA issuer resource)
          ├─> {database}-server-cert (Server certificate, 1 year, auto-renew)
          └─> {database}-client-cert (Client certificate, 1 year, auto-renew)

Certificates automatically renew 15 days before expiration.

5.2 MongoDB Certificates

Created Certificates (templates/certmanager-mongodb.yaml):

  1. mongo-root-ca - Root CA (self-signed, 10 years)
  2. mongo-ca-issuer - Issuer using root CA
  3. mongo-server-cert-0 - Server cert for mongodb-0 pod
  4. mongo-app-client-cert - Client cert for applications
  5. mongo-dgxcops-client-cert - Client cert for operations

Mounting in Components:

volumes:
  - name: mongo-app-client-cert
    secret:
      secretName: mongo-app-client-cert-secret
      items:
        - key: tls.crt
          path: tls.crt
        - key: tls.key
          path: tls.key
        - key: ca.crt
          path: ca.crt

volumeMounts:
  - name: mongo-app-client-cert
    mountPath: /etc/ssl/client-certs  # Actual mount location
    readOnly: true

No init container needed - MongoDB driver accepts certificates as-is.

Environment Variables:

MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs

5.3 PostgreSQL Certificates

Created Certificates (templates/certmanager-postgresql.yaml):

  1. postgresql-root-ca - Root CA (self-signed, 10 years)
  2. selfsigned-ca-issuer - Self-signed issuer for CA
  3. postgresql-ca-issuer - Issuer using root CA
  4. postgresql-server-cert - Server cert for PostgreSQL pod
  5. postgresql-client-cert - Client cert for applications

Mounting with Init Container (Two-Stage):

initContainers:
  - name: fix-cert-permissions
    image: bitnamilegacy/os-shell
    command:
      - sh
      - -c
      - |
        cp /etc/ssl/client-certs-original/* /etc/ssl/client-certs-fixed/
        chmod 644 /etc/ssl/client-certs-fixed/tls.crt
        chmod 644 /etc/ssl/client-certs-fixed/ca.crt
        chmod 600 /etc/ssl/client-certs-fixed/tls.key  # CRITICAL: PostgreSQL requires 0600
    volumeMounts:
      - name: postgresql-client-cert-original
        mountPath: /etc/ssl/client-certs-original
        readOnly: true
      - name: client-certs-fixed
        mountPath: /etc/ssl/client-certs-fixed

containers:
  - name: component
    volumeMounts:
      - name: client-certs-fixed  # Use fixed certs, not original
        mountPath: /etc/ssl/client-certs
        readOnly: true

volumes:
  - name: postgresql-client-cert-original
    secret:
      secretName: postgresql-client-cert
  - name: client-certs-fixed
    emptyDir: {}  # Mutable volume for fixed permissions

Why the Init Container?

  • Kubernetes secrets are mounted as root:root with 0644/0444 permissions
  • PostgreSQL libpq requires client key to be 0600 (owner-only read/write)
  • Secrets are immutable, can't chmod directly
  • Solution: Init container copies to emptyDir and fixes permissions

Environment Variables:

DATASTORE_SSLCERT=/etc/ssl/client-certs/tls.crt
DATASTORE_SSLKEY=/etc/ssl/client-certs/tls.key
DATASTORE_SSLROOTCERT=/etc/ssl/client-certs/ca.crt
POSTGRESQL_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs

5.4 Certificate Path Resolution Complexity

CRITICAL: MongoDB has a 5-level precedence system for determining certificate paths.

Precedence Order (Highest to Lowest):

1. CLI Flag (Explicit)
   --database-client-cert-mount-path=/etc/ssl/client-certs
   ↓ (if not provided)

2. Environment Variable
   MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs
   ↓ (if not set)

3. Config Struct Field
   config.Connection.TLSConfig.CertPath
   ↓ (if empty)

4. File Existence Check
   if os.Stat("/etc/ssl/client-certs/ca.crt") → use /etc/ssl/client-certs
   ↓ (if not found)

5. Legacy Default Fallback
   /etc/ssl/mongo-client  ← DANGEROUS: Causes issues!

Code Implementation (commons/pkg/flags/database_flags.go):

type DatabaseCertConfig struct {
    DatabaseClientCertMountPath string  // New flag, defaults to "/etc/ssl/database-client"
    LegacyMongoCertPath        string  // Old default: "/etc/ssl/mongo-client"
    ResolvedCertPath           string  // Final resolved path
}

func (c *DatabaseCertConfig) ResolveCertPath() string {
    // If flag still has default value, fall back to legacy
    if c.DatabaseClientCertMountPath == "/etc/ssl/database-client" {
        c.ResolvedCertPath = c.LegacyMongoCertPath  // /etc/ssl/mongo-client
        return c.ResolvedCertPath
    }
    // Otherwise use explicitly set value
    c.ResolvedCertPath = c.DatabaseClientCertMountPath
    return c.ResolvedCertPath
}

The Problem:

  • Actual mount: /etc/ssl/client-certs
  • Default flag: /etc/ssl/database-client
  • Fallback: /etc/ssl/mongo-client
  • Result: Code looks in wrong place!

The Solution (Applied to All Components):

args:
  - "--database-client-cert-mount-path=/etc/ssl/client-certs"  # Explicit!

This makes the CLI flag (precedence level #1) override all fallbacks.

Best Practice: Always use CLI flags for cert paths - highest precedence, most explicit.

PostgreSQL Avoids This: Cert paths come directly from environment variables set in ConfigMap, no complex resolution needed.


[Due to length, continuing in next message...]

6. CONFIGURATION SYSTEM

6.1 Configuration Philosophy

Two Namespaces:

  • Generic (DATASTORE_*): Provider-agnostic configuration
  • Legacy (MONGODB_*, POSTGRESQL_*): Provider-specific for backward compatibility

6.2 MongoDB Configuration

Dual Namespace (for backward compatibility):

# New generic namespace
DATASTORE_PROVIDER=mongodb
DATASTORE_HOST=mongodb-headless.nvsentinel.svc.cluster.local
DATASTORE_PORT=27017
DATASTORE_DATABASE=HealthEventsDatabase

# Legacy MongoDB-specific namespace
MONGODB_URI=mongodb://mongodb-headless.nvsentinel.svc.cluster.local:27017/?replicaSet=rs0&tls=true
MONGODB_DATABASE_NAME=HealthEventsDatabase
MONGODB_COLLECTION_NAME=HealthEvents
MONGODB_TOKEN_COLLECTION_NAME=ResumeTokens
MONGODB_MAINTENANCE_EVENT_COLLECTION_NAME=MaintenanceEvents
MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs  # From deployment env, not ConfigMap

# Timeout configuration
MONGODB_PING_TIMEOUT_TOTAL_SECONDS=30
MONGODB_PING_INTERVAL_SECONDS=5
CA_CERT_MOUNT_TIMEOUT_TOTAL_SECONDS=360
CA_CERT_READ_INTERVAL_SECONDS=5

Why Both? Legacy components still use MONGODB_* variables directly.

6.3 PostgreSQL Configuration

Single Namespace (cleaner):

# Everything in DATASTORE_* namespace
DATASTORE_PROVIDER=postgresql
DATASTORE_HOST=nvsentinel-postgresql.nvsentinel.svc.cluster.local
DATASTORE_PORT=5432
DATASTORE_DATABASE=nvsentinel
DATASTORE_USERNAME=postgresql
DATASTORE_SSLMODE=require
DATASTORE_SSLCERT=/etc/ssl/client-certs/tls.crt
DATASTORE_SSLKEY=/etc/ssl/client-certs/tls.key
DATASTORE_SSLROOTCERT=/etc/ssl/client-certs/ca.crt

# No legacy variables needed
# No timeout configuration (uses defaults)

Observation: PostgreSQL is cleaner - everything in unified namespace.

6.4 Configuration Loading

Priority Order:

func LoadDatastoreConfig() (*DataStoreConfig, error) {
    // 1. DATASTORE_PROVIDER env var (highest priority)
    if provider := os.Getenv("DATASTORE_PROVIDER"); provider != "" {
        return loadConfigFromEnv(provider)
    }

    // 2. DATASTORE_YAML env var (YAML string)
    if yamlConfig := os.Getenv("DATASTORE_YAML"); yamlConfig != "" {
        return loadConfigFromYAMLString(yamlConfig)
    }

    // 3. DATASTORE_YAML_PATH env var (YAML file)
    if yamlPath := os.Getenv("DATASTORE_YAML_PATH"); yamlPath != "" {
        return loadConfigFromYAMLFile(yamlPath)
    }

    // 4. Default to MongoDB with legacy env vars
    return loadDefaultConfig()
}

Best Practice: Use DATASTORE_PROVIDER with individual env vars.

6.5 ConfigMap Selection Logic

Component Deployment Pattern:

envFrom:
  - configMapRef:
      name: {{ if .Values.global.datastore }}{{ .Release.Name }}-datastore-config{{ else }}mongodb-config{{ end }}

Logic:

  • If global.datastore is set → use nvsentinel-datastore-config (unified)
  • If not set → use mongodb-config (legacy)

Critical: This ensures backward compatibility with existing MongoDB deployments.


7. HELM CHART DESIGN

7.1 Chart Structure

distros/kubernetes/nvsentinel/
├── Chart.yaml
├── Chart.lock
├── values.yaml                      # Base values (no database selected)
├── values-tilt.yaml                # Common Tilt settings
├── values-tilt-mongodb.yaml        # MongoDB Tilt configuration
├── values-tilt-postgresql.yaml     # PostgreSQL Tilt configuration
├── values-postgresql.yaml          # Production PostgreSQL
├── templates/
│   ├── configmap-datastore.yaml    # Unified datastore configuration (CRITICAL)
│   ├── certmanager-mongodb.yaml    # MongoDB TLS certificates
│   └── certmanager-postgresql.yaml # PostgreSQL TLS certificates
└── charts/
    ├── postgresql/                  # Vendored Bitnami PostgreSQL chart
    ├── mongodb-store/               # MongoDB subchart
    │   ├── charts/mongodb/         # Vendored Bitnami MongoDB chart
    │   └── templates/
    │       └── configmap.yaml      # Legacy mongodb-config ConfigMap
    ├── node-drainer/
    │   └── templates/deployment.yaml
    ├── fault-quarantine/
    │   └── templates/deployment.yaml
    └── ... (other components)

7.2 The Critical ConfigMap

File: templates/configmap-datastore.yaml
Name: nvsentinel-datastore-config
Condition: {{- if .Values.global.datastore }}

This ConfigMap is the bridge between Helm values and component runtime.

MongoDB Example:

# values-tilt-mongodb.yaml
global:
  datastore:
    provider: "mongodb"
    connection:
      host: "mongodb-headless.nvsentinel.svc.cluster.local"
      port: 27017
      database: "HealthEventsDatabase"
      collection: "HealthEvents"
      tokenCollection: "ResumeTokens"
      extraParams:  # CRITICAL for MongoDB
        replicaSet: "rs0"
        tls: "true"

# Generated ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvsentinel-datastore-config
data:
  DATASTORE_PROVIDER: "mongodb"
  DATASTORE_HOST: "mongodb-headless.nvsentinel.svc.cluster.local"
  DATASTORE_PORT: "27017"
  DATASTORE_DATABASE: "HealthEventsDatabase"
  MONGODB_URI: "mongodb://mongodb-headless.nvsentinel.svc.cluster.local:27017/?replicaSet=rs0&tls=true"
  MONGODB_DATABASE_NAME: "HealthEventsDatabase"
  MONGODB_COLLECTION_NAME: "HealthEvents"
  MONGODB_TOKEN_COLLECTION_NAME: "ResumeTokens"

PostgreSQL Example:

# values-tilt-postgresql.yaml
global:
  datastore:
    provider: "postgresql"
    connection:
      host: "nvsentinel-postgresql.nvsentinel.svc.cluster.local"
      port: 5432
      database: "nvsentinel"
      username: "postgresql"
      sslmode: "require"
      sslcert: "/etc/ssl/client-certs/tls.crt"
      sslkey: "/etc/ssl/client-certs/tls.key"
      sslrootcert: "/etc/ssl/client-certs/ca.crt"

# Generated ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvsentinel-datastore-config
data:
  DATASTORE_PROVIDER: "postgresql"
  DATASTORE_HOST: "nvsentinel-postgresql.nvsentinel.svc.cluster.local"
  DATASTORE_PORT: "5432"
  DATASTORE_DATABASE: "nvsentinel"
  DATASTORE_USERNAME: "postgresql"
  DATASTORE_SSLMODE: "require"
  DATASTORE_SSLCERT: "/etc/ssl/client-certs/tls.crt"
  DATASTORE_SSLKEY: "/etc/ssl/client-certs/tls.key"
  DATASTORE_SSLROOTCERT: "/etc/ssl/client-certs/ca.crt"

Template Logic (simplified):

{{- if .Values.global.datastore }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Release.Name }}-datastore-config
data:
  DATASTORE_PROVIDER: {{ .Values.global.datastore.provider | quote }}
  DATASTORE_HOST: {{ .Values.global.datastore.connection.host | quote }}
  {{- if eq .Values.global.datastore.provider "mongodb" }}
  {{- $params := include "buildMongoDBQueryParams" .Values.global.datastore.connection.extraParams }}
  MONGODB_URI: "mongodb://{{ .Values.global.datastore.connection.host }}:{{ .Values.global.datastore.connection.port }}/?{{ $params }}"
  {{- else if eq .Values.global.datastore.provider "postgresql" }}
  DATASTORE_SSLCERT: {{ .Values.global.datastore.connection.sslcert | quote }}
  DATASTORE_SSLKEY: {{ .Values.global.datastore.connection.sslkey | quote }}
  {{- end }}
{{- end }}

7.3 Vendored Dependencies

PostgreSQL Chart:

  • Source: Bitnami PostgreSQL chart 15.5.38
  • Location: charts/postgresql/
  • Size: 65+ files, 1,780 lines in values.yaml
  • Modified: All images changed to bitnamilegacy/*

MongoDB Chart:

  • Source: Bitnami MongoDB chart
  • Location: charts/mongodb-store/charts/mongodb/
  • Modified: Images use bitnamilegacy/*

Reason for Vendoring: Ensures availability and allows customization.


8. TILT DEVELOPMENT WORKFLOW

8.1 Switching Databases

Use MongoDB (Default):

cd tilt
tilt up
# Loads: values-tilt-mongodb.yaml
# Deploys: MongoDB StatefulSet
# Creates: mongo-* certificates

Use PostgreSQL:

cd tilt
export USE_POSTGRESQL=1
tilt up
# Loads: values-tilt-postgresql.yaml
# Deploys: PostgreSQL StatefulSet
# Creates: postgresql-* certificates

8.2 Tiltfile Logic

File: tilt/Tiltfile

use_postgresql = os.getenv('USE_POSTGRESQL', '0') == '1'

# Values file selection
values_files = ['../distros/kubernetes/nvsentinel/values-tilt.yaml']
if use_postgresql:
    print("Using PostgreSQL as datastore (USE_POSTGRESQL=1)")
    values_files.append('../distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml')
else:
    print("Using MongoDB as datastore (default)")
    values_files.append('../distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml')

# Resource naming
datastore_resource = 'nvsentinel-postgresql' if use_postgresql else 'mongodb'

# Certificate resources
cert_manager_objects = ['janitor-webhook-cert:certificate']
if use_postgresql:
    cert_manager_objects.extend([
        'postgresql-root-ca:certificate',
        'postgresql-ca-issuer:issuer',
        'selfsigned-ca-issuer:issuer',
        'postgresql-server-cert:certificate',
        'postgresql-client-cert:certificate'
    ])
else:
    cert_manager_objects.extend([
        'mongo-root-ca:certificate',
        'mongo-ca-issuer:issuer',
        'selfsigned-ca-issuer:issuer',
        'mongo-server-cert-0:certificate',
        'mongo-app-client-cert:certificate',
        'mongo-dgxcops-client-cert:certificate'
    ])

# Component dependencies
k8s_resource('platform-connectors', resource_deps=[datastore_resource])
k8s_resource('fault-quarantine', resource_deps=[datastore_resource])
k8s_resource('fault-remediation', resource_deps=[datastore_resource])
k8s_resource('node-drainer', resource_deps=[datastore_resource])
k8s_resource('health-events-analyzer', resource_deps=[datastore_resource])

Key Pattern: Components wait for datastore to be ready before starting.


9. DEPLOYMENT PATTERNS

9.1 Component Deployment Template

Every component follows this pattern (example: node-drainer/templates/deployment.yaml):

spec:
  template:
    spec:
      # PostgreSQL ONLY: Init container to fix cert permissions
      {{- if eq .Values.global.datastore.provider "postgresql" }}
      initContainers:
        - name: fix-cert-permissions
          image: bitnamilegacy/os-shell
          command:
            - sh
            - -c
            - |
              cp /etc/ssl/client-certs-original/* /etc/ssl/client-certs-fixed/
              chmod 600 /etc/ssl/client-certs-fixed/tls.key
          volumeMounts:
            - name: {{ .Values.global.datastore.provider }}-client-cert-original
              mountPath: /etc/ssl/client-certs-original
            - name: client-certs-fixed
              mountPath: /etc/ssl/client-certs-fixed
      {{- end }}

      containers:
        - name: {{ .Chart.Name }}
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          
          # CRITICAL: Pass cert path via command-line arg
          args:
            - "--metrics-port=2112"
            - "--config-path=/etc/config/config.toml"
            - "--database-client-cert-mount-path={{ .Values.clientCertMountPath }}"
          
          env:
            - name: LOG_LEVEL
              value: {{ .Values.logLevel | quote }}
            {{- if eq .Values.global.datastore.provider "postgresql" }}
            - name: POSTGRESQL_CLIENT_CERT_MOUNT_PATH
              value: {{ .Values.clientCertMountPath }}
            {{- else }}
            - name: MONGODB_CLIENT_CERT_MOUNT_PATH
              value: {{ .Values.clientCertMountPath }}
            {{- end }}
          
          # Load all datastore config
          envFrom:
            - configMapRef:
                name: {{ if .Values.global.datastore }}{{ .Release.Name }}-datastore-config{{ else }}mongodb-config{{ end }}
          
          volumeMounts:
            - name: config
              mountPath: /etc/config
            {{- if eq .Values.global.datastore.provider "postgresql" }}
            - name: client-certs-fixed
              mountPath: {{ .Values.clientCertMountPath }}
            {{- else }}
            - name: mongo-app-client-cert
              mountPath: {{ .Values.clientCertMountPath }}
            {{- end }}
      
      volumes:
        - name: config
          configMap:
            name: {{ .Release.Name }}-{{ .Chart.Name }}-config
        {{- if eq .Values.global.datastore.provider "postgresql" }}
        - name: postgresql-client-cert-original
          secret:
            secretName: postgresql-client-cert
        - name: client-certs-fixed
          emptyDir: {}
        {{- else }}
        - name: mongo-app-client-cert
          secret:
            secretName: mongo-app-client-cert-secret
            optional: true
        {{- end }}

Pattern Observations:

  1. Init container: PostgreSQL only (cert permissions)
  2. Command-line arg: Required for cert path resolution
  3. Environment variable: Provider-specific name
  4. ConfigMap selection: Conditional
  5. Volume mounting: Two-stage for PostgreSQL, direct for MongoDB

10. DATABASE SCHEMA DESIGN

10.1 MongoDB Schema

Collections:

  • HealthEvents - Health event documents
  • MaintenanceEvents - Maintenance event documents
  • ResumeTokens - Change stream resume positions

Schema-less (Flexible):

{
  "_id": ObjectId("..."),
  "createdAt": ISODate("2025-11-18T..."),
  "healthevent": {
    "nodename": "gpu-node-1",
    "checkname": "GpuXidError",
    "componentclass": "GPU",
    "isfatal": true,
    "message": "XID error detected on GPU 0"
  },
  "healtheventstatus": {
    "nodequarantined": "Quarantined",
    "userpodsevictionstatus": {
      "status": "Completed",
      "message": "All pods evicted"
    },
    "maintenanceeventcreationstatus": {
      "status": "Created",
      "maintenanceeventnodename": "gpu-node-1"
    }
  }
}

Querying (Aggregation Pipelines):

db.HealthEvents.aggregate([
  {$match: {"healthevent.nodename": "gpu-node-1"}},
  {$sort: {createdAt: -1}},
  {$limit: 10}
])

10.2 PostgreSQL Schema

Tables:

  • health_events - Health events with JSONB + indexed columns
  • maintenance_events - Maintenance events with JSONB + indexed columns
  • datastore_changelog - Change tracking (for change stream emulation)
  • resume_tokens - Resume positions for watchers

Hybrid Schema (Best of Both):

CREATE TABLE health_events (
    id SERIAL PRIMARY KEY,
    document JSONB NOT NULL,              -- Full document (flexible)

    -- Extracted columns for performance
    node_name VARCHAR(255),                -- From document->'healthevent'->>'nodename'
    status VARCHAR(50),                    -- From document->'healtheventstatus'->>'nodequarantined'
    is_fatal BOOLEAN,                      -- From document->'healthevent'->>'isfatal'
    agent VARCHAR(100),                    -- From document->'healthevent'->>'agent'
    created_at BIGINT,                     -- Timestamp for ordering

    -- Indexes
    INDEX idx_health_events_node_name (node_name),
    INDEX idx_health_events_status (status),
    INDEX idx_health_events_created_at (created_at),
    INDEX idx_health_events_is_fatal (is_fatal)
);

CREATE TABLE datastore_changelog (
    id SERIAL PRIMARY KEY,
    table_name VARCHAR(255) NOT NULL,
    operation VARCHAR(10) NOT NULL,        -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_at TIMESTAMP DEFAULT NOW(),
    processed BOOLEAN DEFAULT FALSE,
    
    INDEX idx_changelog_table_id (table_name, id),
    INDEX idx_changelog_processed (processed)
);

Querying (SQL with JSONB):

SELECT * FROM health_events
WHERE node_name = 'gpu-node-1'            -- Fast: index
  AND is_fatal = true                      -- Fast: index
  AND document->'healthevent'->>'checkname' = 'GpuXidError'  -- JSONB
ORDER BY created_at DESC                   -- Fast: index
LIMIT 10;

Benefits:

  • Fast queries: Indexed columns
  • Flexibility: JSONB for schema evolution
  • SQL power: JOINs, transactions, constraints

11. CHANGE DETECTION MECHANISMS

11.1 MongoDB: Native Change Streams

How It Works:

// Create change stream
pipeline := bson.A{
    bson.M{"$match": bson.M{"operationType": bson.M{"$in": bson.A{"insert", "update", "delete"}}}},
}
stream, err := collection.Watch(ctx, pipeline)
if err != nil {
    return err
}

// Listen for changes (blocking)
for stream.Next(ctx) {
    var event bson.M
    if err := stream.Decode(&event); err != nil {
        log.Error(err)
        continue
    }

    // event contains:
    // - operationType: "insert" | "update" | "delete"
    // - fullDocument: complete document after change
    // - documentKey: {_id: ...}
    // - _id: {_data: "..."} ← resume token

    processEvent(event)

    // Save resume token for fault tolerance
    resumeToken := stream.ResumeToken()
    saveResumeToken(resumeToken)
}

Characteristics:

  • Latency: <50ms (near real-time push)
  • Method: Server pushes changes to client
  • Resume: Binary BSON tokens
  • Efficient: Only changed documents sent
  • Requires: Replica set configuration

11.2 PostgreSQL: Polling with Triggers

Database Setup:

-- Trigger function
CREATE OR REPLACE FUNCTION health_events_change_trigger() RETURNS TRIGGER AS $$
BEGIN
    IF (TG_OP = 'INSERT') THEN
        INSERT INTO datastore_changelog (table_name, operation, new_values)
        VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW)::jsonb);
    ELSIF (TG_OP = 'UPDATE') THEN
        INSERT INTO datastore_changelog (table_name, operation, old_values, new_values)
        VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb, row_to_json(NEW)::jsonb);
    ELSIF (TG_OP = 'DELETE') THEN
        INSERT INTO datastore_changelog (table_name, operation, old_values)
        VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Attach trigger
CREATE TRIGGER health_events_change
AFTER INSERT OR UPDATE OR DELETE ON health_events
FOR EACH ROW EXECUTE FUNCTION health_events_change_trigger();

Application Polling:

ticker := time.NewTicker(5 * time.Second)  // Configurable
lastProcessedID := loadLastProcessedID()

for range ticker.C {
    rows, err := db.Query(`
        SELECT id, operation, new_values, changed_at
        FROM datastore_changelog
        WHERE id > $1 AND table_name = 'health_events'
        ORDER BY id
        LIMIT 100
    `, lastProcessedID)

    for rows.Next() {
        var id int64
        var operation string
        var newValues json.RawMessage
        var changedAt time.Time

        rows.Scan(&id, &operation, &newValues, &changedAt)

        // Convert to MongoDB-style change event
        event := &EventWithToken{
            OperationType: operation,
            FullDocument:  parseDocument(newValues),
            ResumeToken:   strconv.FormatInt(id, 10),
        }

        sendToChannel(event)
        lastProcessedID = id
    }
}

Characteristics:

  • Latency: 0-5 seconds (poll interval)
  • Method: Application polls database
  • Resume: Integer IDs (simpler)
  • Tradeoff: Slight delay for simplicity
  • No Special Requirements: Works on any PostgreSQL

12. QUERY & OPERATION PATTERNS

12.1 Health Event Operations

Insert (Identical for Both):

event := &datastore.HealthEventWithStatus{
    CreatedAt: time.Now(),
    HealthEvent: &protoEvent,
    HealthEventStatus: datastore.HealthEventStatus{
        NodeQuarantined: &quarantinedStatus,
    },
}

healthStore := ds.HealthEventStore()
err := healthStore.InsertHealthEvents(ctx, event)

Implementation:

  • MongoDB: collection.InsertOne(document)
  • PostgreSQL: INSERT INTO health_events (document, node_name, ...) VALUES (...)

Query by Node (Identical for Both):

events, err := healthStore.FindHealthEventsByNode(ctx, "gpu-node-1")

Implementation:

  • MongoDB: collection.Find({"healthevent.nodename": "gpu-node-1"})
  • PostgreSQL: SELECT * FROM health_events WHERE node_name = 'gpu-node-1'

Update Status (Identical for Both):

status := datastore.HealthEventStatus{
    NodeQuarantined: &newStatus,
}
err := healthStore.UpdateHealthEventStatus(ctx, eventID, status)

Implementation:

  • MongoDB: collection.UpdateOne({"_id": id}, {"$set": {"healtheventstatus": status}})
  • PostgreSQL: UPDATE health_events SET document = jsonb_set(document, '{healtheventstatus}', $1), status = $2 WHERE id = $3

12.2 Maintenance Event Operations

Upsert (Identical for Both):

event := &datastore.MaintenanceEventWithHistory{
    NodeName: "gpu-node-1",
    MaintenanceEvent: &protoEvent,
}

maintenanceStore := ds.MaintenanceEventStore()
err := maintenanceStore.UpsertMaintenanceEvent(ctx, event)

Implementation:

  • MongoDB: collection.ReplaceOne(..., options.Replace().SetUpsert(true))
  • PostgreSQL: INSERT ... ON CONFLICT (node_name) DO UPDATE ...

13. PERFORMANCE CHARACTERISTICS

13.1 MongoDB

Strengths:

  • ✅ Real-time change notifications (<50ms latency)
  • ✅ Flexible schema (no migrations)
  • ✅ Rich query language (aggregation framework)
  • ✅ Mature, battle-tested in production
  • ✅ Horizontal scaling (sharding)

Considerations:

  • ⚠️ Requires replica set (even single node)
  • ⚠️ Higher memory usage for change streams
  • ⚠️ Document size limits (16MB BSON)
  • ⚠️ Eventual consistency in replica sets

Measured Performance (Tilt environment):

  • Insert latency: ~5-10ms
  • Query latency: ~2-5ms
  • Change stream latency: <50ms
  • Memory usage: ~500MB baseline

13.2 PostgreSQL

Strengths:

  • ✅ ACID transactions (strong consistency)
  • ✅ Mature relational features (JOINs, constraints)
  • ✅ JSONB indexing (very efficient)
  • ✅ Lower memory footprint
  • ✅ Standard SQL tooling
  • ✅ Proven scalability (vertical + read replicas)

Considerations:

  • ⚠️ Polling adds 0-5 second delay for change detection
  • ⚠️ Schema migrations for index column changes
  • ⚠️ Changelog table requires maintenance (cleanup)
  • ⚠️ Newer implementation (less production time)

Measured Performance (Tilt environment):

  • Insert latency: ~3-8ms
  • Query latency: ~1-3ms (indexed columns)
  • Change detection latency: 0-5 seconds
  • Memory usage: ~300MB baseline
  • Changelog growth: ~1KB per change

13.3 Feature Parity Matrix

Feature MongoDB PostgreSQL Notes
Insert/Update/Delete Identical interface
Query by ID Both O(1) with indexes
Query by Node Both use indexes
Complex Queries ✅ Aggregation ✅ SQL Different syntax, same capability
Change Detection ✅ Native ✅ Polling Different latency characteristics
Transactions ✅ Limited ✅ Full ACID PostgreSQL stronger
Schema Flexibility ✅ Native ✅ JSONB Both support flexible schemas
High Availability ✅ Replica Set ✅ Streaming Replication Both production-ready
Horizontal Scaling ✅ Sharding ⚠️ Limited MongoDB advantage for massive scale
Operational Maturity ✅ Production-proven ✅ New to NVSentinel MongoDB has more mileage

Reality Check: For NVSentinel's workload (moderate write volume, query-heavy), both perform excellently.


14. TESTING STRATEGY

14.1 Test Organization

Provider-Specific Tests:

mongodb/
├── health_store_test.go        - MongoDB health operations
├── maintenance_store_test.go   - MongoDB maintenance operations
└── watcher/
    └── watch_store_test.go     - Change stream tests

postgresql/
├── datastore_test.go           - Connection, ping, close
├── changestream_test.go        - Polling change detection
├── pipeline_filter_test.go     - Pipeline → SQL translation
└── watcher_factory_test.go     - Watcher creation

Cross-Provider Tests (CRITICAL):

datastore/
├── behavioral_contract_test.go    - Ensures identical behavior
└── interface_compliance_test.go   - Ensures interface conformance

14.2 Behavioral Contract Tests

Purpose: Guarantee MongoDB and PostgreSQL behave identically for the same operations.

Example:

func TestHealthEventStoreBehavior(t *testing.T) {
    providers := []string{"mongodb", "postgresql"}
    
    for _, provider := range providers {
        t.Run(provider, func(t *testing.T) {
            // Setup
            ds := createDataStore(t, provider)
            healthStore := ds.HealthEventStore()
            
            // Test: Insert event
            event := createTestEvent()
            err := healthStore.InsertHealthEvents(ctx, event)
            require.NoError(t, err)
            
            // Test: Query by node
            events, err := healthStore.FindHealthEventsByNode(ctx, event.HealthEvent.NodeName)
            require.NoError(t, err)
            require.Len(t, events, 1)
            
            // Test: Update status
            newStatus := "Remediated"
            err = healthStore.UpdateHealthEventStatus(ctx, events[0].ID, datastore.HealthEventStatus{
                NodeQuarantined: &newStatus,
            })
            require.NoError(t, err)
            
            // Verify update
            updated, _ := healthStore.FindHealthEventsByNode(ctx, event.HealthEvent.NodeName)
            assert.Equal(t, newStatus, *updated[0].HealthEventStatus.NodeQuarantined)
            
            // Both providers must behave identically!
        })
    }
}

What It Catches:

  • Inconsistent error handling
  • Different null/empty behavior
  • Incompatible return types
  • Missing interface methods
  • Query result differences

14.3 Test Coverage

MongoDB Tests: ~1,500 LOC

  • health_store_test.go
  • maintenance_store_test.go
  • watcher/watch_store_test.go
  • Legacy client tests

PostgreSQL Tests: ~800 LOC

  • datastore_test.go
  • changestream_test.go
  • pipeline_filter_test.go
  • watcher_factory_test.go

Shared Tests: ~715 LOC

  • behavioral_contract_test.go (342 lines)
  • interface_compliance_test.go (373 lines)

Total: 196+ tests, all passing ✅


15. MIGRATION & COMPATIBILITY

15.1 Legacy MongoDB Support

Some components still use pre-abstraction MongoDB code:

Old Style (Still Supported):

import "github.com/nvidia/nvsentinel/store-client/pkg/client"

// Legacy MongoDB client
mongoClient, err := client.NewMongoDBClient(ctx, dbConfig)
cursor, err := mongoClient.Find(ctx, filter, options)

New Style (Preferred):

import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"

// Unified datastore
ds, err := datastore.NewDataStore(ctx, config)
healthStore := ds.HealthEventStore()
events, err := healthStore.FindHealthEventsByNode(ctx, nodeName)

MongoDB Adapter Bridges the Gap:

type AdaptedMongoStore struct {
    databaseClient   client.DatabaseClient  // Legacy access
    healthStore      datastore.HealthEventStore  // New interface
}

// New interface
func (a *AdaptedMongoStore) HealthEventStore() datastore.HealthEventStore {
    return a.healthStore
}

// Legacy access (for gradual migration)
func (a *AdaptedMongoStore) GetDatabaseClient() client.DatabaseClient {
    return a.databaseClient
}

This allows gradual migration - old code continues working while new code uses abstractions.

15.2 PostgreSQL Has No Legacy

PostgreSQL was built for the new abstraction from day one:

  • No pre-existing implementation to support
  • Cleaner code - single path
  • Future template for adding new databases

15.3 Migration Path

Recommended Approach:

  1. Phase 1 (Current): Both systems coexist

    • Legacy MongoDB code uses client.DatabaseClient
    • New code uses datastore.DataStore
    • Both work simultaneously
  2. Phase 2 (Future): Migrate components

    • Update components to use only datastore.DataStore
    • Remove legacy client.* imports
    • Test with both MongoDB and PostgreSQL
  3. Phase 3 (End State): Clean architecture

    • Deprecate client.DatabaseClient interface
    • Remove adapter layers
    • Unified abstraction only

Reality Check: Phase 1 is stable and working. Phase 2/3 are optional improvements.


16. OPERATIONAL CONSIDERATIONS

16.1 PostgreSQL Changelog Table Maintenance

Problem: The datastore_changelog table grows indefinitely.

Growth Rate:

  • ~1KB per change event
  • 1000 events/hour = ~1MB/hour = ~24MB/day
  • 30 days = ~720MB uncompressed

Cleanup Options:

Option 1: Periodic Deletion (Simple):

-- Delete processed entries older than 7 days
DELETE FROM datastore_changelog
WHERE processed = true
  AND changed_at < NOW() - INTERVAL '7 days';

Option 2: Partitioning (Production):

-- Create partitioned table
CREATE TABLE datastore_changelog (
    id SERIAL,
    table_name VARCHAR(255),
    operation VARCHAR(10),
    old_values JSONB,
    new_values JSONB,
    changed_at TIMESTAMP DEFAULT NOW(),
    processed BOOLEAN DEFAULT FALSE
) PARTITION BY RANGE (changed_at);

-- Create monthly partitions
CREATE TABLE datastore_changelog_2025_11
    PARTITION OF datastore_changelog
    FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');

-- Drop old partitions
DROP TABLE datastore_changelog_2025_09;

Option 3: Archive and Truncate (Recommended):

# Cron job (daily)
#!/bin/bash
psql -c "COPY (SELECT * FROM datastore_changelog WHERE processed = true AND changed_at < NOW() - INTERVAL '30 days') TO '/backup/changelog_$(date +%Y%m%d).csv' CSV HEADER;"
psql -c "DELETE FROM datastore_changelog WHERE processed = true AND changed_at < NOW() - INTERVAL '30 days';"

Best Practice: Implement Option 3 with 30-day retention.

16.2 MongoDB Oplog Sizing

MongoDB uses oplog for change streams. Size appropriately:

# MongoDB values
replication:
  enabled: true
  replSetName: "rs0"
  oplogSize: 1024  # MB - adjust based on write volume

Rule of Thumb: Oplog should hold at least 24 hours of operations.

16.3 Connection Pooling

MongoDB:

options:
  maxConnections: "25"
  maxIdleConnections: "10"
  connectionMaxLifetime: "1h"

PostgreSQL:

options:
  maxConnections: "25"
  maxIdleConnections: "10"
  connectionMaxLifetime: "1h"
  connectionMaxIdleTime: "30m"

Tuning: Monitor with kubectl top pods and adjust based on load.

16.4 Backup & Recovery

MongoDB:

# Backup
kubectl exec mongodb-0 -- mongodump --archive=/backup/nvsentinel-$(date +%Y%m%d).archive --gzip

# Restore
kubectl exec mongodb-0 -- mongorestore --archive=/backup/nvsentinel-20251118.archive --gzip

PostgreSQL:

# Backup
kubectl exec nvsentinel-postgresql-0 -- pg_dump -Fc nvsentinel > nvsentinel-$(date +%Y%m%d).dump

# Restore
kubectl exec -i nvsentinel-postgresql-0 -- pg_restore -d nvsentinel < nvsentinel-20251118.dump

17. ADDING SUPPORT FOR NEW DATABASES

If you wanted to add MySQL, CockroachDB, etc., follow the PostgreSQL pattern (cleaner):

Step 1: Implement Provider

// pkg/datastore/providers/newdb/datastore.go
package newdb

import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"

type NewDBDataStore struct {
    db                    *sql.DB
    maintenanceEventStore datastore.MaintenanceEventStore
    healthEventStore      datastore.HealthEventStore
}

func NewNewDBDataStore(ctx context.Context, config datastore.DataStoreConfig) (datastore.DataStore, error) {
    // Open connection
    db, err := sql.Open("newdb", buildConnectionString(config))
    if err != nil {
        return nil, err
    }

    // Create stores
    return &NewDBDataStore{
        db:                    db,
        healthEventStore:      NewNewDBHealthEventStore(db),
        maintenanceEventStore: NewNewDBMaintenanceEventStore(db),
    }, nil
}

// Implement DataStore interface
func (n *NewDBDataStore) HealthEventStore() datastore.HealthEventStore {
    return n.healthEventStore
}

func (n *NewDBDataStore) MaintenanceEventStore() datastore.MaintenanceEventStore {
    return n.maintenanceEventStore
}

func (n *NewDBDataStore) Ping(ctx context.Context) error {
    return n.db.PingContext(ctx)
}

func (n *NewDBDataStore) Close(ctx context.Context) error {
    return n.db.Close()
}

func (n *NewDBDataStore) Provider() datastore.DataStoreProvider {
    return datastore.ProviderNewDB
}

Step 2: Implement Store Interfaces

// pkg/datastore/providers/newdb/health_events.go
package newdb

type NewDBHealthEventStore struct {
    db *sql.DB
}

func NewNewDBHealthEventStore(db *sql.DB) *NewDBHealthEventStore {
    return &NewDBHealthEventStore{db: db}
}

func (n *NewDBHealthEventStore) InsertHealthEvents(ctx context.Context, events ...*datastore.HealthEventWithStatus) error {
    // Your database-specific implementation
    for _, event := range events {
        _, err := n.db.ExecContext(ctx, `
            INSERT INTO health_events (document, node_name, status, created_at)
            VALUES ($1, $2, $3, $4)
        `, event.Document, event.NodeName, event.Status, event.CreatedAt)
        if err != nil {
            return err
        }
    }
    return nil
}

// Implement all other HealthEventStore methods...

Step 3: Register Provider

// pkg/datastore/providers/newdb/register.go
package newdb

import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"

const ProviderNewDB datastore.DataStoreProvider = "newdb"

func init() {
    datastore.RegisterProvider(ProviderNewDB, NewNewDBDataStore)
}

Step 4: Add Helm Support

  1. Create values-tilt-newdb.yaml:
global:
  datastore:
    provider: "newdb"
    connection:
      host: "nvsentinel-newdb.nvsentinel.svc.cluster.local"
      port: 3306
      database: "nvsentinel"
      # ... newdb-specific fields
  1. Update configmap-datastore.yaml template
  2. Create cert-manager resources (if needed)
  3. Update component deployment templates

Step 5: Add Tests

// pkg/datastore/providers/newdb/datastore_test.go
func TestNewDBDataStore(t *testing.T) {
    // Provider-specific tests
}

// Update behavioral_contract_test.go
func TestAllProvidersBehavior(t *testing.T) {
    providers := []string{"mongodb", "postgresql", "newdb"}
    // Test all providers identically
}

Template: Use PostgreSQL implementation as reference - it's cleaner than MongoDB.


18. DEBUGGING GUIDE

18.1 MongoDB Not Connecting

Symptoms:

  • Component crashes with "connection refused"
  • Logs: "Failed to connect to MongoDB"

Diagnostic Steps:

# 1. Check ConfigMap
kubectl get cm nvsentinel-datastore-config -o yaml | grep MONGODB_URI
# Should have: ?replicaSet=rs0&tls=true

# 2. Check MongoDB is running
kubectl get pods | grep mongodb
# Should be: mongodb-0  Running

# 3. Check certificates exist
kubectl get secret mongo-app-client-cert-secret -o yaml

# 4. Check cert path in component
kubectl logs <component-pod> | grep "CA cert"
# Should show: /etc/ssl/client-certs/ca.crt

# 5. Check component args
kubectl get deployment <component> -o yaml | grep "args:" -A5
# Should have: --database-client-cert-mount-path=/etc/ssl/client-certs

# 6. Exec into pod and verify certs
kubectl exec -it <component-pod> -- ls -la /etc/ssl/client-certs/
# Should show: ca.crt, tls.crt, tls.key

Common Fixes:

  • Add --database-client-cert-mount-path to deployment args
  • Verify replicaSet=rs0 in MongoDB URI
  • Check MongoDB logs: kubectl logs mongodb-0

18.2 PostgreSQL Not Connecting

Symptoms:

  • Component crashes with "connection refused"
  • Logs: "Failed to connect to PostgreSQL"

Diagnostic Steps:

# 1. Check ConfigMap SSL settings
kubectl get cm nvsentinel-datastore-config -o yaml | grep SSL

# 2. Check PostgreSQL is running
kubectl get statefulset nvsentinel-postgresql

# 3. Check init container ran
kubectl get pod <component-pod> -o json | jq '.status.initContainerStatuses'
# Should show: "state": {"terminated": {"exitCode": 0}}

# 4. Check init container logs
kubectl logs <component-pod> -c fix-cert-permissions

# 5. Check cert permissions in pod
kubectl exec -it <component-pod> -- ls -la /etc/ssl/client-certs/
# tls.key should be: -rw------- (600)

# 6. Test PostgreSQL connection
kubectl exec -it nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "SELECT 1;"

Common Fixes:

  • Verify init container completed successfully
  • Check cert permissions (tls.key must be 600)
  • Verify SSL cert paths in ConfigMap

18.3 Change Stream Not Working

MongoDB:

# Check oplog is enabled
kubectl exec mongodb-0 -- mongo --eval "rs.status()"
# Should show replica set status

# Check component is watching
kubectl logs <component-pod> | grep "change stream"

# Test manually
kubectl exec mongodb-0 -- mongo nvsentinel --eval "db.HealthEvents.watch()"

PostgreSQL:

# Check triggers exist
kubectl exec nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "
  SELECT tgname, tgtype FROM pg_trigger WHERE tgrelid = 'health_events'::regclass;
"

# Check changelog is being populated
kubectl exec nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "
  SELECT COUNT(*) FROM datastore_changelog WHERE table_name = 'health_events';
"

# Check component polling
kubectl logs <component-pod> | grep "polling"

18.4 Certificate Issues

Debug Cert Path Resolution:

# Add verbose logging to component
kubectl set env deployment/<component> LOG_LEVEL=debug

# Watch logs for cert path resolution
kubectl logs -f <component-pod> | grep -i cert

Common Issues:

  1. Wrong path: Code looks in /etc/ssl/mongo-client instead of /etc/ssl/client-certs

    • Fix: Add --database-client-cert-mount-path=/etc/ssl/client-certs to args
  2. Permissions (PostgreSQL only): tls.key has wrong permissions

    • Fix: Verify init container ran and check logs
  3. Missing secret: Secret not created by cert-manager

    • Fix: Check cert-manager logs, verify Certificate resources

18.5 Quick Diagnostic Commands

# Database type in use
kubectl get cm nvsentinel-datastore-config -o yaml | grep DATASTORE_PROVIDER

# All certificates
kubectl get certificates -n nvsentinel

# All secrets
kubectl get secrets -n nvsentinel | grep -E "mongo|postgresql"

# Component logs (last 100 lines)
kubectl logs --tail=100 <component-pod>

# Component environment variables
kubectl exec <component-pod> -- env | sort

# Component volume mounts
kubectl describe pod <component-pod> | grep -A10 "Mounts:"

19. KEY TAKEAWAYS

For New Developers

1. Two Implementations, One Interface Components don't (and shouldn't) care which database is used. They program against datastore.DataStore interface. This abstraction enables database-agnostic code.

2. MongoDB = Adapted, PostgreSQL = Native

  • MongoDB wraps existing implementation with adapters (evolutionary)
  • PostgreSQL was built for the abstraction from scratch (revolutionary)
  • This explains why PostgreSQL code is often simpler but isn't necessarily less complex

3. Certificate Paths Are Tricky Multiple systems can determine cert paths:

  • Command-line flags (highest priority)
  • Environment variables
  • Configuration structs
  • File existence checks
  • Default fallbacks

Best Practice: Be explicit - always pass via --database-client-cert-mount-path arg.

4. ConfigMap is Critical The nvsentinel-datastore-config ConfigMap is the single source of truth for runtime configuration. Template logic must handle all database providers correctly.

5. Change Detection Differs Fundamentally

  • MongoDB: Real-time push (<50ms latency) via native change streams
  • PostgreSQL: Polling every 5s (0-5s latency) via triggers + changelog

Components shouldn't care - both present the same ChangeStreamWatcher interface.

6. Testing Ensures Consistency The behavioral contract tests (behavioral_contract_test.go) are critical - they ensure both databases behave identically. Always update them when adding new operations.

7. Dual Configuration System (MongoDB Only) MongoDB components run TWO parallel configuration systems:

  • Legacy client.DatabaseClient (for change stream watchers)
  • New datastore.DataStore (for queries)

Both must get the same cert path or connection fails!

8. PostgreSQL Requires Operational Maintenance The datastore_changelog table grows indefinitely and needs periodic cleanup. Plan for this in production deployments.


20. QUICK REFERENCE

Switch Database (Tilt)

# MongoDB (default)
tilt up

# PostgreSQL
export USE_POSTGRESQL=1
tilt up

# Back to MongoDB
unset USE_POSTGRESQL
tilt up

Check Current Database

kubectl get cm nvsentinel-datastore-config -o yaml | grep DATASTORE_PROVIDER

kubectl get pods | grep -E "mongodb|postgresql"

Debug Certificate Issues

# Check cert paths in pod
kubectl describe pod <pod> | grep -A20 "volumeMounts"

# Check what code is looking for
kubectl logs <pod> | grep "CA cert"

# Check actual ConfigMap
kubectl get cm nvsentinel-datastore-config -o yaml | grep CERT

# Exec into pod and verify
kubectl exec -it <pod> -- ls -la /etc/ssl/client-certs/

Verify Services

# Check datastore pods
kubectl get pods | grep -E "mongo|postgres"

# Check cert-manager certificates
kubectl get certificates

# Check if ConfigMap exists
kubectl get cm nvsentinel-datastore-config

# Check component health
kubectl get pods | grep -E "fault|node|health|platform"

Common Operations

# Restart component
kubectl rollout restart deployment/<component>

# View component logs
kubectl logs -f deployment/<component>

# Shell into database
kubectl exec -it mongodb-0 -- mongo
kubectl exec -it nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel

# Check database contents
kubectl exec mongodb-0 -- mongo nvsentinel --eval "db.HealthEvents.count()"
kubectl exec nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "SELECT COUNT(*) FROM health_events;"

21. FILES OF INTEREST

Must Understand

store-client/pkg/datastore/interfaces.go
    Core interfaces that define the abstraction layer

store-client/pkg/datastore/config.go
    Configuration loading logic and precedence rules

store-client/pkg/datastore/registry.go
    Factory pattern implementation for provider selection

distros/kubernetes/nvsentinel/templates/configmap-datastore.yaml
    CRITICAL: Unified configuration template

tilt/Tiltfile
    Development orchestration and database selection

MongoDB Deep Dive

store-client/pkg/datastore/providers/mongodb/adapter.go
    How legacy wrapping works - adapter pattern

store-client/pkg/datastore/providers/mongodb/watcher/watch_store.go
    Change streams implementation (~1,200 LOC)

store-client/pkg/client/mongodb_client.go
    Legacy MongoDB client (pre-abstraction)

distros/kubernetes/nvsentinel/templates/certmanager-mongodb.yaml
    MongoDB certificate hierarchy

PostgreSQL Deep Dive

store-client/pkg/datastore/providers/postgresql/datastore.go
    Main PostgreSQL datastore implementation

store-client/pkg/datastore/providers/postgresql/changestream.go
    Polling-based change detection mechanism

store-client/pkg/datastore/providers/postgresql/pipeline_filter.go
    MongoDB aggregation pipeline → PostgreSQL SQL translator

distros/kubernetes/nvsentinel/templates/certmanager-postgresql.yaml
    PostgreSQL certificate hierarchy

Testing

store-client/pkg/datastore/behavioral_contract_test.go
    Cross-provider consistency tests (CRITICAL)

store-client/pkg/datastore/interface_compliance_test.go
    Interface verification tests

store-client/pkg/datastore/providers/mongodb/health_store_test.go
    MongoDB-specific health store tests

store-client/pkg/datastore/providers/postgresql/changestream_test.go
    PostgreSQL polling mechanism tests

Helm & Deployment

distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
    MongoDB Tilt configuration

distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
    PostgreSQL Tilt configuration

distros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yaml
    Example component deployment pattern

distros/kubernetes/nvsentinel/charts/postgresql/
    Vendored Bitnami PostgreSQL chart (65+ files)

distros/kubernetes/nvsentinel/charts/mongodb-store/charts/mongodb/
    Vendored Bitnami MongoDB chart

Configuration & Flags

commons/pkg/flags/database_flags.go
    Certificate path resolution logic (5-level precedence)

store-client/pkg/datastore/providers/mongodb/register.go
    MongoDB provider auto-registration

store-client/pkg/datastore/providers/postgresql/register.go
    PostgreSQL provider auto-registration

APPENDIX A: ARCHITECTURAL DIAGRAMS

MongoDB Architecture Flow

┌─────────────────────────────────────────────────────┐
│              Component (e.g., node-drainer)          │
│                                                       │
│  ┌──────────────────┐    ┌────────────────────┐    │
│  │  Legacy Factory  │    │   New DataStore    │    │
│  │  (for watcher)   │    │   (for queries)    │    │
│  └────────┬─────────┘    └─────────┬──────────┘    │
│           │                         │                │
│           │  DUAL CONFIGURATION     │                │
│           │  SYSTEM (both needed)   │                │
└───────────┼─────────────────────────┼────────────────┘
            │                         │
            ▼                         ▼
   ┌────────────────┐        ┌──────────────────┐
   │ MongoDB Client │        │ AdaptedMongoStore│
   │  (legacy)      │        │   (new wrapper)  │
   └────────┬───────┘        └────────┬─────────┘
            │                         │
            └─────────┬───────────────┘
                      ▼
            ┌──────────────────┐
            │  MongoDB Driver  │
            │  (mongo.Client)  │
            └─────────┬────────┘
                      ▼
            ┌──────────────────┐
            │  MongoDB Server  │
            │   (Replica Set)  │
            │                  │
            │  Change Streams  │ ← Real-time push (<50ms)
            └──────────────────┘

PostgreSQL Architecture Flow

┌─────────────────────────────────────┐
│      Component (e.g., node-drainer)  │
│                                       │
│      ┌─────────────────┐             │
│      │  New DataStore  │             │
│      │   (direct use)  │             │
│      └────────┬────────┘             │
│               │                       │
│       SINGLE CONFIGURATION            │
│       SYSTEM (cleaner)                │
└───────────────┼───────────────────────┘
                │
                ▼
       ┌─────────────────┐
       │ PostgreSQLStore │
       │   (native impl) │
       └────────┬────────┘
                │
                ▼
       ┌─────────────────┐
       │   *sql.DB       │
       │  (Go stdlib)    │
       └────────┬────────┘
                │
                ▼
       ┌─────────────────┐
       │  lib/pq Driver  │
       └────────┬────────┘
                │
                ▼
       ┌─────────────────────────────┐
       │   PostgreSQL Server         │
       │                             │
       │  Triggers → Changelog Table │ ← Polling every 5s
       │  (change detection)         │
       └─────────────────────────────┘

APPENDIX B: ENVIRONMENT VARIABLE REFERENCE

Complete MongoDB Environment

# Generic namespace
DATASTORE_PROVIDER=mongodb
DATASTORE_HOST=mongodb-headless.nvsentinel.svc.cluster.local
DATASTORE_PORT=27017
DATASTORE_DATABASE=HealthEventsDatabase

# MongoDB-specific namespace (legacy compatibility)
MONGODB_URI=mongodb://mongodb-headless.nvsentinel.svc.cluster.local:27017/?replicaSet=rs0&tls=true
MONGODB_DATABASE_NAME=HealthEventsDatabase
MONGODB_COLLECTION_NAME=HealthEvents
MONGODB_TOKEN_COLLECTION_NAME=ResumeTokens
MONGODB_MAINTENANCE_EVENT_COLLECTION_NAME=MaintenanceEvents

# Certificate path (from deployment env, NOT ConfigMap)
MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs

# Timeouts
MONGODB_PING_TIMEOUT_TOTAL_SECONDS=30
MONGODB_PING_INTERVAL_SECONDS=5
CA_CERT_MOUNT_TIMEOUT_TOTAL_SECONDS=360
CA_CERT_READ_INTERVAL_SECONDS=5

# Connection pool
MONGODB_MAX_CONNECTIONS=25
MONGODB_MAX_IDLE_CONNECTIONS=10
MONGODB_CONNECTION_MAX_LIFETIME=1h

Complete PostgreSQL Environment

# Generic namespace (complete)
DATASTORE_PROVIDER=postgresql
DATASTORE_HOST=nvsentinel-postgresql.nvsentinel.svc.cluster.local
DATASTORE_PORT=5432
DATASTORE_DATABASE=nvsentinel
DATASTORE_USERNAME=postgresql
DATASTORE_SSLMODE=require
DATASTORE_SSLCERT=/etc/ssl/client-certs/tls.crt
DATASTORE_SSLKEY=/etc/ssl/client-certs/tls.key
DATASTORE_SSLROOTCERT=/etc/ssl/client-certs/ca.crt

# PostgreSQL-specific (for direct access)
POSTGRESQL_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs

# Connection pool
DATASTORE_MAX_CONNECTIONS=25
DATASTORE_MAX_IDLE_CONNECTIONS=10
DATASTORE_CONNECTION_MAX_LIFETIME=1h
DATASTORE_CONNECTION_MAX_IDLE_TIME=30m

# Polling configuration
DATASTORE_POLL_INTERVAL=5s

CONCLUSION

This consolidated guide represents the authoritative reference for MongoDB and PostgreSQL support in NVSentinel. It combines:

  • Technical accuracy from multiple analysis iterations
  • Practical guidance for developers
  • Operational considerations for production deployments
  • Debugging strategies for troubleshooting
  • Testing approaches for ensuring consistency

Implementation Status

MongoDB: ✅ Production-ready, mature, fully tested
PostgreSQL: ✅ Feature-complete, tested, ready for deployment
Unified Abstraction: ✅ Stable, proven, well-tested

Success Factors

  1. Abstraction layer enables database-agnostic components
  2. Behavioral tests ensure consistent behavior across providers
  3. Certificate management via cert-manager for both databases
  4. Explicit configuration prevents runtime surprises
  5. Comprehensive testing validates both providers

Future Considerations

  • Phase out MongoDB dual-config system for cleaner architecture
  • Implement changelog cleanup for PostgreSQL production deployments
  • Add new databases following PostgreSQL pattern (cleaner template)
  • Standardize cert path handling across all providers

END OF CONSOLIDATED MONGODB & POSTGRESQL IMPLEMENTATION GUIDE

This document consolidates content from 9 source documents, incorporating the best technical details, practical guidance, and operational insights from all sources.


DOCUMENT SOURCES

This consolidated guide was synthesized from:

  1. claude-mongodb-vs-postgresql-analysis.md (v2.0 revised)
  2. mongodb-postgresql-implementation-guide.md (colleague's original)
  3. mongodb-vs-postgresql-comprehensive-analysis.md (detailed session analysis)
  4. mongodb-postgresql-guide-critical-additions.md (supplementary content)
  5. claude-mongodb-vs-postgresql-analysis-critique.md (feedback incorporated)
  6. mongodb-postgresql-implementation-guide-critique.md (analysis feedback)
  7. Supporting documentation and analysis

Consolidated by: Claude (Anthropic)
Date: November 18, 2025
Version: 3.0 (Unified Guide)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment