Consolidated & Authoritative Reference Last Updated: November 18, 2025 Branch: add-support-for-postgres Version: 3.0 (Consolidated from multiple sources)
- Understanding the Abstraction Layer Design
- Overview
- Architectural Design
- MongoDB Implementation
- PostgreSQL Implementation
- Certificate & TLS Management
- Configuration System
- Helm Chart Design
- Tilt Development Workflow
- Deployment Patterns
- Database Schema Design
- Change Detection Mechanisms
- Query & Operation Patterns
- Performance Characteristics
- Testing Strategy
- Migration & Compatibility
- Operational Considerations
- Adding Support for New Databases
- Debugging Guide
- Key Takeaways
- Quick Reference
- Files of Interest
Context: NVSentinel evolved from MongoDB-only to supporting multiple databases.
MongoDB's Journey (Evolutionary):
- MongoDB was implemented before the abstraction layer existed
- Original code used direct MongoDB driver calls throughout
- When abstraction layer was added, needed to maintain backward compatibility
- Result: Adapter pattern wraps existing implementation
PostgreSQL's Journey (Revolutionary):
- PostgreSQL was added after abstraction layer was designed
- Built from ground up to implement the abstraction interfaces
- No legacy code to maintain
- Result: Clean, direct implementation
Core Philosophy: Components should never know which database they're using.
// This code works identically with MongoDB or PostgreSQL
dsConfig, _ := datastore.LoadDatastoreConfig()
ds, _ := datastore.NewDataStore(ctx, *dsConfig)
healthStore := ds.HealthEventStore()
healthStore.UpdateHealthEventStatus(ctx, eventID, status)Factory Pattern:
DATASTORE_PROVIDER env var
↓
LoadDatastoreConfig()
↓
NewDataStore(config)
↓
Factory selects:
- MongoDB → NewMongoDBDataStore()
- PostgreSQL → NewPostgreSQLStore()
↓
Returns DataStore interface
Critical Insight: This abstraction enables database-agnostic component code.
NVSentinel supports two database backends through a unified abstraction layer:
- MongoDB - Original implementation, production-proven
- PostgreSQL - Newly added alternative backend
Both databases provide identical functionality to components through the DataStore interface defined in store-client/pkg/datastore/interfaces.go.
- MongoDB: ✅ Production-ready, mature codebase (~3,500 LOC)
- PostgreSQL: ✅ Feature-complete, newly added (~2,900 LOC)
- Unified Interface: ✅ Both implement identical
DataStoreinterface - Test Coverage: ✅ Both have comprehensive test suites (196+ tests total)
- Behavioral Parity: ✅ Verified via cross-provider contract tests
All core NVSentinel components use the unified datastore:
- platform-connectors
- fault-quarantine
- fault-remediation
- node-drainer
- health-events-analyzer
- csp-health-monitor
- janitor
Core Abstraction (store-client/pkg/datastore/interfaces.go):
type DataStore interface {
MaintenanceEventStore() MaintenanceEventStore
HealthEventStore() HealthEventStore
Ping(ctx context.Context) error
Close(ctx context.Context) error
Provider() DataStoreProvider
}
type HealthEventStore interface {
InsertHealthEvents(ctx context.Context, events ...*HealthEventWithStatus) error
UpdateHealthEventStatus(ctx context.Context, id string, status HealthEventStatus) error
FindHealthEventsByNode(ctx context.Context, nodeName string) ([]*HealthEventWithStatus, error)
DeleteHealthEvent(ctx context.Context, id string) error
// ... more methods
}
type MaintenanceEventStore interface {
UpsertMaintenanceEvent(ctx context.Context, event *MaintenanceEventWithHistory) error
GetMaintenanceEvent(ctx context.Context, nodeName string) (*MaintenanceEventWithHistory, error)
DeleteMaintenanceEvent(ctx context.Context, nodeName string) error
ListMaintenanceEvents(ctx context.Context) ([]*MaintenanceEventWithHistory, error)
}Key Principle: Components program against interfaces, not implementations.
Standard Pattern (Recommended):
package main
import (
"github.com/nvidia/nvsentinel/store-client/pkg/datastore"
)
func main() {
// 1. Load configuration (reads from environment)
config, err := datastore.LoadDatastoreConfig()
if err != nil {
log.Fatalf("Failed to load datastore config: %v", err)
}
// 2. Create datastore (factory automatically selects provider)
ds, err := datastore.NewDataStore(ctx, *config)
if err != nil {
log.Fatalf("Failed to create datastore: %v", err)
}
defer ds.Close(ctx)
// 3. Get specialized stores
maintenanceStore := ds.MaintenanceEventStore()
healthStore := ds.HealthEventStore()
// 4. Use them - same code for both databases!
err = healthStore.UpdateHealthEventStatus(ctx, eventID, newStatus)
err = maintenanceStore.UpsertMaintenanceEvent(ctx, event)
}The factory in NewDataStore() automatically creates the correct provider based on config.Provider.
Auto-Registration Pattern:
// MongoDB registration (mongodb/register.go)
func init() {
datastore.RegisterProvider(datastore.ProviderMongoDB, NewMongoDBDataStore)
}
// PostgreSQL registration (postgresql/register.go)
func init() {
datastore.RegisterProvider(datastore.ProviderPostgreSQL, NewPostgreSQLStore)
}Both providers self-register when their packages are imported, making the factory pattern work transparently.
| Aspect | MongoDB | PostgreSQL |
|---|---|---|
| Design Approach | Adapter pattern (wraps legacy) | Native implementation |
| Client Type | *mongo.Client (official driver) |
*sql.DB (stdlib) |
| Legacy Compat | Full backward compatibility layer | None (built for abstraction) |
| Change Detection | Native MongoDB change streams | Polling with database triggers |
| Document Storage | BSON native | JSONB with indexed columns |
| Query Language | Aggregation pipelines | SQL with JSONB operators |
| Connection Format | MongoDB URI | PostgreSQL connection string |
| Code Complexity | HIGH (dual systems) | MEDIUM (single path) |
"Adaptation" - MongoDB wraps existing implementation to provide new abstraction interface while maintaining backward compatibility.
store-client/pkg/datastore/providers/mongodb/
├── adapter.go (271 lines) - Wraps legacy MongoDB client
├── builders.go (152 lines) - Query builder factories
├── health_store.go (250 lines) - Health event operations
├── maintenance_store.go (238 lines) - Maintenance event operations
├── register.go (75 lines) - Provider auto-registration
├── watcher_factory.go (73 lines) - Change stream watcher factory
└── watcher/ - Change stream implementation
├── watch_store.go (~1,200 LOC) - Core MongoDB watcher logic
├── unmarshaller.go - Event unmarshalling
└── ...
Total: ~3,500 lines of code
CRITICAL UNDERSTANDING: MongoDB components run TWO parallel configuration systems.
System A - Legacy (Pre-Abstraction):
// Used for change stream watchers
import "github.com/nvidia/nvsentinel/store-client/pkg/client"
databaseConfig := client.NewDatabaseConfigFromEnv()
clientFactory := factory.NewClientFactory(databaseConfig)
watcher, _ := clientFactory.CreateChangeStreamWatcher(ctx, client, "name", pipeline)Configuration Source:
MONGODB_URIenvironment variableMONGODB_DATABASE_NAMEMONGODB_COLLECTION_NAME- Certificate path from command-line flags (
--database-client-cert-mount-path)
System B - New (Post-Abstraction):
// Used for queries and maintenance operations
import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"
dsConfig, _ := datastore.LoadDatastoreConfig()
ds, _ := datastore.NewDataStore(ctx, *dsConfig)
healthStore := ds.HealthEventStore()Configuration Source:
DATASTORE_PROVIDER=mongodbDATASTORE_HOST,DATASTORE_PORT,DATASTORE_DATABASE- Certificate path from config.Connection.TLSConfig or environment variable
The Problem: Both systems load configuration independently and can resolve different certificate paths!
The Solution:
- Always pass
--database-client-cert-mount-path=/etc/ssl/client-certsto components - This ensures both systems use the same cert path
1. Change Streams (Native Feature): MongoDB provides real-time change notifications:
// MongoDB driver watches the collection
stream = collection.Watch(ctx, pipeline)
for stream.Next(ctx) {
// Get change event immediately (push-based)
event = stream.Current
// Process: operationType, fullDocument, resumeToken
}Characteristics:
- Latency: <50ms (near real-time)
- Method: Push-based notifications from MongoDB server
- Resume: Binary BSON resume tokens for fault tolerance
- Efficient: Only changed documents sent over wire
2. Replica Set Requirement: MongoDB requires replica set configuration even for single-node deployments:
mongodb://mongodb-headless:27017/?replicaSet=rs0&tls=true
^^^^^^^^^^^^ Required for change streams!
Without replicaSet parameter, change streams will not work.
3. Document Storage:
- Native BSON documents
- Schema-less (no migrations needed)
- Flexible nested structures
- 16MB document size limit
4. Authentication:
- Method:
MONGODB-X509(certificate-based) - No passwords - TLS client certificates only
- Certificate DN must match MongoDB user
type AdaptedMongoStore struct {
// Legacy MongoDB clients (pre-abstraction)
databaseClient client.DatabaseClient
collectionClient client.CollectionClient
factory *factory.ClientFactory
// New interface implementations
maintenanceStore datastore.MaintenanceEventStore // Implements new interface
healthStore datastore.HealthEventStore // Implements new interface
}
// Implements DataStore interface
func (a *AdaptedMongoStore) HealthEventStore() datastore.HealthEventStore {
return a.healthStore
}
// Legacy access (for components not yet migrated)
func (a *AdaptedMongoStore) GetDatabaseClient() client.DatabaseClient {
return a.databaseClient // Bridge to old code
}Design Pattern: Adapter wraps existing functionality to present new interface.
Component Startup
↓
LoadDatastoreConfig()
├─ Reads: DATASTORE_PROVIDER=mongodb
├─ Reads: DATASTORE_HOST, DATASTORE_PORT
├─ Builds: DataStoreConfig struct
↓
NewDataStore(ctx, config)
├─ Factory lookup: ProviderMongoDB → NewMongoDBDataStore
↓
NewMongoDBDataStore(ctx, config)
├─ Creates legacy adapter for compatibility:
│ └─ ConvertDataStoreConfigToLegacyWithCertPath()
├─ Initializes mongo.Client with connection string
├─ Creates AdaptedMongoStore
│ ├─ healthStore = NewMongoHealthEventStore(...)
│ └─ maintenanceStore = NewMongoMaintenanceEventStore(...)
↓
Returns: DataStore interface
"Native" - PostgreSQL was built specifically for the new abstraction layer with no legacy baggage.
store-client/pkg/datastore/providers/postgresql/
├── datastore.go (435 lines) - Main PostgreSQL datastore
├── changestream.go (337 lines) - Polling-based change detection
├── health_events.go (571 lines) - Health event operations
├── maintenance_events.go (369 lines) - Maintenance event operations
├── database_client.go (418 lines) - Legacy client adapter (for compat)
├── register.go (34 lines) - Provider auto-registration
├── watcher_factory.go (89 lines) - Change stream watcher factory
├── pipeline_filter.go (318 lines) - MongoDB pipeline → SQL translator
└── *_test.go (464 lines) - Comprehensive tests
Total: ~2,900 lines of code (smaller than MongoDB due to no legacy)
1. Polling-Based Change Detection:
Since PostgreSQL doesn't have native change streams, we implement them:
Database Setup (migrations):
-- Changelog table to track all changes
CREATE TABLE datastore_changelog (
id SERIAL PRIMARY KEY,
table_name VARCHAR(255) NOT NULL,
operation VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
old_values JSONB,
new_values JSONB,
changed_at TIMESTAMP DEFAULT NOW(),
processed BOOLEAN DEFAULT FALSE
);
-- Trigger function to capture changes
CREATE OR REPLACE FUNCTION health_events_change_trigger() RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = 'INSERT') THEN
INSERT INTO datastore_changelog (table_name, operation, new_values)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW)::jsonb);
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO datastore_changelog (table_name, operation, old_values, new_values)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb, row_to_json(NEW)::jsonb);
ELSIF (TG_OP = 'DELETE') THEN
INSERT INTO datastore_changelog (table_name, operation, old_values)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Attach trigger to health_events table
CREATE TRIGGER health_events_change
AFTER INSERT OR UPDATE OR DELETE ON health_events
FOR EACH ROW EXECUTE FUNCTION health_events_change_trigger();Application Polling (Go code):
ticker := time.NewTicker(5 * time.Second) // Configurable poll interval
for range ticker.C {
rows, _ := db.Query(`
SELECT id, operation, new_values, changed_at
FROM datastore_changelog
WHERE id > $1 AND table_name = 'health_events'
ORDER BY id
`, lastProcessedID)
for rows.Next() {
// Convert to MongoDB-style change event format
event := convertToChangeStreamEvent(row)
sendToChannel(event)
lastProcessedID = row.ID
}
}Characteristics:
- Latency: 0-5 seconds (depending on when change occurs relative to poll)
- Method: Application polls database
- Resume: Integer IDs (simpler than MongoDB's BSON tokens)
- Tradeoff: Slight delay vs MongoDB's real-time, but simpler to debug
2. JSONB Storage with Indexed Columns:
Hybrid approach combining SQL performance with document flexibility:
CREATE TABLE health_events (
id SERIAL PRIMARY KEY,
document JSONB NOT NULL, -- Full document (flexible schema)
-- Extracted columns for fast querying
node_name VARCHAR(255), -- From document->'healthevent'->>'nodename'
status VARCHAR(50), -- From document->'healtheventstatus'->>'nodequarantined'
is_fatal BOOLEAN, -- From document->'healthevent'->>'isfatal'
agent VARCHAR(100),
created_at BIGINT, -- Indexed timestamp
-- Indexes on extracted columns for performance
INDEX idx_health_events_node_name (node_name),
INDEX idx_health_events_status (status),
INDEX idx_health_events_created_at (created_at)
);Benefits:
- Fast queries: Indexed columns for common filters
- Flexibility: JSONB allows schema evolution without migrations
- Best of both: SQL performance + document database flexibility
Example Query:
-- Uses index on node_name, then JSONB operators for nested fields
SELECT * FROM health_events
WHERE node_name = 'gpu-node-1' -- Fast: uses index
AND document->'healthevent'->>'checkname' = 'GpuXidError' -- JSONB query
ORDER BY created_at DESC -- Fast: uses index
LIMIT 10;3. Pipeline Translation:
Converts MongoDB aggregation pipelines to PostgreSQL SQL:
// MongoDB pipeline from component
pipeline := []interface{}{
{"$match": {"healthevent.nodename": "node-1"}},
{"$sort": {"createdAt": -1}},
{"$limit": 10},
}
// Translator converts to SQL
sql := translatePipeline(pipeline)
// Result:
// SELECT * FROM health_events
// WHERE document->'healthevent'->>'nodename' = 'node-1'
// ORDER BY created_at DESC
// LIMIT 10Supported Pipeline Stages:
- ✅
$match→WHEREclauses - ✅
$sort→ORDER BY - ✅
$limit→LIMIT - ✅
$skip→OFFSET - ❌
$group→ Not supported (use SQL directly) - ❌
$lookup→ Not supported (use JOINs) - ❌
$unwind→ Not supported
4. Authentication:
- Method: PostgreSQL SSL certificate verification
- Requires: Client cert + CA cert + server cert
pg_hba.conf:hostssl all all 0.0.0.0/0 cert- Certificate CN must match PostgreSQL user
type PostgreSQLDataStore struct {
db *sql.DB // Direct database connection
// Stores implement interfaces natively
maintenanceEventStore datastore.MaintenanceEventStore
healthEventStore datastore.HealthEventStore
}
// Implements DataStore interface
func (p *PostgreSQLDataStore) HealthEventStore() datastore.HealthEventStore {
return p.healthEventStore
}
// No legacy compatibility needed!Design Pattern: Direct implementation of abstraction interfaces from the start.
Component Startup
↓
LoadDatastoreConfig()
├─ Reads: DATASTORE_PROVIDER=postgresql
├─ Reads: DATASTORE_HOST, DATASTORE_PORT, DATASTORE_USERNAME
├─ Reads: DATASTORE_SSLCERT, DATASTORE_SSLKEY, DATASTORE_SSLROOTCERT
├─ Stores cert paths directly in config.Connection
↓
NewDataStore(ctx, config)
├─ Factory lookup: ProviderPostgreSQL → NewPostgreSQLStore
↓
NewPostgreSQLStore(ctx, config)
├─ Builds connection string with explicit cert paths
├─ Opens sql.DB connection
├─ Creates PostgreSQLDataStore
│ ├─ healthStore = NewPostgreSQLHealthEventStore(db)
│ └─ maintenanceStore = NewPostgreSQLMaintenanceEventStore(db)
↓
Returns: DataStore interface
Cleaner Flow: No legacy conversions, single configuration path.
Both databases use cert-manager for certificate lifecycle:
selfsigned-ca-issuer (Self-signed root issuer)
└─> {database}-root-ca (CA certificate, 10 year lifetime)
└─> {database}-ca-issuer (CA issuer resource)
├─> {database}-server-cert (Server certificate, 1 year, auto-renew)
└─> {database}-client-cert (Client certificate, 1 year, auto-renew)
Certificates automatically renew 15 days before expiration.
Created Certificates (templates/certmanager-mongodb.yaml):
- mongo-root-ca - Root CA (self-signed, 10 years)
- mongo-ca-issuer - Issuer using root CA
- mongo-server-cert-0 - Server cert for mongodb-0 pod
- mongo-app-client-cert - Client cert for applications
- mongo-dgxcops-client-cert - Client cert for operations
Mounting in Components:
volumes:
- name: mongo-app-client-cert
secret:
secretName: mongo-app-client-cert-secret
items:
- key: tls.crt
path: tls.crt
- key: tls.key
path: tls.key
- key: ca.crt
path: ca.crt
volumeMounts:
- name: mongo-app-client-cert
mountPath: /etc/ssl/client-certs # Actual mount location
readOnly: trueNo init container needed - MongoDB driver accepts certificates as-is.
Environment Variables:
MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certsCreated Certificates (templates/certmanager-postgresql.yaml):
- postgresql-root-ca - Root CA (self-signed, 10 years)
- selfsigned-ca-issuer - Self-signed issuer for CA
- postgresql-ca-issuer - Issuer using root CA
- postgresql-server-cert - Server cert for PostgreSQL pod
- postgresql-client-cert - Client cert for applications
Mounting with Init Container (Two-Stage):
initContainers:
- name: fix-cert-permissions
image: bitnamilegacy/os-shell
command:
- sh
- -c
- |
cp /etc/ssl/client-certs-original/* /etc/ssl/client-certs-fixed/
chmod 644 /etc/ssl/client-certs-fixed/tls.crt
chmod 644 /etc/ssl/client-certs-fixed/ca.crt
chmod 600 /etc/ssl/client-certs-fixed/tls.key # CRITICAL: PostgreSQL requires 0600
volumeMounts:
- name: postgresql-client-cert-original
mountPath: /etc/ssl/client-certs-original
readOnly: true
- name: client-certs-fixed
mountPath: /etc/ssl/client-certs-fixed
containers:
- name: component
volumeMounts:
- name: client-certs-fixed # Use fixed certs, not original
mountPath: /etc/ssl/client-certs
readOnly: true
volumes:
- name: postgresql-client-cert-original
secret:
secretName: postgresql-client-cert
- name: client-certs-fixed
emptyDir: {} # Mutable volume for fixed permissionsWhy the Init Container?
- Kubernetes secrets are mounted as root:root with 0644/0444 permissions
- PostgreSQL
libpqrequires client key to be 0600 (owner-only read/write) - Secrets are immutable, can't chmod directly
- Solution: Init container copies to emptyDir and fixes permissions
Environment Variables:
DATASTORE_SSLCERT=/etc/ssl/client-certs/tls.crt
DATASTORE_SSLKEY=/etc/ssl/client-certs/tls.key
DATASTORE_SSLROOTCERT=/etc/ssl/client-certs/ca.crt
POSTGRESQL_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certsCRITICAL: MongoDB has a 5-level precedence system for determining certificate paths.
Precedence Order (Highest to Lowest):
1. CLI Flag (Explicit)
--database-client-cert-mount-path=/etc/ssl/client-certs
↓ (if not provided)
2. Environment Variable
MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs
↓ (if not set)
3. Config Struct Field
config.Connection.TLSConfig.CertPath
↓ (if empty)
4. File Existence Check
if os.Stat("/etc/ssl/client-certs/ca.crt") → use /etc/ssl/client-certs
↓ (if not found)
5. Legacy Default Fallback
/etc/ssl/mongo-client ← DANGEROUS: Causes issues!
Code Implementation (commons/pkg/flags/database_flags.go):
type DatabaseCertConfig struct {
DatabaseClientCertMountPath string // New flag, defaults to "/etc/ssl/database-client"
LegacyMongoCertPath string // Old default: "/etc/ssl/mongo-client"
ResolvedCertPath string // Final resolved path
}
func (c *DatabaseCertConfig) ResolveCertPath() string {
// If flag still has default value, fall back to legacy
if c.DatabaseClientCertMountPath == "/etc/ssl/database-client" {
c.ResolvedCertPath = c.LegacyMongoCertPath // /etc/ssl/mongo-client
return c.ResolvedCertPath
}
// Otherwise use explicitly set value
c.ResolvedCertPath = c.DatabaseClientCertMountPath
return c.ResolvedCertPath
}The Problem:
- Actual mount:
/etc/ssl/client-certs - Default flag:
/etc/ssl/database-client - Fallback:
/etc/ssl/mongo-client - Result: Code looks in wrong place!
The Solution (Applied to All Components):
args:
- "--database-client-cert-mount-path=/etc/ssl/client-certs" # Explicit!This makes the CLI flag (precedence level #1) override all fallbacks.
Best Practice: Always use CLI flags for cert paths - highest precedence, most explicit.
PostgreSQL Avoids This: Cert paths come directly from environment variables set in ConfigMap, no complex resolution needed.
[Due to length, continuing in next message...]
Two Namespaces:
- Generic (
DATASTORE_*): Provider-agnostic configuration - Legacy (
MONGODB_*,POSTGRESQL_*): Provider-specific for backward compatibility
Dual Namespace (for backward compatibility):
# New generic namespace
DATASTORE_PROVIDER=mongodb
DATASTORE_HOST=mongodb-headless.nvsentinel.svc.cluster.local
DATASTORE_PORT=27017
DATASTORE_DATABASE=HealthEventsDatabase
# Legacy MongoDB-specific namespace
MONGODB_URI=mongodb://mongodb-headless.nvsentinel.svc.cluster.local:27017/?replicaSet=rs0&tls=true
MONGODB_DATABASE_NAME=HealthEventsDatabase
MONGODB_COLLECTION_NAME=HealthEvents
MONGODB_TOKEN_COLLECTION_NAME=ResumeTokens
MONGODB_MAINTENANCE_EVENT_COLLECTION_NAME=MaintenanceEvents
MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs # From deployment env, not ConfigMap
# Timeout configuration
MONGODB_PING_TIMEOUT_TOTAL_SECONDS=30
MONGODB_PING_INTERVAL_SECONDS=5
CA_CERT_MOUNT_TIMEOUT_TOTAL_SECONDS=360
CA_CERT_READ_INTERVAL_SECONDS=5Why Both? Legacy components still use MONGODB_* variables directly.
Single Namespace (cleaner):
# Everything in DATASTORE_* namespace
DATASTORE_PROVIDER=postgresql
DATASTORE_HOST=nvsentinel-postgresql.nvsentinel.svc.cluster.local
DATASTORE_PORT=5432
DATASTORE_DATABASE=nvsentinel
DATASTORE_USERNAME=postgresql
DATASTORE_SSLMODE=require
DATASTORE_SSLCERT=/etc/ssl/client-certs/tls.crt
DATASTORE_SSLKEY=/etc/ssl/client-certs/tls.key
DATASTORE_SSLROOTCERT=/etc/ssl/client-certs/ca.crt
# No legacy variables needed
# No timeout configuration (uses defaults)Observation: PostgreSQL is cleaner - everything in unified namespace.
Priority Order:
func LoadDatastoreConfig() (*DataStoreConfig, error) {
// 1. DATASTORE_PROVIDER env var (highest priority)
if provider := os.Getenv("DATASTORE_PROVIDER"); provider != "" {
return loadConfigFromEnv(provider)
}
// 2. DATASTORE_YAML env var (YAML string)
if yamlConfig := os.Getenv("DATASTORE_YAML"); yamlConfig != "" {
return loadConfigFromYAMLString(yamlConfig)
}
// 3. DATASTORE_YAML_PATH env var (YAML file)
if yamlPath := os.Getenv("DATASTORE_YAML_PATH"); yamlPath != "" {
return loadConfigFromYAMLFile(yamlPath)
}
// 4. Default to MongoDB with legacy env vars
return loadDefaultConfig()
}Best Practice: Use DATASTORE_PROVIDER with individual env vars.
Component Deployment Pattern:
envFrom:
- configMapRef:
name: {{ if .Values.global.datastore }}{{ .Release.Name }}-datastore-config{{ else }}mongodb-config{{ end }}Logic:
- If
global.datastoreis set → usenvsentinel-datastore-config(unified) - If not set → use
mongodb-config(legacy)
Critical: This ensures backward compatibility with existing MongoDB deployments.
distros/kubernetes/nvsentinel/
├── Chart.yaml
├── Chart.lock
├── values.yaml # Base values (no database selected)
├── values-tilt.yaml # Common Tilt settings
├── values-tilt-mongodb.yaml # MongoDB Tilt configuration
├── values-tilt-postgresql.yaml # PostgreSQL Tilt configuration
├── values-postgresql.yaml # Production PostgreSQL
├── templates/
│ ├── configmap-datastore.yaml # Unified datastore configuration (CRITICAL)
│ ├── certmanager-mongodb.yaml # MongoDB TLS certificates
│ └── certmanager-postgresql.yaml # PostgreSQL TLS certificates
└── charts/
├── postgresql/ # Vendored Bitnami PostgreSQL chart
├── mongodb-store/ # MongoDB subchart
│ ├── charts/mongodb/ # Vendored Bitnami MongoDB chart
│ └── templates/
│ └── configmap.yaml # Legacy mongodb-config ConfigMap
├── node-drainer/
│ └── templates/deployment.yaml
├── fault-quarantine/
│ └── templates/deployment.yaml
└── ... (other components)
File: templates/configmap-datastore.yaml
Name: nvsentinel-datastore-config
Condition: {{- if .Values.global.datastore }}
This ConfigMap is the bridge between Helm values and component runtime.
MongoDB Example:
# values-tilt-mongodb.yaml
global:
datastore:
provider: "mongodb"
connection:
host: "mongodb-headless.nvsentinel.svc.cluster.local"
port: 27017
database: "HealthEventsDatabase"
collection: "HealthEvents"
tokenCollection: "ResumeTokens"
extraParams: # CRITICAL for MongoDB
replicaSet: "rs0"
tls: "true"
# Generated ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: nvsentinel-datastore-config
data:
DATASTORE_PROVIDER: "mongodb"
DATASTORE_HOST: "mongodb-headless.nvsentinel.svc.cluster.local"
DATASTORE_PORT: "27017"
DATASTORE_DATABASE: "HealthEventsDatabase"
MONGODB_URI: "mongodb://mongodb-headless.nvsentinel.svc.cluster.local:27017/?replicaSet=rs0&tls=true"
MONGODB_DATABASE_NAME: "HealthEventsDatabase"
MONGODB_COLLECTION_NAME: "HealthEvents"
MONGODB_TOKEN_COLLECTION_NAME: "ResumeTokens"PostgreSQL Example:
# values-tilt-postgresql.yaml
global:
datastore:
provider: "postgresql"
connection:
host: "nvsentinel-postgresql.nvsentinel.svc.cluster.local"
port: 5432
database: "nvsentinel"
username: "postgresql"
sslmode: "require"
sslcert: "/etc/ssl/client-certs/tls.crt"
sslkey: "/etc/ssl/client-certs/tls.key"
sslrootcert: "/etc/ssl/client-certs/ca.crt"
# Generated ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: nvsentinel-datastore-config
data:
DATASTORE_PROVIDER: "postgresql"
DATASTORE_HOST: "nvsentinel-postgresql.nvsentinel.svc.cluster.local"
DATASTORE_PORT: "5432"
DATASTORE_DATABASE: "nvsentinel"
DATASTORE_USERNAME: "postgresql"
DATASTORE_SSLMODE: "require"
DATASTORE_SSLCERT: "/etc/ssl/client-certs/tls.crt"
DATASTORE_SSLKEY: "/etc/ssl/client-certs/tls.key"
DATASTORE_SSLROOTCERT: "/etc/ssl/client-certs/ca.crt"Template Logic (simplified):
{{- if .Values.global.datastore }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-datastore-config
data:
DATASTORE_PROVIDER: {{ .Values.global.datastore.provider | quote }}
DATASTORE_HOST: {{ .Values.global.datastore.connection.host | quote }}
{{- if eq .Values.global.datastore.provider "mongodb" }}
{{- $params := include "buildMongoDBQueryParams" .Values.global.datastore.connection.extraParams }}
MONGODB_URI: "mongodb://{{ .Values.global.datastore.connection.host }}:{{ .Values.global.datastore.connection.port }}/?{{ $params }}"
{{- else if eq .Values.global.datastore.provider "postgresql" }}
DATASTORE_SSLCERT: {{ .Values.global.datastore.connection.sslcert | quote }}
DATASTORE_SSLKEY: {{ .Values.global.datastore.connection.sslkey | quote }}
{{- end }}
{{- end }}PostgreSQL Chart:
- Source: Bitnami PostgreSQL chart 15.5.38
- Location:
charts/postgresql/ - Size: 65+ files, 1,780 lines in values.yaml
- Modified: All images changed to
bitnamilegacy/*
MongoDB Chart:
- Source: Bitnami MongoDB chart
- Location:
charts/mongodb-store/charts/mongodb/ - Modified: Images use
bitnamilegacy/*
Reason for Vendoring: Ensures availability and allows customization.
Use MongoDB (Default):
cd tilt
tilt up
# Loads: values-tilt-mongodb.yaml
# Deploys: MongoDB StatefulSet
# Creates: mongo-* certificatesUse PostgreSQL:
cd tilt
export USE_POSTGRESQL=1
tilt up
# Loads: values-tilt-postgresql.yaml
# Deploys: PostgreSQL StatefulSet
# Creates: postgresql-* certificatesFile: tilt/Tiltfile
use_postgresql = os.getenv('USE_POSTGRESQL', '0') == '1'
# Values file selection
values_files = ['../distros/kubernetes/nvsentinel/values-tilt.yaml']
if use_postgresql:
print("Using PostgreSQL as datastore (USE_POSTGRESQL=1)")
values_files.append('../distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml')
else:
print("Using MongoDB as datastore (default)")
values_files.append('../distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml')
# Resource naming
datastore_resource = 'nvsentinel-postgresql' if use_postgresql else 'mongodb'
# Certificate resources
cert_manager_objects = ['janitor-webhook-cert:certificate']
if use_postgresql:
cert_manager_objects.extend([
'postgresql-root-ca:certificate',
'postgresql-ca-issuer:issuer',
'selfsigned-ca-issuer:issuer',
'postgresql-server-cert:certificate',
'postgresql-client-cert:certificate'
])
else:
cert_manager_objects.extend([
'mongo-root-ca:certificate',
'mongo-ca-issuer:issuer',
'selfsigned-ca-issuer:issuer',
'mongo-server-cert-0:certificate',
'mongo-app-client-cert:certificate',
'mongo-dgxcops-client-cert:certificate'
])
# Component dependencies
k8s_resource('platform-connectors', resource_deps=[datastore_resource])
k8s_resource('fault-quarantine', resource_deps=[datastore_resource])
k8s_resource('fault-remediation', resource_deps=[datastore_resource])
k8s_resource('node-drainer', resource_deps=[datastore_resource])
k8s_resource('health-events-analyzer', resource_deps=[datastore_resource])Key Pattern: Components wait for datastore to be ready before starting.
Every component follows this pattern (example: node-drainer/templates/deployment.yaml):
spec:
template:
spec:
# PostgreSQL ONLY: Init container to fix cert permissions
{{- if eq .Values.global.datastore.provider "postgresql" }}
initContainers:
- name: fix-cert-permissions
image: bitnamilegacy/os-shell
command:
- sh
- -c
- |
cp /etc/ssl/client-certs-original/* /etc/ssl/client-certs-fixed/
chmod 600 /etc/ssl/client-certs-fixed/tls.key
volumeMounts:
- name: {{ .Values.global.datastore.provider }}-client-cert-original
mountPath: /etc/ssl/client-certs-original
- name: client-certs-fixed
mountPath: /etc/ssl/client-certs-fixed
{{- end }}
containers:
- name: {{ .Chart.Name }}
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
# CRITICAL: Pass cert path via command-line arg
args:
- "--metrics-port=2112"
- "--config-path=/etc/config/config.toml"
- "--database-client-cert-mount-path={{ .Values.clientCertMountPath }}"
env:
- name: LOG_LEVEL
value: {{ .Values.logLevel | quote }}
{{- if eq .Values.global.datastore.provider "postgresql" }}
- name: POSTGRESQL_CLIENT_CERT_MOUNT_PATH
value: {{ .Values.clientCertMountPath }}
{{- else }}
- name: MONGODB_CLIENT_CERT_MOUNT_PATH
value: {{ .Values.clientCertMountPath }}
{{- end }}
# Load all datastore config
envFrom:
- configMapRef:
name: {{ if .Values.global.datastore }}{{ .Release.Name }}-datastore-config{{ else }}mongodb-config{{ end }}
volumeMounts:
- name: config
mountPath: /etc/config
{{- if eq .Values.global.datastore.provider "postgresql" }}
- name: client-certs-fixed
mountPath: {{ .Values.clientCertMountPath }}
{{- else }}
- name: mongo-app-client-cert
mountPath: {{ .Values.clientCertMountPath }}
{{- end }}
volumes:
- name: config
configMap:
name: {{ .Release.Name }}-{{ .Chart.Name }}-config
{{- if eq .Values.global.datastore.provider "postgresql" }}
- name: postgresql-client-cert-original
secret:
secretName: postgresql-client-cert
- name: client-certs-fixed
emptyDir: {}
{{- else }}
- name: mongo-app-client-cert
secret:
secretName: mongo-app-client-cert-secret
optional: true
{{- end }}Pattern Observations:
- Init container: PostgreSQL only (cert permissions)
- Command-line arg: Required for cert path resolution
- Environment variable: Provider-specific name
- ConfigMap selection: Conditional
- Volume mounting: Two-stage for PostgreSQL, direct for MongoDB
Collections:
HealthEvents- Health event documentsMaintenanceEvents- Maintenance event documentsResumeTokens- Change stream resume positions
Schema-less (Flexible):
{
"_id": ObjectId("..."),
"createdAt": ISODate("2025-11-18T..."),
"healthevent": {
"nodename": "gpu-node-1",
"checkname": "GpuXidError",
"componentclass": "GPU",
"isfatal": true,
"message": "XID error detected on GPU 0"
},
"healtheventstatus": {
"nodequarantined": "Quarantined",
"userpodsevictionstatus": {
"status": "Completed",
"message": "All pods evicted"
},
"maintenanceeventcreationstatus": {
"status": "Created",
"maintenanceeventnodename": "gpu-node-1"
}
}
}Querying (Aggregation Pipelines):
db.HealthEvents.aggregate([
{$match: {"healthevent.nodename": "gpu-node-1"}},
{$sort: {createdAt: -1}},
{$limit: 10}
])Tables:
health_events- Health events with JSONB + indexed columnsmaintenance_events- Maintenance events with JSONB + indexed columnsdatastore_changelog- Change tracking (for change stream emulation)resume_tokens- Resume positions for watchers
Hybrid Schema (Best of Both):
CREATE TABLE health_events (
id SERIAL PRIMARY KEY,
document JSONB NOT NULL, -- Full document (flexible)
-- Extracted columns for performance
node_name VARCHAR(255), -- From document->'healthevent'->>'nodename'
status VARCHAR(50), -- From document->'healtheventstatus'->>'nodequarantined'
is_fatal BOOLEAN, -- From document->'healthevent'->>'isfatal'
agent VARCHAR(100), -- From document->'healthevent'->>'agent'
created_at BIGINT, -- Timestamp for ordering
-- Indexes
INDEX idx_health_events_node_name (node_name),
INDEX idx_health_events_status (status),
INDEX idx_health_events_created_at (created_at),
INDEX idx_health_events_is_fatal (is_fatal)
);
CREATE TABLE datastore_changelog (
id SERIAL PRIMARY KEY,
table_name VARCHAR(255) NOT NULL,
operation VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
old_values JSONB,
new_values JSONB,
changed_at TIMESTAMP DEFAULT NOW(),
processed BOOLEAN DEFAULT FALSE,
INDEX idx_changelog_table_id (table_name, id),
INDEX idx_changelog_processed (processed)
);Querying (SQL with JSONB):
SELECT * FROM health_events
WHERE node_name = 'gpu-node-1' -- Fast: index
AND is_fatal = true -- Fast: index
AND document->'healthevent'->>'checkname' = 'GpuXidError' -- JSONB
ORDER BY created_at DESC -- Fast: index
LIMIT 10;Benefits:
- Fast queries: Indexed columns
- Flexibility: JSONB for schema evolution
- SQL power: JOINs, transactions, constraints
How It Works:
// Create change stream
pipeline := bson.A{
bson.M{"$match": bson.M{"operationType": bson.M{"$in": bson.A{"insert", "update", "delete"}}}},
}
stream, err := collection.Watch(ctx, pipeline)
if err != nil {
return err
}
// Listen for changes (blocking)
for stream.Next(ctx) {
var event bson.M
if err := stream.Decode(&event); err != nil {
log.Error(err)
continue
}
// event contains:
// - operationType: "insert" | "update" | "delete"
// - fullDocument: complete document after change
// - documentKey: {_id: ...}
// - _id: {_data: "..."} ← resume token
processEvent(event)
// Save resume token for fault tolerance
resumeToken := stream.ResumeToken()
saveResumeToken(resumeToken)
}Characteristics:
- Latency: <50ms (near real-time push)
- Method: Server pushes changes to client
- Resume: Binary BSON tokens
- Efficient: Only changed documents sent
- Requires: Replica set configuration
Database Setup:
-- Trigger function
CREATE OR REPLACE FUNCTION health_events_change_trigger() RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = 'INSERT') THEN
INSERT INTO datastore_changelog (table_name, operation, new_values)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW)::jsonb);
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO datastore_changelog (table_name, operation, old_values, new_values)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb, row_to_json(NEW)::jsonb);
ELSIF (TG_OP = 'DELETE') THEN
INSERT INTO datastore_changelog (table_name, operation, old_values)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(OLD)::jsonb);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Attach trigger
CREATE TRIGGER health_events_change
AFTER INSERT OR UPDATE OR DELETE ON health_events
FOR EACH ROW EXECUTE FUNCTION health_events_change_trigger();Application Polling:
ticker := time.NewTicker(5 * time.Second) // Configurable
lastProcessedID := loadLastProcessedID()
for range ticker.C {
rows, err := db.Query(`
SELECT id, operation, new_values, changed_at
FROM datastore_changelog
WHERE id > $1 AND table_name = 'health_events'
ORDER BY id
LIMIT 100
`, lastProcessedID)
for rows.Next() {
var id int64
var operation string
var newValues json.RawMessage
var changedAt time.Time
rows.Scan(&id, &operation, &newValues, &changedAt)
// Convert to MongoDB-style change event
event := &EventWithToken{
OperationType: operation,
FullDocument: parseDocument(newValues),
ResumeToken: strconv.FormatInt(id, 10),
}
sendToChannel(event)
lastProcessedID = id
}
}Characteristics:
- Latency: 0-5 seconds (poll interval)
- Method: Application polls database
- Resume: Integer IDs (simpler)
- Tradeoff: Slight delay for simplicity
- No Special Requirements: Works on any PostgreSQL
Insert (Identical for Both):
event := &datastore.HealthEventWithStatus{
CreatedAt: time.Now(),
HealthEvent: &protoEvent,
HealthEventStatus: datastore.HealthEventStatus{
NodeQuarantined: &quarantinedStatus,
},
}
healthStore := ds.HealthEventStore()
err := healthStore.InsertHealthEvents(ctx, event)Implementation:
- MongoDB:
collection.InsertOne(document) - PostgreSQL:
INSERT INTO health_events (document, node_name, ...) VALUES (...)
Query by Node (Identical for Both):
events, err := healthStore.FindHealthEventsByNode(ctx, "gpu-node-1")Implementation:
- MongoDB:
collection.Find({"healthevent.nodename": "gpu-node-1"}) - PostgreSQL:
SELECT * FROM health_events WHERE node_name = 'gpu-node-1'
Update Status (Identical for Both):
status := datastore.HealthEventStatus{
NodeQuarantined: &newStatus,
}
err := healthStore.UpdateHealthEventStatus(ctx, eventID, status)Implementation:
- MongoDB:
collection.UpdateOne({"_id": id}, {"$set": {"healtheventstatus": status}}) - PostgreSQL:
UPDATE health_events SET document = jsonb_set(document, '{healtheventstatus}', $1), status = $2 WHERE id = $3
Upsert (Identical for Both):
event := &datastore.MaintenanceEventWithHistory{
NodeName: "gpu-node-1",
MaintenanceEvent: &protoEvent,
}
maintenanceStore := ds.MaintenanceEventStore()
err := maintenanceStore.UpsertMaintenanceEvent(ctx, event)Implementation:
- MongoDB:
collection.ReplaceOne(..., options.Replace().SetUpsert(true)) - PostgreSQL:
INSERT ... ON CONFLICT (node_name) DO UPDATE ...
Strengths:
- ✅ Real-time change notifications (<50ms latency)
- ✅ Flexible schema (no migrations)
- ✅ Rich query language (aggregation framework)
- ✅ Mature, battle-tested in production
- ✅ Horizontal scaling (sharding)
Considerations:
⚠️ Requires replica set (even single node)⚠️ Higher memory usage for change streams⚠️ Document size limits (16MB BSON)⚠️ Eventual consistency in replica sets
Measured Performance (Tilt environment):
- Insert latency: ~5-10ms
- Query latency: ~2-5ms
- Change stream latency: <50ms
- Memory usage: ~500MB baseline
Strengths:
- ✅ ACID transactions (strong consistency)
- ✅ Mature relational features (JOINs, constraints)
- ✅ JSONB indexing (very efficient)
- ✅ Lower memory footprint
- ✅ Standard SQL tooling
- ✅ Proven scalability (vertical + read replicas)
Considerations:
⚠️ Polling adds 0-5 second delay for change detection⚠️ Schema migrations for index column changes⚠️ Changelog table requires maintenance (cleanup)⚠️ Newer implementation (less production time)
Measured Performance (Tilt environment):
- Insert latency: ~3-8ms
- Query latency: ~1-3ms (indexed columns)
- Change detection latency: 0-5 seconds
- Memory usage: ~300MB baseline
- Changelog growth: ~1KB per change
| Feature | MongoDB | PostgreSQL | Notes |
|---|---|---|---|
| Insert/Update/Delete | ✅ | ✅ | Identical interface |
| Query by ID | ✅ | ✅ | Both O(1) with indexes |
| Query by Node | ✅ | ✅ | Both use indexes |
| Complex Queries | ✅ Aggregation | ✅ SQL | Different syntax, same capability |
| Change Detection | ✅ Native | ✅ Polling | Different latency characteristics |
| Transactions | ✅ Limited | ✅ Full ACID | PostgreSQL stronger |
| Schema Flexibility | ✅ Native | ✅ JSONB | Both support flexible schemas |
| High Availability | ✅ Replica Set | ✅ Streaming Replication | Both production-ready |
| Horizontal Scaling | ✅ Sharding | MongoDB advantage for massive scale | |
| Operational Maturity | ✅ Production-proven | ✅ New to NVSentinel | MongoDB has more mileage |
Reality Check: For NVSentinel's workload (moderate write volume, query-heavy), both perform excellently.
Provider-Specific Tests:
mongodb/
├── health_store_test.go - MongoDB health operations
├── maintenance_store_test.go - MongoDB maintenance operations
└── watcher/
└── watch_store_test.go - Change stream tests
postgresql/
├── datastore_test.go - Connection, ping, close
├── changestream_test.go - Polling change detection
├── pipeline_filter_test.go - Pipeline → SQL translation
└── watcher_factory_test.go - Watcher creation
Cross-Provider Tests (CRITICAL):
datastore/
├── behavioral_contract_test.go - Ensures identical behavior
└── interface_compliance_test.go - Ensures interface conformance
Purpose: Guarantee MongoDB and PostgreSQL behave identically for the same operations.
Example:
func TestHealthEventStoreBehavior(t *testing.T) {
providers := []string{"mongodb", "postgresql"}
for _, provider := range providers {
t.Run(provider, func(t *testing.T) {
// Setup
ds := createDataStore(t, provider)
healthStore := ds.HealthEventStore()
// Test: Insert event
event := createTestEvent()
err := healthStore.InsertHealthEvents(ctx, event)
require.NoError(t, err)
// Test: Query by node
events, err := healthStore.FindHealthEventsByNode(ctx, event.HealthEvent.NodeName)
require.NoError(t, err)
require.Len(t, events, 1)
// Test: Update status
newStatus := "Remediated"
err = healthStore.UpdateHealthEventStatus(ctx, events[0].ID, datastore.HealthEventStatus{
NodeQuarantined: &newStatus,
})
require.NoError(t, err)
// Verify update
updated, _ := healthStore.FindHealthEventsByNode(ctx, event.HealthEvent.NodeName)
assert.Equal(t, newStatus, *updated[0].HealthEventStatus.NodeQuarantined)
// Both providers must behave identically!
})
}
}What It Catches:
- Inconsistent error handling
- Different null/empty behavior
- Incompatible return types
- Missing interface methods
- Query result differences
MongoDB Tests: ~1,500 LOC
- health_store_test.go
- maintenance_store_test.go
- watcher/watch_store_test.go
- Legacy client tests
PostgreSQL Tests: ~800 LOC
- datastore_test.go
- changestream_test.go
- pipeline_filter_test.go
- watcher_factory_test.go
Shared Tests: ~715 LOC
- behavioral_contract_test.go (342 lines)
- interface_compliance_test.go (373 lines)
Total: 196+ tests, all passing ✅
Some components still use pre-abstraction MongoDB code:
Old Style (Still Supported):
import "github.com/nvidia/nvsentinel/store-client/pkg/client"
// Legacy MongoDB client
mongoClient, err := client.NewMongoDBClient(ctx, dbConfig)
cursor, err := mongoClient.Find(ctx, filter, options)New Style (Preferred):
import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"
// Unified datastore
ds, err := datastore.NewDataStore(ctx, config)
healthStore := ds.HealthEventStore()
events, err := healthStore.FindHealthEventsByNode(ctx, nodeName)MongoDB Adapter Bridges the Gap:
type AdaptedMongoStore struct {
databaseClient client.DatabaseClient // Legacy access
healthStore datastore.HealthEventStore // New interface
}
// New interface
func (a *AdaptedMongoStore) HealthEventStore() datastore.HealthEventStore {
return a.healthStore
}
// Legacy access (for gradual migration)
func (a *AdaptedMongoStore) GetDatabaseClient() client.DatabaseClient {
return a.databaseClient
}This allows gradual migration - old code continues working while new code uses abstractions.
PostgreSQL was built for the new abstraction from day one:
- No pre-existing implementation to support
- Cleaner code - single path
- Future template for adding new databases
Recommended Approach:
-
Phase 1 (Current): Both systems coexist
- Legacy MongoDB code uses
client.DatabaseClient - New code uses
datastore.DataStore - Both work simultaneously
- Legacy MongoDB code uses
-
Phase 2 (Future): Migrate components
- Update components to use only
datastore.DataStore - Remove legacy
client.*imports - Test with both MongoDB and PostgreSQL
- Update components to use only
-
Phase 3 (End State): Clean architecture
- Deprecate
client.DatabaseClientinterface - Remove adapter layers
- Unified abstraction only
- Deprecate
Reality Check: Phase 1 is stable and working. Phase 2/3 are optional improvements.
Problem: The datastore_changelog table grows indefinitely.
Growth Rate:
- ~1KB per change event
- 1000 events/hour = ~1MB/hour = ~24MB/day
- 30 days = ~720MB uncompressed
Cleanup Options:
Option 1: Periodic Deletion (Simple):
-- Delete processed entries older than 7 days
DELETE FROM datastore_changelog
WHERE processed = true
AND changed_at < NOW() - INTERVAL '7 days';Option 2: Partitioning (Production):
-- Create partitioned table
CREATE TABLE datastore_changelog (
id SERIAL,
table_name VARCHAR(255),
operation VARCHAR(10),
old_values JSONB,
new_values JSONB,
changed_at TIMESTAMP DEFAULT NOW(),
processed BOOLEAN DEFAULT FALSE
) PARTITION BY RANGE (changed_at);
-- Create monthly partitions
CREATE TABLE datastore_changelog_2025_11
PARTITION OF datastore_changelog
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
-- Drop old partitions
DROP TABLE datastore_changelog_2025_09;Option 3: Archive and Truncate (Recommended):
# Cron job (daily)
#!/bin/bash
psql -c "COPY (SELECT * FROM datastore_changelog WHERE processed = true AND changed_at < NOW() - INTERVAL '30 days') TO '/backup/changelog_$(date +%Y%m%d).csv' CSV HEADER;"
psql -c "DELETE FROM datastore_changelog WHERE processed = true AND changed_at < NOW() - INTERVAL '30 days';"Best Practice: Implement Option 3 with 30-day retention.
MongoDB uses oplog for change streams. Size appropriately:
# MongoDB values
replication:
enabled: true
replSetName: "rs0"
oplogSize: 1024 # MB - adjust based on write volumeRule of Thumb: Oplog should hold at least 24 hours of operations.
MongoDB:
options:
maxConnections: "25"
maxIdleConnections: "10"
connectionMaxLifetime: "1h"PostgreSQL:
options:
maxConnections: "25"
maxIdleConnections: "10"
connectionMaxLifetime: "1h"
connectionMaxIdleTime: "30m"Tuning: Monitor with kubectl top pods and adjust based on load.
MongoDB:
# Backup
kubectl exec mongodb-0 -- mongodump --archive=/backup/nvsentinel-$(date +%Y%m%d).archive --gzip
# Restore
kubectl exec mongodb-0 -- mongorestore --archive=/backup/nvsentinel-20251118.archive --gzipPostgreSQL:
# Backup
kubectl exec nvsentinel-postgresql-0 -- pg_dump -Fc nvsentinel > nvsentinel-$(date +%Y%m%d).dump
# Restore
kubectl exec -i nvsentinel-postgresql-0 -- pg_restore -d nvsentinel < nvsentinel-20251118.dumpIf you wanted to add MySQL, CockroachDB, etc., follow the PostgreSQL pattern (cleaner):
// pkg/datastore/providers/newdb/datastore.go
package newdb
import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"
type NewDBDataStore struct {
db *sql.DB
maintenanceEventStore datastore.MaintenanceEventStore
healthEventStore datastore.HealthEventStore
}
func NewNewDBDataStore(ctx context.Context, config datastore.DataStoreConfig) (datastore.DataStore, error) {
// Open connection
db, err := sql.Open("newdb", buildConnectionString(config))
if err != nil {
return nil, err
}
// Create stores
return &NewDBDataStore{
db: db,
healthEventStore: NewNewDBHealthEventStore(db),
maintenanceEventStore: NewNewDBMaintenanceEventStore(db),
}, nil
}
// Implement DataStore interface
func (n *NewDBDataStore) HealthEventStore() datastore.HealthEventStore {
return n.healthEventStore
}
func (n *NewDBDataStore) MaintenanceEventStore() datastore.MaintenanceEventStore {
return n.maintenanceEventStore
}
func (n *NewDBDataStore) Ping(ctx context.Context) error {
return n.db.PingContext(ctx)
}
func (n *NewDBDataStore) Close(ctx context.Context) error {
return n.db.Close()
}
func (n *NewDBDataStore) Provider() datastore.DataStoreProvider {
return datastore.ProviderNewDB
}// pkg/datastore/providers/newdb/health_events.go
package newdb
type NewDBHealthEventStore struct {
db *sql.DB
}
func NewNewDBHealthEventStore(db *sql.DB) *NewDBHealthEventStore {
return &NewDBHealthEventStore{db: db}
}
func (n *NewDBHealthEventStore) InsertHealthEvents(ctx context.Context, events ...*datastore.HealthEventWithStatus) error {
// Your database-specific implementation
for _, event := range events {
_, err := n.db.ExecContext(ctx, `
INSERT INTO health_events (document, node_name, status, created_at)
VALUES ($1, $2, $3, $4)
`, event.Document, event.NodeName, event.Status, event.CreatedAt)
if err != nil {
return err
}
}
return nil
}
// Implement all other HealthEventStore methods...// pkg/datastore/providers/newdb/register.go
package newdb
import "github.com/nvidia/nvsentinel/store-client/pkg/datastore"
const ProviderNewDB datastore.DataStoreProvider = "newdb"
func init() {
datastore.RegisterProvider(ProviderNewDB, NewNewDBDataStore)
}- Create
values-tilt-newdb.yaml:
global:
datastore:
provider: "newdb"
connection:
host: "nvsentinel-newdb.nvsentinel.svc.cluster.local"
port: 3306
database: "nvsentinel"
# ... newdb-specific fields- Update
configmap-datastore.yamltemplate - Create cert-manager resources (if needed)
- Update component deployment templates
// pkg/datastore/providers/newdb/datastore_test.go
func TestNewDBDataStore(t *testing.T) {
// Provider-specific tests
}
// Update behavioral_contract_test.go
func TestAllProvidersBehavior(t *testing.T) {
providers := []string{"mongodb", "postgresql", "newdb"}
// Test all providers identically
}Template: Use PostgreSQL implementation as reference - it's cleaner than MongoDB.
Symptoms:
- Component crashes with "connection refused"
- Logs: "Failed to connect to MongoDB"
Diagnostic Steps:
# 1. Check ConfigMap
kubectl get cm nvsentinel-datastore-config -o yaml | grep MONGODB_URI
# Should have: ?replicaSet=rs0&tls=true
# 2. Check MongoDB is running
kubectl get pods | grep mongodb
# Should be: mongodb-0 Running
# 3. Check certificates exist
kubectl get secret mongo-app-client-cert-secret -o yaml
# 4. Check cert path in component
kubectl logs <component-pod> | grep "CA cert"
# Should show: /etc/ssl/client-certs/ca.crt
# 5. Check component args
kubectl get deployment <component> -o yaml | grep "args:" -A5
# Should have: --database-client-cert-mount-path=/etc/ssl/client-certs
# 6. Exec into pod and verify certs
kubectl exec -it <component-pod> -- ls -la /etc/ssl/client-certs/
# Should show: ca.crt, tls.crt, tls.keyCommon Fixes:
- Add
--database-client-cert-mount-pathto deployment args - Verify
replicaSet=rs0in MongoDB URI - Check MongoDB logs:
kubectl logs mongodb-0
Symptoms:
- Component crashes with "connection refused"
- Logs: "Failed to connect to PostgreSQL"
Diagnostic Steps:
# 1. Check ConfigMap SSL settings
kubectl get cm nvsentinel-datastore-config -o yaml | grep SSL
# 2. Check PostgreSQL is running
kubectl get statefulset nvsentinel-postgresql
# 3. Check init container ran
kubectl get pod <component-pod> -o json | jq '.status.initContainerStatuses'
# Should show: "state": {"terminated": {"exitCode": 0}}
# 4. Check init container logs
kubectl logs <component-pod> -c fix-cert-permissions
# 5. Check cert permissions in pod
kubectl exec -it <component-pod> -- ls -la /etc/ssl/client-certs/
# tls.key should be: -rw------- (600)
# 6. Test PostgreSQL connection
kubectl exec -it nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "SELECT 1;"Common Fixes:
- Verify init container completed successfully
- Check cert permissions (tls.key must be 600)
- Verify SSL cert paths in ConfigMap
MongoDB:
# Check oplog is enabled
kubectl exec mongodb-0 -- mongo --eval "rs.status()"
# Should show replica set status
# Check component is watching
kubectl logs <component-pod> | grep "change stream"
# Test manually
kubectl exec mongodb-0 -- mongo nvsentinel --eval "db.HealthEvents.watch()"PostgreSQL:
# Check triggers exist
kubectl exec nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "
SELECT tgname, tgtype FROM pg_trigger WHERE tgrelid = 'health_events'::regclass;
"
# Check changelog is being populated
kubectl exec nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "
SELECT COUNT(*) FROM datastore_changelog WHERE table_name = 'health_events';
"
# Check component polling
kubectl logs <component-pod> | grep "polling"Debug Cert Path Resolution:
# Add verbose logging to component
kubectl set env deployment/<component> LOG_LEVEL=debug
# Watch logs for cert path resolution
kubectl logs -f <component-pod> | grep -i certCommon Issues:
-
Wrong path: Code looks in
/etc/ssl/mongo-clientinstead of/etc/ssl/client-certs- Fix: Add
--database-client-cert-mount-path=/etc/ssl/client-certsto args
- Fix: Add
-
Permissions (PostgreSQL only):
tls.keyhas wrong permissions- Fix: Verify init container ran and check logs
-
Missing secret: Secret not created by cert-manager
- Fix: Check cert-manager logs, verify Certificate resources
# Database type in use
kubectl get cm nvsentinel-datastore-config -o yaml | grep DATASTORE_PROVIDER
# All certificates
kubectl get certificates -n nvsentinel
# All secrets
kubectl get secrets -n nvsentinel | grep -E "mongo|postgresql"
# Component logs (last 100 lines)
kubectl logs --tail=100 <component-pod>
# Component environment variables
kubectl exec <component-pod> -- env | sort
# Component volume mounts
kubectl describe pod <component-pod> | grep -A10 "Mounts:"1. Two Implementations, One Interface
Components don't (and shouldn't) care which database is used. They program against datastore.DataStore interface. This abstraction enables database-agnostic code.
2. MongoDB = Adapted, PostgreSQL = Native
- MongoDB wraps existing implementation with adapters (evolutionary)
- PostgreSQL was built for the abstraction from scratch (revolutionary)
- This explains why PostgreSQL code is often simpler but isn't necessarily less complex
3. Certificate Paths Are Tricky Multiple systems can determine cert paths:
- Command-line flags (highest priority)
- Environment variables
- Configuration structs
- File existence checks
- Default fallbacks
Best Practice: Be explicit - always pass via --database-client-cert-mount-path arg.
4. ConfigMap is Critical
The nvsentinel-datastore-config ConfigMap is the single source of truth for runtime configuration. Template logic must handle all database providers correctly.
5. Change Detection Differs Fundamentally
- MongoDB: Real-time push (<50ms latency) via native change streams
- PostgreSQL: Polling every 5s (0-5s latency) via triggers + changelog
Components shouldn't care - both present the same ChangeStreamWatcher interface.
6. Testing Ensures Consistency
The behavioral contract tests (behavioral_contract_test.go) are critical - they ensure both databases behave identically. Always update them when adding new operations.
7. Dual Configuration System (MongoDB Only) MongoDB components run TWO parallel configuration systems:
- Legacy
client.DatabaseClient(for change stream watchers) - New
datastore.DataStore(for queries)
Both must get the same cert path or connection fails!
8. PostgreSQL Requires Operational Maintenance
The datastore_changelog table grows indefinitely and needs periodic cleanup. Plan for this in production deployments.
# MongoDB (default)
tilt up
# PostgreSQL
export USE_POSTGRESQL=1
tilt up
# Back to MongoDB
unset USE_POSTGRESQL
tilt upkubectl get cm nvsentinel-datastore-config -o yaml | grep DATASTORE_PROVIDER
kubectl get pods | grep -E "mongodb|postgresql"# Check cert paths in pod
kubectl describe pod <pod> | grep -A20 "volumeMounts"
# Check what code is looking for
kubectl logs <pod> | grep "CA cert"
# Check actual ConfigMap
kubectl get cm nvsentinel-datastore-config -o yaml | grep CERT
# Exec into pod and verify
kubectl exec -it <pod> -- ls -la /etc/ssl/client-certs/# Check datastore pods
kubectl get pods | grep -E "mongo|postgres"
# Check cert-manager certificates
kubectl get certificates
# Check if ConfigMap exists
kubectl get cm nvsentinel-datastore-config
# Check component health
kubectl get pods | grep -E "fault|node|health|platform"# Restart component
kubectl rollout restart deployment/<component>
# View component logs
kubectl logs -f deployment/<component>
# Shell into database
kubectl exec -it mongodb-0 -- mongo
kubectl exec -it nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel
# Check database contents
kubectl exec mongodb-0 -- mongo nvsentinel --eval "db.HealthEvents.count()"
kubectl exec nvsentinel-postgresql-0 -- psql -U postgresql -d nvsentinel -c "SELECT COUNT(*) FROM health_events;"store-client/pkg/datastore/interfaces.go
Core interfaces that define the abstraction layer
store-client/pkg/datastore/config.go
Configuration loading logic and precedence rules
store-client/pkg/datastore/registry.go
Factory pattern implementation for provider selection
distros/kubernetes/nvsentinel/templates/configmap-datastore.yaml
CRITICAL: Unified configuration template
tilt/Tiltfile
Development orchestration and database selection
store-client/pkg/datastore/providers/mongodb/adapter.go
How legacy wrapping works - adapter pattern
store-client/pkg/datastore/providers/mongodb/watcher/watch_store.go
Change streams implementation (~1,200 LOC)
store-client/pkg/client/mongodb_client.go
Legacy MongoDB client (pre-abstraction)
distros/kubernetes/nvsentinel/templates/certmanager-mongodb.yaml
MongoDB certificate hierarchy
store-client/pkg/datastore/providers/postgresql/datastore.go
Main PostgreSQL datastore implementation
store-client/pkg/datastore/providers/postgresql/changestream.go
Polling-based change detection mechanism
store-client/pkg/datastore/providers/postgresql/pipeline_filter.go
MongoDB aggregation pipeline → PostgreSQL SQL translator
distros/kubernetes/nvsentinel/templates/certmanager-postgresql.yaml
PostgreSQL certificate hierarchy
store-client/pkg/datastore/behavioral_contract_test.go
Cross-provider consistency tests (CRITICAL)
store-client/pkg/datastore/interface_compliance_test.go
Interface verification tests
store-client/pkg/datastore/providers/mongodb/health_store_test.go
MongoDB-specific health store tests
store-client/pkg/datastore/providers/postgresql/changestream_test.go
PostgreSQL polling mechanism tests
distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
MongoDB Tilt configuration
distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
PostgreSQL Tilt configuration
distros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yaml
Example component deployment pattern
distros/kubernetes/nvsentinel/charts/postgresql/
Vendored Bitnami PostgreSQL chart (65+ files)
distros/kubernetes/nvsentinel/charts/mongodb-store/charts/mongodb/
Vendored Bitnami MongoDB chart
commons/pkg/flags/database_flags.go
Certificate path resolution logic (5-level precedence)
store-client/pkg/datastore/providers/mongodb/register.go
MongoDB provider auto-registration
store-client/pkg/datastore/providers/postgresql/register.go
PostgreSQL provider auto-registration
┌─────────────────────────────────────────────────────┐
│ Component (e.g., node-drainer) │
│ │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Legacy Factory │ │ New DataStore │ │
│ │ (for watcher) │ │ (for queries) │ │
│ └────────┬─────────┘ └─────────┬──────────┘ │
│ │ │ │
│ │ DUAL CONFIGURATION │ │
│ │ SYSTEM (both needed) │ │
└───────────┼─────────────────────────┼────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌──────────────────┐
│ MongoDB Client │ │ AdaptedMongoStore│
│ (legacy) │ │ (new wrapper) │
└────────┬───────┘ └────────┬─────────┘
│ │
└─────────┬───────────────┘
▼
┌──────────────────┐
│ MongoDB Driver │
│ (mongo.Client) │
└─────────┬────────┘
▼
┌──────────────────┐
│ MongoDB Server │
│ (Replica Set) │
│ │
│ Change Streams │ ← Real-time push (<50ms)
└──────────────────┘
┌─────────────────────────────────────┐
│ Component (e.g., node-drainer) │
│ │
│ ┌─────────────────┐ │
│ │ New DataStore │ │
│ │ (direct use) │ │
│ └────────┬────────┘ │
│ │ │
│ SINGLE CONFIGURATION │
│ SYSTEM (cleaner) │
└───────────────┼───────────────────────┘
│
▼
┌─────────────────┐
│ PostgreSQLStore │
│ (native impl) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ *sql.DB │
│ (Go stdlib) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ lib/pq Driver │
└────────┬────────┘
│
▼
┌─────────────────────────────┐
│ PostgreSQL Server │
│ │
│ Triggers → Changelog Table │ ← Polling every 5s
│ (change detection) │
└─────────────────────────────┘
# Generic namespace
DATASTORE_PROVIDER=mongodb
DATASTORE_HOST=mongodb-headless.nvsentinel.svc.cluster.local
DATASTORE_PORT=27017
DATASTORE_DATABASE=HealthEventsDatabase
# MongoDB-specific namespace (legacy compatibility)
MONGODB_URI=mongodb://mongodb-headless.nvsentinel.svc.cluster.local:27017/?replicaSet=rs0&tls=true
MONGODB_DATABASE_NAME=HealthEventsDatabase
MONGODB_COLLECTION_NAME=HealthEvents
MONGODB_TOKEN_COLLECTION_NAME=ResumeTokens
MONGODB_MAINTENANCE_EVENT_COLLECTION_NAME=MaintenanceEvents
# Certificate path (from deployment env, NOT ConfigMap)
MONGODB_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs
# Timeouts
MONGODB_PING_TIMEOUT_TOTAL_SECONDS=30
MONGODB_PING_INTERVAL_SECONDS=5
CA_CERT_MOUNT_TIMEOUT_TOTAL_SECONDS=360
CA_CERT_READ_INTERVAL_SECONDS=5
# Connection pool
MONGODB_MAX_CONNECTIONS=25
MONGODB_MAX_IDLE_CONNECTIONS=10
MONGODB_CONNECTION_MAX_LIFETIME=1h# Generic namespace (complete)
DATASTORE_PROVIDER=postgresql
DATASTORE_HOST=nvsentinel-postgresql.nvsentinel.svc.cluster.local
DATASTORE_PORT=5432
DATASTORE_DATABASE=nvsentinel
DATASTORE_USERNAME=postgresql
DATASTORE_SSLMODE=require
DATASTORE_SSLCERT=/etc/ssl/client-certs/tls.crt
DATASTORE_SSLKEY=/etc/ssl/client-certs/tls.key
DATASTORE_SSLROOTCERT=/etc/ssl/client-certs/ca.crt
# PostgreSQL-specific (for direct access)
POSTGRESQL_CLIENT_CERT_MOUNT_PATH=/etc/ssl/client-certs
# Connection pool
DATASTORE_MAX_CONNECTIONS=25
DATASTORE_MAX_IDLE_CONNECTIONS=10
DATASTORE_CONNECTION_MAX_LIFETIME=1h
DATASTORE_CONNECTION_MAX_IDLE_TIME=30m
# Polling configuration
DATASTORE_POLL_INTERVAL=5sThis consolidated guide represents the authoritative reference for MongoDB and PostgreSQL support in NVSentinel. It combines:
- Technical accuracy from multiple analysis iterations
- Practical guidance for developers
- Operational considerations for production deployments
- Debugging strategies for troubleshooting
- Testing approaches for ensuring consistency
MongoDB: ✅ Production-ready, mature, fully tested
PostgreSQL: ✅ Feature-complete, tested, ready for deployment
Unified Abstraction: ✅ Stable, proven, well-tested
- Abstraction layer enables database-agnostic components
- Behavioral tests ensure consistent behavior across providers
- Certificate management via cert-manager for both databases
- Explicit configuration prevents runtime surprises
- Comprehensive testing validates both providers
- Phase out MongoDB dual-config system for cleaner architecture
- Implement changelog cleanup for PostgreSQL production deployments
- Add new databases following PostgreSQL pattern (cleaner template)
- Standardize cert path handling across all providers
END OF CONSOLIDATED MONGODB & POSTGRESQL IMPLEMENTATION GUIDE
This document consolidates content from 9 source documents, incorporating the best technical details, practical guidance, and operational insights from all sources.
This consolidated guide was synthesized from:
- claude-mongodb-vs-postgresql-analysis.md (v2.0 revised)
- mongodb-postgresql-implementation-guide.md (colleague's original)
- mongodb-vs-postgresql-comprehensive-analysis.md (detailed session analysis)
- mongodb-postgresql-guide-critical-additions.md (supplementary content)
- claude-mongodb-vs-postgresql-analysis-critique.md (feedback incorporated)
- mongodb-postgresql-implementation-guide-critique.md (analysis feedback)
- Supporting documentation and analysis
Consolidated by: Claude (Anthropic)
Date: November 18, 2025
Version: 3.0 (Unified Guide)