Skip to content

Instantly share code, notes, and snippets.

@stellaraccident
Created January 26, 2026 20:58
Show Gist options
  • Select an option

  • Save stellaraccident/1a23a26b3149a2bccd15b6deaa2d287e to your computer and use it in GitHub Desktop.

Select an option

Save stellaraccident/1a23a26b3149a2bccd15b6deaa2d287e to your computer and use it in GitHub Desktop.
Quartz Design Document: ROCm CI/CD Dashboard & Orchestration

Quartz Design Document: ROCm CI/CD Dashboard & Orchestration

Executive Summary

This document provides architectural guidance for "Quartz" - a PyTorch HUD-like system for ROCm downstream CI/CD orchestration. The junior engineer's instinct to start with status.json is understandable but insufficient for the stated requirements. A database-first approach is correct.


Requirements Recap

  1. Cascade triggering: ROCm build → PyTorch → vLLM/sglang (with artifacts)
  2. Health aggregation: Unified view of build health across ~20 downstream projects
  3. DevOps efficiency: Central team manages O(20) projects
  4. REST endpoints: Query release/branch status, artifacts, jobs, signals
  5. Dashboards: PyTorch HUD-style with drill-down capability

Alternatives Considered

Option A: Pure GitHub Native (repository_dispatch + Environments)

What it provides:

  • repository_dispatch can trigger cross-repo cascades
  • GitHub Environments can gate deployments
  • Check Runs API provides rich status reporting
  • Deployments API tracks deployment history

Why it doesn't fit:

  • ❌ No aggregated visibility - "peer-to-peer choreography" means no central view
  • ❌ No built-in dashboard - would need to build one anyway
  • ❌ Fragile at scale - 20 repos cross-triggering becomes unmaintainable
  • ❌ No cascade state - can't answer "why did this cascade stop?"
  • ❌ PAT management burden for cross-repo triggers

Verdict: Use for per-repo CI execution, but need external orchestration layer.


Option B: Zuul (OpenStack)

What it provides:

  • Purpose-built for multi-repo gating (manages 100+ OpenStack repos)
  • Cross-project dependencies via Depends-On: commit footers
  • Shared change queues for coupled projects
  • Built-in dashboard and REST API
  • GitHub Checks API integration

Why it might not fit:

  • ⚠️ Steep learning curve - designed for OpenStack's workflow
  • ⚠️ Heavy operational footprint (Kubernetes operator, Percona XtraDB, etc.)
  • ⚠️ Primarily designed for "gating" (blocking merges), not "reporting"
  • ⚠️ Overkill for 20 projects if you don't need commit-level gating

When to use: If you want sophisticated cross-repo gating where failures in downstream block upstream merges. Best for large-scale, high-discipline environments.

Verdict: Powerful but likely over-engineered for ROCm's current needs. Keep on radar for future.


Option C: GoCD

What it provides:

  • Explicit fan-in/fan-out primitives for cascade orchestration
  • Value Stream Map (VSM) visualization of end-to-end pipelines
  • Mature REST API with good coverage
  • Simpler than Zuul, still sophisticated

Why it might not fit:

  • ⚠️ Another CI system to run alongside GitHub Actions
  • ⚠️ Learning curve for pipeline-as-code DSL
  • ⚠️ Less GitHub-native than other options

When to use: If DevOps wants an off-the-shelf orchestrator with good visualization and doesn't mind running additional infrastructure.

Verdict: Strong alternative if bespoke feels too risky. Evaluate seriously.


Option D: Fork PyTorch HUD

What it provides:

  • Proven architecture (Next.js + ClickHouse + GitHub webhooks)
  • Open source at pytorch/test-infra
  • ML-powered failure classification
  • Full log viewer, artifact access, branch switching
  • Already handles complex GitHub Actions workflows

Why it might not fit:

  • ⚠️ Designed for monitoring, not orchestrating cascades
  • ⚠️ Would need significant customization for cascade tracking
  • ⚠️ ClickHouse Cloud dependency (or self-host overhead)
  • ⚠️ PyTorch-specific assumptions baked in

What it would need:

  • Add cascade DAG definition and tracking
  • Add cross-repo trigger coordination
  • Replace PyTorch-specific queries with ROCm equivalents
  • Potentially switch from ClickHouse to PostgreSQL for simplicity

Verdict: Good reference architecture, but forking adds maintenance burden. Better to learn from it and build simpler.


Option E: Bespoke Lightweight System (Recommended)

What to build:

  • Database-first: PostgreSQL for events, builds, cascades, artifacts
  • GitHub webhooks: Listen to workflow_run completed events
  • REST API: FastAPI/Express for status queries
  • Dashboard: Next.js with simple tables and drill-down
  • Cascade orchestration: Trigger downstream repos via repository_dispatch

Why this fits:

  • ✅ Right-sized for ~20 projects
  • ✅ Standard tech (PostgreSQL, Python/TypeScript, AWS)
  • ✅ Full control over cascade logic
  • ✅ Easy to spin up on AWS (RDS, Lambda/ECS, S3)
  • ✅ Junior engineer can learn by building something tractable
  • ✅ Buildable in a few days with coding agent assistance

Recommended Architecture

Core Components

┌─────────────────────────────────────────────────────────────────┐
│                         GitHub Actions                          │
│  (ROCm builds, PyTorch tests, vLLM tests, etc.)                │
└─────────────────┬───────────────────────────────────────────────┘
                  │ webhook: workflow_run.completed
                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Quartz Orchestrator                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │  Webhook    │  │  Cascade    │  │   GitHub Trigger        │ │
│  │  Receiver   │──│  Engine     │──│   (repository_dispatch) │ │
│  └─────────────┘  └──────┬──────┘  └─────────────────────────┘ │
│                          │                                       │
│                          ▼                                       │
│                   ┌─────────────┐                                │
│                   │ PostgreSQL  │                                │
│                   │  - builds   │                                │
│                   │  - cascades │                                │
│                   │  - artifacts│                                │
│                   └─────────────┘                                │
│                          │                                       │
│                          ▼                                       │
│                   ┌─────────────┐                                │
│                   │  REST API   │                                │
│                   │  (FastAPI)  │                                │
│                   └─────────────┘                                │
└─────────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Quartz Dashboard                            │
│                   (Next.js / React)                              │
│  - Release/branch health grid                                    │
│  - Cascade status (ROCm → PyTorch → vLLM)                       │
│  - Drill-down to jobs, tests, logs                              │
│  - Artifact links (S3)                                          │
└─────────────────────────────────────────────────────────────────┘

Data Model (PostgreSQL)

-- Core entities
CREATE TABLE projects (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,           -- 'rocm', 'pytorch', 'vllm'
    repo TEXT NOT NULL,           -- 'ROCm/TheRock', 'pytorch/pytorch'
    category TEXT NOT NULL,       -- 'core', '1p_downstream', '3p_downstream'
    UNIQUE(repo)
);

CREATE TABLE builds (
    id SERIAL PRIMARY KEY,
    project_id INTEGER REFERENCES projects(id),
    github_run_id BIGINT NOT NULL,
    ref TEXT NOT NULL,            -- branch or tag
    sha TEXT NOT NULL,
    status TEXT NOT NULL,         -- 'pending', 'running', 'success', 'failure'
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    artifacts JSONB,              -- S3 URLs, checksums
    UNIQUE(github_run_id)
);

CREATE TABLE cascades (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,           -- 'nightly', 'release-6.4'
    trigger_ref TEXT NOT NULL,    -- what ref triggered this cascade
    status TEXT NOT NULL,         -- 'running', 'success', 'partial_failure', 'failure'
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ
);

CREATE TABLE cascade_stages (
    id SERIAL PRIMARY KEY,
    cascade_id INTEGER REFERENCES cascades(id),
    project_id INTEGER REFERENCES projects(id),
    build_id INTEGER REFERENCES builds(id),
    stage_order INTEGER NOT NULL, -- 0=ROCm, 1=PyTorch, 2=vLLM
    depends_on INTEGER[],         -- array of project_ids that must complete first
    status TEXT NOT NULL
);

-- For tracking test results if available
CREATE TABLE test_results (
    id SERIAL PRIMARY KEY,
    build_id INTEGER REFERENCES builds(id),
    suite_name TEXT,
    passed INTEGER,
    failed INTEGER,
    skipped INTEGER,
    details_url TEXT              -- link to test report
);

REST API Endpoints

GET  /api/v1/health
     → Overall system health

GET  /api/v1/releases/{release}/status
     → Full status for a release (branches, cascades, all projects)

GET  /api/v1/projects
     → List all tracked projects with latest build status

GET  /api/v1/projects/{project}/builds
     → Build history for a project, with filtering

GET  /api/v1/cascades
     → List cascades (nightly, release, etc.)

GET  /api/v1/cascades/{cascade_id}
     → Full cascade status with all stages

GET  /api/v1/builds/{build_id}
     → Detailed build info including artifacts, test results

POST /api/v1/cascades
     → Manually trigger a cascade (for releases)

POST /api/v1/webhooks/github
     → GitHub webhook receiver (workflow_run events)

Cascade Logic (Pseudocode)

# When GitHub webhook arrives for workflow_run.completed
def handle_workflow_complete(event):
    project = lookup_project(event.repository.full_name)
    build = upsert_build(project, event.workflow_run)

    # Find any cascades waiting on this build
    pending_stages = find_pending_stages(project_id=project.id)

    for stage in pending_stages:
        if all_dependencies_complete(stage):
            if all_dependencies_successful(stage):
                trigger_downstream(stage)
            else:
                mark_cascade_failed(stage.cascade_id)

def trigger_downstream(stage):
    project = get_project(stage.project_id)

    # Use repository_dispatch to trigger downstream
    github_api.post(
        f"/repos/{project.repo}/dispatches",
        json={
            "event_type": "cascade_trigger",
            "client_payload": {
                "cascade_id": stage.cascade_id,
                "upstream_artifacts": get_upstream_artifacts(stage),
                "ref": stage.cascade.trigger_ref
            }
        }
    )

    stage.status = "triggered"
    db.commit()

Dashboard Views

  1. Release Health Grid

    • Rows: Projects (ROCm, PyTorch, vLLM, ...)
    • Columns: Branches/releases (main, release-6.4, ...)
    • Cells: ✅❌🔄 with link to build details
  2. Cascade View

    • Visual DAG showing ROCm → PyTorch → vLLM
    • Each node shows status, duration, artifact links
    • Click to drill into specific build
  3. Build Detail

    • Job list with status
    • Test results summary (if available)
    • Artifact links (wheels, tarballs, etc.)
    • Link to GitHub Actions run
  4. Historical Trends

    • Success rate over time per project
    • Build duration trends
    • Flaky test identification

Implementation Approach

Phase 1: Foundation (Week 1)

  • Set up PostgreSQL on RDS
  • Implement GitHub webhook receiver (Lambda or ECS)
  • Create data model and basic CRUD
  • Single project tracking (ROCm only)

Phase 2: Cascade Engine (Week 2)

  • Add cascade definitions (YAML or DB config)
  • Implement cascade trigger logic
  • Add downstream project tracking (PyTorch)
  • Test end-to-end cascade

Phase 3: REST API (Week 3)

  • FastAPI implementation
  • All status endpoints
  • Authentication (API keys or GitHub App)
  • OpenAPI documentation

Phase 4: Dashboard (Week 4)

  • Next.js scaffold
  • Health grid view
  • Cascade visualization
  • Build drill-down

Phase 5: Polish (Week 5+)

  • Test result ingestion
  • Artifact tracking improvements
  • Alerting (Slack/email on failures)
  • Historical analytics

Why "status.json" Doesn't Work

The engineer's proposed approach of materializing status.json somewhere ad-hoc fails for:

  1. No cascade state: Can't track "ROCm passed → waiting for PyTorch → PyTorch running"
  2. No history: Can't answer "when did this start failing?"
  3. No atomicity: Race conditions when multiple builds complete
  4. No queries: Can't filter by project, branch, time range
  5. No relationships: Can't link builds to cascades to artifacts
  6. No REST: Would need to build API layer anyway

A database is the correct foundation. The engineer should start there.


Tech Stack Recommendation

Component Technology Rationale
Database PostgreSQL (RDS) Relational fits the domain, AWS-native
Webhook Handler Python (Lambda) or FastAPI (ECS) Simple, fast to develop
REST API FastAPI Modern Python, auto-generates OpenAPI
Dashboard Next.js Same as PyTorch HUD, good React ecosystem
Cascade Triggers GitHub API (repository_dispatch) Already have PATs, simple
Artifact Storage S3 Already using for artifacts
Hosting AWS (ECS or Lambda + API Gateway) Already have AWS infra

Guidance for the Engineer

  1. Start with the database schema. Model the domain (projects, builds, cascades) before writing any API code.

  2. Build the webhook receiver first. Get GitHub events flowing into the database. This validates the data model.

  3. Add REST endpoints incrementally. Start with /projects and /builds, then add cascade queries.

  4. Dashboard last. The API should be complete before building UI.

  5. Use a coding agent. Claude Code or Cursor can scaffold FastAPI + Next.js in an afternoon. Focus on domain logic, not boilerplate.

  6. Don't over-engineer. Start with 2-3 projects, linear cascade. Add complexity when needed.


Open Questions for TL

  1. GitHub App vs PAT: Should Quartz authenticate as a GitHub App (more robust) or use PATs (simpler)?

  2. 3P project handling: For true 3P projects (llamacpp example), do they push status to us, or do we poll them?

  3. Artifact handoff: How are artifacts passed between cascade stages? S3 bucket with naming convention?

  4. Test result ingestion: Do downstream projects produce JUnit XML? Where is it stored?

  5. SLA for cascade completion: How fast should ROCm → PyTorch → vLLM complete? (Affects architecture choices)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment