Quartz Design Document: ROCm CI/CD Dashboard & Orchestration

Executive Summary

This document provides architectural guidance for "Quartz" - a PyTorch HUD-like system for ROCm downstream CI/CD orchestration. The junior engineer's instinct to start with status.json is understandable but insufficient for the stated requirements. A database-first approach is correct.

Requirements Recap

Cascade triggering: ROCm build → PyTorch → vLLM/sglang (with artifacts)
Health aggregation: Unified view of build health across ~20 downstream projects
DevOps efficiency: Central team manages O(20) projects
REST endpoints: Query release/branch status, artifacts, jobs, signals
Dashboards: PyTorch HUD-style with drill-down capability

Alternatives Considered

Option A: Pure GitHub Native (repository_dispatch + Environments)

What it provides:

repository_dispatch can trigger cross-repo cascades
GitHub Environments can gate deployments
Check Runs API provides rich status reporting
Deployments API tracks deployment history

Why it doesn't fit:

❌ No aggregated visibility - "peer-to-peer choreography" means no central view
❌ No built-in dashboard - would need to build one anyway
❌ Fragile at scale - 20 repos cross-triggering becomes unmaintainable
❌ No cascade state - can't answer "why did this cascade stop?"
❌ PAT management burden for cross-repo triggers

Verdict: Use for per-repo CI execution, but need external orchestration layer.

Option B: Zuul (OpenStack)

What it provides:

Purpose-built for multi-repo gating (manages 100+ OpenStack repos)
Cross-project dependencies via Depends-On: commit footers
Shared change queues for coupled projects
Built-in dashboard and REST API
GitHub Checks API integration

Why it might not fit:

⚠️ Steep learning curve - designed for OpenStack's workflow
⚠️ Heavy operational footprint (Kubernetes operator, Percona XtraDB, etc.)
⚠️ Primarily designed for "gating" (blocking merges), not "reporting"
⚠️ Overkill for 20 projects if you don't need commit-level gating

When to use: If you want sophisticated cross-repo gating where failures in downstream block upstream merges. Best for large-scale, high-discipline environments.

Verdict: Powerful but likely over-engineered for ROCm's current needs. Keep on radar for future.

Option C: GoCD

What it provides:

Explicit fan-in/fan-out primitives for cascade orchestration
Value Stream Map (VSM) visualization of end-to-end pipelines
Mature REST API with good coverage
Simpler than Zuul, still sophisticated

Why it might not fit:

⚠️ Another CI system to run alongside GitHub Actions
⚠️ Learning curve for pipeline-as-code DSL
⚠️ Less GitHub-native than other options

When to use: If DevOps wants an off-the-shelf orchestrator with good visualization and doesn't mind running additional infrastructure.

Verdict: Strong alternative if bespoke feels too risky. Evaluate seriously.

Option D: Fork PyTorch HUD

What it provides:

Proven architecture (Next.js + ClickHouse + GitHub webhooks)
Open source at pytorch/test-infra
ML-powered failure classification
Full log viewer, artifact access, branch switching
Already handles complex GitHub Actions workflows

Why it might not fit:

⚠️ Designed for monitoring, not orchestrating cascades
⚠️ Would need significant customization for cascade tracking
⚠️ ClickHouse Cloud dependency (or self-host overhead)
⚠️ PyTorch-specific assumptions baked in

What it would need:

Add cascade DAG definition and tracking
Add cross-repo trigger coordination
Replace PyTorch-specific queries with ROCm equivalents
Potentially switch from ClickHouse to PostgreSQL for simplicity

Verdict: Good reference architecture, but forking adds maintenance burden. Better to learn from it and build simpler.

Option E: Bespoke Lightweight System (Recommended)

What to build:

Database-first: PostgreSQL for events, builds, cascades, artifacts
GitHub webhooks: Listen to workflow_run completed events
REST API: FastAPI/Express for status queries
Dashboard: Next.js with simple tables and drill-down
Cascade orchestration: Trigger downstream repos via repository_dispatch

Why this fits:

✅ Right-sized for ~20 projects
✅ Standard tech (PostgreSQL, Python/TypeScript, AWS)
✅ Full control over cascade logic
✅ Easy to spin up on AWS (RDS, Lambda/ECS, S3)
✅ Junior engineer can learn by building something tractable
✅ Buildable in a few days with coding agent assistance

Recommended Architecture

Core Components

┌─────────────────────────────────────────────────────────────────┐
│                         GitHub Actions                          │
│  (ROCm builds, PyTorch tests, vLLM tests, etc.)                │
└─────────────────┬───────────────────────────────────────────────┘
                  │ webhook: workflow_run.completed
                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Quartz Orchestrator                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │  Webhook    │  │  Cascade    │  │   GitHub Trigger        │ │
│  │  Receiver   │──│  Engine     │──│   (repository_dispatch) │ │
│  └─────────────┘  └──────┬──────┘  └─────────────────────────┘ │
│                          │                                       │
│                          ▼                                       │
│                   ┌─────────────┐                                │
│                   │ PostgreSQL  │                                │
│                   │  - builds   │                                │
│                   │  - cascades │                                │
│                   │  - artifacts│                                │
│                   └─────────────┘                                │
│                          │                                       │
│                          ▼                                       │
│                   ┌─────────────┐                                │
│                   │  REST API   │                                │
│                   │  (FastAPI)  │                                │
│                   └─────────────┘                                │
└─────────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Quartz Dashboard                            │
│                   (Next.js / React)                              │
│  - Release/branch health grid                                    │
│  - Cascade status (ROCm → PyTorch → vLLM)                       │
│  - Drill-down to jobs, tests, logs                              │
│  - Artifact links (S3)                                          │
└─────────────────────────────────────────────────────────────────┘

Data Model (PostgreSQL)

-- Core entities
CREATE TABLE projects (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,           -- 'rocm', 'pytorch', 'vllm'
    repo TEXT NOT NULL,           -- 'ROCm/TheRock', 'pytorch/pytorch'
    category TEXT NOT NULL,       -- 'core', '1p_downstream', '3p_downstream'
    UNIQUE(repo)
);

CREATE TABLE builds (
    id SERIAL PRIMARY KEY,
    project_id INTEGER REFERENCES projects(id),
    github_run_id BIGINT NOT NULL,
    ref TEXT NOT NULL,            -- branch or tag
    sha TEXT NOT NULL,
    status TEXT NOT NULL,         -- 'pending', 'running', 'success', 'failure'
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    artifacts JSONB,              -- S3 URLs, checksums
    UNIQUE(github_run_id)
);

CREATE TABLE cascades (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,           -- 'nightly', 'release-6.4'
    trigger_ref TEXT NOT NULL,    -- what ref triggered this cascade
    status TEXT NOT NULL,         -- 'running', 'success', 'partial_failure', 'failure'
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ
);

CREATE TABLE cascade_stages (
    id SERIAL PRIMARY KEY,
    cascade_id INTEGER REFERENCES cascades(id),
    project_id INTEGER REFERENCES projects(id),
    build_id INTEGER REFERENCES builds(id),
    stage_order INTEGER NOT NULL, -- 0=ROCm, 1=PyTorch, 2=vLLM
    depends_on INTEGER[],         -- array of project_ids that must complete first
    status TEXT NOT NULL
);

-- For tracking test results if available
CREATE TABLE test_results (
    id SERIAL PRIMARY KEY,
    build_id INTEGER REFERENCES builds(id),
    suite_name TEXT,
    passed INTEGER,
    failed INTEGER,
    skipped INTEGER,
    details_url TEXT              -- link to test report
);

REST API Endpoints

GET  /api/v1/health
     → Overall system health

GET  /api/v1/releases/{release}/status
     → Full status for a release (branches, cascades, all projects)

GET  /api/v1/projects
     → List all tracked projects with latest build status

GET  /api/v1/projects/{project}/builds
     → Build history for a project, with filtering

GET  /api/v1/cascades
     → List cascades (nightly, release, etc.)

GET  /api/v1/cascades/{cascade_id}
     → Full cascade status with all stages

GET  /api/v1/builds/{build_id}
     → Detailed build info including artifacts, test results

POST /api/v1/cascades
     → Manually trigger a cascade (for releases)

POST /api/v1/webhooks/github
     → GitHub webhook receiver (workflow_run events)

Cascade Logic (Pseudocode)

# When GitHub webhook arrives for workflow_run.completed
def handle_workflow_complete(event):
    project = lookup_project(event.repository.full_name)
    build = upsert_build(project, event.workflow_run)

    # Find any cascades waiting on this build
    pending_stages = find_pending_stages(project_id=project.id)

    for stage in pending_stages:
        if all_dependencies_complete(stage):
            if all_dependencies_successful(stage):
                trigger_downstream(stage)
            else:
                mark_cascade_failed(stage.cascade_id)

def trigger_downstream(stage):
    project = get_project(stage.project_id)

    # Use repository_dispatch to trigger downstream
    github_api.post(
        f"/repos/{project.repo}/dispatches",
        json={
            "event_type": "cascade_trigger",
            "client_payload": {
                "cascade_id": stage.cascade_id,
                "upstream_artifacts": get_upstream_artifacts(stage),
                "ref": stage.cascade.trigger_ref
            }
        }
    )

    stage.status = "triggered"
    db.commit()

Dashboard Views

Release Health Grid
- Rows: Projects (ROCm, PyTorch, vLLM, ...)
- Columns: Branches/releases (main, release-6.4, ...)
- Cells: ✅❌🔄 with link to build details
Cascade View
- Visual DAG showing ROCm → PyTorch → vLLM
- Each node shows status, duration, artifact links
- Click to drill into specific build
Build Detail
- Job list with status
- Test results summary (if available)
- Artifact links (wheels, tarballs, etc.)
- Link to GitHub Actions run
Historical Trends
- Success rate over time per project
- Build duration trends
- Flaky test identification

Implementation Approach

Phase 1: Foundation (Week 1)

Set up PostgreSQL on RDS
Implement GitHub webhook receiver (Lambda or ECS)
Create data model and basic CRUD
Single project tracking (ROCm only)

Phase 2: Cascade Engine (Week 2)

Add cascade definitions (YAML or DB config)
Implement cascade trigger logic
Add downstream project tracking (PyTorch)
Test end-to-end cascade

Phase 3: REST API (Week 3)

FastAPI implementation
All status endpoints
Authentication (API keys or GitHub App)
OpenAPI documentation

Phase 4: Dashboard (Week 4)

Next.js scaffold
Health grid view
Cascade visualization
Build drill-down

Phase 5: Polish (Week 5+)

Test result ingestion
Artifact tracking improvements
Alerting (Slack/email on failures)
Historical analytics

Why "status.json" Doesn't Work

The engineer's proposed approach of materializing status.json somewhere ad-hoc fails for:

No cascade state: Can't track "ROCm passed → waiting for PyTorch → PyTorch running"
No history: Can't answer "when did this start failing?"
No atomicity: Race conditions when multiple builds complete
No queries: Can't filter by project, branch, time range
No relationships: Can't link builds to cascades to artifacts
No REST: Would need to build API layer anyway

A database is the correct foundation. The engineer should start there.

Tech Stack Recommendation

Component	Technology	Rationale
Database	PostgreSQL (RDS)	Relational fits the domain, AWS-native
Webhook Handler	Python (Lambda) or FastAPI (ECS)	Simple, fast to develop
REST API	FastAPI	Modern Python, auto-generates OpenAPI
Dashboard	Next.js	Same as PyTorch HUD, good React ecosystem
Cascade Triggers	GitHub API (`repository_dispatch`)	Already have PATs, simple
Artifact Storage	S3	Already using for artifacts
Hosting	AWS (ECS or Lambda + API Gateway)	Already have AWS infra

Guidance for the Engineer

Start with the database schema. Model the domain (projects, builds, cascades) before writing any API code.
Build the webhook receiver first. Get GitHub events flowing into the database. This validates the data model.
Add REST endpoints incrementally. Start with /projects and /builds, then add cascade queries.
Dashboard last. The API should be complete before building UI.
Use a coding agent. Claude Code or Cursor can scaffold FastAPI + Next.js in an afternoon. Focus on domain logic, not boilerplate.
Don't over-engineer. Start with 2-3 projects, linear cascade. Add complexity when needed.

Open Questions for TL

GitHub App vs PAT: Should Quartz authenticate as a GitHub App (more robust) or use PATs (simpler)?
3P project handling: For true 3P projects (llamacpp example), do they push status to us, or do we poll them?
Artifact handoff: How are artifacts passed between cascade stages? S3 bucket with naming convention?
Test result ingestion: Do downstream projects produce JUnit XML? Where is it stored?
SLA for cascade completion: How fast should ROCm → PyTorch → vLLM complete? (Affects architecture choices)

stellaraccident/quartz-design.md

Select an option

No results found

Select an option

No results found

Quartz Design Document: ROCm CI/CD Dashboard & Orchestration

Executive Summary

Requirements Recap

Alternatives Considered

Option A: Pure GitHub Native (repository_dispatch + Environments)

Option B: Zuul (OpenStack)

Option C: GoCD

Option D: Fork PyTorch HUD

Option E: Bespoke Lightweight System (Recommended)

Recommended Architecture

Core Components

Data Model (PostgreSQL)

REST API Endpoints

Cascade Logic (Pseudocode)

Dashboard Views

Implementation Approach

Phase 1: Foundation (Week 1)

Phase 2: Cascade Engine (Week 2)

Phase 3: REST API (Week 3)

Phase 4: Dashboard (Week 4)

Phase 5: Polish (Week 5+)

Why "status.json" Doesn't Work

Tech Stack Recommendation

Guidance for the Engineer

Open Questions for TL