This document provides architectural guidance for "Quartz" - a PyTorch HUD-like system for ROCm downstream CI/CD orchestration. The junior engineer's instinct to start with status.json is understandable but insufficient for the stated requirements. A database-first approach is correct.
- Cascade triggering: ROCm build → PyTorch → vLLM/sglang (with artifacts)
- Health aggregation: Unified view of build health across ~20 downstream projects
- DevOps efficiency: Central team manages O(20) projects
- REST endpoints: Query release/branch status, artifacts, jobs, signals
- Dashboards: PyTorch HUD-style with drill-down capability
What it provides:
repository_dispatchcan trigger cross-repo cascades- GitHub Environments can gate deployments
- Check Runs API provides rich status reporting
- Deployments API tracks deployment history
Why it doesn't fit:
- ❌ No aggregated visibility - "peer-to-peer choreography" means no central view
- ❌ No built-in dashboard - would need to build one anyway
- ❌ Fragile at scale - 20 repos cross-triggering becomes unmaintainable
- ❌ No cascade state - can't answer "why did this cascade stop?"
- ❌ PAT management burden for cross-repo triggers
Verdict: Use for per-repo CI execution, but need external orchestration layer.
What it provides:
- Purpose-built for multi-repo gating (manages 100+ OpenStack repos)
- Cross-project dependencies via
Depends-On:commit footers - Shared change queues for coupled projects
- Built-in dashboard and REST API
- GitHub Checks API integration
Why it might not fit:
⚠️ Steep learning curve - designed for OpenStack's workflow⚠️ Heavy operational footprint (Kubernetes operator, Percona XtraDB, etc.)⚠️ Primarily designed for "gating" (blocking merges), not "reporting"⚠️ Overkill for 20 projects if you don't need commit-level gating
When to use: If you want sophisticated cross-repo gating where failures in downstream block upstream merges. Best for large-scale, high-discipline environments.
Verdict: Powerful but likely over-engineered for ROCm's current needs. Keep on radar for future.
What it provides:
- Explicit fan-in/fan-out primitives for cascade orchestration
- Value Stream Map (VSM) visualization of end-to-end pipelines
- Mature REST API with good coverage
- Simpler than Zuul, still sophisticated
Why it might not fit:
⚠️ Another CI system to run alongside GitHub Actions⚠️ Learning curve for pipeline-as-code DSL⚠️ Less GitHub-native than other options
When to use: If DevOps wants an off-the-shelf orchestrator with good visualization and doesn't mind running additional infrastructure.
Verdict: Strong alternative if bespoke feels too risky. Evaluate seriously.
What it provides:
- Proven architecture (Next.js + ClickHouse + GitHub webhooks)
- Open source at
pytorch/test-infra - ML-powered failure classification
- Full log viewer, artifact access, branch switching
- Already handles complex GitHub Actions workflows
Why it might not fit:
⚠️ Designed for monitoring, not orchestrating cascades⚠️ Would need significant customization for cascade tracking⚠️ ClickHouse Cloud dependency (or self-host overhead)⚠️ PyTorch-specific assumptions baked in
What it would need:
- Add cascade DAG definition and tracking
- Add cross-repo trigger coordination
- Replace PyTorch-specific queries with ROCm equivalents
- Potentially switch from ClickHouse to PostgreSQL for simplicity
Verdict: Good reference architecture, but forking adds maintenance burden. Better to learn from it and build simpler.
What to build:
- Database-first: PostgreSQL for events, builds, cascades, artifacts
- GitHub webhooks: Listen to
workflow_runcompleted events - REST API: FastAPI/Express for status queries
- Dashboard: Next.js with simple tables and drill-down
- Cascade orchestration: Trigger downstream repos via
repository_dispatch
Why this fits:
- ✅ Right-sized for ~20 projects
- ✅ Standard tech (PostgreSQL, Python/TypeScript, AWS)
- ✅ Full control over cascade logic
- ✅ Easy to spin up on AWS (RDS, Lambda/ECS, S3)
- ✅ Junior engineer can learn by building something tractable
- ✅ Buildable in a few days with coding agent assistance
┌─────────────────────────────────────────────────────────────────┐
│ GitHub Actions │
│ (ROCm builds, PyTorch tests, vLLM tests, etc.) │
└─────────────────┬───────────────────────────────────────────────┘
│ webhook: workflow_run.completed
▼
┌─────────────────────────────────────────────────────────────────┐
│ Quartz Orchestrator │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Webhook │ │ Cascade │ │ GitHub Trigger │ │
│ │ Receiver │──│ Engine │──│ (repository_dispatch) │ │
│ └─────────────┘ └──────┬──────┘ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ PostgreSQL │ │
│ │ - builds │ │
│ │ - cascades │ │
│ │ - artifacts│ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ REST API │ │
│ │ (FastAPI) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Quartz Dashboard │
│ (Next.js / React) │
│ - Release/branch health grid │
│ - Cascade status (ROCm → PyTorch → vLLM) │
│ - Drill-down to jobs, tests, logs │
│ - Artifact links (S3) │
└─────────────────────────────────────────────────────────────────┘
-- Core entities
CREATE TABLE projects (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL, -- 'rocm', 'pytorch', 'vllm'
repo TEXT NOT NULL, -- 'ROCm/TheRock', 'pytorch/pytorch'
category TEXT NOT NULL, -- 'core', '1p_downstream', '3p_downstream'
UNIQUE(repo)
);
CREATE TABLE builds (
id SERIAL PRIMARY KEY,
project_id INTEGER REFERENCES projects(id),
github_run_id BIGINT NOT NULL,
ref TEXT NOT NULL, -- branch or tag
sha TEXT NOT NULL,
status TEXT NOT NULL, -- 'pending', 'running', 'success', 'failure'
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
artifacts JSONB, -- S3 URLs, checksums
UNIQUE(github_run_id)
);
CREATE TABLE cascades (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL, -- 'nightly', 'release-6.4'
trigger_ref TEXT NOT NULL, -- what ref triggered this cascade
status TEXT NOT NULL, -- 'running', 'success', 'partial_failure', 'failure'
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
);
CREATE TABLE cascade_stages (
id SERIAL PRIMARY KEY,
cascade_id INTEGER REFERENCES cascades(id),
project_id INTEGER REFERENCES projects(id),
build_id INTEGER REFERENCES builds(id),
stage_order INTEGER NOT NULL, -- 0=ROCm, 1=PyTorch, 2=vLLM
depends_on INTEGER[], -- array of project_ids that must complete first
status TEXT NOT NULL
);
-- For tracking test results if available
CREATE TABLE test_results (
id SERIAL PRIMARY KEY,
build_id INTEGER REFERENCES builds(id),
suite_name TEXT,
passed INTEGER,
failed INTEGER,
skipped INTEGER,
details_url TEXT -- link to test report
);GET /api/v1/health
→ Overall system health
GET /api/v1/releases/{release}/status
→ Full status for a release (branches, cascades, all projects)
GET /api/v1/projects
→ List all tracked projects with latest build status
GET /api/v1/projects/{project}/builds
→ Build history for a project, with filtering
GET /api/v1/cascades
→ List cascades (nightly, release, etc.)
GET /api/v1/cascades/{cascade_id}
→ Full cascade status with all stages
GET /api/v1/builds/{build_id}
→ Detailed build info including artifacts, test results
POST /api/v1/cascades
→ Manually trigger a cascade (for releases)
POST /api/v1/webhooks/github
→ GitHub webhook receiver (workflow_run events)
# When GitHub webhook arrives for workflow_run.completed
def handle_workflow_complete(event):
project = lookup_project(event.repository.full_name)
build = upsert_build(project, event.workflow_run)
# Find any cascades waiting on this build
pending_stages = find_pending_stages(project_id=project.id)
for stage in pending_stages:
if all_dependencies_complete(stage):
if all_dependencies_successful(stage):
trigger_downstream(stage)
else:
mark_cascade_failed(stage.cascade_id)
def trigger_downstream(stage):
project = get_project(stage.project_id)
# Use repository_dispatch to trigger downstream
github_api.post(
f"/repos/{project.repo}/dispatches",
json={
"event_type": "cascade_trigger",
"client_payload": {
"cascade_id": stage.cascade_id,
"upstream_artifacts": get_upstream_artifacts(stage),
"ref": stage.cascade.trigger_ref
}
}
)
stage.status = "triggered"
db.commit()-
Release Health Grid
- Rows: Projects (ROCm, PyTorch, vLLM, ...)
- Columns: Branches/releases (main, release-6.4, ...)
- Cells: ✅❌🔄 with link to build details
-
Cascade View
- Visual DAG showing ROCm → PyTorch → vLLM
- Each node shows status, duration, artifact links
- Click to drill into specific build
-
Build Detail
- Job list with status
- Test results summary (if available)
- Artifact links (wheels, tarballs, etc.)
- Link to GitHub Actions run
-
Historical Trends
- Success rate over time per project
- Build duration trends
- Flaky test identification
- Set up PostgreSQL on RDS
- Implement GitHub webhook receiver (Lambda or ECS)
- Create data model and basic CRUD
- Single project tracking (ROCm only)
- Add cascade definitions (YAML or DB config)
- Implement cascade trigger logic
- Add downstream project tracking (PyTorch)
- Test end-to-end cascade
- FastAPI implementation
- All status endpoints
- Authentication (API keys or GitHub App)
- OpenAPI documentation
- Next.js scaffold
- Health grid view
- Cascade visualization
- Build drill-down
- Test result ingestion
- Artifact tracking improvements
- Alerting (Slack/email on failures)
- Historical analytics
The engineer's proposed approach of materializing status.json somewhere ad-hoc fails for:
- No cascade state: Can't track "ROCm passed → waiting for PyTorch → PyTorch running"
- No history: Can't answer "when did this start failing?"
- No atomicity: Race conditions when multiple builds complete
- No queries: Can't filter by project, branch, time range
- No relationships: Can't link builds to cascades to artifacts
- No REST: Would need to build API layer anyway
A database is the correct foundation. The engineer should start there.
| Component | Technology | Rationale |
|---|---|---|
| Database | PostgreSQL (RDS) | Relational fits the domain, AWS-native |
| Webhook Handler | Python (Lambda) or FastAPI (ECS) | Simple, fast to develop |
| REST API | FastAPI | Modern Python, auto-generates OpenAPI |
| Dashboard | Next.js | Same as PyTorch HUD, good React ecosystem |
| Cascade Triggers | GitHub API (repository_dispatch) |
Already have PATs, simple |
| Artifact Storage | S3 | Already using for artifacts |
| Hosting | AWS (ECS or Lambda + API Gateway) | Already have AWS infra |
-
Start with the database schema. Model the domain (projects, builds, cascades) before writing any API code.
-
Build the webhook receiver first. Get GitHub events flowing into the database. This validates the data model.
-
Add REST endpoints incrementally. Start with
/projectsand/builds, then add cascade queries. -
Dashboard last. The API should be complete before building UI.
-
Use a coding agent. Claude Code or Cursor can scaffold FastAPI + Next.js in an afternoon. Focus on domain logic, not boilerplate.
-
Don't over-engineer. Start with 2-3 projects, linear cascade. Add complexity when needed.
-
GitHub App vs PAT: Should Quartz authenticate as a GitHub App (more robust) or use PATs (simpler)?
-
3P project handling: For true 3P projects (llamacpp example), do they push status to us, or do we poll them?
-
Artifact handoff: How are artifacts passed between cascade stages? S3 bucket with naming convention?
-
Test result ingestion: Do downstream projects produce JUnit XML? Where is it stored?
-
SLA for cascade completion: How fast should ROCm → PyTorch → vLLM complete? (Affects architecture choices)