Skip to content

Instantly share code, notes, and snippets.

@avivl
Created January 28, 2026 15:46
Show Gist options
  • Select an option

  • Save avivl/f1d4769138f9020dfa891c206e33afe0 to your computer and use it in GitHub Desktop.

Select an option

Save avivl/f1d4769138f9020dfa891c206e33afe0 to your computer and use it in GitHub Desktop.
X Thread: SRE Skills for Clawdbot (ready to post)

X Thread: SRE Skills for Clawdbot

Post as thread on @avivl account.


I taught my AI assistant to do SRE.

Not just "summarize logs" β€” actual incident response, alert analysis, and engineering metrics.

Here's what an AI-powered operations toolkit looks like 🧡

1/11


Skill #1: Incident Response

My Clawdbot can now: β€’ Check production health across 67 Cloud Run services β€’ Correlate alerts with recent deploys β€’ Suggest rollbacks or scaling fixes β€’ Follow actual runbooks, not hallucinate them

2/11


The key: structured diagnostics.

It knows to check:

  1. Recent deployments (GitHub Actions)
  2. Error logs (Cloud Logging)
  3. Service metrics (latency, memory)
  4. External dependencies (API quotas)

In that order. Like an actual SRE would.

3/11


Skill #2: Alert Insights

Weekly analysis of production alerts: β€’ Scans Gmail for alert patterns β€’ Identifies noisy/flapping alerts β€’ Cross-references with monitoring config β€’ Recommends specific threshold changes

Turns alert fatigue into actionable PRs.

4/11


The magic: it reads our actual infra code.

Points to specific files: "Adjust error threshold in src/core/services/monitoring/error-reporting.ts line 47"

Not generic advice. Real code changes.

5/11


Skill #3: DORA Metrics

Tracks the 4 key DevOps metrics: β€’ Deployment Frequency β€’ Lead Time for Changes β€’ Change Failure Rate β€’ MTTR

Weekly reports with trends and per-service breakdowns.

6/11


Data sources it pulls from:

β€’ GitHub Actions β†’ deploy frequency, failure rate β€’ GitHub PRs β†’ lead time (created β†’ merged) β€’ Gmail alerts β†’ MTTR (alert β†’ resolved)

All automated. No manual spreadsheets.

7/11


The pattern: Skills = Runbooks as Code

Each skill is: β€’ A markdown file with procedures β€’ CLI commands it can run β€’ Context about our specific infra

AI follows the runbook. Humans review the output.

8/11


What changed for us:

Before: Wake up to 47 alerts, spend 30min triaging After: "Golem, what happened overnight?" β†’ 2min summary

Before: Monthly DORA review (if we remembered) After: Weekly automated report in my inbox

9/11


The meta insight:

SRE is mostly pattern matching + executing known procedures.

That's exactly what AI is good at.

Humans should design the runbooks and make judgment calls. AI should execute the checklist.

10/11


All of this runs on a $34/month GCP VM.

Skills are just markdown files. No fancy infra needed.

Your AI assistant can be your junior SRE β€” if you teach it how.

#SRE #DevOps #AI #Clawdbot #PlatformEngineering

11/11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment