Skip to content

Instantly share code, notes, and snippets.

@johnlindquist
Created December 7, 2025 16:08
Show Gist options
  • Select an option

  • Save johnlindquist/38469750a974a2c8edcb1f39397d3696 to your computer and use it in GitHub Desktop.

Select an option

Save johnlindquist/38469750a974a2c8edcb1f39397d3696 to your computer and use it in GitHub Desktop.
Microsoft AI Dev Days Talk: The Thoughtful Skeptic's Guide to AI Agents

The Thoughtful Skeptic's Guide to AI Agents

When Your Code Knows It Needs Help (And When It Doesn't)

Microsoft AI Dev Days | 15-20 minutes + demos


The Hook (1 minute)

"I'm going to show you how to build AI agents that can escalate to smarter models when they're stuck. But first, I need to tell you why you probably shouldn't use most of what I'm about to show you in production. Yet."

The Tension: AI agents are incredibly powerful AND frequently unreliable. Both are true. Let's be honest about both.


Section 1: The Paradigm Shift (3 minutes)

The Core Insight

Software has always been instructions + tools. What's changed:

  • Instructions used to be for machines (precise, deterministic)
  • Now instructions are for AI (fuzzy, probabilistic)

The Real Shift

  • Old skill: Learning how to use tools
  • New skill: Learning how to write instructions for AI to use tools

Honest Acknowledgment

This is genuinely exciting. It's also:

  • Unpredictable
  • Expensive at scale
  • Sometimes hilariously wrong
  • A security surface area we don't fully understand

Section 2: markdown-agent (ma) - The Tool (3 minutes)

What It Is

# A markdown file IS an agent
ma deploy-preview.copilot.md

Why Markdown?

  • Version controlled instructions
  • Human readable, AI executable
  • Portable across teams
  • Auditable - you can review what the AI was told to do

Demo Idea #1: Simple Agent

---
command: gh-copilot
---
List the last 5 commits in this repo with their authors.

Show: deterministic task, bounded scope, verifiable output.


Section 3: When NOT to Use Autonomous Agents (4 minutes)

The "Don't" List

1. High-Stakes Decisions

  • Financial transactions
  • Security-critical operations
  • Anything with legal implications
  • User data modifications without human confirmation

2. When You Need Determinism

  • Build scripts that must work every time
  • CI/CD pipelines (unless with heavy guardrails)
  • Anything where "usually works" isn't good enough

3. When Cost Matters

  • Token costs add up fast
  • Rate limits are real
  • A runaway agent loop can burn through quotas

4. When You Can't Verify Output

  • If you can't tell if the answer is right, don't automate it
  • "Looks plausible" is not the same as "correct"

The Pattern That Works

Human Intent → AI Draft → Human Verification → Execution

Not:

Human Intent → AI Execution → Hope

Section 4: Security - The Uncomfortable Conversation (2 minutes)

Prompt Injection is Real

  • Any user input that reaches the AI is a potential attack vector
  • Markdown files can contain malicious instructions
  • !command`` execution in imports is powerful AND dangerous

Responsible Patterns

  1. Sandboxing: Run agents in containers with limited permissions
  2. Allowlists: Restrict which commands agents can invoke
  3. Audit Logging: Log every command an agent runs (ma does this!)
  4. Review Before Commit: Never let agents push directly to main

Demo Idea #2: The Audit Trail

Show the logs at ~/.markdown-agent/logs/<agent>/ - prove you can trace what happened.


Section 5: Intelligent Escalation - The Responsible Pattern (4 minutes)

The Core Idea

Instead of one powerful (expensive) agent, use a chain:

  1. Fast/cheap model tries first (Copilot CLI, GPT-4o-mini)
  2. If stuck, escalate to smarter model (Claude Opus, GPT-4)
  3. If still stuck, escalate to human

Why This Works

  • Most tasks are simple - don't pay for Opus to run git status
  • Expensive models only used when needed
  • Built-in circuit breaker: escalation has limits

Demo Idea #3: Escalation Chain

---
command: gh-copilot
on-failure: escalate.claude.md
max-escalations: 2
---
Analyze why tests are failing and suggest a fix.

Show:

  1. Copilot tries, gets confused by complex issue
  2. Automatically escalates to Claude
  3. Claude provides better analysis
  4. Human still reviews before applying

Honest Limitation

Escalation works when you can detect failure. Many AI failures look like success (confident wrong answers).


Section 6: What AI Agents Are Bad At (2 minutes)

Permanently Bad At

  • Novel reasoning - they remix, they don't truly innovate
  • Knowing what they don't know - confidence != correctness
  • Long-term consistency - context windows are real limits
  • Understanding consequences - they don't feel the pain of their mistakes

Currently Bad At (Improving)

  • Multi-step planning
  • Tool use reliability
  • Recovering from errors gracefully

The Takeaway

Use agents for augmentation, not replacement. The human in the loop isn't a bug, it's a feature.


Section 7: Responsible Deployment Patterns (2 minutes)

Pattern 1: The Preview Pattern

ma analyze.copilot.md --dry-run
# Review output
ma analyze.copilot.md --apply

Pattern 2: The Bounded Agent

  • Time limits
  • Token limits
  • Allowed action lists
  • Required confirmation for destructive operations

Pattern 3: The Pair Programming Model

Agent suggests, human approves. Every time.

Demo Idea #4: Guardrails in Action

Show an agent that:

  1. Proposes a git commit
  2. Shows the diff
  3. Requires --confirm to actually commit
  4. Logs the decision either way

Closing: The Honest Promise (1 minute)

What I'm NOT Promising

  • That this will replace developers
  • That agents are production-ready for everything
  • That you should automate your entire workflow

What I AM Promising

  • This changes how we think about instructions
  • There's a responsible path forward
  • The skeptics who learn this will build better systems than the enthusiasts who ignore the risks

The Call to Action

"Be the developer who understands the limitations. You'll be more valuable than the one who only knows the hype."


Resources


Appendix: Demo Checklist

Demo Risk Level What Could Go Wrong Mitigation
Simple agent Low Copilot API down Have backup recording
Audit trail Low Logs empty Pre-run agents before talk
Escalation Medium Escalation fails weirdly Have fallback slide
Guardrails Low Demo works too well Show it blocking something

Timing Summary

Section Time Running Total
Hook 1 min 1 min
Paradigm Shift 3 min 4 min
markdown-agent 3 min 7 min
When NOT to Use 4 min 11 min
Security 2 min 13 min
Escalation 4 min 17 min
AI Limitations 2 min 19 min
Deployment Patterns 2 min 21 min
Closing 1 min 22 min

Total: ~22 minutes (trim demos if running long)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment