The Thoughtful Skeptic's Guide to AI Agents

When Your Code Knows It Needs Help (And When It Doesn't)

Microsoft AI Dev Days | 15-20 minutes + demos

The Hook (1 minute)

"I'm going to show you how to build AI agents that can escalate to smarter models when they're stuck. But first, I need to tell you why you probably shouldn't use most of what I'm about to show you in production. Yet."

The Tension: AI agents are incredibly powerful AND frequently unreliable. Both are true. Let's be honest about both.

Section 1: The Paradigm Shift (3 minutes)

The Core Insight

Software has always been instructions + tools. What's changed:

Instructions used to be for machines (precise, deterministic)
Now instructions are for AI (fuzzy, probabilistic)

The Real Shift

Old skill: Learning how to use tools
New skill: Learning how to write instructions for AI to use tools

Honest Acknowledgment

This is genuinely exciting. It's also:

Unpredictable
Expensive at scale
Sometimes hilariously wrong
A security surface area we don't fully understand

Section 2: markdown-agent (`ma`) - The Tool (3 minutes)

What It Is

# A markdown file IS an agent
ma deploy-preview.copilot.md

Why Markdown?

Version controlled instructions
Human readable, AI executable
Portable across teams
Auditable - you can review what the AI was told to do

Demo Idea #1: Simple Agent

---
command: gh-copilot
---
List the last 5 commits in this repo with their authors.

Show: deterministic task, bounded scope, verifiable output.

Section 3: When NOT to Use Autonomous Agents (4 minutes)

The "Don't" List

1. High-Stakes Decisions

Financial transactions
Security-critical operations
Anything with legal implications
User data modifications without human confirmation

2. When You Need Determinism

Build scripts that must work every time
CI/CD pipelines (unless with heavy guardrails)
Anything where "usually works" isn't good enough

3. When Cost Matters

Token costs add up fast
Rate limits are real
A runaway agent loop can burn through quotas

4. When You Can't Verify Output

If you can't tell if the answer is right, don't automate it
"Looks plausible" is not the same as "correct"

The Pattern That Works

Human Intent → AI Draft → Human Verification → Execution

Not:

Human Intent → AI Execution → Hope

Section 4: Security - The Uncomfortable Conversation (2 minutes)

Prompt Injection is Real

Any user input that reaches the AI is a potential attack vector
Markdown files can contain malicious instructions
!command`` execution in imports is powerful AND dangerous

Responsible Patterns

Sandboxing: Run agents in containers with limited permissions
Allowlists: Restrict which commands agents can invoke
Audit Logging: Log every command an agent runs (ma does this!)
Review Before Commit: Never let agents push directly to main

Demo Idea #2: The Audit Trail

Show the logs at ~/.markdown-agent/logs/<agent>/ - prove you can trace what happened.

Section 5: Intelligent Escalation - The Responsible Pattern (4 minutes)

The Core Idea

Instead of one powerful (expensive) agent, use a chain:

Fast/cheap model tries first (Copilot CLI, GPT-4o-mini)
If stuck, escalate to smarter model (Claude Opus, GPT-4)
If still stuck, escalate to human

Why This Works

Most tasks are simple - don't pay for Opus to run git status
Expensive models only used when needed
Built-in circuit breaker: escalation has limits

Demo Idea #3: Escalation Chain

---
command: gh-copilot
on-failure: escalate.claude.md
max-escalations: 2
---
Analyze why tests are failing and suggest a fix.

Show:

Copilot tries, gets confused by complex issue
Automatically escalates to Claude
Claude provides better analysis
Human still reviews before applying

Honest Limitation

Escalation works when you can detect failure. Many AI failures look like success (confident wrong answers).

Section 6: What AI Agents Are Bad At (2 minutes)

Permanently Bad At

Novel reasoning - they remix, they don't truly innovate
Knowing what they don't know - confidence != correctness
Long-term consistency - context windows are real limits
Understanding consequences - they don't feel the pain of their mistakes

Currently Bad At (Improving)

Multi-step planning
Tool use reliability
Recovering from errors gracefully

The Takeaway

Use agents for augmentation, not replacement. The human in the loop isn't a bug, it's a feature.

Section 7: Responsible Deployment Patterns (2 minutes)

Pattern 1: The Preview Pattern

ma analyze.copilot.md --dry-run
# Review output
ma analyze.copilot.md --apply

Pattern 2: The Bounded Agent

Time limits
Token limits
Allowed action lists
Required confirmation for destructive operations

Pattern 3: The Pair Programming Model

Agent suggests, human approves. Every time.

Demo Idea #4: Guardrails in Action

Show an agent that:

Proposes a git commit
Shows the diff
Requires --confirm to actually commit
Logs the decision either way

Closing: The Honest Promise (1 minute)

What I'm NOT Promising

That this will replace developers
That agents are production-ready for everything
That you should automate your entire workflow

What I AM Promising

This changes how we think about instructions
There's a responsible path forward
The skeptics who learn this will build better systems than the enthusiasts who ignore the risks

The Call to Action

"Be the developer who understands the limitations. You'll be more valuable than the one who only knows the hype."

Resources

markdown-agent (ma): github.com/johnlindquist/agents
GitHub Copilot CLI: gh extension install github/gh-copilot
This outline: [gist link]

Appendix: Demo Checklist

Demo	Risk Level	What Could Go Wrong	Mitigation
Simple agent	Low	Copilot API down	Have backup recording
Audit trail	Low	Logs empty	Pre-run agents before talk
Escalation	Medium	Escalation fails weirdly	Have fallback slide
Guardrails	Low	Demo works too well	Show it blocking something

Timing Summary

Section	Time	Running Total
Hook	1 min	1 min
Paradigm Shift	3 min	4 min
markdown-agent	3 min	7 min
When NOT to Use	4 min	11 min
Security	2 min	13 min
Escalation	4 min	17 min
AI Limitations	2 min	19 min
Deployment Patterns	2 min	21 min
Closing	1 min	22 min

Total: ~22 minutes (trim demos if running long)

johnlindquist/thoughtful-skeptic-outline.md