Skip to content

Instantly share code, notes, and snippets.

@ttarler
Created March 11, 2026 18:42
Show Gist options
  • Select an option

  • Save ttarler/0f4784045b11d6bfbc37a8de1b9b4f3e to your computer and use it in GitHub Desktop.

Select an option

Save ttarler/0f4784045b11d6bfbc37a8de1b9b4f3e to your computer and use it in GitHub Desktop.
Post 6: Documentation That Deploys

Documentation That Deploys: How We Turned Post-Mortems Into Production Safety Nets

Between January 9 and January 30, 2026, our AI-assisted trading system had seven significant incidents. The same root patterns — silent fallbacks, unverified deployments, training/inference mismatches — kept recurring despite post-mortems and a growing LESSONS_LEARNED.md. The lessons existed. The enforcement mechanism didn't.

This post is about the documentation system that emerged from those failures. Not documentation in the wiki sense — a self-reinforcing system where every incident writes the rules that prevent the next one, and those rules are embedded in the tools that AI coding agents read at the start of every session.

Post 5 ended with a tease: "I'll go deeper on the documentation system in the next post, including why we built it, how it prevents the kind of knowledge loss that contributed to our $78K incident, and how it works with AI coding agents." Here's the full story.

The Problem: Lessons That Don't Stick

AI coding agents lose context between sessions. A lesson learned Tuesday is forgotten Thursday. This isn't a minor inconvenience — it's an operational risk.

Between January 9 and February 6, 2026, I logged seven significant incidents:

Date Incident Cost Root Pattern
Jan 9 EventBridge rules disabled Missed trading window Unverified deployment
Jan 12 Silent heuristic fallback $55-90 wasted Silent fallback
Jan 12 Constant prescreening scores $20-40 wasted Unverified output
Jan 13 Alphabetical bias (all A-stocks) Missed diversity Unverified output
Jan 14 RL model never used (6 days) $150-250+ wasted Feature mismatch + silent fallback
Jan 30 KXIN concentration ($78,947 loss) $78,947 Silent fallback + missing risk check
Feb 6 SageMaker segfault 5 hours lost Training/inference mismatch

These weren't seven independent failures. They were variations on three themes: silent fallbacks masking broken systems, deployments assumed to work without verification, and training code leaking into inference containers.

We had post-mortems. We had a LESSONS_LEARNED.md. Incidents kept happening because there was no mechanism to turn a lesson into a constraint that an AI agent would respect on its next session. The post-mortem for Jan 12's silent fallback explicitly documented "NEVER use try/except with silent fallback for model inference." Six days later, the Jan 30 loss was caused by the exact same pattern — the RL endpoint's silent fallback to heuristics. The lesson existed. The enforcement mechanism didn't.

The question became: how do you make documentation that doesn't just record knowledge, but actively prevents the recurrence of the failures it documents?

Four Layers of Operational Documentation

The answer turned out to be four concentric layers, each with a different audience, update frequency, and enforcement mechanism. The innermost layer is the most concentrated and the most frequently consulted. Each outer layer adds depth and context.

Four Layers of Operational Documentation

Layer 1: CLAUDE.md — The Team Lead

CLAUDE.md sits at the repository root. AI coding agents load it automatically at the start of every session. It's the single most important file in the documentation system because it's the only one that's always in context.

It contains 12 policies. Each one traces directly to a specific failure:

## Key Policies (from past failures)

1. **Follow docs first** — Read procedures before acting. Don't assume.
2. **Never sleep/wait** — Check status once, report to user, move on.
3. **Verify deployments** — "Deployed" ≠ "Working". Always check logs.
4. **Fail fast, never silently** — No try/except fallbacks that mask failures.
5. **CI/CD is mandatory** — Never bypass PR checks, never merge before
   tests pass, never manually deploy Lambda (except emergencies).
6. **Separate training/inference code** — Ray imports in inference
   containers = segfault. Duplicate model architecture instead.
7. **Feature consistency** — Training and inference must use identical
   feature lists. Single source of truth.
8. **Feature expansion is additive** — New features need backward-compatible
   fallbacks. Never break existing data/models.
9. **Pipeline is event-driven** — Freeze pipeline state BEFORE stopping
   jobs, or orchestrator advances unexpectedly.
10. **Urgent failures = act first** — Stop failing jobs → deploy fix to S3 →
    reset SSM → retrigger → THEN do PR paperwork.
11. **Explicit plan approval** — Never start implementing until user says
    "go ahead".
12. **Overnight jobs** — Never say "you're good to sleep" until a job has
    COMPLETED successfully.

The mapping from policies to incidents is direct:

  • Policy 3 ("Verify deployments") ← Jan 14: RL model "deployed" but never actually loaded. Agent reported success without checking logs.
  • Policy 4 ("Fail fast, never silently") ← Jan 12 and Jan 30: Silent fallback to heuristics masked complete model failure for weeks.
  • Policy 6 ("Separate training/inference code") ← Feb 6: Used RLlib training script as inference entry point. Ray import caused a segfault. Five hours lost.
  • Policy 10 ("Urgent failures = act first") ← Feb 17: SageMaker jobs burning money while we debated the correct CI/CD process.

Each policy is a scar. The CLAUDE.md file isn't a style guide or a wish list — it's a compressed version of every expensive mistake we've made, written in imperative language that an AI agent can follow without ambiguity.

Layer 2: .ai-knowledge-base.yaml — Machine-Readable Patterns

CLAUDE.md tells agents what to do and what not to do, but it doesn't explain why, and it doesn't provide code examples. That's what the knowledge base is for.

.ai-knowledge-base.yaml is a 620-line structured YAML file containing 8 patterns (do this), 6 anti-patterns (don't do this), and 4 decision-tree scenarios. Each entry has a consistent schema:

- id: feature_verification
  name: "Feature Verification Checklist"
  severity: "CRITICAL"
  category: "process"
  description: |
    "Scheduled" ≠ "Working". Features must be verified in production,
    not just marked complete when code is written.
  do:
    - "Verify feature works in production after deployment"
    - "Check worker logs, not just scheduler logs"
    - "Monitor actual outcomes (BUY:SELL ratio, position closures)"
    - "Run smoke tests after deployment"
    - "Document verification evidence"
  dont:
    - "Mark complete based on code alone"
    - "Assume scheduled = working"
    - "Skip verification due to time constraints"
  checklist:
    - "Code written"
    - "Tests passing"
    - "Deployed to production"
    - "Verified working in production"
    - "Documented with verification evidence"

The key difference from CLAUDE.md: this file is structured enough for programmatic consumption. An AI agent can look up patterns by id, filter by severity: CRITICAL, or follow a decision tree step by step. The do and dont arrays are unambiguous directives. The checklist provides sequential verification steps. The reference_files field points to the exact source code locations where the pattern applies.

This is the difference between "don't use silent fallbacks" (CLAUDE.md) and "here's exactly what a silent fallback looks like in code, here's what to do instead, here's how to verify you didn't introduce one, and here's the post-mortem that proved why this matters" (knowledge base).

Layer 3: Operational Runbooks

The docs/v2/ops/ directory contains the procedures that turn policies into actions. Morning go/no-go checklists. Deployment procedures. Hotfix and rollback playbooks. Pipeline troubleshooting guides.

Every runbook includes YAML frontmatter:

---
id: ops-05
title: Trading Morning Go/No-Go Checklist
perspective: operations
tags: [pre-market, readiness, go-no-go, health-check, checklist]
depends_on: [ops-01, ops-14]
key_files:
  - backend/app/api/v1/health.py
  - backend/app/services/monitoring/multi_account.py
  - backend/app/services/portfolio_risk.py
  - backend/app/services/broker.py
last_updated: 2026-02-28
---

The morning go/no-go checklist was a direct response to the Jan 9 EventBridge incident, where all trading rules were disabled and nobody noticed until 23 minutes before market open. Now, every trading day starts with a single endpoint call that checks service health, trading operations, risk controls, data freshness, ML model availability, and watchlist readiness. It returns a boolean: overall_ready: true or overall_ready: false. No ambiguity. No room for an AI agent to assume systems are working without checking.

The depends_on field creates a navigable graph. The go/no-go checklist depends on the daily operations doc and the monitoring setup doc. An agent troubleshooting a pre-market failure can follow these links to find the relevant context.

Layer 4: Architecture and Risk Documentation

The outermost layer provides full depth: system architecture, risk controls, model design, feature engineering, infrastructure, and more. The master index (docs/v2/INDEX.md) organizes everything into four perspectives — Architecture, Risk, Operations, and System Guide — with an "I need to..." lookup table:

Task Document
Understand the overall system arch/01-system-overview.md
Debug a pipeline failure ops/08-troubleshooting-pipeline.md
Deploy a code change ops/03-deployment-checklist.md
Fix a production issue fast ops/04-hotfix-and-rollback.md
Understand risk controls risk/02-trading-risk-controls.md
Check model risk incidents risk/15-risk-incidents-and-lessons.md
Pre-market morning checklist ops/05-trading-morning-go-no-go.md
Understand feature engineering risk/07-feature-engineering.md

This table is explicitly designed for AI agents. An agent that needs to deploy a code change doesn't need to know the documentation structure — it reads the lookup table and goes directly to ops/03-deployment-checklist.md. The agent discovery section includes grep commands:

# Find docs related to a topic
grep -r "tags:.*pipeline" docs/v2/

# Find docs related to a source file
grep -r "key_files:.*execution.py" docs/v2/

# Find dependencies
grep -r "depends_on:.*arch-05" docs/v2/

The cross-references create a graph. From any document, an agent can discover related documents through depends_on, related source code through key_files, and topical connections through tags.

The Post-Mortem Feedback Loop

The four layers only matter if they're connected to the incidents that justify them. Here's how a single incident — our worst production loss — propagated through every layer of the documentation system. The point isn't the incident itself (I covered that in earlier posts). The point is the process — how a failure becomes a permanent system constraint.

Step 1: The Incident

January 30, 2026. A single stock crashes 74%. Multiple safeguards fail simultaneously — the RL model had been silently falling back to heuristics, position limits didn't account for pending orders, and compliance defaulted to "allow" for unclassified sectors.

Step 2: The Post-Mortem

The post-mortem followed a structured template — 12 mandatory sections (Summary, Timeline, Root Cause, Impact, What Went Well, What Went Wrong, Action Items, Lessons Learned, Prevention). No optional sections. If you can't fill a section, you haven't investigated deeply enough.

The template's most important feature: it requires multiple root causes. This investigation identified five — infrastructure misconfiguration, silent fallback behavior, incomplete validation logic, permissive defaults for missing data, and insufficient monitoring. A single-root-cause analysis would have fixed the infrastructure issue and declared victory, leaving four other holes open for the next incident.

Step 3: Pattern Encoded

The "silent fallback" pattern was formalized in .ai-knowledge-base.yaml as a CRITICAL anti-pattern. The "feature verification" pattern was added as a CRITICAL positive pattern with an explicit checklist: code written → tests passing → deployed → verified in production → documented with evidence.

Step 4: Policy Added

Three CLAUDE.md rules trace directly to this incident:

  • Policy 3 ("Verify deployments"): "Deployed" ≠ "Working". Always check logs.
  • Policy 4 ("Fail fast, never silently"): No try/except fallbacks that mask failures.
  • Policy 7 ("Feature consistency"): Training and inference must use identical feature lists.

Step 5: Runbook Updated

The morning go/no-go checklist now includes SageMaker endpoint verification — confirming that every endpoint is InService and responding within latency thresholds. The deployment checklist now includes a mandatory log verification step.

Step 6: Code Changes

Four code changes shipped the same day — each one addressing a specific root cause from the post-mortem: remove the silent fallback (fail fast instead), block trades with missing data (reject unknowns instead of allowing), include pending orders in concentration limits, and add portfolio risk validation before every buy.

Step 7: CI Gate

Feature consistency validation runs on every PR. features.py is the single source of truth for model feature lists. If the training feature list diverges from the inference feature list, CI fails. No exception. No override.

The feedback loop:

The Post-Mortem Feedback Loop

Every incident feeds every layer. The layers reinforce each other: CLAUDE.md tells agents what to do, the knowledge base explains why and how, the runbooks provide step-by-step procedures, and CI gates enforce the rules in code. An incident that passes through all seven steps of this loop has been converted from a one-time lesson into a permanent system constraint.

The Mishap Cost Log: Quantifying Lessons

Every incident in the system gets a dollar amount. Not an estimate, not a range — a specific cost entry in MISHAP_COST_LOG.md.

The file opens with a cost reference table:

Resource Cost
SageMaker Training (ml.m5.xlarge) ~$0.23/hour
SageMaker Training (ml.g4dn.xlarge GPU) ~$0.74/hour
SageMaker Endpoint (ml.t2.medium) ~$0.05/hour
SageMaker Serverless Endpoint ~$0.00012/second
SageMaker Processing (ml.m5.xlarge) ~$0.23/hour
ECS Task (Fargate) ~$0.04/hour

Then each session's incidents follow a consistent template:

#### [6:02 PM ET] - Silent Heuristic Fallback Producing Identical Results
- **What happened**: Multiple training jobs ($5-10 each) produced identical
  backtest results (7.41% return, 62.1% win rate). User correctly identified
  this as impossible if the model was actually learning different things.
- **Root cause**: TWO cascading failures:
  1. `compute_single_action` deprecated in RLlib 2.x, causing exceptions
  2. Code had try/except that caught the error and fell back to equal weights
- **AWS resources affected**: 1x validation job + HPO job with 15 training jobs
- **Duration**: ~20 minutes before user caught it
- **Estimated cost**: ~$5-10 (wasted training jobs)
- **Prevention**:
  1. NEVER use try/except with silent fallback for model inference
  2. Fail fast: If model inference fails, raise immediately
  3. Add validation counters: Track model_inference_count vs heuristic_count

The dollar amounts serve two purposes. First, accountability — it's harder to dismiss an incident as minor when you can see it cost $55-90 in wasted compute. Second, prioritization — the entries that cost the most get the most aggressive prevention measures.

What matters more than any individual dollar amount is the pattern recognition. When you aggregate costs by category rather than by date, the same root patterns emerge that the incident table showed earlier — silent fallbacks, unverified outputs, architecture mismatches. The mishap log makes these patterns impossible to ignore because they have cumulative price tags.

The $0-cost near-misses are arguably the most valuable entries. The Jan 13 alphabetical bias — where prescreening only ever considered stocks starting with "A" because the broker API returns symbols alphabetically and the code took universe[:500] — cost nothing because it was caught during paper trading review. But it revealed that the prescreening pipeline had never been validated for output diversity. That led to the broader principle: verify outputs, not just execution. The system learned before paying.

The mishap log creates a culture of measurement. When an agent makes a mistake, it doesn't just get a post-mortem — it gets a price tag. Over time, the cumulative cost data makes the value of the documentation system concrete. Every dollar in the mishap log is a dollar that the feedback loop is designed to prevent from recurring.

Documentation as Code: Frontmatter and Cross-References

Every document in the system includes YAML frontmatter. This isn't cosmetic — it's the mechanism that makes the documentation discoverable by AI agents.

---
id: risk-15
title: Risk Incidents and Lessons Learned
perspective: risk
tags: [incidents, post-mortems, lessons, controls, loss-events, fallback]
depends_on: [risk-10, risk-11, risk-13]
key_files:
  - docs/post-mortems/2026-01-30-RL-MODEL-NEVER-WORKED.md
  - docs/post-mortems/2026-02-06-SAGEMAKER-INFERENCE-SEGFAULT.md
  - docs/LESSONS_LEARNED.md
  - docs/MISHAP_COST_LOG.md
last_updated: 2026-02-28
---

The frontmatter fields are designed for programmatic consumption:

  • id: Stable reference for cross-linking (depends_on: [risk-10])
  • perspective: One of four categories (architecture, risk, operations, guide)
  • tags: Topical classification for search
  • depends_on: Explicit dependency graph between documents
  • key_files: Source code files that this document describes

An agent investigating a production incident can start with the "I need to..." lookup table, find the relevant runbook, follow depends_on links to understand the broader context, and trace key_files to the exact code that needs attention. The frontmatter turns a flat directory of markdown files into a navigable knowledge graph.

The post-mortem template itself is documentation as code. Its 12 mandatory sections enforce investigative rigor:

  1. Summary
  2. Timeline
  3. Root Cause
  4. Impact (financial, users, duration)
  5. What Went Well
  6. What Went Wrong
  7. Action Items (with owner, status, due date)
  8. Lessons Learned
  9. Prevention

The "What Went Well" section is often the hardest to fill — it forces you to identify what parts of the system worked correctly during a failure, which prevents over-correction. In our worst incident, the prescreening model and the stop-loss system both functioned as designed. Those details matter because they prevent the knee-jerk reaction of dismantling working components alongside the broken ones.

What This Looks Like in Practice

The clearest way to see the documentation system's value is to compare two incidents: one before the system existed, and one after.

Before: February 6, 2026

Task: Deploy new RL model to production.

The agent used the RLlib training script as the inference entry point. Ray imported at module load time. The SageMaker inference container didn't have Ray installed. Segfault.

Five hours of debugging followed. The agent tried adding packages, removing custom entrypoints, rebuilding containers — none of it worked because the root cause (training code in the inference path) wasn't identified until hour four. Zero trades executed that morning. The fix — creating a standalone inference.py that duplicates the model architecture without Ray imports — took 30 minutes once the problem was understood. The other 4.5 hours were wasted.

After: March, 2026

Same task: Deploy updated RL model to production.

The agent reads CLAUDE.md at session start. Policy 6 is explicit: "Separate training/inference code — Ray imports in inference containers = segfault. Duplicate model architecture instead." The agent opens the knowledge base, finds the deployment_process pattern, and follows the deployment checklist from ops/03-deployment-checklist.md.

The checklist includes: verify inference module has no training-only imports, build container locally, test inference with sample payload, deploy to SageMaker, verify endpoint is InService, invoke with test data, check CloudWatch logs for errors.

The agent creates the standalone inference module, tests it locally, deploys, verifies the endpoint responds, checks the logs for any errors — and is done in 30 minutes. No segfault. No five-hour debugging session. No missed trading window.

The difference isn't that the agent is smarter. It's that the system contains the answer to a problem that was already solved once. The Feb 6 post-mortem identified the root cause. The CLAUDE.md policy encoded the constraint. The knowledge base provided the code pattern. The deployment checklist provided the verification steps. Each layer added a guardrail, and the guardrails compound.

This is what I mean by documentation that deploys. It doesn't sit in a wiki waiting to be read. It's loaded into the agent's context at session start, structured for programmatic lookup, and connected to the verification steps that prove the deployment actually works.

Your Turn

If you're working with AI agents and want to build a similar system, here's where to start:

  1. Start with a mishap log. Track every mistake with a dollar amount. Even $0 near-misses deserve an entry. The act of quantifying forces you to take incidents seriously and creates the data you need to prioritize prevention.

  2. Write post-mortems with multiple root causes. Single-root-cause analysis is almost always incomplete. Force yourself to find at least three contributing factors. Use a template with mandatory sections so you can't skip the hard parts.

  3. Encode policies where agents read them. If you're using AI coding agents, put your most critical rules in the file that loads automatically every session. For Claude, that's CLAUDE.md. For other tools, find the equivalent. Twelve imperative rules beat a hundred pages of guidelines.

  4. Structure knowledge for machines, not just humans. Prose documentation is for context. Structured YAML (or JSON, or whatever your agents parse) is for enforcement. Include severity levels, code examples, verification steps, and reference files.

  5. Create the feedback loop. The system only works if every incident propagates through all layers: incident → post-mortem → pattern/anti-pattern → policy → runbook → code change → CI gate. Skip any step and the loop has a gap.

  6. Treat documentation updates as part of the fix. When you close an incident, the code change is half the work. The policy update, the runbook update, and the knowledge base entry are the other half. If you don't update the documentation, you'll fix the same problem twice.

The investment is real — building and maintaining four documentation layers takes time. But the alternative is paying for the same lesson more than once. The documentation system exists to make sure every failure only happens once.


This is Post 6 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Previous: The Ensemble That Actually Trades — How PPO, SAC, and TD3 Became One Decisioning System | Next: One Developer, 12 Workflows, and a Production ML System

The documentation system tells agents what NOT to do. Next week: the full development system that lets one person ship production ML at startup speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment