Between January 9 and January 30, 2026, our AI-assisted trading system had seven significant incidents. The same root patterns — silent fallbacks, unverified deployments, training/inference mismatches — kept recurring despite post-mortems and a growing LESSONS_LEARNED.md. The lessons existed. The enforcement mechanism didn't.
This post is about the documentation system that emerged from those failures. Not documentation in the wiki sense — a self-reinforcing system where every incident writes the rules that prevent the next one, and those rules are embedded in the tools that AI coding agents read at the start of every session.
Post 5 ended with a tease: "I'll go deeper on the documentation system in the next post, including why we built it, how it prevents the kind of knowledge loss that contributed to our $78K incident, and how it works with AI coding agents." Here's the full story.
AI coding agents lose context between sessions. A lesson learned Tuesday is forgotten Thursday. This isn't a minor inconvenience — it's an operational risk.
Between January 9 and February 6, 2026, I logged seven significant incidents:
| Date | Incident | Cost | Root Pattern |
|---|---|---|---|
| Jan 9 | EventBridge rules disabled | Missed trading window | Unverified deployment |
| Jan 12 | Silent heuristic fallback | $55-90 wasted | Silent fallback |
| Jan 12 | Constant prescreening scores | $20-40 wasted | Unverified output |
| Jan 13 | Alphabetical bias (all A-stocks) | Missed diversity | Unverified output |
| Jan 14 | RL model never used (6 days) | $150-250+ wasted | Feature mismatch + silent fallback |
| Jan 30 | KXIN concentration ($78,947 loss) | $78,947 | Silent fallback + missing risk check |
| Feb 6 | SageMaker segfault | 5 hours lost | Training/inference mismatch |
These weren't seven independent failures. They were variations on three themes: silent fallbacks masking broken systems, deployments assumed to work without verification, and training code leaking into inference containers.
We had post-mortems. We had a LESSONS_LEARNED.md. Incidents kept happening because there was no mechanism to turn a lesson into a constraint that an AI agent would respect on its next session. The post-mortem for Jan 12's silent fallback explicitly documented "NEVER use try/except with silent fallback for model inference." Six days later, the Jan 30 loss was caused by the exact same pattern — the RL endpoint's silent fallback to heuristics. The lesson existed. The enforcement mechanism didn't.
The question became: how do you make documentation that doesn't just record knowledge, but actively prevents the recurrence of the failures it documents?
The answer turned out to be four concentric layers, each with a different audience, update frequency, and enforcement mechanism. The innermost layer is the most concentrated and the most frequently consulted. Each outer layer adds depth and context.
CLAUDE.md sits at the repository root. AI coding agents load it automatically at the start of every session. It's the single most important file in the documentation system because it's the only one that's always in context.
It contains 12 policies. Each one traces directly to a specific failure:
## Key Policies (from past failures)
1. **Follow docs first** — Read procedures before acting. Don't assume.
2. **Never sleep/wait** — Check status once, report to user, move on.
3. **Verify deployments** — "Deployed" ≠ "Working". Always check logs.
4. **Fail fast, never silently** — No try/except fallbacks that mask failures.
5. **CI/CD is mandatory** — Never bypass PR checks, never merge before
tests pass, never manually deploy Lambda (except emergencies).
6. **Separate training/inference code** — Ray imports in inference
containers = segfault. Duplicate model architecture instead.
7. **Feature consistency** — Training and inference must use identical
feature lists. Single source of truth.
8. **Feature expansion is additive** — New features need backward-compatible
fallbacks. Never break existing data/models.
9. **Pipeline is event-driven** — Freeze pipeline state BEFORE stopping
jobs, or orchestrator advances unexpectedly.
10. **Urgent failures = act first** — Stop failing jobs → deploy fix to S3 →
reset SSM → retrigger → THEN do PR paperwork.
11. **Explicit plan approval** — Never start implementing until user says
"go ahead".
12. **Overnight jobs** — Never say "you're good to sleep" until a job has
COMPLETED successfully.The mapping from policies to incidents is direct:
- Policy 3 ("Verify deployments") ← Jan 14: RL model "deployed" but never actually loaded. Agent reported success without checking logs.
- Policy 4 ("Fail fast, never silently") ← Jan 12 and Jan 30: Silent fallback to heuristics masked complete model failure for weeks.
- Policy 6 ("Separate training/inference code") ← Feb 6: Used RLlib training script as inference entry point. Ray import caused a segfault. Five hours lost.
- Policy 10 ("Urgent failures = act first") ← Feb 17: SageMaker jobs burning money while we debated the correct CI/CD process.
Each policy is a scar. The CLAUDE.md file isn't a style guide or a wish list — it's a compressed version of every expensive mistake we've made, written in imperative language that an AI agent can follow without ambiguity.
CLAUDE.md tells agents what to do and what not to do, but it doesn't explain why, and it doesn't provide code examples. That's what the knowledge base is for.
.ai-knowledge-base.yaml is a 620-line structured YAML file containing 8 patterns (do this), 6 anti-patterns (don't do this), and 4 decision-tree scenarios. Each entry has a consistent schema:
- id: feature_verification
name: "Feature Verification Checklist"
severity: "CRITICAL"
category: "process"
description: |
"Scheduled" ≠ "Working". Features must be verified in production,
not just marked complete when code is written.
do:
- "Verify feature works in production after deployment"
- "Check worker logs, not just scheduler logs"
- "Monitor actual outcomes (BUY:SELL ratio, position closures)"
- "Run smoke tests after deployment"
- "Document verification evidence"
dont:
- "Mark complete based on code alone"
- "Assume scheduled = working"
- "Skip verification due to time constraints"
checklist:
- "Code written"
- "Tests passing"
- "Deployed to production"
- "Verified working in production"
- "Documented with verification evidence"The key difference from CLAUDE.md: this file is structured enough for programmatic consumption. An AI agent can look up patterns by id, filter by severity: CRITICAL, or follow a decision tree step by step. The do and dont arrays are unambiguous directives. The checklist provides sequential verification steps. The reference_files field points to the exact source code locations where the pattern applies.
This is the difference between "don't use silent fallbacks" (CLAUDE.md) and "here's exactly what a silent fallback looks like in code, here's what to do instead, here's how to verify you didn't introduce one, and here's the post-mortem that proved why this matters" (knowledge base).
The docs/v2/ops/ directory contains the procedures that turn policies into actions. Morning go/no-go checklists. Deployment procedures. Hotfix and rollback playbooks. Pipeline troubleshooting guides.
Every runbook includes YAML frontmatter:
---
id: ops-05
title: Trading Morning Go/No-Go Checklist
perspective: operations
tags: [pre-market, readiness, go-no-go, health-check, checklist]
depends_on: [ops-01, ops-14]
key_files:
- backend/app/api/v1/health.py
- backend/app/services/monitoring/multi_account.py
- backend/app/services/portfolio_risk.py
- backend/app/services/broker.py
last_updated: 2026-02-28
---The morning go/no-go checklist was a direct response to the Jan 9 EventBridge incident, where all trading rules were disabled and nobody noticed until 23 minutes before market open. Now, every trading day starts with a single endpoint call that checks service health, trading operations, risk controls, data freshness, ML model availability, and watchlist readiness. It returns a boolean: overall_ready: true or overall_ready: false. No ambiguity. No room for an AI agent to assume systems are working without checking.
The depends_on field creates a navigable graph. The go/no-go checklist depends on the daily operations doc and the monitoring setup doc. An agent troubleshooting a pre-market failure can follow these links to find the relevant context.
The outermost layer provides full depth: system architecture, risk controls, model design, feature engineering, infrastructure, and more. The master index (docs/v2/INDEX.md) organizes everything into four perspectives — Architecture, Risk, Operations, and System Guide — with an "I need to..." lookup table:
| Task | Document |
|---|---|
| Understand the overall system | arch/01-system-overview.md |
| Debug a pipeline failure | ops/08-troubleshooting-pipeline.md |
| Deploy a code change | ops/03-deployment-checklist.md |
| Fix a production issue fast | ops/04-hotfix-and-rollback.md |
| Understand risk controls | risk/02-trading-risk-controls.md |
| Check model risk incidents | risk/15-risk-incidents-and-lessons.md |
| Pre-market morning checklist | ops/05-trading-morning-go-no-go.md |
| Understand feature engineering | risk/07-feature-engineering.md |
This table is explicitly designed for AI agents. An agent that needs to deploy a code change doesn't need to know the documentation structure — it reads the lookup table and goes directly to ops/03-deployment-checklist.md. The agent discovery section includes grep commands:
# Find docs related to a topic
grep -r "tags:.*pipeline" docs/v2/
# Find docs related to a source file
grep -r "key_files:.*execution.py" docs/v2/
# Find dependencies
grep -r "depends_on:.*arch-05" docs/v2/The cross-references create a graph. From any document, an agent can discover related documents through depends_on, related source code through key_files, and topical connections through tags.
The four layers only matter if they're connected to the incidents that justify them. Here's how a single incident — our worst production loss — propagated through every layer of the documentation system. The point isn't the incident itself (I covered that in earlier posts). The point is the process — how a failure becomes a permanent system constraint.
January 30, 2026. A single stock crashes 74%. Multiple safeguards fail simultaneously — the RL model had been silently falling back to heuristics, position limits didn't account for pending orders, and compliance defaulted to "allow" for unclassified sectors.
The post-mortem followed a structured template — 12 mandatory sections (Summary, Timeline, Root Cause, Impact, What Went Well, What Went Wrong, Action Items, Lessons Learned, Prevention). No optional sections. If you can't fill a section, you haven't investigated deeply enough.
The template's most important feature: it requires multiple root causes. This investigation identified five — infrastructure misconfiguration, silent fallback behavior, incomplete validation logic, permissive defaults for missing data, and insufficient monitoring. A single-root-cause analysis would have fixed the infrastructure issue and declared victory, leaving four other holes open for the next incident.
The "silent fallback" pattern was formalized in .ai-knowledge-base.yaml as a CRITICAL anti-pattern. The "feature verification" pattern was added as a CRITICAL positive pattern with an explicit checklist: code written → tests passing → deployed → verified in production → documented with evidence.
Three CLAUDE.md rules trace directly to this incident:
- Policy 3 ("Verify deployments"): "Deployed" ≠ "Working". Always check logs.
- Policy 4 ("Fail fast, never silently"): No try/except fallbacks that mask failures.
- Policy 7 ("Feature consistency"): Training and inference must use identical feature lists.
The morning go/no-go checklist now includes SageMaker endpoint verification — confirming that every endpoint is InService and responding within latency thresholds. The deployment checklist now includes a mandatory log verification step.
Four code changes shipped the same day — each one addressing a specific root cause from the post-mortem: remove the silent fallback (fail fast instead), block trades with missing data (reject unknowns instead of allowing), include pending orders in concentration limits, and add portfolio risk validation before every buy.
Feature consistency validation runs on every PR. features.py is the single source of truth for model feature lists. If the training feature list diverges from the inference feature list, CI fails. No exception. No override.
The feedback loop:
Every incident feeds every layer. The layers reinforce each other: CLAUDE.md tells agents what to do, the knowledge base explains why and how, the runbooks provide step-by-step procedures, and CI gates enforce the rules in code. An incident that passes through all seven steps of this loop has been converted from a one-time lesson into a permanent system constraint.
Every incident in the system gets a dollar amount. Not an estimate, not a range — a specific cost entry in MISHAP_COST_LOG.md.
The file opens with a cost reference table:
| Resource | Cost |
|---|---|
| SageMaker Training (ml.m5.xlarge) | ~$0.23/hour |
| SageMaker Training (ml.g4dn.xlarge GPU) | ~$0.74/hour |
| SageMaker Endpoint (ml.t2.medium) | ~$0.05/hour |
| SageMaker Serverless Endpoint | ~$0.00012/second |
| SageMaker Processing (ml.m5.xlarge) | ~$0.23/hour |
| ECS Task (Fargate) | ~$0.04/hour |
Then each session's incidents follow a consistent template:
#### [6:02 PM ET] - Silent Heuristic Fallback Producing Identical Results
- **What happened**: Multiple training jobs ($5-10 each) produced identical
backtest results (7.41% return, 62.1% win rate). User correctly identified
this as impossible if the model was actually learning different things.
- **Root cause**: TWO cascading failures:
1. `compute_single_action` deprecated in RLlib 2.x, causing exceptions
2. Code had try/except that caught the error and fell back to equal weights
- **AWS resources affected**: 1x validation job + HPO job with 15 training jobs
- **Duration**: ~20 minutes before user caught it
- **Estimated cost**: ~$5-10 (wasted training jobs)
- **Prevention**:
1. NEVER use try/except with silent fallback for model inference
2. Fail fast: If model inference fails, raise immediately
3. Add validation counters: Track model_inference_count vs heuristic_countThe dollar amounts serve two purposes. First, accountability — it's harder to dismiss an incident as minor when you can see it cost $55-90 in wasted compute. Second, prioritization — the entries that cost the most get the most aggressive prevention measures.
What matters more than any individual dollar amount is the pattern recognition. When you aggregate costs by category rather than by date, the same root patterns emerge that the incident table showed earlier — silent fallbacks, unverified outputs, architecture mismatches. The mishap log makes these patterns impossible to ignore because they have cumulative price tags.
The $0-cost near-misses are arguably the most valuable entries. The Jan 13 alphabetical bias — where prescreening only ever considered stocks starting with "A" because the broker API returns symbols alphabetically and the code took universe[:500] — cost nothing because it was caught during paper trading review. But it revealed that the prescreening pipeline had never been validated for output diversity. That led to the broader principle: verify outputs, not just execution. The system learned before paying.
The mishap log creates a culture of measurement. When an agent makes a mistake, it doesn't just get a post-mortem — it gets a price tag. Over time, the cumulative cost data makes the value of the documentation system concrete. Every dollar in the mishap log is a dollar that the feedback loop is designed to prevent from recurring.
Every document in the system includes YAML frontmatter. This isn't cosmetic — it's the mechanism that makes the documentation discoverable by AI agents.
---
id: risk-15
title: Risk Incidents and Lessons Learned
perspective: risk
tags: [incidents, post-mortems, lessons, controls, loss-events, fallback]
depends_on: [risk-10, risk-11, risk-13]
key_files:
- docs/post-mortems/2026-01-30-RL-MODEL-NEVER-WORKED.md
- docs/post-mortems/2026-02-06-SAGEMAKER-INFERENCE-SEGFAULT.md
- docs/LESSONS_LEARNED.md
- docs/MISHAP_COST_LOG.md
last_updated: 2026-02-28
---The frontmatter fields are designed for programmatic consumption:
- id: Stable reference for cross-linking (
depends_on: [risk-10]) - perspective: One of four categories (architecture, risk, operations, guide)
- tags: Topical classification for search
- depends_on: Explicit dependency graph between documents
- key_files: Source code files that this document describes
An agent investigating a production incident can start with the "I need to..." lookup table, find the relevant runbook, follow depends_on links to understand the broader context, and trace key_files to the exact code that needs attention. The frontmatter turns a flat directory of markdown files into a navigable knowledge graph.
The post-mortem template itself is documentation as code. Its 12 mandatory sections enforce investigative rigor:
- Summary
- Timeline
- Root Cause
- Impact (financial, users, duration)
- What Went Well
- What Went Wrong
- Action Items (with owner, status, due date)
- Lessons Learned
- Prevention
The "What Went Well" section is often the hardest to fill — it forces you to identify what parts of the system worked correctly during a failure, which prevents over-correction. In our worst incident, the prescreening model and the stop-loss system both functioned as designed. Those details matter because they prevent the knee-jerk reaction of dismantling working components alongside the broken ones.
The clearest way to see the documentation system's value is to compare two incidents: one before the system existed, and one after.
Task: Deploy new RL model to production.
The agent used the RLlib training script as the inference entry point. Ray imported at module load time. The SageMaker inference container didn't have Ray installed. Segfault.
Five hours of debugging followed. The agent tried adding packages, removing custom entrypoints, rebuilding containers — none of it worked because the root cause (training code in the inference path) wasn't identified until hour four. Zero trades executed that morning. The fix — creating a standalone inference.py that duplicates the model architecture without Ray imports — took 30 minutes once the problem was understood. The other 4.5 hours were wasted.
Same task: Deploy updated RL model to production.
The agent reads CLAUDE.md at session start. Policy 6 is explicit: "Separate training/inference code — Ray imports in inference containers = segfault. Duplicate model architecture instead." The agent opens the knowledge base, finds the deployment_process pattern, and follows the deployment checklist from ops/03-deployment-checklist.md.
The checklist includes: verify inference module has no training-only imports, build container locally, test inference with sample payload, deploy to SageMaker, verify endpoint is InService, invoke with test data, check CloudWatch logs for errors.
The agent creates the standalone inference module, tests it locally, deploys, verifies the endpoint responds, checks the logs for any errors — and is done in 30 minutes. No segfault. No five-hour debugging session. No missed trading window.
The difference isn't that the agent is smarter. It's that the system contains the answer to a problem that was already solved once. The Feb 6 post-mortem identified the root cause. The CLAUDE.md policy encoded the constraint. The knowledge base provided the code pattern. The deployment checklist provided the verification steps. Each layer added a guardrail, and the guardrails compound.
This is what I mean by documentation that deploys. It doesn't sit in a wiki waiting to be read. It's loaded into the agent's context at session start, structured for programmatic lookup, and connected to the verification steps that prove the deployment actually works.
If you're working with AI agents and want to build a similar system, here's where to start:
-
Start with a mishap log. Track every mistake with a dollar amount. Even $0 near-misses deserve an entry. The act of quantifying forces you to take incidents seriously and creates the data you need to prioritize prevention.
-
Write post-mortems with multiple root causes. Single-root-cause analysis is almost always incomplete. Force yourself to find at least three contributing factors. Use a template with mandatory sections so you can't skip the hard parts.
-
Encode policies where agents read them. If you're using AI coding agents, put your most critical rules in the file that loads automatically every session. For Claude, that's
CLAUDE.md. For other tools, find the equivalent. Twelve imperative rules beat a hundred pages of guidelines. -
Structure knowledge for machines, not just humans. Prose documentation is for context. Structured YAML (or JSON, or whatever your agents parse) is for enforcement. Include severity levels, code examples, verification steps, and reference files.
-
Create the feedback loop. The system only works if every incident propagates through all layers: incident → post-mortem → pattern/anti-pattern → policy → runbook → code change → CI gate. Skip any step and the loop has a gap.
-
Treat documentation updates as part of the fix. When you close an incident, the code change is half the work. The policy update, the runbook update, and the knowledge base entry are the other half. If you don't update the documentation, you'll fix the same problem twice.
The investment is real — building and maintaining four documentation layers takes time. But the alternative is paying for the same lesson more than once. The documentation system exists to make sure every failure only happens once.
This is Post 6 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Previous: The Ensemble That Actually Trades — How PPO, SAC, and TD3 Became One Decisioning System | Next: One Developer, 12 Workflows, and a Production ML System
The documentation system tells agents what NOT to do. Next week: the full development system that lets one person ship production ML at startup speed.

