Distilled principles from production compilers, expert practitioners, and the greatest software engineers.
Merge lexing, parsing, and transformation into as few passes as possible. Each pass is a full traversal; fewer passes keep data cache-hot.
- esbuild uses only 3 full AST passes: (1) lex + parse + scope setup + symbol declaration, (2) bind + constant fold + syntax lower + mangle, (3) print + source maps. [esbuild architecture]
- OXC combines multiple optimizations into single passes with early termination. [OXC performance]
- Biome parses once and reuses the tree for formatting, linting, and analysis. [Biome announcement]
- Lua emits VM instructions during parsing with no AST construction at all. [Implementation of Lua 5.0 (PDF)]
"The fewer passes you have to make over your data (and also the fewer different representations you need to transform your data into), the faster your compiler will go." -- Evan Wallace [esbuild FAQ]
- ASCII lookup tables: V8 uses a 128-entry table with classification flags (ID_Start, ID_Continue, keyword start). Single table lookup + branch replaces multiple comparisons. Gains: 1.2-2.1x across token types. [V8 scanner blog]
- String views over copies: Point into the original input rather than allocating new strings for token text. [xnacly fast lexer strategies]
- Hash-while-scanning identifiers: Compute the hash value simultaneously with parsing the identifier. Avoids a separate hashing pass. [Sean Barrett lexing strategies]
- Split hot loops by radix: V8 uses three specialized number-scanning loops (decimal, hex, octal) instead of checking radix every iteration. Fewer branches in hot loops = faster scanning. [V8 scanner blog]
- Keyword detection via word-sized comparisons: If all keywords fit in 8 bytes, pack them into a 64-bit register for comparison instead of string comparison. Beat the
logoslexer generator by 30%. [alic.dev fast lexing] - Direct-coded vs table-driven lexers: re2c-style directly-coded engines (using goto) are 2-3x faster than flex-style table-driven engines. [alic.dev fast lexing]
Handwritten recursive descent dominates: 8 of the top 10 Redmonk languages use handwritten parsers. Go saw an 18% parsing speedup switching to handwritten in Go 1.6. [Phil Eaton survey]
Pratt parsing for expressions: Handles all precedence levels in a single loop controlled by binding power values, eliminating one function call per precedence level that traditional recursive descent requires. Used in JSLint, rust-analyzer, and many production parsers. [matklad Pratt parsing] [Bob Nystrom Pratt parsers] [Crockford TDOP]
Lazy parsing (pre-parsing): V8's preparser syntax-checks functions without building a full AST. The preparser is ~2x faster than the full parser. Full parsing happens on-demand when functions are actually called. [V8 lazy parsing blog]
Error-tolerant (resilient) parsing: LL parsers naturally handle well-formed prefixes of incomplete constructs. Lossless syntax trees store all whitespace and trivia, enabling perfect source reconstruction. [matklad resilient LL parsing]
Parser combinator overhead: 5-10x performance overhead vs hand-rolled recursive descent. Monadic combinators resist optimization because predicates are undecidable. Use for prototypes; replace if measurement shows problems. [HN discussion]
Flat arrays with indices: Replace pointer-based tree nodes with indices into flat arrays. 32-bit indices halve reference sizes vs 64-bit pointers. Measured 1.5-2.4x speedup from flattening alone. [Adrian Sampson flattening ASTs] [Super-flat ASTs]
Struct-of-Arrays (SoA): Store each field in its own array instead of an array of structs. Zig reduced tokens from 64 bytes to 5 bytes. Results: 20% wall-clock reduction in tokenizer/AST creation, 40% in IR generation. [Zig AST rework PR] [Andrew Kelley DoD talk (HN)]
Arena/bump allocation: O(1) allocation (increment pointer), O(1) deallocation (free the entire arena). OXC's arena allocation yielded ~20% speedup. Arena-allocated nodes are contiguous in memory, improving cache line utilization. [OXC performance] [Ryan Fleury arena allocators] [Chris Wellons arena tips]
Stop storing derivable data: Zig removed line/column info from tokens because it can be lazily computed from byte offset. Every byte saved on the most-frequent struct is multiplied by millions of instances. [Zig DoD issue]
u32 spans: OXC changed usize to u32 for span offsets (files > 4GB are pathological). Up to 5% improvement on large files. [OXC AST design]
Zero-copy parsing: Store (start, end) index pairs into the original source string. Slice only at output time. Eliminates all intermediate string allocations during parsing. In V8/JSC, String.prototype.slice() returns a view (pointer + offset + length). [Zero-copy techniques (Go)] [Manish Goregaokar zero-copy]
String interning: Deduplicate strings to integer IDs. Comparison becomes O(1) integer equality. V8 deduplicates at the scanner boundary. rustc uses an interned Symbol type. A broken interning function in rustc caused 59% excess peak memory and page faults. [matklad rust interner] [Nethercote rustc 2019]
Caveat -- interning vs parallelism: OXC found that global mutex lock contention from string-cache was a bottleneck during parallel parsing. Removing it improved parallel parsing by ~30%. Interning helps single-threaded; hurts multi-threaded. [OXC performance]
Small string optimization: Strings <= 24 bytes stored inline without heap allocation. Covers the majority of identifiers. [OXC AST design]
SIMD string scanning: simdjson processes 32 bytes simultaneously, producing bitsets of structural character positions. 4x faster than RapidJSON. Native indexOf for single chars uses SIMD internally in V8/JSC -- prefer it over manual charCodeAt loops. [simdjson] [StringZilla]
Rope concat in JSC/Bun: String += is faster than array-push + join for small/medium segments because JSC uses rope concatenation internally. Only use array-join for very large strings with many pieces.
- Thompson NFA guarantees linear time: Tracks all states simultaneously. O(n * m) worst case where m is pattern size. No exponential blowup. Go's regexp uses this exclusively. [Russ Cox regexp1]
- Nested quantifiers cause exponential backtracking: Patterns like
(\s*-+\s*)*are the primary danger. [Snyk ReDoS] - Restrict
\sto[ \t]when newlines are structurally invalid.\sincludes\n, causing cross-line backtracking. - Adjacent capture groups with overlapping character classes: Cause backtracking when the delimiter is absent. Replace with charCode scans.
- Regex on unbounded input is inherently risky: Prefer charCode scans that walk the string once in a single forward pass.
- When regex wins: On large inputs (650K+), V8/JSC's native regex engine (compiled to machine code) can be 2x faster than JS character loops. [ratfactor benchmarks]
Red-Green trees (Roslyn): Two layered trees. The green tree is immutable, persistent, built bottom-up with no parent references. The red tree is computed on-demand. ~95% of keyword occurrences share the exact same green node object. O(log n) incremental re-parsing. [Eric Lippert red-green trees]
Tree-sitter incremental parsing: Structural sharing between old and new trees. Only edited portions are re-parsed. Re-parsing after a small edit is sub-millisecond. [Tree-sitter GitHub]
Query-based compilation (rustc): Instead of sequential stages, the compiler has memoized queries. On re-compilation, only queries whose inputs changed are re-executed. [Rust incremental compilation]
- esbuild: Parsing and codegen are embarrassingly parallel. Go's shared heap avoids JS's serialization overhead between worker threads. [esbuild FAQ]
- SWC: Work-stealing parallelism via rayon. 20x faster than Babel single-threaded, 70x on 4 cores. [SWC website]
- Mold linker: Links Chrome (2 GB) in ~2 seconds (~2x the time to
cpthe file). 5x faster than lld, 25x faster than GNU gold. Speculative preprocessing + aggressive multi-core parallelism. [Mold slides] - TypeScript Go port: Native compilation (3-4x) + parallel type checking (2-3x) = ~10x total. [TypeScript native port blog]
- Constant folding + propagation + DCE work iteratively: Each creates opportunities for the others. [Constant folding (Wikipedia)]
- Phase ordering matters: LLVM has 100+ passes. Mutually beneficial sub-sequences should be treated as units. [Phase ordering (arxiv)]
- SSA form: Each value defined exactly once. Enables CSE, DCE, constant propagation, and register allocation. [LLVM passes docs]
- Modular pass architecture: Each pass does one thing well. Analysis passes compute; transform passes mutate. [AOSA Book: LLVM]
- Deferred error analysis: Keep the hot parsing path free of complex error handling. Delegate to a separate semantic analyzer. (OXC: 3x faster than SWC, 5x faster than Biome.) [Pursuit of Performance (Rust Magazine)]
- NaN boxing: Pack all value types into 64 bits using IEEE 754 NaN payloads. Half memory vs tagged union. ~20% speedup in Lua. Used by LuaJIT, JavaScriptCore, SpiderMonkey. [NaN boxing blog] [Crafting Interpreters optimization]
- Register-based > stack-based bytecode: Fewer instructions, avoids separate operand stack memory traffic. (Lua, LuaJIT 2.) [Crafting Interpreters bytecode]
- Trace compilation: Profile during execution, JIT-compile hot loops. Allocation sinking eliminates temporary objects that don't escape their trace. [LuaJIT allocation sinking]
- Streaming/one-pass compilation: V8's Liftoff compiler for WebAssembly iterates bytecode once, emitting machine code immediately. No IR, no optimization passes. [V8 WASM pipeline]
- Hidden classes: Initialize all object properties in the same order in constructors. Objects with identical structure share a hidden class, enabling inline caching.
- Short-lived objects: Objects that die before the next minor GC (scavenge) incur near-zero GC cost.
- Monomorphic call sites: Ensure functions are called with consistent argument shapes to enable JIT optimization.
- Vec growth strategy matters: rustc changed Vec growth from
0, 1, 2, 4, 8, 16to0, 4, 8, 16, reducing total allocations by 10%+. Similar principles apply to JS array pre-sizing. [Nethercote rustc optimizations] - Babel's overhead:
babel-traverseis 8-16x slower thanbabel-walkbecause it doesn't cache visitor explosion. Each plugin adds another full AST traversal. [babel-walk]
"Your head is a faulty interpreter." Minimize state mutation. Maximize code analyzability. Inlining large functions (even 2000 lines) makes execution order obvious and prevents unintended call sites. Static code analysis is the most important practice -- more valuable for the mindset change than the bugs caught.
"The real enemy addressed by inlining is unexpected dependency and mutation of state, which functional programming solves more directly."
[Carmack on Inlined Code] [Carmack on Static Analysis]
The Three Big Lies: (1) Software is a platform -- the hardware is the platform. (2) Code should model the world -- real-world IS-A relationships don't map to data transformations. (3) Code is more important than data -- the transformation of data is the only purpose of any program.
"If you don't understand the data, you don't understand the problem."
10x improvements come from data layout transformations, not algorithmic cleverness. Design for the "multiple" case -- there is never "one" of anything important.
[CppCon 2014 talk] [Data Oriented Programming]
Polymorphism, virtual dispatch, and deep class hierarchies create indirection that defeats CPU branch prediction and cache locality. Organize code by what operations do, not by what types represent.
"I've never seen a codebase written with 'clean code' principles in mind that is also maintainable and easy to develop on top of."
The 5 Rules: (1) You can't tell where a program spends its time. (2) Measure before tuning. (3) Fancy algorithms are slow when n is small, and n is usually small. (4) Fancy algorithms have more bugs. (5) Data dominates -- if you've chosen the right data structures, the algorithms will be self-evident.
[Pike's 5 Rules] [Notes on Programming in C]
"When in doubt, use brute force." Think about how a bug came to be -- build a mental model and find where the model is wrong. Small teams produce better abstractions because committee designs accumulate everyone's favorite feature.
[Best advice from Ken Thompson (Pike)]
Good taste means eliminating edge cases through better abstractions. The linked list **p example: use an indirect pointer to make the special case for deleting the first element disappear.
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
[Torvalds good taste (GitHub)] [Torvalds on Wikiquote]
Simple vs. Easy: Simple means not interleaved (objective). Easy means familiar (subjective). Complecting (braiding together) is the root of complexity. State complects value and time. Inheritance complects types.
"Most critical bugs come from misunderstanding the problem, not from implementation errors." (Hammock-driven development)
[Simple Made Easy transcript] [Hammock Driven Development]
"You wanted a banana but what you got was a gorilla holding the banana and the entire jungle." Let things crash. Write corrective code, not defensive code. Keep state explicit and visible.
[Why OO Sucks] [Joe Armstrong's legacy (The New Stack)]
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
Literate programming: writing in teaching mode to another human produces fewer mistakes and takes less time. Forces poorly thought-out designs to become obvious.
[Literate Programming (PDF)] [Knuth at Quanta]
Essential complexity is inherent in the problem domain. Accidental complexity is what we create unnecessarily. Past breakthroughs (high-level languages, time-sharing, IDEs) all addressed accidental complexity. No single technique yields an order-of-magnitude improvement.
"Avoid local maximums." Question foundational premises. 35% speed improvements from data layout optimization alone, no algorithmic changes. Make allocators explicit, first-class values.
"The less memory is touched, the less pressure there will be on the CPU."
[Kelley on Zig design (Sourcegraph)] [Kelley CoRecursive podcast]
Platform values (approachability, debuggability, performance, robustness, simplicity...) conflict, and you must choose. The right language/platform is the one whose values align with your project's goals. Build observability tools as you need them.
[Software as a Reflection of Values (CoRecursive)]
Deep understanding of fundamentals enables one person to build what teams cannot. TinyGL in 8,000 lines. QEMU solo through v0.7.1. Discipline and tight feedback loops -- not genius -- explain extraordinary productivity.
[Bellard's Performance Craft (Koder.ai)]
"Write programs that do one thing and do it well. Write programs to work together. Expect the output of every program to become the input to another, as yet unknown, program."
STEADY goals: Simple, Timely, Efficient, Adaptable, Dependable, Yummy. Have a spec. Get it right. Keep it clean. Don't hide power. Don't generalize prematurely. Good fences make good neighbors.
[Hints and Principles for Computer System Design (arXiv)]
Three-fold principle: (1) Keep it simple, (2) Do not speculate -- never include unused code, (3) Do it yourself. Code should fit in 1KB blocks -- a quantum comprehensible on a single screen. Beautiful code is never larger than this.
"Complexity is the problem. Moving it from hardware to software, or vice versa, doesn't help. Simplicity is the only answer."
[Moore on Keeping It Simple (Simple Talk)]
"OOP to me means only messaging, local retention and protection and hiding of state-process, and extreme late-binding of all things." The big idea is messaging, not objects. Design how modules communicate, not what they contain.
[Alan Kay on OOP] [Deep Insights of Alan Kay]
Optimize the interpreter first, then JIT compile only CPU-intensive parts. Keep the codebase astonishingly small. Understand what the target runtime can and cannot optimize.
[Pall on interpreter optimization (HN)]
"Most of the decisions I make in programming are aesthetic decisions." C++ is burdened with every idea ever introduced. Good tools should feel like C but shed its worst baggage: no silent errors, no complex build chains, no mysterious behaviors.
[Blow on good design (Notion)]
Ten principles that appear across multiple compilers and multiple engineering legends:
1. Data first, code second. Understand the data before writing code. Choose data structures first; algorithms follow. Make data layouts cache-friendly and minimal. -- Pike Rule 5, Acton, Torvalds, Kelley, Carmack
2. Simplicity is not optional. Simple means not braided together, not "easy to type." Every abstraction must justify itself. If you can't hold it in your head, it's too complex. -- Moore, McIlroy, Hickey, Armstrong, Thompson
3. Measure, don't guess. Never optimize without profiling. Bottlenecks occur in surprising places. But when you find the critical 3%, optimize it ruthlessly. -- Pike Rules 1-2, Knuth's 97/3, Carmack
4. Eliminate edge cases through better abstractions. Good taste means reformulating problems so special cases disappear. The right data structure. The right interface. The right level of indirection. -- Torvalds, Lampson, Kay
5. Understand the machine. Know how CPUs, caches, and memory work. Smaller data structures = faster programs via cache effects. This is not premature optimization -- it's foundational design. -- Acton, Kelley, Knuth, Bellard, Pall
6. Fail explicitly, recover independently. Don't hide errors. Let things crash. Write corrective code, not defensive code. Make failures observable and isolated. -- Armstrong, Cantrill, Carmack
7. Think before you code. Most critical bugs come from misunderstanding the problem, not from typos. Build a mental model. When a bug appears, reason about how the model is wrong, not just where the symptom is. -- Hickey, Thompson/Pike, Knuth, Brooks
8. Compose small, focused units. Do one thing well. Design how modules communicate, not what they contain internally. Expect output to become input to unknown future consumers. -- McIlroy, Kay, Lampson, Hickey
9. Minimize intermediate representations. Every serialization/deserialization boundary is wasted work. Do maximum work per pass. Keep data in a single consistent format throughout the pipeline. -- Wallace/esbuild, Bellard, Pall
10. Don't speculate. Never include code that isn't used. Don't generalize before you need to. N is usually small. Fancy algorithms have big constants and more bugs. -- Moore, Brooks, Pike Rule 3