Skip to content

Instantly share code, notes, and snippets.

@leonardoalt
Last active March 10, 2026 20:35
Show Gist options
  • Select an option

  • Save leonardoalt/abda9a29e240c4d1f061ef09c659acf4 to your computer and use it in GitHub Desktop.

Select an option

Save leonardoalt/abda9a29e240c4d1f061ef09c659acf4 to your computer and use it in GitHub Desktop.

WOMIR APC Generation Profiling Results

Date: 2026-03-10 Block: 24171384, --apc-count 10

High-Level Timing

Phase Wall Time
APC generation (parallel) 57.1 min
Keygen 0.8 min
Total compile ~58 min

The Problem: Long Tail of Slow Blocks

19,031 blocks were generated (0 failures). The sum of all per-block times is 36.7 hours of single-threaded work, but parallelism brings it to 57 min wall time. However, the wall time is bottlenecked by the slowest blocks on the critical path.

Time bucket Blocks % of blocks Sum of time % of time
<1s 14,001 73.6% 2,634s 2.0%
1-10s 3,326 17.5% 9,072s 6.9%
10-60s 973 5.1% 29,094s 22.0%
60-120s 596 3.1% 46,115s 34.9%
120-600s 120 0.6% 20,022s 15.2%
>600s (>10min) 15 0.1% 25,078s 19.0%

Top 10 Slowest Blocks

Time Instrs PC Before optimizer (cols)
41.3 min 908 2211432 50,882
40.7 min 916 2152724 51,242
37.3 min 916 520876 51,242
35.4 min 884 2234668 ~50,000
33.8 min 888 2295204 ~50,000
29.8 min 746 2450600 41,814
28.3 min 754 2193372 ~42,000
27.4 min 738 2220196 ~41,000
27.0 min 766 2226264 ~42,000
25.1 min 916 2720916 51,242

Key Finding: Pathological Medium Blocks

The most disproportionately slow blocks are medium-sized (20-60 instructions):

Block Instrs Cols Time Time/instr
pc=403788 22 1,039 149s 6.8s/instr
pc=2283700 48 2,222 167s 3.5s/instr
pc=325000 58 2,177 142s 2.4s/instr
Hundreds of 29-instr blocks 29 1,335 ~83s each 2.9s/instr

Compare with large blocks (which are more efficient per-instruction):

Block Instrs Cols Time Time/instr
pc=42828 541 25,654 193s 0.4s/instr
pc=361456 3,723 170,978 1,115s 0.3s/instr

Optimizer cost per 1000 items (cols+constraints+bus)

  • Large blocks (~15K-30K cols): ~2-3s per 1000 items
  • Medium blocks (~1K-2.5K cols): 10-60s per 1000 items ← 10-30x worse

Optimizer Loop Analysis

The optimizer converges in 3-5 iterations. Iteration 3 is disproportionately slow on medium WOMIR blocks:

Iteration System size (constraints) Time Notes
1 20,671 constraints 122s Proportional to size
1 7,569 constraints 76s Proportional to size
1 4,179 constraints 33s Proportional to size
3 676 constraints 79s Disproportionate
3 339 constraints 37s Disproportionate
3 529 constraints 45s Disproportionate

Something in optimize_constraints iteration 3 (solver-based optimization + rule-based optimization + memory optimization + low-degree bus interaction optimization) is hitting a bad case on WOMIR constraint patterns.

WOMIR Instruction Overhead

WOMIR generates ~46-50 columns per instruction due to:

  • Frame-pointer overhead: every instruction reads FP from memory, then uses FP-relative addressing for all operands
  • 64-bit variants: 8-limb decomposition (vs 4 for 32-bit) doubles constraints
  • More bus interactions per operand: 2+ per operand (FP read + register read) vs 1 for RISC-V

For comparison, RISC-V generates 12,402 blocks for the same Ethereum block and completes in ~11 min.

Identical Block Duplication

Many blocks have identical constraint fingerprints:

  • Hundreds of 29-instruction blocks: exactly 1,335 cols, 1,086 constraints, 782 bus interactions
  • Hundreds of 30-instruction blocks: exactly 1,375 cols, 1,119 constraints, 806 bus interactions

Each of these runs through the full optimizer independently at ~80-85s each.

Root Causes

  1. No apc_max_instructions limit — defaults to u32::MAX, allowing blocks up to 3,723 instructions (170K cols, 18.6 min each)

  2. ~46-50 cols per WOMIR instruction — the frame-pointer overhead and 64-bit variants create ~2x more constraints than RISC-V per instruction

  3. Optimizer has superlinear behavior on certain WOMIR constraint patterns — the per-1000-items time varies from 2s (large blocks) to 60s (medium blocks). Something in the constraint solver is hitting a bad case.

  4. 0 failures — every block generates an APC, even the massive ones. No blocks are filtered out.

  5. Duplicate blocks waste optimizer time — identical constraint systems are optimized independently.

Recommended Mitigations

Quick wins

  1. Set apc_max_instructions to ~100-200 — eliminates the 15 blocks taking >10 min (19% of compute time) and most of the 120 blocks in the 2-10 min range (15%). Wall time would likely drop from 57 min to ~15-20 min.

  2. Add a per-block timeout — if a block takes >X seconds in the optimizer, skip it. Most useful APCs are the fast-to-generate ones anyway.

  3. Deduplicate blocks by constraint fingerprint — run the optimizer once per unique (cols, constraints, bus_interactions) tuple, reuse the result. Would eliminate hundreds of redundant 80s+ optimizer runs.

Deeper investigation

  1. Profile the optimizer iteration 3 pathology — add per-step timing within optimize_constraints to pinpoint which sub-step (solver-based, rule-based, memory, low-degree bus interaction) causes the superlinear behavior on medium WOMIR blocks.

  2. Review WOMIR adapter constraint design — the ~46-50 cols/instruction overhead from frame-pointer handling may be reducible.

Metrics Bug Found

In powdr/autoprecompiles/src/lib.rs lines 348-353 and 367-372, the before_opt_constraints and before_opt_interactions metrics both incorrectly use machine.unique_references().count() (which is the column count). They should use machine.constraints.len() and machine.bus_interactions.len() respectively.

Instrumentation Added

Changes were made to local ~/devel/powdr (not committed upstream):

  • autoprecompiles/src/pgo/cell/mod.rs: Per-block progress logging (info level), success/failure counts, parallel generation summary
  • autoprecompiles/src/lib.rs: Per-block timing for statements_to_symbolic_machine, optimizer, before/after constraint sizes; fixed metrics bug
  • autoprecompiles/src/optimizer.rs: Solver construction timing, per-iteration timing and stats, convergence iteration count

Run with RUST_LOG=info to see all instrumentation output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment