WOMIR APC Generation Profiling Results

Date: 2026-03-10 Block: 24171384, --apc-count 10

High-Level Timing

Phase	Wall Time
APC generation (parallel)	57.1 min
Keygen	0.8 min
Total compile	~58 min

The Problem: Long Tail of Slow Blocks

19,031 blocks were generated (0 failures). The sum of all per-block times is 36.7 hours of single-threaded work, but parallelism brings it to 57 min wall time. However, the wall time is bottlenecked by the slowest blocks on the critical path.

Time bucket	Blocks	% of blocks	Sum of time	% of time
<1s	14,001	73.6%	2,634s	2.0%
1-10s	3,326	17.5%	9,072s	6.9%
10-60s	973	5.1%	29,094s	22.0%
60-120s	596	3.1%	46,115s	34.9%
120-600s	120	0.6%	20,022s	15.2%
>600s (>10min)	15	0.1%	25,078s	19.0%

Top 10 Slowest Blocks

Time	Instrs	PC	Before optimizer (cols)
41.3 min	908	2211432	50,882
40.7 min	916	2152724	51,242
37.3 min	916	520876	51,242
35.4 min	884	2234668	~50,000
33.8 min	888	2295204	~50,000
29.8 min	746	2450600	41,814
28.3 min	754	2193372	~42,000
27.4 min	738	2220196	~41,000
27.0 min	766	2226264	~42,000
25.1 min	916	2720916	51,242

Key Finding: Pathological Medium Blocks

The most disproportionately slow blocks are medium-sized (20-60 instructions):

Block	Instrs	Cols	Time	Time/instr
pc=403788	22	1,039	149s	6.8s/instr
pc=2283700	48	2,222	167s	3.5s/instr
pc=325000	58	2,177	142s	2.4s/instr
Hundreds of 29-instr blocks	29	1,335	~83s each	2.9s/instr

Compare with large blocks (which are more efficient per-instruction):

Block	Instrs	Cols	Time	Time/instr
pc=42828	541	25,654	193s	0.4s/instr
pc=361456	3,723	170,978	1,115s	0.3s/instr

Optimizer cost per 1000 items (cols+constraints+bus)

Large blocks (~15K-30K cols): ~2-3s per 1000 items
Medium blocks (~1K-2.5K cols): 10-60s per 1000 items ← 10-30x worse

Optimizer Loop Analysis

The optimizer converges in 3-5 iterations. Iteration 3 is disproportionately slow on medium WOMIR blocks:

Iteration	System size (constraints)	Time	Notes
1	20,671 constraints	122s	Proportional to size
1	7,569 constraints	76s	Proportional to size
1	4,179 constraints	33s	Proportional to size
3	676 constraints	79s	Disproportionate
3	339 constraints	37s	Disproportionate
3	529 constraints	45s	Disproportionate

Something in optimize_constraints iteration 3 (solver-based optimization + rule-based optimization + memory optimization + low-degree bus interaction optimization) is hitting a bad case on WOMIR constraint patterns.

WOMIR Instruction Overhead

WOMIR generates ~46-50 columns per instruction due to:

Frame-pointer overhead: every instruction reads FP from memory, then uses FP-relative addressing for all operands
64-bit variants: 8-limb decomposition (vs 4 for 32-bit) doubles constraints
More bus interactions per operand: 2+ per operand (FP read + register read) vs 1 for RISC-V

For comparison, RISC-V generates 12,402 blocks for the same Ethereum block and completes in ~11 min.

Identical Block Duplication

Many blocks have identical constraint fingerprints:

Hundreds of 29-instruction blocks: exactly 1,335 cols, 1,086 constraints, 782 bus interactions
Hundreds of 30-instruction blocks: exactly 1,375 cols, 1,119 constraints, 806 bus interactions

Each of these runs through the full optimizer independently at ~80-85s each.

Root Causes

No apc_max_instructions limit — defaults to u32::MAX, allowing blocks up to 3,723 instructions (170K cols, 18.6 min each)
~46-50 cols per WOMIR instruction — the frame-pointer overhead and 64-bit variants create ~2x more constraints than RISC-V per instruction
Optimizer has superlinear behavior on certain WOMIR constraint patterns — the per-1000-items time varies from 2s (large blocks) to 60s (medium blocks). Something in the constraint solver is hitting a bad case.
0 failures — every block generates an APC, even the massive ones. No blocks are filtered out.
Duplicate blocks waste optimizer time — identical constraint systems are optimized independently.

Recommended Mitigations

Quick wins

Set apc_max_instructions to ~100-200 — eliminates the 15 blocks taking >10 min (19% of compute time) and most of the 120 blocks in the 2-10 min range (15%). Wall time would likely drop from 57 min to ~15-20 min.
Add a per-block timeout — if a block takes >X seconds in the optimizer, skip it. Most useful APCs are the fast-to-generate ones anyway.
Deduplicate blocks by constraint fingerprint — run the optimizer once per unique (cols, constraints, bus_interactions) tuple, reuse the result. Would eliminate hundreds of redundant 80s+ optimizer runs.

Deeper investigation

Profile the optimizer iteration 3 pathology — add per-step timing within optimize_constraints to pinpoint which sub-step (solver-based, rule-based, memory, low-degree bus interaction) causes the superlinear behavior on medium WOMIR blocks.
Review WOMIR adapter constraint design — the ~46-50 cols/instruction overhead from frame-pointer handling may be reducible.

Metrics Bug Found

In powdr/autoprecompiles/src/lib.rs lines 348-353 and 367-372, the before_opt_constraints and before_opt_interactions metrics both incorrectly use machine.unique_references().count() (which is the column count). They should use machine.constraints.len() and machine.bus_interactions.len() respectively.

Instrumentation Added

Changes were made to local ~/devel/powdr (not committed upstream):

autoprecompiles/src/pgo/cell/mod.rs: Per-block progress logging (info level), success/failure counts, parallel generation summary
autoprecompiles/src/lib.rs: Per-block timing for statements_to_symbolic_machine, optimizer, before/after constraint sizes; fixed metrics bug
autoprecompiles/src/optimizer.rs: Solver construction timing, per-iteration timing and stats, convergence iteration count

Run with RUST_LOG=info to see all instrumentation output.

leonardoalt/womir-apc-profiling-results.md

Select an option

No results found