Date: 2026-03-10
Block: 24171384, --apc-count 10
| Phase | Wall Time |
|---|---|
| APC generation (parallel) | 57.1 min |
| Keygen | 0.8 min |
| Total compile | ~58 min |
19,031 blocks were generated (0 failures). The sum of all per-block times is 36.7 hours of single-threaded work, but parallelism brings it to 57 min wall time. However, the wall time is bottlenecked by the slowest blocks on the critical path.
| Time bucket | Blocks | % of blocks | Sum of time | % of time |
|---|---|---|---|---|
| <1s | 14,001 | 73.6% | 2,634s | 2.0% |
| 1-10s | 3,326 | 17.5% | 9,072s | 6.9% |
| 10-60s | 973 | 5.1% | 29,094s | 22.0% |
| 60-120s | 596 | 3.1% | 46,115s | 34.9% |
| 120-600s | 120 | 0.6% | 20,022s | 15.2% |
| >600s (>10min) | 15 | 0.1% | 25,078s | 19.0% |
| Time | Instrs | PC | Before optimizer (cols) |
|---|---|---|---|
| 41.3 min | 908 | 2211432 | 50,882 |
| 40.7 min | 916 | 2152724 | 51,242 |
| 37.3 min | 916 | 520876 | 51,242 |
| 35.4 min | 884 | 2234668 | ~50,000 |
| 33.8 min | 888 | 2295204 | ~50,000 |
| 29.8 min | 746 | 2450600 | 41,814 |
| 28.3 min | 754 | 2193372 | ~42,000 |
| 27.4 min | 738 | 2220196 | ~41,000 |
| 27.0 min | 766 | 2226264 | ~42,000 |
| 25.1 min | 916 | 2720916 | 51,242 |
The most disproportionately slow blocks are medium-sized (20-60 instructions):
| Block | Instrs | Cols | Time | Time/instr |
|---|---|---|---|---|
| pc=403788 | 22 | 1,039 | 149s | 6.8s/instr |
| pc=2283700 | 48 | 2,222 | 167s | 3.5s/instr |
| pc=325000 | 58 | 2,177 | 142s | 2.4s/instr |
| Hundreds of 29-instr blocks | 29 | 1,335 | ~83s each | 2.9s/instr |
Compare with large blocks (which are more efficient per-instruction):
| Block | Instrs | Cols | Time | Time/instr |
|---|---|---|---|---|
| pc=42828 | 541 | 25,654 | 193s | 0.4s/instr |
| pc=361456 | 3,723 | 170,978 | 1,115s | 0.3s/instr |
- Large blocks (~15K-30K cols): ~2-3s per 1000 items
- Medium blocks (~1K-2.5K cols): 10-60s per 1000 items ← 10-30x worse
The optimizer converges in 3-5 iterations. Iteration 3 is disproportionately slow on medium WOMIR blocks:
| Iteration | System size (constraints) | Time | Notes |
|---|---|---|---|
| 1 | 20,671 constraints | 122s | Proportional to size |
| 1 | 7,569 constraints | 76s | Proportional to size |
| 1 | 4,179 constraints | 33s | Proportional to size |
| 3 | 676 constraints | 79s | Disproportionate |
| 3 | 339 constraints | 37s | Disproportionate |
| 3 | 529 constraints | 45s | Disproportionate |
Something in optimize_constraints iteration 3 (solver-based optimization + rule-based optimization + memory optimization + low-degree bus interaction optimization) is hitting a bad case on WOMIR constraint patterns.
WOMIR generates ~46-50 columns per instruction due to:
- Frame-pointer overhead: every instruction reads FP from memory, then uses FP-relative addressing for all operands
- 64-bit variants: 8-limb decomposition (vs 4 for 32-bit) doubles constraints
- More bus interactions per operand: 2+ per operand (FP read + register read) vs 1 for RISC-V
For comparison, RISC-V generates 12,402 blocks for the same Ethereum block and completes in ~11 min.
Many blocks have identical constraint fingerprints:
- Hundreds of 29-instruction blocks: exactly 1,335 cols, 1,086 constraints, 782 bus interactions
- Hundreds of 30-instruction blocks: exactly 1,375 cols, 1,119 constraints, 806 bus interactions
Each of these runs through the full optimizer independently at ~80-85s each.
-
No
apc_max_instructionslimit — defaults tou32::MAX, allowing blocks up to 3,723 instructions (170K cols, 18.6 min each) -
~46-50 cols per WOMIR instruction — the frame-pointer overhead and 64-bit variants create ~2x more constraints than RISC-V per instruction
-
Optimizer has superlinear behavior on certain WOMIR constraint patterns — the per-1000-items time varies from 2s (large blocks) to 60s (medium blocks). Something in the constraint solver is hitting a bad case.
-
0 failures — every block generates an APC, even the massive ones. No blocks are filtered out.
-
Duplicate blocks waste optimizer time — identical constraint systems are optimized independently.
-
Set
apc_max_instructionsto ~100-200 — eliminates the 15 blocks taking >10 min (19% of compute time) and most of the 120 blocks in the 2-10 min range (15%). Wall time would likely drop from 57 min to ~15-20 min. -
Add a per-block timeout — if a block takes >X seconds in the optimizer, skip it. Most useful APCs are the fast-to-generate ones anyway.
-
Deduplicate blocks by constraint fingerprint — run the optimizer once per unique (cols, constraints, bus_interactions) tuple, reuse the result. Would eliminate hundreds of redundant 80s+ optimizer runs.
-
Profile the optimizer iteration 3 pathology — add per-step timing within
optimize_constraintsto pinpoint which sub-step (solver-based, rule-based, memory, low-degree bus interaction) causes the superlinear behavior on medium WOMIR blocks. -
Review WOMIR adapter constraint design — the ~46-50 cols/instruction overhead from frame-pointer handling may be reducible.
In powdr/autoprecompiles/src/lib.rs lines 348-353 and 367-372, the before_opt_constraints and before_opt_interactions metrics both incorrectly use machine.unique_references().count() (which is the column count). They should use machine.constraints.len() and machine.bus_interactions.len() respectively.
Changes were made to local ~/devel/powdr (not committed upstream):
autoprecompiles/src/pgo/cell/mod.rs: Per-block progress logging (info level), success/failure counts, parallel generation summaryautoprecompiles/src/lib.rs: Per-block timing forstatements_to_symbolic_machine, optimizer, before/after constraint sizes; fixed metrics bugautoprecompiles/src/optimizer.rs: Solver construction timing, per-iteration timing and stats, convergence iteration count
Run with RUST_LOG=info to see all instrumentation output.