- Run:
unknown - Artifact:
unknown - Hardware:
unknown - Topology:
unknown - Git SHA:
unknown - Generated At (UTC):
2026-02-24T06:41:11.812038Z
- Trace SHA256:
3c3793b69a3ab3461853542dc4d5ac4a87cbe6fede6b6818dbd479d8526f7cf2 - Observed run_ids:
563, 564, 565, 566, 567, 568, 569, 570
- Steps counted:
208 - Median:
903.888 - P90:
17838.300 - Mean:
19365.831
| Class | Count | Fraction | Median | P90 | Representative Step | Periodicity |
|---|---|---|---|---|---|---|
light |
170 | 81.73% | 843.372 | 2548.588 | 132 | n/a |
heavy |
38 | 18.27% | 19406.041 | 195957.718 | 5 | 11 |
| Category | Duration | Share |
|---|---|---|
| Compute | 3423731.901 | 84.84% |
| Communication | 70639.619 | 1.75% |
| Host | 0.000 | 0.00% |
| Stall | 541206.346 | 13.41% |
| Other | 0.000 | 0.00% |
| Region Path | Depth | Count | Inclusive | Inclusive % | Exclusive | Exclusive % |
|---|---|---|---|---|---|---|
_train_step |
1 | 90375 | 27531781.349 | 100.00% | 547709.734 | 1.99% |
_train_step=>fused_linear_softmax_cross_entropy_loss |
2 | 1576 | 13592351.112 | 49.37% | 0.000 | 0.00% |
_train_step=>fused_linear_softmax_cross_entropy_loss=>linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu=>_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined |
4 | 288 | 10193730.451 | 37.03% | 10168257.583 | 36.93% |
_train_step=>_splash_attention |
2 | 8528 | 6190435.324 | 22.48% | 0.000 | 0.00% |
_train_step=>fused_linear_softmax_cross_entropy_loss=>linear_softmax_cross_entropy_loss_fwd_pallas_mosaic_tpu |
3 | 432 | 3193852.293 | 11.60% | 3189666.994 | 11.59% |
_train_step=>_splash_attention=>splash_mha_dkv_segmented_block_kv_dkv=512_block_kv_dkv_compute=512_block_q_dkv=512_k_layout=1_q_layout=1_v_layout=1 |
3 | 848 | 2376985.415 | 8.63% | 2376985.415 | 8.63% |
_train_step=>_splash_attention=>splash_mha_fwd_segmented_residuals_block_kv=512_block_kv_compute=512_block_kv_dkv=512_block_kv_dkv_compute=512_block_kv_dq=512_block_q=512_block_q_dkv=512_block_q_dq=512_k_layout=1_q_layout=1_use_fused_bwd_kernel=False_v_layout=1 |
3 | 1720 | 1804531.507 | 6.55% | 1804531.507 | 6.55% |
_train_step=>_splash_attention=>splash_mha_dq_segmented_block_kv_dq=512_block_q_dq=512_k_layout=1_q_layout=1_v_layout=1 |
3 | 848 | 1780672.132 | 6.47% | 1780672.132 | 6.47% |
_train_step=>bsh,hm->bsm=>dot_general |
3 | 8256 | 1627871.117 | 5.91% | 1627871.117 | 5.91% |
_train_step=>apply_rotary_embedding |
2 | 5375 | 1541515.731 | 5.60% | 0.000 | 0.00% |
_train_step=>bsh,hd->bsd=>dot_general |
3 | 17824 | 1055872.980 | 3.84% | 1055872.980 | 3.84% |
_train_step=>bsm,mh->bsh=>dot_general |
3 | 4944 | 758279.036 | 2.75% | 758279.036 | 2.75% |
_train_step=>apply_rotary_embedding=>sub |
3 | 864 | 567415.754 | 2.06% | 567415.754 | 2.06% |
_train_step=>apply_rotary_embedding=>add_any |
3 | 848 | 554502.299 | 2.01% | 554502.299 | 2.01% |
_train_step=>convert_element_type |
2 | 12128 | 457950.500 | 1.66% | 457950.500 | 1.66% |
_train_step=>apply_rotary_embedding=>concatenate |
3 | 3375 | 418984.626 | 1.52% | 418984.626 | 1.52% |
_train_step=>scatter-add |
2 | 704 | 396358.419 | 1.44% | 396358.419 | 1.44% |
_train_step=>silu=>add_any |
3 | 848 | 321158.119 | 1.17% | 321158.119 | 1.17% |
_train_step=>mul |
2 | 4568 | 289709.177 | 1.05% | 289709.177 | 1.05% |
_train_step=>reshape |
2 | 2592 | 228540.183 | 0.83% | 228540.183 | 0.83% |
_train_step=>add |
2 | 7752 | 206405.777 | 0.75% | 206405.777 | 0.75% |
_train_step=>fused_linear_softmax_cross_entropy_loss=>psum |
3 | 568 | 204453.938 | 0.74% | 204453.938 | 0.74% |
_train_step=>_splash_attention=>hsd,hsd->hs=>dot_general |
4 | 848 | 188396.497 | 0.68% | 188396.497 | 0.68% |
_train_step=>add_any |
2 | 1688 | 184280.692 | 0.67% | 184280.692 | 0.67% |
_train_step=>gather |
2 | 288 | 83560.186 | 0.30% | 83560.186 | 0.30% |
async-collective-start |
1 | 11248 | 50159.484 | 100.00% | 50159.484 | 100.00% |
_train_step=>reduce_sum |
2 | 8040 | 39551.274 | 0.14% | 39551.274 | 0.14% |
_train_step=>_splash_attention=>broadcast_in_dim |
3 | 4120 | 38284.319 | 0.14% | 38284.319 | 0.14% |
copy |
1 | 19288 | 29335.933 | 100.00% | 29335.933 | 100.00% |
_train_step=>fused_linear_softmax_cross_entropy_loss=>linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu=>_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined=>reduce_sum |
5 | 144 | 25472.868 | 0.09% | 25472.868 | 0.09% |
| Op | Canonical | Category | Count | Exclusive | Avg | Shape Signature |
|---|---|---|---|---|---|---|
_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1 |
_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined |
compute | 144 | 10168257.583 | 70612.900 | 65536,512|2,512,128256 |
linear_softmax_cross_entropy_loss_fwd_pallas_mosaic_tpu.1 |
linear_softmax_cross_entropy_loss_fwd_pallas_mosaic_tpu |
compute | 144 | 3189666.994 | 22150.465 | 65536,128|65536,128 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.1 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 424731.676 | 2949.526 | 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.3 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 399184.020 | 2772.111 | 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.4 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 398870.318 | 2769.933 | 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.2 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 398833.430 | 2769.677 | 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.6 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 136 | 378361.057 | 2782.067 | 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.5 |
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 136 | 377004.915 | 2772.095 | 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.4 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 302343.169 | 2099.605 | 32,512,64|32,8,2048,64 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.1 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 302200.079 | 2098.612 | 32,512,64|32,8,2048,64 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.3 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 302113.094 | 2098.008 | 32,512,64|32,8,2048,64 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.2 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 144 | 302103.047 | 2097.938 | 32,512,64|32,8,2048,64 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.6 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 |
compute | 144 | 296248.720 | 2057.283 | 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.5 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 136 | 286479.090 | 2106.464 | 32,512,64|32,8,2048,64 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.6 |
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 |
compute | 136 | 285433.652 | 2098.777 | 32,512,64|32,8,2048,64 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.11 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 |
compute | 144 | 263382.898 | 1829.048 | 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.8 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 |
compute | 144 | 263350.079 | 1828.820 | 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.10 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 |
compute | 144 | 263339.422 | 1828.746 | 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.7 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 |
compute | 144 | 263318.251 | 1828.599 | 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.9 |
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 |
compute | 144 | 263307.916 | 1828.527 | 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128 |
fusion.1 |
fusion |
compute | 136 | 202727.619 | 1490.644 | 128256,512 |
psum.29 |
psum |
communication | 144 | 201721.845 | 1400.846 | 512,128256 |
all-reduce.23 |
all-reduce |
communication | 120 | 168016.960 | 1400.141 | 128256,512 |
fusion.48 |
fusion |
compute | 136 | 95951.371 | 705.525 | 512,128256|512,128256|512,128256 |
fusion.38 |
fusion |
compute | 144 | 94670.169 | 657.432 | 32,2048,8,32|32,2048,8,32|32,2048,8,32|32,2048,8,32 |
Note: FLOP proxy metrics are relative scaling heuristics from trace shapes, not hardware MFU.
| Family | Count | Exclusive | Share | Avg Exclusive | FLOP Proxy Total | FLOP Proxy/s | Example Op |
|---|---|---|---|---|---|---|---|
loss_xent |
288 | 13357924.577 | 67.15% | 46381.683 | 12079595520.000 | 9.043e+02 | _linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1 |
attention_splash |
2560 | 5770604.832 | 29.01% | 2254.143 | 822434697576448.000 | 1.425e+08 | splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.1 |
other |
560 | 595071.005 | 2.99% | 1062.627 | 29733421056.000 | 4.997e+04 | fusion.1 |
collective |
120 | 168016.960 | 0.84% | 1400.141 | 7880048640.000 | 4.690e+04 | all-reduce.23 |
| Collective | Count | Exclusive | Avg |
|---|---|---|---|
| all-reduce | 856 | 442269.905 | 516.670 |
| async-collective | 22488 | 59490.260 | 2.645 |
| all-gather | 144 | 3648.932 | 25.340 |
| Op | Count | Total Gap | Max Gap | Avg Gap |
|---|---|---|---|---|
copy.564 |
120 | 3176365.089 | 39767.551 | 26469.709 |
copy.1 |
136 | 530431.752 | 4255.501 | 3900.233 |
copy.1436 |
16 | 457528.939 | 29388.719 | 28595.559 |
fusion.749 |
120 | 4890.014 | 41.599 | 40.750 |
fusion.705 |
136 | 596.271 | 38.057 | 4.384 |
copy-done.257 |
144 | 320.186 | 2.911 | 2.224 |
copy-done.56 |
128 | 162.696 | 1.488 | 1.271 |
add_bitcast_fusion |
144 | 162.575 | 1.418 | 1.129 |
fusion.768 |
120 | 143.718 | 1.470 | 1.198 |
copy.561 |
128 | 100.117 | 0.958 | 0.782 |
negate_power_fusion |
144 | 38.615 | 2.551 | 0.268 |
copy-done.77 |
136 | 20.373 | 1.540 | 0.150 |
copy.1435 |
16 | 3.759 | 0.236 | 0.235 |
copy-done.359 |
144 | 2.234 | 0.019 | 0.016 |
copy-done.446 |
16 | 0.273 | 0.019 | 0.017 |
rsqrt_bitcast_fusion.2 |
144 | 0.118 | 0.001 | 0.001 |
rsqrt_bitcast_fusion.5 |
144 | 0.113 | 0.001 | 0.001 |
rsqrt_bitcast_fusion.1 |
144 | 0.109 | 0.001 | 0.001 |
rsqrt_bitcast_fusion.3 |
144 | 0.109 | 0.001 | 0.001 |
async-collective-done.14 |
144 | 0.102 | 0.001 | 0.001 |
rsqrt_bitcast_fusion.9 |
144 | 0.102 | 0.001 | 0.001 |
async-collective-start.4 |
144 | 0.102 | 0.001 | 0.001 |
copy.654 |
144 | 0.101 | 0.001 | 0.001 |
async-collective-done.23 |
144 | 0.101 | 0.001 | 0.001 |
copy-done.391 |
144 | 0.100 | 0.001 | 0.001 |
psum.28 |
144 | 0.100 | 0.001 | 0.001 |
async-collective-start.11 |
144 | 0.099 | 0.001 | 0.001 |
rsqrt_bitcast_fusion.7 |
144 | 0.099 | 0.001 | 0.001 |
fusion.12 |
136 | 0.098 | 0.001 | 0.001 |
copy.718 |
144 | 0.098 | 0.001 | 0.001 |
| Op | Region Path | Count | Total Gap | Avg Gap |
|---|---|---|---|---|
copy.564 |
copy(model.params.blocks[0].attn.w_k) |
120 | 3176365.089 | 26469.709 |
copy.1 |
copy |
136 | 530431.752 | 3900.233 |
copy.1436 |
copy |
16 | 457528.939 | 28595.559 |
fusion.749 |
_train_step=>add |
120 | 4890.014 | 40.750 |
fusion.705 |
_train_step=>add |
16 | 596.196 | 37.262 |
copy-done.257 |
copy(copy-done) |
144 | 320.186 | 2.224 |
copy-done.56 |
copy(copy-done) |
128 | 162.696 | 1.271 |
add_bitcast_fusion |
_train_step=>_threefry_split=>broadcast_in_dim |
144 | 162.575 | 1.129 |
fusion.768 |
_train_step=>reduce_sum |
120 | 143.718 | 1.198 |
copy.561 |
copy(model.params.blocks[0].attn.w_q) |
128 | 100.117 | 0.782 |
negate_power_fusion |
_train_step=>apply_rotary_embedding=>pow |
144 | 38.615 | 0.268 |
copy-done.77 |
copy(copy-done) |
136 | 20.373 | 0.150 |
copy.1435 |
copy |
16 | 3.759 | 0.235 |
copy-done.359 |
copy(copy-done) |
144 | 2.234 | 0.016 |
copy-done.446 |
copy(copy-done) |
16 | 0.273 | 0.017 |
rsqrt_bitcast_fusion.2 |
_train_step=>mul |
144 | 0.118 | 0.001 |
rsqrt_bitcast_fusion.5 |
_train_step=>mul |
144 | 0.113 | 0.001 |
rsqrt_bitcast_fusion.1 |
_train_step=>mul |
144 | 0.109 | 0.001 |
rsqrt_bitcast_fusion.3 |
_train_step=>mul |
144 | 0.109 | 0.001 |
async-collective-done.14 |
_train_step=>bsh,hd->bsd=>dot_general |
144 | 0.102 | 0.001 |
rsqrt_bitcast_fusion.9 |
_train_step=>mul |
144 | 0.102 | 0.001 |
async-collective-start.4 |
async-collective-start |
144 | 0.102 | 0.001 |
async-collective-done.23 |
_train_step=>bsh,hd->bsd=>dot_general |
144 | 0.101 | 0.001 |
copy-done.391 |
copy(copy-done) |
144 | 0.100 | 0.001 |
psum.28 |
_train_step=>fused_linear_softmax_cross_entropy_loss=>psum |
144 | 0.100 | 0.001 |
async-collective-start.11 |
async-collective-start |
144 | 0.099 | 0.001 |
rsqrt_bitcast_fusion.7 |
_train_step=>mul |
144 | 0.099 | 0.001 |
fusion.12 |
_train_step=>bsm,mh->bsh=>dot_general |
136 | 0.098 | 0.001 |
copy-start.42 |
copy(copy-start) |
144 | 0.097 | 0.001 |
pad_maximum_fusion |
_train_step=>apply_rotary_embedding=>concatenate |
144 | 0.096 | 0.001 |
Top op '_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1' contributes 252.0% of profiled exclusive duration.
Evidence:
- Top op: _linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1
- Top op exclusive duration: 10168257.583
- Top op share: 252.0%
Suggestions:
- Inspect kernel implementation and tiling/fusion opportunities for this op.
- Try alternative algorithmic variants or precision/layout adjustments.
- Run a focused microbenchmark on this op before/after tuning changes.
Op 'copy.564' accumulates significant pre-op idle gap (78.7% of total profiled exclusive duration).
Evidence:
- Op with largest pre-gap: copy.564
- Total pre-gap: 3176365.089
- Max pre-gap: 39767.551
- Occurrences: 120
Suggestions:
- Inspect upstream dependencies immediately before this op.
- Look for host dispatch or synchronization barriers causing the gap.
- Use hierarchical region totals to localize where the waiting accumulates.
Steady-state p90/median ratio is 19.74, indicating intermittent slow steps.
Evidence:
- Steady median: 903.888
- Steady p90: 17838.300
- p90/median: 19.74
Suggestions:
- Correlate slow steps with collective spikes and host wait events.
- Check for periodic checkpoint/eval/input stalls during profiled range.
- Compare traces before/after disabling optional callbacks or host work.