Skip to content

Instantly share code, notes, and snippets.

@dlwh
Last active February 24, 2026 06:41
Show Gist options
  • Select an option

  • Save dlwh/f1d13cb31ba7c7a88e1d425875491ba9 to your computer and use it in GitHub Desktop.

Select an option

Save dlwh/f1d13cb31ba7c7a88e1d425875491ba9 to your computer and use it in GitHub Desktop.
Example human-readable profiling report for marin PR #2970

Profile Report (profile_summary.v1)

Run Metadata

  • Run: unknown
  • Artifact: unknown
  • Hardware: unknown
  • Topology: unknown
  • Git SHA: unknown
  • Generated At (UTC): 2026-02-24T06:41:11.812038Z

Trace Provenance

  • Trace SHA256: 3c3793b69a3ab3461853542dc4d5ac4a87cbe6fede6b6818dbd479d8526f7cf2
  • Observed run_ids: 563, 564, 565, 566, 567, 568, 569, 570

Step Time (Steady State)

  • Steps counted: 208
  • Median: 903.888
  • P90: 17838.300
  • Mean: 19365.831

Step Classes

Class Count Fraction Median P90 Representative Step Periodicity
light 170 81.73% 843.372 2548.588 132 n/a
heavy 38 18.27% 19406.041 195957.718 5 11

Time Breakdown (exclusive_duration_global_timeline)

Category Duration Share
Compute 3423731.901 84.84%
Communication 70639.619 1.75%
Host 0.000 0.00%
Stall 541206.346 13.41%
Other 0.000 0.00%

Hierarchical Regions

Region Path Depth Count Inclusive Inclusive % Exclusive Exclusive %
_train_step 1 90375 27531781.349 100.00% 547709.734 1.99%
_train_step=>fused_linear_softmax_cross_entropy_loss 2 1576 13592351.112 49.37% 0.000 0.00%
_train_step=>fused_linear_softmax_cross_entropy_loss=>linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu=>_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined 4 288 10193730.451 37.03% 10168257.583 36.93%
_train_step=>_splash_attention 2 8528 6190435.324 22.48% 0.000 0.00%
_train_step=>fused_linear_softmax_cross_entropy_loss=>linear_softmax_cross_entropy_loss_fwd_pallas_mosaic_tpu 3 432 3193852.293 11.60% 3189666.994 11.59%
_train_step=>_splash_attention=>splash_mha_dkv_segmented_block_kv_dkv=512_block_kv_dkv_compute=512_block_q_dkv=512_k_layout=1_q_layout=1_v_layout=1 3 848 2376985.415 8.63% 2376985.415 8.63%
_train_step=>_splash_attention=>splash_mha_fwd_segmented_residuals_block_kv=512_block_kv_compute=512_block_kv_dkv=512_block_kv_dkv_compute=512_block_kv_dq=512_block_q=512_block_q_dkv=512_block_q_dq=512_k_layout=1_q_layout=1_use_fused_bwd_kernel=False_v_layout=1 3 1720 1804531.507 6.55% 1804531.507 6.55%
_train_step=>_splash_attention=>splash_mha_dq_segmented_block_kv_dq=512_block_q_dq=512_k_layout=1_q_layout=1_v_layout=1 3 848 1780672.132 6.47% 1780672.132 6.47%
_train_step=>bsh,hm->bsm=>dot_general 3 8256 1627871.117 5.91% 1627871.117 5.91%
_train_step=>apply_rotary_embedding 2 5375 1541515.731 5.60% 0.000 0.00%
_train_step=>bsh,hd->bsd=>dot_general 3 17824 1055872.980 3.84% 1055872.980 3.84%
_train_step=>bsm,mh->bsh=>dot_general 3 4944 758279.036 2.75% 758279.036 2.75%
_train_step=>apply_rotary_embedding=>sub 3 864 567415.754 2.06% 567415.754 2.06%
_train_step=>apply_rotary_embedding=>add_any 3 848 554502.299 2.01% 554502.299 2.01%
_train_step=>convert_element_type 2 12128 457950.500 1.66% 457950.500 1.66%
_train_step=>apply_rotary_embedding=>concatenate 3 3375 418984.626 1.52% 418984.626 1.52%
_train_step=>scatter-add 2 704 396358.419 1.44% 396358.419 1.44%
_train_step=>silu=>add_any 3 848 321158.119 1.17% 321158.119 1.17%
_train_step=>mul 2 4568 289709.177 1.05% 289709.177 1.05%
_train_step=>reshape 2 2592 228540.183 0.83% 228540.183 0.83%
_train_step=>add 2 7752 206405.777 0.75% 206405.777 0.75%
_train_step=>fused_linear_softmax_cross_entropy_loss=>psum 3 568 204453.938 0.74% 204453.938 0.74%
_train_step=>_splash_attention=>hsd,hsd->hs=>dot_general 4 848 188396.497 0.68% 188396.497 0.68%
_train_step=>add_any 2 1688 184280.692 0.67% 184280.692 0.67%
_train_step=>gather 2 288 83560.186 0.30% 83560.186 0.30%
async-collective-start 1 11248 50159.484 100.00% 50159.484 100.00%
_train_step=>reduce_sum 2 8040 39551.274 0.14% 39551.274 0.14%
_train_step=>_splash_attention=>broadcast_in_dim 3 4120 38284.319 0.14% 38284.319 0.14%
copy 1 19288 29335.933 100.00% 29335.933 100.00%
_train_step=>fused_linear_softmax_cross_entropy_loss=>linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu=>_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined=>reduce_sum 5 144 25472.868 0.09% 25472.868 0.09%

Top Ops

Op Canonical Category Count Exclusive Avg Shape Signature
_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1 _linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined compute 144 10168257.583 70612.900 65536,512|2,512,128256
linear_softmax_cross_entropy_loss_fwd_pallas_mosaic_tpu.1 linear_softmax_cross_entropy_loss_fwd_pallas_mosaic_tpu compute 144 3189666.994 22150.465 65536,128|65536,128
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.1 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 compute 144 424731.676 2949.526 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.3 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 compute 144 399184.020 2772.111 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.4 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 compute 144 398870.318 2769.933 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.2 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 compute 144 398833.430 2769.677 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.6 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 compute 136 378361.057 2782.067 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64
splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.5 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1 compute 136 377004.915 2772.095 32,512,64|32,512,64|32,8,2048,64|32,8,2048,64
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.4 splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 compute 144 302343.169 2099.605 32,512,64|32,8,2048,64
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.1 splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 compute 144 302200.079 2098.612 32,512,64|32,8,2048,64
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.3 splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 compute 144 302113.094 2098.008 32,512,64|32,8,2048,64
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.2 splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 compute 144 302103.047 2097.938 32,512,64|32,8,2048,64
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.6 splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 compute 144 296248.720 2057.283 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.5 splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 compute 136 286479.090 2106.464 32,512,64|32,8,2048,64
splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1.6 splash_mha_dq_segmented_block_kv_dq_512_block_q_dq_512_k_layout_1_q_layout_1_v_layout_1 compute 136 285433.652 2098.777 32,512,64|32,8,2048,64
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.11 splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 compute 144 263382.898 1829.048 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.8 splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 compute 144 263350.079 1828.820 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.10 splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 compute 144 263339.422 1828.746 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.7 splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 compute 144 263318.251 1828.599 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128
splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1.9 splash_mha_fwd_segmented_residuals_block_kv_512_block_kv_compute_512_block_kv_dkv_512_block_kv_dkv_compute_512_block_kv_dq_512_block_q_512_block_q_dkv_512_block_q_dq_512_k_layout_1_q_layout_1_use_fused_bwd_kernel_False_v_layout_1 compute 144 263307.916 1828.527 32,512,128|32,512,128|32,512,64|32,8,2048,64|32,8,2048,128
fusion.1 fusion compute 136 202727.619 1490.644 128256,512
psum.29 psum communication 144 201721.845 1400.846 512,128256
all-reduce.23 all-reduce communication 120 168016.960 1400.141 128256,512
fusion.48 fusion compute 136 95951.371 705.525 512,128256|512,128256|512,128256
fusion.38 fusion compute 144 94670.169 657.432 32,2048,8,32|32,2048,8,32|32,2048,8,32|32,2048,8,32

Semantic Families

Note: FLOP proxy metrics are relative scaling heuristics from trace shapes, not hardware MFU.

Family Count Exclusive Share Avg Exclusive FLOP Proxy Total FLOP Proxy/s Example Op
loss_xent 288 13357924.577 67.15% 46381.683 12079595520.000 9.043e+02 _linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1
attention_splash 2560 5770604.832 29.01% 2254.143 822434697576448.000 1.425e+08 splash_mha_dkv_segmented_block_kv_dkv_512_block_kv_dkv_compute_512_block_q_dkv_512_k_layout_1_q_layout_1_v_layout_1.1
other 560 595071.005 2.99% 1062.627 29733421056.000 4.997e+04 fusion.1
collective 120 168016.960 0.84% 1400.141 7880048640.000 4.690e+04 all-reduce.23

Communication Collectives

Collective Count Exclusive Avg
all-reduce 856 442269.905 516.670
async-collective 22488 59490.260 2.645
all-gather 144 3648.932 25.340

Pre-Op Gaps

Op Count Total Gap Max Gap Avg Gap
copy.564 120 3176365.089 39767.551 26469.709
copy.1 136 530431.752 4255.501 3900.233
copy.1436 16 457528.939 29388.719 28595.559
fusion.749 120 4890.014 41.599 40.750
fusion.705 136 596.271 38.057 4.384
copy-done.257 144 320.186 2.911 2.224
copy-done.56 128 162.696 1.488 1.271
add_bitcast_fusion 144 162.575 1.418 1.129
fusion.768 120 143.718 1.470 1.198
copy.561 128 100.117 0.958 0.782
negate_power_fusion 144 38.615 2.551 0.268
copy-done.77 136 20.373 1.540 0.150
copy.1435 16 3.759 0.236 0.235
copy-done.359 144 2.234 0.019 0.016
copy-done.446 16 0.273 0.019 0.017
rsqrt_bitcast_fusion.2 144 0.118 0.001 0.001
rsqrt_bitcast_fusion.5 144 0.113 0.001 0.001
rsqrt_bitcast_fusion.1 144 0.109 0.001 0.001
rsqrt_bitcast_fusion.3 144 0.109 0.001 0.001
async-collective-done.14 144 0.102 0.001 0.001
rsqrt_bitcast_fusion.9 144 0.102 0.001 0.001
async-collective-start.4 144 0.102 0.001 0.001
copy.654 144 0.101 0.001 0.001
async-collective-done.23 144 0.101 0.001 0.001
copy-done.391 144 0.100 0.001 0.001
psum.28 144 0.100 0.001 0.001
async-collective-start.11 144 0.099 0.001 0.001
rsqrt_bitcast_fusion.7 144 0.099 0.001 0.001
fusion.12 136 0.098 0.001 0.001
copy.718 144 0.098 0.001 0.001

Gap Context (By Region)

Op Region Path Count Total Gap Avg Gap
copy.564 copy(model.params.blocks[0].attn.w_k) 120 3176365.089 26469.709
copy.1 copy 136 530431.752 3900.233
copy.1436 copy 16 457528.939 28595.559
fusion.749 _train_step=>add 120 4890.014 40.750
fusion.705 _train_step=>add 16 596.196 37.262
copy-done.257 copy(copy-done) 144 320.186 2.224
copy-done.56 copy(copy-done) 128 162.696 1.271
add_bitcast_fusion _train_step=>_threefry_split=>broadcast_in_dim 144 162.575 1.129
fusion.768 _train_step=>reduce_sum 120 143.718 1.198
copy.561 copy(model.params.blocks[0].attn.w_q) 128 100.117 0.782
negate_power_fusion _train_step=>apply_rotary_embedding=>pow 144 38.615 0.268
copy-done.77 copy(copy-done) 136 20.373 0.150
copy.1435 copy 16 3.759 0.235
copy-done.359 copy(copy-done) 144 2.234 0.016
copy-done.446 copy(copy-done) 16 0.273 0.017
rsqrt_bitcast_fusion.2 _train_step=>mul 144 0.118 0.001
rsqrt_bitcast_fusion.5 _train_step=>mul 144 0.113 0.001
rsqrt_bitcast_fusion.1 _train_step=>mul 144 0.109 0.001
rsqrt_bitcast_fusion.3 _train_step=>mul 144 0.109 0.001
async-collective-done.14 _train_step=>bsh,hd->bsd=>dot_general 144 0.102 0.001
rsqrt_bitcast_fusion.9 _train_step=>mul 144 0.102 0.001
async-collective-start.4 async-collective-start 144 0.102 0.001
async-collective-done.23 _train_step=>bsh,hd->bsd=>dot_general 144 0.101 0.001
copy-done.391 copy(copy-done) 144 0.100 0.001
psum.28 _train_step=>fused_linear_softmax_cross_entropy_loss=>psum 144 0.100 0.001
async-collective-start.11 async-collective-start 144 0.099 0.001
rsqrt_bitcast_fusion.7 _train_step=>mul 144 0.099 0.001
fusion.12 _train_step=>bsm,mh->bsh=>dot_general 136 0.098 0.001
copy-start.42 copy(copy-start) 144 0.097 0.001
pad_maximum_fusion _train_step=>apply_rotary_embedding=>concatenate 144 0.096 0.001

Optimization Candidates

Single op has outsized exclusive time

Top op '_linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1' contributes 252.0% of profiled exclusive duration.

Evidence:

  • Top op: _linear_softmax_cross_entropy_loss_bwd_pallas_mosaic_tpu_combined.1
  • Top op exclusive duration: 10168257.583
  • Top op share: 252.0%

Suggestions:

  • Inspect kernel implementation and tiling/fusion opportunities for this op.
  • Try alternative algorithmic variants or precision/layout adjustments.
  • Run a focused microbenchmark on this op before/after tuning changes.

Large idle gaps appear before specific ops

Op 'copy.564' accumulates significant pre-op idle gap (78.7% of total profiled exclusive duration).

Evidence:

  • Op with largest pre-gap: copy.564
  • Total pre-gap: 3176365.089
  • Max pre-gap: 39767.551
  • Occurrences: 120

Suggestions:

  • Inspect upstream dependencies immediately before this op.
  • Look for host dispatch or synchronization barriers causing the gap.
  • Use hierarchical region totals to localize where the waiting accumulates.

Steady-state step time has high jitter

Steady-state p90/median ratio is 19.74, indicating intermittent slow steps.

Evidence:

  • Steady median: 903.888
  • Steady p90: 17838.300
  • p90/median: 19.74

Suggestions:

  • Correlate slow steps with collective spikes and host wait events.
  • Check for periodic checkpoint/eval/input stalls during profiled range.
  • Compare traces before/after disabling optional callbacks or host work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment