Autonomous CUDA kernel optimization for the fused 1024x2 dual persistent WaveGRU generation kernel.
Maximize throughput_sps (samples per second) while maintaining correctness: pass under the full verification path.
kernel.cu — this is the only file you modify. Everything else is read-only.
uv run harness.py > run.log 2>&1grep "^throughput_sps:\|^correctness:\|^sample_.*:\|^teacher_.*_diff:\|^h[01]_.*_diff:" run.logIf the grep output is empty, the run crashed. Run tail -n 50 run.log to read the Python stack trace and attempt a fix.
harness.py now validates correctness in two complementary ways before measuring throughput:
- It runs the normal autoregressive CUDA generation path and requires exact token parity against the PyTorch reference. This keeps the sampler, min-p logic, and
samples_outwriteback covered. - It also runs a teacher-forced path: first the PyTorch reference generates the teacher token sequence, then both the PyTorch reference and the CUDA kernel are re-run with those fixed tokens forced back into the recurrence.
- In teacher-forced mode it compares the selected-token probability from the CUDA kernel against the selected-token probability from the reference at every step.
- It also compares the final
h0andh1hidden states from the teacher-forced run.
This avoids autoregressive sample divergence causing a cascade of downstream mismatches while still keeping the real sampling path under test. A run should be treated as correct only when the autoregressive token parity check, teacher-forced score comparison, and final hidden-state comparison all pass.
Throughput is still measured on the normal generation path after correctness passes.
For detailed analysis of occupancy, bandwidth, register usage, etc.:
uv run harness.py --ncu > ncu.log 2>&1This is slow (~30-60s overhead) so use it sparingly for deep dives, not every iteration.
When an experiment is done, log it to results.tsv (tab-separated, NOT comma-separated).
The TSV has a header row and 6 columns:
commit throughput_sps kernel_us correctness status description
- git commit hash (short, 7 chars)
- throughput_sps achieved (e.g. 42000.0) — use 0.0 for crashes
- kernel_us per step in microseconds (e.g. 23.81) — use 0.0 for crashes
- correctness:
passorfail— usefailfor crashes - status:
keep,discard, orcrash - short text description of what this experiment tried
Example:
commit throughput_sps kernel_us correctness status description
a1b2c3d 42000.0 23.81 pass keep baseline
b2c3d4e 44500.0 22.47 pass keep unroll GLU loop 8x
c3d4e5f 41000.0 24.39 pass discard vectorized load experiment
d4e5f6g 0.0 0.0 fail crash too many registers (launch failed)
LOOP FOREVER:
- Look at the git state and results.tsv for context on what's been tried.
- Think about optimization ideas:
- Register pressure reduction
- Shared memory usage patterns
- Warp specialization
- Loop unrolling strategies
- Memory access coalescing
- Instruction-level parallelism
- Occupancy tuning (launch bounds, block size)
- Reducing grid syncs
- Algorithmic changes to GRU/GLU/sampling phases
- Edit
kernel.cuwith an experimental idea. - git commit
- Run the experiment:
uv run harness.py > run.log 2>&1 - Read out the results:
grep "^throughput_sps:\|^correctness:\|^sample_.*:\|^teacher_.*_diff:\|^h[01]_.*_diff:" run.log - If the grep output is empty, the run crashed. Run
tail -n 50 run.logto read the stack trace. - Record the results in results.tsv (do NOT commit results.tsv, leave it untracked)
- If throughput improved AND correctness is pass → keep (advance the branch)
- If throughput is equal/worse OR correctness is fail → discard (git reset back)
Timeout: Each experiment should take < 1 minute (compile + run). If a run exceeds this, kill it and treat it as a failure.
Crashes: If it's a dumb fix (typo, missing import), fix and re-run. If the idea is fundamentally broken, log "crash", discard, and move on.
NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep or away. You are autonomous. If you run out of ideas, think harder — re-read the kernel code for new angles, try combining previous near-misses, try more radical approaches. The loop runs until the human interrupts you, period.
The kernel is a fused persistent WaveGRU generation kernel:
- Uses cooperative groups for grid-wide synchronization
- Each step: L0 embed+RMS → L0 GRU → L1 prep+RMS → L1 GRU → GLU → output GEMV → sampling
- GRU layers use register-tiled weights (32 elements per lane × 32 lanes = 1024)
- Sampling uses warp-parallel bitonic sort + Gumbel-max
- Block size 256, cooperative launch
- All 1024 elements fit cleanly in registers (no memory tail)
The baseline uses both GRU layers register-tiled (L0_REG_TILED=true, L1_REG_TILED=true).
This gives maximum data reuse but very high register pressure (~255 registers).
Ideas to explore:
- Drop one layer to memory path to reduce register pressure → more blocks → better latency hiding
- Shared memory caching strategies for weight rows
- Warp specialization (input/hidden path split)
- Different launch bounds / block sizes
- Vectorized loads (float4)
- Fuse phases differently
- Reduce grid syncs by overlapping computation
- Tune the GLU/GEMV phases
- Sampling optimizations