NTT123

auto-kernel-dev

Autonomous CUDA kernel optimization for the fused 1024x2 dual persistent WaveGRU generation kernel.

Goal

Maximize throughput_sps (samples per second) while maintaining correctness: pass under the full verification path.

Edit target

Claude Code Billing Header

Purpose

The x-anthropic-billing-header is a computed system message required in every Claude Code API request. It serves as an authentication/integrity check that ties each request to the Claude Code client. Without it, OAuth tokens scoped to Claude Code will reject the request with:

This credential is only authorized for use with Claude Code and cannot be used for other API requests.

You are Kimi K2.5, an AI assistant developed by Moonshot AI(月之暗面).

You possess native vision for perceiving and reasoning over images users send. You have access to a set of tools for selecting appropriate actions and interfacing with external services.

Boundaries

You cannot generate downloadable files, the only exception is creating data analysis charts by ipython tool.

For file creation requests, clearly state the limitation of not being able to directly generate files. Do NOT use language that implies "refusing to assist with creation". Then redirect users to the appropriate Kimi alternatives:

Slides (PPT) → https://www.kimi.com/slides

name	description	allowed-tools
chrome-webpage-click	Click on web page elements with visual verification. Specify the TARGET element description and INITIAL COORDINATES. The skill will iteratively adjust coordinates until the red dot is on the target, then click automatically.	mcp__claude-in-chrome__javascript_tool, mcp__claude-in-chrome__computer, mcp__claude-in-chrome__tabs_context_mcp, mcp__claude-in-chrome__read_page, mcp__claude-in-chrome__find

Chrome Webpage Click with Auto-Correction

This skill ensures accurate clicking by iteratively adjusting coordinates until the red dot is visually confirmed on the target element, then clicking directly.

	// All-Gather using Cooperative Groups grid.sync() with vectorized memory access
	// RTX 5090: 170 SMs, 1 block per SM, 16 bytes (uint4) per SM to share
	// Persistent kernel: multiple rounds of all-gather, each with different buffer

	#include <cuda_runtime.h>
	#include <cooperative_groups.h>
	#include <stdio.h>
	#include <climits>

	namespace cg = cooperative_groups;

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>SwiGLU 2D Activation</title>
	<style>
	* {
	margin: 0;
	padding: 0;

	"""
	Benchmark matrix multiplication with locked GPU clock for stable performance.
	Requires: pip install nvidia-ml-py torch numpy
	"""
	import pynvml
	import torch
	import random
	import os
	import numpy as np
	from torch.profiler import profile, ProfilerActivity, schedule

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>AI Chess Arena - Gemini API Chess Battle</title>
	<style>
	body {
	font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
	margin: 0 auto;