BenHamm / WEKA-KV-CACHE-TESTER.md

Created January 27, 2026 17:33

Weka KV Cache Tester - Documentation for the new load generation toolkit

Weka KV Cache Tester

A comprehensive load generation and benchmarking toolkit from Weka for testing LLM inference performance with realistic KV cache patterns.

Overview

The weka-new-kv-cache-tester is a Python-based toolkit designed to benchmark LLM inference servers with realistic agentic coding workloads. It includes multiple testing tools and a dataset of 588 real Claude Code conversation traces.

Author: Callan Fox (Weka)
License: Apache 2.0

BenHamm / BENCHMARK_RESULTS_GIST.md

Created December 17, 2025 23:35

Qwen3-32B Disaggregated Serving Benchmark Results - AIConfigurator vs Actual Performance

Qwen3-32B Disaggregated Serving Benchmark Results

Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo

1. Cluster Configuration

BenHamm / AIC_WALKTHROUGH_GUIDE.md

Last active December 19, 2025 17:02

AIConfigurator Walkthrough: Finding Optimal LLM Deployment Configurations

AIConfigurator: Fast-Track Your LLM Deployment on NVIDIA Dynamo

What is NVIDIA Dynamo?

NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.

Key capabilities:

Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
KV-aware routing — Routes requests to workers with the highest cache hit rate
KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput

BenHamm / AIC_PREDICTION_MISMATCH_GIST.md

Last active December 1, 2025 23:24

AIConfigurator Prediction Mismatch: 7-8% vs 102-148% Disaggregated Serving Performance Gains

AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.

Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.

Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).

BenHamm / AIPERF-PRESENTATION.md

Last active November 14, 2025 09:32

AIPerf Comprehensive Benchmarking Guide - WIP

AIPerf: Comprehensive LLM Benchmarking

Presentation Date: November 13, 2025
Tool: AIPerf v0.3.0

Block Hash: Privacy-Preserving LLM Trace Sharing

The Problem

Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.

The Solution

Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).

BenHamm / aiperf_benchmark_results.md

Created November 3, 2025 21:24

AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API - 8K input, 1K output, 100 requests

AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).

Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled

BenHamm / brev-instance-alert-setup.md

Last active October 24, 2025 20:58

Brev Instance Alert Setup for macOS - Never forget to stop your instances!

Brev Instance Alert Setup for macOS

Never forget to stop your Brev instances again! This setup creates an alert that pops up every 4 hours if you have running Brev instances.

What the Alert Looks Like

BenHamm / QWEN32B_DISAGGREGATION_REPORT.md

Created October 22, 2025 17:02

Qwen3-32B Disaggregation Performance Analysis: Why Disaggregation Underperformed

Qwen3-32B Disaggregation Performance Analysis

Date: October 22, 2025
Environment: Nebius H200 Kubernetes Cluster
Infrastructure: Dynamo LLM Serving Platform v0.5.0
Model: Qwen/Qwen3-32B with FP8 quantization

1. Executive Summary

BenHamm / AIPerf_permutations_guide.md

Last active September 27, 2025 00:51

AI Perf Permutations

AIPerf Profiling of Text, Image, & Embeddings Endpoints

This tutorial captures end-to-end reference flows for running AIPerf against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret

Ben Hamm BenHamm

Weka KV Cache Tester

Overview

Qwen3-32B Disaggregated Serving Benchmark Results

1. Cluster Configuration

AIConfigurator: Fast-Track Your LLM Deployment on NVIDIA Dynamo

What is NVIDIA Dynamo?

AIConfigurator Performance Prediction Mismatch

Summary

AIPerf: Comprehensive LLM Benchmarking

Table of Contents

Block Hash: Privacy-Preserving LLM Trace Sharing

The Problem

The Solution

AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Brev Instance Alert Setup for macOS

What the Alert Looks Like

Qwen3-32B Disaggregation Performance Analysis

1. Executive Summary

AIPerf Profiling of Text, Image, & Embeddings Endpoints