Skip to content

Instantly share code, notes, and snippets.

@bbrowning
bbrowning / instructions.md
Last active November 24, 2025 18:27
Compile recent vLLM builds from source on DGX Spark

Compiling vLLM main from source on DGX Spark

I do all this SSH'd into the DGX Spark from another machine, so everything is terminal commands.

Install python dev dependencies and uv

sudo apt install python3-dev
curl -LsSf https://astral.sh/uv/install.sh | sh
@bbrowning
bbrowning / sm120_nvfp4_moe.diff
Created November 21, 2025 20:56
Changes required to get latest main of vLLM running Qwen3 MoE NVFP4 on DGX Spark
diff --git a/csrc/ops.h b/csrc/ops.h
index f8bdc61aa..933c64db0 100644
--- a/csrc/ops.h
+++ b/csrc/ops.h
@@ -218,6 +218,7 @@ bool cutlass_scaled_mm_supports_fp4(int64_t cuda_device_capability);
bool cutlass_scaled_mm_supports_fp8(int64_t cuda_device_capability);
bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability);
bool cutlass_group_gemm_supported(int64_t cuda_device_capability);
+bool cutlass_moe_mm_supports_fp4(int64_t cuda_device_capability);
@bbrowning
bbrowning / Dockerfile.dgx_spark
Created November 21, 2025 00:14
Dockerfile to create vLLM v0.11.2 containers for DGX Spark
# A crude copy of vLLM's normal Dockerfile that installs
# a released version on DGX Spark
ARG CUDA_VERSION=13.0.2
ARG PYTHON_VERSION=3.12
ARG VLLM_VERSION=0.11.2
ARG BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04
ARG PYTORCH_CUDA_INDEX_BASE_URL=https://download.pytorch.org/whl
@bbrowning
bbrowning / test_grammar.py
Created September 11, 2025 17:27
Llguidance, vllm guided_grammar, and Hermes models
import json
from openai import OpenAI
def hermes_grammar_from_tools(tools: list[dict]) -> str:
tool_funcs = ""
for tool in tools:
tool_funcs += " | " if tool_funcs else ""
tool_funcs += f"fun_{tool['function']['name']}"
@bbrowning
bbrowning / 0_instructions.md
Last active September 3, 2025 17:36
Running BFCL against models deployed to vLLM

Running BFCL tests against vLLM

Clone the gorilla repo and install BFCL dependencies:

git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
python -m venv
source venv/bin/activate
pip install -e .
@bbrowning
bbrowning / pydantic_agent_test.py
Created July 25, 2025 00:01
An example of how to use Pydantic AI with Llama Stack and the Responses API
# Dependencies:
# pip install openai pydantic-ai
# This example uses the web_search builtin tool, so it assumes you
# have a valid TAVILY_API_KEY environment variable set before starting
# your Llama Stack server.
# Usage:
#
# ollama run llama3.2:3b
@bbrowning
bbrowning / llama4_pythonic.ebnf
Created May 22, 2025 12:38
EBNF grammar (for use with Tatsu) for Llama 4 Pythonic tool calling parsing
@@grammar::Llama4
start
=
expression $
;
expression
=
diff --git a/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py b/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py
index fbbbc1fb2..1f953706b 100644
--- a/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py
+++ b/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py
@@ -52,6 +52,27 @@ ESCAPED_STRING_FUNCTION_CALL = FunctionCall(
name="get_weather",
arguments='{"city": "Martha\'s Vineyard", "metric": "\\"cool units\\""}',
)
+PYTHON_TAGS_FUNCTION_OUTPUT="<|python_start|>[get_weather(city='San Francisco', metric='celsius')]<|python_end|>"
+PYTHON_TAGS_FUNCTION_CALL = FunctionCall(
diff --git a/llama_stack/providers/remote/inference/vllm/vllm.py b/llama_stack/providers/remote/inference/vllm/vllm.py
index 8bc733fd..eaea63f8 100644
--- a/llama_stack/providers/remote/inference/vllm/vllm.py
+++ b/llama_stack/providers/remote/inference/vllm/vllm.py
@@ -161,45 +161,52 @@ def _convert_to_vllm_finish_reason(finish_reason: str) -> StopReason:
async def _process_vllm_chat_completion_stream_response(
stream: AsyncGenerator[OpenAIChatCompletionChunk, None],
) -> AsyncGenerator:
- event_type = ChatCompletionResponseEventType.start
- tool_call_buf = UnparseableToolCall()
@bbrowning
bbrowning / llama4_pythonic.diff
Created May 14, 2025 00:42
diff of changes to llama 4 pythonic tool parser
diff --git a/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py b/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py
index fbbbc1fb2..5d232f44a 100644
--- a/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py
+++ b/tests/entrypoints/openai/tool_parsers/test_pythonic_tool_parser.py
@@ -52,6 +52,16 @@ ESCAPED_STRING_FUNCTION_CALL = FunctionCall(
name="get_weather",
arguments='{"city": "Martha\'s Vineyard", "metric": "\\"cool units\\""}',
)
+PYTHON_TAGS_FUNCTION_OUTPUT="<|python_start|>[get_weather(city='San Francisco', metric='celsius')]<|python_end|>"
+PYTHON_TAGS_FUNCTION_CALL = FunctionCall(