Skip to content

Instantly share code, notes, and snippets.

@TruffleClock
Last active November 8, 2025 23:05
Show Gist options
  • Select an option

  • Save TruffleClock/bcc26aed8e737469f84b87ea54893d51 to your computer and use it in GitHub Desktop.

Select an option

Save TruffleClock/bcc26aed8e737469f84b87ea54893d51 to your computer and use it in GitHub Desktop.
nanochat inference

NanoChat Inference Walkthrough

Let's walk through what happens when you send a message to the NanoChat server. We'll trace the entire journey from typing "hello world" in your browser to receiving a response, explaining everything along the way.


Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Browser (Frontend)                       │
│  - HTML/CSS/JavaScript UI (nanochat/ui.html)                    │
│  - Manages conversation history                                 │
│  - Streams responses via Server-Sent Events                     │
└────────────────┬────────────────────────────────────────────────┘
                 │ HTTP POST /chat/completions
                 │ JSON: {messages, temperature, top_k, max_tokens}
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                   FastAPI Server (chat_web.py)                  │
│  - Validates requests (abuse prevention)                        │
│  - Manages worker pool                                          │
│  - Coordinates tokenization & generation                        │
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Worker Pool (Multi-GPU)                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │  Worker 0    │  │  Worker 1    │  │  Worker N    │           │
│  │  GPU 0       │  │  GPU 1       │  │  GPU N       │           │
│  │  - Model     │  │  - Model     │  │  - Model     │           │
│  │  - Engine    │  │  - Engine    │  │  - Engine    │           │
│  │  - Tokenizer │  │  - Tokenizer │  │  - Tokenizer │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Engine (engine.py)                           │
│  - KV Cache management                                          │
│  - Prefill phase (parallel token processing)                    │
│  - Decode phase (autoregressive generation)                     │
│  - Tool use coordination (calculator)                           │
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                   GPT Model (gpt.py)                            │
│  - Token embedding layer                                        │
│  - 12 Transformer blocks (configurable)                         │
│  - Rotary positional embeddings (RoPE)                          │
│  - Multi-Query Attention (MQA)                                  │
│  - ReLU² activation                                             │
│  - Language model head                                          │
└─────────────────────────────────────────────────────────────────┘

Key Components

Component File Responsibility
Frontend UI nanochat/ui.html User interface, message history, streaming display
Web Server scripts/chat_web.py HTTP API, validation, worker management
Worker Pool scripts/chat_web.py (lines 98-149) Multi-GPU load balancing
Engine nanochat/engine.py Generation loop, KV cache, sampling
GPT Model nanochat/gpt.py Transformer neural network
Tokenizer nanochat/tokenizer.py Text ↔ token ID conversion

Complete Message Flow

Let's trace what happens when you type "hello world" and press Send.

Step 1: Frontend Captures User Input

File: nanochat/ui.html (lines 524-544)

async function sendMessage() {
  const message = chatInput.value.trim(); // "hello world"
  if (!message || isGenerating) return;

  chatInput.value = "";
  chatInput.style.height = "auto";

  const userMessageIndex = messages.length;
  messages.push({ role: "user", content: message });
  addMessage("user", message, userMessageIndex);

  await generateAssistantResponse();
}

What happens:

  • JavaScript captures the text "hello world"
  • Adds it to the messages array with role 'user'
  • Displays it in the chat UI
  • Calls generateAssistantResponse() to fetch the AI's reply

Step 2: Frontend Sends HTTP Request

File: nanochat/ui.html (lines 390-401)

const response = await fetch(`${API_URL}/chat/completions`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    messages: messages, // [{ role: 'user', content: 'hello world' }]
    temperature: currentTemperature, // e.g., 0.8
    top_k: currentTopK, // e.g., 50
    max_tokens: 512,
  }),
});

HTTP Request:

POST /chat/completions HTTP/1.1
Host: localhost:8000
Content-Type: application/json

{
  "messages": [{"role": "user", "content": "hello world"}],
  "temperature": 0.8,
  "top_k": 50,
  "max_tokens": 512
}

Step 3: FastAPI Receives and Validates Request

File: scripts/chat_web.py (lines 313-324)

@app.post("/chat/completions")
async def chat_completions(request: ChatRequest):
    """Chat completion endpoint (streaming only) - uses worker pool for multi-GPU."""

    # Basic validation to prevent abuse
    validate_chat_request(request)

    # Log incoming conversation to console
    logger.info("="*20)
    for i, message in enumerate(request.messages):
        logger.info(f"[{message.role.upper()}]: {message.content}")
    logger.info("-"*20)

Request Validation (lines 160-221):

The server applies strict limits to prevent abuse:

# Abuse prevention limits (lines 52-61)
MAX_MESSAGES_PER_REQUEST = 500
MAX_MESSAGE_LENGTH = 8000
MAX_TOTAL_CONVERSATION_LENGTH = 32000
MIN_TEMPERATURE = 0.0
MAX_TEMPERATURE = 2.0
MIN_TOP_K = 1
MAX_TOP_K = 200
MIN_MAX_TOKENS = 1
MAX_MAX_TOKENS = 4096

For our "hello world" message:

  • ✓ Message count: 1 < 500
  • ✓ Message length: 11 chars < 8000
  • ✓ Total conversation: 11 chars < 32000
  • ✓ Temperature: 0.8 ∈ [0.0, 2.0]
  • ✓ Top-k: 50 ∈ [1, 200]
  • ✓ Max tokens: 512 ∈ [1, 4096]

Why these limits? They protect against:

  • Memory exhaustion (very long conversations)
  • CPU/GPU abuse (extremely long generation)
  • Denial of Service attacks

Step 4: Worker Pool Assigns GPU Worker

File: scripts/chat_web.py (lines 326-328)

# Acquire a worker from the pool (will wait if all are busy)
worker_pool = app.state.worker_pool
worker = await worker_pool.acquire_worker()

Worker Pool Architecture (lines 98-149):

Each worker contains a complete copy of the model on a separate GPU:

@dataclass
class Worker:
    """A worker with a model loaded on a specific GPU."""
    gpu_id: int                        # e.g., 0, 1, 2, 3
    device: torch.device               # e.g., cuda:0
    engine: Engine                     # Generation engine with KV cache
    tokenizer: object                  # BPE tokenizer
    autocast_ctx: torch.amp.autocast   # Mixed precision context

Key Points:

  • Data Parallelism: Each GPU has a full model replica
  • Load Balancing: Requests are distributed across available workers
  • Async Queue: If all workers are busy, new requests wait in an async queue
  • Resource Isolation: Each worker is independent (no shared state)

Example with 4 GPUs:

Request 1 → Worker 0 (GPU 0) → Busy
Request 2 → Worker 1 (GPU 1) → Busy
Request 3 → Worker 2 (GPU 2) → Busy
Request 4 → Worker 3 (GPU 3) → Busy
Request 5 → (waits in queue)

Step 5: Conversation Tokenization

File: scripts/chat_web.py (lines 331-349)

# Build conversation tokens
bos = worker.tokenizer.get_bos_token_id()
user_start = worker.tokenizer.encode_special("<|user_start|>")
user_end = worker.tokenizer.encode_special("<|user_end|>")
assistant_start = worker.tokenizer.encode_special("<|assistant_start|>")
assistant_end = worker.tokenizer.encode_special("<|assistant_end|>")

conversation_tokens = [bos]
for message in request.messages:
    if message.role == "user":
        conversation_tokens.append(user_start)
        conversation_tokens.extend(worker.tokenizer.encode(message.content))
        conversation_tokens.append(user_end)
    elif message.role == "assistant":
        conversation_tokens.append(assistant_start)
        conversation_tokens.extend(worker.tokenizer.encode(message.content))
        conversation_tokens.append(assistant_end)

conversation_tokens.append(assistant_start)

For "hello world", the token sequence becomes:

[
  65527,  # <|bos|>            (Beginning of Sequence)
  65528,  # <|user_start|>     (Start of user message)
  52300,  # "hello"            (BPE token for "hello")
  883,    # " world"           (BPE token for " world" - note the leading space!)
  65529,  # <|user_end|>       (End of user message)
  65530   # <|assistant_start|> (Primes the model to respond)
]

Note: "hello world" tokenizes to just 2 tokens:

  • 52300 = "hello"
  • 883 = " world" (includes the leading space)

Why special tokens?

  • Role Separation: The model learns different behaviors for user vs assistant
  • Tool Use: Additional tokens like <|python_start|> enable tool calling
  • Structured Conversations: Clear boundaries between messages prevent confusion
  • Supervision: During training, we only compute loss on assistant tokens

Multi-turn Example:

User: "What is 2+2?"
Assistant: "4"
User: "And 3+3?"
→ Tokenization:
[
  <|bos|>,
  <|user_start|>, tokens("What is 2+2?"), <|user_end|>,
  <|assistant_start|>, tokens("4"), <|assistant_end|>,
  <|user_start|>, tokens("And 3+3?"), <|user_end|>,
  <|assistant_start|>  ← primes for next response
]

Step 6: Engine Generation - Prefill Phase

File: nanochat/engine.py (lines 208-220)

The engine performs generation in two phases: Prefill and Decode.

6.1: Prefill Phase

The prefill phase processes all input tokens in parallel to populate the KV cache.

# 1) Run a batch 1 prefill of the prompt tokens
m = self.model.config
kv_model_kwargs = {
    "num_heads": m.n_kv_head,
    "head_dim": m.n_embd // m.n_head,
    "num_layers": m.n_layer
}
kv_cache_prefill = KVCache(
    batch_size=1,
    seq_len=len(tokens),  # Only needs to store this many tokens
    **kv_model_kwargs,
)
ids = torch.tensor([tokens], dtype=torch.long, device=device)
logits = self.model.forward(ids, kv_cache=kv_cache_prefill)  # (1, T, vocab_size)
logits = logits[:, -1, :]  # Take only the last token's logits

What's happening:

  1. Create KV Cache: Allocates memory to store attention keys/values

    • Shape: (num_layers, 2, batch_size, num_heads, seq_len, head_dim)
    • Example: (12, 2, 1, 6, 6, 128) for our 6 input tokens
  2. Forward Pass: Process all 6 tokens through the Transformer

    • Token embedding
    • 12 Transformer layers (attention + MLP)
    • Each layer caches its K, V matrices
    • Output logits for ALL positions
  3. Extract Last Logits: Only the last position predicts the next token

    • Input: [bos, user_start, "hello", " world", user_end, assistant_start]
    • We want to generate the token AFTER assistant_start
    • So we take logits[:, -1, :] (logits at position 5)

Why prefill is fast:

  • Parallel Processing: All tokens computed simultaneously (not sequential)
  • Matrix Operations: GPUs excel at large matrix multiplications
  • Single Forward Pass: Processes entire prompt at once

Prefill Complexity:

  • Time: O(T²) where T = prompt length (due to attention)
  • But in practice, GPUs make this very fast for T < 2048

6.2: Sample First Token

next_ids = sample_next_token(logits, rng, temperature, top_k)  # (B, 1)
sampled_tokens = next_ids[:, 0].tolist()

This samples the first output token (e.g., "Hi") using temperature and top-k sampling (explained in Deep Dive: Token Sampling).

Step 7: Engine Generation - Decode Phase

File: nanochat/engine.py (lines 222-296)

After prefill, we replicate the KV cache for each sample and enter the decode loop.

# 2) Replicate the KV cache for each sample/row
kv_length_hint = (len(tokens) + max_tokens) if max_tokens is not None else self.model.config.sequence_len
kv_cache_decode = KVCache(
    batch_size=num_samples,
    seq_len=kv_length_hint,
    **kv_model_kwargs,
)
kv_cache_decode.prefill(kv_cache_prefill)
del kv_cache_prefill  # Free memory

# 3) Initialize states for each sample
row_states = [RowState(tokens.copy()) for _ in range(num_samples)]

# 4) Main generation loop
num_generated = 0
first_iteration = True
while True:
    # Stop conditions
    if max_tokens is not None and num_generated >= max_tokens:
        break
    if all(state.completed for state in row_states):
        break

    if first_iteration:
        # Use the token we already sampled from prefill
        sampled_tokens = [sampled_tokens[0]] * num_samples
        first_iteration = False
    else:
        # Forward the model and get the next token for each row
        logits = self.model.forward(ids, kv_cache=kv_cache_decode)  # (B, T, vocab_size)
        logits = logits[:, -1, :]  # (B, vocab_size) at last time step
        next_ids = sample_next_token(logits, rng, temperature, top_k)  # (B, 1)
        sampled_tokens = next_ids[:, 0].tolist()

    # Process each row: choose next token, update state, handle tool use
    token_column = []
    token_masks = []
    for i, state in enumerate(row_states):
        # Determine if this token is forced (tool output) or sampled
        is_forced = len(state.forced_tokens) > 0
        token_masks.append(0 if is_forced else 1)
        next_token = state.forced_tokens.popleft() if is_forced else sampled_tokens[i]
        token_column.append(next_token)

        # Update state
        state.current_tokens.append(next_token)

        # Check for completion
        if next_token == assistant_end or next_token == bos:
            state.completed = True

        # Handle tool logic (calculator)
        # ... (see Tool Use section)

    # Yield the token column
    yield token_column, token_masks
    num_generated += 1

    # Prepare ids for next iteration (single token forward pass)
    ids = torch.tensor(token_column, dtype=torch.long, device=device).unsqueeze(1)

Decode Loop Mechanics:

Each iteration:

  1. Forward Pass: Process ONE new token through the model
  2. Sample: Get next token using temperature + top-k
  3. Update KV Cache: Append new K, V to cache
  4. Yield: Return token to caller (for streaming)
  5. Repeat: Until max_tokens or <|assistant_end|>

Example Decode Sequence:

Iteration 1: [assistant_start]     → Sample "Hi"       → Yield "Hi"
Iteration 2: ["Hi"]                 → Sample "!"        → Yield "!"
Iteration 3: ["!"]                  → Sample " How"     → Yield " How"
Iteration 4: [" How"]               → Sample " can"     → Yield " can"
Iteration 5: [" can"]               → Sample " I"       → Yield " I"
Iteration 6: [" I"]                 → Sample " help"    → Yield " help"
Iteration 7: [" help"]              → Sample "?"        → Yield "?"
Iteration 8: ["?"]                  → Sample <|assistant_end|> → STOP

Decode Complexity:

  • Time per token: O(T) where T = total sequence length so far
  • Total time for N tokens: O(N × T) ≈ O(N²) worst case
  • BUT: With KV cache, each iteration only processes 1 new token (very fast!)

Without KV Cache (naive approach):

Iteration 1: Process [assistant_start] → 1 token
Iteration 2: Process [assistant_start, "Hi"] → 2 tokens
Iteration 3: Process [assistant_start, "Hi", "!"] → 3 tokens
...
Iteration N: Process [all N tokens] → N tokens
Total: 1 + 2 + 3 + ... + N = O(N²) tokens processed

With KV Cache (our approach):

Prefill:     Process [assistant_start] → 1 token
Iteration 1: Process ["Hi"] → 1 token
Iteration 2: Process ["!"] → 1 token
...
Iteration N: Process [one token] → 1 token
Total: N + 1 tokens processed (linear!)

KV Cache saves ~50x computation for typical responses!

Step 8: Model Forward Pass

File: nanochat/gpt.py (lines 244-276)

Each iteration of the decode loop calls model.forward():

def forward(self, idx, targets=None, kv_cache=None, loss_reduction='mean'):
    B, T = idx.size()  # Batch size, sequence length

    # Grab rotary embeddings for current sequence length
    T0 = 0 if kv_cache is None else kv_cache.get_pos()
    cos_sin = self.cos[:, T0:T0+T], self.sin[:, T0:T0+T]

    # Forward the trunk of the Transformer
    x = self.transformer.wte(idx)  # Token embeddings
    x = norm(x)                     # RMSNorm after embedding
    for block in self.transformer.h:
        x = block(x, cos_sin, kv_cache)
    x = norm(x)                     # Final RMSNorm

    # Forward the lm_head (compute logits)
    logits = self.lm_head(x)
    logits = 15 * torch.tanh(logits / 15)  # Logits softcap
    return logits

Transformer Block (lines 126-135):

class Block(nn.Module):
    def forward(self, x, cos_sin, kv_cache):
        x = x + self.attn(norm(x), cos_sin, kv_cache)  # Attention with residual
        x = x + self.mlp(norm(x))                       # MLP with residual
        return x

Pre-Norm Architecture:

  • RMSNorm BEFORE attention (not after like original Transformer)
  • RMSNorm BEFORE MLP
  • Residual connections add the original x back
  • More stable training, better gradient flow

Attention (lines 66-110):

def forward(self, x, cos_sin, kv_cache):
    B, T, C = x.size()

    # Project to Q, K, V
    q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
    k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
    v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)

    # Apply Rotary Embeddings (RoPE)
    cos, sin = cos_sin
    q = apply_rotary_emb(q, cos, sin)
    k = apply_rotary_emb(k, cos, sin)

    # Apply QK Normalization
    q, k = norm(q), norm(k)

    # Transpose for attention: (B, T, H, D) → (B, H, T, D)
    q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)

    # KV Cache: insert current k,v and get full view
    if kv_cache is not None:
        k, v = kv_cache.insert_kv(self.layer_idx, k, v)

    # Scaled Dot-Product Attention
    y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

    # Re-assemble heads and project back
    y = y.transpose(1, 2).contiguous().view(B, T, -1)
    y = self.c_proj(y)
    return y

Key Optimizations:

  • Rotary Embeddings (RoPE): Encodes position without learnable parameters
  • QK Normalization: Stabilizes attention, prevents runaway attention scores
  • Multi-Query Attention (MQA): Fewer KV heads than Q heads (6 KV, 6 Q in base config)
    • Saves memory in KV cache
    • Faster inference
    • Minimal quality loss
  • Flash Attention: PyTorch's scaled_dot_product_attention uses optimized kernels

MLP (lines 113-123):

class MLP(nn.Module):
    def forward(self, x):
        x = self.c_fc(x)         # Linear: d_model → 4*d_model
        x = F.relu(x).square()   # ReLU²  (not GELU!)
        x = self.c_proj(x)       # Linear: 4*d_model → d_model
        return x

Why ReLU² instead of GELU?

  • Simpler (no approximations)
  • Faster to compute
  • Works well in practice
  • Novel activation used in recent models

Step 9: Token Sampling

File: nanochat/engine.py (lines 156-172)

After the model produces logits, we sample the next token:

@torch.inference_mode()
def sample_next_token(logits, rng, temperature=1.0, top_k=None):
    """Sample a single next token from logits of shape (B, vocab_size). Returns (B, 1)."""

    # Temperature = 0 → Greedy (always pick highest probability)
    if temperature == 0.0:
        return torch.argmax(logits, dim=-1, keepdim=True)

    # Top-k filtering
    if top_k is not None:
        k = min(top_k, logits.size(-1))
        vals, idx = torch.topk(logits, k, dim=-1)  # Get top-k logits
        vals = vals / temperature                   # Scale by temperature
        probs = F.softmax(vals, dim=-1)             # Convert to probabilities
        choice = torch.multinomial(probs, num_samples=1, generator=rng)
        return idx.gather(1, choice)                # Map back to vocab indices
    else:
        # Standard temperature sampling (no top-k)
        logits = logits / temperature
        probs = F.softmax(logits, dim=-1)
        return torch.multinomial(probs, num_samples=1, generator=rng)

Detailed Example:

Let's say the model outputs logits for the vocabulary:

Vocab size = 50,000 tokens
Logits (raw scores):
  "Hi"    → 8.2
  "Hello" → 7.9
  "Hey"   → 7.5
  "Greet" → 6.1
  ...
  "Zebra" → -3.2

Step-by-step with temperature=0.8, top_k=50:

  1. Top-k Filtering:

    • Extract top 50 logits (discard the remaining 49,950)
    • Top 50 might be: ["Hi": 8.2, "Hello": 7.9, "Hey": 7.5, ..., "Howdy": 4.3]
  2. Temperature Scaling:

    • Divide each by 0.8:
    "Hi"    → 8.2 / 0.8 = 10.25
    "Hello" → 7.9 / 0.8 = 9.875
    "Hey"   → 7.5 / 0.8 = 9.375
    
    • Higher temperature (>1) → flatter distribution (more random)
    • Lower temperature (<1) → sharper distribution (more deterministic)
  3. Softmax:

    • Convert to probabilities:
    "Hi"    → exp(10.25) / Z = 0.52  (52% chance)
    "Hello" → exp(9.875) / Z = 0.31  (31% chance)
    "Hey"   → exp(9.375) / Z = 0.12  (12% chance)
    ... (remaining 47 tokens share ~5%)
    
  4. Multinomial Sampling:

    Now, we pick the next token. We use torch.multinomial() which samples from a categorical distribution.

    How it works:

    Think of it like a weighted lottery:

    Probabilities:
    "Hi"    → 0.52  (52% of the probability mass)
    "Hello" → 0.31  (31% of the probability mass)
    "Hey"   → 0.12  (12% of the probability mass)
    ... (remaining 47 tokens share ~5%)
    

    Visualization as a number line:

    0.0              0.52      0.83  0.95          1.0
    |----------------|---------|-----|--------------|
           "Hi"        "Hello"  "Hey" (other tokens)
          (52%)         (31%)   (12%)     (~5%)
    

    The sampling process:

    1. Generate a random number r uniformly between 0.0 and 1.0
    2. Find which "bucket" it falls into:
    r = random.uniform(0, 1)  # e.g., r = 0.67
    
    cumulative = 0.0
    for token, prob in zip(tokens, probabilities):
        cumulative += prob
        if r < cumulative:
            return token
    
    # Example with r = 0.67:
    # cumulative = 0.0 → r (0.67) >= 0.0, continue
    # cumulative = 0.52 ("Hi") → r (0.67) >= 0.52, continue
    # cumulative = 0.83 ("Hello") → r (0.67) < 0.83, return "Hello"!

    Different random numbers → different tokens:

    r = 0.25  → Falls in [0.0, 0.52]   → "Hi"
    r = 0.67  → Falls in [0.52, 0.83]  → "Hello"
    r = 0.88  → Falls in [0.83, 0.95]  → "Hey"
    r = 0.99  → Falls in [0.95, 1.0]   → Some rare token
    

    Tokens with higher probabilities occupy more of the number line, so they're more likely to be sampled!

    In our example, let's say we sample r = 0.25, which gives us "Hi" (token ID varies by vocabulary)

Step 10: Streaming to Client

File: scripts/chat_web.py (lines 262-311)

As tokens are generated, they're immediately streamed to the client:

async def generate_stream(
    worker: Worker,
    tokens,
    temperature=None,
    max_new_tokens=None,
    top_k=None
) -> AsyncGenerator[str, None]:
    """Generate assistant response with streaming."""

    # Accumulate tokens for proper UTF-8 handling
    accumulated_tokens = []
    last_clean_text = ""

    with worker.autocast_ctx:
        for token_column, token_masks in worker.engine.generate(...):
            token = token_column[0]

            # Check for stopping tokens
            if token == assistant_end or token == bos:
                break

            # Accumulate token
            accumulated_tokens.append(token)

            # Decode all accumulated tokens
            current_text = worker.tokenizer.decode(accumulated_tokens)

            # Only emit if no incomplete UTF-8 sequences
            if not current_text.endswith('�'):
                new_text = current_text[len(last_clean_text):]
                if new_text:
                    yield f"data: {json.dumps({'token': new_text, 'gpu': worker.gpu_id}, ensure_ascii=False)}\n\n"
                    last_clean_text = current_text

    yield f"data: {json.dumps({'done': True})}\n\n"

UTF-8 Handling:

This is critical! Tokens don't always align with UTF-8 character boundaries.

Example with emoji "😀":

Token 1: First 2 bytes  → decode → "�" (incomplete UTF-8)
Token 2: Next 2 bytes   → decode → "😀" (complete!)

The code:

  1. Accumulates tokens
  2. Decodes ALL tokens so far
  3. Checks for '�' (Unicode replacement character)
  4. Only emits when we have a complete UTF-8 sequence

Server-Sent Events Format:

data: {"token": "Hi", "gpu": 0}
data: {"token": "!", "gpu": 0}
data: {"token": " How", "gpu": 0}
data: {"done": true}

Each line:

  • Starts with "data: "
  • Contains JSON payload
  • Ends with \n\n (double newline)

Step 11: Frontend Receives Stream

File: nanochat/ui.html (lines 407-432)

The browser receives and displays tokens in real-time:

const reader = response.body.getReader();
const decoder = new TextDecoder();
let fullResponse = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  // Decode bytes to string
  const chunk = decoder.decode(value);
  const lines = chunk.split("\n");

  for (const line of lines) {
    if (line.startsWith("data: ")) {
      try {
        const data = JSON.parse(line.slice(6));
        if (data.token) {
          fullResponse += data.token;
          assistantContent.textContent = fullResponse;
          chatContainer.scrollTop = chatContainer.scrollHeight;
        }
      } catch (e) {
        // Ignore parse errors (incomplete JSON)
      }
    }
  }
}

Step 12: Worker Release and Logging

File: scripts/chat_web.py (lines 353-373)

async def stream_and_release():
    try:
        async for chunk in generate_stream(...):
            # Accumulate response for logging
            chunk_data = json.loads(chunk.replace("data: ", "").strip())
            if "token" in chunk_data:
                response_tokens.append(chunk_data["token"])
            yield chunk
    finally:
        # Log the assistant response
        full_response = "".join(response_tokens)
        logger.info(f"[ASSISTANT] (GPU {worker.gpu_id}): {full_response}")
        logger.info("="*20)

        # Release worker back to pool
        await worker_pool.release_worker(worker)

Worker lifecycle:

  1. Request arrives
  2. Worker acquired from pool (blocks if all busy)
  3. Generation happens
  4. Worker released back to pool
  5. Next request can use this worker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment