Skip to content

Instantly share code, notes, and snippets.

@shashankb-cc
Created December 12, 2025 11:30
Show Gist options
  • Select an option

  • Save shashankb-cc/5695d305b5dcca4ecafbd7235aa6247b to your computer and use it in GitHub Desktop.

Select an option

Save shashankb-cc/5695d305b5dcca4ecafbd7235aa6247b to your computer and use it in GitHub Desktop.
When using the LLM.swift package with models like Gemma-2-2B, we encounter KV (Key-Value) cache errors that cause the model to fail during inference.

Where the Issue Occurs

The issue manifests when calling model.respond(to: prompt) from the LLM package:

// File: LocalAIService.swift
// Location: sendSingle() method

@MainActor
private func sendSingle(prompt: String, model: LLM) async {
    isProcessing = true
    currentChunk = 1
    totalChunks = 1
    self.response = ""

    do {
        // ❌ ISSUE: This call can throw KV cache errors
        await model.respond(to: prompt)
        self.response = model.output

        // Sometimes the error appears in the output instead of throwing
        // Check for KV cache errors in the response
        if await checkForKVCacheError(response: model.output) {
            print("🚨 KV cache error detected, attempting automatic recovery...")
            await handleKVCacheError(originalPrompt: prompt)
            return
        }

    } catch {
        print("❌ Error during AI processing: \(error)")

        // Check if it's a KV cache related error
        if await isKVCacheError(error) {
            print("🚨 KV cache error detected in exception, attempting automatic recovery...")
            await handleKVCacheError(originalPrompt: prompt)
            return
        }

        self.response = "Error: \(error.localizedDescription)"
    }

    isProcessing = false
}

Error Patterns Detected

The KV cache errors appear in two forms:

1. As Exception Messages:

  • "failed to find kv cache slot"
  • "kv cache"
  • "llama_decode: failed to decode"
  • "ubatch"
  • "cache slot"
  • "decode failed"

2. As Response Content:

  • "failed to find kv cache slot"
  • "kv cache slot"
  • "llama_decode: failed to decode"
  • "decode: failed to find"
  • "ubatch of size"
  • "ret = 1"
  • "..." (truncated responses)

Detection Code

// MARK: - KV Cache Error Detection and Handling

/// Check if an error is related to KV cache issues
private func isKVCacheError(_ error: Error) async -> Bool {
    let errorDescription = error.localizedDescription.lowercased()
    let kvCacheErrorPatterns = [
        "failed to find kv cache slot",
        "kv cache",
        "llama_decode: failed to decode",
        "ubatch",
        "cache slot",
        "decode failed"
    ]

    return kvCacheErrorPatterns.contains { pattern in
        errorDescription.contains(pattern)
    }
}

/// Check if the response contains KV cache error messages
private func checkForKVCacheError(response: String) async -> Bool {
    let responseLower = response.lowercased()
    let kvCacheErrorPatterns = [
        "failed to find kv cache slot",
        "kv cache slot",
        "llama_decode: failed to decode",
        "decode: failed to find",
        "ubatch of size",
        "ret = 1",
        "..."
    ]

    return kvCacheErrorPatterns.contains { pattern in
        responseLower.contains(pattern)
    }
}

Fix Implementation

Our workaround involves:

  1. Detecting KV cache errors (both in exceptions and response content)
  2. Clearing the model instance
  3. Reinitializing the model
  4. Retrying the original prompt
/// Handle KV cache errors with automatic recovery
@MainActor
private func handleKVCacheError(originalPrompt: String) async {
    print("πŸ”§ Starting automatic KV cache error recovery...")

    // Update UI to show recovery in progress
    self.response = "πŸ”§ KV cache error detected. Automatically clearing cache and retrying..."

    // Perform full cache reset
    await clearCacheAndReinitialize()

    // Wait a moment for the model to fully initialize
    try? await Task.sleep(nanoseconds: 1_000_000_000) // 1 second

    // Retry the original prompt with the fresh model
    if model != nil {
        print("πŸ”„ Retrying original prompt after cache reset...")
        self.response = "πŸ”„ Retrying after cache reset..."
        
        await model!.respond(to: originalPrompt)
        self.response = model!.output
        
        print("βœ… Recovery completed successfully")
    } else {
        self.response = "❌ Failed to recover from KV cache error. Please try again or restart the application."
        print("❌ Recovery failed - model could not be reinitialized")
    }

    isProcessing = false
}

/// Clear the model cache and reinitialize with fresh memory settings
@MainActor
func clearCacheAndReinitialize() async {
    print("🧹 Clearing AI cache and reinitializing...")

    // Stop any current processing
    stop()

    // Clear the current model to free memory
    model = nil
    response = ""
    currentChunk = 0
    totalChunks = 0

    // Force garbage collection
    autoreleasepool {
        // This helps ensure memory is actually freed
    }

    // Refresh system specs to get current memory state
    refreshSystemSpecs()

    // Reinitialize the model with fresh settings
    await initializeModel()

    print("βœ… Cache cleared and model reinitialized")
}

Complete Integration Example

@MainActor
private func sendSingle(prompt: String, model: LLM) async {
    isProcessing = true
    currentChunk = 1
    totalChunks = 1
    self.response = ""

    do {
        await model.respond(to: prompt)
        self.response = model.output

        // βœ… FIX: Check for KV cache errors in the response
        if await checkForKVCacheError(response: model.output) {
            print("🚨 KV cache error detected, attempting automatic recovery...")
            await handleKVCacheError(originalPrompt: prompt)
            return
        }

    } catch {
        print("❌ Error during AI processing: \(error)")

        // βœ… FIX: Check if it's a KV cache related error
        if await isKVCacheError(error) {
            print("🚨 KV cache error detected in exception, attempting automatic recovery...")
            await handleKVCacheError(originalPrompt: prompt)
            return
        }

        self.response = "Error: \(error.localizedDescription)"
    }

    isProcessing = false
}

Suggested Package-Level Fix

We recommend the LLM.swift package should:

  1. Internally handle KV cache exhaustion by:

    • Automatically clearing and reallocating the KV cache when it becomes full
    • Providing a method to manually clear the KV cache without reinitializing the entire model
    • Throwing more specific error types that can be caught and handled
  2. Add KV cache management methods:

    // Proposed API additions
    extension LLM {
        func clearKVCache() async throws
        var kvCacheStatus: KVCacheStatus { get }
        func resetContext() async throws
    }
  3. Improve error reporting:

    • Throw KVCacheError instead of generic errors
    • Include cache capacity information in error messages
    • Provide guidance on how to resolve the issue

Environment Details

  • Package: LLM.swift
  • Model: bartowski/gemma-2-2b-it-GGUF
  • Template: Gemma
  • Max Token Count: 16384
  • Platform: macOS
  • Swift Version: 5.9+

Reproduction Steps

  1. Initialize a model with LLM(from: huggingFaceModel, maxTokenCount: 16384)
  2. Send multiple prompts in sequence using model.respond(to: prompt)
  3. After several prompts (typically 3-5), the KV cache error occurs
  4. The error either:
    • Throws an exception with KV cache related messages, OR
    • Returns a response containing KV cache error strings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment