litellm-proxy.md

Here is how to set up and run the LiteLLM Proxy with Apple Foundation Models using uv.

1. Initialize the Project

Run these commands in your terminal to create a dedicated environment and install the necessary dependencies:

# Create a new project directory
mkdir my-apple-proxy && cd my-apple-proxy

# Initialize a uv project
uv init

# Add dependencies (including the proxy extra and the Apple provider)
uv add "litellm[proxy]" litellm-apple-foundation-models mlx

2. The Python Code (`proxy.py`)

Create a file named proxy.py in your project folder. This script registers the custom Apple handler and then launches the server.

import litellm
from litellm.proxy.proxy_server import ProxyConfig, run_proxy_server
from apple_foundation_models import AppleFoundationModel

def start_proxy():
    # 1. Register the custom provider
    # This tells LiteLLM how to handle the "apple-foundation-models" prefix
    apple_handler = AppleFoundationModel()
    
    litellm.custom_provider_map = [
        {
            "provider": "apple-foundation-models",
            "custom_handler": apple_handler
        }
    ]

    # 2. Define the models you want to expose
    # You can point to any MLX-compatible model on Hugging Face
    model_list = [
        {
            "model_name": "apple-llama-3",
            "litellm_params": {
                "model": "apple-foundation-models/mlx-community/Meta-Llama-3-8B-Instruct-4bit",
            },
        }
    ]

    # 3. Launch the Proxy
    run_proxy_server(
        host="0.0.0.0",
        port=4000,
        model_list=model_list,
        debug=False
    )

if __name__ == "__main__":
    start_proxy()

3. Running the Proxy

Execute the script using uv. This will ensure all dependencies are correctly loaded from your virtual environment:

uv run proxy.py

Tip

First Run: The proxy will download the model weights (several GBs) from Hugging Face the first time you call it. Ensure you have a stable internet connection and enough disk space.

4. How the Request Flows

Once the proxy is running, any OpenAI-compatible client can talk to it. The proxy intercepts the request, routes it through the MLX framework, and uses your Mac's GPU (Unified Memory) to generate the response.

Test it with curl:

curl --request POST \
  --url http://localhost:4000/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "apple-llama-3",
    "messages": [{"role": "user", "content": "Why is Apple Silicon good for LLMs?"}]
  }'

Key Considerations

Hardware Requirements: This requires a Mac with Apple Silicon (M1, M2, M3, M4).
Memory: If you are using a base model with 8GB or 16GB of RAM, stick to "4bit" quantized models (like the one in the code above) to avoid system slowdowns.
Config Files: If you want to use a config.yaml instead of hardcoding model_list, you can pass config_file="your_config.yaml" to the run_proxy_server function.

Would you like me to show you how to set up a config.yaml to run Apple models alongside other local providers like Ollama or LM Studio?

z-ai-lab/litellm-proxy.md

Select an option

No results found

Select an option

No results found

1. Initialize the Project

2. The Python Code (`proxy.py`)

3. Running the Proxy

4. How the Request Flows

Key Considerations

z-ai-lab/litellm-proxy.md

1. Initialize the Project

2. The Python Code (proxy.py)

3. Running the Proxy

4. How the Request Flows

Key Considerations

2. The Python Code (`proxy.py`)