General Local AI Usage

Download LMStudio:
- LMStudio
In LMStudio, search for:
- bartowski/Codestral-22B-v0.1-GGUF
Download the model that ends with Q8_0 on the right:
- Note: The Q8_0 model requires a minimum of 23.30GB RAM when loaded. If you want it to consume less RAM, use Q6_K with slightly less quality.
In the AI Chat window, load the Codestral model.
Ensure the following settings on the right:
- Preset: Mistral Instruct
- Clear the System Prompt
- Context Length: 8192
- Temperature: 0.8
- Tokens to generate: -1
- CPU Threads: 10
- GPU Offload: Max
- GPU Backend: Metal llama.cpp
- Flash Attention: on (below)

Example uses for AI chat:

Chat about code, programming concepts
Ask it to generate terminal commands, Docker commands, K8s etc.
Ask it to modify formatting like JSON, add markdown, etc.
Ask it to explain foreign code
Ask it to rework code in a certain way, write comments, PR description, unit tests
Paste in a PR diff to review (can be private code as it runs fully local)
Paste in a GitHub issue discussion (or any long text) and ask it to summarize
Modify system prompt (tab on the right) to force a certain kind of output (e.g. "You summarize discussions")

VSCode Local AI Plugin (Continue.dev)

Set up LMStudio first as in point 1.
Enable API server in LMStudio:
- Select the arrow tab (Local Server) on the left.
- Click "Start Server".
Install Continue.dev plugin for VSCode (also available in Jetbrains):
- Go to View -> Extensions.
- Search for "Continue".
- Install the plugin.

Continue.dev configuration:

Before doing anything, go to Continue plugin settings and disable Telemetry on the main page.
Open the config file:
```
vim ~/.continue/config.json
```

Replace it with:

{
  "models": [
    {
      "title": "LM Studio",
      "provider": "lmstudio",
      "model": "codestral",
      "apiBase": "http://192.168.0.137:1234/v1/"
    },
    {
      "title": "Llama 3",
      "provider": "ollama",
      "model": "llama3"
    },
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "codestral",
    "provider": "lmstudio",
    "model": "codestral",
    "apiBase": "http://192.168.0.137:1234/v1/"
  },
  "allowAnonymousTelemetry": false,
  "embeddingsProvider": {
    "provider": "transformers.js"
  }
}

Note: Replace BOTH apiBase URLs with the IP of your MacOS host where LMStudio is running (check with ifconfig -a | grep inet).

Restart VSCode.
Test connection:
- Open the Continue tab in VSCode (on the left) and type in 'hello'. You may use this window as a normal chatbot AI running fully locally.
- You can select any code and add it to chat (Ctrl+L) or Edit (Ctrl+I).
- Press Ctrl+I in a code window, and a window will pop up to write a prompt for generating any code. You can then accept or reject it.
- Verify that tab auto-complete works (AI suggests code continuation in grey, press TAB to accept).
- If auto-complete is stuck, you can always Ctrl+I and directly prompt it to continue the code.
Learn how to provide additional context here:
- Context Providers. You can also click "Add Context" in the chat window with options to bring in files, folders, terminal output into prompt context.
Learn how to use Continue commands directly in the code window:
- Slash Commands.

Advanced notes

You can also try other LLM models like:

TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF - use Q4_K_M version
TheBloke/deepseek-coder-33B-instruct-GGUF - use Q4_K_M version and change Prompt Format (right tab) to Deepseek Coder

You may need to increase context length (both for AI Chat window and Local Server) from 8192 to a higher value when pasting long discussions or a lot of code. Codestral and Mixtral support up to 32768 tokens.

High amount of context increases RAM use so above a certain amount of text you can get errors. Use smaller Q size if you want to use a lot of context.

Use system prompt (right tab) if the model is not behaving the way you want to. Keep the system prompt simple.

stanek-michal/local_coding_ai_guide_macos.md

Select an option

No results found

Select an option

No results found

General Local AI Usage

VSCode Local AI Plugin (Continue.dev)

Advanced notes

wd021 commented Jul 10, 2025

Uh oh!