Skip to content

Instantly share code, notes, and snippets.

@VooDisss
VooDisss / README.md
Created March 9, 2026 20:56
llama-server models.ini guide for Qwen3 reranker + embedding + chat models. Fix for Qwen3-Reranker GGUF producing near-zero scores (4.5e-23) with llama.cpp. Covers /v1/rerank endpoint, pooling=rank, cls.output.weight, convert_hf_to_gguf.py proper conversion, models.ini preset reference with all valid keys,

Running Qwen3 Models with llama-server (Embedding + Reranking + Chat)

A practical guide to running multiple Qwen3 models through a single llama-server instance using model routing. Covers embedding, reranking, and chat/vision models.

Tested on Windows with RTX 3090 (24GB VRAM), llama-server build from llama.cpp master branch. Last updated: 2025-03-09.


What you get