Running llama.cpp on RX 480

Even though the RX 480 is not a very recent GPU, the 8 GB VRAM are quite decent and make it possible to run LLMs much faster than with the CPU.

To make it running, three steps are necessary: Installing vulkan, building shaderc and finally building lama.cpp including the vulkan backend (assuming git, a c compiler, cmake, ninja, etc. are installed on a sufficiently recent Ubuntu).

Step 1: Installing Vulkan

Run sudo apt install libvulkan-dev vulkan-tools glslang-tools.

Step 2: Building shaderc

To install shaderc, execute the following commands (and in case of problems, consult their documentation):

git clone https://github.com/google/shaderc
cd shaderc
git checkout known-good
./update_shaderc_sources.py
cd src/
./utils/git-sync-deps
mkdir build
export BUILD_DIR=$(pwd)/build
export SOURCE_DIR=$(pwd)
cmake -GNinja -DCMAKE_BUILD_TYPE=Release $SOURCE_DIR
ninja

To have the created binary available, execute export PATH=$MYPATH/shaderc/src/build/glslc:$PATH.

Step 3: Build llama.cpp

git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake .. -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Afterwards, it is possible to run models:

A small quantizised version of Mistral-Small-24B works (bin/llama-server --gpu-layers 35 -hf unsloth/Mistral-Small-24B-Instruct-2501-GGUF:Q2_K_L); on my machine, if has a high TTFT (takes ~10 seconds), but relatively good TPS (5-10)
DeepSeek R1 works (bin/llama-server --gpu-layers 60 -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_M) -- this chain of thaught feature also effectively makes the TTFT high, but TPS are between 5 and 10 (not super to work with, but acceptable)
Gemma 12b works quite nice (bin/llama-server --gpu-layers 49 -hf unsloth/gemma-3-12b-it-GGUF:Q2_K_XL), with 15-20 TPS and TTFT below 1 second - so this seems to be the best fit for RX 480 (might be different with different CPU etc., but for me, this works quite nice).

cawa0505/rx480llama.md

Select an option