Skip to content

Instantly share code, notes, and snippets.

@valentinsavenko
Last active August 1, 2025 17:15
Show Gist options
  • Select an option

  • Save valentinsavenko/36218fec4656a2586098a3b0b2a62ebf to your computer and use it in GitHub Desktop.

Select an option

Save valentinsavenko/36218fec4656a2586098a3b0b2a62ebf to your computer and use it in GitHub Desktop.
Run llama.cpp on AMD GPU, verifiably. RX6700S (gfx1032) - Ubuntu 24

On my setup it was suprisingly hard to verify that a model is fully loaded and running on the dedicated GPU. The GPU can have some kind of activity, or the models can be just partially loaded even when the CPU is doing most of the work and on Ubuntu you'll need a handful of tools to verify rocm performance.

This is mostly a reminder for myself, but might prove useful if you have the same setup:

$ lscpu | grep -i model
Model name:                           AMD Ryzen 9 6900HS with Radeon Graphics

$ nvtop
 Device 0 [AMD Radeon RX 6700S] PCIe GEN 4@ 8x             
 Device 1 [AMD Radeon Graphics] Integrated GPU 
 
$ uname -a
Linux vs-ubuntu 6.11.0-19-generic #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

As LLama suggest I followed this official AMD guide, ignore the compatibility list, it's fine.

https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.2/install/quick-start.html Verify the installation with: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.2/install/post-install.html

If at the end of this you don't see your device in here, fix it before proceeding:

 rocm-smi 

========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                      
======================================================================================================================
0       1     0x73ef,   3931   57.0°C  5.0W   N/A, N/A, 0         700Mhz  96Mhz    100.0%  auto  100.0W  75%    0%    
1       2     0x1681,   52857  55.0°C  18.0W  N/A, N/A, 0         N/A     2400Mhz  0%      auto  N/A     83%    0%    
======================================================================================================================
================================================ End of ROCm SMI Log =================================================

As LLama is big and bulky it makes sense to verify that HIP is working at all: ROCm ships with some C++ samples for HIP and the readme is helpful in building/running:

ls /opt/rocm/share/hip/samples/
0_Intro  1_Utils  2_Cookbook  CMakeLists.txt  common  packaging  README.md

But I found them very minimalistic and too small to actually see VRAM and GPU Usage fill up.

Grok helped me write a large matrix calculation that actually fills the 8gb VRAM: stress_hip.cpp The matrix sizes are adjustable with:

    int M = 23000;
    int N = 23000;
    int K = 23000;

Make sure to set the HIP_DEVICE_ID according to the device number from rocm-smi.Compile and run with HIP compiler:

hipcc stress_hip.cpp -o stress_hip --offload-arch=gfx1030
./stress_hip

You can monitor the VRAM and GPU% with:

watch rocm-smi

# second terminal
nvtop

https://github.com/Syllo/nvtop?tab=readme-ov-file#ubuntu-focal-2004-debian-buster-stable-and-more-recent

Now that we know HIP works and the correct device is being worked we can work on LLama.cpp

https://github.com/ggml-org/llama.cpp/blob/baad94885df512bb24ab01e2b22d1998fce4d00e/docs/build.md#hip Compile LLama with HIP and run qwen3 with CPU/GPU:

git clone [email protected]:ggml-org/llama.cpp.git
cd llama.cpp

export ROCM_ARCH="gfx1030"  # I use the 1030 although the 6700S is gtx1032

# DEBUG BUILD, if needed
#HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
#cmake -S . -B build -DGGML_HIP=ON -DLLAMA_ALL_WARNINGS=ON -DAMDGPU_TARGETS=$ROCM_ARCH -DCMAKE_BUILD_TYPE=Debug
#cmake --build build --config Debug -j$(nproc)

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$ROCM_ARCH -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

cd build/bin

export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0

# this downloads the model magically https://ollama.com/library/qwen3:4b
./llama-run qwen3:4b 
# exit
# ctrl+c

https://github.com/ggml-org/llama.cpp/blob/baad94885df512bb24ab01e2b22d1998fce4d00e/tools/run/README.md?plain=1#L19-L20

This minor detail cost me a lot of time.. it's only evermentioned there and in some dockerfiles, but if you don't define the number of layers to offload to the GPU with -n / -ngl, it defaults to using 100% CPU, even if there will be some minor increase in VRAM, that will confuse you!

Verbose output will show you which device is visible and loaded and how many layers are loaded into the GPU, if you load many layers for 14b model with only 8GB VRAM, you'll get an OOM error.

export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0

./llama-bench -m qwen3\:4b -ngl 99 -v
# ---
gml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6700S, gfx1030 (0x1030), VMM: no, Wave Size: 32

# you'll see, depending on the model how many layers are actually loaded into VRAM
llama_kv_cache_unified: layer   0: dev = ROCm0
llama_kv_cache_unified: layer   1: dev = ROCm0
llama_kv_cache_unified: layer   2: dev = ROCm0
llama_kv_cache_unified: layer   3: dev = ROCm0
llama_kv_cache_unified: layer   4: dev = ROCm0
llama_kv_cache_unified: layer   5: dev = ROCm0
llama_kv_cache_unified: layer   6: dev = ROCm0
llama_kv_cache_unified: layer   7: dev = ROCm0
llama_kv_cache_unified: layer   8: dev = ROCm0
llama_kv_cache_unified: layer   9: dev = ROCm0
llama_kv_cache_unified: layer  10: dev = ROCm0
...
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |           tg128 |         43.27 ± 0.12 |
llama_perf_context_print:        load time =    6925.96 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   641 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   21717.15 ms /   642 tokens
llama_perf_context_print:    graphs reused =          0

# more readable without -v:
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 4B Q4_K - Medium         |   2.44 GiB |     4.02 B | ROCm       |  99 |           pp512 |        601.23 ± 8.72 |
| qwen3 4B Q4_K - Medium         |   2.44 GiB |     4.02 B | ROCm       |  99 |           tg128 |         39.52 ± 0.35 |

If you monitor your nvtop & rocm-smi you'll see that the GPU is filling the VRAM and mostly on full load during the benchmark.

You can also change the device to the internal GPU HIP_VISIBLE_DEVICES=1, or both HIP_VISIBLE_DEVICES="0,1" or set -n to 0 which both performs worse, especially on output token metric tg128.

#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int HIP_DEVICE_ID = 0; // Set to the desired device ID
#define CHECK_HIP(cmd) { \
hipError_t err = cmd; \
if (err != hipSuccess) { \
fprintf(stderr, "HIP Error: %s at %s:%d\n", hipGetErrorString(err), __FILE__, __LINE__); \
exit(1); \
} \
}
__global__ void matrixMulHeavy(const float *A, const float *B, float *C, int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
float sum = 0.0f;
for (int i = 0; i < K; i++) {
sum += A[row * K + i] * B[i * N + col];
}
// Add heavy compute to maximize GPU utilization
for (int j = 0; j < 500; j++) { // Increased iterations
sum = sum * 1.00001f + sinf(sum) * 0.0001f; // Math-heavy operations
}
C[row * N + col] = sum;
}
}
int main() {
// Matrix dimensions: A (M x K), B (K x N), C (M x N)
int M = 23000;
int N = 23000;
int K = 23000;
size_t size_A = (size_t)M * K * sizeof(float);
size_t size_B = (size_t)K * N * sizeof(float);
size_t size_C = (size_t)M * N * sizeof(float);
printf("Allocating ~%.2f GB VRAM\n", (size_A + size_B + size_C) / (1024.0 * 1024.0 * 1024.0));
// Set device to Radeon 6700S
CHECK_HIP(hipSetDevice(HIP_DEVICE_ID));
hipDeviceProp_t props;
CHECK_HIP(hipGetDeviceProperties(&props, HIP_DEVICE_ID));
printf("Running on device: %s\n", props.name);
// Allocate host memory
float *h_A = (float*)malloc(size_A);
float *h_B = (float*)malloc(size_B);
float *h_C = (float*)malloc(size_C);
if (!h_A || !h_B || !h_C) {
fprintf(stderr, "Host memory allocation failed\n");
exit(1);
}
// Initialize matrices
srand(time(NULL));
for (size_t i = 0; i < (size_t)M * K; i++) h_A[i] = rand() / (float)RAND_MAX;
for (size_t i = 0; i < (size_t)K * N; i++) h_B[i] = rand() / (float)RAND_MAX;
// Allocate device memory
float *d_A, *d_B, *d_C;
CHECK_HIP(hipMalloc(&d_A, size_A));
CHECK_HIP(hipMalloc(&d_B, size_B));
CHECK_HIP(hipMalloc(&d_C, size_C));
// Copy inputs to device
CHECK_HIP(hipMemcpy(d_A, h_A, size_A, hipMemcpyHostToDevice));
CHECK_HIP(hipMemcpy(d_B, h_B, size_B, hipMemcpyHostToDevice));
// Launch kernel
dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid((N + threadsPerBlock.x - 1) / threadsPerBlock.x,
(M + threadsPerBlock.y - 1) / threadsPerBlock.y);
printf("Launching kernel with grid (%d, %d), block (%d, %d)\n",
blocksPerGrid.x, blocksPerGrid.y, threadsPerBlock.x, threadsPerBlock.y);
// Run kernel multiple times for sustained high utilization
for (int iter = 0; iter < 30; iter++) { // Increased iterations
printf("Iteration %d...\n", iter + 1);
hipLaunchKernelGGL(matrixMulHeavy, blocksPerGrid, threadsPerBlock, 0, 0,
d_A, d_B, d_C, M, N, K);
CHECK_HIP(hipDeviceSynchronize());
}
// Copy result back (optional, for verification)
CHECK_HIP(hipMemcpy(h_C, d_C, size_C, hipMemcpyDeviceToHost));
// Basic verification
printf("Verifying result...\n");
for (int i = 0; i < 30; i++) { // Reduced checks for speed
int row = rand() % M;
int col = rand() % N;
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += h_A[row * K + k] * h_B[k * N + col];
}
for (int j = 0; j < 500; j++) {
sum = sum * 1.00001f + sinf(sum) * 0.0001f;
}
if (fabs(h_C[row * N + col] - sum) > 1e-2) {
printf("Verification failed at (%d, %d)!\n", row, col);
break;
}
}
printf("Test completed successfully\n");
// Clean up
CHECK_HIP(hipFree(d_A));
CHECK_HIP(hipFree(d_B));
CHECK_HIP(hipFree(d_C));
free(h_A);
free(h_B);
free(h_C);
return 0;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment