it's january 2026, so things will likely change, at some point
- llama.cpp repo link
- build instructions
- contains instructions on ho to build the debug version
- default build creates devices: 'libggml-base.so' and 'libggml-cpu.so' - probably need to set up additional requirements for the other backends.
- looking at main.cpp for llama-simple - a CLI program to continue with a given prompt, specified on the command line. command line:
./llama-simple -m model.gguf [-n n_predict] [-ngl n_gpu_layers] [prompt]m <file>- (mandatory) path to the model file in gguf formatngl <n_gpu_layers>: (default 99) number of layers to offload to the GPU (-1 - means all layers)n <num_output_tokens_to_predict>: (default 32)[prompt]: it has a default prompt value:Hello my name is
- each backend has it's own shared libraries (base, cpu, cuda, etc) - ibggml-base.so, ibggml-cpu.so, ibggml-cude.so
- backend api is defined in ggml-backend-impl.h
- each backend supports set of functions defined in struct ggml_backend_device_i (file ggml-backend-impl.h)
- Each backend implementation is in
ggml/src/<backend_name>cpu cuda opencl - what is in a 'backend' ?
- ggml_backend_reg_t - all the structs in this file
- has field or api version
- void *
context struct ggml_backend_reg_i iface;- the 'registration interface' - this one returns the nested 'devices'
- struct ggml_backend_reg_i
get_name- function returns backend nameget_device_count- returns number of devices (-1 is the max index to callget_device)get_device- gets device index, returnsggml_backend_dev_t- the 'devicce' (this contains the interface functions!)void * (*get_proc_address)(ggml_backend_reg_t reg, const char * name);- opt. can return 'custom' functions not in te standard device interface (messy interface)
ggml_backend_dev_tstruct ggml_backend_device_i iface;finally interface functions of the standard device interface are herevoid * context;- device internal context (set by implementation ofggml_backend_init)ggml_backend_reg_t reg;
- ggml_backend_device_i - type with function pointers that make up the interface supported by a backend 'device'.
- ggml_backend_reg_t - all the structs in this file
- When loading a single backend instance:
ggml_backend_load_best- finds all instances of the shared library in the configured search path,
- loads each shared library & calls
ggml_backend_score- this function returns a score number - loads the backend with the max positive score (score of zero means - not supported on this machine)
- to init: calls
ggml_backend_initof shared library - the return value isggml_backend_reg_t - checks that the returned version field in the
ggml_backend_reg_tis as expected- registers the 'backend' pointer, enumerates all contained 'devices' (that's the struct with the interface table) and registers them too.
Models are stored in GGUF format spec here.
- The model is stored as a single file. However llama.cppx allows to 'split' the model into multiple files called 'shards' (purpose is to bypass a size limit on HuggingFace ?)
- GGUF models can be loaded via mmap - works from memory mapped files.
- The breakup of a GGUF file (main sections)
- header/version
- metadata - that's a key value map.
- the keys are strings, they have names like 'general.architecture' 'llama.context_length' 'tokenizer.ggml.tokens'
- the value types can be either
- numeric (all numeric integer types, floating point number types)
- boolean
- string
- array
- an important array type info in the metadata: the Vocabulary
- the vocabulary is a list of basic tokens. Each basic token is a string, it is a sub-word unit, the token.
- the role of the token section is to split the text up into token units, where each token unit is then converted into an embedding vector (that's what the LLM works with)
- how are tokens converted into embedding vectors: each token in this vocabulary list has a token ID (that's the index of the token in the vocabulary array). This token ID is used as an index into a tensor that is stored in the tensor named the 'primary embedding' tensor.
- a series of tensors (a tensor is a n-dimensional array, n >=1)
- each tensor has a name
loading process:
llama_model_load_from_fileloads a model file. parameters- full path of model
- ptr to
llama_model_params- this one contains instruction on what to do (including if to use mmap), also will hold 'devices' to be used by the model
- llama_model_load_from_file_impl - does the loading
- enumerates all devices of GPU ad IGPU type backends, adds them to
llam_model_params::devicesfield - calls llama_model_load
- delegates all the action to llama_model_loader ctor
- enumerates all devices of GPU ad IGPU type backends, adds them to
to-be-continued.