Skip to content

Instantly share code, notes, and snippets.

@MoserMichael
Last active January 25, 2026 01:46
Show Gist options
  • Select an option

  • Save MoserMichael/42db5129dce87eb15e566f3326748ccb to your computer and use it in GitHub Desktop.

Select an option

Save MoserMichael/42db5129dce87eb15e566f3326748ccb to your computer and use it in GitHub Desktop.
learning-lama.cpp

llama.cpp - reading the source code

it's january 2026, so things will likely change, at some point

  • llama.cpp repo link
  • build instructions
    • contains instructions on ho to build the debug version
    • default build creates devices: 'libggml-base.so' and 'libggml-cpu.so' - probably need to set up additional requirements for the other backends.
  • looking at main.cpp for llama-simple - a CLI program to continue with a given prompt, specified on the command line. command line: ./llama-simple -m model.gguf [-n n_predict] [-ngl n_gpu_layers] [prompt]
    • m <file> - (mandatory) path to the model file in gguf format
    • ngl <n_gpu_layers> : (default 99) number of layers to offload to the GPU (-1 - means all layers)
    • n <num_output_tokens_to_predict> : (default 32)
    • [prompt] : it has a default prompt value: Hello my name is

loads backend shared libraries.

  • each backend has it's own shared libraries (base, cpu, cuda, etc) - ibggml-base.so, ibggml-cpu.so, ibggml-cude.so
  • backend api is defined in ggml-backend-impl.h
    • each backend supports set of functions defined in struct ggml_backend_device_i (file ggml-backend-impl.h)
  • Each backend implementation is in ggml/src/<backend_name> cpu cuda opencl
  • what is in a 'backend' ?
    • ggml_backend_reg_t - all the structs in this file
      • has field or api version
      • void *context
      • struct ggml_backend_reg_i iface; - the 'registration interface' - this one returns the nested 'devices'
    • struct ggml_backend_reg_i
      • get_name - function returns backend name
      • get_device_count - returns number of devices (-1 is the max index to call get_device)
      • get_device - gets device index, returns ggml_backend_dev_t - the 'devicce' (this contains the interface functions!)
      • void * (*get_proc_address)(ggml_backend_reg_t reg, const char * name); - opt. can return 'custom' functions not in te standard device interface (messy interface)
    • ggml_backend_dev_t
      • struct ggml_backend_device_i iface; finally interface functions of the standard device interface are here
      • void * context; - device internal context (set by implementation of ggml_backend_init)
      • ggml_backend_reg_t reg;
    • ggml_backend_device_i - type with function pointers that make up the interface supported by a backend 'device'.
  • When loading a single backend instance: ggml_backend_load_best
    • finds all instances of the shared library in the configured search path,
    • loads each shared library & calls ggml_backend_score - this function returns a score number
    • loads the backend with the max positive score (score of zero means - not supported on this machine)
    • to init: calls ggml_backend_initof shared library - the return value is ggml_backend_reg_t
    • checks that the returned version field in the ggml_backend_reg_t is as expected
      • registers the 'backend' pointer, enumerates all contained 'devices' (that's the struct with the interface table) and registers them too.

load the model

Models are stored in GGUF format spec here.

  • The model is stored as a single file. However llama.cppx allows to 'split' the model into multiple files called 'shards' (purpose is to bypass a size limit on HuggingFace ?)
  • GGUF models can be loaded via mmap - works from memory mapped files.
  • The breakup of a GGUF file (main sections)
    • header/version
    • metadata - that's a key value map.
      • the keys are strings, they have names like 'general.architecture' 'llama.context_length' 'tokenizer.ggml.tokens'
      • the value types can be either
        • numeric (all numeric integer types, floating point number types)
        • boolean
        • string
        • array
      • an important array type info in the metadata: the Vocabulary
        • the vocabulary is a list of basic tokens. Each basic token is a string, it is a sub-word unit, the token.
        • the role of the token section is to split the text up into token units, where each token unit is then converted into an embedding vector (that's what the LLM works with)
        • how are tokens converted into embedding vectors: each token in this vocabulary list has a token ID (that's the index of the token in the vocabulary array). This token ID is used as an index into a tensor that is stored in the tensor named the 'primary embedding' tensor.
    • a series of tensors (a tensor is a n-dimensional array, n >=1)
      • each tensor has a name

loading process:

  • llama_model_load_from_file loads a model file. parameters
    • full path of model
    • ptr to llama_model_params - this one contains instruction on what to do (including if to use mmap), also will hold 'devices' to be used by the model
  • llama_model_load_from_file_impl - does the loading
    • enumerates all devices of GPU ad IGPU type backends, adds them to llam_model_params::devices field
    • calls llama_model_load
      • delegates all the action to llama_model_loader ctor

tokenization of input prompt


to-be-continued.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment