Skip to content

Instantly share code, notes, and snippets.

@FlorSanders
Created April 11, 2024 15:17
Show Gist options
  • Select an option

  • Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.

Select an option

Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.
Setup llama.cpp on a Nvidia Jetson Nano 2GB

Setup Guide for llama.cpp on Nvidia Jetson Nano 2GB

This is a full account of the steps I ran to get llama.cpp running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.

Procedure

At a high level, the procedure to install llama.cpp on a Jetson Nano consists of 3 steps.

  1. Compile the gcc 8.5 compiler from source.

  2. Compile llama.cpp from source using the gcc 8.5 compiler.

  3. Download a model.

  4. Perform inference.

As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.

GCC Compilation

  1. Compile the GCC 8.5 compiler from source on the Jetson nano.
    NOTE: The make -j6 command takes a long time. I recommend running it overnight in a tmux session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
  1. Once the make install command ran successfully, you can clean up disk space by removing the build directory.
cd /usr/local/
rm -rf build
  1. Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
  1. Double check whether the install was indeed successful (both commands should say 8.5.0).
gcc --version
g++ --version

llama.cpp Compilation

  1. Start by cloning the repository and rolling back to a known working commit.
git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0
  1. Edit the Makefile and apply the following changes
    (save to file.patch and apply with git apply --stat file.patch)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
 ifdef LLAMA_FAST
 MK_CFLAGS     += -Ofast
 HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS  += -O3
+MK_NVCCFLAGS += -maxrregcount=80
 else
 MK_CFLAGS     += -O3
 MK_CXXFLAGS   += -O3
-MK_NVCCFLAGS  += -O3
+MK_NVCCFLAGS += -maxrregcount=80
 endif

 ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
     # Raspberry Pi 3, 4, Zero 2 (64-bit)
     # Nvidia Jetson
     MK_CFLAGS   += -mcpu=native
-    MK_CXXFLAGS += -mcpu=native
     JETSON_RELEASE_INFO = $(shell jetson_release)
     ifdef JETSON_RELEASE_INFO
         ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
  • NOTE: If you rather make the changes manually, do the following:

    • Change MK_NVCCFLAGS += -O3 to MK_NVCCFLAGS += -maxrregcount=80 on line 109 and line 113.

    • Remove MK_CXXFLAGS += -mcpu=native on line 302.

  1. Build the llama.cpp source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6

Download a model

  1. Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf

Perform inference

  1. Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33  -c 2048 -b 512 -n 128 --keep 48
  1. Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33  -c 2048 -b 512 -n 128
  1. Test the web server functionality using curl
curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

You can now run a large language model on this tiny and cheap edge device. Have fun!

References

@dieg0varela
Copy link

@acerbetti thank you for sharing.

I tried your "🐳 Docker: The Lazy Person’s Paradise" but without any success.

docker run -it \
    -p 8000:8000 \
    acerbetti/l4t-jetpack-llama-cpp-python:latest \
    /bin/bash -c \
    'python3 -m llama_cpp.server \
        --model $(huggingface-downloader TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf) \
        --n_ctx 1024 \
        --n_gpu_layers 35 \
        --host 0.0.0.0 \
        --port 8000'

Fails with:

...
OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory
...
RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama_cpp/libllama.so': libcudart.so.10.2: cannot open shared object file: No such file or directory

but the Nvidia runtime is set as default

$ sudo docker info | grep 'Default Runtime'
 Default Runtime: nvidia
* Inside the container: `pip list` seems like `llama_cpp_python   0.2.70` is available.

* Inside the container: `nvcc --version` looks okay-ish `Cuda compilation tools, release 10.2, V10.2.300 Build cuda_10.2_r440.TC440_70.29663091_0`

* Inside the container: `echo $PATH` seems fine
  `/usr/local/cuda/bin:/usr/local/cuda-10.2/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin`

Would you have your Dockerfile Code somewhere to debug this, even for me, maybe?

I have the same issue on a Jetson Nano 4GB and I solve it add --runtime=nvidia when I create the container and works perfectly

docker run -it -p 8000:8000 --runtime=nvidia acerbetti/l4t-jetpack-llama-cpp-python:latest /bin/bash -c 'python3 -m llama_cpp.server \                                                                                                                   
        --model $(huggingface-downloader TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf) \
        --n_ctx 1024 \
        --n_gpu_layers 35 \
        --host 0.0.0.0 \
        --port 8000'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment