The following notes are generated from ChatGPT and modified while dumping here.
- The GPU driver acts as an interface between your operating system and the hardware.
- It ensures your OS can communicate with and utilize the GPU for tasks.
- The CUDA Toolkit is required for developing and running GPU-accelerated applications.
- It includes libraries (like cuBLAS, cuDNN), compilers, and tools for building and optimizing GPU programs.
- Part of the CUDA Toolkit, it compiles CUDA programs written in C/C++ to run on the GPU.
- Necessary if you're building custom CUDA kernels.
- A library that provides optimized routines for deep learning frameworks like TensorFlow and PyTorch.
- Libraries like TensorFlow or PyTorch require specific versions of CUDA and cuDNN to use the GPU.
- Check GPU Model:
nvidia-smiThis command lists your GPU model and current driver version. If it doesn't show up anything which means you don't have GPU drivre installed. Then follow the next steps.
-
Download Driver: Visit NVIDIA Driver Downloads to get the appropriate driver.
-
Install the Driver: Follow installation instructions provided on the NVIDIA website. For Linux: Use the
.runfile or a package manager likeaptoryum.
- Check Compatibility: Check which version of CUDA is supported by your framework (e.g., TensorFlow, PyTorch).
- Download CUDA Toolkit: Visit CUDA Toolkit Downloads.. Find your target cuda verisoin. Also check your LSB system information by running:
lsb_release -a. If the command is not found, it's may be needed to install thelsb-corepackage, as some Linux distributions do not include it by default.
# sample output
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
# or,
Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster
- Install CUDA: Follow the instructions for your OS (Linux, Windows, macOS). Example for Linux (using
apt):
'''
Linux -> x86_64 -> Ubuntu -> 22.04 -> runfile (local)
'''
!wget https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run
!sudo sh cuda_12.6.2_560.35.03_linux.runNext, Follow the instructions. After completion, it will suggest you to set PATH and LD_LIBRARY_PATH to your environment. You need to do this. But let's check follows if needed:
ls /usr/local/ # to check existed cuda filesIt may show:
..., cuda, cuda-12.0, cuda-11.0, cuda-12.6, ...You can keep all or remove if you are okay with that.
sudo rm /usr/local/cuda # [optional]
sudo rm -r /usr/local/cuda-12.0 # [optional]And let's say, your desire cuda version is 12.6. You can do:
sudo ln -s /usr/local/cuda-12.6 /usr/local/cuda- Update
PATHVariables: Open~/.bashrcor~/.zshrcusingnano:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHPress Ctrl+0 to save and Ctrl+X to exit.
or, run it from terminal.
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc- Source the File:
source ~/.bashrc- Verify Installation:
nvcc -VIt should be like this:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 12.6, V11.8.89
Build cuda_12.6.r12.6/compiler.31833905_0As shown, it should match whatever we set in PATH variable. If we need to change the cuda version, just do the following:
sudo rm /usr/local/cuda
sudo ln -s /usr/local/cuda-12.8 /usr/local/cudaNow, running nvcc -V will show compiled cuda 12.8.
- Download cuDNN: Visit cuDNN Download Page.
- Install cuDNN: Extract and copy files to the appropriate CUDA directory. Example (Linux):
sudo cp lib/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/- Run
nvidia-smito check GPU status. - Test with a Framework:
import torch
torch.cuda.is_available()
torch.distributed.is_nccl_available()
torch.cuda.nccl.version()
torch.cuda.device_count()
torch.version.cuda- Driver Issues: Ensure driver and CUDA versions are compatible.
- Version Mismatch: Use the framework’s recommended CUDA version.
- CUDA Path Not Found: Ensure nvcc and libraries are correctly added to the environment.
- When running multi-GPU computations with NCCL (NVIDIA Collective Communications Library) is mandatory. Get the NCCL version:
locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'Reason:
- The installed NCCL version might not be compatible with the CUDA Toolkit version.
Solution:
- Verify NCCL compatibility with your CUDA version (NCCL Compatibility Matrix).
- Update or downgrade the CUDA Toolkit or NCCL library as needed.
Misc
- GPU driver installation can be failed due to missing kernel headers and source files. Identify Your Current Kernel Version: run
uname -r. The output can be look like this6.1.0-31-cloud-amd64(my current system: Debian 12 (Bookworm)). We need to install the matching headers. To do that, run
sudo apt update
sudo apt install -y linux-headers-$(uname -r) build-essentialAfter that, we can verify the Kernel Headers installation by
ls -l /usr/src/linux-headers-$(uname -r)If the directory exists and contains files, the headers are correctly installed. Now, try again installing the gpu driver, either directly or via cuda toolkit.