Allow GPU passthrough

Done in proxmox 9

Append the following in /etc/kernel/cmdline: iommu=pt (no need of pcie_acs_override=downstream,multifunction as we don't need separate iommu groups, no need of intel_iommu=on for kernels >=6.8)
If you have systemd-boot: proxmox-boot-tool refresh
nano /etc/modules-load.d/pci-pass-through.conf:

vfio
vfio_iommu_type1
vfio_pci

update-initramfs -u -k all
Disable actual drivers to use the GPU to not interfere with passthrough: nano /etc/modprobe.d/nvidia-passthrough-blacklist.conf:

blacklist nouveau
blacklist nvidia*

Reboot
dmesg | grep -e DMAR should return DMAR: IOMMU enabled or DMAR: Intel(R) Virtualization Technology for Directed I/O
Create PCI device Datacenter > PCI devices > Add and select the GPU (you should see the warning not in a separate IOMMU group, make sure this is intended., but it doesn't matter, as the only device in the IOMMU group is the GPU itself, so no security risk)

Create Docker VM

Execute bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/vm/docker-vm.sh)"
- CPU: host
- Machine: q35
- Bios: SeaBIOS (I don't know why OMVF doesn't work with q35)
- DO NOT START THE VM now, or it will hang, or even freeze the VM I don't know why
Set Display to none in options
Start the VM

Install NVIDIA drivers in VM

Inside the VM:

apt install linux-headers-$(uname -r)
add-apt-repository contrib
apt install -y wget
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
apt -V install nvidia-open (TODO: test compute-only drivers apt -V install nvidia-driver-cuda nvidia-kernel-open-dkms)
Reboot
Then check if it worked: nvidia-smi

Debrid CPU

The m720q limits its CPU power when a GPU is plugged. Without any configuration, when the GPU is sollicitated, the pc halts because of too much power to deliver to ensure no fire.

Warning

You need at least a 135 or 170 watts psu to put enough power in that tiny (beefy) pc.

Just disable BD Prochot (No BIOS changes, except the secure boot must be disabled, hoping a day I'll find how to get it enabled back).

On the host (not in VM):

apt-get install msr-tools
curl -LO https://raw.githubusercontent.com/fralapo/Disable-BD-PROCHOT-on-LINUX/main/Disable_BD_PROCHOT
chmod u+x Disable_BD_PROCHOT
./Disable_BD_PROCHOT

Limit the GPU power

Note

If you have soldering skills, you can instead change the 12K OCP resistor to 15-20K resistor, which basically makes overcurrent sensitivity less problematic, so you don't need anymore to limit the GPU power

The m720q only accept 50W max on the PCIe port, so we need to ensure not drawing more, or the system will halt without any notice!

This service makes:

Power draw limit at 50 watts (not enough, still have >12V spikes)
Limit GPU clocks at 1702 mhz and memory at max 6001 (seems very stable)

List possible clock pairs: nvidia-smi --query-supported-clocks=mem,gr
Select the best pair by using small clocks and increasing little by little using nvidia-smi -ac <mem clock>,<graphics clock>. 6001,1702 is pretty stable with RTX A2000 12GB
Make the service: nano /etc/systemd/system/nvidia-power-limit.service:

[Unit]
Description=NVIDIA power limitation
Wants=syslog.target
[Service]
Type=oneshot
ExecStartPre=/usr/bin/nvidia-smi -pl 50
ExecStart=/usr/bin/nvidia-smi -ac 6001,1702
[Install]
WantedBy=multi-user.target

Enable the new service: systemctl enable nvidia-power-limit.service

How to check GPU stability

Check CPU:
Check GPU: watch -n 1 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,utilization.memory,power.draw.instant,clocks.video,clocks.gr,clocks.sm,clocks.mem --format=csv'

Configure Docker