This Gist explains how to deploy the Habana Labs operator in an OpenShift cluster. The first step is to deploy an OpenShift 4.11 cluster. Simply follow the documentation.
The habanalabs module needs to load a firmware, and the driver container will copy it to /var/lib/firmware on the node. We need to tell the node kernel to lookup that folder for firmwares, as it's not a default path. This is done by applying a MachineConfig to all worker nodes. The following command line will do it.
$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/machineconfig-firmware-path.yamlAll the worker nodes will reboot, so you may lose access to the OpenShift console for some time.
The Habana Labs operator relies on the Kernel Module Management operator to deploy the kernel module and the device plugin. So, we need to install it. It's available via an OLM catalog, that we can add with the following command.
$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/catalogsource-kmmo.yamlThen, we can head to the OperatorHub page in OpenShift console and install the OOT Operator (legacy name of KMM Operator).
Similarly, we will need to add a catalog for the Habana Labs operator. Below is the command to install it.
$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/catalogsource-habana-ai.yamlThen, go to Operators > OperatorHub and install the operator. The default values are fine.
Once the operator is installed, go to Operators > Installed Operators and click on Habana AI Operator. If it's not in the list, check that you looking up the habana-ai-operator project.
You can then go to the Device Config tab and click on the Create DeviceConfig button. It will open a dialog and you can use the default values. This configures the operator to apply the 1.6.0-439 version of the driver on all nodes with a Habana device (PCI vendor ID 1da3).
The following job will run a pod that simply sleep forever, allowing us to run the hl-smi command from the pod terminal.
$ oc apply -f https://github.com/fabiendupont/habana-ai-smi/raw/main/job.yamlNote that pods will require the CAP_SYS_RAWIO capability. For that, they'll have to run privileged.