Pci grouping with nova and pci in placement

PCI Grouping with PCI in Placement

Context

Placement Service

The placement service in OpenStack is an inventory management system.

Placement models the available resources in an OpenStack cloud as a set of resource providers in a tree structure. Each resource provider has inventories of consumable resource classes such as CPUs, memory, disks, and capabilities known as traits.

When selecting a host to place a workload, Nova first asks placement what host has the capacity and then filters and weights those candidate hosts to schedule the workload. In this context, the placement role in the scheduling process is to track availability and provide an atomic way to receive capacity for a workload through a mechanism known as allocations.

PCI in Placement

PCI in placement is an optional Tech Preview feature of Nova to allow modeling PCI devices as nested Resource providers of compute nodes:

https://specs.openstack.org/openstack/nova-specs/specs/zed/approved/pci-device-tracking-in-placement.html

This feature was first introduced in the OpenStack ZED release.

PCI in placement enhances Nova PCI passthrough support to allow Nova to model a PCI device in the placement service. This allows Nova to map individual PCI devices, such as a GPU, to a placement resource provider with a CUSTOM resource class and optional traits.

By using this functionality, it is possible to model groups of PCI devices through the use of a CUSTOM_GPU_GROUP_# resource class so that we allocate them as groups.

PCI Passthrough in OpenStack

This document describes a specific example of how to use PCI passthrough to define groups of devices and request them to be assigned.

Proposal: Use PCI in Placement to Model PCI Groups

Problem Statement

As a cloud operator, I would like to group PCI devices so that they can be consumed as a group. This may be required to align a set of resources based on the host topology such as NUMA affinity, high-speed interconnect such as NvLink or infinity fabric, or sets of devices that are otherwise physically related.

Planning

In OpenStack, if you need a set of compute nodes to have a different configuration, they need to be modeled as a separate node set or host aggregate. As such, each type of server that has a different PCI topology or hardware set should be grouped together. This needs to be accounted for during the planning stage.

Procedure

Enable PCI in Placement

To enable PCI in placement, you must modify the nova.conf file on both the OpenStack control plane nodes and the compute nodes.

On the control plane nodes, edit /etc/nova/nova.conf to enable the feature in the scheduler and define the aliases that flavors will use:

[filter_scheduler]
pci_in_placement = true

[pci]
alias = {
  "name": "GPU_GROUP_1",
  "device_type": "type-PCI",
  "numa_policy": "preferred",
  "resource_class": "CUSTOM_GPU_GROUP_1"
}
alias = {
  "name": "GPU_GROUP_2",
  "device_type": "type-PF",
  "numa_policy": "preferred",
  "resource_class": "CUSTOM_GPU_GROUP_2"
}

NOTE: The device_type in the alias is dependent on the actual hardware. In the case of a GPU that is capable of SR-IOV, the device_type needs to be set to type-PF, otherwise it needs to be set to type-PCI.

On the compute nodes, edit /etc/nova/nova.conf to enable reporting to placement and to map specific PCI device addresses to the custom resource classes. These settings tell the nova-compute service which devices to manage and how to group them.

[pci]
report_in_placement = true

# This alias configuration is also needed on compute nodes
# for resize/cold migration operations.
alias = {
  "name": "GPU_GROUP_1",
  "device_type": "type-PCI",
  "numa_policy": "preferred",
  "resource_class": "CUSTOM_GPU_GROUP_1"
}
alias = {
  "name": "GPU_GROUP_2",
  "device_type": "type-PF",
  "numa_policy": "preferred",
  "resource_class": "CUSTOM_GPU_GROUP_2"
}

# Define the PCI groups by mapping addresses to resource classes
device_spec = {
  "address": "0000:82:00.0",
  "resource_class":"CUSTOM_GPU_GROUP_1"
}
device_spec = {
  "address": "0000:83:00.0",
  "resource_class":"CUSTOM_GPU_GROUP_1"
}
device_spec = {
  "address": "0000:84:00.0",
  "resource_class":"CUSTOM_GPU_GROUP_2"
}
device_spec = {
  "address": "0000:85:00.0",
  "resource_class":"CUSTOM_GPU_GROUP_2"
}
device_spec = {
  "address": "0000:86:00.0",
  "resource_class":"CUSTOM_GPU_GROUP_2"
}

NOTE: This configuration enables reporting of PCI devices to placement, defines the alias for resize/cold migration, and defines the device_spec to declare the PCI groups using custom resource classes.

Additionally, certain kernel arguments (like intel_iommu=on or amd_iommu=on) must be set on the hypervisor to enable IOMMU for passthrough. This typically involves editing the bootloader (e.g., GRUB) configuration and restarting the compute node.

Create a Flavor for the PCI Group

Each PCI group requires a flavor. In the previous example, group 1 contains 2 devices and group 2 contains 3 devices.

Create a flavor for group 1:

openstack --os-compute-api=2.86 flavor create --ram <size_mb> \
  --disk <size_gb> --vcpus <no_vcpus> \
  --property "pci_passthrough:alias"="GPU_GROUP_1:2" \
  gpu_group_1

This command should be executed as an OpenStack administrator. It will create a flavor called gpu_group_1.

NOTE: Since group 1 has 2 devices, the flavor asks for 2 allocations of the alias to ensure all devices in the group are assigned to the instance.

Create a flavor for group 2 using the same procedure but with values updated for the number of devices in the group.

openstack --os-compute-api=2.86 flavor create --ram <size_mb> \
  --disk <size_gb> --vcpus <no_vcpus> \
  --property "pci_passthrough:alias"="GPU_GROUP_2:3" \
  gpu_group_2

NOTE: It is important to ensure the number of devices requested in the flavor fully consumes all devices provided by the group.

Limitations and Known Issues

With this feature, when using PCI in placement to model PCI groups, it is not possible to enforce that a flavor fully consumes the group. Administrators need to carefully ensure they align their flavors to the PCI groups available.
It is not possible to create a flavor that can be consumed from either group, even if both PCI groups contain the same number of identical PCI devices.
It is not possible to have different PCI addresses on a different host within the same host aggregate or node set. If you have multiple servers with different physical device layouts, you will need to create multiple host aggregates to model that.
There is a performance limitation of placement that puts a limit on the possible size of a single group.
- A group with 4 PCI devices will work.
- A group with 6 PCI devices might work if the number of compute nodes in the system is limited.
- A group with 8 PCI devices will not work.
- There is work in progress to provide a workaround to the performance issue.
as of october 2025 we think we have now fixed the placement issue, maybe :) so your milage may vary

Future Integrations

Unified Limits for Quota

Unified limits is another feature in recent OpenStack releases that allows an administrator to define limits on any resource tracked by placement at a global or by Keystone project level.

By taking this approach of using PCI in placement, the operator can also use unified limits to impose a quota on PCI passthrough usage.

PCI Groups in Nova

Upstream contributors have raised the topic of modeling PCI device groups natively in OpenStack as a top-level feature to address many of the limitations of the current approach.

https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/pci-passthrough-groups.html

While this is not currently scheduled for immediate implementation, if there is enough community interest this could be explored for a future OpenStack release.

SeanMooney/nova-pci-grouping.md

Select an option

No results found