Scheduling Baremetal Resources in Pike

OpenStack Pike

For many reasons, it is common for HPC users of OpenStack to use Ironic and Nova together to deliver baremetal servers to their users. In this post we look at recent changes to how Nova chooses which Ironic node to use for each user's nova boot request.


To set the scene, lets look at what at what a user gets to choose from when asking Nova to boot a server. While there are many options relating to the boot image, storage volumes and networking, let's ignore these and focus on the choice of Flavor.

The choice of Flavor allows the user to specify which of the predefined options of CPU, RAM and disk combinations bests suits their needs. In many clouds the choice of flavor maps directly to how much the user has to pay. In some clouds it is also possible to pick between a baremetal server (i.e. using the Ironic driver) or a VM (i.e. using the libvirt driver) by picking a particular flavor, while most clouds only use a single driver for all their instances.

Before Pike

Ironic manages an inventory of nodes (i.e. physical machines). We need to somehow translate Nova's flavor into a choice of Ironic node. Before the Pike release, this was done by comparing the RAM, CPU and disk resources for each node with what is defined in the flavor.

If you don't use the exact match filters in Nova, you will find Nova is happy to give users any physical machine that has at least the amount of resources requested in the flavor. This can lead to your special high memory servers being used by people who only requested your regular type of server. Some find this is a feature; if you are out of small servers your preference might be giving people a slightly bigger server instead.

All this confusion comes because we are trying to manage indivisible physical machines using a set of descriptions designed for packing VMs onto a hypervisor, possibly taking into account a degree of overcommit. Things get even harder when you consider having both VM and baremetal resources in the same region, with a single scheduler having to pick the correct resources based on the user's request. At this point you need the exact match filters for only a subset of the hosts. This problem is now starting to be resolved by the creation of Nova's placement service.

The Resource Class

The new Placement API brings its own set of new terms. Lets just say a Resource Provider has an Inventory that defines what quantity of each available Resource Class the Resource Provider has. Users can get a set of Allocation for specific amounts of a Resource Class from a given Resource Provider. Note: while there are a set of well known Resource Class names, you are also able to have custom names.

Furthermore, a Resource Provider can be tagged with Traits that describe the qualitative capabilities of the Resource Provider. The python library os-traits defines the standard Traits, but the system also allows custom traits. Ironic has recently added the ability to set a Resource Class on an Ironic Node.

In Pike Nova now reads the Ironic node resource_class property, and if it has been set updates the Inventory of the Resource Provider that represents that Ironic node to have an amount of 1 available of a given custom Resource Class.

Using Ironic's Resource Classes

Lots of technical jargon in that last section. What does that really mean?

Well it means we can divide up all Ironic nodes into distinct subsets, and we can label each distinct subset with a Resource Class. For an existing system, you can update any Node to add a Resource Class. But be careful, because once you add a Resource Class to a node, you can't change the field until the Ironic node is no longer being used (i.e. in the available state). (There are good reasons why, but lets leave that for another blog post).

If you are adding new Nodes or creating a new cloud, you can use Ironic inspector rules to set the Resource Class to an appropriate value, in a similar way to initializing any of the other Node properties you can determine via inspection.

Mapping Resource Classes to Flavors

So here is were it gets more interesting. Now we have defined these groups of Ironic nodes, we can map these groups to a particular Nova flavor. Here are the docs on how you do that.

Health warning time

You probably noticed our blog post on upgrading to Pike <{filename}2017-09-21-pike-upgrade.rst> Well if you want to do this, you need to make sure you have a bug fix we have helped develop to make this work. In particular you want to be on a new enough version of Pike that you get this backport.

Without the above fix, you will find adding the flavor extra specs such as resources:VCPU=0 cause the Nova scheduler to start picking Ironic nodes that are already being used by existing instances, triggering lots of retries, and likely lots of build failures.

One more heath warning. If you set a resource class of CUSTOM_GOLD in Ironic, that will get registered in Nova as CUSTOM_CUSTOM_GOLD. As such its best not to add the CUSTOM_ prefix in Ironic. There is a lot of history around why it works this way, for more details see the bug on launchpad.

An Unrelated Pike bug

While we are talking about Pike and using Ironic through Nova, if you have started using the experimental HA mode, where two or more nova-compute processes talk to one Ironic deploy, you will want to know about this bug that means it is quite badly broken in Pike.

Once we have the fix for that merged, we will let you know what can be done for Pike based clouds in a future blog post.

Something you must do before you upgrade to Queens

In Pike there is a choice between the old scheduling world and the new Resource Class based world. But you must add a Resource Class for every before you upgrade to Queens.

For more details on the deprecation of scheduling of Ironic nodes using VCPU, RAM and disk, please see the Nova release notes.

Once you update your Ironic nodes with the Resource Class (once you are on the latest version of Pike that has the bug fix in), existing instances that previously never claimed the new Resource Class.

Why not use Mogan?

I hear you ask, why bother with Nova any more, there is this new project called Mogan that is focusing on Ironic and ignores VMs?

Talking to our users, they like making use of the rich ecosystem around the Nova API that (largely) works equally well for both VMs and Baremetal, be that the OpenStack support in Ansible or the support for orchestrating big data systems in OpenStack Sahara. In my opinion, this means its worth sticking with Nova, and I am not just saying that because I used to be the Nova PTL.

Where we have got in Pike

In the SKA performance prototype we are now making use of the Resource Class based placement. This means placement picks only an Ironic Node that exactly matches what the flavor requests. Previously, because we did not use the exact filters or capabilities, we had GPU capable nodes being handed out to users who only requested a regular node.

When you look at the capacity of the cloud using the Placement API, it is now much simpler when you consider the available Resource Classes. You can see a prototype tool I created to query the capacity from Placement (using Ocata).

What is Happening in Queens and Rocky?

If you want to know more about the context around the work on the Placement API and the plans for the future, these two presentations from the Boston summit are a great place to start:

I recently attended the Project Team Gathering (PTG) in Denver. There was lots of discussion on how Ironic can make use of Traits for finer grained scheduling, including how you could use Nova flavors to pick between different RAID and BIOS configurations that are be optimized for specific workloads. More on how those discussions are going, and how the SKA (Square Kilometre Array) project is looking to use those new features in a future blog post!