The Only Constant in Life is Change

For optimal reading, please switch to desktop mode.

This is a story about how changes outside our control forced us to take a big pile of carefully-planned JIRA tickets outside and burn them. More usefully, its also a story about what we've done to prevent that happening again ...

Background: The Slurm Appliance

Back in 2021 we blogged about our Slurm appliance batch scheduler cluster environment. Things have moved on since then and the feature set has expanded but the core remains the same. We use OpenTofu (replacing Terraform used at launch) to create virtualised or baremetal nodes, servers or instances. Then we run Ansible against those instances to configure them with the Slurm daemons, monitoring stack, filesystems, identity and network services etc. etc. to provide an operational HPC environment.

At launch, the appliance used CentOS 8 - this was pre CentOS Stream. Early on in development we added configurations for Hashicorp's Packer to allow us to build machine images containing all the packages the appliance needed. Both the CentOS repositories and the OpenHPC repositories we used for Slurm and MPI packages followed a model where packages could be updated within a minor release, and also only the latest packages were available from mirrors after a new minor release. This created a problem where the packages installed by Ansible would move on between Ansible runs. Building these "fat" machine images containing the required packages fixed this problem, giving us reproducible deployments. New nodes could be added to a cluster or existing nodes re-provisioned without being affected by changes in those upstream repositories. When CentOS Stream launched we followed OpenHPC in moving to RockyLinux, but the same issues and the same approach applied.

Upsets from Upstream

Development continued and the end of 2024 was near when several problems all arose within a few days:

Testing at a client using Lustre showed that an RDMA verbs header file was missing from their image.
During debugging the above, we found that using the latest NVIDIA OFED version clashed with the OpenHPC libfabric package required for MPI packages.
New NVIDIA CUDA packages were released, and our CI builds failed due to an incompatibility with the NVIDIA drivers.
RockyLinux 9.5 was released, so packages from RockyLinux 9.4 disappeared from mirrors.
Our automated testing hit an issue with the latest podman package provided by RockyLinux 9.5.
There was no version of NVIDIA OFED (or DOCA) which worked with RockyLinux 9.5.

The last one was critical, as that meant we couldn't even build images including OFED to begin working on the other problems.

More Passengers for the Release Train

The solution was to leverage the StackHPC Release Train. This provides artifacts for OpenStack installations using a StackHPC-hosted Pulp server called ark. Our use of it here is to get snapshots of upstream package repositories - mirrors of those repositories at particular points in time. The .repo files in the VMs used for the Packer image builds are re-written to point to specific snapshots. This means that now image builds are reproducible too; repeating an image build results in the same packages in the final image. While new snapshots are regularly created in ark, we decide when to use them by changing the snapshots which the build VM .repo files point to. This also means that building new images during feature development will no longer run into - and be forced to address at the same time - issues from package updates, making development cleaner and more predicable.

Of course, using ark for builds wasn't quite as simple as re-writing the URLs in the .repo files. For example, some build steps installed the epel-release package, which overwrote our custom EPEL .repo file. And some Ansible tasks running during cluster configuration didn't cope with the fact the ark credentials had been removed from the .repo files at the end of the image build (as these files are user-readable).

These problems were all worked through and finally we could build RockyLinux 9.4 images again, but using packages from ark. Inevitably, almost as soon that was working NVIDIA released OFED and DOCA versions which supported RockyLinux 9.5 and in a sense the original reason for doing this disappeared. However, breaking changes are guaranteed to re-occur with minor version upgrades and we and our clients are now on a sounder footing to cope with them.

At that point we could move on and in a flurry of activity around the Christmas break we also:

Added support for using NVIDIA DOCA rather than OFED (including rebuilding kernel modules to match installed kernels).
Added support for using a local Pulp mirror to a) reduce egress charges from ark and b) enable installing packages during cluster deployment, not just during image build.
Worked around the NVIDIA driver/CUDA package incompatibility.
Made it easier to add additional site-specific dnf packages and include them in a build.
Added support for RockyLinux 8.10 via the Release Train.
Moved support for RockyLinux 9 onto 9.5 with a fix for the podman issue.
Fixed the issues with Lustre installs and the verbs header file.

Conclusion

All this resulted in probably the largest release for the Slurm appliance to-date. Development continues and a subsequent release has not only added new functionality, but also bumped the ark snapshots - and hence packages - forward. It does contain the groundwork for some interesting new functionality, but that's a story for another time ...

StackHPC

Other articles

Exploring Slurm on Kubernetes

High Performance Ethernet - to IB or not to IB