StackHPC

  • Home
  • About
  • Workshops
  • Kayobe
  • Contact
  • Blog

  • Bare metal
  • Community
  • Data
  • Deployment
  • networking
  • Virtualisation
  • Workloads
  • Home
  • About
  • Workshops
  • Kayobe
  • Contact
  • Blog
  • Bare metal
  • Community
  • Data
  • Deployment
  • networking
  • Virtualisation
  • Workloads

Navigating Upstream Q3 2025

For optimal reading, please switch to desktop mode.

Published: Fri 01 August 2025
Updated: Fri 01 August 2025
By Justin Coquillon Gavin Heffernan Stig Telfer axel simon

In News.

tags: openstack kolla kolla-ansible slurm

Welcome to the first issue of the StackHPC Newsletter

We’re excited to bring you the first edition of our newsletter. Navigating Upstream – named after what we do in the open source waters – is a way to stay connected, share updates and highlight insights from across the StackHPC and broader open source and scientific cloud ecosystem. We’ll aim to bring you a brief quarterly roundup of what’s new: developments in OpenStack and the wider research computing landscape, updates on our projects and open source contributions, and reflections we think might be useful to others navigating similar paths.

We’re all part of a growing community working to build and run powerful, sustainable infrastructure for science, research, and innovation. We hope this newsletter helps you feel more connected to that community and the people that power it.

If you’d prefer not to receive future editions, you can opt out at any time using the link below or with a simple reply. Otherwise, we look forward to keeping in touch.

– The StackHPC team

The StackHC team

Announcement: StackHPC Client event colocated with OpenInfra Summit Europe 2025

We are organising a StackHPC Client event on Thursday 16 October, the day before the OpenInfra Summit Europe 2025, which will take place at Saclay, near Paris in France. Our event will be at the Hilton Garden Inn Massy, a venue very near the Summit (10 minute drive between the two sites).

We are excited to invite you to this event, which will create an opportunity for our customers to meet with StackHPC team members, network, and share knowledge with other “fellow travellers”, as well as learn more about StackHPC’s roadmap and plans.

The event will feature a mix of presentations from StackHPC and StackHPC customers, a workshop and more casual conversations, as well as morning coffee and pastries, lunch and an afternoon coffee break.

We hope you will be able to join us and would appreciate it if you could let us know if you would like to attend as it will help with planning. To do so, please contact us by email.

The Hilton Garden Inn Massy

The View from the Release Train - Infrastructure updates

StackHPC (along with the wider OpenStack community) has been busy bringing upgrades, fixes and new features to Release Train in Q2. Here are the main ones to know of. OpenStack

OpenStack

  • OpenStack Epoxy (2025.1) A major focus this quarter has been validating OpenStack Epoxy and its upgrade path ahead of starting customer upgrades in Q3.
  • Read the full release notes

StackHPC Kayobe-config

  • Version Checks for Kayobe and Kolla-Ansible New playbooks have been added to verify that the correct versions of Kayobe and Kolla-Ansible are installed when running Kayobe operations. This improves reliability by mitigating a whole class of errors caused by out-of-date dependencies.
  • OpenBao Support for Internal TLS Certificate Management Support for OpenBao has been introduced for TLS certificate generation. OpenBao, an open source fork of Hashicorp Vault, retains all core features and will replace Vault next year. A migration path will be provided in a future release.
  • Queue Migration Scripts for Messaging Improvements (RabbitMQ) We’ve added a script to migrate reply/fanout queues to quorum queues and streams, in preparation for the upcoming OpenStack Epoxy upgrades. This builds on our earlier work during the 2024.1 upgrade cycle, where classically mirrored queues were migrated to quorum queues.
  • GPU Passthrough Configuration Templates Templates to simplify configuration of PCI passthrough for GPUs are now available. A list of supported GPUs can be found here.
  • Read the full release notes

Kayobe

  • Ubuntu Noble (24.04) Full support for Ubuntu Noble (24.04) has been added to Kayobe as both a host and container operating system for seed, hypervisor, and overcloud hosts. This is a key step in preparing for upcoming OpenStack Epoxy upgrades, as the migration to Noble will be a prerequisite for Ubuntu-based systems.
  • Read the full release notes

Kolla

  • Prometheus and AlertManager Updates Prometheus services have been updated, including Alertmanager 0.28.1, which simplifies and improves support for Microsoft Teams notifications.
  • Read the full release notes

Kolla-Ansible

  • Improved Messaging Resilience (RabbitMQ) Support has been added for using quorum queues for transient/fanout queues, improving the resilience of OpenStack services against message broker failures.
  • Read the full release notes

Slurm Appliance Enhancements

In this issue, we will focus on the Slurm Appliance, as there have been a lot of improvements in the last few months. Major new features include:

Image-Based Node Upgrades via Slurm Jobs

  • We’ve introduced the ability to re-image compute nodes via Slurm jobs. This allows upgrades to be scheduled as part of the job queue, minimising disruption, while still keeping the benefits of image-based upgrades, such as reproducibility and ease of testing. This helps integrate more smoothly into production environments. Note: This feature isn’t yet supported across all appliance components and may require customisation for your site.

Enhanced Cluster Monitoring

  • AlertManager Integration Native support for AlertManager improves real-time alerting.
  • Node Health Monitoring Integration with LBNL's Node Heath Checks enables much better visibility into node and cluster state.

Filesystem and Performance Improvements

  • Lustre Client Support Support for Lustre clients (including a fix for an issue affecting the current client release on Rocky Linux 9.6).
  • Performance tuning TuneD can now be used for optimizing system performance.

Expanded GPU and Slurm Features

  • Support for NVIDIA MIG (Multi-Instance GPUs)
  • Improved Autodetection Enhanced compatibility with Slurm’s NVIDIA GPU autodetection
  • Important Upgrade Note: MIG support introduces backward-incompatible changes to openhpc_ variables. Please read the release notes carefully if upgrading, or engage our support.

Major Version and Infrastructure Upgrades

  • Slurm 24.11.5 on Rocky Linux 9 Now supported, with automated Slurm database backup during upgrades.
  • OpenTofu Enhancements Our OpenTofu infrastructure-as-code configurations are now more flexible, mature, and production-ready. Previously custom site work has been combined into a standardised framework. OpenTofu is an open source fork of Hashicorp Terraform, which retains all features.

Other Notable Updates

  • Package Updates Ongoing updates to the appliance package set.
  • Rocky Linux Compatibility The appliance currently supports Rocky Linux 9.5, with 9.6 held back due to the unavailability of NVIDIA DOCA.

More details can be found in the Slurm Appliance Release Notes.

CVE Watch

The first half of 2025 has thankfully been pretty tame, but there were still a few security vulnerabilities worth mentioning. It’s also worth noting that neither of the two following CVEs mentioned affected any of our customer deployments, but we considered them important enough to warrant sending warning emails.

24 March 2025: CVE-2025-1974 and co.: Kubernetes ingress-nginx CVEs

https://kubernetes.io/blog/2025/03/24/ingress-nginx-cve-2025-1974/

A set of multiple vulnerabilities that affect the Kubernetes ingress-nginx controller. As our cloud portal Azimuth can deploy Kubernetes clusters (and runs as a Kubernetes application), it would appear to be well within scope, however as Azimuth does not enable any ingress controller by default, clusters deployed by Azimuth are not affected.

Nevertheless, because use of ingress-nginx is widespread and the vulnerabilities are considered critical and make it possible for an attacker to take over a whole cluster, it warrants calling attention to.

6 April 2025: CVE-2025-31492: Apache mod_auth_openidc

https://nvd.nist.gov/vuln/detail/CVE-2025-31492

A flaw in the Apache web server mod_auth_openidc extension, used among others by Slurm, the workload manager and job scheduler used by many of our customers.

This vulnerability allows unauthenticated users to access protected content via crafted HTTP POST requests in the absence of application-level gateways (reverse proxy or load balancer). However, the conditions required for this vulnerability to be exploited are strict and cumulative: the default AuthRequestMethod must have been changed from GET to POST, there must be no load-balancer or reverse-proxy if front of the attacked resource and lastly, the attacker must know of a valid user name. In light of this, we did not believe this to be a high impact risk or a concern which requires immediate attention.

Upcoming Training Opportunities

Our Q1 and Q2 mixed-group training workshops have now wrapped up, a big thank you to everyone who took part! These sessions covered key OpenStack operations and infrastructure best practices. Developing a greater degree of independence by growing skills and confidence of the teams we work with is one of the core values of StackHPC, and these trainings are an embodiment of this.

Curious to see what these workshops involve? Take a look at our workshop overview.

Booking Now for Q3

Our next round of training is scheduled for 1-8th October 2025. If you're interested in joining or would like a tailored quote for your team, please get in touch.

StackHPC and the Community

OpenStack community user survey is live

It’s that important time of year again! The time to answer the OpenStack user survey.

The numbers are important in helping the OpenInfra Foundation demonstrate the size of our community. You can answer the survey here.

Results from the previous years are visible here.

The OpenStack Scientific SIG turns 10!

To celebrate 10 Years of Scientific Computing with OpenStack, the OpenInfra foundation wants to create a blog series highlighting the innovation, collaboration, and impact in scientific computing that OpenStack has had over the last decade.

As many of you run HPC workloads, manage research infrastructure or support cutting-edge science, you are prime candidates to share your experience. Please consider doing so in this form.

This will help celebrate the power of open collaboration and the critical role OpenStack has played in scientific discovery!

StackHPC in the Community: next events

Meet StackHPC Team Members. Here are the next events we will be attending:

  • 21st ECMWF workshop in Bologna, Italy on 15-19 September
  • HPC-AI Advisory Council 7th Annual UK Conference, Leicester, England on 14 & 15 October
  • OpenInfra Summit Europe 2025 in Saclay near Paris, France on 17-19 October
  • SuperComputing 25 (SC25) in St Louis, USA on 16-21 November
  • Kubernetes Community Days UK (KCD) in Edinburgh, Scotland on 21-22 November

From the Blog

Modernising Kolla-Ansible’s RabbitMQ offering

Published 11 June 2025, by Matt Crees

Matt reports on our successful deployment of RabbitMQ quorum queues in Kolla‑Ansible and detail enhanced queue durability during OpenStack upgrades, with insights into migrations through Antelope, Caracal, and preparations for the upcoming Epoxy release

Read the full post.

Stop Scientists Stealing Your Nodes: Evaluating Slinky for Backfilling AI Resources

Published 23 May, by William Tripp

Will looks at using SchedMD’s Slinky to opportunistically backfill Slurm compute in a Kubernetes-based AI infrastructure (Dawn supercomputer). The post outlines benefits, cgroup limitations, pod preemption challenges, and future possibilities with Slurm Bridge.

Read the full post.

Driving sustainability for cutting‑edge research computing (Verne Global and StackHPC)

Published 2 May 2025, by Stig Telfer

Stig talks about our collaboration with Verne to extend the lifecycle of donated HPC servers by relocating them to Iceland's renewable-powered data centres. The initiative has slashed e‑waste and carbon emissions, extending hardware utility and cutting CO₂ by tens of thousands of tonnes.

Read the full post.

Azimuth Cloud Portal successfully completes security audit

Published 2 April 2025, by Axel Simon

Azimuth - the open source portal for easily deploying Slurm, JupyterHub, and more - has come out of a penetration test with no critical findings. Conducted by Arctic Owl, the audit validated strong hardening practices and confirmed only minor low-level security issues.

Read the full post.

Parting words

Thank you for reading this first edition of Navigating Upstream!

We would love to have your feedback and suggestions.

– The StackHPC Team

Reach out to us via BlueSky, LinkedIn or directly via our contact page.

Scientific OpenStack Kayobe Scientific SIG

StackHPC Ltd, registered company number 09938332. Privacy Policy