Quorum in the warren - modernising Kolla-Ansible's RabbitMQ offering

For optimal reading, please switch to desktop mode.

When we last left off, we had made decent progress in improving the reliability of RabbitMQ in Kolla-Ansible. Since then, we've upgraded our fleet of customer systems to the Antelope release. During this, we took om_enable_rabbitmq_high_availability into production with good results. The upgrades were much smoother than previous years on the RabbitMQ front, especially for our ML2/OVN-based systems. No longer did we need to force reset half our services due to lingering errors raised by oslo_messaging. Sadly there were still some issues with queues going missing for our ML2/OVS-based systems, but we'll look into how we're fixing these later in the blog.

As I'm writing this, we're coming to the end of our Caracal upgrade season so I'll also discuss how this is shaping up in regards to RabbitMQ. Here's a hint: quorum queues are looking good.

The world of cloud computing never stands still of course, so we've also been hard at work preparing for the latest OpenStack release, Epoxy. This will bring in the new RabbitMQ version 4.0, with (a hard requirement to move to) exciting new features.

Before we get stuck in, I want to extend my thanks to the upstream Kolla/Kolla-Ansible community. The work I'll be discussing here has been a collaborative effort, with development, testing and feedback not just from within StackHPC, but also from the greater Kolla team. Open source at its finest 💪

What happened in Antelope?

Our main motivation for all the RabbitMQ reliability improvements was to improve the stability of our OpenStack upgrades. So we were of course keen to see these improvements in our Yoga -> Antelope upgrade season. As such, we enabled the HA features in the Yoga release before upgrading to Antelope. RabbitMQ went through the rolling upgrade without dropping any existing queues or leaving "old incarnations" of queues behind.

Expectedly but still unfortunately, ML2/OVS-based systems were still seeing issues because ML2/OVS makes much heavier use of RabbitMQ than ML2/OVN. As seen in the diagram below, all the major components of ML2/OVS communication use RabbitMQ. Under the hood, the majority of this is via "reply" and "fanout" queues. You'll remember that these queue types are still transient. As we discussed last time, this is necessary, but unfortunately it does mean we suffer from the lack of high availability. The main issue we have been seeing is that some queues, notably the "reply" queues, are getting lost during container restarts. This is a problem because the OpenStack services treat these as permanent even though they're currently transient, so many of the queues are only re-declared on service startup.

If you've seen the Kolla-Ansible docs, you'll know that there is a pretty involved migration procedure required to move from transient to durable queues. While the impact to end-users is minimal in ML2/OVN-based system, with just some downtime to the OpenStack APIs, ML2/OVS-based systems will see downtime to tenant networking as much of the communication is through RabbitMQ. Given this high impact, we decided it was best not to enable om_enable_rabbitmq_high_availability by default upstream. Instead, we've set our sights on RabbitMQ's shiny new toy: quorum queues.

How about Caracal?

Quorum queues were enabled by default in Bobcat, and therefore Caracal was the first major release to use them. With this, we therefore needed to support migrating to using quorum queues before upgrading to Caracal, so support was backported to Antelope too. As a bonus, this would help a bit with the stability of the system when the Caracal upgrade proper is performed.

The migration process was already well-tested, as we'd opted to enable om_enable_rabbitmq_high_availability on all of our customer systems during the Yoga -> Antelope upgrades. So in Antelope/Caracal, we simply needed to upstream the migration process. This was documented in Kolla-Ansible, and we also scripted this for Kayobe so that it could be validated in CI. Downstream, we improved our previous process, adding more scripting into our stackhpc-kayobe-config; both for testing in our own CI, and for automating the rollout to customer systems.

Ultimately, while the migrations were smooth, there were some expected hurdles. Queues declared by an OpenStack services will just be given a randomly-generated name, meaning it is not possible to determine which queue belongs to which service, nor to which host. As such, we had to resort to stopping all OpenStack services which use RabbitMQ, resetting RabbitMQ's state, and then re-deploying all services again. This meant that there was an unavoidable period of API downtime. Once we were on quorum queues though, system stability was great. On all our systems where we use ML2/OVN, we didn't see any RabbitMQ-related issues during the upgrades to Caracal. ML2/OVS-based systems of course had the same issues, as "reply" and "fanout" queues are still transient.

A tangent on SLURP

With the 4.0 release of RabbitMQ looming, we were keen to bring Kolla-Ansible's RabbitMQ version up-to-date quickly. As such, we bumped the version in both the Bobcat and the Caracal release. This is an issue for SLURP as jumping two versions of RabbitMQ is not officially supported. So we built multiple versions of RabbitMQ in the Antelope release. We added support to build 3.11, 3.12 and 3.13 in Antelope, and then introduced a new command in Kolla-Ansible to upgrade between these.

And finally, where did we get to with Epoxy?

Up until now, we've made a concerted effort to upgrade the version of RabbitMQ with every OpenStack release. Now, at long last, we'll be able to upgrade to the latest and greatest release. However, this new release is no small feat. Instead, we find ourselves staring down the barrel of RabbitMQ version 4.0. Many of the existing features which we currently support and rely on, namely classic mirroring and transient queues, are being flat-out removed from v4.0. This of course means it's crunch time again, so let's discuss how we added support for all the new features provided by oslo.messaging and RabbitMQ.

Oslo.messaging have now added support for replacing the current transient queues ("fanout" and "reply") with quorum queues. Additionally, there is further support for the new sister feature to quorum queues: RabbitMQ Streams. These are be available for "fanout" queues, and should be both more performant and have a lower storage profile than simply using quorum queues for fanout communication.

Oslo.messaging have also introduced a new feature called the Queue Manager. This is an addition to the AMQP driver which changes how queues are named by OpenStack services. Instead of random UUIDs, queues are now named based on service name, hostname, and queue type. This is a very useful feature for debugging purposes, as we can now finally determine which queue is owned by which service. Queue manager isn't just nice to have though, it's also a hard requirement for large-scale systems, particularly those that run Heat or Magnum. This is because these services make heavy use of fanout queues, and if each of these were to be randomly named then we would quickly consume all the available Erlang atoms and be unable to make any new queues. We had hopes of using this feature to avoid the migration dance again, but unfortunately the queue names do not get updated live. So we'll still need to clear the RabbitMQ state and restart the OpenStack services again. At least we should be able to use queue manager to avoid this if any new queue types get introduced in the future.

With 4.0 finally arriving, we had to say goodbye to classically mirrored queues. As such, om_enable_high_availability was removed from Kolla-Ansible.

Looking to the future

Later this year we will begin upgrading our fleets of customer systems to the Epoxy release. As with the last cycle, we will be migrating to the new queues types on Caracal before running the full version upgrade. This is particularly exciting this time around, as we should see a noticable improvement in the resilience of service communication on our ML2/OVS systems. Every queue will be of durable type, and so will be persisted between service restarts. This means we should no longer see issues from "reply" queues disappearing. It's been a long time coming...

Shortly after we finalised RabbitMQ v4.0 in Kolla-Ansible, v4.1 released. This means v4.0 is immediately out of community support. Obviously this is less than ideal, as we maintain multiple stable releases, and even our latest release is now sat on an out-of-support version of RabbitMQ. We had a one-off solution to handling multiple RabbitMQ versions before, but it would be pretty fiddly to do this every time. I made heavy use of symlinks, which proved to be a controversial decision, and the only way to keep the Kolla-Ansible side idempotent was to explicitly override the RabbitMQ container image tag. These are nether developer nor operator friendly designs. As such, I hope to implement a new solution to handling multiple RabbitMQ versions per OpenStack release; one where it is easy to support new RabbitMQ versions, and which is backportable to existing stable OpenStack releases.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via BlueSky, LinkedIn or directly via our contact page.

StackHPC

Other articles

Stop Scientists Stealing Your Nodes: Evaluating Slinky for Backfilling AI Resources

Verne and StackHPC - Driving sustainability for cutting-edge research computing