The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

There are three key areas that we want to be optimised in a large-scale messaging system: high availability, high performance, and low operator input. However, these areas are often in conflict. Making messages highly available will be more computationally intensive, which may reduce performance. Of course, improving either of these will usually increase the complexity of the system, so extra care is needed to avoid adding to an operator's workload. This last point is exacerbated by bugs and misconfiguration. As unexpected downtime doesn't care about the concept of a 9-5, it's highly beneficial for a system to recover automatically when things do go wrong. This blog post explores the recent work that's been done to improve how RabbitMQ is used within Kolla-Ansible. We see the decisions and tradeoffs chosen between availability and consistency and discuss how we've tried to reduce the need for operator interventions.

What's up with RabbitMQ in Kolla Ansible?

There are clear benefits to mirroring queues in RabbitMQ. Message redundancy helps to minimise the impact of a RabbitMQ node going down, either due to unexpected failures or when planned during an upgrade. As such, it had been the case for years that Kolla Ansible would configure the queues in RabbitMQ for classic queue mirroring. This means that message queues are replicated across multiple RabbitMQ nodes. However, these queues were not made durable as there were concerns that this would impact performance at large scales. Classic queue mirroring is not designed to work with transient queues, so issues were commonplace. This, combined with other bugs that were not on the radar of the Kolla Ansible team, meant that RabbitMQ was in a particularly fragile state. As such, the high availability wasn't particularly high. As the messaging system is integral to many moving parts of an OpenStack system, this problem would rear its head in many different forms. We found that it would often break during OpenStack upgrades, and some of our customers had issues during the regular running of their systems. Typically, this was caused by outages such as from switch resets or power failures. In some cases, it felt like half of our job was just picking up the pieces after more messages were lost. As such, it was decided by the Kolla-Ansible team to remove queue mirroring entirely. After all, some downtime is preferable compared to a system getting stuck in a broken state. The guiding principle in removing replicated queues was that the OpenStack services should retry and recover in the event of message loss. I.e. the loss of one node in a cluster should result in successfully retrying on another node. However, it turned out that many services are actually very bad at handling message loss. So classic queue mirroring was removed, but a HA service was still retained in some sense. Kolla-Ansible continued deploying and supporting HA RabbitMQ clusters and exchanges. RabbitMQ clients remained free to fail over to other nodes if there was an outage.

Revisiting HA

In November of last year, I was tasked with this goal: revive classic queue mirroring in RabbitMQ, and aim to minimise message loss to support a greater degree of high availability. Here's what has been done since then. Many of these ideas come from the Large Scale SIG's resources on configuring RabbitMQ.

First up is the big one: bringing back classic queue mirroring. But this time, it now follows the supported configuration and must be enabled alongside durable queues. Using the flag om_enable_rabbitmq_high_availability will either enable both options at once, or neither of them. Some queues are not mirrored, notably "reply" and "fanout" queues. An exclusionary pattern is used in the classic mirroring policy. This pattern is:

^(?!(amq\\.)|(.*_fanout_)|(reply_)).*

Originally, these exclusions were designed to improve the efficiency of RabbitMQ in OpenStack Ansible, as they are not expected to be long-lived queues. But fanout and reply queues are also not made durable, and as we now know, mirroring transient queues is unsupported and causes issues around failover. If a fanout queue is being mirrored while a node is then shut down, old incarnations of these queues are often left behind without any active consumers. However, some services will still send messages to these queues, grinding their communication to a halt.

Next, we explored an issue with how classic queue mirroring handles the leader queues of a cluster. By default, the value of ha-promote-on-shutdown is set to "when-synced". This means that a replica queue will only be promoted to a leader queue if the messages are fully synchronised. However, this assumes that the node is going to recover quickly after a soft shutdown. If it takes a long time or never comes back at all then the messages in these queues cannot be consumed, so the system fails to recover from the node outage. To resolve this, ha-promote-on-shutdown is set to "always" in Kolla-Ansible. This means that a follower queue will immediately be promoted to the new leader if the current leader is shut down. While this does open the possibility of some messages getting lost due to not being synchronised, this has been deemed an acceptable tradeoff to slightly reduce consistency in favour of high availability. This also eliminates the need for operator intervention, as the system can now recover automatically.

I mentioned before about bugs that, while not directly related, made it much harder to diagnose the problems with RabbitMQ HA. One such bug was polluting the logs with errors about heartbeat connections being dropped. These had very similar stack traces to the HA failures, so it took a while to discern where the problems were.

Many OpenStack APIs run under mod_wsgi. When this is the case, the RabbitMQ heartbeat thread is changed to run in a green thread, instead of the intended native thread. Instead of checking AMQP sockets every 15 seconds, it is now suspended and resumed by eventlet. This can take a very long time if mod_wsgi isn't processing traffic regularly, which causes RabbitMQ to close the AMQP connection. The oslo.messaging team resolved this issue by setting the value of [oslo_messaging_rabbit] heartbeat_in_pthread to true. However, this then caused issues for non-wsgi services. As such, it was reset to false and is now left up to the user to choose the appropriate value. In Kolla-Ansible, heartbeat_in_pthread is now set to true only for wsgi applications to allow the RabbitMQ heartbeats to function. As the default value of this variable has changed between releases, this value is explicitly set for all services from Wallaby onwards. For more information on this, check out Hervé Beraud's blog which does a great job explaining this issue in more detail.

An OpenStack system is a complex collection of different projects and services all working together. As such, it is not unexpected that some of the issues with RabbitMQ have not been caused directly by Kolla-Ansible. After working on all the above changes for quite some time and finally being able to get RabbitMQ within Kolla-Ansible into a stable state in a Wallaby test environment, I moved on to making sure that the system would remain stable as I upgraded through OpenStack releases. Sadly, I immediately hit a new issue in Xena: if a RabbitMQ node was shut down, then OpenStack was unable to launch any new VMs. I was concerned that this was another bug in Kolla-Ansible that wasn't present in Wallaby, but eventually managed to track it down to a regression caused by a change in oslo.messaging. Thankfully, fixes are already being proposed for this bug, so in the meantime, we have been able to apply these changes locally in our downstream forks and the issue will be resolved without any further changes to Kolla-Ansible.

Supporting more features

Now a few months deep, it's safe to say we're fully down the rabbit hole. So why not keep this momentum going? There are some additional patches in Kolla-Ansible to support more configuration of RabbitMQ. These aim to make RabbitMQ more reliable in a Kolla-Ansible managed OpenStack system. (Note that, as of writing this blog, some of these are still in progress).

Support has been added to Kolla-Ansible to change the replication factor of mirrored queues. Prior to this, queues were replicated across every node. This is mainly motivated as a performance measure, as we now default to following the advice in the RabbitMQ docs. Queues are replicated across n/2+1 nodes, where n is equal to the total number of nodes. There are also hopes that this will help to speed up recovery times from any node outages, as messages will need to be replicated fewer times.

Two new configurable options have also been added to KA: queue expiry and message TTL. Messages will expire after 10 minutes, and queues after an hour of inactivity. The former ensures that old messages with no consumers will be removed (note this is intentionally longer than the 300s timeouts used often across OpenStack services). The latter ensures any old queues from removed or renamed nodes are dealt with, as they may otherwise grow over time.

There is work being done to switch the Kolla-Ansible upgrade of RabbitMQ from a full-stop upgrade to a rolling upgrade. This will involve utilising RabbitMQ's feature flags to ensure the upgrades can take place, and we will take advantage of node maintenance mode to minimise the impact of individual node downtimes.

How did we test all of this?

A lot of the testing and exploration involved deliberately breaking RabbitMQ and causing outages of the OpenStack system. Obviously, this kind of work could not have been performed on our customers' production systems. As such, our multinode Kayobe test environments have been invaluable. These have allowed me to have a safe, isolated testing environment that is very close to a production system. As such, I could test experimental changes and deliberately cause RabbitMQ-related breakages without any risk of impacting clients or users of their systems. You can find our new blog post on the multinode environment here.

Looking to the future

Classic queue mirroring is a deprecated feature in RabbitMQ. It is scheduled to be removed in version 4.0, and while there isn't yet an official release date for this, we must still consider how to proceed. Thankfully, RabbitMQ boasts an alternative: quorum queues. These are replicated, FIFO queues that are based on the Raft consensus algorithm. There is an important choice still to be made here. Right now, queue mirroring is disabled by default in Kolla-Ansible. While the blog suggests that there are benefits to enabling this, it is not a seamless process. The migration to using durable queues requires that the state of RabbitMQ must be reset, and all OpenStack services which use RabbitMQ need to be shut down to recreate their queues. For some operators, this is a particularly big problem as, in large environments, it can take many hours to reconfigure all the services to use durable queues. Additionally, the migration is (currently) a manual procedure. It could therefore be worth considering shortening the timeline to implementing quorum queues, with the option to opt-in to classic queue mirroring as a stop-gap measure in the meantime. This would allow for systems that are particularly sensitive to downtime to only make one migration, rather than two.

An OpenStack system is more than just Kolla-Ansible. As such, we are optimistically planning to track down and fix situations in other OpenStack services where message loss results in an unrecoverable failure. For example, we wonder if there are scenarios where different acknowledgement modes such as positive confirmations should be used, instead of automatic acknowledgements.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Bluesky or directly via our contact page.

StackHPC

Climbing out the Rabbit Hole - RabbitMQ Reliability with Kolla Ansible