The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

We have previously described a new kind of OpenStack infrastructure, built to combine polymorphic flexibility with HPC levels of performance, in the context of our project with the Square Kilometre Array. To take advanage of OpenStack's latest capabilities, this week we upgraded that infrastructure from Ocata to Pike.

Early on, we took a design decision to base our deployments on Kolla, which uses Docker to containerise the OpenStack control plane, transforming it into something approximating a microservice architecture.

Kolla is in reality several projects. There is the project to define the composition of the Docker containers for each OpenStack service, and then there are the projects to orchestrate the deployment of Docker containers across one or more control plane hosts. This could be done using Kolla-Kubernetes, but our preference is for Kolla-Ansible.

Kolla-Ansible builds upon a set of hosts already deployed and configured up to a baseline level where Ansible can drive the Docker deployment. Given we are typically starting from pallets of new servers in a loading dock, there is a gap to be filled to get from one to the other. For that role, we created Kayobe, loosely defined as "Kolla on Bifrost", and intended to perform a similar role to TripleO, but using only Ironic for the undercloud seed and driven by Ansible throughout. This approach has enabled us to incorporate some compelling features, such as Ansible-driven configuration of BIOS and RAID firmware parameters and Network switch configuration.

There is no doubt that Kayobe has been a huge enabler for us, but what about Kolla? One of the advantages claimed for a containerised control plane is how it simplifies the upgrade process by severing the interlocking package dependencies of different services. This week we put this to the test, by upgrading a number of systems from Ocata to Pike.

This is a short guide to how we did it, and how it worked out...

Have a Working Test Plan

It may seem obvious but it may not an obvious starting point. Make a set of tests to ensure that your OpenStack system is working before you start. Then repeat these tests at any convenient point. By starting with a test plan that you know works, you'll know for sure if you've broken it.

Otherwise in the depths of troubleshooting you'll have a lingering doubt that perhaps your cloud was broken in this way all along...

Preparing the System for Upgrade

We brought the system to the latest on the stable/ocata branch. This in itself shakes out a number of issues. Just how healthy is the kernel and OS on the controller hosts? Is the Netron agents containers spinning looking for lost namespaces? Is the kernel blocking on most cores before spewing out reams of kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s!

A host in this state is unlikely to succeed in moving one patchset forward, let alone a major OpenStack release.

One of Kolla's strengths is the elimination of dependencies between services. It makes it possible to deploy different versions of OpenStack services without worrying about dependency conflicts. This can be a very powerful advantage.

The ability to update a kolla container forward along the same stable release branch establishes the basic procedure is working as expected. Getting the control plane migrated to the tip of the current release branch is a good precursor to making the version upgrade.

Staging the Upgrade

Take the leap on a staging or development system and you'll be more confident of landing in one piece on the other side. In tests on a development system, we identified and fixed a number of issues that would each have become a major problem on the production system upgrade.

Even a single-node staging system will find problems for you.

For example:

During the Pike upgrade, the Docker Python bindings package renames from docker_py to docker. They are mutually exclusive. The python environment we use for Kolla-Ansible must start the process with docker_py installed and at the appropriate point transition to docker. We found a way through and developed Kayobe to perform this orchestration.
We carrried forward a piece of work to enable our Kolla logs via Fluentd to go to Monasca, which just made its way upstream.
We hit a problem with Kolla-Ansible's RabbitMQ containers generating duplicate entries in /etc/hosts, which we work around while the root cause is investigated.
We found and fixed some more issues with Kolla-Ansible pre-checks for both Ironic and Murano.
We hit this bug with generating config for mariadb - easily fixed once the problem was identified.

Performing the Upgrade

On the day, at a production scale, new problems can occur that were not exposed at the scale of a staging system.

In a production upgrade, the best results come from bringing all the technical stakeholders together while the upgrade progresses. This enables a team to draw on all the expertise it needs to work through issues encountered.

In production upgrades, we worked through new issues:

A race condition encountered in the management of keepalived for an haproxy cluster. This was identified to be a race condition reported in this bug and already fixed on the master branch, which we could cherry-pick.
We hit this bug with Horizon reporting End of script output before headers: django.wsgi, for which a bug fix was already in review upstream that we could cherry-pick.

That final point should have been found by our test plan, but was not covered (this time). Arguably it should have been found by Kolla-Ansible's CI testing too.

The Early Bird Gets The Worm

Being an early adopter has both benefits and drawbacks. Kolla, Ansible and Kayobe have made it possible to do what we did - successfully - with a small but talented team.

Our users have scientific work to do, and our OpenStack projects exist to support that.

We are working to deliver infrastructure with cutting-edge capabilities that exploit OpenStack's latest features. We are proud to take some credit for our upstream contributions, and excited to make the most of these new powers in Pike.

StackHPC

Upgrade to Pike using Kolla and Kayobe

Have a Working Test Plan

Preparing the System for Upgrade

Staging the Upgrade

Performing the Upgrade

The Early Bird Gets The Worm

Further Reading