We have previously described a new kind of OpenStack infrastructure,
built to combine polymorphic flexibility with HPC levels of
performance, in the context of our project with the Square Kilometre
To take advanage of OpenStack's latest capabilities, this week we
upgraded that infrastructure from Ocata to Pike.
Early on, we took a design decision to base our deployments on
Kolla, which uses Docker
to containerise the OpenStack control plane, transforming it into
something approximating a microservice architecture.
Kolla is in reality several projects. There is the project to define
the composition of the Docker containers for each OpenStack service,
and then there are the projects to orchestrate the deployment of Docker
containers across one or more control plane hosts. This could be done
but our preference is for Kolla-Ansible.
Kolla-Ansible builds upon a set of hosts already deployed and
configured up to a baseline level where Ansible can drive the Docker
deployment. Given we are typically starting from pallets of new
servers in a loading dock, there is a gap to be filled to get from
one to the other. For that role, we created Kayobe, loosely defined as
"Kolla on Bifrost",
and intended to perform a similar role to TripleO,
but using only Ironic for the undercloud seed and driven by Ansible
throughout. This approach has enabled us to incorporate some
compelling features, such as Ansible-driven configuration of BIOS
and RAID firmware parameters
and Network switch configuration.
There is no doubt that Kayobe has been a huge enabler for us, but what about Kolla?
One of the advantages claimed for a containerised control plane is how it
simplifies the upgrade process by severing the interlocking package dependencies
of different services. This week we put this to the test, by upgrading
a number of systems from Ocata to Pike.
This is a short guide to how we did it, and how it worked out...
Have a Working Test Plan
It may seem obvious but it may not an obvious starting point.
Make a set of tests to ensure that your OpenStack system is working
before you start. Then repeat these tests at any convenient point.
By starting with a test plan that you know works, you'll know for sure
if you've broken it.
Otherwise in the depths of troubleshooting you'll have a lingering
doubt that perhaps your cloud was broken in this way all along...
Preparing the System for Upgrade
We brought the system to the latest on the stable/ocata branch.
This in itself shakes out a number of issues. Just how healthy is
the kernel and OS on the controller hosts? Is the Netron agents
containers spinning looking for lost namespaces? Is the kernel
blocking on most cores before spewing out reams of kernel:NMI
watchdog: BUG: soft lockup - CPU#2 stuck for 23s!
A host in this state is unlikely to succeed in moving one patchset forward,
let alone a major OpenStack release.
One of Kolla's strengths is the elimination of dependencies between
services. It makes it possible to deploy different versions of
OpenStack services without worrying about dependency conflicts.
This can be a very powerful advantage.
The ability to update a kolla container forward along the same
stable release branch establishes the basic procedure is working
as expected. Getting the control plane migrated to the tip of the
current release branch is a good precursor to making the version
Staging the Upgrade
Take the leap on a staging or development system and you'll be more
confident of landing in one piece on the other side. In tests on
a development system, we identified and fixed a number of issues
that would each have become a major problem on the production system
Even a single-node staging system will find problems for you.
- During the Pike upgrade, the Docker Python bindings package renames
from docker_py to docker. They are mutually exclusive.
The python environment we use for Kolla-Ansible must start the
process with docker_py installed and at the appropriate point
transition to docker. We found a way through
and developed Kayobe to perform this orchestration.
- We carrried forward a piece of work to enable our Kolla logs via Fluentd
to go to Monasca,
which just made its way upstream.
- We hit a problem with Kolla-Ansible's RabbitMQ containers generating
duplicate entries in /etc/hosts,
which we work around while the root cause is investigated.
- We found and fixed some more issues with Kolla-Ansible pre-checks for both
Ironic and Murano.
- We hit this bug with generating config for mariadb - easily
fixed once the problem was identified.
Performing the Upgrade
On the day, at a production scale, new problems can occur that were not
exposed at the scale of a staging system.
In a production upgrade, the best results come from bringing all the technical
stakeholders together while the upgrade progresses. This enables a team to draw
on all the expertise it needs to work through issues encountered.
In production upgrades, we worked through new issues:
That final point should have been found by our test plan, but was
not covered (this time). Arguably it should have been found by Kolla-Ansible's
CI testing too.
The Early Bird Gets The Worm
Being an early adopter has both benefits and drawbacks. Kolla,
Ansible and Kayobe have made it possible to do what we did -
successfully - with a small but talented team.
Our users have scientific work to do, and our OpenStack projects
exist to support that.
We are working to deliver infrastructure with cutting-edge capabilities that
exploit OpenStack's latest features. We are proud to take some credit for our
upstream contributions, and excited to make the most of these new powers