HPCAC Lugano 2018
The recent HPC Advisory Council 2018 Lugano Conference was another great event, with a stimulating schedule of presentations that were once again themed on emerging trends in HPC. This year, the use of containers in HPC was at the forefront.
Motivations for Containers in Research Computing
The case for containerised compute is already well established, but here's a quick recap on our view of the most prominent advantages of supporting workload containerisation:
- Flexibility: Conventional HPC infrastructure is built around a multi-user console system, a parallel filesystem and a batch-queued workload manager. This paradigm works for many applications, but there is an increasing number of workloads that do not fit this mould. The users who require applications classed as non-traditional HPC can often do so through the ability for containers to package a runtime that otherwise cannot be practically supported. Some software that wouldn’t be possible to run can do so by containerising it.
- Convenience: For the scientist user, application development is a means to an end. The true metric in which they are interested is the "time to science". If a user can pull their application in as a container, requiring minimal consideration for adapting to the run-time environment of the HPC system, then they probably save themselves time and effort in their goal of conducting their research.
- Consistency: Users of a research computing infrastructure have often arrived there after outgrowing the compute resources of their laptop or workstation. Their home workspace may be a very different runtime environment from the HPC system. The inconvenience and frustration that could be incurred by porting to a new environment (or maintaining portability between multiple environments) is an inconvenience that can be avoided if the user is able to containerise their workspace and carry over to other systems.
- Reproducibility: The ability to repeat research performed in software environments has been a longstanding concern in science. A container, being a recipe for installing and running an application, can play a vital role in addressing this concern. If the recipe is sufficiently precise, it can describe the sources and versions of dependencies to enable others to regenerate the exact environment at a later time.
The Contenders for Containerised HPC
Representatives from all the prominent projects spoke, and thankfully Rich Brueckner from InsideHPC was there to capture each presentation. InsideHPC has posted all presentations from the conference online here.
This post attempts to capture the salient points of each technology and make a few comparisons that hopefully are not too odious.
Christian's vision is of a set of runtime tools based on modular components standardised through the OCI, centred upon use of the Docker daemon. As it executes with root privileges, the Docker daemon has often been characterised as a security risk. Christian points out that this argument is unfairly singling out the Docker daemon, given the same could be said of slurmd (but rarely is).
Through use of a mainstream set of tools, the philosophy is that containerised scientific compute is not left behind when new capabilities are introduced. And with a common user experience for working with containerised applications, both 'scientific' and not, cross-fertilisation remains easy.
If the user requirements of scientific compute can be implemented through extension of the OCI's modular ecosystem, this could become a simple way of focussing on the differences, rather than creating and maintaining an entirely disjoint toolchain. Christian's work-in-progress doxy project aims to demonstrate this approach. Watch this space.
The Docker toolchain is the de facto standard implementation. The greatest technical challenges to this approach remain around scalability, process tree lineage and the integration of HPC network fabrics.
Saverio Proto from SWITCH presented their new service for Kubernetes and how it integrates with their SWITCHEngines OpenStack infrastructure.
Kubernetes stands apart from the other projects covered here in that it is a Container Orchestration Engine rather than a runtime environment. The other projects described here use a conventional HPC workload manager to manage resources and application deployment.
Saverio picked out this OpenStack Summit talk that describes the many ways that Kubernetes can integrate with OpenStack infrastructure. At StackHPC we use the Magnum project (where available) to take advantage of the convenience of having Kubernetes provided - as a service.
Saverio and the SWITCH team have been blogging about how Kubernetes is used effectively in the SWITCHengines infrastructure here.
Abhinav Thota from Indiana University presented on Singularity, and how it is used with success on IU infrastructure.
A 2017 paper in PLOS ONE describes Singularity's motivation for mobility, reproducibility and user freedom, without sacrificing performance and access to HPC technologies. Singularity takes a iconoclastic position as the "anti-Docker" of containerisation. This contrarian stance also sees Singularity eschew common causes such as the Linux Foundation's Open Container Initiative, in favour of entirely home-grown implementations of the tool chain.
Singularity is also becoming widely used within research computing and is developing a self-sustaining community around it.
Singularity requires set-UID binaries for both building container images and for executing them. As an attack surface this may be an improvement over a daemon that continuously runs as root. However the unprivileged execution environment of Charliecloud goes further, and reduces its attack surface to the bare minimum - the kernel ABI and namespace implementations themselves.
The evident drawback of Singularity is that its policy of independence from the Docker ecosystem could lead to difficulties with portability and interoperability. Unlike the ubiquitous Docker image format, the Singularity Image Format depends on the ongoing existence, maintenance and development of the Singularity project. The sharing of scientific applications packaged in SIF is restricted to other users of Singularity, which must inevitably have an impact on the project's aims of mobility and reproducibility of science.
If these limitations were resolved then Singularity appears to be a good choice for it's rapidly-growing user base and evolving ecosystem. It also requires little administrative overhead to install, but may not be as secure as Charliecloud due to its requirement for set-UID.
Shifter's genesis was a project between NERSC and Cray, to support the scalable deployment of HPC applications, packaged in Docker containers, in a batch-queued workload manager environment. Nowadays Shifter is generic and does not require a Cray to run it. However, if you do have a Cray system, Shifter is already part of the Cray software environment and supported by Cray.
Shifter's focus is on a user experience based around Docker's composition tools, using containers as a means of packaging complex application runtimes, and a bespoke runtime toolchain designed to be as similar as possible to the Docker tools. Shifter's implementation addresses the scalable deployment of the application into an HPC environment. Shifter also aims to restrict privilege escalation within the container and performs specific customisations to ensure containerisation incurs no performance overhead.
To ensure performance, the MPI libraries of the container environment are swapped for the native MPI support of the host environment (this requires use of the MPICH ABI).
To enable scalability, Shifter approaches docker container launches by creating a flattened image file on the shared parallel filesystem, and then mounting the image locally on each node using a loopback device. At NERSC, Shifter's scalability has been demonstrated to extend well to many thousands of processes on the Cori supercomputer.
CSCS work has removed several perceived issues with the Shifter architecture. CSCS have been developing Shifter to improve the pulling of images from Dockerhub (or local user-owned Docker image repositories), and have added the ability to import images from tar files.
Shifter appears to be a good choice for sites that have a conventional HPC batch-queued infrastructure and are seeking to provide a scalable and performant solution, but retaining as much compatibility as possible with the Docker work flow. Shifter requires more administrative setup than Singularity or Charliecloud.
Shifter is available on NERSC's github site.
Michael Jennings from Los Alamos National Lab presented Charliecloud and the concepts upon which it is built.
Charliecloud's development at LANL resulted in a solution developed in a site with strict security requirements. Cluster systems in such an environment typically have no external internet connectivity. System applications are closely scrutinised, in particular those that involve privileged execution.
In these environments, Charliecloud's distinct advantage is the usage of the newly-introduced user namespace to support non-privileged launch of containerised applications. This technique was described in the 2017 Singularity paper as being "not deemed stable by multiple prominent distributions of Linux". It was actually introduced in 2013, but its use widened exposure to a new kernel attack surface. As a result its maturity has been complex and slow, but user namespaces are now a standard feature of the latest releases of all major Linux distributions. Configuration of Debian, RHEL and CentOS is described here. (For environments where unprivileged user namespaces cannot be supported, Charliecloud can fall back to using setuid binaries).
The user namespace is an unprivileged namespace. A user namespace can be created without requiring root privileges. Within a user namespace all other privileged namespaces can be created. In this way, a containerised application can be launched without requiring privileged access on the host.
Development for a Charliecloud environment involves using the Docker composition tools locally. Unlike Docker, a container is flattened to a single archive file in preparation for execution. Execution is scaled by the scalable distribution of this archive, which is unpacked into a tmpfs environment locally on each compute node. Charliecloud has been demonstrated scaling to 10,000 nodes on LANL's Trinity Cray system.
Charliecloud appears to be a good choice for sites that are seeking scalability, but with strong requirements for runtime security. The Docker development environment and composition tools are also helpful for users on a learning curve for containerisation.
Further details on Charliecloud can be be found from the informative paper presented at Supercomputing 2017. Michael has provided a Charliecloud Vagrantfile to help people familiarise themselves with it. Charliecloud packages are expected to ship in the next major release of OpenHPC.
The Road Ahead
The ecosystem around container technology is rapidly evolving, and this is also true in the niche of HPC.
The Open Container Initiative
The tools for this niche are somewhat bespoke, but thanks to the efforts of the OCI to break down the established Docker tools into modular components, there is new scope to build a specialist solution upon a common foundation.
This initiative has brought about new innovation. Rootless RunC is an approach for using the runc tool for unprivileged container launch. This approach and its current limitations are well documented in the above link.
In a similar vein, the CRI-O project is working on a lightweight container runtime interface that displaces the Docker daemon from Kubernetes compute nodes, in favour of any OCI-compliant runtime.
Shifter, Charliecloud and Singularity are not OCI-compliant runtimes, as they predate OCI’s relativately recent existence. However, when the OCI's tools become suitably capable and mature they are likely be adopted in Charliecloud and Shifter.
Challenges for HPC
There are signs of activitiy around developing better support for RDMA in containerised environments. The RDMA Cgroup introduced in Linux 4.11 introduces support for controlling the consumption of RDMA communication resources. This is already being included in the spec for the OCI runtime.
RDMA isolation (for example, through a namespace) doesn’t seem to be currently possible. Current implementations can only pass-through the host’s RDMA context. This will work fine for HPC configurations with a scheduling policy not to share a compute node between workloads.
The greatest advantages of specialist solutions appear to address challenges that remain unique to scientific computing. For example:
- Scalable launch of containerised workloads. The approach taken by Singularity, Shifter and Charliecloud involves using a parallel filesystem for the distribution of the application container image. This addresses one of the major differences in use case and design. Distributing the container as a single image file also greatly reduces filesystem metadata operations.
- Launching multi-node MPI applications in research computing containers. The Docker runtime creates complications with interacting with MPI's Process Management Interface. Shifter's innovation around replacing container MPI libraries with host MPI libraries is an intriguing way of specialising a generalised environment. Given multi-node MPI applications are the standard environment of conventional HPC infrastructure, running containerised applications of this form is likely to be a tightly specialised niche use case.
(Most) Paths Converge
A future direction in which HPC runtime frameworks for containerised applications have greater commonality with the de facto standard ecosystem around OCI and/or Docker's tools has considerable appeal. The development environment for Docker containers is rich, and developing rapidly. The output is portable and interchangeable between different runtimes. As Michael Jennings of Charliecloud says, “If we can all play in the same playground, that saves everyone time, effort and money”.
The HPCAC 2018 Swiss Conference brought together experts from all the leading options for containerised scientific compute, and was a rare opportunity to make side-by-side comparisons. Given the rapid development of this technology I am sure things will have changed again in time for the 2019 conference.