The Contenders for Containerised HPC
Representatives from all the prominent projects spoke, and thankfully
Rich Brueckner from InsideHPC was there
to capture each presentation. InsideHPC has posted all presentations
from the conference online here.
This post attempts to capture the salient points of each technology
and make a few comparisons that hopefully are not too odious.
Christian Kniep from Docker Inc spoke on the work he has been doing
to enhance the integration of the Docker engine with
HPC runtime environments. (Christian's slides uploaded here).
Christian's vision is of a set of runtime tools based on modular
components standardised through the OCI, centred upon use of the Docker
daemon. As it executes with root privileges, the Docker daemon has
often been characterised as a security risk. Christian points out
that this argument is unfairly singling out the Docker daemon, given
the same could be said of slurmd (but rarely is).
Through use of a mainstream set of tools, the philosophy is that
containerised scientific compute is not left behind when new
capabilities are introduced. And with a common user experience for
working with containerised applications, both 'scientific' and not,
cross-fertilisation remains easy.
If the user requirements of scientific compute can be implemented
through extension of the OCI's modular ecosystem, this could become
a simple way of focussing on the differences, rather than creating
and maintaining an entirely disjoint toolchain. Christian's
work-in-progress doxy project
aims to demonstrate this approach. Watch this space.
The Docker toolchain is the de facto standard implementation.
The greatest technical challenges to this approach remain around
scalability, process tree lineage and the integration of HPC network
Saverio Proto from SWITCH presented their
new service for Kubernetes and how it integrates with their
SWITCHEngines OpenStack infrastructure.
Kubernetes stands apart from the other projects covered here in
that it is a Container Orchestration Engine rather than a runtime
environment. The other projects described here use a conventional
HPC workload manager to manage resources and application deployment.
Saverio picked out this OpenStack Summit talk
that describes the many ways that Kubernetes can integrate with
OpenStack infrastructure. At StackHPC we use the Magnum project (where available) to
take advantage of the convenience of having Kubernetes provided
- as a service.
Saverio and the SWITCH team have been blogging about how Kubernetes
is used effectively in the SWITCHengines infrastructure here.
Abhinav Thota from Indiana University presented on Singularity, and how it is used with success
on IU infrastructure.
A 2017 paper in PLOS ONE
describes Singularity's motivation for mobility, reproducibility
and user freedom, without sacrificing performance and access to HPC
technologies. Singularity takes a iconoclastic position as the
"anti-Docker" of containerisation. This contrarian stance also
sees Singularity eschew common causes such as the Linux Foundation's
Open Container Initiative, in
favour of entirely home-grown implementations of the tool chain.
Singularity is also becoming widely used within research computing
and is developing a self-sustaining community around it.
Singularity requires set-UID binaries for both building container
images and for executing them. As an attack surface this may be
an improvement over a daemon that continuously runs as root. However
the unprivileged execution environment of Charliecloud goes further,
and reduces its attack surface to the bare minimum - the kernel ABI and
namespace implementations themselves.
The evident drawback of Singularity is that its policy of independence
from the Docker ecosystem could lead to difficulties with portability
and interoperability. Unlike the ubiquitous Docker image format,
the Singularity Image Format depends on the ongoing existence,
maintenance and development of the Singularity project. The sharing
of scientific applications packaged in SIF is restricted to other
users of Singularity, which must inevitably have an impact on the
project's aims of mobility and reproducibility of science.
If these limitations were resolved then Singularity appears to be
a good choice for it's rapidly-growing user base and evolving
ecosystem. It also requires little administrative overhead to
install, but may not be as secure as Charliecloud due to its
requirement for set-UID.
Alberto Madonna from CSCS gave an overview
of Shifter, and
an update on recent work at CSCS to improve it.
Shifter's genesis was a project between NERSC and Cray, to support
the scalable deployment of HPC applications, packaged in Docker
containers, in a batch-queued workload manager environment. Nowadays
Shifter is generic and does not require a Cray to run it. However,
if you do have a Cray system, Shifter is already part of the Cray
software environment and supported by Cray.
Shifter's focus is on a user experience based around Docker's
composition tools, using containers as a means of packaging complex
application runtimes, and a bespoke runtime toolchain designed to
be as similar as possible to the Docker tools. Shifter's implementation
addresses the scalable deployment of the application into an HPC
environment. Shifter also aims to restrict privilege escalation
within the container and performs specific customisations to ensure
containerisation incurs no performance overhead.
To ensure performance, the MPI libraries of the container environment
are swapped for the native MPI support of the host environment (this
requires use of the MPICH ABI).
To enable scalability, Shifter approaches docker container launches
by creating a flattened image file on the shared parallel filesystem,
and then mounting the image locally on each node using a loopback
device. At NERSC, Shifter's scalability has been demonstrated to
extend well to many thousands of processes on the Cori supercomputer.
CSCS work has removed several perceived issues with the Shifter
architecture. CSCS have been developing Shifter to improve the
pulling of images from Dockerhub (or local user-owned Docker image
repositories), and have added the ability to import images from tar
Shifter appears to be a good choice for sites that have a conventional
HPC batch-queued infrastructure and are seeking to provide a scalable
and performant solution, but retaining as much compatibility as
possible with the Docker work flow. Shifter requires more administrative
setup than Singularity or Charliecloud.
Shifter is available on NERSC's github site.
Michael Jennings from Los Alamos National Lab presented Charliecloud and the concepts upon which it is
at LANL resulted in a solution developed in a site with strict
security requirements. Cluster systems in such an environment
typically have no external internet connectivity. System applications
are closely scrutinised, in particular those that involve privileged
In these environments, Charliecloud's distinct advantage is the
usage of the newly-introduced user namespace to support non-privileged
launch of containerised applications. This technique was described
in the 2017 Singularity paper as being "not deemed stable by multiple
prominent distributions of Linux". It was actually introduced in
2013, but its use widened
exposure to a new kernel attack surface. As a result its maturity
has been complex and slow, but user namespaces are now a standard
feature of the latest releases of all major Linux distributions.
Configuration of Debian, RHEL and CentOS is described here.
(For environments where unprivileged user namespaces cannot be supported,
Charliecloud can fall back to using setuid binaries).
The user namespace is an unprivileged namespace. A user namespace
can be created without requiring root privileges. Within a user
namespace all other privileged namespaces can be created. In this
way, a containerised application can be launched without requiring
privileged access on the host.
Development for a Charliecloud environment involves using the Docker
composition tools locally. Unlike Docker, a container is flattened
to a single archive file in preparation for execution. Execution
is scaled by the scalable distribution of this archive, which is
unpacked into a tmpfs environment locally on each compute node.
Charliecloud has been demonstrated scaling to 10,000 nodes on LANL's
Trinity Cray system.
Charliecloud appears to be a good choice for sites that are seeking
scalability, but with strong requirements for runtime security. The
Docker development environment and composition tools are also helpful
for users on a learning curve for containerisation.
Further details on Charliecloud can be be found from the informative
paper presented at Supercomputing 2017.
Michael has provided a Charliecloud Vagrantfile
to help people familiarise themselves with it. Charliecloud packages
are expected to ship in the next major release of OpenHPC.
The Road Ahead
The ecosystem around container technology is rapidly evolving, and this
is also true in the niche of HPC.
The Open Container Initiative
The tools for this niche are somewhat bespoke, but thanks to the
efforts of the OCI to break down the established Docker tools into
modular components, there is new scope to build a specialist solution
upon a common foundation.
This initiative has brought about new innovation. Rootless RunC is an approach for using the runc
tool for unprivileged container launch. This approach and its
current limitations are well documented in the above link.
In a similar vein, the CRI-O project is working
on a lightweight container runtime interface that displaces the
Docker daemon from Kubernetes compute nodes, in favour of any
Shifter, Charliecloud and Singularity are not OCI-compliant runtimes,
as they predate OCI’s relativately recent existence. However,
when the OCI's tools become suitably capable and mature
they are likely be adopted in Charliecloud and Shifter.
Challenges for HPC
There are signs of activitiy around developing better support for
RDMA in containerised environments. The RDMA Cgroup
introduced in Linux 4.11 introduces support for controlling the
consumption of RDMA communication resources. This is already being
included in the spec for the
RDMA isolation (for example, through a namespace) doesn’t seem to
be currently possible. Current implementations can only pass-through
the host’s RDMA context. This will work fine for HPC configurations
with a scheduling policy not to share a compute node between workloads.
The greatest advantages of specialist solutions appear to address
challenges that remain unique to scientific computing. For example:
- Scalable launch of containerised workloads. The approach taken by
Singularity, Shifter and Charliecloud involves using a parallel filesystem
for the distribution of the application container image. This addresses
one of the major differences in use case and design. Distributing
the container as a single image file also greatly reduces filesystem
- Launching multi-node MPI applications in research computing containers.
The Docker runtime creates complications with interacting with
MPI's Process Management Interface. Shifter's innovation around
replacing container MPI libraries with host MPI libraries is an
intriguing way of specialising a generalised environment. Given
multi-node MPI applications are the standard environment of
conventional HPC infrastructure, running containerised applications
of this form is likely to be a tightly specialised niche use case.
(Most) Paths Converge
A future direction in which HPC runtime frameworks for containerised
applications have greater commonality with the de facto standard
ecosystem around OCI and/or Docker's tools has considerable appeal.
The development environment for Docker containers is rich, and
developing rapidly. The output is portable and interchangeable
between different runtimes. As Michael Jennings of Charliecloud
says, “If we can all play in the same playground, that saves everyone
time, effort and money”.
The HPCAC 2018 Swiss Conference brought together experts from all
the leading options for containerised scientific compute, and was a
rare opportunity to make side-by-side comparisons. Given the rapid
development of this technology I am sure things will have changed
again in time for the 2019 conference.