The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

UPDATED: See OpenHPC v2.1 Released below.

The OpenHPC project is a key part of the "HPC" side of StackHPC. It essentially provides a set of packages which make it easy to install the system software for a scheduler-based HPC cluster, including the scheduler itself, compilers, MPI libraries, maths and I/O libraries, performance tools, etc., all integrated into a neat set of hierarchical modules. While OpenHPC’s own documentation has recipes for using Warewulf to create and deploy images for compute nodes, we generally use our stackhpc.openhpc Ansible Galaxy role which can configure all nodes in a cluster from a base image with a single command.

OpenHPC v2.0 was released in October 2020 and we've since deployed it to client systems using our Galaxy role, including the 1000+ node Top 100 and Top 500 systems we recently blogged about. OpenHPC v2.0 is a significant upgrade from v1.3.9 which preceded it (released in November 2019), with new versions of software at all levels of the stack. However it requires Centos 8 so it is not a trivial upgrade, although presumably at least some of these enhancements will eventually show up in the Centos 7-based v1.3.10 which was originally planned for the end of 2020. Let's take a look at what that upgrade gets you.

Starting with the scheduler, OpenHPC v2.0 updates Slurm to v20.02.5. This adds a "configless" mode which enables compute and login nodes to pull configuration information directly from the slurm control daemon rather than having to distribute the config file to every node. As well as simplifying the overall cluster configuration, this approach means that changes to slurm configuration such as added or removed nodes no longer need to be replicated in compute node images or mounted over a network filesystem, simplifying building images for compute nodes.

OpenHPC provides a variety of compilers and MPI libraries, but the GCC + OpenMPI combination will be of interest to many users needing a FOSS toolchain. OpenHPC v2.0 updates the packaged GNU compilers from v8 to v9, and OpenMPI from v3 to v4, but the building of OpenMPI against UCX is possibly the most obvious change to users. UCX is a communications framework which aims to provide optimized performance with a unified interface both “up” to the user and “down” to developers across a range of hardware and platforms. While adding yet another layer into the already-complicated HPC interconnect/networking/fabric stack may feel somewhat unhelpful, UCX does simplify life for users. For example we recently carried out benchmarking of a range of MPI applications to compare performance between InfiniBand and RoCE interconnects. Getting RoCE to work on Mellanox ConnectX4 cards with the "native" OpenMPI Byte Transport Layer required setting:

OMPI_MCA_btl=openib,self,vader
OMPI_MCA_btl_openib_if_include=mlx5_1:1
OMPI_OPENIB_ROCE_QUEUES="--mca btl_openib_receive_queues P,128,64,32,32,32:S,2048,1024,128,32:S,12288,1024,128,32:S,65536,1024,128,32"

with pingpong tests showing some poor performance at some message sizes. Rather than optimising the queues to try to improve this, using UCX provided more consistent RoCE performance by simply setting:

UCX_NET_DEVICES=mlx5_1:1

As a bonus, the MPICH packages in OpenHPC v2.0 also use UCX and Intel MPI supports it from v2019.5, so using multiple MPI libraries gets significantly simpler.

We've just released v0.7.0 of our Ansible Galaxy OpenHPC role. The first major enhancement in this version is support for the new "configless" mode when using OpenHPC v2.0 (on Centos 8 - support for OpenHPC v1.x/Centos 7 is still included but without this mode). The second major enhancement is new options to configure slurmdb and the accounting plugin. This significantly enhances the accounting information available via sacct compared to the default text-file-based storage, and enables us to build job-specific monitoring dashboards for the cluster. Less obviously, a number of internal tweaks have been made to improve using the role in image build pipelines for compute nodes. For a full list of what's new see the v0.7.0 release notes.

To maintain backwards compatibility these features aren't turned on by default, so as an example of what's needed here's the configuration for a Slurm cluster in configless mode, slurmdb/enhanced accounting enabled and 2x partitions:

- name: Setup slurm
  hosts: openhpc
  become: yes
  tags:
    - openhpc
  tasks:
    - import_role:
        name: stackhpc.openhpc
      vars:
        openhpc_enable:
          control: "{{ inventory_hostname in groups['control'] }}"
          batch: "{{ inventory_hostname in groups['compute'] }}"
          database: "{{ inventory_hostname in groups['control'] }}"
          runtime: true
        openhpc_slurm_accounting_storage_type: 'accounting_storage/slurmdbd'
        openhpc_slurmdbd_mysql_password: "{{ secrets_openhpc_mysql_slurm_password }}"
        openhpc_slurm_control_host: "{{ groups['control'] | first }}"
        openhpc_slurm_partitions:
          - name: "hpc"
            default: YES
            maxtime: "3-0" # 3 days 0 hours
          - name: "express"
            default: NO
            maxtime: "1:0:0" # 1 hour 0m 0s
        openhpc_slurm_configless: true
        openhpc_login_only_nodes: login

Using this role for the core Slurm functionality, we're now building a flexible Ansible-based "Slurm appliance" around it which automates deployment and configuration of an entire HPC environment. At present it includes the Slurm-based monitoring as mentioned above, post-deployment performance tests, and additional filesystems, as well as providing production-ready configuration for aspects such as PAM and user limits. Our OpenHPC role makes deploying a Slurm cluster easy, and we're excited that this appliance will provide the same ease of use for a much richer user experience. Watch this space for details ...

OpenHPC v2.1 Released

This version of OpenHPC was released on 6th April 2021 and supports CentOS 8.3. While this release is numbered as a minor version change (and many of the included packages do have minor version upgrades) this is a fairly significant change for Slurm-based systems. The Slurm version changes from 20.02.5 to 20.11.3 and the release notes for this version show some significant changes. New features such as "dynamic future nodes" allowing nodes to be specified by hardware configuration rather than by name are interesting but there are two potential pit-falls.

Firstly, the filetxt plugin for accounting storage is no longer supported. While only supporting basic accounting features, it was enabled simply by setting a slurm.conf parameter (which our Ansible Galaxy OpenHPC role role did by default). Now, enabling accounting requires setting up a MySQL or MariaDB database and the Slurm database daemon. For production clusters this is probably preferable anyway (and is supported by our OpenHPC role) but for some situations the simplicity of the filetxt approach will be missed. One partial mitigation is to use job accounting instead, which allows sacct -c to at least show job completion information. This is again a simple slurm.conf change and supported by our Galaxy OpenHPC role.

Secondly, a 20.11.3 slurmd cannot communicate with a 20.02.5 slurmctld. As per Slurm's versioning scheme the major release is given by combining the first two parts of the version number, so this is not surprising. As the newer version can read statefiles, etc., from the old version, a smooth upgrade is possible.

The problem is that both OpenHPC v2.x versions are in the same repos. So using our Galaxy OpenHPC role on a CentOS 8.x system before 6th April created an OpenHPC v2.0 node using Slurm 20.02.05, and now creates an OpenHPC v2.1 node using Slurm 20.11.3. So not only did the role start failing in CI (due to the now-unsupported default accounting configuration), but adding a compute node to an existing cluster by rerunning the role failed, as the updated packages in the new compute node meant it couldn't communicate with the older slurmctld.

If required, OpenHPC version pinning should be achievable through modification of the installed repo configurations, but this will be messy and require amending for each new version. For now, note that:

Running old versions of our Galaxy OpenHPC role against an existing cluster (with no new nodes) will not cause problems, as the role does not update packages itself.
Do not run a yum/dnf update of *-ohpc packages unless done as part of a Slurm upgrade.
Use the new v0.8 release of our Galaxy OpenHPC role for all new CentOS 8.x clusters, if using the default accounting configuration. This version disables accounting by default and is therefore compatible with Slurm 20.11.3.
Adding nodes to existing OpenHPC v2.0 clusters should probably be done using existing images, rather than by directly installing OpenHPC on the node (whether via our Galaxy OpenHPC role or any other means).

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.