A team from Intel including Sunil Mahawar, Yih Leong Sun and Jeff
Adams have already presented their work
at the latest OpenStack summit in Boston:
Independently, we have been working on our own OpenHPC clusters as
one of our scientific cluster applications on the SKA performance prototype system, and
I'm going to share some of the components we have used to make this
project happen.
os_images_list:
# Build of OpenHPC image on a CentOS base
- name: "CentOS7-OpenHPC"
elements:
- "centos7"
- "epel"
- "openhpc"
- "selinux-permissive"
- "dhcp-all-interfaces"
- "vm"
env:
DIB_OPENHPC_GRPLIST: "ohpc-base-compute ohpc-slurm-client 'InfiniBand Support'"
DIB_OPENHPC_PKGLIST: "lmod-ohpc mrsh-ohpc lustre-client-ohpc ntp"
DIB_OPENHPC_DELETE_REPO: "n"
properties:
os_distro: "centos"
os_version: 7
os_images_elements: ../stackhpc-image-elements
---
# This playbook uses the Ansible OpenStack modules to create a cluster
# using a number of baremetal compute node instances, and configure it
# for a SLURM partition
- hosts: openstack
roles:
- role: stackhpc.cluster-infra
cluster_name: "{{ cluster_name }}"
cluster_params:
cluster_prefix: "{{ cluster_name }}"
cluster_keypair: "{{ cluster_keypair }}"
cluster_groups: "{{ cluster_groups }}"
cluster_net: "{{ cluster_net }}"
cluster_roles: "{{ cluster_roles }}"
Authenticating our users: On this prototype system, we currently
have a small number of users, and these users are locally defined within
Keystone. In a larger production environment, a more likely scenario would
be that the users of an OpenStack cloud are stored within external
authentication infrastructure, such as LDAP.
Equivalent user accounts must be created on our OpenHPC cluster. Users
need to be able to login on the externally-facing login node. The users
should be defined on the batch compute nodes, but they should not be able
to login on these instances.
Our solution is to enable our users to authenticate using Keystone on
the login node. This is done using two projects,
PAM-Python and
PAM-Keystone - a minimal
PAM module that performs auth requests using the Keystone API.
Using this, our users benefit from common authentication on OpenStack and
all the resources created on it.
Access to cluster filesystems: OpenHPC clusters require a common filesystem
mounted across all nodes in a workload manager. One possible solution here
would be to use Manila, but our
bare metal infrastructure may complicate its usage. It is an area for future
exploration for this project.
We are using CephFS, exported from our local Ceph cluster, with an all-SSD
pool for metadata and a journaled pool for file data. Our solution defines
a CephX key, shared between project users, which enables access to the CephFS
storage pools and metadata server. This CephX key is stored in Barbican. This appears to be an area where
support in Shade and Ansible's
own OpenStack modules
is limited. We have written an Ansible role for retrieving secrets from
Barbican and storing them as facts, and we'll be working to package it and publish
on Galaxy in due course.
Converting infrastructure into platform: Once we have built upon the
infrastructure to add the support we need, the next phase is to configure and
start the platform services. In this case, we build a SLURM configuration
that draws from the infrastructure inventory to define the workers and controllers
in the SLURM configuration.
Adding Value in a Cloud Context
In the first instance, cloud admins recreate application environments,
defined by software and deployed on demand. These environments
meet user requirements. The convenience of their creation is
probably offset by a slight overhead in performance. On balance,
an indifferent user might not see compelling benefit to working
this way. Our OpenHPC-as-a-Service example described here largely
falls into this category.
Don't stop here.
Software-defined cloud methodologies enable us to do some more
imaginative things in order to make our clusters the best they
possibly can be. We can introduce infrastructure services for
consuming and processing syslog streams, simplifying the administrative
workload of cluster operation. We can automate monitoring services
for ensuring smooth cluster operation, and application performance
telemetry as standard to assist users with optimsation. We can
help admins secure the cluster.
All of these things are attainable, because we have moved from
managing a deployment to developing the automation of that deployment.
Reducing the Time to Science
Our users have scientific work to do, and our OpenStack projects
exist to support that.
We believe that OpenStack infrastructure can go beyond simply recreating
conventional scientific application clusters to generate Cluster-as-a-Service
deployments that integrate cloud technologies to be even better.