StackHPC Ltd//www.stackhpc.com/2022-10-26T12:00:00+01:00Azimuth - enabling computational scientists to manage science workflows in the cloud2022-10-26T12:00:00+01:002022-10-26T12:00:00+01:00Matt Ansontag:www.stackhpc.com,2022-10-26:/azimuth-introduction.html<p class="first last">Azimuth - enabling scientists to manage science environments in the cloud.</p>
<p>Computing is a core requirement of many scientific disciplines - historically the preserve of the physical sciences, with the advent of disciplines like bioinformatics and data science, in later years it has become a central theme of the life and social sciences too.</p>
<p>To enable these endeavours, High Performance- and High Throughput Computing (HPC and HTC) have long been the cornerstone of computing in science. Typically, institutions and organisations have a central HPC system with a job scheduler that queues up scientists' requests for an allocation of computing infrastructure, then makes it available in a fair way. The "fairness weighting" of a request can depend on many factors - the overall utilisation of the HPC system, the "dimensions" (CPU, memory and time) of the request, and whether additional bits of hardware - like a GPU - are required for the job.</p>
<div class="section" id="fair-isn-t-always-flexible-enter-the-cloud">
<h2>Fair isn't always flexible - enter the cloud!</h2>
<p>While HPC systems and their schedulers are designed for fairness, this doesn't always translate to flexibility. The requirements of modern science workloads frequently depart from the environment provided by traditional HPC configurations - for example, the software that you want to use might require a specific operating system, or it's RAM requirements may exceed the RAM per CPU core (or even a whole HPC compute node!) available in traditionally specced and configured HPC systems. Additionally, data-science-heavy disciplines making use of Artificial Intelligence (AI) and Machine Learning (ML) often depend on interactive and collaborative exploration of data using tools like Jupyter Notebooks. This process doesn't neccessarily fit perfectly into the paradigm of queueing up a HPC job that may allocate your resources when the scheduler says so, and for exactly the amount of time requested!</p>
<p><strong>Enter the cloud!</strong> Many organisations are now dedicating part or all of their datacentre real-estate to on-premise, private cloud, and many more are trying to remove the barriers to using public cloud that scientists have experienced in the past. Using cloud computing infrastructure unlocks complete flexibility for researchers - you can use an operating system that fits the requirements of your software tooling, you can configure your environment exactly as you like because you are able to use the <tt class="docutils literal">root</tt> account, and you can pick a "flavour" (or specification) of your compute environment to exactly fit your workload.</p>
</div>
<div class="section" id="flexibility-almost-always-means-complexity">
<h2>Flexibility almost always means complexity</h2>
<p>While the promise of the flexibility of cloud computing is great, the reality is that flexibility introduces complexity. That is not to say that crafting a well-written HPC job submission script isn't a complex process, but certainly configuring your own scientific computing environment in cloud from scratch isn't always the most intuitive process, particularly for anyone that doesn't have some background in managing or operating compute infrastructure. Particularly, configuring virtual networking, security groups and SSH keys are notorious barriers to researchers becoming productive in the cloud in a performant and secure way.</p>
<p>Indeed, this is where the new breed of workflow managers such as <a class="reference external" href="https://nextflow.io">Nextflow</a>, <a class="reference external" href="https://snakemake.github.io">Snakemake</a> and <a class="reference external" href="https://toil.ucsc-cgl.org">Toil</a> can help ease the transition from running workloads on traditional HPC to running in the cloud. These tools abstract the process of creating, using and destroying short-lived cloud infrastructure specifically for the purposes of the workload that they are running, but they are not designed to provide long-lived cloud resources such as Linux desktops, Kubernetes clusters and batch computing clusters with Slurm.</p>
</div>
<div class="section" id="introducing-azimuth">
<h2>Introducing Azimuth</h2>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/azimuth-logo-blue-text.png"><img alt="Azimuth logo in blue" src="//www.stackhpc.com/images/azimuth-logo-blue-text.png" style="width: 400px;" /></a>
</div>
<p>Azimuth provides a self-service portal for managing long(er)-lived cloud resources - "science platforms" - with a focus on simplifying the use of cloud for scientific computing and artificial intelligence (AI) use cases. It is currently capable of targeting OpenStack clouds, which many organisations and institutions provide as an on-premise private cloud, although it is specifically architected to be cloud-agnostic.</p>
<p>Azimuth is based on prior work with <a class="reference external" href="https://jasmin.ac.uk/">JASMIN Cloud</a> and is a simplified version of the <a class="reference external" href="https://docs.openstack.org/horizon/latest/">OpenStack Horizon</a> dashboard, with the aim of reducing the "getting-started" complexity for users that don't have a background in running compute infrastructure. Azimuth abstracts away complexity with opinionated, safe default configuration to enable scientists and researchers to "get on with the science" without ever having to encounter cloud-specific concepts like virtual networking or security groups. It offers functionality with a focus on simplicity for scientific use cases, including the ability to create complex science platforms via a user-friendly web interface.</p>
<p>Science platforms range from single-machine workstations with graphical desktops and consoles available securely via the web, to entire multi-server <a class="reference external" href="https://slurm.schedmd.com/">Slurm</a> clusters and platforms such as <a class="reference external" href="https://jupyter.org/hub">JupyterHub</a> and other <a class="reference external" href="https://kubernetes.io/">Kubernetes</a>-native platforms. It even supports creating <a class="reference external" href="https://sonobuoy.io/">Sonobouy</a>-conformant Kubernetes clusters that you can use to provide an execution environment for your favourite workflow manager!</p>
</div>
<div class="section" id="azimuth-science-platforms">
<h2>Azimuth science platforms</h2>
<p>StackHPC distribute a range of science platforms with Azimuth which, after extensive feedback from the JASMIN and other science communities, aim to enable a variety of science workloads.</p>
<table border="1" class="colwidths-given docutils">
<caption><strong>Science Platforms Distributed with Azimuth</strong></caption>
<colgroup>
<col width="30%" />
<col width="70%" />
</colgroup>
<tbody valign="top">
<tr><td><strong>Linux workstation</strong></td>
<td><ul class="first last simple">
<li>Provides secure web access to a Linux cloud console (command line) or a full, graphical desktop environment - a "bigger laptop".</li>
<li>Allows scientists to use a familiar GUI/desktop, but with Linux underneath.</li>
<li>Access via a web browser means no complicated setup of SSH clients.</li>
<li>If supported by the underlying OpenStack cloud, scientists can access esoteric hardware configurations like GPUs and large-memory machines.</li>
</ul>
</td>
</tr>
<tr><td><strong>Jupyter Notebook</strong></td>
<td><ul class="first last simple">
<li>Provides a <a class="reference external" href="https://mybinder.org/">Binder</a>-like experience, turning a compliant git repository into an interactive Jupyter Notebook using <a class="reference external" href="https://github.com/jupyterhub/repo2docker">repo2docker</a>.</li>
</ul>
</td>
</tr>
<tr><td><strong>Slurm batch computing cluster</strong></td>
<td><ul class="first last simple">
<li>Provides a completely configured installation of the <a class="reference external" href="https://github.com/stackhpc/ansible-slurm-appliance">StackHPC Slurm appliance</a>.</li>
<li>Clusters are accessible via SSH, or the <a class="reference external" href="https://openondemand.org/">Open OnDemand</a> web interface.</li>
</ul>
</td>
</tr>
<tr><td><strong>Kubernetes cluster</strong></td>
<td><ul class="first last simple">
<li>Deploys a fully functional Kubernetes cluster, optionally with autoscaling, so that the overall size of your Kubernetes cluster is responsive to the size of your workload.</li>
<li>Easily generate configuration files for the <a class="reference external" href="https://kubernetes.io/docs/reference/kubectl/">kubectl</a> and <a class="reference external" href="https://helm.sh/">Helm</a> command-line tools, so that you can customise and install packages on your Kubernetes cluster.</li>
</ul>
</td>
</tr>
<tr><td><strong>JupyterHub/DaskHub/Pangeo</strong></td>
<td><ul class="first last simple">
<li>Installs a multi-user JupyterHub/DaskHub/Pangeo onto an Azimuth-deployed Kubernetes cluster.</li>
<li>Single Sign-On integration with OpenStack (via the <a class="reference external" href="https://github.com/stackhpc/zenith">Zenith</a> tunnelling HTTP(S) proxy) - only authenticated members of your OpenStack project can create and access JupyterHub/DaskHub/Pangeo resources.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/azimuth-platform-picker.png"><img alt="Azimuth platform picking dialog" src="//www.stackhpc.com/images/azimuth-platform-picker.png" style="width: 750px;" /></a>
</div>
<p>All platforms come with their own monitoring stack based on <a class="reference external" href="https://prometheus.io">Prometheus</a> and <a class="reference external" href="https://grafana.com">Grafana</a>, which is accessible via a web browser, this gives immediate insights into how your workload interacts with your compute environment.</p>
</div>
<div class="section" id="try-azimuth">
<h2>Try Azimuth</h2>
<p>Azimuth is <a class="reference external" href="https://github.com/stackhpc/azimuth">free and open-source</a>, and it is designed to run on the same OpenStack cloud that it creates science platforms on.</p>
<p>If your organisation uses OpenStack to provide cloud infrastructure, and you are a cloud operator or a keen researcher with some OpenStack quota - we provide an <a class="reference external" href="https://stackhpc.github.io/azimuth-config/try/">easy-to-deploy demo configuration</a>, which you can use to try Azimuth on your own cloud.</p>
<p>For production-ready deployments and further information on the inner-workings of Azimuth, we provide an in-depth <a class="reference external" href="https://stackhpc.github.io/azimuth-config/">guide</a> to continuous deployment, management and architecture of Azimuth.</p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>StackHPC greatly appreciate the support provided through the Science and Technology Facilities Council (<a class="reference external" href="https://www.ukri.org/councils/stfc/">STFC</a>), and in particular the <a class="reference external" href="https://www.iris.ac.uk/">IRIS</a>, <a class="reference external" href="https://dirac.ac.uk/">DiRAC</a> and <a class="reference external" href="https://jasmin.ac.uk/">JASMIN</a> communities.</p>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Achieving Consistency in an Inconsistent World: Hardware Anomaly Detection2022-10-24T14:30:00+01:002022-10-24T14:30:00+01:00Matt Creestag:www.stackhpc.com,2022-10-24:/advise-announcement.html<p class="first last">Change is inevitable. Servers get repurposed. We don't always get quite the hardware that we asked for. How do we make it good?</p>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/advise-stock.jpg"><img alt="Photo by `Randy Fath <https://unsplash.com/@randyfath?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">`_ on `Unsplash <https://unsplash.com/s/photos/odd-one-out?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText>`_" src="//www.stackhpc.com/images/advise-stock.jpg" style="width: 450px;" /></a>
</div>
<p>Change is inevitable. Servers get repurposed. We don't always get quite the hardware that we asked for, with occasional early mortality, misconfiguration or simple inconsistency between servers supplied by a vendor. As such, it is important to consider that a system which is intended to have identical components may in fact have hidden differences. It is possible to gather hardware introspection data with both <a class="reference external" href="https://docs.openstack.org/kayobe/latest/deployment.html#saving-hardware-introspection-data">Kayobe (via Bifrost)</a> and user-space <a class="reference external" href="https://docs.openstack.org/ironic/latest/admin/inspection.html">Ironic</a>. At StackHPC, we use this data extensively during infrastructure commissioning, as <a class="reference external" href="//www.stackhpc.com/ironic-idrac-ztp.html">documented previously</a> in this blog. However, using this data to identify server discrepancies has always been challenging, and reviewing this data can be a particularly daunting task.</p>
<div class="section" id="advise-anomaly-detection-visualiser">
<h2>ADVise - Anomaly Detection Visualiser</h2>
<p>We have developed <a class="reference external" href="https://github.com/stackhpc/ADVise">ADVise</a> (Anomaly Detection VISualisEr). This tool will analyse hardware introspection data and provide graphs and summaries to help you identify unexpected hardware and performance anomalies. ADVise follows a two-pronged approach. It will extract and visualise differences between the reported hardware attributes, and will analyse and graph any benchmarked performance metrics.</p>
<div class="section" id="hardware-attritubes">
<h3>Hardware Attritubes</h3>
<p>Here we have an anonymised case study on a selection of 143 compute nodes that are intended to be identical systems. Through the use of ADVise, we instead found five nodes which stray from the collective. The manufacturer has provided an unexpected gift, one node has a newer motherboard version to the rest. Three of the nodes were previously used as controllers, and after being recommissioned as compute nodes, they still require a BIOS update. We also found two nodes which were not reporting any logical cores, one of which specifically had multithreading disabled. While only some of these anomalies are critical enough to require further action, they are all worth being aware of.</p>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/advise-hardware-attributes.png"><img alt="ADVise hardware attributes" src="//www.stackhpc.com/images/advise-hardware-attributes.png" style="width: 550px;" /></a>
<p class="caption"><cite>Systems grouping, with a difference in firmware. Difference visualisation (left) and data on the differing values (right).</cite></p>
</div>
</div>
<div class="section" id="benchmarking-for-anomalies-in-performance">
<h3>Benchmarking for Anomalies in Performance</h3>
<p>ADVise also analyses the benchmarking data to identify any potentially abnormal behaviour. Groups may have a high variance in performance, which could indicate some issues as identical hardware should be expected to perform roughly at the same level. The tool also highlights where individual nodes may be over/underperforming compared to the rest of the group. This could warrant further investigation into potential causes of this behaviour, particularly if a node is found to be consistently underperforming. The plot below is an example of how anomalous data may appear. In this case, a few nodes were marked as underperforming outliers on a memory benchmark. We can then review the rest of the memory benchmark data on this group. The remaining benchmarks did not have any outliers, suggesting that there are not any faults with the memory of the systems.</p>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/advise-boxplot.png"><img alt="Boxplot of memory benchmark data" src="//www.stackhpc.com/images/advise-boxplot.png" style="width: 750px;" /></a>
<p class="caption"><cite>Box plot of a memory benchmark across the compute nodes. In these plots, the box itself covers the 25th to 75th percentiles. The whiskers extend to the furthest datapoint within 1.5 times the interquartile range. Outliers beyond this are plotted as individual points. Two nodes are seen to be particularly significant outliers.</cite></p>
</div>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/advise-performance.png"><img alt="Additonal memory benchmark results" src="//www.stackhpc.com/images/advise-performance.png" style="width: 750px;" /></a>
<p class="caption"><cite>List of other memory benchmarks on the same group. These are marked as 'CONSISTENT', indicating that there are no outlier nodes.</cite></p>
</div>
<p>While this was only one isolated instance, multiple plots showing the same node as an outlier would indicate statistical significance. If a node is consistently underperforming across all metrics, this could suggest that an external factor is hampering performance. For example we’ve previously encountered offline power supplies, which caused some servers to slow. While these performance metrics cannot always identify causes, they do allow for us to be aware that the issues even exist, and help to narrow down which nodes need our attention.</p>
<p>When working with large-scale systems, things will never be in the exact state that we would like. With the help of ADVise, we can discover where expectations differ from reality and manage hardware differences before they cause problems.</p>
</div>
</div>
<div class="section" id="credit">
<h2>Credit</h2>
<p>ADVise revives the pacakge <tt class="docutils literal">cardiff</tt> from <a class="reference external" href="https://github.com/redhat-cip/hardware/">https://github.com/redhat-cip/hardware/</a>, with thanks and appreciation to its <a class="reference external" href="https://github.com/ErwanAliasr1">original author</a>.</p>
<p>ADVise integrates the package <tt class="docutils literal">mungetout</tt> from <a class="reference external" href="https://github.com/stackhpc/mungetout">https://github.com/stackhpc/mungetout</a>.</p>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
CREATE: OpenStack at King's College London2022-03-25T16:00:00+00:002022-03-25T16:00:00+00:00John Taylortag:www.stackhpc.com,2022-03-25:/kcl-create.html<p class="first last">StackHPC and King’s College London collaborate on extending
the reach of OpenStack Kayobe to Ubuntu</p>
<p><strong>Bristol, England</strong>: <a class="reference external" href="https://www.stackhpc.com">StackHPC Ltd</a>
are pleased to announce today, the culmination of a major collaboration
with King’s College London on the provision of a Research Computing
environment built upon StackHPC’s preferred OpenStack distribution
environment, <a class="reference external" href="https://docs.openstack.org/kayobe/latest">OpenStack Kayobe</a>, with significant input from King’s
e-Research staff in making this environment available on Ubuntu
Focal 20.04. Previously, OpenStack Kayobe was only supported on
CentOS and latterly CentOS Stream (see the <a class="reference external" href="https://docs.openstack.org/kayobe/latest/support-matrix.html">matrix of supported options</a>).</p>
<p>The new OpenStack deployment forms the core of the newly launched
King’s Computational Research Engineering and Technology Environment
(<a class="reference external" href="https://docs.er.kcl.ac.uk/">CREATE</a>). CREATE provides the
following services:</p>
<ul class="simple">
<li>CREATE Cloud: a private cloud platform to provide flexible,
scalable development environments and allow users greater control
over their own research computing resources using virtual machines;</li>
<li>CREATE HPC (High Performance Computing): a compute cluster with
CPU and GPU nodes, fast network interconnects and shared storage,
for large scale simulations and data analytics;</li>
<li>CREATE RDS (Research Data Storage): a very large, highly resilient
storage area for longer term curation of research data;</li>
<li>CREATE TRE (Trusted Research Environment): tightly controlled
project areas making use of Cloud and HPC resources to process
sensitive datasets (e.g. clinical PIID) complying with NHS Digital
audit standards (DSPT);</li>
<li>CREATE Web: a self-service web hosting platform for static content
(HTML/CSS/JS) and WordPress sites;</li>
</ul>
<p>See our accompanying <a class="reference external" href="//www.stackhpc.com/resources/StackHPC-KCL-CREATE.pdf">press release</a> for full details.</p>
All aboard the Release Train2022-03-24T12:00:00+00:002022-03-24T12:00:00+00:00Mark Goddardtag:www.stackhpc.com,2022-03-24:/all-aboard-the-release-train.html<p class="first last">How we learned to stop worrying and deploy CentOS Stream.</p>
<p>Spare a thought for CentOS users, it's been a bumpy ride for the last few
years. CentOS 7 to 8 migrations, CentOS Linux 8 end-of-life, CentOS Stream. New
RHEL clones such as Rocky Linux and Alma Linux stepping up to fill the void.
How to respond?</p>
<p>In this blog post, we'll look at how these events affected us and our clients,
and the steps we took to mitigate the risks involved in using CentOS Stream.</p>
<div class="section" id="timeline">
<h2>Timeline</h2>
<p>Here is a brief timeline of the major events in the CentOS saga.</p>
<dl class="docutils">
<dt>24th September 2019</dt>
<dd>CentOS Linux 8 general availability. CentOS Stream 8 is introduced at the
same time, described as an alternative, parallel distribution, upstream of
RHEL, with a rolling release model.</dd>
<dt>16th October 2019</dt>
<dd>OpenStack Train release general availability.</dd>
<dt>1st January 2020</dt>
<dd>The end of life of Python 2 on January 1 2020 adds pressure to reinstall
CentOS 7 systems with the then shiny new CentOS Linux 8.</dd>
<dt>8th December 2020</dt>
<dd>CentOS project accelerates CentOS Linux end-of-life, bringing it forward by
8 years to 31st December 2021. CentOS Stream remains.</dd>
<dt>11th December 2020</dt>
<dd>CentOS co-founder Gregory Kurtzer announces Rocky Linux, a RHEL clone.</dd>
<dt>31st December 2021</dt>
<dd>RIP CentOS Linux :(</dd>
</dl>
</div>
<div class="section" id="openstack-train-release">
<h2>OpenStack Train release</h2>
<p>The OpenStack Train release (not to be confused with the StackHPC Release
train!) was a big one for the Kolla projects. With CentOS 7 providing
insufficient support for Python 3, we needed to get to CentOS 8 to support
Python 3 before the end-of-life of Python 2. Train therefore needed to support
both CentOS 7 (with Python 2) and CentOS 8 (with Python 3), and to provide a route
for migration. Major OS version upgrades are not supported in CentOS, so a reinstall is
required. The <a class="reference external" href="https://opendev.org/openstack/kolla-ansible/src/branch/master/specs/centos8-migration.rst">Kolla Ansible spec</a>
provides some of the gory details.</p>
<p>It took significant development effort to make this migration possible.
While the automation provided by Kayobe and Kolla Ansible helps significantly
to perform the migration, it still involves significant operator effort to
reinstall each host and keep the cloud running.</p>
</div>
<div class="section" id="centos-linux-eol">
<h2>CentOS Linux EOL</h2>
<p>Having fairly recently migrated our clients' systems to CentOS 8, the
announcement of the end-of-life of CentOS Linux came as quite a blow.
It also required us to make a decision about how to proceed.</p>
<p>Should we jump into the Stream? CentOS Stream seemed to divide the community,
with some claiming it to be not production-ready, and others having faith in
its CI testing.</p>
<p>Should we try out one of the new RHEL clones? Rocky Linux seemed promising,
but would not be released until June 2021, and would need to build a
sustainable community around it to be viable.</p>
<p>Should we switch to another distribution entirely? We started development of
Ubuntu support for Kayobe around this time. Tempting, but migrating from
CentOS to Ubuntu could be a bigger challenge than migrating from CentOS 7 to 8,
and a hard sell for our existing clients.</p>
</div>
<div class="section" id="taking-the-plunge">
<h2>Taking the plunge</h2>
<p>In the end, with a little encouragement from the direction taken by CERN, we
learned to stop worrying and deploy CentOS Stream. Of course, in order to stop
worrying, we needed to mitigate the risks of using a rolling release
distribution.</p>
<p>CentOS Linux was a rebuild of RHEL, meaning that it followed the same release
cycle, albeit delayed by a few weeks. The main CentOS mirrors served the latest
minor release (e.g. 8.2), and were updated every few months with a new minor
release. Security patches and critical bug fixes were applied between minor
releases. While the minor releases sometimes introduced issues that needed to
be fixed or worked around, this model provided fairly long periods of
stability.</p>
<p>With CentOS Stream, there is a constant trickle of updates, and the state at
any point lies somewhere between one minor release of RHEL and the next. In
fact, it may be worse than this, if packages are upgraded then downgraded again
before the next minor RHEL release. There is some CI testing around changes to
Stream, however clearly it won't be as rigorous as the testing applied to a
RHEL minor release.</p>
</div>
<div class="section" id="islands-in-the-stream">
<h2>Islands in the Stream</h2>
<p>How to cope with this instability? For StackHPC, the answer lies in creating
snapshots of the CentOS Stream package repositories. Stable, tested islands in
the stream.</p>
<p>This approach is one we had been starting to apply to some of our client
deployments, even with CentOS Linux. Having a local mirror with repository
snapshots improves repeatability, as well as reducing dependence on an
external resource (the upstream mirrors), and avoiding a fan-in effect on
package downloads from the Internet.</p>
<p>This approach worked well for a while, but it soon became clear that having
separate package repositories for each deployment was not optimal. Each site
was using a different set of packages, a different set of locally-built
container images, and no common source of configuration beyond that provided by
Kayobe and Kolla Ansible. Without some changes, this would soon become a
scaling bottleneck for the company.</p>
</div>
<div class="section" id="release-train">
<h2>Release Train</h2>
<p>As they say, <em>never waste a good crisis</em>. CentOS Stream gave us an opportunity to take the StackHPC Release Train
project off of our (long) backlog. It's an effort to make our OpenStack
deployments more consistent, reproducible and reliable, by releasing a set of
tested artifacts. This includes:</p>
<ul class="simple">
<li>Package repositories</li>
<li>Kolla container images</li>
<li>Binary artifacts (images, etc.)</li>
<li>Source code repositories</li>
<li>Kayobe and Kolla configuration</li>
</ul>
<p>We are initially targeting CentOS Stream, but we will likely add support for
other distributions.</p>
</div>
<div class="section" id="pulp">
<h2>Pulp</h2>
<p>We use <a class="reference external" href="https://pulpproject.org">Pulp</a> to manage and host content for the
release train. Pulp is a content server that manages repositories of software
packages and facilitates their distribution to content consumers.
It is implemented in Python using the Django framework, and is composed of a
<a class="reference external" href="https://docs.pulpproject.org/pulpcore/concepts.html#what-is-pulpcore">core</a>
and <a class="reference external" href="https://docs.pulpproject.org/pulpcore/concepts.html#content-management-with-plugins">plugins</a>
for the various content types (e.g. RPMs, containers). A key feature of Pulp
is fine-grained control over snapshots of repositories.</p>
</div>
<div class="section" id="pulp-concepts">
<h2>Pulp concepts</h2>
<p>It's useful to have an awareness of the <a class="reference external" href="https://docs.pulpproject.org/pulpcore/concepts.html">core concepts</a> in Pulp, since they are
common to the various plugins.</p>
<dl class="docutils">
<dt>Repository</dt>
<dd>A pulp repository of artifacts & metadata</dd>
<dt>Remote</dt>
<dd>A remote repository</dd>
<dt>Repository version</dt>
<dd>A snapshot of a repository at a point in time</dd>
<dt>Sync</dt>
<dd>Sync a repository with a remote, creating a repository version</dd>
<dt>Publication</dt>
<dd>A published repository version, including metadata</dd>
<dt>Distribution</dt>
<dd>Content made available to users</dd>
</dl>
<p>For example, the RPM plugin might have a <tt class="docutils literal"><span class="pre">centos-baseos</span></tt> <em>repository</em>, which
<em>syncs</em> from the upstream <tt class="docutils literal"><span class="pre">centos-baseos</span></tt> <em>remote</em> at
<a class="reference external" href="http://mirror.centos.org/centos/8-stream/BaseOS/">http://mirror.centos.org/centos/8-stream/BaseOS/</a>, creating a new <em>repository
version</em> when new content is available. In order to make the content available,
we might create a <em>publication</em> of the new <em>repository version</em>, then associate
a <tt class="docutils literal"><span class="pre">centos-baseos-development</span></tt> <em>distribution</em> with the <em>publication</em>.
Another <tt class="docutils literal"><span class="pre">centos-baseos-production</span></tt> might be associated with an older
<em>publication</em> and <em>repository version</em>. Pulp can serve both <em>distributions</em> at
the same time, allowing us to test a new snapshot of the repository without
affecting users of existing snapshots.</p>
</div>
<div class="section" id="automation-continuous-integration-ci">
<h2>Automation & Continuous Integration (CI)</h2>
<p>Automation and CI are key aspects of the release train. The additional control
provided by the release train comes at a cost in maintenance and complexity,
which must be offset via automation and CI. In general, Ansible is used for
automation, and GitHub Actions provide CI.</p>
<p>The intention is to have as much as possible of the release train automated and
run via CI. Typically, workflows may go through the following stages as they
evolve:</p>
<ol class="arabic simple">
<li>automated via Ansible, manually executed</li>
<li>executed by GitHub Actions workflows, manually triggered by <a class="reference external" href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch">workflow
dispatch</a>
or <a class="reference external" href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule">schedule</a></li>
<li>executed by GitHub Actions workflows, automatically triggered by an event
e.g. pull request or another workflow</li>
</ol>
<p>This sequence discourages putting too much automation into the GitHub Actions
workflows, ensuring it is possible to run them manually.</p>
<p>We typically use Ansible to drive the Pulp API, and have made <a class="reference external" href="https://github.com/pulp/squeezer/pulls?q=is%3Apr+involves%3Amarkgoddard+">several</a>
<a class="reference external" href="https://github.com/pulp/squeezer/pulls?q=is%3Apr+involves%3Acityofships">contributions</a> to
the squeezer collection. We also created the <a class="reference external" href="https://github.com/stackhpc/ansible-collection-pulp">stackhpc.pulp</a> Ansible collection
as a higher level interface to define repositories, remotes, and other Pulp
resources.</p>
</div>
<div class="section" id="architecture">
<h2>Architecture</h2>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/release-train.png"><img alt="Release train architecture" src="//www.stackhpc.com/images/release-train.png" style="width: 750px;" /></a>
</div>
<p><a class="reference external" href="https://ark.stackhpc.com">Ark</a> is our production Pulp server, and is hosted
on <a class="reference external" href="https://create.leaf.cloud">Leafcloud</a>. It comprises a single compute
instance, with content stored in Leafcloud's object storage. We use
<a class="reference external" href="https://github.com/pulp/pulp_installer">pulp_installer</a> to deploy Pulp.
It is the master copy of development and released content. Access to the API
and artifacts is controlled via client certificates and passwords.</p>
<p>Clients access Ark via a Pulp service deployed on their local infrastructure.
Content is synced from Ark to the local Pulp service, and control plane hosts
acquire the content from there.</p>
<p>A test Pulp service runs on the <a class="reference external" href="https://api.sms-lab.cloud/">SMS lab</a> cloud.
Content is synced from Ark to the test Pulp service, where it is used to build
container images and run tests. In some respects, the test Pulp service may be
considered a client.</p>
</div>
<div class="section" id="content-types">
<h2>Content types</h2>
<p>Various different types of content are hosted by Pulp, including:</p>
<ul class="simple">
<li>RPM package repositories (<a class="reference external" href="https://docs.pulpproject.org/pulp_rpm/">Pulp RPM plugin</a>)<ul>
<li>CentOS distribution packages</li>
<li>Third party packages</li>
</ul>
</li>
<li>Container image repositories (<a class="reference external" href="https://docs.pulpproject.org/pulp_container/">Pulp container plugin</a>)<ul>
<li>Kolla container images</li>
</ul>
</li>
</ul>
<p>We also anticipate supporting the following content:</p>
<ul class="simple">
<li>Apt package repositories (<a class="reference external" href="https://docs.pulpproject.org/pulp_deb/">Pulp Deb plugin</a>)<ul>
<li>Ubuntu distribution packages</li>
<li>Third party packages</li>
</ul>
</li>
<li>File repositories (<a class="reference external" href="https://docs.pulpproject.org/pulp_file/">Pulp file plugin</a>)<ul>
<li>Disk images</li>
</ul>
</li>
</ul>
<p>Some of this content may be mirrored from upstream sources, while others are
the result of release train build processes.</p>
</div>
<div class="section" id="access-control">
<h2>Access control</h2>
<p>Access to released Pulp content is restricted to clients with a support
agreement. Build and test processes also need access to unreleased content.</p>
<p>Access to package repositories is controlled via <a class="reference external" href="https://docs.pulpproject.org/pulp_certguard/">Pulp x509 cert guards</a>. A <a class="reference external" href="https://vault.stackhpc.com/">HashiCorp Vault</a> service acts as a Certificate Authority (CA)
for the cert guards. Two cert guards are in use - <tt class="docutils literal">development</tt> and
<tt class="docutils literal">release</tt>. The <tt class="docutils literal">development</tt> cert guard is assigned to unreleased content,
while the <tt class="docutils literal">release</tt> cert guard is assigned to released content. Clients are
provided with a client certificate which they use when syncing package
repositories in their local Pulp service with Ark. Clients' client certificates
are authorised to access content protected by the <tt class="docutils literal">release</tt> cert guard. Build
and test processes are provided with a client certificate that is authorised to
access both the <tt class="docutils literal">development</tt> and <tt class="docutils literal">release</tt> cert guards. The latter is made
possible via the CA chain.</p>
<p>Access to container images is controlled by token authentication, which uses
Django users in the backend. Two container namespaces are in use -
<tt class="docutils literal"><span class="pre">stackhpc-dev</span></tt> and <tt class="docutils literal">stackhpc</tt>. The <tt class="docutils literal"><span class="pre">stackhpc-dev</span></tt> namespace is used for
unreleased content, while the <tt class="docutils literal">stackhpc</tt> namespace is used for released
content. Clients are provided with a set of credentials, which they use when
syncing container image repositories in their local Pulp service with Ark.
Clients' credentials are authorised to pull from the <tt class="docutils literal">stackhpc</tt> namespace.
Build and test processes are provided with credentials that are authorised to
push to the <tt class="docutils literal"><span class="pre">stackhpc-dev</span></tt> namespace.</p>
</div>
<div class="section" id="syncing-package-repositories">
<h2>Syncing package repositories</h2>
<p>The <a class="reference external" href="https://github.com/stackhpc/stackhpc-release-train">stackhpc-release-train</a> repository provides
Ansible-based automation and Github Actions workflows for the release train.</p>
<p>A Github Actions workflow syncs CentOS Stream and other upstream package
repositories nightly into Ark, creating new snapshots when there are changes.
The workflow may also be run on demand. Publications and distributions are
created using the <tt class="docutils literal">development</tt> content guard, to ensure that untested
content is not accessible to clients. We sync using the <a class="reference external" href="https://docs.pulpproject.org/pulpcore/workflows/on-demand-downloading.html">immediate</a>
policy, to ensure content remains available if it is removed from upstream
mirrors. This workflow also syncs the content to the test Pulp service.</p>
<p>Package repository distributions are versioned based on the date/time stamp at
the beginning of the sync workflow, e.g. <tt class="docutils literal">20211122T102435</tt>. This version
string is used as the final component of the path at which the corresponding
distribution is hosted. For example, a CentOS Stream 8 BaseOS snapshot may be
hosted at
<a class="reference external" href="https://ark.stackhpc.com/pulp/content/centos/8-stream/BaseOS/x86_64/os/20220105T044843/">https://ark.stackhpc.com/pulp/content/centos/8-stream/BaseOS/x86_64/os/20220105T044843/</a>.</p>
<p>The rationale behind using a date/time stamp is that there is no sane way to
version a large collection of content, such as a repository, in a way in which
the version reflects changes in the content (e.g. SemVer). While the timestamp
used is fairly arbitrary, it does at least provide a reasonable guarantee of
ordering, and is easily automated.</p>
</div>
<div class="section" id="building-container-images">
<h2>Building container images</h2>
<p>Kolla container images are built from the test Pulp service package
repositories and pushed to the Ark container registry.</p>
<p>Build and test processes run on SMS cloud, to avoid excessive running costs.
All content in Ark that is required by the build and test processes is synced
to the test Pulp service running in SMS cloud, minimising data egress from Ark.</p>
<p>Kolla container images are built via Kayobe, using a <tt class="docutils literal">builder</tt> environment in
<a class="reference external" href="https://github.com/stackhpc/stackhpc-kayobe-config">StackHPC Kayobe config</a>.
The configuration uses the package repositories in Ark when building
containers. Currently this is run manually, but will eventually run as a CI
job. The <tt class="docutils literal"><span class="pre">stackhpc-dev</span></tt> namespace in Ark contains <a class="reference external" href="https://docs.pulpproject.org/pulp_container/workflows/push.html">container push
repositories</a>, which are
pushed to using Kayobe. (Currently this is rather slow due to a <a class="reference external" href="https://github.com/pulp/pulp_container/issues/494">Pulp bug</a>.)</p>
<p>A Github Actions workflow runs on demand, syncing container repositories in
test Pulp service with those in Ark. It also configures container image
distributions to be private, since they are public by default.</p>
<p>Kolla container images are versioned based on the OpenStack release name and
the date/time stamp at the beginning of the build workflow, e.g.
<tt class="docutils literal"><span class="pre">wallaby-20211122T102435</span></tt>. This version string is used as the image tag.
Unlike package repositories, container image tags allow multiple versions to be
present in a distribution of a container repository simultaneously. We
therefore use separate namespaces for development (<tt class="docutils literal"><span class="pre">stackhpc-dev</span></tt>) and
release (<tt class="docutils literal">stackhpc</tt>).</p>
</div>
<div class="section" id="testing">
<h2>Testing</h2>
<p>Release Train content is tested via a Kayobe deployment of OpenStack. An
<tt class="docutils literal">aio</tt> environment in <a class="reference external" href="https://github.com/stackhpc/stackhpc-kayobe-config">StackHPC Kayobe config</a> provides a converged
control/compute host for testing. Currently this is run manually, but will
eventually run as a CI job.</p>
</div>
<div class="section" id="promotion">
<h2>Promotion</h2>
<p>Whether content is mirrored from an upstream source or built locally, it is not
immediately released. Promotion describes the process whereby release candidate
content is made into a release that is available to clients.</p>
<p>For package repositories, promotion does not affect how content is accessed,
only who may access it. Promotion involves changing the content guard for the
distribution to be released from <tt class="docutils literal">development</tt> to <tt class="docutils literal">release</tt>. This makes the
content accessible to clients using their x.509 client certificates.</p>
<p>The <tt class="docutils literal">stackhpc</tt> container namespace contains regular container repositories,
which cannot be pushed to via <tt class="docutils literal">docker push</tt>. Instead, we use the Pulp API to
sync specific tags from <tt class="docutils literal"><span class="pre">stackhpc-dev</span></tt> to <tt class="docutils literal">stackhpc</tt>.</p>
</div>
<div class="section" id="configuration">
<h2>Configuration</h2>
<p>StackHPC maintains a <a class="reference external" href="https://github.com/stackhpc/stackhpc-kayobe-config">base Kayobe configuration</a> which includes settings
required to consume the release train, as well as various generally applicable
configuration changes. Release Train consumers merge this configuration into
their own, and apply site and environment-specific changes. This repository
provides configuration and playbooks to:</p>
<ul class="simple">
<li>deploy a local Pulp service as a container on the seed</li>
<li>define which package repository versions to use</li>
<li>define which container image tags to use</li>
<li>sync all necessary content from Ark into the local Pulp service</li>
<li>use the local Pulp repository mirrors on control plane hosts</li>
<li>use the local Pulp container registry on control plane hosts</li>
</ul>
<p>This configuration is in active development and is expected to evolve over the
coming releases. It currently supports the OpenStack Victoria and Wallaby
releases.</p>
<p>Further documentation of this configuration is available in the <a class="reference external" href="https://github.com/stackhpc/stackhpc-kayobe-config/blob/stackhpc/wallaby/README.rst">readme</a>.</p>
</div>
<div class="section" id="all-aboard">
<h2>All aboard</h2>
<p>In the second half of 2021, with the end-of-life of CentOS Linux looming large,
the pressure to switch our clients to CentOS Stream was increasing. We had to
make some tough calls in order to implement the necessary parts of the release
train and put them into production. The StackHPC team then had to work hard
migrating our clients' systems to our new release train and CentOS Stream.</p>
<p>Of course, there were some teething problems, but overall, the adoption of the
release train went fairly smoothly.</p>
</div>
<div class="section" id="release-train-day-2">
<h2>Release Train day 2</h2>
<p>Clouds don't stand still, and at some point something will require a patch. How
well does the release train cope with changes? A test soon came in the form of
a zero-day Grafana exploit, <a class="reference external" href="https://grafana.com/blog/2021/12/07/grafana-8.3.1-8.2.7-8.1.8-and-8.0.7-released-with-high-severity-security-fix/">CVE-2021-43798</a>.
Grafana applied fixes and cut some releases. We needed to get the updated
packages into our Grafana container image, and roll it out to affected
deployments. This turned out to be a little more clunky than I'd hoped, with
repository snapshot versions tracked in multiple places, and various pull
requests. It looked a bit like this:</p>
<ul class="simple">
<li>Manual package repository sync (the previous night's sync had failed)</li>
<li>Update test repo versions, commit only grafana version bump in
stackhpc-release-train repository</li>
<li>Sync and publish repos in test pulp service</li>
<li>Bump grafana repository version in stackhpc-kayobe-config repository</li>
<li>Build & push grafana image</li>
<li>Promote container images with the new tag</li>
<li>Sync container images to test Pulp service</li>
<li>Bump grafana container image tag in stackhpc-kayobe-config repository</li>
<li>Test the new container</li>
</ul>
<p>Needless to say, this could be simpler. However, it was a good catalyst for
some head scratching and a whiteboard session with Matt Anson, resulting in
a design for release train mk2.</p>
</div>
<div class="section" id="release-train-mk2">
<h2>Release Train mk2</h2>
<p>The changes in mk2 are mostly quite subtle, but the key shift was to using
stackhpc-kayobe-config as the single source of truth for all repository
versions and container image tags. The stackhpc-release-train repository would
not include any 'live' state, just a set of Ansible playbooks and Github
Actions workflows to drive the various workflows.</p>
<p>Another change in mk2 is around the branching model for stackhpc-kayobe-config,
adding the ability to stage changes for an upcoming release, without making
them available to clients. There are also some usability improvements to the
automation, making it easier to perform actions changes on specific package
repositories or container images.</p>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/release-train-mk2.png"><img alt="Release train mk2 whiteboard session" src="//www.stackhpc.com/images/release-train-mk2.png" style="width: 750px;" /></a>
</div>
</div>
<div class="section" id="next-stop">
<h2>Next stop?</h2>
<p>Where next for the release train? Within the current scope of package
repositories and container images, there is plenty of room for improvement.
Usability enhancements, more automation and CI, and better automated building
security scanning and testing of changes before they are released. We'll also
likely expand to support other distributions than CentOS Stream.</p>
<p>Release Train also has ambitions around our source code, improving our
development processes with better CI, more automation for our (few)
repositories with downstream changes, and increased reproducibility of builds
and installations.</p>
<p>Another avenue for growth of the release train is moving up the stack to our
platforms, such as Slurm and Kubernetes.</p>
</div>
<div class="section" id="centos-stream-in-retrospect">
<h2>CentOS Stream in retrospect</h2>
<p>Overall, our experience with CentOS Stream has been acceptable, given the steps
we have taken to mitigate risks of unexpected changes. Indeed, there have been
a few occasions where changes in Stream have <a class="reference external" href="https://review.opendev.org/q/I30c2a7b6850350901b15fe196175508634c8e9a5">broken the Kolla upstream CI jobs</a>,
that we have been insulated from.</p>
<p>There are some other downsides to Stream. Mellanox does not publish OFED
packages built against Stream, only RHEL minor releases. While it is generally
possible to force the use of a compatible kernel in Stream, this kernel will
not receive the security updates that a RHEL minor release kernel would.
In many cases it may be possible to use the in-box Mellanox drivers, however
these may lack support for some of the latest hardware or features. This is
a consideration for the points where we make our snapshots.</p>
<p>Kayobe and Kolla Ansible have recently added support for Rocky Linux as a host
OS, as well as the ability to run libvirt as a host daemon rather than in a
container. This reduces the coupling between the host and containers, making it
safer to mix their OS distributions, e.g. running Rocky hosts with CentOS Stream
container images.</p>
</div>
<div class="section" id="resources">
<h2>Resources</h2>
<ul class="simple">
<li><a class="reference external" href="https://stackhpc.github.io/stackhpc-release-train/">Release Train documentation</a></li>
</ul>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Automating OpenStack database backups with Kayobe and GitHub Actions2022-03-14T12:00:00+00:002022-03-14T12:00:00+00:00Pierre Riteautag:www.stackhpc.com,2022-03-14:/openstack-backups-kayobe-github-actions.html<p class="first last">Using GitHub Actions and Kayobe Ansible playbooks to automate
database backups, including off-site transfer.</p>
<p>Developed by StackHPC and the upstream open source community, <a class="reference external" href="//www.stackhpc.com/pages/kayobe.html">Kayobe</a> provides automation for deploying OpenStack
using infrastructure-as-code principles. Beyond the initial software
deployment, so-called <em>day 2</em> operations are also crucial to automate to
increase productivity and ensure continued operation and knowledge transfer
throughout the lifetime of the infrastructure.</p>
<p>In this short blog post, we are looking at how Kayobe and GitHub Actions can
work together to automate OpenStack database backups.</p>
<div class="section" id="openstack-database-backups-with-kayobe">
<h2>OpenStack database backups with Kayobe</h2>
<p>OpenStack services rely on a relational database to store information.
OpenStack deployments generally use a MariaDB or MySQL database. Keeping
regular backups of database content is critical to be able to recover from a
failure of the control plane.</p>
<p>Thankfully, Kolla Ansible makes it easy to perform database <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/admin/mariadb-backup-and-restore.html">backups</a>
using <a class="reference external" href="https://mariadb.com/kb/en/mariabackup-overview/">Mariabackup</a>.
This functionality is also available via <a class="reference external" href="https://docs.openstack.org/kayobe/latest/administration/overcloud.html#performing-database-backups">a Kayobe command</a>.</p>
</div>
<div class="section" id="kayobe-automation">
<h2>Kayobe automation</h2>
<p>We work with many different organisations that are running OpenStack. Local
teams using GitOps to manage their infrastructure will already have experience
with a specific CI/CD platform. Imposing a different one for Kayobe could add
much friction and discourage automation. Instead, we strive to make Kayobe easy
to integrate with any CI/CD platform.</p>
<p>To achieve this, we developed <a class="reference external" href="https://github.com/stackhpc/kayobe-automation">kayobe-automation</a>. In addition to running any
Kayobe command, it generates a Kolla configuration diff for change requests,
making it easy to visualise the Kolla configuration changes to be applied on
each OpenStack host.</p>
<p>Initially developed for <a class="reference external" href="https://about.gitlab.com/">GitLab</a>, we have since
integrated it with <a class="reference external" href="https://github.com/features/actions">GitHub Actions</a>
using a local self-hosted runner with access to the Kayobe admin network.</p>
</div>
<div class="section" id="automating-backups-with-github-actions">
<h2>Automating backups with GitHub Actions</h2>
<p>When integrated with GitHub Actions, each Kayobe command is implemented in a
different workflow, which can be triggered manually through the
<a class="reference external" href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch">workflow_dispatch</a>
event. This allows an operator to perform a database backup at the push of a
button. However, this is still not fully automated. GitHub Actions provides the
<a class="reference external" href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule">schedule event</a>,
which uses a cron syntax to define when to run the workflow:</p>
<div class="highlight"><pre><span></span><span class="nt">on</span><span class="p">:</span>
<span class="nt">schedule</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">cron</span><span class="p">:</span> <span class="s">'0</span><span class="nv"> </span><span class="s">30</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*'</span>
</pre></div>
<p>And just like that, our OpenStack database is now backed up every night, with
GitHub Actions taking care of the scheduling, providing us with status and logs
of each workflow run through its web interface, and sending us an email
notification in case of failure.</p>
</div>
<div class="section" id="but-where-is-my-backup">
<h2>But where is my backup?</h2>
<p>Kolla Ansible runs the database backup process on the first host in
<tt class="docutils literal">mariadb_shard_group</tt>, which will usually map to the first controller. The
backup file is stored in the <tt class="docutils literal">mariadb_backup</tt> Docker volume on this host.
This makes it prone to loss in case of hardware failure. To be really effective
against disaster, backup files should be transferred off-site onto a separate
storage system. This can be easily customised for each deployment using Kayobe
hooks. <a class="reference external" href="https://docs.openstack.org/kayobe/latest/custom-ansible-playbooks.html#hooks">Hooks</a>
allow operators to automatically execute <a class="reference external" href="https://docs.openstack.org/kayobe/latest/custom-ansible-playbooks.html">custom Ansible playbooks</a> at
certain points during the execution of a Kayobe command. For example, we can
write a custom playbook called <tt class="docutils literal"><span class="pre">upload-database-backup.yml</span></tt> which will upload
the latest backup file somewhere safe, for example to an off-site object store:</p>
<div class="highlight"><pre><span></span><span class="nn">---</span>
<span class="c1"># This playbook uploads MariaDB backups to a Swift object store.</span>
<span class="p p-Indicator">-</span> <span class="nt">hosts</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">controllers[0]</span>
<span class="nt">vars</span><span class="p">:</span>
<span class="nt">backup_directory</span><span class="p">:</span> <span class="s">"/var/lib/docker/volumes/mariadb_backup/_data"</span>
<span class="nt">swift_venv</span><span class="p">:</span> <span class="s">"/opt/kayobe/venvs/swift"</span>
<span class="nt">tasks</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Ensure swift client is available</span>
<span class="nt">pip</span><span class="p">:</span>
<span class="nt">name</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">python-keystoneclient</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">python-swiftclient</span>
<span class="nt">virtualenv</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">swift_venv</span><span class="nv"> </span><span class="s">}}"</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Upload backup files</span>
<span class="nt">shell</span><span class="p">:</span> <span class="p p-Indicator">></span>
<span class="no">cd {{ backup_directory }} && /opt/kayobe/venvs/swift/bin/swift --auth-version 2 \</span>
<span class="no">--os-auth-url {{ backup_swift_url }} --os-username {{ secrets_backup_swift_user }} \</span>
<span class="no">--os-password {{ secrets_backup_swift_key }} --os-project-name {{ backup_swift_project }} \</span>
<span class="no">upload --skip-identical {{ mysql_backup_container }} .</span>
<span class="nt">become</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">True</span>
</pre></div>
<p>Creating a hook to run this playbook after each execution of <tt class="docutils literal">kayobe overcloud
database backup</tt> (including when run through a CI/CD platform) is done by
creating a symbolic link from
<tt class="docutils literal"><span class="pre">$KAYOBE_CONFIG_PATH/hooks/overcloud-database-backup/post.d/10-upload-database-backup.yml</span></tt>
to <tt class="docutils literal"><span class="pre">$KAYOBE_CONFIG_PATH/ansible/upload-database-backup.yml</span></tt>:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> readlink <span class="nv">$KAYOBE_CONFIG_PATH</span>/hooks/overcloud-database-backup/post.d/10-upload-database-backup.yml
<span class="go">../../../ansible/upload-database-backup.yml</span>
</pre></div>
<p>Similarly, we can run another playbook to delete any old backups from the
controllers, to avoid consuming large amounts of storage over time.</p>
</div>
<div class="section" id="restoring-from-backup">
<h2>Restoring from backup</h2>
<p>Like taking an insurance policy, we perform backups hoping we will never have
to use them. However, it is important to know how to restore from backups and
to check that the process actually works. The Kolla Ansible documentation
covers <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/admin/mariadb-backup-and-restore.html#restoring-backups">how to restore backups</a>.
Note that OpenStack will lose track of anything that happened since the backup
was taken, so there may be some resources, such as virtual machines or volumes,
that will require manual cleanup.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>Automating OpenStack database backups is only one of the many things possible
with Kayobe and CI/CD platforms. Future blog posts will describe other
features, such as Kolla configuration diffs.</p>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
OpenInfra Live: OpenStack in Academia2022-03-05T18:00:00+00:002022-03-05T18:00:00+00:00Stig Telfertag:www.stackhpc.com,2022-03-05:/openinfra-academia.html<p class="first last">StackHPC participated in an OpenInfra Live discussion on using
OpenStack in academia</p>
<div class="section" id="speaking-on-behalf-of-the-scientific-sig">
<h2>Speaking on behalf of the Scientific SIG</h2>
<p>StackHPC tackles an increasing breadth of activities, but our core
business remains helping academic institutions to get the most out
of OpenStack private cloud infrastructure.</p>
<p>One very rewarding way in which we contribute is through participation
in the OpenStack Scientific SIG. The Scientific SIG has an open
membership, and on a day-to-day basis convenes on a Slack channel
(get in touch for a sign-up link). The SIG's primary purpose is
to provide <em>social infrastruture</em>, maintaining social connections
and information sharing between various scientific and academic
instutions. One way we do this is through a Slack workspace.
At the time of writing there are about 150 members.
<a class="reference external" href="//www.stackhpc.com/pages/contact.html">Get in touch</a> if you would like
a link to join.</p>
<p>The OpenInfra Foundation organises weekly webinars, <a class="reference external" href="https://openinfra.dev/live/">OpenInfra Live</a>, covering all things open infra.
This week it was Kendall's turn to lead a discussion on <a class="reference external" href="https://youtu.be/xqGvMn0B7jc">How OpenStack
is Used in Academia</a>, and as SIG
representative it was my pleasure to join a great panel of researchers
and service providers for academic institutions:</p>
<ul class="simple">
<li>Steve Quenette from <a class="reference external" href="https://www.monash.edu/researchinfrastructure/eresearch">Monash University eResearch Centre</a>.</li>
<li>Rémi Sharrock and Marc Jeanmougin from <a class="reference external" href="https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci/open-software-and-innovation">Télécom Paris Center for Open Source Innovation</a></li>
<li>Lance Albertson from <a class="reference external" href="https://osuosl.org">Oregon State University Open Source Lab</a>.</li>
</ul>
<p>You can see our discussion here:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/xqGvMn0B7jc" width="500" height="333" allowfullscreen seamless frameBorder="0"></iframe></div><div class="section" id="id1">
<h3>Get in touch</h3>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
</div>
StackHPC is Recruiting2021-11-13T18:00:00+00:002021-11-13T18:00:00+00:00Justin Coquillontag:www.stackhpc.com,2021-11-13:/recruiting-3.html<p class="first last">StackHPC continues to grow and seek new Cloud Engineers to
join our enthusiastic team.</p>
<div class="section" id="stackhpc-cloud-engineer">
<h2>StackHPC Cloud Engineer</h2>
<p>StackHPC is a dynamic OpenStack and cloud consultancy that works
with leading research institutions to provide high-performance cloud
infrastructure for data-intensive scientific challenges. By focusing
on client needs we identify and develop solutions that address the
gaps in cloud for high performance research computing.</p>
<p>We’ve been growing for five years and are approaching 20 team
members, but still retain a startup mindset.</p>
<p>An important aspect of StackHPC’s corporate culture is dedication
to open source and the additional values of open design, development
and community. All of our development is upstream open source;
StackHPC is a committed member of the Open Infrastructure community
being a Silver Member founder. Principal staff are active in community
efforts around OpenStack and research computing and our CTO is
co-founder and co-chair of the OpenStack Foundation's Scientific
Special Interest Group.</p>
<div class="section" id="the-role">
<h3>The Role</h3>
<p>StackHPC is looking for our next team member to join our growing
cloud business and work with some of the best software engineers
around. You would be interested in a career in cloud systems
engineering, and ideally be familiar with working with complex cloud
and HPC systems.</p>
<div class="section" id="typical-work-activities">
<h4>Typical work activities</h4>
<ul class="simple">
<li>Design assistance and HPC consultancy.</li>
<li>Creating new HPC deployments.</li>
<li>Migration and upgrades of systems.</li>
<li>Supporting existing client systems and resolving operational problems.</li>
<li>Delivering training and knowledge transfer.</li>
<li>Producing technical documentation for customers and our blog.</li>
</ul>
</div>
<div class="section" id="skills-and-experience">
<h4>Skills and Experience</h4>
<p><strong>Preferred</strong></p>
<ul class="simple">
<li>Deployment and administration of Linux operating systems.</li>
<li>Ansible configuration management.</li>
<li>Infrastructure lifecycle management using, e.g. Terraform.</li>
<li>Cloud infrastructure concepts such as cloud-init.</li>
<li>Systems and process automation using Python and Bash.</li>
<li>Docker containerisation methods.</li>
<li>Development lifecycle tools, such as Git, Jira.</li>
<li>Use of monitoring and reporting tools, such as Prometheus and Grafana.</li>
</ul>
<p><strong>Desirable</strong></p>
<ul class="simple">
<li>HPC application experience.</li>
<li>Performance profiling, monitoring tools, and software performance optimization.</li>
<li>Knowledge of configuring and optimising Object and File Systems used research computing.</li>
<li>System-level hardware and performance optimisation.</li>
<li>Setup of HPC middleware.</li>
<li>Experience of Kubernetes.</li>
<li>Experience of Slurm.</li>
<li>Public Cloud (e.g. AWS, Azure and GCP).</li>
<li>Exposure to the design or deployment of OpenStack or other cloud services.</li>
<li>Technical knowledge of Linux-based cloud infrastructure technologies, such as:
virtualisation, containerisation, software-defined networking
(SDN), network function virtualisation (NFV), high speed networks,
orchestration, storage, metrics, control consoles, code management
systems, CI or test frameworks etc.</li>
</ul>
</div>
<div class="section" id="our-technology-stack">
<h4>Our Technology Stack</h4>
<p>We work with and deploy a wide range of OpenStack technologies and
services, so experience with the following is always beneficial.</p>
<ul class="simple">
<li><strong>O/S</strong>: CentOS, Ubuntu, RHEL, Rocky Linux</li>
<li><strong>Storage</strong>: Ceph, Cinder, Manila</li>
<li><strong>Supporting Components</strong>: Prometheus, Grafana, ElasticSearch, Kibana, Cloudkitty, Keystone, Horizon</li>
<li><strong>Networking</strong>: Neutron, SRIOV</li>
<li><strong>Compute and Workloads</strong>: Nova, Magnum, Slurm, Kubernetes, Octavia</li>
</ul>
<p>Our technology stack is constantly evolving to meet market and user
requirements. So we will help you get up to speed as required.</p>
</div>
<div class="section" id="our-location">
<h4>Our Location</h4>
<p>We have team members based across the globe but our HQ is in central
Bristol - voted <a class="reference external" href="https://www.thetimes.co.uk/article/why-bristol-best-place-to-live-uk-h5wjpk0r3">one of the best places to live in the UK</a>. This location is a short walk from Bristol Temple Meads
station.</p>
</div>
<div class="section" id="salary-and-benefits">
<h4>Salary and benefits</h4>
<ul class="simple">
<li>Competitive salary and bonus.</li>
<li>Generous stock option scheme.</li>
<li>Discretionary remote working / flexible working practices.</li>
<li>Flexible working hours and home/office hybrid working.</li>
<li>25 days paid holiday.</li>
<li>Pension contribution.</li>
<li>Support for travel to conferences and delivering presentations.</li>
<li>Learning and employee development is a priority, including work time dedicated to R&D and technology.</li>
</ul>
</div>
</div>
<div class="section" id="get-in-touch">
<h3>Get in touch</h3>
<p><em>We have an existing recruiter relationship, so there is no need for recruiters to contact us about this role</em></p>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
</div>
OpenInfra Live: OpenStack-powered supercomputing2021-08-26T18:00:00+01:002021-08-26T18:00:00+01:00Stig Telfertag:www.stackhpc.com,2021-08-26:/oil-supercomputers.html<p class="first last">StackHPC's CTO Stig Telfer moderated an Open Infra Live discussion
on OpenStack and sotware-defined supercomputers</p>
<p>StackHPC is proud to support the Open Infrastructure Foundation's
<a class="reference external" href="https://openinfra.dev/live/">OpenInfra Live</a>, a growing archive of
discussions around all things open infra.</p>
<p>As co-chair of the <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">OpenStack Scientific SIG</a>, StackHPC's CTO Stig Telfer moderated
a lively discussion between a panel of technology leaders and
trail-blazers in the emerging phenomenon of software-defined
supercomputing:</p>
<ul class="simple">
<li><strong>Sadaf Alam</strong>, CTO of <a class="reference external" href="https://www.cscs.ch">CSCS</a> the Swiss national supercomputer centre.</li>
<li><strong>Happy Sithole</strong>, Director of <a class="reference external" href="https://www.chpc.ac.za">CHPC</a> the Cape Town Centre for High-Performance Computing.</li>
<li><strong>Steve Quenette</strong>, Deputy Director of <a class="reference external" href="https://www.monash.edu/researchinfrastructure/eresearch">Monash University e-Research Centre</a>.</li>
<li><strong>Jon Mills</strong>, Cloud Computing Technical Team Lead, <a class="reference external" href="https://www.nccs.nasa.gov">NASA Center for Climate Simulation (NCCS)</a>, NASA Goddard Space Flight Center</li>
</ul>
<div class="youtube"><iframe src="https://www.youtube.com/embed/fOJTHanmOFg" width="500" height="333" allowfullscreen seamless frameBorder="0"></iframe></div><p>The Open Infrastructure Foundation's <em>SuperUser</em> blog has put together an excellent <a class="reference external" href="https://superuser.openstack.org/articles/large-scale-openstack-discussing-software-defined-supercomputers-openinfra-live-recap/">summary of the session</a>.</p>
<p>OpenStack's supporting role in HPC environments is well established, in a pattern where
private cloud provides flexible compute resources in tandem with a dedicated HPC supercomputer.
The OpenStack system serves to create models and scenarios for input, and analyse the data
generated as outputs from an HPC workload.</p>
<p>This discussion also highlighted the emergence of another, more prominent role for OpenStack, in providing
a software-defined supercomputer built entirely using open infrastructure.</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
OpenInfra Live: Ironic in Production2021-08-12T18:00:00+01:002021-08-12T18:00:00+01:00Stig Telfertag:www.stackhpc.com,2021-08-12:/oil-ironic.html<p class="first last">StackHPC's Mark Goddard participated in the OpenInfra Live
session on Ironic in Production</p>
<p>StackHPC is proud to support the Open Infrastructure Foundation's
<a class="reference external" href="https://openinfra.dev/live/">OpenInfra Live</a>, a growing archive of
discussions around all things open infra.</p>
<p>As PTL of the <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla project</a>,
core team member of the <a class="reference external" href="https://docs.openstack.org/ironic/latest/">Ironic project</a> and Senior Tech Lead
from StackHPC, Mark Goddard joined a lively and informative discussion
around how Ironic is used today in production, including some of the
cool projects we have underway with Ironic at their heart.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/m81Q2bU9bGU" width="500" height="333" allowfullscreen seamless frameBorder="0"></iframe></div><p>Ironic plays a fundamental role in StackHPC's vision of the
"software-defined supercomputer", and StackHPC is a <a class="reference external" href="//www.stackhpc.com/baremetal-program.html">founding member</a>
of the Open Infrastructure Foundation's <a class="reference external" href="https://www.openstack.org/use-cases/bare-metal/">bare metal program</a> and SIG.</p>
<div class="figure">
<img alt="Bare metal program logo" src="//www.stackhpc.com/images/baremetal-program-logo.png" style="width: 300px;" />
</div>
<p>Mark adds, "it was a pleasure to take part in this discussion about running
Ironic in production. Ironic is central to the provisioning layer of Kayobe
(our OpenStack deployment tool of choice), and provides multi-tenant access to
bare metal compute for demanding user workloads".</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
SuperCompCloud 4 at ISC '212021-07-02T18:00:00+01:002021-07-02T18:00:00+01:00Stig Telfertag:www.stackhpc.com,2021-07-02:/supercompcloud-4.html<p class="first last">StackHPC CTO Stig Telfer presented a keynote address
to SuperCompCloud - the 4th Workshop on Interoperability of Supercomputing
and Cloud Technologies at ISC '21</p>
<p>The <a class="reference external" href="https://sites.google.com/view/supercompcloud/archives/isc21-4th-supercompcloud-workshop?authuser=0">SuperCompCloud workshop</a>
returned as part of <a class="reference external" href="https://www.isc-hpc.com">ISC '21</a>. StackHPC's CTO Stig Telfer was
thrilled to present a workshop keynote:</p>
<p><em>In recent years the proposition of cloud-native supercomputing has
matured and is a compelling alternative to conventional HPC
infrastructure. Cloud means many things, and on-premise private
cloud infrastructure covers the full range of design choices. In
this talk, Stig will present recent technical work that shifts the
balance in design trade-offs, and may change the way people think
about deploying software-defined infrastructure for supercomputing.</em></p>
<p>The presentation slides are <a class="reference external" href="https://drive.google.com/file/d/1ym32l_t4Xr3JzhnxyhZPk3Gpmr8I-gZT/view?usp=sharing">available here</a>.</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
High Performance Ethernet - to IB or not to IB2021-05-24T14:00:00+01:002021-05-24T14:00:00+01:00John Taylortag:www.stackhpc.com,2021-05-24:/ethernet-hpc-2.html<p class="first last">The second of a trilogy of blogs
on the use of modern Ethernet stacks (sometimes referred to as the
generic term High Performance Ethernet) as a viable alternative to
other interconnects such as Infiniband in HPC and AI.</p>
<p>This is the second [<a class="reference external" href="//www.stackhpc.com/ethernet-hpc.html">first one here</a>] of a trilogy of
blogs on the use of modern Ethernet stacks (sometimes referred to
as the generic term High Performance Ethernet) as a viable alternative
to other interconnects such as Infiniband in HPC and AI. The reasons
and motivations for this are many:</p>
<ol class="arabic simple">
<li>Modern Ethernet stacks support RDMA and as such come close to
the base latencies (critical to the performance of a bona fide
HPC network) of alternative technologies, viz. order 1µs</li>
<li>Convergence of NIC hardware used in both Ethernet and Infiniband
mean that the delay in higher line rate Ethernet technologies is
not as great as previous generations</li>
<li>Ethernet as an IEEE standard means that suppliers of innovative
storage and new AI acceleration technologies choose Ethernet by
default.</li>
<li>In virtualised contexts, the properties of Ethernet in supporting
VLAN extensions significantly enhances security and multi-tenancy
without the performance penalties, albeit to some extent <a class="reference external" href="//www.stackhpc.com/bare-metal-infiniband.html">IB pkeys</a>
can also be used.</li>
<li>If Ethernet can address these aspects, then the complexity of
medium-sized systems can be reduced, saving cost, time and risk as
only one data transport is required along with the control networks
which are always Ethernet. <strong>I say medium-sized systems as there is
still a question relating to congestion management at large-scale
(see previous blog) together with the scalability of global reduction
operations without hardware offload.</strong></li>
</ol>
<p>For StackHPC, the standard nature of Ethernet, however, has potential
for some interesting side-effects, unless care is taken on ensuring
optimisation of the configuration. The focus of this blog, explores
these aspects and draws upon many years of experience of the team
in SDN and HPC. We also assess these aspects in a novel manner in
particular by extensive use of hardware monitoring of the application
and the network stack. We also note here that we are only considering
the case of bare metal. In the case of virtualised compute, we point
you to the following article on advances in performance and
functionality using SR-IOV.</p>
<p>Previously we had compared and contrasted the performance of 25GbE
and 100Gbps EDR for a range of benchmarks up to 8 nodes. Our
experience with “medium” scale HPC operations with customers over
the past 5 years suggests that a typical operational profile of
applications running would indicate that 8 nodes represents a good
median and in fact one that remains pretty constant as new node
types (potentially doubling the number of cores) are purchased or
new nodes with acceleration are introduced.</p>
<p>Of course as a median, there may well exist power-users who say
they need (or probably more correctly want) the capability of the
whole machine for a single application - such as a time-critical
scientific deadline - and while such black swan events do occur,
they are, ipso facto, not common. Hence the focus here on medium-sized
(or <a class="reference external" href="http://www.hpcadvisorycouncil.com/pdf/Intersect360_HPC_AI_Market_Sept_2020.pdf">high-end</a>
systems) the largest tranche of the HPC market.</p>
<p>Thus in this article we extend the results firstly up to and including
a single Top-of-Rack switch for both Ethernet and IB and then to
explore the performance across multiple racks. Typically, but subject
to data centre constraints this ranges from 56 to 128 nodes
(two-to-three racks worth), or around the 1-2MUSD price tag (disclaimer
here), probably representing 80% of the HPC/AI market.</p>
<p>In doing this however, we of course need to be cognisant of the
following caveats, given the test equipment to hand. So, here we
go.</p>
<ol class="arabic simple">
<li>The system under test (SUT) comprises multiple racks of 56 nodes
of Cascade Lake with each node comprising a dual NIC configured as
one <strong>50GbE</strong> and one <strong>100Gbps</strong> HDR. These are Mellanox ConnectX-6. Each
interface is connected to a Mellanox Cumulus SN3700 and a Mellanox
HDR-200 IB Switch respectively.</li>
<li>Each Top-of-rack switch is then connected in a Leaf-Spine topology
with the following over-subscription ratio. <strong>For Ethernet this is
14:1 and for IB this is 2.3:1</strong>.</li>
<li>Priority Flow Control was configured on the network, with MTU set at 9000.</li>
</ol>
<p>As an aside, there is an excellent overview on these networking
aspects at <a class="reference external" href="https://dug.com/hpc-networking-a-ballad-in-four-parts-part-3-resilience/">Down Under Geophysical’s blog</a>
(I’ve tagged blog #3 here). The reader is also pointed at the RoCE
vs IB comparison <a class="reference external" href="https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet">here</a>.</p>
<div class="figure">
<img alt="Rack-level network oversubscription" src="//www.stackhpc.com/images/network-oversubscription.png" style="width: 500px;" />
</div>
<p>The system is built from the ground-up using OpenStack and the
OpenStack Ironic bare-metal service and has applied to the base
infrastructure, an <a class="reference external" href="//www.stackhpc.com/ohpcv2.html">OpenHPC v2 Slurm Ansible playbook</a>
to create the
necessary platform. Performance of the individual nodes is captured
by a sophisticated life-cycle management process ensuring that
components are only entered into service once they have transitioned
through a set of health and performance checks. More on that to
come in subsequent blogs.</p>
<p>The resulting system provides a cloud-native HPC environment that
provides a great deal of software defined flexibility for the
customer. For this analysis it provided a convenient programmatic
interface for manipulation of switch configurations and easy
deployment of Prometheus-Redfish service stack for monitoring of
the nodes and Ethernet switches within the system. This provided
an invaluable service in debugging the environment.</p>
<p>Given these hardware characteristics and depending (that dreadful
verb again) on the application we use, we would expect the inter-switch
performance will be governed by the bi-sectional ratios of b) above
together with the additional inter-switch links (ISLs) ratio .</p>
<p>As yet we have not been able to find an application/data-set
combination in which there is a significant difference between IB
and RoCE as we scale up nodes. So here, we again focus on using
Linpack as the base test as we know that network performance and
in particular bi-sectional bandwidth is a strong influence on
scalability, and we should be able to expose this with the application.
N.B. The system as a whole reached #98 in the most recent (November
2020) top500 list - more details are available here but save to say
this result was with IB.</p>
<p>However, let's first start with base latencies and bandwidth for a
MPI pingpong.</p>
<table border="1" class="docutils">
<colgroup>
<col width="16%" />
<col width="45%" />
<col width="39%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Network</th>
<th class="head">PingPong Latency (microseconds)</th>
<th class="head">PingPong Bandwidth (MB/sec)</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>IB 100Gbps</td>
<td>1.09</td>
<td>12069.36</td>
</tr>
<tr><td>ROCE 50Gbps</td>
<td>1.49</td>
<td>5738.79</td>
</tr>
</tbody>
</table>
<p>[sources: <a class="reference external" href="https://github.com/stackhpc/hpc-tests/blob/master/output/csd3/cclake-ib-icc19-impi19-ucx/imb/IMB_PingPong/rfm_IMB_PingPong_job.out">IB</a>, <a class="reference external" href="https://github.com/stackhpc/hpc-tests/blob/master/output/csd3/cclake-roce-icc19-impi19-ucx/imb/IMB_PingPong/rfm_IMB_PingPong_job.out">RoCE</a>]. We note here that the switch latency between
IB and RoCE is also included in this measure. This would be higher
in the case of the latter and again is a factor to consider at
scale.</p>
<p>Secondly, we now compare the performance within a single rack. The
final headline numbers are shown in the Table below, but how we got
there is an interesting journey.</p>
<p>Single Rack 56 nodes</p>
<table border="1" class="docutils">
<colgroup>
<col width="41%" />
<col width="59%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Network</th>
<th class="head">GFLOPS (Linpack)</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>IB 100Gbps</td>
<td>1.34976e+05</td>
</tr>
<tr><td>ROCE 50Gbps</td>
<td>1.33943e+05</td>
</tr>
</tbody>
</table>
<p>NB. that the RoCE number at 50Gbps is within 99% of the IB number.</p>
<p>The first step in the analysis was to make sure we could more easily
toggle between the types of interconnect. Previously with tests we
had used openmpi3 and the pmi layer within Slurm. For this SUT we
decided to use the UCX transport. This provides a much easier way
to select the interconnect, however we did find out an interesting
side-effect when using a system with this transport, as without a
specific setting to configure the mlx5_[0,1] interface, the run-time
assumes the system to be a dual-rail configuration with a perverse
unintended consequence.</p>
<p>Evidence of the behaviour was determined by observing node metrics
and in particular system CPU combined with metrics from the Ethernet
switch. The attached screenshot shows this behaviour and was a
compelling diagnostic tool as multiple engineers were involved in
the configuration and set-up.</p>
<div class="figure">
<img alt="Grafana dashboard of LINPACK telemetry" src="//www.stackhpc.com/images/linpack-telemetry-1.png" style="width: 750px;" />
</div>
<p>Here we observe a system CPU pattern, not observed in previous
experiments where system CPU is flat and near zero and user CPU
flat at 100%. On a different dashboard we were also monitoring
Ethernet traffic through the switch as well as packets through the
different mlx devices. This quickly resolved the issue of unintended
dual-rail behaviour.</p>
<div class="figure">
<img alt="Grafana dashboard of LINPACK telemetry" src="//www.stackhpc.com/images/linpack-telemetry-2.png" style="width: 750px;" />
</div>
<p>Once we had resolved the single rack performance we moved on to
multiple rack measurements, where armed with the appropriate
dashboards we can begin to monitor the effects due to bi-sectional
bandwidth.</p>
<p>The results are shown in the graph below. Between 1 and 2 Racks the
performance is within 99% of the IB performance irrespective even
taking into account the reduced bisectional bandwidth.</p>
<div class="figure">
<img alt="LINPACK performance scaling" src="//www.stackhpc.com/images/linpack-perf.png" style="width: 750px;" />
</div>
<p>We are building up a body of evidence in terms of RoCE vs IB
performance and will be adding further information as more application
performance is gained. These data are being documented at the
following <a class="reference external" href="https://github.com/stackhpc/hpc-tests">github</a> and
will be detailed in subsequent blogs. Further work will also look
at I/O performance.</p>
<p>Of course, many stalwarts of HPC interconnects will remain in the
IB camp (and for good reason) but we still see many organisations
moving from IB to Ethernet in the middle of the HPC pyramid. We
also expect the number of options in the market to increase in the
next two years.</p>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>We would like to take this opportunity to thank members of the
University of Cambridge, University Information Services for help
and support in this article.</p>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
An Ansible-driven Slurm "Appliance" for an HPC Environment2021-04-23T16:00:00+01:002021-04-23T16:00:00+01:00Steve Brasiertag:www.stackhpc.com,2021-04-23:/slurm-app.html<p class="first last">We showcase a new Ansible-based Slurm appliance for production HPC clusters</p>
<p>Recently we <a class="reference external" href="https://www.stackhpc.com/ohpcv2.html">discussed</a> our <a class="reference external" href="https://galaxy.ansible.com/stackhpc/openhpc">stackhpc.openhpc</a> Ansible Galaxy role which can deploy a Slurm cluster based on the <a class="reference external" href="https://openhpc.community/">OpenHPC project</a>. Using that role as a base, we have created a <a class="reference external" href="https://github.com/stackhpc/ansible-slurm-appliance">"Slurm appliance"</a> which configures a fully functional and production ready HPC workload management environment. Currently in preview, this set of Ansible playbooks, roles and configuration creates a CentOS 8 / OpenHPC v2 -based Slurm cluster with:</p>
<ul class="simple">
<li>Multiple NFS filesystems - using servers both within or external to the appliance-managed cluster.</li>
<li>Slurm accounting using a MySQL backend.</li>
<li>A monitoring backend integrated with Slurm using Prometheus and ElasticSearch.</li>
<li>Grafana with dashboards for both individual nodes and Slurm jobs.</li>
<li>Production-ready Slurm defaults for access and memory.</li>
<li>Post-deploy MPI-based tests for floating point performance, bandwidth and latency using Intel MPI Benchmarks and the High Performance Linpack suites.</li>
<li>A Packer-based build pipeline for compute node images.</li>
<li>Slurm-driven reimaging of compute nodes.</li>
</ul>
<p>The "software appliance" moniker reflects the fact it is intended to be as "plug-and-play" as possible, with configuration only required to define which services to run where. Yet unlike e.g. a storage appliance it is hardware agnostic, working with anything from general-use cloud VMs to baremetal HPC nodes. In addition, its modular design means that the environment can be customised based on site-specific requirements. The initial list of features above is just a starting point and there are plans to add support for identity/access management, high-performance filesystems, and Spack-based software toolchains. We fully intend for this to be a community effort and that users will propose and integrate new features based on their local needs. We think this combination of useability, flexibility and extendability is genuinely novel and quite different from existing "HPC-in-a-box" type offerings.</p>
<p>The appliance also supports multiple environments in a single repository, making it simple to manage differences between between development, staging and production environments.</p>
<p>Future blogs will explore some of these features such as the Slurm-driven reimaging. For now, let's look at the monitoring features in more detail. The appliance can deploy and automatically configure:</p>
<ul class="simple">
<li><a class="reference external" href="https://github.com/prometheus/node_exporter">Prometheus node-exporters</a> to gather information on hardware- and OS-level metrics, such as CPU/memory/network use.</li>
<li>A <a class="reference external" href="https://prometheus.io/docs/introduction/overview/">Prometheus server</a> to scrape and store that data.</li>
<li>A <a class="reference external" href="https://www.mysql.com/">MySQL</a> server and the <a class="reference external" href="https://slurm.schedmd.com/slurmdbd.html">Slurm database daemon</a> to provide enhanced Slurm accounting information.</li>
<li><a class="reference external" href="https://opendistro.github.io/">OpenDistro</a>'s containerised ElasticSearch for archiving and retrieval of log files.</li>
<li>Containerised <a class="reference external" href="https://www.elastic.co/kibana">Kibana</a> for visualisation and searching with ElasticSearch.</li>
<li>Containerised <a class="reference external" href="https://www.elastic.co/beats/filebeat">Filebeat</a> to parse log files and ship to ElasticSearch.</li>
<li><a class="reference external" href="https://podman.io/">Podman</a> to manage these containers.</li>
<li><a class="reference external" href="https://github.com/stackhpc/ansible_collection_slurm_openstack_tools/tree/main/roles/slurm-stats">Tools</a> to convert output from Slurm's <cite>sacct</cite> for ingestion into ElasticSearch.</li>
<li><a class="reference external" href="https://grafana.com/">Grafana</a> to serve browser-based dashboards.</li>
</ul>
<p>That's a complex software stack but to deploy it you just tell the appliance which host(s) should run it. This gives cluster users a Grafana dashboard displaying Slurm jobs:</p>
<div class="figure">
<img alt="Slurm job list dashboard showing 3 CP2K jobs" src="//www.stackhpc.com/images/jobs-cp2-2k.png" style="width: 750px;" />
</div>
<p>Clicking on a job shows a job-specific dashboard which aggregates metrics from all the nodes running the job, for example CPU and network usage as shown here:</p>
<div class="figure">
<img alt="Slurm job details dashboard showing timeseries for CPU load and network traffic" src="//www.stackhpc.com/images/cp2k-3.png" style="width: 750px;" />
</div>
<p>Moving on to the post-deployment tests, these have proved to be a useful tool for picking up issues, especially in combination with the monitoring. Getting an MPI environment working properly is often not straightforward, with potential for incompatilities between the selected compiler, MPI library, MPI launcher and scheduler integration and batch script options. The key here is that these tests deploy a "known-good" software stack using specific OpenHPC- and Intel- provided packages with pre-defined job options. These tests therefore both ensure the other aspects of the environment work as expected (hardware, operating system and network issues) and provide a working baseline MPI configuration to test alternative MPI configurations against. Four tests are currently defined:</p>
<ul>
<li><p class="first">Intel MPI Benchmarks PingPong on two (scheduler-selected) nodes, providing a basic look at latency and bandwidth.</p>
</li>
<li><p class="first">A "ping matrix" using a similar ping-pong style zero-size message latency test, but running on every pair-wise combination of nodes. As well as summary statistics this also generates a heat-map style table which exposes problems with specific links or switches. The results below are from a run using 224 nodes in four racks - differences between intra- and inter-rack latency are clearly visible.</p>
<table border="1" class="docutils">
<colgroup>
<col width="50%" />
<col width="50%" />
</colgroup>
<tbody valign="top">
<tr><td><div class="first last figure">
<img alt="Start of pair-wise latency test results with node IDs in first row and column and latency values in cells." src="//www.stackhpc.com/images/pingmatrix-N224-detail.png" style="width: 300px;" />
<p class="caption">Start of results (<a class="reference external" href="//www.stackhpc.com/images/pingmatrix-N224-detail.png">larger image</a>).</p>
</div>
</td>
<td><div class="first last figure">
<img alt="Overview of results showing blocks of colour from latency differences between/within racks." src="//www.stackhpc.com/images/pingmatrix-N224-large.png" style="width: 200px;" />
<p class="caption">Larger extract (<a class="reference external" href="//www.stackhpc.com/images/pingmatrix-N224-large.png">larger image</a>).</p>
</div>
</td>
</tr>
</tbody>
</table>
</li>
<li><p class="first">High Performance Linpack (HPL) on every node individually, to show up "weak" nodes and provide a measure of single-node performance.</p>
</li>
<li><p class="first">HPL using all nodes individually, to test the impact of inter-node communications.</p>
</li>
</ul>
<p>Larger extract from ping matrix - <a class="reference external" href="//www.stackhpc.com/images/pingmatrix-N224-large.png">fullsize image</a>.</p>
<p>Many papers have been written on selecting optimal parameters for HPL but here only the processor-model-specific block size "NB" needs to be set, with selection of all other parameters being automated. The sustained use of all processor cores in these tests also provides quite a severe stress test for other aspects of the environment such as cooling.</p>
<p>Early versions of this Slurm appliance are in production now on various systems. We plan for it to gradually replace earlier client-specific codebases as we upgrade their Slurm clusters. Not only will this commonality make it possible for us to spend more time adding new features, it will make it much easier for us to bring fixes and new features to those deployments.</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Cloud-Native HPC at Nvidia GTC2021-04-14T10:00:00+01:002021-04-14T10:00:00+01:00John Taylortag:www.stackhpc.com,2021-04-14:/cambridge-gtc.html<p class="first last">The software-defined supercomputer looks to bring to bear
the modern techniques of research operations (ResOps) through
automation and infrastructure as code.</p>
<p>The cloud-native (or, as we like to call it, <em>software-defined</em>)
supercomputer looks to bring to bear the modern techniques of
research operations (ResOps) through automation and infrastructure
as code.</p>
<p>For HPC this recognises the fact that while the use of public cloud
increases, organisations currently remain better suited to exploit
their own on-premise resources, in order to maximise investment in
advanced high-performance technologies.</p>
<p><a class="reference external" href="https://gtc21.event.nvidia.com/media/Introducing%20Cloud-Native%20Supercomputing%3A%20Bare-Metal%2C%20Secured%20Supercomputing%20Architecture%20%20%5BS32021%5D/1_hpj4tc51?ncid=em-even-113928&amp;amp;amp;">At GTC21</a>
<em>(GTC registration required)</em>, Prof. DK Panda and Dr. Paul Calleja
gave presentations on the new Nvidia Data Processing Unit (DPU) and
its use in cloud native supercomputing environments, respectively,
to deliver secure HPC platforms for clinical research without
compromising performance. StackHPC has been collaborating with the
University for a number of years on this new mode of operation for
HPC services.</p>
<p>However, this does not mean that the use of public cloud for HPC
remains static and over time, particular workflows may well migrate
to public cloud, as pointed out by Dr. Calleja. In order to be
prepared for these circumstances, on-premise HPC needs to move to
a more cloud native model, ensuring that operations can take advantage
of a range of cloud resources (not necessarily fixed to one Cloud
Service Provider) and adopt the Hybrid Cloud model. Achieving this
state of interoperability however requires renewed investment in
DevOps.</p>
<p>The experience and expertise of StackHPC in terms of high performance
networking and cloud methodologies provides a unique capability to
help address these aspects and minimise the impact of this new
engineering method.</p>
<div class="section" id="the-software-defined-supercomputer">
<h2>The Software-Defined Supercomputer</h2>
<p>For more details, please watch our <a class="reference external" href="https://www.openstack.org/videos/summits/virtual/Lessons-learnt-expanding-Cambridge-Universitys-CSD3-Supercomputer-with-OpenStack">recent presentation from the
2020 OpenInfra Summit</a>:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/bbw-Fj0F1iY" width="500" height="333" allowfullscreen seamless frameBorder="0"></iframe></div></div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
VF-LAG Networking in Kayobe2021-03-30T14:00:00+01:002021-03-30T14:00:00+01:00Stig Telfertag:www.stackhpc.com,2021-03-30:/vflag-kayobe.html<p class="first last">In a virtualised environment, SR-IOV brings closer access to
underlying hardware. Adding VF-LAG overcomes limitations of redundancy
and fault-tolerance, bringing together performance with enterprise features.</p>
<p><em>StackHPC contributed this research to a challenging project with
our clients at</em> <a class="reference external" href="https://www.gresearch.co.uk/article/vf-lag-networking-in-openstack-with-kayobe/">G Research</a>,
<em>technology leaders in the field of high-performance cloud.</em></p>
<p>One aspect of open infrastructure that makes it an exciting field to work in
is that it is a continually changing landscape. This is particularly true in
the arena of high performance networking.</p>
<div class="section" id="open-infrastructure">
<h2>Open Infrastructure</h2>
<p>The concept of Open Infrastructure may not be familiar to all. It
seeks to replicate the open source revolution in compute, exemplified
by Linux, for the distributed management frameworks of cloud
computing. At its core is OpenStack, the cloud operating system
that has grown to become one of the most popular open source projects
in the world.</p>
<p><a class="reference external" href="//www.stackhpc.com/pages/kayobe.html">Kayobe</a> is an open source project
for deploying and operating OpenStack in a model that packages all
OpenStack’s components as containerised microservices and orchestrates
the logic of their deployment, reconfiguration and life cycle using
Ansible.</p>
</div>
<div class="section" id="high-performance-networking">
<h2>High Performance Networking</h2>
<p>In cloud environments, Single-Root IO Virtualisation (SR-IOV) is
still the way to achieve highest performance in virtualised networking,
and remains popular for telcos, high-performance computing (HPC)
and other network-intensive use cases. RDMA (the HPC-derived network
technology for bypassing kernel network stacks) in VMs is only possible
through use of SR-IOV.</p>
<p>The concept of SR-IOV is that the hardware resources of a physical
NIC (the <em>physical function</em>, or PF) are presented as many additional
<em>virtual functions</em> (or VFs). The VFs are treated like separable devices
and can be passed-through to VMs to provide them with direct access to
networking hardware.</p>
<p>Historically, this performance has been counter-balanced by limitations
on its use:</p>
<ul class="simple">
<li>SR-IOV configurations usually bypass the security groups that implement
firewall protection for VMs.</li>
<li>SR-IOV used to prevent key operational features such as live
migration (we will be following up on the new
live migration capabilities added <a class="reference external" href="https://docs.openstack.org/neutron/latest/admin/config-sriov#known-limitations">in the Train release</a>
and the consequences of it in a follow-up article).</li>
<li>SR-IOV configurations can be complex to set up.</li>
<li>VMs require hardware drivers to enable use of the SR-IOV interfaces.</li>
<li>SR-IOV lacked fault tolerance. In standard configurations SR-IOV
is associated with a single physical network interface.</li>
</ul>
<p>The lack of support for high-availability in networking can be addressed - with
the right network hardware.</p>
</div>
<div class="section" id="mellanox-vf-lag-fault-tolerance-for-sr-iov">
<h2>Mellanox VF-LAG: Fault-tolerance for SR-IOV</h2>
<p>In a resilient design for a virtualised data centre, hypervisors
use bonded NICs to provide network access for control plane data
services and workloads running in VMs. This design provides
active-active use of a pair of high-speed network interfaces, but
would normally exclude the most demanding network-intensive use
cases.</p>
<p>Mellanox NICs have a feature, VF-LAG, which claims to enable SR-IOV
to work in configurations where the ports of a 2-port NIC are bonded
together.</p>
<div class="figure">
<img alt="VF-LAG hypervisor networking" src="//www.stackhpc.com/images/vflag-hypervisor.png" style="width: 500px;" />
<p class="caption"><em>VMs configured using VF-LAG, combining SR-IOV and bonded physical interfaces</em></p>
</div>
<p>In NICs that support it, VF-LAG uses the same technology underpinning
<a class="reference external" href="https://www.mellanox.com/products/ASAP2">ASAP2 OVS hardware offloading</a>;
much of the process for creation of VF-LAG configurations is common
with ASAP2.</p>
<div class="section" id="system-requirements">
<h3>System Requirements</h3>
<ul class="simple">
<li>VF-LAG requires Mellanox ConnectX-5 (or later) NICs.</li>
<li>VF-LAG only works for two ports on the same physical NIC. It cannot
be used for LAGs created using multiple NICs.</li>
<li>Open vSwitch version 2.12 or later (the Train release of Kolla shipped
with Open vSwitch 2.12).</li>
</ul>
</div>
</div>
<div class="section" id="the-process-for-vf-lag-creation">
<h2>The Process for VF-LAG Creation</h2>
<p>A <a class="reference external" href="//www.stackhpc.com/sriov-kayobe.html">standard procedure can be applied for SR-IOV</a>, but for VF-LAG
support some changes are required due to specific ordering restrictions
with VF-LAG hardware initialisation.</p>
<div class="section" id="system-configuration">
<h3>System Configuration</h3>
<div class="section" id="nic-firmware-configuration">
<h4>NIC Firmware Configuration</h4>
<p>Both ports must be put into Ethernet mode. SR-IOV must be enbled,
and the limit on the number of Virtual Functions must be set to the
maximum number of VF-LAG VFs planned during hypervisor operation.</p>
<p>To set firmware parameters, the <tt class="docutils literal">mft</tt> package is required from
<a class="reference external" href="https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed">Mellanox OFED</a>.</p>
<p><em>When OFED packages are installed, take care not to enable the
Mellanox interface manager or openibd service. These will interfere
with device initialisation ordering.</em></p>
<p>Use of Mellanox Firmware Tools is described on a <a class="reference external" href="https://community.mellanox.com/s/article/getting-started-with-mellanox-firmware-tools--mft--for-linux">Mellanox community
support page here</a>.</p>
<div class="highlight"><pre><span></span>mst start
mst status
<span class="c1"># Set Ethernet mode on both ports</span>
mlxconfig -d /dev/mst/<device_name> <span class="nb">set</span> <span class="nv">LINK_TYPE_P1</span><span class="o">=</span><span class="m">2</span> <span class="nv">LINK_TYPE_P2</span><span class="o">=</span><span class="m">2</span>
<span class="c1"># Enable SRIOV and set a maximum number of VFs on each port (here, 8 VFs)</span>
mlxconfig -d /dev/mst/<device_name> <span class="nb">set</span> <span class="nv">SRIOV_EN</span><span class="o">=</span><span class="m">1</span> <span class="nv">NUM_OF_VFS</span><span class="o">=</span><span class="m">8</span>
reboot
</pre></div>
<p>Once applied, these settings are persistent and should not require further changes.</p>
</div>
<div class="section" id="bios-configuration">
<h4>BIOS Configuration</h4>
<p>BIOS support is required for SR-IOV and hardware I/O virtualisation. On Intel systems
this may refer to <a class="reference external" href="https://en.wikipedia.org/wiki/X86_virtualization#Intel-VT-d">VT-d</a>.
No further configuration is required to support VF-LAG.</p>
</div>
<div class="section" id="kernel-boot-parameters">
<h4>Kernel Boot Parameters</h4>
<p>Kernel boot parameters are required to support direct access to SR-IOV hardware using
I/O virtualisation.</p>
<p>For Intel systems these are:</p>
<blockquote>
<div class="highlight"><pre><span></span>intel_iommu=on iommu=pt
</pre></div>
</blockquote>
<p>Similarly for AMD systems:</p>
<blockquote>
<div class="highlight"><pre><span></span>amd_iommu=on iommu=pt
</pre></div>
</blockquote>
<p>(For performance-optimised configurations you might also want to set kernel boot parameters
for static huge pages and processor C-states).</p>
</div>
<div class="section" id="open-vswitch-configuration">
<h4>Open vSwitch Configuration</h4>
<p>Open vSwitch is configured to enable hardware offload of the OVS
data plane. This configuration must be applied to every hypervisor
using VF-LAG, and is applied to the <tt class="docutils literal">openvswitch_vswitchd</tt>
container.</p>
<p>Open vSwitch must be at version 2.12 or later (available as standard
from <a class="reference external" href="http://mirror.centos.org/centos/8/cloud/x86_64/openstack-train/Packages/o/">RDO package archives</a>
and <a class="reference external" href="https://hub.docker.com/r/kolla/centos-binary-openvswitch-vswitchd/tags">Kolla-Ansible</a>
Train release, or later).</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> Enable hardware offloads:
<span class="go">docker exec openvswitch_vswitchd ovs-vsctl set Open_vSwitch . other_config:hw-offload=true</span>
<span class="gp">#</span> Verify hardware offloads have been enabled:
<span class="go">docker exec openvswitch_vswitchd ovs-vsctl get Open_vSwitch . other_config:hw-offload</span>
<span class="go">"true"</span>
</pre></div>
</div>
<div class="section" id="openstack-configuration">
<h4>OpenStack Configuration</h4>
<p>OpenStack Nova and Neutron must be configured for SR-IOV as usual.
See the <a class="reference external" href="//www.stackhpc.com/sriov-kayobe.html">previous blog post</a>
for further details on how this is done.</p>
<p>OpenStack networking must be configured without a Linuxbridge
directly attached to the <tt class="docutils literal">bond0</tt> interface. In standard
configurations this can mean that only tagged VLANs (and <em>not</em> the
native untagged VLAN) can be used with the <tt class="docutils literal">bond0</tt> device.</p>
<div class="figure">
<img alt="VF-LAG Network bridging" src="//www.stackhpc.com/images/vflag-bridging.png" style="width: 750px;" />
<p class="caption"><em>Kayobe hypervisor virtual networking configuration that supports hardware offloading</em></p>
</div>
</div>
</div>
<div class="section" id="boot-time-initialisation">
<h3>Boot-time Initialisation</h3>
<p>The divergence from the standard bootup procedures for SR-IOV starts here.</p>
<div class="section" id="early-setup-creation-of-sr-iov-vfs">
<h4>Early Setup: Creation of SR-IOV VFs</h4>
<p>VF-LAG hardware and driver initialisation must be applied in a strict order with regard
to other network subsystem initialisations. Early hardware and driver configuration must be
performed before the interfaces are bonded together.</p>
<p>Systemd dependencies are used to ensure that VF-LAG initialisation
is applied after device driver initialisation and before starting
the rest of networking initialisation.</p>
<div class="figure">
<img alt="VF-LAG boot dependencies" src="//www.stackhpc.com/images/vflag-boot-dependencies.png" style="width: 750px;" />
<p class="caption"><em>Systemd is used to ensure VF-LAG bootup ordering dependencies are met.</em></p>
</div>
<p>As a systemd unit, define the dependencies as follows to ensure
VF-LAG setup happens after drivers are loaded but before networking initiasation commences
(here we use the physical NIC names <tt class="docutils literal">ens3f0</tt> and <tt class="docutils literal">ens3f1</tt>):</p>
<div class="highlight"><pre><span></span><span class="nv">Requires</span><span class="o">=</span>sys-subsystem-net-devices-ens3f0.device
<span class="nv">After</span><span class="o">=</span>sys-subsystem-net-devices-ens3f0.device
<span class="nv">Requires</span><span class="o">=</span>sys-subsystem-net-devices-ens3f1.device
<span class="nv">After</span><span class="o">=</span>sys-subsystem-net-devices-ens3f1.device
<span class="nv">Before</span><span class="o">=</span>network-pre.target
<span class="nv">Requires</span><span class="o">=</span>network-pre.target
</pre></div>
<p>Virtual functions (VFs) are created on both physical NICs in the bond.</p>
<p>For pass-through to VMs, the VFs are unbound from the Ethernet driver in the host kernel.
VF configuration is managed at "arms length" through use of a <em>representor</em> device,
created as a placeholder for referring to the VF without taking ownership of it.</p>
<p>This script implements the required process and ordering:</p>
<div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="o">[</span> -f /etc/sysconfig/sriov <span class="o">]</span> <span class="o">&&</span> <span class="nb">source</span> /etc/sysconfig/sriov
<span class="c1"># Defaults</span>
<span class="c1"># The network devices on which we create VFs.</span>
<span class="nv">SRIOV_PFS</span><span class="o">=</span><span class="si">${</span><span class="nv">SRIOV_PFS</span><span class="k">:-</span><span class="s2">"ens3f0 ens3f1"</span><span class="si">}</span>
<span class="c1"># The number of VFs to create on each PF</span>
<span class="nv">SRIOV_VF_COUNT</span><span class="o">=</span><span class="si">${</span><span class="nv">SRIOV_VF_COUNT</span><span class="k">:-</span><span class="nv">8</span><span class="si">}</span>
<span class="c1"># The number of combined channels to enable on each PF</span>
<span class="nv">SRIOV_PF_CHANNELS</span><span class="o">=</span><span class="si">${</span><span class="nv">SRIOV_PF_CHANNELS</span><span class="k">:-</span><span class="nv">63</span><span class="si">}</span>
<span class="c1"># The number of combined channels to enable on each representor</span>
<span class="nv">SRIOV_VF_CHANNELS</span><span class="o">=</span><span class="si">${</span><span class="nv">SRIOV_VF_CHANNELS</span><span class="k">:-</span><span class="nv">18</span><span class="si">}</span>
<span class="k">function</span> sriov_vf_create
<span class="o">{</span>
<span class="nv">PF_NIC</span><span class="o">=</span><span class="nv">$1</span>
<span class="nv">VF_COUNT</span><span class="o">=</span><span class="nv">$2</span>
<span class="nb">cd</span> /sys/class/net/<span class="nv">$PF_NIC</span>/device
<span class="nv">PF_PCI</span><span class="o">=</span>pci/<span class="k">$(</span>basename <span class="k">$(</span>realpath <span class="nv">$PWD</span><span class="k">))</span>
logger -t mlnx-vflag-early <span class="s2">"Creating </span><span class="nv">$VF_COUNT</span><span class="s2"> VFs for </span><span class="nv">$PF_NIC</span><span class="s2"> (</span><span class="nv">$PF_PCI</span><span class="s2">)"</span>
<span class="nb">echo</span> <span class="nv">$VF_COUNT</span> > sriov_numvfs
<span class="k">for</span> i in <span class="k">$(</span>readlink virtfn*<span class="k">)</span>
<span class="k">do</span>
logger -t mlnx-vflag-early <span class="s2">"Unbinding </span><span class="k">$(</span>basename <span class="nv">$i</span><span class="k">)</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="k">$(</span>basename <span class="nv">$i</span><span class="k">)</span> > /sys/bus/pci/drivers/mlx5_core/unbind
<span class="k">done</span>
<span class="c1"># Put the NIC eSwitch into devlink mode</span>
devlink dev eswitch <span class="nb">set</span> <span class="nv">$PF_PCI</span> mode switchdev
logger -t mlnx-vflag-early <span class="s2">"After enabling switchdev: </span><span class="k">$(</span>devlink dev eswitch show <span class="nv">$PF_PCI</span><span class="k">)</span><span class="s2">"</span>
<span class="o">}</span>
<span class="k">function</span> enable_tc_offload
<span class="o">{</span>
<span class="nv">PF_NIC</span><span class="o">=</span><span class="nv">$1</span>
<span class="nv">TC_OFFLOAD</span><span class="o">=</span><span class="k">$(</span>ethtool -k <span class="nv">$PF_NIC</span> <span class="p">|</span> awk <span class="s1">'{print $2}'</span><span class="k">)</span>
<span class="k">if</span> <span class="o">[[</span> <span class="s2">"</span><span class="nv">$TC_OFFLOAD</span><span class="s2">"</span> !<span class="o">=</span> <span class="s2">"on"</span> <span class="o">]]</span>
<span class="k">then</span>
logger -t mlnx-vflag-early <span class="s2">"Enabling HW TC offload for </span><span class="nv">$PF_NIC</span><span class="s2">"</span>
ethtool -K <span class="nv">$PF_NIC</span> hw-tc-offload on
<span class="k">fi</span>
<span class="o">}</span>
<span class="k">function</span> hwrep_ethtool
<span class="o">{</span>
<span class="c1"># There isn't an obvious way to connect a representor port</span>
<span class="c1"># back to the PF or VF, so apply tuning to all representor ports</span>
<span class="c1"># served by the mlx5e_rep driver.</span>
<span class="nv">hwrep_devs</span><span class="o">=</span><span class="k">$(</span><span class="nb">cd</span> /sys/devices/virtual/net<span class="p">;</span> <span class="k">for</span> i in *
<span class="k">do</span>
ethtool -i <span class="nv">$i</span> <span class="m">2</span>> /dev/null <span class="p">|</span>
awk -v <span class="nv">dev</span><span class="o">=</span><span class="nv">$i</span> <span class="s1">'$1=="driver:" && $2=="mlx5e_rep" {print dev}'</span>
<span class="k">done)</span>
<span class="k">for</span> i in <span class="nv">$hwrep_devs</span>
<span class="k">do</span>
logger -t mlnx-vflag-early <span class="s2">"Tuning receive channels for representor </span><span class="nv">$i</span><span class="s2">"</span>
ethtool -L <span class="nv">$i</span> combined <span class="nv">$SRIOV_VF_CHANNELS</span>
<span class="c1"># Enable hardware TC offload for each representor device</span>
enable_tc_offload <span class="nv">$i</span>
<span class="k">done</span>
<span class="o">}</span>
<span class="k">for</span> PF in <span class="nv">$SRIOV_PFS</span>
<span class="k">do</span>
<span class="c1"># Validate that the NIC exists as a network device</span>
<span class="k">if</span> <span class="o">[[</span> ! -d /sys/class/net/<span class="nv">$PF</span> <span class="o">]]</span>
<span class="k">then</span>
logger -t mlnx-vflag-early <span class="s2">"NIC </span><span class="nv">$PF</span><span class="s2"> not found, aborting"</span>
<span class="nb">echo</span> <span class="s2">"mlnx-vflag-early: NIC </span><span class="nv">$PF</span><span class="s2"> not found"</span> ><span class="p">&</span><span class="m">2</span>
<span class="nb">exit</span> -1
<span class="k">fi</span>
<span class="c1"># Validate that the NIC is not already up and active in a bond</span>
<span class="c1"># It appears this could be fatal.</span>
<span class="nv">dev_flags</span><span class="o">=</span><span class="k">$(</span>ip link show dev <span class="nv">$PF</span> <span class="p">|</span> grep -o <span class="s1">'<.*>'</span><span class="k">)</span>
grep -q <span class="s1">'\<SLAVE\>'</span> <span class="o"><<<</span> <span class="nv">$dev_flags</span>
<span class="k">if</span> <span class="o">[[</span> <span class="nv">$?</span> -eq <span class="m">0</span> <span class="o">]]</span>
<span class="k">then</span>
logger -t mlnx-vflag-early <span class="s2">"NIC </span><span class="nv">$PF</span><span class="s2"> already part of a bond, aborting"</span>
<span class="nb">echo</span> <span class="s2">"mlnx-vflag-early: NIC </span><span class="nv">$PF</span><span class="s2"> already part of a bond"</span> ><span class="p">&</span><span class="m">2</span>
<span class="nb">exit</span> -1
<span class="k">fi</span>
sriov_vf_create <span class="nv">$PF</span> <span class="nv">$SRIOV_VF_COUNT</span>
enable_tc_offload <span class="nv">$PF</span>
<span class="c1"># Raise the receive channels configured for this PF, if too low</span>
logger -t mlnx-vflag-early <span class="s2">"Tuning receive channels for PF </span><span class="nv">$PF</span><span class="s2">"</span>
ethtool -L <span class="nv">$PF</span> combined <span class="nv">$SRIOV_PF_CHANNELS</span>
<span class="k">done</span>
hwrep_ethtool
</pre></div>
</div>
<div class="section" id="late-boot-binding-vfs-back">
<h4>Late Boot: Binding VFs back</h4>
<p>A second Systemd unit is required to run later in the boot process: after networking setup
is complete but before the containerised OpenStack services are started.</p>
<p>This can be achieved with the following dependencies:</p>
<div class="highlight"><pre><span></span><span class="nv">Wants</span><span class="o">=</span>network-online.target
<span class="nv">After</span><span class="o">=</span>network-online.target
<span class="nv">Before</span><span class="o">=</span>docker.service
</pre></div>
<p>At this point, the VFs are rebound to the Mellanox network driver. The following script
serves as an example:</p>
<div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="o">[</span> -f /etc/sysconfig/sriov <span class="o">]</span> <span class="o">&&</span> <span class="nb">source</span> /etc/sysconfig/sriov
<span class="k">function</span> sriov_vf_bind
<span class="o">{</span>
<span class="nv">PF_NIC</span><span class="o">=</span><span class="nv">$1</span>
<span class="k">if</span> <span class="o">[[</span> ! -d /sys/class/net/<span class="nv">$PF_NIC</span> <span class="o">]]</span>
<span class="k">then</span>
logger -t mlnx-vflag-final <span class="s2">"NIC </span><span class="nv">$PF_NIC</span><span class="s2"> not found, aborting"</span>
<span class="nb">echo</span> <span class="s2">"mlnx-vflag-final: NIC </span><span class="nv">$PF_NIC</span><span class="s2"> not found"</span> ><span class="p">&</span><span class="m">2</span>
<span class="nb">exit</span> -1
<span class="k">fi</span>
<span class="c1"># Validate that the NIC is configured to be part of a bond.</span>
<span class="nv">dev_flags</span><span class="o">=</span><span class="k">$(</span>ip link show dev <span class="nv">$PF_NIC</span> <span class="p">|</span> grep -o <span class="s1">'<.*>'</span><span class="k">)</span>
grep -q <span class="s1">'\<SLAVE\>'</span> <span class="o"><<<</span> <span class="nv">$dev_flags</span>
<span class="k">if</span> <span class="o">[[</span> <span class="nv">$?</span> -ne <span class="m">0</span> <span class="o">]]</span>
<span class="k">then</span>
logger -t mlnx-vflag-final <span class="s2">"NIC </span><span class="nv">$PF_NIC</span><span class="s2"> not part of a bond, VF-LAG abort"</span>
<span class="nb">echo</span> <span class="s2">"mlnx-vflag-final: NIC </span><span class="nv">$PF_NIC</span><span class="s2"> not part of a bond, VF-LAG abort"</span> ><span class="p">&</span><span class="m">2</span>
<span class="nb">exit</span> -1
<span class="k">fi</span>
<span class="c1"># It appears we need to rebind the VFs to NIC devices, and then</span>
<span class="c1"># attach the NIC devices to the OVS bridge to which our bond is attached.</span>
<span class="nb">cd</span> /sys/class/net/<span class="nv">$PF_NIC</span>/device
<span class="nv">PF_PCI</span><span class="o">=</span>pci/<span class="k">$(</span>basename <span class="k">$(</span>realpath <span class="nv">$PWD</span><span class="k">))</span>
<span class="k">for</span> i in <span class="k">$(</span>readlink virtfn*<span class="k">)</span>
<span class="k">do</span>
logger -t mlnx-vflag-final <span class="s2">"Binding </span><span class="k">$(</span>basename <span class="nv">$i</span><span class="k">)</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="k">$(</span>basename <span class="nv">$i</span><span class="k">)</span> > /sys/bus/pci/drivers/mlx5_core/bind
<span class="k">done</span>
<span class="o">}</span>
<span class="c1"># The network devices on which we create VFs.</span>
<span class="nv">SRIOV_PFS</span><span class="o">=</span><span class="si">${</span><span class="nv">SRIOV_PFS</span><span class="k">:-</span><span class="s2">"ens3f0 ens3f1"</span><span class="si">}</span>
<span class="k">for</span> PF in <span class="nv">$SRIOV_PFS</span>
<span class="k">do</span>
sriov_vf_bind <span class="nv">$PF</span>
<span class="k">done</span>
</pre></div>
</div>
</div>
</div>
<div class="section" id="performance-tuning">
<h2>Performance Tuning</h2>
<p>High-performance networking will also benefit from configuring multiple receive channels.
These should be configured for the PF and also the representors. Check the current setting
using <tt class="docutils literal">ethtool <span class="pre">-l</span></tt> and adjust if necessary:</p>
<div class="highlight"><pre><span></span><span class="go">ethtool -L ens3f0 combined 18</span>
</pre></div>
<p>Mellanox maintains a comprehensive <a class="reference external" href="https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters">performance tuning guide</a>
for their NICs.</p>
</div>
<div class="section" id="using-vf-lag-interfaces-in-vms">
<h2>Using VF-LAG interfaces in VMs</h2>
<p>In common with SR-IOV and OVS hardware-offloaded ports, ports using VF-LAG and OVS hardware
offloading must be created separately and with custom parameters:</p>
<div class="highlight"><pre><span></span><span class="go">openstack port create --network $net_name --vnic-type=direct --binding-profile '{"capabilities": ["switchdev"]}' $hostname-vflag</span>
</pre></div>
<p>A VM instance can be created specifying the VF-LAG port. In this example, it is one of
two ports connected to the VM:</p>
<div class="highlight"><pre><span></span><span class="go">openstack server create --key-name $keypair --image $image --flavor $flavor --nic net-id=$tenant_net --nic port-id=$vflag_port_id $hostname</span>
</pre></div>
<p>The VM image should include Mellanox NIC kernel drivers to use the VF LAG interface.</p>
</div>
<div class="section" id="troubleshooting-is-it-working">
<h2>Troubleshooting: Is it Working?</h2>
<p>Check that the Mellanox Ethernet driver is managing the LAG correctly:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> dmesg <span class="p">|</span> grep <span class="s1">'mlx5.*lag'</span>
<span class="go">[ 44.064025] mlx5_core 0000:37:00.0: lag map port 1:2 port 2:2</span>
<span class="go">[ 44.196781] mlx5_core 0000:37:00.0: modify lag map port 1:1 port 2:1</span>
<span class="go">[ 46.491380] mlx5_core 0000:37:00.0: modify lag map port 1:2 port 2:2</span>
<span class="go">[ 46.591272] mlx5_core 0000:37:00.0: modify lag map port 1:1 port 2:2</span>
</pre></div>
<p>Check that the VFs have been created during bootup:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> lspci <span class="p">|</span> grep Mellanox
<span class="go">5d:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]</span>
<span class="go">5d:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]</span>
<span class="go">5d:00.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:00.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:00.4 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:00.5 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:00.6 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:00.7 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.4 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.5 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.6 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:01.7 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
<span class="go">5d:02.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]</span>
</pre></div>
<p>Data about VFs for a given NIC (PF) can also be retrieved using <tt class="docutils literal">ip link</tt> (here for <tt class="docutils literal">ens3f0</tt>):</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> ip link show dev ens3f0
<span class="go">18: ens3f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000</span>
<span class="go"> link/ether 24:8a:07:b4:30:8a brd ff:ff:ff:ff:ff:ff</span>
<span class="go"> vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 1 link/ether 3a:c0:c7:a5:ab:b2 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 2 link/ether 82:f5:8f:52:dc:2f brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 3 link/ether 3a:62:76:ef:69:d3 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 4 link/ether da:07:4c:3d:29:7a brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 5 link/ether 7e:9b:4c:98:3b:ff brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 6 link/ether 42:28:d1:6a:0d:5d brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
<span class="go"> vf 7 link/ether 86:d2:c8:a4:1b:c6 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off</span>
</pre></div>
<p>The configured number of VFs should also be available via sysfs (shown here for NIC <tt class="docutils literal">eno5</tt>):</p>
<div class="highlight"><pre><span></span><span class="go">cat /sys/class/net/eno5/device/sriov_numvfs</span>
<span class="go">8</span>
</pre></div>
<p>Check that the Mellanox NIC eSwitch has been put into <tt class="docutils literal">switchdev</tt> mode, not <tt class="docutils literal">legacy</tt> mode
(use the PCI bus address for the NIC from <tt class="docutils literal">lspci</tt>, here <tt class="docutils literal">37:00.0</tt>):</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> devlink dev eswitch show pci/0000:37:00.0
<span class="go">pci/0000:37:00.0: mode switchdev inline-mode none encap enable</span>
</pre></div>
<p>Check that tc hardware offloads are enabled on the physical NICs and also the representor ports
(shown here for a NIC <tt class="docutils literal">ens3f0</tt> and a representor <tt class="docutils literal">eth0</tt>):</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> ethtool -k ens3f0 <span class="p">|</span> grep hw-tc-offload
<span class="go">hw-tc-offload: on</span>
<span class="gp">#</span> ethtool -k eth0 <span class="p">|</span> grep hw-tc-offload
<span class="go">hw-tc-offload: on</span>
</pre></div>
<p>Check that Open vSwitch is at version 2.12 or later:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> docker <span class="nb">exec</span> openvswitch_vswitchd ovs-vsctl --version
<span class="go">ovs-vsctl (Open vSwitch) 2.12.0</span>
<span class="go">DB Schema 8.0.0</span>
</pre></div>
<p>Check that Open vSwitch has hardware offloads enabled:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> docker <span class="nb">exec</span> openvswitch_vswitchd ovs-vsctl get Open_vSwitch . other_config:hw-offload
<span class="go">"true"</span>
</pre></div>
<p>Once a VM has been created and has network activity on an SR-IOV interface,
check for hardware-offloaded flows in Open vSwitch. Look for offloaded flows coming in
on both <tt class="docutils literal">bond0</tt> and on the SR-IOV VF:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> docker <span class="nb">exec</span> openvswitch_vswitchd ovs-appctl dpctl/dump-flows --names <span class="nv">type</span><span class="o">=</span>offloaded
<span class="go">in_port(bond0),eth(src=98:5d:82:b5:d2:e5,dst=fa:16:3e:44:44:71),eth_type(0x8100),vlan(vid=540,pcp=0),encap(eth_type(0x0800),ipv4(frag=no)), packets:29, bytes:2842, used:0.550s, actions:pop_vlan,eth9</span>
<span class="go">in_port(eth9),eth(src=fa:16:3e:44:44:71,dst=00:1c:73:00:00:99),eth_type(0x0800),ipv4(frag=no), packets:29, bytes:2958, used:0.550s, actions:push_vlan(vid=540,pcp=0),bond0</span>
</pre></div>
<p>Watch out for kernel errors logged of this form as a sign that offloading is not applying successfully:</p>
<div class="highlight"><pre><span></span><span class="go">2020-03-19T11:42:17.028Z|00001|dpif_netlink(handler223)|ERR|failed to offload flow: Operation not supported: bond0</span>
</pre></div>
</div>
<div class="section" id="performance">
<h2>Performance</h2>
<p>The performance achieved rewards the effort. VMs connected with
VF-LAG are close to saturating the hardware bus for PCIe gen 3 (just
under 100Gb/s). Message latencies, measured using standard HPC
benchmarking tools, are quantifiable but small.</p>
<p>Looking ahead, we'll be building on this capability in follow-up blog
posts.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<p>Documentation on VF-LAG is hard to find (hence this post):</p>
<ul class="simple">
<li>A <a class="reference external" href="https://community.mellanox.com/s/article/Configuring-VF-LAG-using-TC">Mellanox community post</a>
with step-by-step guidance for a manual setup of VF-LAG.</li>
</ul>
<p>However, ASAP2 integration with OpenStack is well covered in some online
sources:</p>
<ul class="simple">
<li>Melanox provide a useful page for <a class="reference external" href="https://community.mellanox.com/s/article/ASAP-Basic-Debug">debug of ASAP2</a>, which is similar in requirements this use case with VF-LAG.</li>
<li>OpenStack-Ansible has a page on <a class="reference external" href="https://docs.openstack.org/openstack-ansible-os_neutron/latest/app-openvswitch-asap.html">configuring for ASAP2</a>.</li>
<li>Another useful page on <a class="reference external" href="https://docs.openstack.org/neutron/latest/admin/config-ovs-offload.html">configuring for OVS hardware offloads</a> from Neutron</li>
</ul>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
OpenHPC v2 - Enhancements and Demos2021-02-26T16:00:00+00:002021-04-11T19:00:00+01:00Steve Brasiertag:www.stackhpc.com,2021-02-26:/ohpcv2.html<p class="first last">We look at the key changes in OpenHPC v2 and some Ansible using those capabilites</p>
<p><strong>UPDATED:</strong> See <a class="reference internal" href="#openhpc-v2-1-released">OpenHPC v2.1 Released</a> <strong>below.</strong></p>
<p>The <a class="reference external" href="https://openhpc.community/">OpenHPC project</a> is a key part of the "HPC" side of StackHPC. It essentially provides a set of packages which make it easy to install the system software for a
scheduler-based HPC cluster, including the scheduler itself, compilers, MPI libraries, maths and I/O libraries, performance tools, etc., all integrated into a
neat set of hierarchical modules. While OpenHPC’s own documentation has recipes for using Warewulf to create and deploy images for compute nodes, we generally use our
<a class="reference external" href="https://galaxy.ansible.com/stackhpc/openhpc">stackhpc.openhpc</a> Ansible Galaxy role which can configure all nodes in a cluster from a base image with a single command.</p>
<p>OpenHPC v2.0 was <a class="reference external" href="https://openhpc.community/openhpc-2-0-released/">released in October 2020</a> and we've since deployed it to client systems using our Galaxy role, including
the 1000+ node Top 100 and Top 500 systems we recently <a class="reference external" href="https://www.stackhpc.com/sc20-top500.html">blogged about</a>. OpenHPC v2.0 is a significant upgrade from v1.3.9 which preceded it (released in November 2019), with new versions of software at all levels of the stack. However it requires Centos 8 so it is not a trivial upgrade, although presumably at least some of these enhancements will eventually show up in the Centos 7-based v1.3.10 which was <a class="reference external" href="https://groups.io/g/OpenHPC-users/message/3833">originally planned</a> for the end of 2020. Let's take a look at what that upgrade gets you.</p>
<p>Starting with the scheduler, OpenHPC v2.0 updates Slurm to v20.02.5. This adds a <a class="reference external" href="https://slurm.schedmd.com/configless_slurm.html">"configless"</a> mode which enables compute
and login nodes to pull configuration information directly from the slurm control daemon rather than having to distribute the config file to every node. As well as simplifying the overall cluster configuration, this approach means that changes to slurm configuration such as added or removed nodes no longer need to be replicated in compute node images
or mounted over a network filesystem, simplifying building images for compute nodes.</p>
<p>OpenHPC provides a variety of compilers and MPI libraries, but the GCC + OpenMPI combination will be of interest to many users needing a FOSS toolchain. OpenHPC v2.0 updates the
packaged GNU compilers from v8 to v9, and OpenMPI from v3 to v4, but the building of OpenMPI against <a class="reference external" href="https://www.openucx.org/">UCX</a> is possibly the most obvious change to users.
UCX is a communications framework which aims to provide optimized performance with a unified interface both “up” to the user and “down” to developers across a range of hardware and platforms. While adding yet another layer into the already-complicated HPC interconnect/networking/fabric stack may feel somewhat unhelpful, UCX does simplify life for users. For example we recently carried out <a class="reference external" href="https://github.com/stackhpc/hpc-tests/">benchmarking</a> of a range of MPI applications to compare performance between InfiniBand and RoCE interconnects. Getting RoCE to work on Mellanox ConnectX4 cards with the "native" OpenMPI Byte Transport Layer required setting:</p>
<div class="highlight"><pre><span></span><span class="nv">OMPI_MCA_btl</span><span class="o">=</span>openib,self,vader
<span class="nv">OMPI_MCA_btl_openib_if_include</span><span class="o">=</span>mlx5_1:1
<span class="nv">OMPI_OPENIB_ROCE_QUEUES</span><span class="o">=</span><span class="s2">"--mca btl_openib_receive_queues P,128,64,32,32,32:S,2048,1024,128,32:S,12288,1024,128,32:S,65536,1024,128,32"</span>
</pre></div>
<p>with pingpong tests showing some poor performance at some message sizes. Rather than optimising the queues to try to improve this, using UCX provided more consistent RoCE performance by simply setting:</p>
<div class="highlight"><pre><span></span><span class="nv">UCX_NET_DEVICES</span><span class="o">=</span>mlx5_1:1
</pre></div>
<p>As a bonus, the MPICH packages in OpenHPC v2.0 also use UCX and Intel MPI supports it from v2019.5, so using multiple MPI libraries gets significantly simpler.</p>
<p>We've just released v0.7.0 of our Ansible Galaxy <a class="reference external" href="https://galaxy.ansible.com/stackhpc/openhpc">OpenHPC role</a>. The first major enhancement in this version is support for the new "configless" mode when using OpenHPC v2.0 (on Centos 8 - support for OpenHPC v1.x/Centos 7 is still included but without this mode). The second major enhancement is new options to configure <cite>slurmdb</cite> and the accounting plugin. This significantly enhances the accounting information available via <cite>sacct</cite> compared to the default text-file-based storage, and enables us to build job-specific monitoring dashboards for the cluster. Less obviously, a number of internal tweaks have been made to improve using the role in image build pipelines for compute nodes. For a full list of what's new see the <a class="reference external" href="https://github.com/stackhpc/ansible-role-openhpc/releases/tag/v0.7.0">v0.7.0 release notes</a>.</p>
<p>To maintain backwards compatibility these features aren't turned on by default, so as an example of what's needed here's the configuration for a Slurm cluster in configless mode, slurmdb/enhanced accounting enabled and 2x partitions:</p>
<div class="highlight"><pre><span></span><span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Setup slurm</span>
<span class="nt">hosts</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">openhpc</span>
<span class="nt">become</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">yes</span>
<span class="nt">tags</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">openhpc</span>
<span class="nt">tasks</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">import_role</span><span class="p">:</span>
<span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">stackhpc.openhpc</span>
<span class="nt">vars</span><span class="p">:</span>
<span class="nt">openhpc_enable</span><span class="p">:</span>
<span class="nt">control</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['control']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">batch</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['compute']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">database</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['control']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">runtime</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">true</span>
<span class="nt">openhpc_slurm_accounting_storage_type</span><span class="p">:</span> <span class="s">'accounting_storage/slurmdbd'</span>
<span class="nt">openhpc_slurmdbd_mysql_password</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">secrets_openhpc_mysql_slurm_password</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">openhpc_slurm_control_host</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">groups['control']</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">first</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">openhpc_slurm_partitions</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="s">"hpc"</span>
<span class="nt">default</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">YES</span>
<span class="nt">maxtime</span><span class="p">:</span> <span class="s">"3-0"</span> <span class="c1"># 3 days 0 hours</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="s">"express"</span>
<span class="nt">default</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">NO</span>
<span class="nt">maxtime</span><span class="p">:</span> <span class="s">"1:0:0"</span> <span class="c1"># 1 hour 0m 0s</span>
<span class="nt">openhpc_slurm_configless</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">true</span>
<span class="nt">openhpc_login_only_nodes</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">login</span>
</pre></div>
<p>Using this role for the core Slurm functionality, we're now building a flexible Ansible-based "Slurm appliance" around it which automates deployment and configuration of an entire HPC environment. At present it includes the Slurm-based monitoring as mentioned above, post-deployment performance tests, and additional filesystems, as well as providing production-ready configuration for aspects such as PAM and user limits. Our OpenHPC role makes deploying a Slurm cluster easy, and we're excited that this appliance will provide the same ease of use for a much richer user experience. Watch this space for details ...</p>
<div class="section" id="openhpc-v2-1-released">
<h2>OpenHPC v2.1 Released</h2>
<p>This version of OpenHPC was <a class="reference external" href="https://github.com/openhpc/ohpc/releases/tag/v2.1.GA">released</a> on 6th April 2021 and supports CentOS 8.3. While this release is numbered as a minor version change (and many of the included packages do have minor version upgrades) this is a fairly significant change for Slurm-based systems. The Slurm version changes from 20.02.5 to 20.11.3 and the <a class="reference external" href="https://slurm.schedmd.com/archive/slurm-20.11.3/news.html">release notes</a> for this version show some significant changes. New features such as <a class="reference external" href="https://slurm.schedmd.com/slurm.conf.html#OPT_Dynamic-Future-Nodes">"dynamic future nodes"</a> allowing nodes to be specified by hardware configuration rather than by name are interesting but there are two potential pit-falls.</p>
<p>Firstly, the <tt class="docutils literal">filetxt</tt> plugin for accounting storage is no longer supported. While only supporting basic accounting features, it was enabled simply by setting a slurm.conf parameter (which our Ansible Galaxy <a class="reference external" href="https://galaxy.ansible.com/stackhpc/openhpc">OpenHPC role</a> role did by default). Now, enabling accounting requires setting up a MySQL or MariaDB database and the Slurm database daemon. For production clusters this is probably preferable anyway (and is supported by our OpenHPC role) but for some situations the simplicity of the <tt class="docutils literal">filetxt</tt> approach will be missed. One partial mitigation is to use job accounting instead, which allows <tt class="docutils literal">sacct <span class="pre">-c</span></tt> to at least show job completion information. This is again a simple slurm.conf change and supported by our Galaxy OpenHPC role.</p>
<p>Secondly, a 20.11.3 <tt class="docutils literal">slurmd</tt> cannot communicate with a 20.02.5 <tt class="docutils literal">slurmctld</tt>. As per Slurm's <a class="reference external" href="https://slurm.schedmd.com/quickstart_admin.html#upgrade">versioning scheme</a> the major release is given by combining the first two parts of the version number, so this is not surprising. As the newer version can read statefiles, etc., from the old version, a smooth upgrade is possible.</p>
<p>The problem is that both OpenHPC v2.x versions are in the same repos. So using our Galaxy OpenHPC role on a CentOS 8.x system before 6th April created an OpenHPC v2.0 node using Slurm 20.02.05, and now creates an OpenHPC v2.1 node using Slurm 20.11.3. So not only did the role start failing in CI (due to the now-unsupported default accounting configuration), but adding a compute node to an existing cluster by rerunning the role failed, as the updated packages in the new compute node meant it couldn't communicate with the older slurmctld.</p>
<p>If required, OpenHPC version pinning should be achievable through modification of the installed repo configurations, but this will be messy and require amending for each new version. For now, note that:</p>
<ul class="simple">
<li>Running old versions of our Galaxy OpenHPC role against an existing cluster (with no new nodes) will not cause problems, as the role does not update packages itself.</li>
<li>Do not run a <tt class="docutils literal">yum/dnf</tt> update of <tt class="docutils literal"><span class="pre">*-ohpc</span></tt> packages unless done as part of a Slurm upgrade.</li>
<li>Use the new v0.8 release of our Galaxy OpenHPC role for all new CentOS 8.x clusters, if using the default accounting configuration. This version disables accounting by default and is therefore compatible with Slurm 20.11.3.</li>
<li>Adding nodes to existing OpenHPC v2.0 clusters should probably be done using existing images, rather than by directly installing OpenHPC on the node (whether via our Galaxy OpenHPC role or any other means).</li>
</ul>
<div class="section" id="get-in-touch">
<h3>Get in touch</h3>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
</div>
OpenStack in the TOP5002020-11-18T11:00:00+00:002020-11-18T11:00:00+00:00John Garbutttag:www.stackhpc.com,2020-11-18:/sc20-top500.html<p class="first last">For the first time, the November Top500 list includes fully
OpenStack-based Software-Defined Supercomputers. Firstly at #99 is
UM6P’s Toubkal, and at #421 is Cambridge University’s Cascade Lake
extension of CSD3. StackHPC is thrilled to have been involved in
both.</p>
<p>For the first time, the November TOP500 list (published to coincide
with <a class="reference external" href="https://sc20.supercomputing.org">Supercomputing 2020</a>)
includes fully OpenStack-based Software-Defined Supercomputers:</p>
<ul class="simple">
<li>At <a class="reference external" href="https://www.top500.org/system/179908/">#99 is UM6P’s Toubkal</a></li>
<li>At <a class="reference external" href="https://www.top500.org/system/179909/">#421 is Cambridge University’s</a>
Cascade Lake <a class="reference external" href="https://www.gov.uk/government/news/12-billion-for-the-worlds-most-powerful-weather-and-climate-supercomputer">extension of CSD3</a>.</li>
</ul>
<p>Drawing on experience including from the <a class="reference external" href="https://www.skatelescope.org">SKA Telescope</a> Science Data Processor <a class="reference external" href="//www.stackhpc.com/ironic-idrac-ztp.html">Performance
Prototypting Platform</a> and <a class="reference external" href="https://verneglobal.com/">Verne
Global's</a> <a class="reference external" href="//www.stackhpc.com/verne-globals-hpcdirect-service-bare-metal-powered-by-molten-rock.html">hpcDIRECT</a> project,
StackHPC has helped bootstrap and is providing support for these
OpenStack deployments. They are deployed and operated using <a class="reference external" href="https://docs.openstack.org/kayobe">OpenStack
Kayobe</a> and <a class="reference external" href="https://docs.openstack.org/kolla-ansible">OpenStack
Kolla-Ansible</a>.</p>
<p>A key part of the solution is being able to deploy an <a class="reference external" href="https://openhpc.community">OpenHPC-2.0</a>
Slurm cluster on server infrastructure managed by <a class="reference external" href="https://www.openstack.org/use-cases/bare-metal/">OpenStack Ironic</a>.
The Dell C6420 servers are imaged with CentOS 8, and we use our
<a class="reference external" href="https://galaxy.ansible.com/stackhpc/openhpc">OpenHPC Ansible role</a> to both configure the system and build images.
Updated images are deployed in a non-impacting way through a <a class="reference external" href="https://github.com/stackhpc/slurm-openstack-tools">custom
Slurm reboot script</a>.</p>
<p>With OpenStack in control, you can quickly rebalance what workloads
are deployed. Users can move capacity between multiple Bare Metal,
Virtual Machine and Container based workloads. In particular,
OpenStack Magnum provides on demand creation of Kubernetes clusters,
<a class="reference external" href="https://www.openstack.org/blog/10-years-of-openstack-tim-bell-at-cern/">an approach popularised by CERN</a>.</p>
<p>In addition to user workloads, the solution interacts with iDRAC
and Redfish management interfaces to control server configurations,
remediate faults and deliver overall system metrics. This was
critical in optimising the data centre environment and resulted in
the high efficiency achieved in the TOP500 list.</p>
<img alt="Redfish telemetry gathered while running LINPACK benchmarks" src="//www.stackhpc.com/images/sc20-top500-redfish.png" style="width: 750px;" />
<p>For more details, please watch our recent presentation from the
<a class="reference external" href="https://www.youtube.com/playlist?list=PLKqaoAnDyfgq5YNWZ3Pk9vXf9Smo-gFxw">OpenInfra Summit</a>:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/bbw-Fj0F1iY" width="500" height="333" allowfullscreen seamless frameBorder="0"></iframe></div><div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
StackHPC is OpenInfra!2020-10-20T12:00:00+01:002020-10-20T12:00:00+01:00Stig Telfertag:www.stackhpc.com,2020-10-20:/openinfra.html<p class="first last">StackHPC is proud to be a founding member of the Open
Infrastructure Foundation. What does this mean for OpenStack and
open infrastructure?</p>
<p>Open infrastructure will underpin the next decade of transformation
for cloud infrastructure. With the virtual <a class="reference external" href="https://www.openstack.org/summit/2020/">Open Infrastructure
Summit</a> well underway,
the first major announcement has been the formation of a new
foundation, the <a class="reference external" href="https://openinfra.dev">Open Infrastructure Foundation</a>.
StackHPC is proud to be a founding member.</p>
<img alt="Open Infastructure Foundation" src="//www.stackhpc.com/images/OpenInfraFoundation-MemberLogo-RGB-Silver-750x285.png" style="width: 500px;" />
<p>StackHPC's CEO, John Taylor, comments "We are extremely pleased to
be a part of the new decade of Open Infrastructure and welcome the
opportunity to continue to transfer the values of "Open" to our
clients."</p>
<p>StackHPC's CTO, Stig Telfer, recorded a short video describing how the concept of
open infrastructure is essential to our work, and how as a company we contribute to open infrastructure
as a central part of what we do:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/D5SjJ4PDWfk" width="500" height="333" allowfullscreen seamless frameBorder="0"></iframe></div><div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
StackHPC Shortlisted for OpenStack SuperUser Award2020-09-28T18:00:00+01:002020-09-28T18:00:00+01:00Stig Telfertag:www.stackhpc.com,2020-09-28:/superuser-nomination.html<p class="first last">In recognition of our contribution to the open infrastructure
community, StackHPC has been selected for the shortlist for the
OpenStack SuperUser award.</p>
<p>With the virtual <a class="reference external" href="https://www.openstack.org/summit/2020/">Open Infrastructure Summit</a>
just a few weeks away, the <a class="reference external" href="https://superuser.openstack.org/articles/meet-the-2020-superuser-awards-nominees/">SuperUser Award shortlist</a>
has been announced, and StackHPC is thrilled to have been <a class="reference external" href="https://superuser.openstack.org/articles/2020-superuser-awards-nominee-stackhpc/">selected as a nominee</a>.</p>
<img alt="SuperUser StackHPC nomination" src="//www.stackhpc.com/images/superuser-nomination-stackhpc.png" style="width: 387px;" />
<p>Since our formation about five years ago, we have followed a vision
of the opportunities offered by open infrastructure for scientific
and research computing. This nomination is a tremendous validation
of our contribution to the open infrastructure community in that
time.</p>
<p>Fingers crossed for the winner announcement during the opening
<a class="reference external" href="https://www.openstack.org/summit/2020/summit-schedule/events/24743/open-infrastructure-keynotes">virtual keynote</a>!</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Monasca on Kayobe, tips and tricks2020-09-03T16:30:00+01:002020-09-03T16:30:00+01:00Isaac Priortag:www.stackhpc.com,2020-09-03:/kayobe-monasca.html<p class="first last">We provide a quick guide to deploying Monasca using Kayobe.</p>
<p>Here at StackHPC we've used and experimented with Monasca in a variety of ways,
<a class="reference external" href="https://www.stackalytics.com/?metric=loc&module=monasca-group&release=all&company=stackhpc">contributing upstream wherever possible</a>.</p>
<p>For the benefit of those using or considering either Monasca or Kayobe we
thought we'd share some of our tips for deploying and configuring it.</p>
<p>This tutorial will follow on from the
<a class="reference external" href="https://github.com/stackhpc/a-universe-from-nothing/tree/stable/train">Kayobe a-universe-from-nothing tutorial on Train</a>
to demonstrate how to deploy and customise Monasca with Kolla-Ansible.</p>
<p>Assuming you've got a Kayobe environment (see our helpful
<a class="reference external" href="http://www.stackhpc.com/universe-from-nothing.html">universe-from-nothing</a>
blog post if you haven't already) you're only a few steps away from having a
deployed Monasca stack. Here's how.</p>
<div class="section" id="before-we-begin">
<h2>Before we begin</h2>
<p>From your designated Ansible control host, source the Kayobe virtualenv,
kayobe-env and admin credentials files.
Assuming the virtualenv and kayobe-config locations are the same as in the
<a class="reference external" href="https://github.com/stackhpc/a-universe-from-nothing/tree/stable/train">a-universe-from-nothing tutorial</a>:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> <span class="nb">source</span> ~/kayobe-venv/bin/activate
<span class="gp">$</span> <span class="nb">cd</span> ~/kayobe/config/src/kayobe-config/
<span class="gp">$</span> <span class="nb">source</span> kayobe-env
<span class="gp">$</span> <span class="nb">source</span> etc/kolla/admin-openrc.sh
</pre></div>
<p>Any reference to a filesystem path from this point in the guide will be
relative to the <tt class="docutils literal"><span class="pre">kayobe-config</span></tt> directory above.</p>
<div class="section" id="optional">
<h3>Optional</h3>
<p>Optionally, enable Kayobe shell completion:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> <span class="nb">source</span> <<span class="o">(</span>kayobe <span class="nb">complete</span><span class="o">)</span>
</pre></div>
</div>
</div>
<div class="section" id="containers">
<h2>Containers</h2>
<p>First you'll need Kolla containers, these can either be pulled from
<a class="reference external" href="https://hub.docker.com/search?q=centos-source-monasca&type=image">Docker Hub</a>
or built using Kayobe.
Kolla containers are either of the source or binary variety depending on how
they were built and this is reflected in their image name. Note that not every
component
<a class="reference external" href="https://docs.openstack.org/kolla/train/support_matrix.html#x86-64-images">supports both build types</a>
and Monasca is only available from source. In practice,
this means we'll need to inform Kolla and Kolla-Ansible of the container image
to build (unless pulling from Dockerhub) and deploy respectively.</p>
<div class="section" id="pulling-from-docker-hub">
<h3>Pulling from Docker Hub</h3>
<p>If you've followed a universe-from-nothing build the following
script can be used to pull the relevant containers from the Docker Hub
<a class="reference external" href="https://hub.docker.com/u/kolla">Kolla repositories</a> and push them to the
seed:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span>!/bin/bash
<span class="go">set -e</span>
<span class="go">tag=${1:-train}</span>
<span class="go">images="kolla/centos-binary-zookeeper</span>
<span class="go">kolla/centos-binary-kafka</span>
<span class="go">kolla/centos-binary-storm</span>
<span class="go">kolla/centos-binary-logstash</span>
<span class="go">kolla/centos-binary-kibana</span>
<span class="go">kolla/centos-binary-elasticsearch</span>
<span class="go">kolla/centos-binary-influxdb</span>
<span class="go">kolla/centos-source-monasca-api</span>
<span class="go">kolla/centos-source-monasca-notification</span>
<span class="go">kolla/centos-source-monasca-persister</span>
<span class="go">kolla/centos-source-monasca-agent</span>
<span class="go">kolla/centos-source-monasca-thresh</span>
<span class="go">kolla/centos-source-monasca-grafana"</span>
<span class="go">registry=192.168.33.5:4000</span>
<span class="go">for image in $images; do</span>
<span class="go"> ssh stack@192.168.33.5 sudo docker pull $image:$tag</span>
<span class="go"> ssh stack@192.168.33.5 sudo docker tag $image:$tag $registry/$image:$tag</span>
<span class="go"> ssh stack@192.168.33.5 sudo docker push $registry/$image:$tag</span>
<span class="go">done</span>
</pre></div>
</div>
<div class="section" id="building-using-kayobe">
<h3>Building using Kayobe</h3>
<p>Building your own containers is the recommended approach for production
OpenStack and is required if customising the Kolla Dockerfiles.
The following Kayobe commands can be used to build Monasca and related
containers:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud container image build kafka influxdb kibana elasticsearch zookeeper storm logstash --push
<span class="gp">$</span> kayobe overcloud container image build monasca -e <span class="nv">kolla_install_type</span><span class="o">=</span><span class="nb">source</span> --push
</pre></div>
<p>The <tt class="docutils literal"><span class="pre">--push</span></tt> argument will push these containers to the Docker registry on
the seed node once built.</p>
</div>
</div>
<div class="section" id="configuring-kayobe">
<h2>Configuring Kayobe</h2>
<p>StackHPC usually recommends a cluster of 3 separate nodes for monitoring
infrastructure but with sufficient available resources it is possible to
configure the controllers as monitoring nodes.
For separate monitoring nodes see <a class="reference external" href="https://docs.openstack.org/kayobe/latest/control-plane-service-placement.html#example-1-adding-network-hosts">here</a>
for an example of adding another node type.</p>
<p>If instead you are running monitoring services on controllers then add the
following to <tt class="docutils literal">etc/kayobe/inventory/groups</tt>:</p>
<div class="highlight"><pre><span></span><span class="go">[monitoring:children]</span>
<span class="gp">#</span> Add controllers to monitoring group
<span class="go">controllers</span>
</pre></div>
</div>
<div class="section" id="configuring-kolla-ansible">
<h2>Configuring Kolla-Ansible</h2>
<p>Add the following to the contents of <tt class="docutils literal">etc/kayobe/kolla/globals.yml</tt>:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> Roles which grant read/write access to Monasca APIs
<span class="go">monasca_default_authorized_roles:</span>
<span class="go">- admin</span>
<span class="go">- monasca-user</span>
<span class="gp">#</span> Roles which grant write access to Monasca APIs
<span class="go">monasca_agent_authorized_roles:</span>
<span class="go">- monasca-agent</span>
<span class="gp">#</span> Project name to send control plane logs and metrics to
<span class="go">monasca_control_plane_project: monasca_control_plane</span>
</pre></div>
<p>This <a class="reference external" href="https://docs.openstack.org/kayobe/train/configuration/kolla-ansible.html#custom-global-variables">configures Kolla-Ansible</a>
with some sane defaults for user and agent roles and finally
names the OpenStack project for metrics as <tt class="docutils literal">monasca_control_plane</tt>.</p>
</div>
<div class="section" id="configuring-monasca">
<h2>Configuring Monasca</h2>
<p>StackHPC makes regular use of the
Slack <a class="reference external" href="https://github.com/openstack/monasca-notification#plugins">notification plugin</a>
for alerts.
To demonstrate how this works we'll enable and customise this feature.
Customising Monasca requires creating configuration under directories that do not
yet exist, so first create both the
<a class="reference external" href="https://docs.openstack.org/kayobe/latest/configuration/kolla-ansible.html#service-configuration">Kolla config Monasca directory</a>
and a subdirectory for alarm notification templates:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> mkdir -p etc/kayobe/kolla/config/monasca/notification_templates
</pre></div>
<div class="section" id="monasca-notification-configuration">
<h3>Monasca-Notification configuration</h3>
<p>Populate the monasca-notification container's configuration at
<tt class="docutils literal">etc/kayobe/kolla/config/monasca/notification.conf</tt> to enable Slack webhooks
and set the notification template:</p>
<div class="highlight"><pre><span></span><span class="go">[notification_types]</span>
<span class="go">enabled = slack,webhook</span>
<span class="go">[slack_notifier]</span>
<span class="go">message_template = "/etc/monasca/slack_template.j2"</span>
<span class="go">timeout = 5</span>
<span class="go">ca_certs = "/etc/ssl/certs/ca-bundle.crt"</span>
<span class="go">insecure = False</span>
<span class="go">[webhook_notifier]</span>
<span class="go">timeout = 5</span>
</pre></div>
</div>
<div class="section" id="slack-webhook-notification-template">
<h3>Slack webhook notification template</h3>
<p>Custom Slack notification templates should be placed in
<tt class="docutils literal">etc/kayobe/kolla/config/monasca/notification_templates/slack_template.j2</tt>.
If you've followed a universe-from-nothing template then the following jinja
will work as is:</p>
<div class="highlight"><pre><span></span><span class="go">{% raw %}{% set base_url = "http://{% endraw %}{{ aio_vip_address }}{% raw %}:3001/plugins/monasca-app/page/alarms" -%}</span>
<span class="go">Alarm: `{{ alarm_name }}`</span>
<span class="go">{%- if metrics[0].dimensions.hostname is defined -%}</span>
<span class="go">{% set hosts = metrics|map(attribute='dimensions.hostname')|unique|list %} on host(s): `{{ hosts|join(', ') }}` moved to <{{ base_url }}?dimensions=hostname:{{ hosts|join('|') }}|status>: `{{ state }}`</span>
<span class="go">{%- else %} moved to <{{ base_url }}|status>: `{{ state }}`</span>
<span class="go">{%- endif %}.{% endraw %}</span>
</pre></div>
<p>If you've prepared your own deployment then <tt class="docutils literal">{{ aio_vip_address }}</tt> will need
to be replaced with the address of an accessible VIP interface as defined in
<tt class="docutils literal">etc/kayobe/networks.yml</tt>.</p>
<p>Astute Jinja practitioners may notice that the notification template is wrapped
inside <tt class="docutils literal">{% raw %}</tt> tags except for the VIP address: this allows Kayobe to
insert a variable not visible at the time Kolla-Ansible templates the file.</p>
</div>
<div class="section" id="adding-dashboards-datasources">
<h3>Adding Dashboards & Datasources</h3>
<p>Monasca-Grafana will need to be configured with the monasca-api address as a
metric source. Note elasticsearch can also be configured as a metric source
to visualise log metrics.
Optionally custom dashboards can also be defined in the same file
<tt class="docutils literal">etc/kayobe/grafana.yml</tt>:</p>
<div class="highlight"><pre><span></span><span class="gp">#</span> Path to git repo containing Grafana dashboards. Eg.
<span class="gp">#</span> https://github.com/stackhpc/grafana-reference-dashboards.git
<span class="go">grafana_monitoring_node_dashboard_repo: "https://github.com/stackhpc/grafana-reference-dashboards.git"</span>
<span class="gp">#</span> Dashboard repo version. Optional, defaults to <span class="s1">'HEAD'</span>.
<span class="go">grafana_monitoring_node_dashboard_repo_version: "stable/train"</span>
<span class="gp">#</span> The path, relative to the grafana_monitoring_node_dashboard_repo_checkout_path
<span class="gp">#</span> containing the dashboards. Eg. /prometheus/control_plane
<span class="go">grafana_monitoring_node_dashboard_repo_path: "/monasca/control_plane"</span>
<span class="gp">#</span> A dict of datasources to configure. See the stackhpc.grafana-conf role
<span class="gp">#</span> <span class="k">for</span> all supported datasources.
<span class="go">grafana_datasources:</span>
<span class="go"> monasca_api:</span>
<span class="go"> port: 8070</span>
<span class="go"> host: "{{ aio_vip_address }}"</span>
<span class="go"> elasticsearch:</span>
<span class="go"> port: 9200</span>
<span class="go"> host: "{{ aio_vip_address }}"</span>
<span class="go"> project_id: "{{ monasca_control_plane_project_id | default('') }}"</span>
</pre></div>
</div>
<div class="section" id="pulling-containers-to-the-overcloud">
<h3>Pulling containers to the overcloud</h3>
<p>Once the configuration is in place, it is recommended to prepare for the next step
by pulling the new containers from the seed registry to the relevant overcloud
nodes:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud container image pull
</pre></div>
<p>This also serves to check all required containers are available.</p>
</div>
</div>
<div class="section" id="deploying-monasca">
<h2>Deploying Monasca</h2>
<p>Deploying Monasca and friends using Kayobe can take a considerable length of
time due to the number of checks Kayobe and Kolla-Ansible both perform.
Being familiar with kolla-ansible we can skip some of these tasks with the
<a class="reference external" href="https://docs.openstack.org/kayobe/train/usage.html#tags">--kolla-tags</a>
argument:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud service deploy --kolla-tags monasca,elasticsearch,influxdb,mariadb,kafka,kibana,grafana,storm,kafka,zookeeper,haproxy,common
</pre></div>
<p>The above command will deploy only Monasca and related services.
A word of caution however, limiting tasks in this fashion can have unexpected
consequences for inexperienced users and honestly doesn't save much time.
If in doubt which tags are required run a full deploy with:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud service deploy
</pre></div>
<p>Now would be a good point to grab a cup of tea.</p>
<p>Run in a production environment, this command shouldn't cause any disruption
to tenant services (on sufficient hardware) but HAProxy will restart,
potentially interrupting connections to the API for a brief period.</p>
<div class="section" id="id1">
<h3>Adding Dashboards & Datasources</h3>
<p>Assuming the deployment completed successfully, additional tasks are still
required to configure Grafana with the datasources and dashboards defined in
<tt class="docutils literal">etc/kayobe/grafana.yml</tt>:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud post configure --tags grafana
</pre></div>
</div>
</div>
<div class="section" id="testing">
<h2>Testing</h2>
<p>You should now be able to navigate to Grafana and Kibana, found by default on
ports 3001 & 5601.</p>
<p>To start using the Monasca CLI, install it from PyPI and configure the relevant
roles to authenticate in the Monasca project. Create and activate a fresh venv
for the purpose:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> deactivate
<span class="gp">$</span> python3 -m venv ~/monasca-venv
<span class="gp">$</span> <span class="nb">source</span> ~/monasca-venv/bin/activate
<span class="gp">$</span> pip install python-openstackclient
<span class="gp">$</span> pip install python-monascaclient
<span class="gp">$</span> <span class="nb">source</span> etc/kolla/admin-openrc.sh
</pre></div>
<p>Optionally enable shell completion:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> <span class="nb">source</span> <<span class="o">(</span>openstack <span class="nb">complete</span><span class="o">)</span>
<span class="gp">$</span> <span class="nb">source</span> <<span class="o">(</span>monasca <span class="nb">complete</span><span class="o">)</span>
</pre></div>
<p>Add the admin user to the <tt class="docutils literal">monasca_control_plane</tt> project (and double check):</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> openstack role add --user admin --project monasca_control_plane admin
<span class="gp">$</span> openstack role assignment list --names --project monasca_control_plane
</pre></div>
<p>Switch to the <tt class="docutils literal">monasca_control_plane</tt> project and view available metric names:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> <span class="nb">export</span> <span class="nv">OS_PROJECT_NAME</span><span class="o">=</span>monasca_control_plane
<span class="gp">$</span> <span class="nb">unset</span> OS_TENANT_NAME
<span class="gp">$</span> monasca metric-name-list
</pre></div>
</div>
<div class="section" id="alarms">
<h2>Alarms!</h2>
<p>Don't forget to install the Slack
<a class="reference external" href="https://api.slack.com/messaging/webhooks">Incoming webhooks</a>
integration in order to
make use of Monasca alerts. Once that is installed and configured in a channel,
you'll be provided with a webhook URL - since this is a private URL it should be
secured before being added to <tt class="docutils literal"><span class="pre">kayobe-config</span></tt> (for more information see the
<a class="reference external" href="https://docs.openstack.org/kayobe/latest/configuration/kayobe.html#encryption-of-secrets">Kayobe documentation on secrets</a>).</p>
<div class="section" id="deploying-alarms-from-a-custom-playbook">
<h3>Deploying Alarms from a custom playbook</h3>
<p>Alarms and notification definitions can be created using the <a class="reference external" href="https://opendev.org/openstack/monasca-notification#user-content-slack-plugin">monasca CLI</a>,
but in keeping with the configuration-as-code approach thus far we'd recommend
<a class="reference external" href="https://galaxy.ansible.com/stackhpc/monasca_default_alarms">our ansible role</a>
for the task - it contains a <em>reasonably</em> sane set of alarms for monitoring
both overcloud nodes and OpenStack services.</p>
<p>With some additional configuration, the role can be
<a class="reference external" href="https://docs.openstack.org/kayobe/train/custom-ansible-playbooks.html">installed and used by Kayobe</a>.</p>
<p>First create the directory and provide symlinks to Kayobe Ansible as per
<a class="reference external" href="https://docs.openstack.org/kayobe/train/custom-ansible-playbooks.html#packaging-custom-playbooks-with-configuration">the documentation</a>:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> mkdir -p etc/kayobe/ansible
<span class="gp">$</span> <span class="nb">cd</span> etc/kayobe/ansible
<span class="gp">$</span> ln -s ../../../../kayobe/ansible/filter_plugins/ filter_plugins
<span class="gp">$</span> ln -s ../../../../kayobe/ansible/group_vars/ group_vars
<span class="gp">$</span> ln -s ../../../../kayobe/ansible/test_plugins/ test_plugins
<span class="gp">$</span> <span class="nb">cd</span> -
</pre></div>
<p>And then <tt class="docutils literal">etc/kayobe/ansible/requirements.yml</tt> to specify the role:</p>
<div class="highlight"><pre><span></span><span class="go">---</span>
<span class="go">- src: stackhpc.monasca_default_alarms</span>
<span class="go"> version: 1.3.0</span>
</pre></div>
<p>An example playbook to deploy only the system level alerts
(cpu, disk, memory usage) in <tt class="docutils literal">etc/kayobe/ansible/monasca_alarms.yml</tt>.
This assumes you've created a variable for your slack webhook called
<tt class="docutils literal">secrets_monasca_slack_webhook</tt> and that the monasca CLI virtualenv is in
<tt class="docutils literal"><span class="pre">~/monasca-venv</span></tt>:</p>
<div class="highlight"><pre><span></span><span class="go">- name: Create Monasca notification method and alarms</span>
<span class="go"> hosts: localhost</span>
<span class="go"> gather_facts: yes</span>
<span class="go"> vars:</span>
<span class="go"> keystone_url: "http://{{ aio_vip_address }}:5000/v3"</span>
<span class="go"> keystone_project: "monasca_control_plane"</span>
<span class="go"> monasca_endpoint_interface: ["internal"]</span>
<span class="go"> notification_address: "{{ secrets_monasca_slack_webhook }}"</span>
<span class="go"> notification_name: "Default Slack Notification"</span>
<span class="go"> notification_type: "SLACK"</span>
<span class="go"> monasca_client_virtualenv_dir: "~/monasca-venv"</span>
<span class="go"> virtualenv_become: "no"</span>
<span class="go"> skip_tasks: ["misc", "openstack", "monasca", "ceph"]</span>
<span class="go"> roles:</span>
<span class="go"> - {role: stackhpc.monasca_default_alarms, tags: [alarms]}</span>
</pre></div>
<p>The Ansible galaxy role can be installed using Kayobe with:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe control host bootstrap
</pre></div>
<p>And the playbook invoking it can be executed with:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe playbook run <span class="si">${</span><span class="nv">KAYOBE_CONFIG_PATH</span><span class="si">}</span>/ansible/monasca_alarms.yml
</pre></div>
</div>
</div>
Kayobe & Kolla - sane OpenStack deployment2020-07-28T10:00:00+01:002020-07-28T10:00:00+01:00Stig Telfertag:www.stackhpc.com,2020-07-28:/os-meetup.html<p class="first last">StackHPC's Mark Goddard to present on Kayobe and Kolla-Ansible
at the London and Manchester virtual OpenInfra meetup.</p>
<p>Coming up at the <a class="reference external" href="https://www.meetup.com/Manchester-OpenInfra-Meetup/events/272083295/">London and Manchester (virtual) OpenInfra meetup</a>
on <strong>Thursday July 30th 6pm-9pm UK time</strong> (17:00-19:00 UTC):
Mark Goddard, Kolla PTL and StackHPC team member, will be talking on
"Kayobe & Kolla - sane OpenStack deployment".</p>
<img alt="Mark at RCUK Cloud Workshop 2019" src="//www.stackhpc.com/images/mark-at-cloudwg-2019.png" style="width: 300px;" />
<p>In this talk Mark will introduce <a class="reference external" href="//www.stackhpc.com/pages/kayobe.html">Kayobe</a>,
the latest addition to the OpenStack Kolla project. Learn how
Kayobe uses <a class="reference external" href="https://docs.openstack.org/bifrost/latest/">Bifrost</a>
to support bare metal provisioning, and extends <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla Ansible</a> to offer an
end-to-end cloud deployment tool.</p>
<p>Mark will be joined by:</p>
<ul class="simple">
<li>Ildikó Vancsa and Gergely Csatari, who will present on <em>Edge ecosystem, use cases and architectures</em>.</li>
<li>Belmiro Moreira will present on <em>7 years of CERN Cloud - From 0 to 300k cores</em></li>
</ul>
<p><a class="reference external" href="https://www.meetup.com/Manchester-OpenInfra-Meetup/events/272083295/ical/Edge+ecosystem%252C+and+sane+OpenStack+deployments.ics">Add to your calendar</a>.</p>
<p>If you're interested in finding out more about OpenStack and Kayobe,
check out our <a class="reference external" href="//www.stackhpc.com/pages/workshops.html">OpenStack HIIT training courses</a>.</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Software RAID support in OpenStack Ironic2020-06-17T10:00:00+01:002020-06-18T13:00:00+01:00Stig Telfertag:www.stackhpc.com,2020-06-17:/software-raid-in-ironic.html<p class="first last">We describe our experiences using the new support for software
RAID in Ironic.</p>
<p>OpenStack Ironic operates in a curious world. Each release of Ironic
introduces ever more inventive implementations of the abstractions
of virtualisation. However, bare metal is wrapped up in hardware-defined
concrete: devices and configurations that have no equivalent in
software-defined cloud. To exist, Ironic must provide pure abstractions,
but to succeed it must also offer real-world circumventions.</p>
<p>For decades the conventional role of an HPC system administrator
has included deploying bare metal machines, sometimes at large
scale. Automation becomes essential beyond trivial numbers of systems
to ensure repeatability, scalability and efficiency. Thus far, that
automation has evolved in domain-specific ways, loaded with simplifying
assumptions that enable large-scale infrastructure to be provisioned
and managed from a minimal service. Ironic is the first framework
to define the provisioning of bare metal infrastructure in the
paradigm of cloud.</p>
<p>So much for the theory: working with hardware has always been a
little hairy, never as predictable or reliable as expected.
Software-defined infrastructure, the method underpinning the modern
mantra of agility, accelerates the interactions with hardware
services by orders of magnitude. Ironic strives to deliver results
in the face of unreliability (minimising the need to ask someone
in the data centre to whack a machine with a large stick).</p>
<div class="section" id="hpc-infrastructure-for-seismic-analysis">
<h2>HPC Infrastructure for Seismic Analysis</h2>
<p>As a leader in the seismic processing industry, <a class="reference external" href="https://www.iongeo.com">ION Geophysical</a> maintains a hyperscale production HPC
infrastructure, and operates a phased procurement model that results
in several generations of hardware being active within the production
environment at any time. Field failures and replacements add further
divergence. Providing a consistent software environment across
multiple hardware configurations can be a challenge.</p>
<p>ION is migrating on-premise HPC infrastructure into an OpenStack
private cloud. The OpenStack infrastructure is deployed and configured
using Kayobe, a project that integrates Ironic (for hardware
deployment) and Kolla-Ansible (for OpenStack deployment), all within
an Ansible framework. Ansible provides a consistent interface to
everything, from the physical layer to the application workloads
themselves.</p>
<p>This journey began with some older-generation HPE SL230 compute
nodes and a transfer of control to OpenStack management. Each node
has two HDDs. To meet the workload requirements these are provisioned
as two RAID volumes - one mirrored (for the OS) and one striped
(for scratch space for the workloads).</p>
<p>Each node also has a hardware RAID controller, and standard practice
in Ironic would be to make use of this. However, after analysing the
hardware it was found that:</p>
<ul class="simple">
<li>The hardware RAID controller needed to be enabled via the BIOS, but the
BIOS administration tool failed on many nodes because the 'personality
board' had failed, preventing the tool from retrieving the server model
number.</li>
<li>The RAID controller required a proprietary kernel driver which was not
available for recent CentOS releases. The driver was not just required
for administering the controller, but for mounting the RAID volumes.</li>
</ul>
<p>Taking these and other factors into account, it was decided that
the hardware RAID controller was unusable.
Thankfully, Ironic developed a software-based alternative.</p>
</div>
<div class="section" id="provisioning-to-software-raid">
<h2>Provisioning to Software RAID</h2>
<p>Linux servers are often deployed with their root filesystem on a
mirrored RAID-1 volume. This requirement exemplifies the inherent
tensions within the Ironic project. The abstractions of virtualisation
demand that the guest OS is treated like a black box, but the
software RAID implementation is Linux-specific. However, not
supporting Linux software RAID would be a limitation for the primary
use case. Without losing Ironic's generalised capability, the guest
OS “black box” becomes a white box in exceptional cases such as
this. Recent work led by CERN has contributed software RAID support
to the <a class="reference external" href="https://docs.openstack.org/releasenotes/ironic/train.html#relnotes-13-0-0-stable-train">Ironic
Train release</a>.</p>
<p>The CERN team have documented the software RAID support <a class="reference external" href="http://techblog.web.cern.ch/techblog/post/ironic_software_raid/">on their tech blog</a>.</p>
<p>In its initial implementation, the software RAID capability is
constrained. A bare metal node is assigned a persistent software
RAID configuration, applied whenever a node is cleaned and used for
all instance deployments. Prior work involving the StackHPC team
to develop <a class="reference external" href="//www.stackhpc.com/bespoke-bare-metal.html">instance-driven RAID configurations</a> is not yet
available for software RAID. However, the current driver implementation
provides exactly the right amount of functionality for Kayobe's
cloud infrastructure deployment.</p>
</div>
<div class="section" id="the-method">
<h2>The Method</h2>
<p>RAID configuration in Ironic is described in greater detail in the
<a class="reference external" href="https://docs.openstack.org/ironic/latest/admin/raid.html#software-raid">Ironic Admin Guide</a>. A higher-level overview is presented here.</p>
<p>Software RAID with UEFI boot is not supported until the Ussuri release, where
it can be used in conjuction with a rootfs UUID hint stored as image meta
data in a service such as Glance. For Bifrost users this
means that legacy BIOS boot mode is the only choice, ruling out secure
boot and NVMe devices for now.</p>
<p>In this case the task was to provision a large number of compute nodes with
OpenStack Train, each with two physical spinning disks and configured for
legacy BIOS boot mode. These were provisioned according to the <a class="reference external" href="https://docs.openstack.org/ironic/train/admin/raid.html#software-raid">OpenStack documentation</a> with
some background provided by the CERN <a class="reference external" href="http://techblog.web.cern.ch/techblog/post/ironic_software_raid/">blog article</a>. Two
RAID devices were specified in the RAID configuration set on each node;
the first for the operating system, and the second for use by Nova
as scratch space for VMs.</p>
<div class="highlight"><pre><span></span><span class="go">{</span>
<span class="go"> "logical_disks": [</span>
<span class="go"> {</span>
<span class="go"> "raid_level": "1",</span>
<span class="go"> "size_gb" : 100,</span>
<span class="go"> "controller": "software"</span>
<span class="go"> },</span>
<span class="go"> {</span>
<span class="go"> "raid_level": "0",</span>
<span class="go"> "size_gb" : "800",</span>
<span class="go"> "controller": "software"</span>
<span class="go"> }</span>
<span class="go"> ]</span>
<span class="go">}</span>
</pre></div>
<p>Note that although you can use all remaining space when creating a logical
disk by setting <tt class="docutils literal">size_gb</tt> to <tt class="docutils literal">MAX</tt>, you may wish to leave a little spare
to ensure that a failed disk can be rebuilt if it is replaced by a model
with marginally different capacity.</p>
<p>The RAID configuration was then applied with the following cleaning
steps as detailed in the <a class="reference external" href="https://docs.openstack.org/ironic/train/admin/raid.html#software-raid">OpenStack documentation</a>:</p>
<div class="highlight"><pre><span></span><span class="go">[{</span>
<span class="go"> "interface": "raid",</span>
<span class="go"> "step": "delete_configuration"</span>
<span class="go"> },</span>
<span class="go"> {</span>
<span class="go"> "interface": "deploy",</span>
<span class="go"> "step": "erase_devices_metadata"</span>
<span class="go"> },</span>
<span class="go"> {</span>
<span class="go"> "interface": "raid",</span>
<span class="go"> "step": "create_configuration"</span>
<span class="go"> }]</span>
</pre></div>
<p>A RAID-1 device was selected for the OS so that the hypervisor would
remain functional in the event of a single disk failure. RAID-0 was
used for the scratch space to take advantage of the performance
benefit and additional storage space offered by this configuration.
It should be noted that this configuration is specific to the
intended use case, and may not be optimal for all deployments.</p>
<p>As noted in the CERN <a class="reference external" href="http://techblog.web.cern.ch/techblog/post/ironic_software_raid/">blog article</a>,
the <tt class="docutils literal">mdadm</tt> package was installed into the Ironic Python Agent (IPA)
ramdisk for the purpose of configuring the RAID array during cleaning.
<tt class="docutils literal">mdadm</tt> was also installed into the deploy image to support
the installation of the <tt class="docutils literal">grub2</tt> bootloader onto the physical disks for the
purposes of loading the operating system from either disk should one
fail. Finally, <tt class="docutils literal">mdadm</tt> was added to the deploy image ramdisk, so that
when the node booted from disk, it could pivot into the root filesystem.
Although we would generally use Disk Image Builder, a simple trick for
the last step is to use <tt class="docutils literal"><span class="pre">virt-customize</span></tt>:</p>
<div class="highlight"><pre><span></span><span class="go">virt-customize -a deployment_image.qcow2 --run-command 'dracut --regenerate-all -fv --mdadmconf --fstab --add=mdraid --add-driver="raid1 raid0"'</span>
</pre></div>
</div>
<div class="section" id="open-source-open-development">
<h2>Open Source, Open Development</h2>
<p>As an open source project, Ironic depends on a thriving user base
contributing back to the project. Our experiences covered new ground:
hardware not used before by the software RAID driver. Inevitably,
new problems are found.</p>
<p>The first observation was that configuration of the RAID devices
during cleaning would fail on about 25% of the nodes from a sample
of 56. The nodes which failed logged the following message:</p>
<div class="highlight"><pre><span></span><span class="go">mdadm: super1.x cannot open /dev/sdXY: Device or resource busy</span>
</pre></div>
<p>where <tt class="docutils literal">X</tt> was either <tt class="docutils literal">a</tt> or <tt class="docutils literal">b</tt> and Y either <tt class="docutils literal">1</tt> or <tt class="docutils literal">2</tt>, denoting the
physical disk and partition number respectively.
These nodes had previously been deployed with software RAID,
either by Ironic or by other means.</p>
<p>Inspection of the kernel logs showed that in all cases, the device
marked as busy had been ejected from the array by the kernel:</p>
<div class="highlight"><pre><span></span><span class="go">md: kicking non-fresh sdXY from array!</span>
</pre></div>
<p>The device which had been ejected, which may or may not have been
synchronised, appeared in <tt class="docutils literal">/proc/mdstat</tt> as part of a RAID-1 array.
The other drive, having been erased, was missing from the output. It was
concluded that the ejected device had bypassed the cleaning steps
designed to remove all previous configuration, and had later
resurrected itself, thereby preventing the formation of the array
during the <tt class="docutils literal">create_configuration</tt> cleaning step.</p>
<p>For cleaning to succeed, a manual workaround of stopping this RAID-1 device
and zeroing signatures in the superblocks was applied:</p>
<div class="highlight"><pre><span></span><span class="go">mdadm --zero-superblock /dev/sdXY</span>
</pre></div>
<p>Removal of all pre-existing state greatly increased the reliability
of software RAID device creation by Ironic. The remaining question
was why some servers exhibited this issue and others did not. Further
inspection showed that although many of the disks were old, there
were no reported SMART failures, the disks passed self tests and
although generally close, had not exceeded their mean time before
failure (MTBF). No signs of failure were reported by the kernel in
addition to the removal of a device from the array. Actively seeking
errors, for example by running tools such as <tt class="docutils literal">badblocks</tt> to exercise
the entire disk media, showed that only a very small number of disks
had issues. Benchmarking, burn-in and anomaly detection may have
identified those devices sooner.</p>
<p>Further research may help us identify whether the disks that exhibit
this behaviour are at fault in any other way. An additional line
of investigation could be to increase thresholds such as retries
and timeouts for the drives in the kernel. For now the details are
noted in a <a class="reference external" href="https://storyboard.openstack.org/#!/story/2007573">bug report</a>.</p>
<p>The second issue observed occurred when the nodes booted from the
RAID-1 device. These nodes, running IPA and deploy images based on
Centos <tt class="docutils literal">7.7.1908</tt> with kernel version <tt class="docutils literal"><span class="pre">3.10.0-1062</span></tt>, would show
degraded RAID-1 arrays, with the same message seen during failed
cleaning cycles:</p>
<div class="highlight"><pre><span></span><span class="go">md: kicking non-fresh sdXY from array!</span>
</pre></div>
<p>A workaround for this issue was developed by running a Kayobe custom
playbook against the nodes to add <tt class="docutils literal">sdXY</tt> back into the array. In all
cases the ejected device was observed to resync with the RAID device.
The state of the RAID arrays is monitored using OpenStack Monasca,
ingesting data from a recent release candidate of Prometheus Node
Exporter containing some enhancements around <a class="reference external" href="https://github.com/prometheus/node_exporter/tree/v1.0.0-rc.1">MD/RAID monitoring</a>.
Software RAID status can be visualised using a simple dashboard:</p>
<div class="figure">
<img alt="Monasca dashboard" src="//www.stackhpc.com/images/mdraid.png" style="width: 750px;" />
<p class="caption"><a class="reference external" href="//www.stackhpc.com/images/mdraid.png">Monasca MD/RAID Grafana dashboard</a> using data scraped from Prometheus node exporter.</p>
</div>
<p>The plot in the top left shows the percentage of blocks synchronised
on each RAID device. A single RAID-1 array can be seen recovering
after a device was forcibly failed and added back to simulate the
failure and replacement of a disk. Unfortunately it is not yet
possible to differentiate between the RAID-0 and RAID-1 devices on
each node since Ironic <a class="reference external" href="https://docs.openstack.org/ironic/train/admin/raid.html#optional-properties">does not support the name field for software RAID</a>.
The names for the RAID-0 and RAID-1 arrays therefore alternate
randomly between md126 and md127. Top right: The simulated failed
device is visible within seconds. This is a good metric to generate
an alert from. Bottom left: The device is marked as recovering
whilst the array rebuilds. Bottom right: No manual re-sync was
initiated. The device is seen as recovering by MD/RAID and does not
show up in this figure.</p>
<p>The root cause of these two issues is not yet identified, but they
are likely to be connected, and relate to an interaction between
these disks and the kernel MD/RAID code.</p>
</div>
<div class="section" id="open-source-open-community">
<h2>Open Source, Open Community</h2>
<p>Software that interacts with hardware soon builds up an extensive
"case law" of exceptions and workarounds. Open projects like Ironic
survive and indeed thrive when users become contributors. Equivalent
projects that do not draw on community contribution have ultimately
fallen short.</p>
<p>The original contribution made by the team at CERN (and others in
the OpenStack community) enabled StackHPC and ION Geophysical to
deploy infrastructure for seismic processing in an optimal way.
Whilst in this case we would have liked to have gone further with
our own contributions, we hope that by sharing our experience we
can inspire other users to get involved with the project.</p>
<div class="section" id="get-in-touch">
<h3>Get in touch</h3>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
</div>
Flatten the Learning Curve with OpenStack HIIT2020-05-05T16:00:00+01:002020-05-05T16:00:00+01:00Stig Telfertag:www.stackhpc.com,2020-05-05:/openstack-hiit.html<p class="first last">Training for open infrastructure specially developed
for teams in lockdown.</p>
<p>With the current Coronavirus lockdown affecting many countries
(including all the countries in which we work), remote working and
videoconference has become the only way to be productive.</p>
<p>At StackHPC our flexible and distributed team is already used to
working this way with clients. We have gone further, and developed
online training for workshops we would normally deliver in person.</p>
<div class="section" id="openstack-hiit-openstack-in-six-sessions">
<h2>OpenStack HIIT: OpenStack in Six Sessions</h2>
<p>With a nod to the intensity of OpenStack's infamous learning curve,
we've called our new workshop format <a class="reference external" href="//www.stackhpc.com/pages/workshops.html">OpenStack HIIT</a>.</p>
<p>OpenStack HIIT is a remote workshop, delivered by video conference.
The workshop is organised into six sessions. Session topics include:</p>
<ol class="arabic simple">
<li>Step-by-step deployment of an OpenStack control plane into
a virtualised lab environment.</li>
<li>A deep dive into the control plane to understand how it fits
together and how it works.</li>
<li>Operations and Site Reliability Engineering (SRE) principles.
Best practices for operating cloud infrastructure.</li>
<li>Monitoring and logging for OpenStack infrastructure and workloads.</li>
<li>Deploying platforms and applications to OpenStack infrastructure.</li>
<li>OpenStack software-defined networking deep dive.</li>
<li>Ceph storage and OpenStack.</li>
<li>Contributing to a self-sustaining open source community.</li>
<li>Deploying Kubernetes using OpenStack Magnum.</li>
</ol>
<p>Each session is led by a Senior Tech Lead from StackHPC's team.
The workshop is designed to be interactive and up to six attendees
can be supported.</p>
<p>Because it is remotely delivered, the sessions can be spread out,
enabling attendees to read around the subject, practice content learned
and prepare ahead for the next session.</p>
<p>The interactive sessions use lab infrastructure provided as part
of the workshop. In some circumstances a client's own infrastructure
can be used, which gives a client the opportunity to retain the lab
environment and to use it between sessions. Additional provision
for qualification of a client environment is required in this case.</p>
<p>For further details see <a class="reference external" href="//www.stackhpc.com/resources/StackHPC-HIIT-Workshops.pdf">our HIIT workshop brochure</a></p>
<img alt="OpenStack HIIT" src="//www.stackhpc.com/images/hiit-clip.jpg" style="width: 750px;" />
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
StackHPC Under African Skies: Kayobe in Cape Town2020-04-30T16:00:00+01:002020-04-30T16:00:00+01:00John Taylortag:www.stackhpc.com,2020-04-30:/chpc-linomtha.html<p class="first last">Working with our local partners, Linomtha ICT, to provide
cloud infrastructure for South African researchers and Academia</p>
<p>StackHPC are pleased to announce along with our partner, Linomtha ICT,
a new OpenStack system at the <a class="reference external" href="https://www.chpc.ac.za">Centre for High Performance Computing</a> to support researchers and academics
across South Africa. StackHPC worked with <a class="reference external" href="http://www.linomthaid.co.za">Linomtha</a>, <a class="reference external" href="https://www.supermicro.com/en/">Supermicro</a> and <a class="reference external" href="https://www.mellanox.com">Mellanox</a> to jointly engineer the system and
support project management. The system deploys OpenStack <a class="reference external" href="http://www.stackhpc.com/pages/kayobe.html">Kayobe</a> together with a billing
system engineered around <a class="reference external" href="http://www.stackhpc.com/cloudkitty-and-monasca-1.html">CloudKitty & Monasca</a>.</p>
<p><em>The text below can also be found on</em> <a class="reference external" href="http://www.linomthaid.co.za/chpc.html">Linomtha's blog</a>.</p>
<img alt="LinomthaID logo" src="//www.stackhpc.com/images/linomtha-logo.png" style="width: 379px;" />
<p>The Centre for High Performance Computing (CHPC) is proud to announce
a new on-premise cloud infrastructure that has been delivered
recently under exceptional circumstances. The delivery of the system
is testament to the close collaboration CHPC has with Linomtha ICT
(SA) and their strategic technology partners StackHPC Ltd (UK),
Supermicro (SA), and Mellanox (IL), and will ensure that the CHPC
has a stable environment to continue to deliver on its mandate. The
OpenStack Production Cloud Services caters for the CHPC scientific
users executing, for example, custom workflows, embarrassingly
parallel workloads and webhosting. The OpenStack services will also
be a road-header for such HPC configuration in the future. It is
envisaged that this platform will build both the skills and operational
experience for CHPC, to develop, provision and operate a National
federated OpenStack platform, which will be linked with other
countries, that are involved in the Square Kilometre Array (SKA)
project.</p>
<p>The Cloud infrastructure has been designed in such a way that the
transportation of data to and from the CHPC to the external
institutions that are connected to the NICIS network or those that
want to utilise the DIRISA long term storage can be achieved.</p>
<p>Linomtha, a majority black owned company comprising of an energetic
mix of business people, entrepreneurs and engineers with experience
and skills from various fields, together with CHPC, successfully
completed the installation of the OpenStack Production Cloud Service
project.</p>
<p>Linomtha recognises the important role that ICT can play in terms
of economic growth, social inclusion and government efficiency. The
key individuals driving Linomtha all have extensive practical
experience in the field of ICT, working on large scale government
and private sector projects across the country and are recognized
as experts, both locally and internationally. Linomtha is a value-added
reseller of StackHPC as well as Supermicro, the key technology
partners in responding to CHPC's RFP. LinomthaICT's
sister company, LinomthaID, provided the Billing/Invoicing portal
for the solution through its VOIS platform.</p>
<p>The CHPC has been running a VMware virtual environment or cluster
(IT-Shop) previously, as an alternative to support scientific
projects or applications which were not best suited for High
Performance Computing Platform. Projects were mostly hosted on the
IT-Shop Cluster as web portals to support these special scientific
groups to share data-knowledge or compute their specific scientific
workflows.</p>
<p>The IT-Shop cluster is currently over-provisioned, especially for
memory resources, due to the large demand of numerous projects
requiring high-spec virtual machines and has become an unreliable
environment, no longer able to adequately serve the users, as the
performance and available capacity has deteriorated over time.</p>
<p>The CHPC OpenStack Production Cloud will provide a sufficient and efficient environment to continue to support these kinds of projects from the IT-Shop. In addition, the CHPC Cloud Solution will offer the following benefits and functionalities which were not met on the current IT-Shop:</p>
<ul class="simple">
<li><strong>Self-Service Portal</strong>. CHPC Cloud users will now have the ability
to deploy application on-demand with limited technical support to
promote rapid and efficient IT Service.</li>
<li><strong>Metered Service and Resource Monitoring</strong>. CHPC will now be
able to monitor resource utilization from individual users or
projects to prepare billing statement as per our cost-recovery
model.</li>
<li><strong>Avoid Vendor Lock-In</strong>. The OpenStack solution is open source.
CHPC will Reduce-On-Cost related to proprietary software such as
the VMware vSphere Solution.</li>
<li><strong>Enable Rapid Innovations (DevOps)</strong>. The CHPC Staff can
significantly reduce on development and testing periods and have
more freedom to experiment with new technology or even do customisation
to expand the capabilities of the OpenStack Cloud.</li>
</ul>
<p>The CentOS based OpenStack Cloud is a self-service Virtual Machine
(VM) provisioning portal for CHPC Administrators where common
administrative tasks like VM creation, recoup unused resources, and
infrastructure maintenance tasks are automated and capacity analysis,
utilization, and end-user costing reports can be generated.</p>
<p>Through this project, CHPC administrators have been exposed to the
initial implementation of the OpenStack system and have hands on
experience of performing the various required tasks.</p>
<p>Linomtha together with Supermicro, Mellanox, StackHPC and LinomthaID
have jointly-engineered the CSIR OpenStack Cloud Solution. This
solution is built on Supermicro Server and Storage systems that
deliver first to market innovation and optimized for value, performance
and efficiency. Using the <a class="reference external" href="https://www.supermicro.com/en/products/twinpro">Supermicro TwinPro</a> Servers to provide
320cores/640threads (2.50 - 3.90GHz) and over 3TB DDR4 2933 Memory
providing some 9GB RAM per core all in just 4U of rack space,
connected through Mellanox 100GB Ethernet Networking to <a class="reference external" href="https://www.supermicro.com/en/products/ultra">Supermicro
Ultra</a> and <a class="reference external" href="https://www.supermicro.com/en/products/top-loading-storage">Supermicro
Simply Double Servers</a>
providing a CEPH Storage cluster with over 1.5PB (1500TB) of
Mechanical Disk Storage and more than 220TB of Flash Storage.</p>
<p>OpenStack was deployed with OpenStack Kayobe, a tool largely developed
and maintained by StackHPC within the OpenStack Foundation. Kayobe
provides for easy management of the deployment process across all
compute, storage and networking infrastructure using a high degree
of automation through infrastructure as code. Kayobe invokes a
containerised Kolla control plane providing for easier upgrades and
maintainability. In addition to the infrastructure element, Kayobe
also deploys rating, monitoring and logging services providing
insight on resources and their use.</p>
<p>The integration of the invoicing engine and portal, VOIS, was
undertaken by LinomthaID who extracted the billing information of
the Openstack Usage provided by <a class="reference external" href="https://docs.openstack.org/cloudkitty/latest/">CloudKitty</a>, and localised
and customised the invoicing to CHPC requirements.</p>
<p>Ensuring there was constant and clear communication during the
project, the Linomtha project team ensured daily stand-up calls,
weekly progress meetings and utilised tools such as Slack and Google
Meet - which allowed for quick turnaround times for addressing
queries.</p>
<blockquote>
<em>We were impressed with the Slack communication and the shared Google
drive provided for documentation between team members, it made the
sharing of thoughts much easier resulting in solving problems quickly
and collaboratively.</em></blockquote>
<p>A single point of contact was identified from each stakeholder
involved in the project, allowing for communication to flow to the
right people and ensuring action items were accomplished and
ultimately, meeting the challenging deadline.</p>
<p>One component of the project was training which initially was to
take place on-site, but due to the restraints of COVID-19, the team
improvised and the training was successfully delivered remotely,
over a five-day period. The training was deemed a great success!
The training has ensured that the CHPC Administrators have sufficient
knowledge and confidence to efficiently manage the environment.</p>
<blockquote>
<em>The training was one of the best we've attended, the setup was
great, the trainer's expertise and their quick thinking or rather
well-considered answers in providing solutions to our questions was
impressive. The information gathered and shared is helping us with
our OpenStack operations and we can only grow strong from here with
our OpenStack expertise as well.</em></blockquote>
<p>No project is without challenges and this one was no exception. One
of the lessons learnt was that the time between the initial workshop
and implementation was too compressed. It did not allow for all
team members, including technical resources, to fully understand
the finer technical detail of the project and allow them to all
contribute.</p>
<p>Despite the challenges encountered during the project, through the
professional Linomtha Project Management deployment, milestones
were met, the deadline accomplished, quality documentation drafted,
successful training delivered and the handover to operations completed
within the required deadline and budget.</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Kata Containers on The New Stack2020-04-29T14:40:00+01:002020-04-29T14:40:00+01:00Bharat Kunwartag:www.stackhpc.com,2020-04-29:/brtknr-kata.html<p class="first last">StackHPC's Bharat Kunwar participated in a The New Stack webinar
as an expert on Kata containers and their performance characteristics.</p>
<p>Our team draws on a broad base of expertise in the technologies
used to build the high-peformance cloud. Occasionally our research
breaks new ground, and we are always thrilled with the opportunity
to talk about it.</p>
<p><a class="reference external" href="https://thenewstack.io/">The New Stack</a> recently approached
Bharat from our team to participate in a webinar on <a class="reference external" href="https://katacontainers.io/">Kata containers</a>. Often Kata containers are pitched
with the soundbite "the speed of containers, the security of VMs".
Bharat's <a class="reference external" href="//www.stackhpc.com/kata-io-1.html">previous research on IO performance</a>
suggested the real picture was more nuanced.</p>
<p>The end result is a great article and webinar (with <a class="reference external" href="https://twitter.com/egernst">Eric Ernst</a> from Ampere), which can be <a class="reference external" href="https://thenewstack.io/kata-containers-demo-a-container-experience-with-vm-security/">read
here</a>.
Bharat's presentation can be <a class="reference external" href="//www.stackhpc.com/images/IO-Performance-of-Kata-Containers-TheNewStack.pdf">downloaded here</a> (as PDF).</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/uDh8UhOrR8I" width="500" height="350" allowfullscreen seamless frameBorder="0"></iframe></div><div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
StackHPC and COVID-192020-03-17T19:00:00+00:002020-03-17T19:00:00+00:00John Taylortag:www.stackhpc.com,2020-03-17:/covid19.html<p class="first last">An update from StackHPC's CEO John Taylor on the company's
strategy for mitigating the consequences of COVID-19.</p>
<p>As COVID-19 continues to spread, I would like to update you on the
steps StackHPC is taking to ensure business continuity, in a secure,
responsible and reliable manner, to the benefit of us, our business
customers and contacts as well as to the wider community.</p>
<p>StackHPC has decided to have all employees work remotely for the
benefit of their safety and well-being. As a consultancy company,
we have a business continuity plan in place, and are confident that
our teams have the resources required to ensure that our activities
will not be compromised, despite the obvious challenges we are all
experiencing.</p>
<p>Rest assured that you can rely on StackHPC to support all of our
customers, partners and business contacts during these uncertain
times.</p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Scaling up: Monasca Performance Improvements2020-03-05T18:00:00+00:002020-03-05T18:00:00+00:00Bharat Kunwartag:www.stackhpc.com,2020-03-05:/monitoring-performance-improvements.html<p class="first last">We describe a multi-pronged approach we used to improve user
facing query performance of a Monasca based monitoring stack we
currently have deployed at client sites.</p>
<div class="figure">
<img alt="Monasca project mascot" src="//www.stackhpc.com/images/monasca-mascot.png" style="width: 150px;" />
</div>
<p>At StackHPC, we use <a class="reference external" href="http://www.stackhpc.com/monasca-comes-to-kolla.html">Kolla-Ansible to deploy Monasca</a>, a multi-tenant
monitoring-as-a-service solution that integrates with OpenStack, which allows
users to deploy <a class="reference external" href="https://influxdb-python.readthedocs.io">InfluxDB</a> as a
time-series database. As this database fills up over time with an unbounded
retention period, it is not surprising that the response time of the database
will be different to when it was initially deployed. Long term operation of
Monasca by our clients in production has required a proactive approach to keep
the monitoring and logging services running optimally. To this end, the
problems we have seen are related to query performance which has been directly
affecting our customers and other Monasca users. In this article, we tell a
story of how we overcame these issues and introduce an <em>opt-in</em> database per
tenant capability we pushed upstream into Monasca for the benefit of our
customers and the wider OpenStack community who may be dealing with similar
challenges of monitoring at scale.</p>
<div class="section" id="the-challenges-of-monitoring-at-scale">
<h2>The Challenges of Monitoring at Scale</h2>
<p>Our journey starts at a point where the following disparate issues (but related
at the same time in the sense that they are all symptoms of a growing OpenStack
deployment) were brought to our attention:</p>
<ul class="simple">
<li>When a user views collected metrics on a Monasca Grafana dashboard (which
uses Monasca as the data source), firstly, it aims to dynamically
obtain a list of host names. This query was not respecting the time
boundary that can be selected on the dashboard and instead was scanning
results from the entire database. Naturally, this went unnoticed when
the database was small, but as the cardinality of the collected metrics
grew over time (345 million at its peak on one site - that is <em>345 million
unique time series</em>), the duration of this query was taking up to an
hour before eventually timing out. In the mean time it would be blocking
resources for additional queries.</li>
<li>A user from a new OpenStack project would experience the same delay in
query time against the Monasca API as a user from another project with a much
larger metrics repository. This is because Monasca currently implements a
single InfluxDB database by default and project scoped metrics are filtered
using a <tt class="docutils literal">WHERE</tt> statement. This was a clear bottleneck.</li>
<li>Last but not the least, all metrics which were being gathered were subject to
the same retention policy. InfluxDB has support for multiple retention
policies per database. To keep things further isolated, it is also possible
to have a database per tenant, each with its own default retention policy.
Not only does this increase the portability of projects, it also removes the
overhead of filtering results by project each time a query is executed,
naturally improving performance.</li>
</ul>
<p>To address these issues, we implemented the following <em>quick fixes</em>, and while
they alleviate the symptoms in the short term, we would not consider either of
them sustainable or scalable solutions as they will soon require further manual
intervention:</p>
<ul class="simple">
<li>Disabling dynamic host name lookup by providing a static inventory of host
names (which could be automated at deploy time for static projects).
However, for dynamic inventories, this approach relies on manual update of
the inventory.</li>
<li>Deleting metrics with highly variable dimensions, which contribute
disproportionately to increasing the database cardinality (larger cardinality
leads to increased query time for InfluxDB, although other Time Series
Databases, e.g. TimescaleDB claim not to be affected in a similar way). Many
metric sources expose metrics with highly variable dimensions and avoiding
this is an intrinsically hard problem and not one confined to Monasca. For
example, sources like cAdvisor expose a lot of metrics with highly variable
dimensions by default and one has to be judicious about which metrics to
scrape. In our Kolla-Ansible based deployment, the low hanging fruits were
mostly metrics matching the regex pattern <tt class="docutils literal">log.*</tt> originating from the
OpenStack control plane useful for triggering alarms and then for a finite
time horizon for auditing. However, since all data is currently stored under
the same database and retention policy (since the Monasca API currently does
not have a way of setting per project retention policies), it is not possible
to define project specific data expiry date. For example, we were able to
reduce 345 million unique time series down to a mere 227 thousand, 0.07% of
the original by deleting these log metrics (deleting at a rate of 7 million
series per hour for a total of 49 hours). Similarly, at another site, we were
able to cut down from 2 million series to 186 thousand, 9% of the original
(deleting at a rate of 29 thousand series per hour for 77 hours). In both
cases, we managed to significantly cut down the query time from a state where
queries were timing out down to a few seconds. However, employing database
per tenancy with fine control over retention period remained the holy grail
for delivering sustained performance.</li>
</ul>
</div>
<div class="section" id="towards-greater-performance">
<h2>Towards Greater Performance</h2>
<p>Our multi-pronged approach to make the monitoring stack more performant and
resilient can be summarised in the following ways:</p>
<ul class="simple">
<li>The first part of our effort to improve the situation is by introducing a
database per tenancy feature to Monasca. <a class="reference external" href="https://review.opendev.org/#/q/topic:story/2006331">The enabling patches</a> affecting
<tt class="docutils literal"><span class="pre">monasca-{api,persister}</span></tt> projects have now merged upstream and are
available from <em>OpenStack Train</em> release. This paves the way for using an
instance of InfluxDB per tenant to further decouple the database back-end
between tenants. In summary, these changes enable end users to:<ul>
<li>Enable a database for tenant within a single InfluxDB instance on an opt-in
basis by setting <tt class="docutils literal">db_per_tenant</tt> to <tt class="docutils literal">True</tt> in
<tt class="docutils literal"><span class="pre">monasca-{api,persister}</span></tt> configuration files.</li>
<li>Set a default retention policy by defining <tt class="docutils literal">default_retention_hours</tt> in
<tt class="docutils literal"><span class="pre">monasca-persister</span></tt> configuration file. Further development of this
thread would involve giving project owners the ability to set retention
policy of their tenancy via the API.</li>
<li>Migrate an existing monolithic database to a database per tenant model
using an efficient migration tool <a class="reference external" href="https://opendev.org/openstack/monasca-persister/src/branch/master/monasca_persister/tools/influxdb/db-per-tenant">we proudly upstreamed</a>.</li>
</ul>
</li>
<li>We also introduced experimental changes to limit the search results to the
query time window selected on the Grafana dashboard. <a class="reference external" href="https://review.opendev.org/#/q/topic:story/2006204">The required changes</a> spanning several
projects (<tt class="docutils literal"><span class="pre">monasca-{api,grafana-datasource,tempest-plugin}</span></tt>) have all
merged upstream and also available from <em>Openstack Train</em> release. Since the
only option previously was to search the entire database, queries targeting
large databases were timing out which can now be avoided. The only caveat
with this approach is that the results are approximate, i.e., the accuracy of
the returned result is determined by the length of the <tt class="docutils literal">shardGroupDuration</tt>
which resolves to 1 week by default when the retention policy is infinite.
This defaults to 1 day when the retention policy is 2 weeks. Considering that
the earlier behaviour was to scan the entire database, this approach yields a
considerable improvement, despite a minor loss in precision.</li>
</ul>
<p>These additional features have allowed us to further reduce the query time to
less than a second in a large, 100+ node deployment with 1 year retention
policy; a dramatic improvement compared to queries without any time boundary
where our users were frequently hitting query timeouts. Additionally, we
have facilitated a more sustainable way to manage the life-cycle of data
being generated and consumed by different tenants. For example, this allows
tenancy for the control plane logs to have a short retention duration.</p>
</div>
<div class="section" id="a-well-rehearsed-migration-strategy">
<h2>A Well-Rehearsed Migration Strategy</h2>
<p>Existing production environments hoping to reap the benefit of capabilities we
have discussed so far may also wish to migrate their existing monolithic
database to a database per tenant model. <em>A good migration tool requires a great
migration strategy.</em> In order to ensure minimal disruptions for our customers,
we rehearsed the following migration strategy in a pre-production environment
before applying the changes in production.</p>
<p>First of all, carry out a migration of the current snapshot of the database up
to a desired <tt class="docutils literal"><span class="pre">--migrate-end-time-offset</span></tt>, e.g. 52 weeks into the past. This
is much like a Virtual Machine migration, we start by syncing the majority of
the data across which requires a minimum free disk space equivalent to the
current size of the database. The following example is relevant to
Kolla-Ansible based deployments:</p>
<div class="highlight"><pre><span></span>docker <span class="nb">exec</span> -it -u root monasca_persister bash
<span class="nb">source</span> /var/lib/kolla/venv/bin/activate
pip install -U monasca-persister
docker <span class="nb">exec</span> -it -u root monasca_persister python /var/lib/kolla/venv/lib/python2.7/site-packages/monasca_persister/tools/influxdb/db-per-tenant/migrate-to-db-per-tenant.py <span class="se">\</span>
--config-file /etc/monasca/persister.conf <span class="se">\</span>
--migrate-retention-policy project_1:2,project_2:12,project_3:52 <span class="se">\</span>
--migrate-skip-regex ^log<span class="se">\\</span>..+ <span class="se">\</span>
--migrate-time-unit w <span class="se">\</span>
--migrate-start-time-offset <span class="m">0</span> <span class="se">\</span>
--migrate-end-time-offset <span class="m">52</span>
</pre></div>
<p>The initial migration is likely to take some time depending on the amount of
data being migrated and the type of disk under the hood. While this is
happening, the <tt class="docutils literal">monasca_persister</tt> container is inserting new metrics into
the original database which will need re-syncing after the initial migration is
complete. Take a note of the length of time this phase of migration takes as
this will determine the portion of the database that will need to be
remigrated. You will be able to see that a new database with project specific
retention policy of <tt class="docutils literal">2w</tt> has been created as follows for <tt class="docutils literal">project_1</tt>:</p>
<div class="highlight"><pre><span></span>docker <span class="nb">exec</span> -it influxdb influx -host <span class="m">192</span>.168.7.1 -database monasca_project_1 -execute <span class="s2">"SHOW RETENTION POLICIES"</span>
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
2w 336h0m0s 24h0m0s <span class="m">1</span> <span class="nb">true</span>
</pre></div>
<p>Once the initial migration is complete, stop the <tt class="docutils literal">monasca_persister</tt>
container and confirm that it has stopped. For deployments with multiple
controllers, you will need to ensure this is the case on all nodes.</p>
<div class="highlight"><pre><span></span>docker stop monasca_persister
docker ps <span class="p">|</span> grep monasca_persister
</pre></div>
<p>Once the persister has stopped, nothing new is getting written to the original
database while any new entries are being buffered on Kafka topics. It is a good
idea to backup this database as this point for which InfluxDB provides a handy
command line interface:</p>
<div class="highlight"><pre><span></span>docker <span class="nb">exec</span> -it influxdb influxd backup -portable /var/lib/influxdb/backup
</pre></div>
<p>Upgrade Monasca containers to <em>OpenStack Train</em> release with database per
tenancy features. For example, Kayobe/Kolla-Ansible users can run the
following Kayobe CLI command which also ensures that the new versions of
<tt class="docutils literal">monsasca_persister</tt> containers are back up and running on all the
controllers writing entries to a database per tenant:</p>
<div class="highlight"><pre><span></span>kayobe overcloud service reconfigure -kt monasca
</pre></div>
<p>Populate the new databases with the missing database entries (minimum is 1 unit
of time). InfluxDB automatically prevents duplicate entries therefore it is not
a problem if there is an overlap in the migration window. In the following
command, we assume that the original migration took less than a week to
complete therefore set <tt class="docutils literal"><span class="pre">--migrate-end-time-offset</span></tt> to 1:</p>
<div class="highlight"><pre><span></span>docker <span class="nb">exec</span> -it -u root monasca_persister python /var/lib/kolla/venv/lib/python2.7/site-packages/monasca_persister/tools/influxdb/db-per-tenant/migrate-to-db-per-tenant.py <span class="se">\</span>
--config-file /etc/monasca/persister.conf <span class="se">\</span>
--migrate-retention-policy project_1:2,project_2:12,project_3:52 <span class="se">\</span>
--migrate-skip-regex ^log<span class="se">\\</span>..+ <span class="se">\</span>
--migrate-time-unit w <span class="se">\</span>
--migrate-start-time-offset <span class="m">0</span> <span class="se">\</span>
--migrate-end-time-offset <span class="m">1</span>
</pre></div>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>This development work was generously funded by <a class="reference external" href="https://verneglobal.com/">Verne Global</a> who are already using the optimised capabilities
to provide enhanced services for <a class="reference external" href="https://hpcdirect.com">hpcDIRECT</a> users.</p>
</div>
<div class="section" id="contact-us">
<h2>Contact Us</h2>
<p>If you would like to get in touch we would love to hear from you. Reach out to
us on <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a> or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact
page</a>.</p>
</div>
SR-IOV Networking in Kayobe2020-02-14T10:00:00+00:002020-02-14T11:00:00+00:00Michal Nasiadkatag:www.stackhpc.com,2020-02-14:/sriov-kayobe.html<p class="first last">In a virtualised environment, SR-IOV enables closer access to
underlying hardware, trading greater performance for reduced operational
flexibility.</p>
<div class="section" id="a-brief-introduction-to-single-root-i-o-virtualisation-sr-iov">
<h2>A Brief Introduction to Single-Root I/O Virtualisation (SR-IOV)</h2>
<div class="line-block">
<div class="line">In a virtualised environment, SR-IOV enables closer access to underlying
hardware, trading greater performance for reduced operational flexibility.</div>
<div class="line"><br /></div>
<div class="line">This involves the creation of virtual functions (VFs), which are presented as
a copy of the physical function (PF) of the hardware device. The VF is
passed-through to a VM, resulting in bypass of the hypervisor operating
system for network activity. The principles of SR-IOV are presented in slightly greater depth
in a short <a class="reference external" href="https://www.intel.com/content/dam/doc/white-paper/pci-sig-single-root-io-virtualization-support-in-virtualization-technology-for-connectivity-paper.pdf">Intel white paper</a>, and the OpenStack fundamentals are described in the <a class="reference external" href="https://docs.openstack.org/neutron/latest/admin/config-sriov">Neutron online documentation</a>.</div>
</div>
<div class="line-block">
<div class="line">A VF can be bound to a given VLAN, or (on some hardware, such as recent
Mellanox NICs) it can be bound to a given VXLAN VNI. The result is direct
access to a physical NIC attached to a tenant or provider network.</div>
<div class="line"><br /></div>
<div class="line">Note that there is no support for security groups or similar richer network
functionality as the VM is directly connected to the physical network
infrastructure, which provides no interface for injecting firewall rules or
other externally managed packet handling.</div>
<div class="line">Mellanox also offer a more advanced capability, known as <a class="reference external" href="https://www.mellanox.com/products/ASAP2">ASAP2</a>, which builds
on SR-IOV to also offload Open vSwitch (OVS) functions from the hypervisor.
This is more complex and not in scope for this investigation.</div>
</div>
</div>
<div class="section" id="setup-for-sr-iov">
<h2>Setup for SR-IOV</h2>
<p>Aside from OpenStack, deployment of SR-IOV involves configuration at many levels.</p>
<ul>
<li><p class="first">BIOS needs to be configured to enable both <cite>Virtualization Technology</cite> and
<cite>SR-IOV</cite>.</p>
</li>
<li><p class="first">Mellanox NIC firmware must be configured to enable the creation of SR-IOV
VFs and define the maximum number of VFs to support. This requires the
installation of the Mellanox Firmware Tools (MFT) package from <a class="reference external" href="https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed">Mellanox OFED</a>.</p>
</li>
<li><p class="first">Kernel boot parameters are required to support direct access to SR-IOV
hardware:</p>
<div class="highlight"><pre><span></span><span class="na">intel_iommu</span><span class="o">=</span><span class="s">on iommu=pt</span>
</pre></div>
</li>
<li><p class="first">A number of VFs can be created by writing the required number to a file under
<tt class="docutils literal">/sys</tt>, for example:
<tt class="docutils literal">/sys/class/net/eno6/device/sriov_numvfs</tt></p>
<p><em>NOTE: There are certain NIC models (e.g. Mellanox Connect-X 3) that do not
support management via sysfs, those need to be configured using modprobe
(see modprobe.d man page).</em></p>
</li>
<li><p class="first">This is typically done as a <cite>udev</cite> trigger script on insertion of the PF
device. The upper limit set for VFs is given by another (read-only) file in
the same directory.</p>
</li>
</ul>
<p>As a framework for management using infrastructure-as-code principles
and <a class="reference external" href="https://docs.ansible.com/ansible/latest/index.html">Ansible</a>
at every level, <a class="reference external" href="//www.stackhpc.com/pages/kayobe.html">Kayobe</a> provides
support for running <a class="reference external" href="https://docs.openstack.org/kayobe/latest/custom-ansible-playbooks.html">custom Ansible playbooks</a>
on the inventory and groups of the infrastructure deployment. Over
time StackHPC has developed a number of roles to perform additional
configuration as a custom site playbook. A recent addition is a
<a class="reference external" href="https://galaxy.ansible.com/stackhpc/sriov">Galaxy role for SR-IOV setup</a></p>
<p>A simple custom site playbook could look like this:</p>
<div class="highlight"><pre><span></span><span class="nn">---</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Configure SR-IOV</span>
<span class="nt">hosts</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">compute_sriov</span>
<span class="nt">tasks</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">include_role</span><span class="p">:</span>
<span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">stackhpc.sriov</span>
<span class="nt">handlers</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">reboot</span>
<span class="nt">include_tasks</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">tasks/reboot.yml</span>
<span class="nt">tags</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">reboot</span>
<span class="nn">...</span>
</pre></div>
<p>This playbook would then be invoked from the Kayobe CLI:</p>
<div class="highlight"><pre><span></span><span class="o">(</span>kayobe<span class="o">)</span> $ kayobe playbook run sriov.yml
</pre></div>
<p>Once the system is prepared for supporting SR-IOV, OpenStack
configuration is required to enable VF resource management, scheduling
according to VF availability, and pass-through of the VF to VMs
that request it.</p>
<div class="section" id="sr-iov-and-lags">
<h3>SR-IOV and LAGs</h3>
<p>An additional complication might be that hypervisors use bonded
NICs to provide network access for VMs. This provides greater fault
tolerance. However, a VF is normally associated with only one PF
(and the two PFs in a bond would lead to inconsistent connectivity).</p>
<p>Mellanox NICs have a feature, VF-LAG, which claims to enable SR-IOV
to work in configurations where the ports of a 2-port NIC are bonded
together.</p>
<p><em>Setup for VF-LAG requires additional steps and complexities, and we'll be
covering it in greater detail in another blog post soon.</em></p>
</div>
</div>
<div class="section" id="nova-configuration">
<h2>Nova Configuration</h2>
<div class="section" id="scheduling-with-hardware-resource-awareness">
<h3>Scheduling with Hardware Resource Awareness</h3>
<p>SR-IOV VFs are managed in the same way as PCI-passthrough hardware (eg, GPUs).
Each VF is managed as a hardware resource. The Nova scheduler must be
configured not to schedule instances requesting SR-IOV resources to hypervisors
with none available. This is done using the <tt class="docutils literal">PciPassthroughFilter</tt> scheduler
filter.</p>
<p>In Kayobe config, the Nova scheduler filters are configured by defining
non-default parameters in <tt class="docutils literal">nova.conf</tt>. In the <tt class="docutils literal"><span class="pre">kayobe-config</span></tt> repo, add this to
<tt class="docutils literal">etc/kayobe/kolla/config/nova.conf</tt>:</p>
<div class="highlight"><pre><span></span><span class="k">[filter_scheduler]</span>
<span class="na">available_filters</span> <span class="o">=</span> <span class="s">nova.scheduler.filters.all_filters</span>
<span class="na">enabled_filters</span> <span class="o">=</span> <span class="s">other-filters,PciPassthroughFilter</span>
</pre></div>
<p>(The other filters listed may vary according to other configuration applied
to the system).</p>
</div>
<div class="section" id="hypervisor-hardware-resources-for-passthrough">
<h3>Hypervisor Hardware Resources for Passthrough</h3>
<p>The nova-compute service on each hypervisor requires configuration to define
which hardware/VF resources are to be made available for passthrough to VMs.
In addition, for infrastructure with multiple physical networks, an association
must be made to define which VFs connect to which physical network.
This is done by defining a whitelist (<tt class="docutils literal">pci_passthrough_whitelist</tt>) of available
hardware resources on the compute hypervisors. This can be tricky to configure
if the available resources are different in an environment with multiple
variants of hypervisor hardware specification.
One solution using Kayobe's inventory is to define whitelist hardware mappings
either globally, or in group variables or even individual host variables as
follows:</p>
<div class="highlight"><pre><span></span><span class="c1"># Physnet to device mappings for SR-IOV, used for the pci</span>
<span class="c1"># passthrough whitelist and sriov-agent configs</span>
<span class="nt">sriov_physnet_mappings</span><span class="p">:</span>
<span class="nt">p4p1</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">physnet2</span>
</pre></div>
<p>This state can then be applied by adding a macro-expanded term to
<tt class="docutils literal">etc/kayobe/kolla/config/nova.conf</tt>:</p>
<div class="highlight"><pre><span></span>{% raw %}
[pci]
passthrough_whitelist = [{% for dev, physnet in sriov_physnet_mappings.items() %}{{ (loop.index0 > 0)|ternary(',','') }}{ "devname": "{{ dev }}", "physical_network": "{{ physnet }}" }{% endfor %}]
{% endraw %}
</pre></div>
<p>We have used the network device name in for designation here, but other
options are available:</p>
<ul>
<li><div class="first line-block">
<div class="line"><strong>devname</strong>: <em>network-device-name</em></div>
<div class="line">(as used above)</div>
</div>
</li>
<li><div class="first line-block">
<div class="line"><strong>address</strong>: <em>pci-bus-address</em></div>
<div class="line">Takes the form <tt class="docutils literal"><span class="pre">[[[[<domain>]:]<bus>]:][<slot>][.[<function>]]</span></tt>.</div>
<div class="line">This is a good way of unambiguously selecting a single device in the
hardware device tree.</div>
</div>
</li>
<li><div class="first line-block">
<div class="line"><strong>address</strong>: <em>mac-address</em></div>
<div class="line">Can be wild-carded.</div>
<div class="line">Useful if the vendor of the SR-IOV NIC is different from all other NICs in
the configuration, so that selection can be made by OUI.</div>
</div>
</li>
<li><div class="first line-block">
<div class="line"><strong>vendor_id</strong>: <em>pci-vendor</em> <strong>product_id</strong>: <em>pci-device</em></div>
<div class="line">A good option for selecting a single hardware device model, wherever they
are located.</div>
<div class="line">These values are 4-digit hexadecimal (but the conventional 0x prefix is not
required).</div>
</div>
</li>
</ul>
<p>The vendor ID and device ID are available from lspci -nn (or lspci -x for
the hard core). The IDs supplied should be those of the virtual function
(VF) not the physical functions, which may be slightly different.</p>
</div>
</div>
<div class="section" id="neutron-configuration">
<h2>Neutron Configuration</h2>
<div class="line-block">
<div class="line">Kolla-Ansible documents SR-IOV configuration well here:
<a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/reference/networking/sriov.html">https://docs.openstack.org/kolla-ansible/latest/reference/networking/sriov.html</a>.</div>
<div class="line">See <a class="reference external" href="https://docs.openstack.org/neutron/train/admin/config-sriov.html">https://docs.openstack.org/neutron/train/admin/config-sriov.html</a> for full
details from Neutron's documentation.</div>
<div class="line">For Kayobe configuration, we set a global flag <tt class="docutils literal">kolla_enable_neutron_sriov</tt>
in <tt class="docutils literal">etc/kayobe/kolla.yml</tt>:</div>
</div>
<div class="highlight"><pre><span></span><span class="nt">kolla_enable_neutron_sriov</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">true</span>
</pre></div>
<div class="section" id="neutron-server">
<h3>Neutron Server</h3>
<p>SR-IOV usually connects to VLANs; here we assume Neutron has already been
configured to support this.
The sriovnicswitch ML2 mechanism driver must be enabled. In Kayobe config,
this is added to <tt class="docutils literal">etc/kayobe/neutron.yml</tt>:</p>
<div class="highlight"><pre><span></span><span class="c1"># List of Neutron ML2 mechanism drivers to use. If unset the kolla-ansible</span>
<span class="c1"># defaults will be used.</span>
<span class="nt">kolla_neutron_ml2_mechanism_drivers</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">openvswitch</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">l2population</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">sriovnicswitch</span>
</pre></div>
</div>
<div class="section" id="neutron-sr-iov-nic-agent">
<h3>Neutron SR-IOV NIC Agent</h3>
<p>Neutron requires an additional agent to run on compute hypervisors with SR-IOV
resources. The SR-IOV agent must be configured with mappings between physical
network name and the interface name of the SR-IOV PF.
In Kayobe config, this should be added in a file
<tt class="docutils literal">etc/kayobe/kolla/config/neutron/sriov_agent.ini</tt>.
Again we can do an expansion using the variables drawn from Kayobe config's
inventory and extra variables:</p>
<div class="highlight"><pre><span></span>{% raw %}
[sriov_nic]
physical_device_mappings = {% for dev, physnet in sriov_physnet_mappings.items() %}{{ (loop.index0 > 0)|ternary(',','') }}{{ physnet }}:{{ dev }}{% endfor %}
exclude_devices =
{% endraw %}
</pre></div>
</div>
</div>
StackHPC Winter Design Summit2020-01-20T09:00:00+00:002020-01-20T09:00:00+00:00Stig Telfertag:www.stackhpc.com,2020-01-20:/winter-2020.html<p class="first last">StackHPC's team gathered to our Bristol base for our
semi-annual all hands design summit.</p>
<p>Our team is becoming increasingly international, and while we
work well as a virtual organisation, sometimes there is no substitute
for gathering in a room and charting the course of the company for
the ensuing months.</p>
<p>We have been holding design summits for the last few years, with
the purpose of reviewing new technologies, considering improvements
to our team processes and updating everyone on the growth and
financial position of the firm.</p>
<p>In addition, we issue employee stock options to broaden the team's
stake in the company's success.</p>
<p>With our growing team, and growing customer base, we spent a good
deal of time discussing how we can continue to work as effectively
as we do while the company grows and takes on new commitments
to deliver. The agility of our working practices has been our
strength, and we intend to keep it that way.</p>
<p>This is an exciting time to be working on the creation of
high-performance cloud infrastructure, and our discussions reflected
the pace of innovation occurring on many fronts. Watch this space
for 2020!</p>
<div class="figure">
<img alt="Map of StackHPC's design summit" src="//www.stackhpc.com/images/winter-2020-design-summit.jpeg" style="width: 750px;" />
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
StackHPC at Supercomputing 20192019-11-26T09:00:00+00:002019-11-26T09:00:00+00:00Stig Telfertag:www.stackhpc.com,2019-11-26:/sc19.html<p class="first last">John Taylor and Stig Telfer attended Supercomputing 2019,
at 13,950 attendees the biggest ever, and heard the latest on
the adoption of cloud technologies for HPC workloads.</p>
<div class="figure">
<img alt="Stig Telfer presenting at SuperCompCloud" src="//www.stackhpc.com/images/stig-supercompcloud-sc2019.jpeg" style="width: 500px;" />
</div>
<p>Supercomputing is <em>massive</em>, that much is clear. The same convention
centre used for the Open Infrastructure summit earlier in the year
was packed to the rafters, and the technical program schedule included
significant content addressing the convergence of HPC and Cloud -
StackHPC's home territory.</p>
<div class="section" id="supercompcloud-workshop">
<h2>SuperCompCloud Workshop</h2>
<p><a class="reference external" href="https://sites.google.com/view/supercompcloud">SuperCompCloud</a>
is the Workshop on Interoperability of Supercomputing and Cloud
Technologies. Supercomputing 2019 was the first edition of this
workshop with a steering committee drawn from CSCS Switzerland,
Indiana University, Jülich Supercomputing Centre, Los Alamos National
Laboratory, University of Illinois, US Department of Defense and
Google.</p>
<p>The program schedule included some very prestigious speakers, and
StackHPC was thrilled to be included:</p>
<ul class="simple">
<li><a class="reference external" href="https://sc19.supercomputing.org/presentation/?id=ws_scc106&sess=sess120">Scalability and data security: deep learning with health data on future HPC platforms</a> - Georgia Tourassi, Director – Health Data Sciences Institute, ORNL, USA</li>
<li><a class="reference external" href="https://sc19.supercomputing.org/presentation/?id=ws_scc104&sess=sess120">Cloud and Supercomputing Platforms at NCI Australia: the Why, the How and the Future</a> - Allan Williams, Associate Director for Services and Technology at NCI, Australia</li>
<li><a class="reference external" href="https://sc19.supercomputing.org/?post_type=page&p=3479&id=ws_scc105&sess=sess120">HPC and Cloud Operations at CERN</a> - Maria Girone, CTO, CERN openlab, Switzerland</li>
<li><a class="reference external" href="https://sc19.supercomputing.org/?post_type=page&p=3479&id=ws_scc102&sess=sess120">Computing Without Borders: Combining Cloud and HPC to Advance Experimental Science</a> - Debbie Bard, Data Science Engagement Group Lead, NERSC, USA</li>
<li><a class="reference external" href="https://sc19.supercomputing.org/?post_type=page&p=3479&id=ws_scc103&sess=sess120">OpenStack and the Software-Defined Supercomputer</a> - Stig Telfer, StackHPC, UK</li>
<li><a class="reference external" href="https://sc19.supercomputing.org/?post_type=page&p=3479&id=ws_scc101&sess=sess120">Perform Like a Supercomputer, Run Like a Cloud</a> - Stathis Papaefstathiou, Senior VP for R&D, Cray Inc, USA</li>
</ul>
</div>
<div class="section" id="the-openstack-scientific-sig-at-supercomputing">
<h2>The OpenStack Scientific SIG at Supercomputing</h2>
<p>Part of the OpenStack <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">Scientific SIG</a>'s remit is to advocate for the
use of OpenStack at scientific computing conferences, and a BoF at
Supercomputing is an ideal forum for making that case. A panel of
regular participants from the Scientific SIG gathered to describe their
use cases and discuss the pros and cons of private cloud.</p>
<div class="figure">
<img alt="The OpenStack Scientific SIG panel" src="//www.stackhpc.com/images/sc19-sig-bof.jpeg" style="width: 750px;" />
</div>
<p><em>The SIG BoF panel L-R: Mike Lowe (Indiana University), Bob Budden (NASA GSFC), Blair Bethwaite (NESI), Stig Telfer (StackHPC), Tim Randles (LANL), Martial Michel (Data Machines)</em></p>
<p>The BoF, titled <a class="reference external" href="https://sc19.supercomputing.org/?post_type=page&p=3479&id=bof132&sess=sess332">Cloud and Open Infrastructure Solutions To Run HPC Workloads</a>,
was well attended, with a good audience participation around issues
such as VM performance for I/O-intensive workloads, and OpenStack's
overall health.</p>
<div class="section" id="get-in-touch">
<h3>Get in touch</h3>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
</div>
High Performance Ethernet for HPC – Are we there yet?2019-11-18T02:00:00+00:002019-11-18T02:00:00+00:00John Taylortag:www.stackhpc.com,2019-11-18:/ethernet-hpc.html<p class="first last">Recently there has been a resurgence of interest around
the use of Ethernet for HPC workloads, most notably from recent
announcements from Cray and Slingshot. In this article I examine
some of the history around Ethernet in HPC and look at some of the
advantages within modern HPC Clouds.</p>
<p>Recently there has been a resurgence of interest around the use of
Ethernet for HPC workloads, most notably from <a class="reference external" href="https://www.nextplatform.com/2019/08/16/how-cray-makes-ethernet-suited-for-hpc-and-ai-with-slingshot/">recent announcements
from Cray and Slingshot</a>. In this article I examine some of the
history around Ethernet in HPC and look at some of the advantages
within modern HPC Clouds.</p>
<p>Of course Ethernet has been the mainstay of many organisations
involved in High Throughput Computing large-scale cluster environments
(e.g. Geophysics, Particle Physics, etc.) although it does not
(generally) hold the mind-share for those organisations where
conventional HPC workloads predominate, notwithstanding the fact
that for many of these environments, the operational workload for
a particular application rarely goes above a small to moderate
number of nodes. Here Infiniband has held sway for many years now.
A recent look at the <a class="reference external" href="https://www.top500.org/statistics/list/">TOP500</a>
gives some indication of the spread of Ethernet vs. Infiniband vs.
Custom or Proprietary interconnects for both system and performance
share, or as I often refer to them as the price-performance and
performance, respectively, of the HPC market.</p>
<div class="figure">
<img alt="Ethernet share of the TOP500" src="//www.stackhpc.com/images/top500-ethernet-share.png" style="width: 750px;" />
</div>
<p>My interest in Ethernet was piqued some 15-20 years ago as it is a
standard, and very early on there were mechanisms to obviate kernel
overheads which allowed some level of scalability even back in the
days of 1Gbps. This meant even then, that one could exploit
Landed-on-Motherboard network technology instead of more expensive
PCI add-in cards, Since then as we moved to 10Gbps and beyond, and
I coincidentally joined Gnodal (later acquired by Cray), RDMA-enablement
(through RoCE and iWarp) allowed standard MPI environment support
and with the 25, 50 and 100Gbps implementations, bandwidth and
latency promised on par with Infiniband. As a standard we would
expect a healthy ecosystem of players within both the smart NIC and
switch markets to flourish. For most switches such support is now
a standard (see next section). In terms of rNICs Broadcom, Chelsio,
Marvel and Mellanox currently offer products supporting either, or
both, the RDMA Ethernet protocols.</p>
<div class="section" id="pause-for-thought-pun-intended">
<h2>Pause for Thought (Pun Intended)</h2>
<p>I think the answer to the question, on “are we there yet” is, (isn’t
it always) going to be “it depends”. That “depends” will largely
be influenced by the market segmentation into the Performance,
Price-Performance and Price regimes. The question is can Ethernet
address the areas of “Price” and “Price-Performance” as opposed to
the “Performance Region” where some of the deficiencies of Ethernet
RDMA may well be exposed, e.g. multi-switch congestion at large
scale but for moderate sized clusters with nodes spanning only a
single switch may well be a better fit.</p>
<p>So for example, a cluster of 128 nodes (minus nodes for management,
access, storage): if it was possible to assess that 25GbE vs 100Gbps
EDR was sufficient, then I can build a system from a single 32-port
100GbE Switch (using break-out cables) as opposed to multiple 36-port
EDR switches, which if I take the standard practise of over-subscription,
I would end-up with similar cross-sectional bandwidth to the single
Ethernet switch anyway. Of course, within the bounds of a single
switch the bandwidth would be higher for IB. I guess down the line
with 400GbE devices coming to a Data Centre soon, this balance will
change.</p>
<p>Recently I had the chance to revisit this when running test benchmarks
on a bare-metal OpenStack system being used for prototyping of the
SKA (I’ll come on to OpenStack a bit later on but just to remark
here that this system runs OpenStack to prototype an operating
environment for the Science Data Processing Platform of the SKA).</p>
<p>I wanted to stress-test the networks, compute nodes and to some
extent the storage. StackHPC operate the system as a performance
prototype platform on behalf of astronomers across the SKA community
and so ensuring performance is maintained across the system is
critical. The system, eponymously named ALaSKA, looks like this.</p>
<div class="figure">
<img alt="ALaSKA - A la SKA" src="//www.stackhpc.com/images/alaska-p3.png" style="width: 750px;" />
</div>
<p>ALaSKA is used to software-define various platforms of interest to
various aspects of the SKA-Science Data Processor. The two predominant
platforms of interest currently are a Container Orchestration
environment (previously Docker-Swarm but now Kubernetes) and a
Slurm-as-a-Service HPC platform.</p>
<p>Here we focus on the latter of these, which gives us a good opportunity
to look at 100G IB vs 25G RoCE vs 25Gbps TCP vs 10G (network not
shown in the above diagram but is used for provisioning) to compare
performance. First let us look more closely at the Slurm PaaS. From
the base, compute, storage and network infrastructure we use OpenStack
Kayobe to deploy the OpenStack control plane (based on Kolla-Ansible)
and then marshal the creation of bare-metal compute nodes via the
OpenStack Ironic service. The flow looks something like this with
the Ansible Control Host being used to configure the OpenStack (via
a Bifrost service running on the seed node) as well the configuration
of network switches. Github provides the source repositories.</p>
<div class="figure">
<img alt="ALaSKA - A la SKA" src="//www.stackhpc.com/images/deploying-scientific-compute-platforms.png" style="width: 750px;" />
</div>
<p>Further Ansible playbooks together with OpenStack Heat permit the
deployment of the Slurm platform, based on the latest <a class="reference external" href="http://www.openhpc.community/">OpenHPC</a> image and various high performance
storage subsystems, in this case using <a class="reference external" href="https://www.stackhpc.com/ansible-role-beegfs.html">BeeGFS Ansible playbooks</a>. The graphic
above depicts the resulting environment with the addition of OpenStack
Monasca Monitoring and Logging Service (depicted by the lizard
logo). As we will see later on, this provides valuable insight to
system metrics (for both system administrators and the end user).</p>
<p>So let us assume that we first want to address the Price-Performance
and Price driven markets - at scale we need to be concerned around
East-West traffic congestion between switches, where this can be somewhat
mitigated by the fact that with modern 100GbE switches we can
break-out to 25/50GbE which increases the arity of a single switch
and (likely congestion). Of course, this means we need to be able
to justify the reduction in bandwidth of the NIC. Of course if the
total system only spans a single switch then congestion may not be
an issue, although further work may be required to understand
end-point congestion.</p>
<p>To test the systems performance I used (my preference) HPCC and
OpenFoam as two benchmark environments. All tests used gcc,
MKL and openmpi3 and no attempt was made to further optimise the
applications. Afterall, all I want to do is run comparative tests
of the same binary, by changing run-time variables to target the
underlying fabric. For openmpi, this can be achieved with the
following (see below). The system uses an OpenHPC image. At the
BIOS level, the system has hyperthreading enabled and so I was
careful to ensure that process placement ensured I pinned only half
the number of available slots (I’m using Slurm) and mapped by CPU.
This is important to know when we come to examine the performance
dashboards below. Here are the specific mca parameters for targeting
the fabrics.</p>
<div class="highlight"><pre><span></span><span class="nv">DEV</span><span class="o">=</span><span class="s2">" roce ibx eth 10Geth"</span>
<span class="k">for</span> j in <span class="nv">$DEV</span><span class="p">;</span>
<span class="k">do</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$j</span> <span class="o">==</span> ibx <span class="o">]</span><span class="p">;</span> <span class="k">then</span>
<span class="nv">MCA_PARAMS</span><span class="o">=</span><span class="s2">"--bind-to core --mca btl openib,self,vader --mca btl_openib_if_include mlx5_0:1 "</span>
<span class="k">fi</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$j</span> <span class="o">==</span> roce <span class="o">]</span><span class="p">;</span> <span class="k">then</span>
<span class="nv">MCA_PARAMS</span><span class="o">=</span><span class="s2">"--bind-to core --mca btl openib,self,vader --mca btl_openib_if_include mlx5_1:1</span>
<span class="s2">fi</span>
<span class="s2">if [ </span><span class="nv">$j</span><span class="s2"> == eth ]; then</span>
<span class="s2">MCA_PARAMS="</span>--bind-to core --mca btl tcp,self,vader --mca btl_tcp_if_include p3p2<span class="s2">"</span>
<span class="s2">fi</span>
<span class="s2">if [ </span><span class="nv">$j</span><span class="s2"> == 10Geth ]; then</span>
<span class="s2">MCA_PARAMS="</span>--bind-to core --mca btl tcp,self,vader --mca btl_tcp_if_include em1<span class="s2">"</span>
<span class="s2">fi</span>
<span class="s2">if [ </span><span class="nv">$j</span><span class="s2"> == ipoib ]; then</span>
<span class="s2">MCA_PARAMS="</span>--bind-to core --mca btl tcp,self,vader --mca btl_tcp_if_include ib0<span class="s2">"</span>
<span class="s2">fi</span>
</pre></div>
<p>In the results below, I’m comparing the performance across each
network using HPCC for a size of 8 nodes (up to 256 cores, albeit
512 virtual cores are available as described above). I think this
would cover the vast majority of cases in Research Computing.</p>
</div>
<div class="section" id="results">
<h2>Results</h2>
<div class="section" id="hpcc-benchmark">
<h3>HPCC Benchmark</h3>
<p>The results for major operations of the HPCC suite are shown below
together with a personal narrative of the performance. A more
thorough description of the benchmarks <a class="reference external" href="https://pdfs.semanticscholar.org/8666/f61e94355b203287d18ee43c32ee8bd69b12.pdf">can be found here</a>.</p>
<p>8 nodes 256 cores</p>
<table border="1" class="docutils">
<colgroup>
<col width="44%" />
<col width="15%" />
<col width="15%" />
<col width="12%" />
<col width="14%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Benchmark</th>
<th class="head">10GbE (TCP)</th>
<th class="head">25GbE (TCP)</th>
<th class="head">100Gb IB</th>
<th class="head">25GbE RoCE</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>HPL_Tflops</td>
<td>3.584</td>
<td>4.186</td>
<td>5.476</td>
<td>5.233</td>
</tr>
<tr><td>PTRANS_GBs</td>
<td>5.656</td>
<td>16.458</td>
<td>44.179</td>
<td>17.803</td>
</tr>
<tr><td>MPIRandomAccess_GUPs</td>
<td>0.005</td>
<td>0.004</td>
<td>0.348</td>
<td>0.230</td>
</tr>
<tr><td>StarFFT_Gflops</td>
<td>1.638</td>
<td>1.635</td>
<td>1.636</td>
<td>1.640</td>
</tr>
<tr><td>SingleFFT_Gflops</td>
<td>2.279</td>
<td>2.232</td>
<td>2.343</td>
<td>2.322</td>
</tr>
<tr><td>MPIFFT_Gflops</td>
<td>27.961</td>
<td>62.236</td>
<td>117.341</td>
<td>59.523</td>
</tr>
<tr><td>RandomlyOrderedRingLatency_usec</td>
<td>87.761</td>
<td>100.142</td>
<td>3.054</td>
<td>2.508</td>
</tr>
<tr><td>RandomlyOrderedRingBandwidth_GBytes</td>
<td>0.027</td>
<td>0.077</td>
<td>0.308</td>
<td>0.092</td>
</tr>
</tbody>
</table>
<ul class="simple">
<li>HPL – We can see here that it is evenly balanced between low-latency and b/w with RoCE and IB on a par even with the reduction in b/w of RoCE. In one sense this performance underlies the graphics shown above in terms of HPL, where Ethernet occupies ~50% of the share of total clusters which is not matched in terms of the performance share.</li>
<li>PTRANS – Performance pretty much in line with b/w</li>
<li>GUPS – latency dominated. IB wins by some margin</li>
<li>STARFFT– Embarrassingly Parallel (HTC use-case) no network effect.</li>
<li>SINGLEFFT – No effect no comms.</li>
<li>MPIFFT – Heavily b/w dominated see effect of 100 vs 25 Gbps (no latency effect)</li>
<li>Random Ring Latency – see effect of RDMA vs. TCP. Not sure why RoCE is better that IB, but may be due to the random order?</li>
<li>Random Ring B/W – In line with 100Gbps (IB) vs 25Gbps (RDMA) vs TCP networks.</li>
</ul>
</div>
<div class="section" id="openfoam">
<h3>OpenFoam</h3>
<p>I took the standard Motorbike benchmark and ran this on 128 (4
nodes) and 256 (8 nodes) cores on the same networks as above. I did
not change the mesh sizing between runs and thus on higher processor
counts, comms will be imbalanced. The results are shown below,
showing very little difference between the RDMA networks despite
the bandwidth difference.</p>
<table border="1" class="docutils">
<colgroup>
<col width="28%" />
<col width="18%" />
<col width="19%" />
<col width="18%" />
<col width="18%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Nodes(Processors)</th>
<th class="head">100Gbps IB</th>
<th class="head">25Gbps ROCE</th>
<th class="head">25Gbps TCP</th>
<th class="head">10Gbps TCP</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>8/(256)</td>
<td>87.64</td>
<td>93.35</td>
<td>560.37</td>
<td>591.23</td>
</tr>
<tr><td>4/(128)</td>
<td>99.83</td>
<td>101.49</td>
<td>347.19</td>
<td>379.32</td>
</tr>
</tbody>
</table>
<p>Elapsed Time in Seconds. NB the increase in time for TCP when running on more processors!</p>
</div>
<div class="section" id="future-work">
<h3>Future Work</h3>
<p>So at present I have only looked at MPI communication. The next big
thing to look at is storage, where the advantages of Ethernet need
to be assessed not only in terms of performance but also the natural
advantage the Ethernet standard has in connectivity for many
network-attached devices.</p>
</div>
<div class="section" id="why-openstack">
<h3>Why OpenStack</h3>
<p>As was mentioned above, one of the prototypical aspects of the
AlaSKA system is to model operational aspects of the Science Data
Processor element of the SKA. A good description of the SDP and the
Operational scenarios are described in the architectural description
of the system. A description of the architecture and that prototyping
can be <a class="reference external" href="http://ska-sdp.org/sites/default/files/attachments/sdp_memo_069_p3-alaska_openstack_prototyping_part_1_-_signed.pdf">found here</a>.</p>
<p>Using Ethernet, and in particular the use of High Performance
Ethernet (HPC Ethernet in the parlance of Cray), holds a particular
benefit in the case of on-premise cloud, as infrastructure may be
isolated in terms of multiple tenants. For the particular case of
IB and OPA this can be achieved using ad-hoc methods for the
respective network. For Ethernet, however, multi-tenancy is native.</p>
<p>For many HPC scenarios, multi tenancy is not important, nor even a
requirement. For others, it is key and mandatory, e.g. secure clouds
for clinical research. One aspect of multi-tenancy is shown in the
analysis of the results, where we use the aspects of OpenStack
Monasca (multi-tenant monitoring and logging service) and Grafana
dashboards. More information on the architecture of Monasca can be
found in a previous blog article.</p>
</div>
</div>
<div class="section" id="appendix-openstack-monasca-monitoring-o-p">
<h2>Appendix – OpenStack Monasca Monitoring O/P</h2>
<div class="section" id="hpcc">
<h3>HPCC</h3>
<p>The plot below shows CPU usage and network b/w for the runs of HPCC
using a grafana dashboard and OpenStack Monasca monitoring as a
service. The 4 epochs are shown for the IB, RoCE, 25Gbps (TCP) and
10Gbps (TCP). The total CPU usage is set at 50% as these are
HT-enabled nodes but mapped by-core with 1 thread per core. Thus,
we are only consuming 50% of the available resources. Network
bandwidth is shown for 3 of the epochs shown. “Inbound ROCE Network
Traffic”, “Inbound Infiniband Network Traffic” and “Inbound Bulk
Data Network Traffic” – Bulk Data Network refers to an erstwhile
name for the ingest network for the SDP.</p>
<div class="figure">
<img alt="HPCC performance data in Monasca" src="//www.stackhpc.com/images/hpcc-monasca.png" style="width: 750px;" />
</div>
<p>For the case of CPU usage, a reduction in performance is observed
for the TCP cases. This is further evidenced by a 2nd plot that
shows the system CPU, showing heavy system overhead for the 4
separate epochs.</p>
<div class="figure">
<img alt="HPCC CPU performance data in Monasca" src="//www.stackhpc.com/images/hpcc-monasca-roce.png" style="width: 750px;" />
</div>
</div>
</div>
Shanghai Open Infrastructure Summit: OpenStack Goes East2019-11-15T09:00:00+00:002019-11-15T09:00:00+00:00Stig Telfertag:www.stackhpc.com,2019-11-15:/os-shanghai.html<p class="first last">Stig Telfer attended the Open Infrastructure Summit in Shanghai
and got a taste for the project's adoption in Asia-Pacific.</p>
<p>Shanghai was anticipated to be an impressive backdrop to the latest
Open Infrastructure summit, and the mega-city (and the fabulous conference
centre) did not disappoint.</p>
<p>The dual-language summit worked out well enough. Parallel tracks in different
languages enabled every attendee to get something from the conference.
The fishbowl sessions were mostly conducted in English - occasionally
bilingual speakers would flip between languages, to foster inclusivity.</p>
<div class="figure">
<img alt="The Open Infra Shanghai Skyline" src="//www.stackhpc.com/images/openstack-shanghai-skyline.jpeg" style="width: 500px;" />
</div>
<div class="section" id="the-future-of-openstack">
<h2>The Future of OpenStack</h2>
<p>In line with recent trends, the Shanghai Summit was smaller and subject to less
attention from vendors. However, analysis of the scale of adoption of
OpenStack and related open infrastructure technologies showed that the market
continues to grow, and is predicted to carry on doing so.</p>
<div class="figure">
<img alt="OpenStack market projections 451 Research" src="//www.stackhpc.com/images/shanghai-openstack-market.jpeg" style="width: 750px;" />
</div>
<p>This confirmed StackHPC's own view, that the OpenStack market, now shorn of
much of its initial hype, is maturing alongside the project to become the
<em>de facto</em> workhorse of private cloud infrastructure.</p>
</div>
<div class="section" id="the-scientific-sig">
<h2>The Scientific SIG</h2>
<p>Unfortunately the location and scheduling did not work out well for
the Scientific SIG. Usually, the SIG events at open infrastructure
summits draw a hundred or more attendees for discussion that could
run far beyond the alloted time. This time, through budgetary
pressures, issues with clearance for travel, placement in the
developer schedule after the main summit, or various other reasons,
attendance fell well short. Nevertheless, it was a pleasure to meet and
chat with the people who made it.</p>
<div class="figure">
<img alt="Now THAT's a Train!" src="//www.stackhpc.com/images/shanghai-maglev.jpeg" style="width: 500px;" />
</div>
<p><em>Now THAT's a Train!</em></p>
<p>All in all, a great summit. StackHPC is already looking forward to
Vancouver in June 2020!</p>
<div class="section" id="get-in-touch">
<h3>Get in touch</h3>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
</div>
Worlds Collide: Virtual Machines & Bare Metal in OpenStack2019-11-06T02:00:00+00:002019-11-06T02:00:00+00:00Mark Goddardtag:www.stackhpc.com,2019-11-06:/mixed-vm-bare-metal.html<p class="first last">Virtual machine or bare metal? Choose the infrastructure model that
fits your application</p>
<div class="figure">
<img alt="Ironic's mascot, Pixie Boots" src="//www.stackhpc.com/images/pixie-boots.png" style="width: 350px;" />
</div>
<div class="section" id="to-virtualise-or-not-to-virtualise">
<h2>To virtualise or not to virtualise?</h2>
<p>If performance is what you need, then there's no debate - bare metal still
beats virtual machines; particularly for I/O intensive applications. However,
unless you can guarantee to keep it fully utilised, iron comes at a price.
In this article we describe how Nova can be used to provide access to both
hypervisors and bare metal compute nodes in a unified manner.</p>
</div>
<div class="section" id="scheduling">
<h2>Scheduling</h2>
<p>When support for bare metal compute via Ironic was first introduced to
Nova, it could not easily coexist with traditional hypervisor-based workloads.
Reported workarounds typically involved the use of host aggregates and flavor
properties.</p>
<p>Scheduling of bare metal is covered in detail in our
<a class="reference external" href="http://www.stackhpc.com/bespoke-bare-metal.html">bespoke bare metal</a> blog
article (see Recap: Scheduling in Nova).</p>
<p>Since the <a class="reference external" href="https://docs.openstack.org/placement/latest/">Placement</a> service
was introduced, scheduling has significantly changed for bare metal. The
standard vCPU, memory and disk resources were replaced with a single unit of a
custom resource class for each Ironic node. There are two key side-effects of
this:</p>
<ul class="simple">
<li>a bare metal node is either entirely allocated or not at all</li>
<li>the resource classes used by virtual machines and bare metal are disjoint, so
we could not end up with a VM flavor being scheduled to a bare metal node</li>
</ul>
<p>A flavor for a 'tiny' VM might look like this:</p>
<div class="highlight"><pre><span></span><span class="go">openstack flavor show vm-tiny -f json -c name -c vcpus -c ram -c disk -c properties</span>
<span class="go">{</span>
<span class="go"> "name": "vm-tiny",</span>
<span class="go"> "vcpus": 1,</span>
<span class="go"> "ram": 1024,</span>
<span class="go"> "disk": 1,</span>
<span class="go"> "properties": ""</span>
<span class="go">}</span>
</pre></div>
<p>A bare metal flavor for 'gold' nodes could look like this:</p>
<div class="highlight"><pre><span></span><span class="go">openstack flavor show bare-metal-gold -f json -c name -c vcpus -c ram -c disk -c properties</span>
<span class="go">{</span>
<span class="go"> "name": "bare-metal-gold",</span>
<span class="go"> "vcpus": 64,</span>
<span class="go"> "ram": 131072,</span>
<span class="go"> "disk": 371,</span>
<span class="go"> "properties": "resources:CUSTOM_GOLD='1',</span>
<span class="go"> resources:DISK_GB='0',</span>
<span class="go"> resources:MEMORY_MB='0',</span>
<span class="go"> resources:VCPU='0'"</span>
<span class="go">}</span>
</pre></div>
<p>Note that the vCPU/RAM/disk resources are informational only, and are zeroed
out via properties for scheduling purposes. We will discuss this further later
on.</p>
<p>With flavors in place, users choosing between VMs and bare metal is handled by
picking the correct flavor.</p>
</div>
<div class="section" id="what-about-networking">
<h2>What about networking?</h2>
<p>In our mixed environment, we might want our VMs and bare metal instances to be
able to communicate with each other, or we might want them to be isolated from
each other. Both models are possible, and work in the same way as a typical
cloud - Neutron networks are isolated from each other until connected via a
Neutron router.</p>
<p>Bare metal compute nodes typically use VLAN or flat networking, although with
the right combination of network hardware and Neutron plugins other models may
be possible. With VLAN networking, assuming that hypervisors are connected to
the same physical network as bare metal compute nodes, then attaching a VM to
the same VLAN as a bare metal compute instance will provide L2 connectivity
between them. Alternatively, it should be possible to use a Neutron router to
join up bare metal instances on a VLAN with VMs on another network e.g. VXLAN.</p>
<p>What does this look like in practice? We need a combination of Neutron
plugins/drivers that support both VM and bare metal networking. To connect bare
metal servers to tenant networks, it is necessary for Neutron to configure
physical network devices. We typically use the <a class="reference external" href="https://docs.openstack.org/networking-generic-switch/latest/">networking-generic-switch</a> ML2 mechanism
driver for this, although the <a class="reference external" href="https://networking-ansible.readthedocs.io/en/latest/">networking-ansible</a> driver is emerging as
a promising vendor-neutral alternative. These drivers support bare metal
ports, that is Neutron ports with a <tt class="docutils literal">VNIC_TYPE</tt> of <tt class="docutils literal">baremetal</tt>.
Vendor-specific drivers are also available, and may support both VMs and bare
metal.</p>
</div>
<div class="section" id="where-s-the-catch">
<h2>Where's the catch?</h2>
<p>One issue that more mature clouds may encounter is around the transition from
scheduling based on standard resource classes (vCPU, RAM, disk), to scheduling
based on custom resource classes. If old bare metal instances exist that were
created in the Rocky release or earlier, they may have standard resource class
inventory in Placement, in addition to their custom resource class. For
example, here is the inventory reported to Placement for such a node:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> openstack resource provider inventory list <node UUID>
<span class="go">+----------------+------------------+----------+----------+-----------+----------+--------+</span>
<span class="go">| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |</span>
<span class="go">+----------------+------------------+----------+----------+-----------+----------+--------+</span>
<span class="go">| VCPU | 1.0 | 64 | 0 | 1 | 1 | 64 |</span>
<span class="go">| MEMORY_MB | 1.0 | 131072 | 0 | 1 | 1 | 131072 |</span>
<span class="go">| DISK_GB | 1.0 | 371 | 0 | 1 | 1 | 371 |</span>
<span class="go">| CUSTOM_GOLD | 1.0 | 1 | 0 | 1 | 1 | 1 |</span>
<span class="go">+----------------+------------------+----------+----------+-----------+----------+--------+</span>
</pre></div>
<p>If this node is allocated to an instance whose flavor requested (or did not
explicitly zero out) standard resource classes, we will have a usage like this:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> openstack resource provider usage show <node UUID>
<span class="go">+----------------+--------+</span>
<span class="go">| resource_class | usage |</span>
<span class="go">+----------------+--------+</span>
<span class="go">| VCPU | 64 |</span>
<span class="go">| MEMORY_MB | 131072 |</span>
<span class="go">| DISK_GB | 371 |</span>
<span class="go">| CUSTOM_GOLD | 1 |</span>
<span class="go">+----------------+--------+</span>
</pre></div>
<p>If this instance is deleted, the standard resource class inventory will become
available, and may be selected by the scheduler for a VM. This is not likely to
end well. What we must do is ensure that these resources are not reported to
Placement. This is done by default in the Stein release of Nova, and Rocky may
be configured to do the same by setting the following in <tt class="docutils literal">nova.conf</tt>:</p>
<div class="highlight"><pre><span></span><span class="k">[workarounds]</span>
<span class="na">report_ironic_standard_resource_class_inventory</span> <span class="o">=</span> <span class="s">False</span>
</pre></div>
<p>However, if we do that, then Nova will attempt to remove inventory from
Placement resource providers that is already consumed by our instance, and will
receive a HTTP 409 Conflict. This will quickly fill our logs with unhelpful
noise.</p>
<div class="section" id="flavor-migration">
<h3>Flavor migration</h3>
<p>Thankfully, there is a solution. We can modify the embedded flavor in our
existing instances to remove the standard resource class inventory, which will
result in the removal of the allocation of these resources from Placement.
This will allow Nova to remove the inventory from the resource provider. There
is a <a class="reference external" href="https://review.opendev.org/637217/">Nova patch</a> started by Matt
Riedemann which will remove our standard resource class inventory. The patch
needs pushing over the line, but works well enough to be <a class="reference external" href="https://github.com/stackhpc/nova/commit/da1b23b8ea66bb2be428d002258daf9e7e535cc1">cherry-picked</a>
to Rocky.</p>
<p>The migration can be done offline or online. We chose to do it offline, to
avoid the need to deploy this patch. For each node to be migrated:</p>
<div class="highlight"><pre><span></span><span class="go">nova-manage db ironic_flavor_migration --resource_class <node resource class> --host <host> --node <node UUID></span>
</pre></div>
<p>Alternatively, if all nodes have the same resource class:</p>
<div class="highlight"><pre><span></span><span class="go">nova-manage db ironic_flavor_migration --resource_class <node resource class> --all</span>
</pre></div>
<p>You can check the instance embedded flavors have been updated correctly via the
database:</p>
<div class="highlight"><pre><span></span><span class="go">sql> use nova</span>
<span class="go">sql> select flavor from instance_extra;</span>
</pre></div>
<p>Now (Rocky only), standard resource class inventory reporting can be disabled.
After the nova compute service has been running for a while, Placement will be
updated:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> openstack resource provider inventory list <node UUID>
<span class="go">+----------------+------------------+----------+----------+-----------+----------+-------+</span>
<span class="go">| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |</span>
<span class="go">+----------------+------------------+----------+----------+-----------+----------+-------+</span>
<span class="go">| CUSTOM_GOLD | 1.0 | 1 | 0 | 1 | 1 | 1 |</span>
<span class="go">+----------------+------------------+----------+----------+-----------+----------+-------+</span>
<span class="gp">$</span> openstack resource provider usage show <node UUID>
<span class="go">+----------------+--------+</span>
<span class="go">| resource_class | usage |</span>
<span class="go">+----------------+--------+</span>
<span class="go">| CUSTOM_GOLD | 1 |</span>
<span class="go">+----------------+--------+</span>
</pre></div>
</div>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>We hope this shows that OpenStack is now in a place where VMs and bare metal
can coexist peacefully, and that even for those pesky pets, there is a path
forward to this brave new world. Thanks to the Nova team for working hard to
make Ironic a first class citizen.</p>
</div>
StackHPC joins the OpenStack Marketplace2019-11-02T09:00:00+00:002019-11-02T09:00:00+00:00Stig Telfertag:www.stackhpc.com,2019-11-02:/marketplace.html<p class="first last">Following OpenStack Foundation membership, StackHPC joins the OpenStack marketplace</p>
<p>In many areas, our participation in the OpenStack community is no secret.</p>
<ul class="simple">
<li>StackHPC are members of the OpenStack Foundation's <a class="reference external" href="https://www.openstack.org/bare-metal/">Baremetal program</a>, which we announced <a class="reference external" href="https://www.stackhpc.com/baremetal-program.html">here</a>.</li>
<li>Our CTO Stig Telfer is co-chair of the OpenStack <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">Scientific SIG</a>, and lead author of the OpenStack Foundation's book on <a class="reference external" href="https://www.openstack.org/science">OpenStack for Scientific Research</a>.</li>
<li>Our team includes PTLs for <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a> and <a class="reference external" href="https://docs.openstack.org/blazar/latest/">Blazar</a>, and core team members for <a class="reference external" href="https://docs.openstack.org/nova/latest/">Nova</a>, <a class="reference external" href="https://docs.openstack.org/ironic/latest/">Ironic</a>, <a class="reference external" href="https://docs.openstack.org/magnum/latest/">Magnum</a> and <a class="reference external" href="https://docs.openstack.org/monasca-api/latest/">Monasca</a>.</li>
<li>Our contributions to the code base are <a class="reference external" href="https://www.stackalytics.com/?company=stackhpc&metric=commits">visible in Stackalytics</a></li>
<li>We've even been in <a class="reference external" href="https://youtu.be/IapriwL4EnM?t=338">the opening keynote</a> at the Barcelona summit.</li>
</ul>
<p>One area we haven't focussed on is our commercial representation within the OpenStack Foundation. As <a class="reference external" href="https://www.stackhpc.com/pages/about.html">described here</a>, StackHPC works with clients to solve challenging problems with cloud infrastructure. Our business has been won through word of mouth.</p>
<p>Now our services can also be found in the <a class="reference external" href="https://www.openstack.org/marketplace/consulting/stackhpc/stackhpc-consulting-for-openstack">OpenStack Marketplace</a>.</p>
<p>John Taylor, StackHPC's co-founder and CEO, adds:</p>
<p><em>We are pleased to announce our OpenStack Foundation membership and inclusion
in the OpenStack Marketplace. Our success in driving the HPC and
Research Computing use-case in cloud has been in no small part
coupled to working closely with the OpenStack Foundation and the
open community it fosters. The era of hybrid cloud and the emergence
of converged AI/HPC infrastructure and coupled workflows is now
upon us, driving the need for architectures that seamlessly transition
across these resources while not compromising on performance. We
look forward to continuing our partnership with OpenStack through
the Scientific SIG and to active participation within OpenStack
projects.</em></p>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Kubeflow on Baremetal OpenStack2019-10-15T11:00:00+01:002019-10-15T11:00:00+01:00Bharat Kunwartag:www.stackhpc.com,2019-10-15:/kubeflow-baremetal-openstack.html<p class="first last">We discuss how we deployed Kubeflow on baremetal OpenStack
managed infrastructure to support cloud-native machine
learning workload use cases along with monitoring
infrastructure to give full visibility to the end user.</p>
<div class="figure">
<img alt="Kubeflow logo" src="//www.stackhpc.com/images/kubeflow.png" style="width: 200px;" />
</div>
<p><em>DISCLAIMER: No GANs were harmed in the writing of the blog.</em></p>
<p><a class="reference external" href="https://www.kubeflow.org/docs/">Kubeflow</a> is a machine learning
toolkit for Kubernetes. It aims to bring popular tools and libraries
under a single umbrella to allow users to:</p>
<ul class="simple">
<li>Spawn Jupyter notebooks with persistent volume for exploratory work.</li>
<li>Build, deploy and manage machine learning pipelines with initial
support for the <a class="reference external" href="https://www.tensorflow.org/api_docs">TensorFlow</a>
ecosystem but has since expanded to include other libraries that have
recently gained popularity in the research communitity like <a class="reference external" href="https://pytorch.org/docs/stable/index.html">PyTorch</a>.</li>
<li>Tune hyperparameters, serve models, etc.</li>
</ul>
<p>In our ongoing effort to demonstrate that OpenStack managed baremetal
infrastructure is a suitable platform for performing cutting-edge
science, we set out to deploy this popular machine learning framework on
top of underlying <a class="reference external" href="https://kubernetes.io/docs/home/">Kubernetes</a>
container orchestration layer deployed via <a class="reference external" href="https://docs.openstack.org/magnum/latest/">OpenStack Magnum</a>. The control plane for the
baremetal OpenStack cloud constitute of <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a> containers deployed using
<a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a> which provides for
containerised OpenStack to baremetal and is how we manage the vast majority of
our deployments to customer sites. The justification for running baremetal
instances is to minimise the performance overhead of virtualisation.</p>
<div class="section" id="apparatus">
<h2>Apparatus</h2>
<ul class="simple">
<li>Baremetal OpenStack cloud (minimum Rocky) except for OpenStack Magnum
(which must be at least Stein 8.1.0 for various reasons detailed later
but critically in order to support Fedora Atomic 29 which addresses a
<a class="reference external" href="https://nvd.nist.gov/vuln/detail/CVE-2019-13509">CVE present in earlier Docker version</a>).</li>
<li>A few spare baremetal instances (minimum 2 for 1 master and 1 worker).</li>
</ul>
</div>
<div class="section" id="deployment-steps">
<h2>Deployment Steps</h2>
<ul class="simple">
<li>Provision a Kubernetes cluster using OpenStack Magnum. For this
step, we recommend using <a class="reference external" href="https://www.terraform.io/docs/index.html">Terraform</a> or <a class="reference external" href="https://docs.ansible.com">Ansible</a>. Since Ansible 2.8,
<tt class="docutils literal">os_coe_cluster_template</tt> and <tt class="docutils literal">os_coe_cluster</tt> modules are
available to support Magnum cluster template and cluster creation.
However, in our case, we opted for Terraform which has a nicer user
experience because it understands the interdependency between the
cluster template and the cluster and therefore automatically
determines the order in which they need to be created and updated. To
be exact, we create our cluster using a Terraform template defined
<a class="reference external" href="https://github.com/stackhpc/kubeflow-demo">in this repo</a> where the
<tt class="docutils literal">README.md</tt> has details of the how to setup Terraform, upload image
and bootstrap Ansible in order to deploy Kubeflow. The key labels we
pass to the cluster template are as follows:</li>
</ul>
<div class="highlight"><pre><span></span>cgroup_driver="cgroupfs"
ingress_controller="traefik"
tiller_enabled="true"
tiller_tag="v2.14.3"
monitoring_enabled="true"
kube_tag="v1.14.6"
cloud_provider_tag="v1.14.0"
heat_container_agent_tag="train-dev"
</pre></div>
<ul class="simple">
<li>Run <tt class="docutils literal">./terraform init && ./terraform apply</tt> to create the cluster.</li>
<li>Once the cluster is ready, source <tt class="docutils literal"><span class="pre">magnum-tiller.sh</span></tt> to use tiller
enabled by Magnum and run our Ansible playbook to deploy Kubeflow
along with ingress to all the services (edit <tt class="docutils literal">variables/example.yml</tt>
to suit your OpenStack environment):</li>
</ul>
<div class="highlight"><pre><span></span>ansible-playbook k8s.yml -e @variables/example.yml
</pre></div>
<ul class="simple">
<li>At this point, we should see a list of ingresses which use
<tt class="docutils literal"><span class="pre">*-minion-0</span></tt> as the ingress node by default when we run <tt class="docutils literal">kubectl get
ingress <span class="pre">-A</span></tt>. We are using a <tt class="docutils literal">nip.io</tt> based wildcard DNS service so that
traffic generating from different subdomains map to various services we have
deployed. For example, the Kubeflow dashboard is deployed as
<tt class="docutils literal"><span class="pre">ambassador-ingress</span></tt> and the Tensorboard dashboard is deployed as
<tt class="docutils literal"><span class="pre">tensorboard-ingress</span></tt>. Similarly, the Grafana dashboard deployed by placing
<tt class="docutils literal">monitoring_enabled=True</tt> label is deployed as <tt class="docutils literal"><span class="pre">monitoring-ingress</span></tt>. The
<tt class="docutils literal"><span class="pre">mnist-ingress</span></tt> ingress is currently functioning as a placeholder for the
next part where we train and serve a model using the Kubeflow ML pipeline.</li>
</ul>
<div class="highlight"><pre><span></span>$ kubectl get ingress -A
NAMESPACE NAME HOSTS ADDRESS PORTS AGE
kubeflow ambassador-ingress kubeflow.10.145.0.8.nip.io <span class="m">80</span> 35h
kubeflow mnist-ingress mnist.10.145.0.8.nip.io <span class="m">80</span> 35h
kubeflow tensorboard-ingress tensorboard.10.145.0.8.nip.io <span class="m">80</span> 35h
monitoring monitoring-ingress grafana.10.145.0.8.nip.io <span class="m">80</span> 35h
</pre></div>
<ul class="simple">
<li>Next step is to deploy a ML workflow to Kubeflow. We stepped through
the instructions in the <a class="reference external" href="https://github.com/kubeflow/examples/tree/master/mnist">README for MNIST on Kubeflow example</a>
ourselves and with minimal <a class="reference external" href="https://github.com/stackhpc/kubeflow-examples/pull/1/files">customisation for use with kustomize</a>,
managed to train the model and serve it through a nice frontend. The
web interface should be reachable through the <tt class="docutils literal"><span class="pre">mnist-ingress</span></tt>
endpoint.</li>
</ul>
<div class="highlight"><pre><span></span>git clone https://github.com/stackhpc/kubeflow-examples examples -b dell
<span class="nb">cd</span> examples/mnist <span class="o">&&</span> bash deploy-kustomizations.sh
</pre></div>
</div>
<div class="section" id="notes-on-monitoring">
<h2>Notes on Monitoring</h2>
<p>Kubeflow comes with a Tensorboard service which allows users to
visualise machine learning model training logs, model architecture and
also the efficacy of the model itself by reducing the latent space of
the weights in the final layer before the model makes a classification.</p>
<p>The extensibility of the <a class="reference external" href="https://docs.openstack.org/monasca-api/latest/">OpenStack Monasca</a> service also lends itself
well to the integration into machine learning model training loops given that
the agent is configured to accept non-local traffic on workers which can be
done by setting the following values inside <tt class="docutils literal">/etc/monasca/agent/agent.yaml</tt>
and a restart of the <tt class="docutils literal"><span class="pre">monasca-agent.target</span></tt> service:</p>
<div class="highlight"><pre><span></span><span class="na">monasca_statsd_port: 8125</span>
<span class="na">non_local_traffic: true</span>
</pre></div>
<p>On the client side where the machine model example is running, metrics of
interest can now be posted to the monasca agent. For example, we can provide a
callback function to <a class="reference external" href="https://docs.fast.ai">FastAI</a>, a machine learning
wrapper library which uses PyTorch primitives underneath with an emphasis on
transfer learning (and can be launched as a <a class="reference external" href="https://github.com/brtknr/fastai-docker/blob/master/fastai-v3/Dockerfile">GPU flavored notebook container</a> on
Kubeflow) for tasks such as image and natural language processing. The training
loop of the library hooks into callback functions encapsulated within the
<tt class="docutils literal">PostMetrics</tt> class defined below at the end of every batch or at the end of
every epoch of the model training process:</p>
<div class="highlight"><pre><span></span><span class="c1"># Import the module.</span>
<span class="kn">from</span> <span class="nn">fastai.callbacks.loss_metrics</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">monascastatsd</span> <span class="k">as</span> <span class="nn">mstatsd</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">mstatsd</span><span class="o">.</span><span class="n">Connection</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">'openhpc-login-0'</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">8125</span><span class="p">)</span>
<span class="c1"># Create the client with optional dimensions</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">mstatsd</span><span class="o">.</span><span class="n">Client</span><span class="p">(</span><span class="n">connection</span><span class="o">=</span><span class="n">conn</span><span class="p">,</span> <span class="n">dimensions</span><span class="o">=</span><span class="p">{</span><span class="s1">'env'</span><span class="p">:</span> <span class="s1">'fastai'</span><span class="p">})</span>
<span class="c1"># Create a gauge called fastai</span>
<span class="n">gauge</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">get_gauge</span><span class="p">(</span><span class="s1">'fastai'</span><span class="p">,</span> <span class="n">dimensions</span><span class="o">=</span><span class="p">{</span><span class="s1">'env'</span><span class="p">:</span> <span class="s1">'fastai'</span><span class="p">})</span>
<span class="k">class</span> <span class="nc">PostMetrics</span><span class="p">(</span><span class="n">LearnerCallback</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stop</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">def</span> <span class="nf">on_batch_end</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">last_loss</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">:</span><span class="n">Any</span><span class="p">)</span><span class="o">-></span><span class="kc">None</span><span class="p">:</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">stop</span><span class="p">:</span> <span class="k">return</span> <span class="kc">True</span> <span class="c1">#to skip validation after stopping during training</span>
<span class="c1"># Record a gauge 50% of the time.</span>
<span class="n">gauge</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="s1">'trn_loss'</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">last_loss</span><span class="p">),</span> <span class="n">sample_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">on_epoch_end</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">last_loss</span><span class="p">,</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">smooth_loss</span><span class="p">,</span> <span class="n">last_metrics</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">:</span><span class="n">Any</span><span class="p">):</span>
<span class="n">val_loss</span><span class="p">,</span> <span class="n">error_rate</span> <span class="o">=</span> <span class="n">last_metrics</span>
<span class="n">gauge</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="s1">'val_loss'</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">val_loss</span><span class="p">),</span> <span class="n">sample_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">gauge</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="s1">'error_rate'</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">error_rate</span><span class="p">),</span> <span class="n">sample_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">gauge</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="s1">'smooth_loss'</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">smooth_loss</span><span class="p">),</span> <span class="n">sample_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">gauge</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="s1">'trn_loss'</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">last_loss</span><span class="p">),</span> <span class="n">sample_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">gauge</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="s1">'epoch'</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">epoch</span><span class="p">),</span> <span class="n">sample_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="c1"># Pass PostMetrics() callback function to cnn_learner's training loop</span>
<span class="n">learn</span> <span class="o">=</span> <span class="n">cnn_learner</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">models</span><span class="o">.</span><span class="n">resnet34</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="n">error_rate</span><span class="p">,</span> <span class="n">bn_final</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">PostMetrics</span><span class="p">()])</span>
</pre></div>
<p>These metrics are sent to the OpenStack Monasca API which can then be
visualised on a Grafana dashboard against GPU power consumption which
can then allow a user to determine the tradeoff against model accuracy
as shown in the following figure:</p>
<div class="figure">
<img alt="Kubeflow logo" src="//www.stackhpc.com/images/kubeflow-fastai-monasca-grafana.png" style="width: 100%;" />
</div>
<p>In addition, general resource usage monitoring may also be of interest.
There are two <a class="reference external" href="https://prometheus.io/docs/introduction/overview/">Prometheus</a>
based monitoring options available on Magnum:</p>
<ul class="simple">
<li>First, non-helm based method uses <tt class="docutils literal">prometheus_monitoring</tt> label
which when set to <tt class="docutils literal">True</tt> deploys a monitoring stack consisting of a
Prometheus service, a Grafana service and a DaemonSet (Kubernetes
terminology which translates to a service per node in the cluster) of
node exporters. However, the the deployed Grafana service does not
provide any useful dashboards that acts as an interface with the
collected metrics due to a change in how default dashboards are loaded
in recent versions of Grafana. A dashboard can be installed manually
but it does not allow the user to drill down into the visible metrics
further and presents the information in a flat way.</li>
<li>Second, helm based method (recommended) requires
<tt class="docutils literal">monitoring_enabled</tt> and <tt class="docutils literal">tiller_enabled</tt> labels to be set to
<tt class="docutils literal">True</tt>. It deploys a similar monitoring stack as above but because
it is helm based, it is also upgradable. In this case, the Grafana
service comes preloaded with several dashboards that present the
metrics collected by the node exporters in a meaningful way allowing
users to drill down to various levels of detail and types of
groupings, e.g. by cluster, namespace, pod, node, etc.</li>
</ul>
<p>Of course, it is also possible to deploy a Prometheus based monitoring stack
without having it managed by Magnum. Additionally, we have demonstrated that it
is also as option to deploy the Monasca agent running inside of a container to
post metrics to the Monasca API which may be available if it is configured to
be the way to monitor the control plane metrics.</p>
</div>
<div class="section" id="why-we-recommend-upgrading-magnum-to-stein-8-1-0-release">
<h2>Why we recommend upgrading Magnum to Stein (8.1.0 release)</h2>
<ul class="simple">
<li>OpenStack Magnum (Rocky) supports up to Fedora Atomic 27 which is EOL.
Support for Fedora Atomic 29 (with the fixes for the CVE mentioned earlier)
requires a backport of various fixes from the master branch that reinstate
support for the two network plugin types supported by Magnum (namely Calico
and Flannel).</li>
<li>Additonally, there have been changes to the Kubernetes API which are outside
of Magnum project's control. Rocky only supports the versions of Kubernetes
upto v1.13.x and the Kubernetes project maintainers only actively maintain a
development branch and 3 stable releases. The current development release is
v1.17.x which means v1.16.x, v1.15.x and 1.14.x can expect updates and
backport of critical fixes. Support for v1.15.x and v1.16.x are coming to
Train release but upgrading to Stein will enable us to support up to v1.14.x.</li>
<li>The <tt class="docutils literal">traefik</tt> ingress controller deployed by magnum is no longer working in
Rocky release due to the fact the former behaviour was to always deploy the
<tt class="docutils literal">latest</tt> tag. However, a new major version (2.0.0) has been released with
breaking changes to the API which inevitably fails. Stein 8.1.0 has the
necessary fixes and additionally, also supports the more popular <tt class="docutils literal">nginx</tt>
based ingress controller.</li>
</ul>
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Bespoke Bare Metal: Ironic Deploy Templates2019-09-23T12:00:00+01:002019-09-23T12:00:00+01:00Mark Goddardtag:www.stackhpc.com,2019-09-23:/bespoke-bare-metal.html<p class="first last">Flexible, dynamic deployment of bare metal using Ironic's deploy
templates API</p>
<div class="figure">
<img alt="Ironic's mascot, Pixie Boots" src="//www.stackhpc.com/images/pixie-boots.png" style="width: 350px;" />
</div>
<div class="section" id="iron-is-solid-and-inflexible-right">
<h2>Iron is solid and inflexible, right?</h2>
<p>OpenStack Ironic's Deploy Templates feature brings us closer to a world where bare metal
servers can be automatically configured for their workload.</p>
<p>In this article we discuss the <a class="reference external" href="https://www.youtube.com/watch?v=DrQcTljx_eM">Bespoke Bare Metal</a> (<a class="reference external" href="https://docs.google.com/presentation/d/1MujO5s23YxpxBfP6ul_HeCi8JLrcrMt_syDCQz5UpKQ/edit?usp=sharing">slides</a>)
presentation given at the Open Infrastructure summit in Denver in April 2019.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/DrQcTljx_eM" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div></div>
<div class="section" id="bios-raid">
<h2>BIOS & RAID</h2>
<p>The most requested features driving the deploy templates work are dynamic BIOS
and RAID configuration. Let's consider the state of things prior to deploy
templates.</p>
<p>Ironic has for a long time supported a feature called <em>cleaning</em>. This is
typically used to perform actions to sanitise hardware, but can also perform
some one-off configuration tasks. There are two modes - automatic and manual.
Automatic cleaning happens when a node is deprovisioned. A typical use case for
automatic cleaning is shredding disks to remove sensitive data. Manual cleaning
happens on demand, when a node is not in use. The following diagram shows a
simplified view of the node states related to cleaning.</p>
<div class="figure">
<img alt="Ironic cleaning states (simplified)" src="//www.stackhpc.com/images/ironic-cleaning-states.png" style="width: 700px;" />
</div>
<p>Cleaning works by executing a list of <em>clean steps</em>, which map to methods
exposed by the Ironic driver in use. Each clean step has the following fields:</p>
<ul class="simple">
<li><tt class="docutils literal">interface</tt>: One of <tt class="docutils literal">deploy</tt>, <tt class="docutils literal">power</tt>, <tt class="docutils literal">management</tt>, <tt class="docutils literal">bios</tt>,
<tt class="docutils literal">raid</tt></li>
<li><tt class="docutils literal">step</tt>: Method (function) name on the driver interface</li>
<li><tt class="docutils literal">args</tt>: Dictionary of keyword arguments</li>
<li><tt class="docutils literal">priority</tt>: Order of execution (higher runs earlier)</li>
</ul>
<div class="section" id="bios">
<h3>BIOS</h3>
<p><a class="reference external" href="https://docs.openstack.org/ironic/latest/admin/bios.html">BIOS configuration</a> support was added
in the Rocky cycle. The <tt class="docutils literal">bios</tt> driver interface provides two clean steps:</p>
<ul class="simple">
<li><tt class="docutils literal">apply_configuration</tt>: apply BIOS configuration</li>
<li><tt class="docutils literal">factory_reset</tt>: reset BIOS configuration to factory defaults</li>
</ul>
<p>Here is an example of a clean step that uses the BIOS driver interface to
disable HyperThreading:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"interface"</span><span class="p">:</span> <span class="s2">"bios"</span><span class="p">,</span>
<span class="nt">"step"</span><span class="p">:</span> <span class="s2">"apply_configuration"</span><span class="p">,</span>
<span class="nt">"args"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"settings"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"name"</span><span class="p">:</span> <span class="s2">"LogicalProc"</span><span class="p">,</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="s2">"Disabled"</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
</div>
<div class="section" id="raid">
<h3>RAID</h3>
<p>Support for <a class="reference external" href="https://docs.openstack.org/ironic/latest/admin/raid.html">RAID configuration</a> was added in the
Mitaka cycle. The <tt class="docutils literal">raid</tt> driver interface provides two clean steps:</p>
<ul class="simple">
<li><tt class="docutils literal">create_configuration</tt>: create RAID configuration</li>
<li><tt class="docutils literal">delete_configuration</tt>: delete all RAID virtual disks</li>
</ul>
<p>The target RAID configuration must be set in a separate API call prior to
cleaning.</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"interface"</span><span class="p">:</span> <span class="s2">"raid"</span><span class="p">,</span>
<span class="nt">"step"</span><span class="p">:</span> <span class="s2">"create_configuration"</span><span class="p">,</span>
<span class="nt">"args"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"create_root_volume"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
<span class="nt">"create_nonroot_volumes"</span><span class="p">:</span> <span class="kc">true</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>Of course, support for BIOS and RAID configuration is hardware-dependent.</p>
</div>
<div class="section" id="limitations">
<h3>Limitations</h3>
<p>While BIOS and RAID configuration triggered through cleaning can be useful, it
has a number of limitations. The configuration is not integrated into
Ironic node deployment, so users cannot select a configuration on demand.
Cleaning is not available to Nova users, so it is accessible only to
administrators. Finally, the requirement for a separate API call to set the
target RAID configuration is quite clunky, and prevents the configuration of
RAID in automated cleaning.</p>
<p>With these limitations in mind, let's consider the goals for bespoke bare
metal.</p>
</div>
</div>
<div class="section" id="goals">
<h2>Goals</h2>
<p>We want to allow a pool of hardware to be applied to various tasks, with an
optimal server configuration used for each task. Some examples:</p>
<ul class="simple">
<li>A Hadoop node with Just a Bunch of Disks (JBOD)</li>
<li>A database server with mirrored & striped disks (RAID 10)</li>
<li>A High Performance Computing (HPC) compute node, with tuned BIOS parameters</li>
</ul>
<p>In order to avoid partitioning our hardware, we want to be able to dynamically
configure these things when a bare metal instance is deployed.</p>
<p>We also want to make it <em>cloudy</em>. It should not require administrator
privileges, and should be abstracted from hardware specifics. The operator
should be able to control what can be configured and who can configure it.
We'd also like to use existing interfaces and concepts where possible.</p>
</div>
<div class="section" id="recap-scheduling-in-nova">
<h2>Recap: Scheduling in Nova</h2>
<p>Understanding the mechanics of deploy templates requires a reasonable knowledge
of how scheduling works in Nova with Ironic. The <a class="reference external" href="https://docs.openstack.org/placement/latest/">Placement service</a> was added to Nova in the
Newton cycle, and extracted into a separate project in Stein. It provides an
API for tracking resource inventory & consumption, with support for both
quantitative and qualitative aspects.</p>
<p>Let's start by introducing the key concepts in Placement.</p>
<ul class="simple">
<li>A <strong>Resource Provider</strong> provides an <strong>Inventory</strong> of resources of different
<strong>Resource Classes</strong></li>
<li>A <strong>Resource Provider</strong> may be tagged with one or more <strong>Traits</strong></li>
<li>A <strong>Consumer</strong> may have an <strong>Allocation</strong> that consumes some of a <strong>Resource
Provider</strong>’s <strong>Inventory</strong></li>
</ul>
<div class="section" id="scheduling-virtual-machines">
<h3>Scheduling Virtual Machines</h3>
<p>In the case of Virtual Machines, these concepts map as follows:</p>
<ul class="simple">
<li>A <strong>Compute Node</strong> provides an <strong>Inventory</strong> of <strong>vCPU</strong>, <strong>Disk</strong> &
<strong>Memory</strong> resources</li>
<li>A <strong>Compute Node</strong> may be tagged with one or more <strong>Traits</strong></li>
<li>An <strong>Instance</strong> may have an <strong>Allocation</strong> that consumes some of a
<strong>Compute Node</strong>’s <strong>Inventory</strong></li>
</ul>
<p>A hypervisor with 35GB disk, 5825MB RAM and 4 CPUs might have a resource
provider inventory record in Placement accessed via <tt class="docutils literal">GET
<span class="pre">/resource_providers/{uuid}/inventories</span></tt> that looks like this:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"inventories"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"DISK_GB"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"allocation_ratio"</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span> <span class="nt">"max_unit"</span><span class="p">:</span> <span class="mi">35</span><span class="p">,</span> <span class="nt">"min_unit"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"reserved"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="nt">"step_size"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="nt">"total"</span><span class="p">:</span> <span class="mi">35</span>
<span class="p">},</span>
<span class="nt">"MEMORY_MB"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"allocation_ratio"</span><span class="p">:</span> <span class="mf">1.5</span><span class="p">,</span> <span class="nt">"max_unit"</span><span class="p">:</span> <span class="mi">5825</span><span class="p">,</span> <span class="nt">"min_unit"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"reserved"</span><span class="p">:</span> <span class="mi">512</span><span class="p">,</span> <span class="nt">"step_size"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="nt">"total"</span><span class="p">:</span> <span class="mi">5825</span>
<span class="p">},</span>
<span class="nt">"VCPU"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"allocation_ratio"</span><span class="p">:</span> <span class="mf">16.0</span><span class="p">,</span> <span class="nt">"max_unit"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span> <span class="nt">"min_unit"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"reserved"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="nt">"step_size"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="nt">"total"</span><span class="p">:</span> <span class="mi">4</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="nt">"resource_provider_generation"</span><span class="p">:</span> <span class="mi">7</span>
<span class="p">}</span>
</pre></div>
<p>Note that the inventory tracks all of a hypervisor's resources, whether they
are consumed or not. Allocations track what has been consumed by instances.</p>
</div>
<div class="section" id="scheduling-bare-metal">
<h3>Scheduling Bare Metal</h3>
<p>The scheduling described above for VMs does not apply cleanly to bare metal.
Bare metal nodes are indivisible units, and cannot be shared by multiple
instances or overcommitted. They're either in use or not. To resolve this
issue, we use Placement slightly differently with Nova and Ironic.</p>
<ul class="simple">
<li>A <strong>Bare Metal Node</strong> provides an <strong>Inventory</strong> of
<strong>one unit of a custom resource</strong></li>
<li>A <strong>Bare Metal Node</strong> may be tagged with one or more <strong>Traits</strong></li>
<li>An <strong>Instance</strong> may have an <strong>Allocation</strong> that consumes all of a
<strong>Bare Metal Node</strong>’s <strong>Inventory</strong></li>
</ul>
<p>If we now look at the resource provider inventory record for a bare metal node,
it might look like this:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"inventories"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"CUSTOM_GOLD"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"allocation_ratio"</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
<span class="nt">"max_unit"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"min_unit"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"reserved"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="nt">"step_size"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"total"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="nt">"resource_provider_generation"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">}</span>
</pre></div>
<p>We have just one unit of one resource class, in this case <tt class="docutils literal">CUSTOM_GOLD</tt>. The
resource class comes from the <tt class="docutils literal">resource_class</tt> field of the node in Ironic,
upper-cased, and with a prefix of <tt class="docutils literal">CUSTOM_</tt> to denote that it is a custom
resource class as opposed to a standard one like <tt class="docutils literal">VCPU</tt>.</p>
<p>What sort of Nova flavor would be required to schedule to this node?</p>
<div class="highlight"><pre><span></span><span class="go">openstack flavor show bare-metal-gold -f json \</span>
<span class="go"> -c name -c ram -c properties -c vcpus -c disk</span>
<span class="go">{</span>
<span class="go"> "name": "bare-metal-gold",</span>
<span class="go"> "vcpus": 4,</span>
<span class="go"> "ram": 4096,</span>
<span class="go"> "disk": 1024,</span>
<span class="go"> "properties": "resources:CUSTOM_GOLD='1',</span>
<span class="go"> resources:DISK_GB='0',</span>
<span class="go"> resources:MEMORY_MB='0',</span>
<span class="go"> resources:VCPU='0'"</span>
<span class="go">}</span>
</pre></div>
<p>Note that the standard fields (<tt class="docutils literal">vcpus</tt> etc.) may be specified for
informational purposes, but should be zeroed out using properties as shown.</p>
</div>
<div class="section" id="traits">
<h3>Traits</h3>
<p>So far we have covered scheduling based on quantitative resources. Placement
uses <em>traits</em> to model qualitative resources. These are associated with
resource providers. For example, we might query <tt class="docutils literal">GET
<span class="pre">/resource_providers/{uuid}/traits</span></tt> for a resource provider that has an FPGA to
find some information about the class of the FPGA device.</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"resource_provider_generation"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="nt">"traits"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"CUSTOM_HW_FPGA_CLASS1"</span><span class="p">,</span>
<span class="s2">"CUSTOM_HW_FPGA_CLASS3"</span>
<span class="p">]</span>
<span class="p">}</span>
</pre></div>
<p>Ironic nodes can have traits assigned to them, in addition to their resource
class: <tt class="docutils literal">GET <span class="pre">/nodes/{uuid}?fields=name,resource_class,traits</span></tt>:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"Name"</span><span class="p">:</span> <span class="s2">"gold-node-1"</span><span class="p">,</span>
<span class="nt">"Resource Class"</span><span class="p">:</span> <span class="s2">"GOLD"</span><span class="p">,</span>
<span class="nt">"Traits"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"CUSTOM_RAID0"</span><span class="p">,</span>
<span class="s2">"CUSTOM_RAID1"</span><span class="p">,</span>
<span class="p">]</span>
<span class="p">}</span>
</pre></div>
<p>Similarly to quantitative scheduling, traits may be specified via a flavor when
creating an instance.</p>
<div class="highlight"><pre><span></span><span class="go">openstack flavor show bare-metal-gold -f json -c name -c properties</span>
<span class="go">{</span>
<span class="go"> "name": "bare-metal-gold",</span>
<span class="go"> "properties": "resources:CUSTOM_GOLD='1',</span>
<span class="go"> resources:DISK_GB='0',</span>
<span class="go"> resources:MEMORY_MB='0',</span>
<span class="go"> resources:VCPU='0',</span>
<span class="go"> trait:CUSTOM_RAID0='required'"</span>
<span class="go">}</span>
</pre></div>
<p>This flavor will select bare metal nodes with a <tt class="docutils literal">resource_class</tt> of
<tt class="docutils literal">CUSTOM_GOLD</tt>, and a list of traits including <tt class="docutils literal">CUSTOM_RAID0</tt>.</p>
<p>To allow ironic to take action based upon the requested traits, the list of
required traits are stored in the Ironic node object under the
<tt class="docutils literal">instance_info</tt> field.</p>
</div>
</div>
<div class="section" id="ironic-deploy-steps">
<h2>Ironic deploy steps</h2>
<p>The <a class="reference external" href="https://docs.openstack.org/ironic/latest/admin/node-deployment.html#node-deployment-deploy-steps">Ironic deploy steps framework</a>
was added in the Rocky cycle as a first step towards making the deployment
process more flexible. It is based on the clean step model described earlier,
and allows drivers to define steps available to be executed during deployment.
Here is the simplified state diagram we saw earlier, this time highlighting the
deploying state in which deploy steps are executed.</p>
<div class="figure">
<img alt="Ironic deployment states (simplified)" src="//www.stackhpc.com/images/ironic-deploy-states.png" style="width: 700px;" />
</div>
<p>Each deploy step has:</p>
<ul class="simple">
<li><tt class="docutils literal">interface</tt>: One of <tt class="docutils literal">deploy</tt>, <tt class="docutils literal">power</tt>, <tt class="docutils literal">management</tt>, <tt class="docutils literal">bios</tt>,
<tt class="docutils literal">raid</tt></li>
<li><tt class="docutils literal">step</tt>: Method (function) name on the driver interface</li>
<li><tt class="docutils literal">args</tt>: Dictionary of keyword arguments</li>
<li><tt class="docutils literal">priority</tt>: Order of execution (higher runs earlier)</li>
</ul>
<p>Notice that this is the same as for clean steps.</p>
<div class="section" id="the-mega-step">
<h3>The mega step</h3>
<p>In the Rocky cycle, the majority of the deployment process was moved to a
single step called <tt class="docutils literal">deploy</tt> on the <tt class="docutils literal">deploy</tt> interface with a priority of
100. This step roughly does the following:</p>
<ul class="simple">
<li>power on the node to boot up the agent</li>
<li>wait for the agent to boot</li>
<li>write the image to disk</li>
<li>power off</li>
<li>unplug from provisioning networks</li>
<li>plug tenant networks</li>
<li>set boot mode</li>
<li>power on</li>
</ul>
<p>Drivers can currently add steps before or after this step. The plan is to split
this into multiple core steps for more granular control over the deployment
process.</p>
</div>
<div class="section" id="id1">
<h3>Limitations</h3>
<p>Deploy steps are static for a given set of driver interfaces, and are currently
all out of band - it is not possible to execute steps on the deployment agent.
Finally, the mega step limits ordering of the steps.</p>
</div>
</div>
<div class="section" id="ironic-deploy-templates">
<h2>Ironic deploy templates</h2>
<p>The Ironic <a class="reference external" href="https://docs.openstack.org/ironic/latest/admin/node-deployment.html#deploy-templates">deploy templates API</a>
was added in the Stein cycle and allows deployment templates to be registered
which have:</p>
<ul class="simple">
<li>a name, which must be a valid trait</li>
<li>a list of deployment steps</li>
</ul>
<p>For example, a deploy template could be registered via <tt class="docutils literal">POST
/v1/deploy_templates</tt>:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"name"</span><span class="p">:</span> <span class="s2">"CUSTOM_HYPERTHREADING_ON"</span><span class="p">,</span>
<span class="nt">"steps"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"interface"</span><span class="p">:</span> <span class="s2">"bios"</span><span class="p">,</span>
<span class="nt">"step"</span><span class="p">:</span> <span class="s2">"apply_configuration"</span><span class="p">,</span>
<span class="nt">"args"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"settings"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"name"</span><span class="p">:</span> <span class="s2">"LogicalProc"</span><span class="p">,</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="s2">"Enabled"</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">},</span>
<span class="nt">"priority"</span><span class="p">:</span> <span class="mi">150</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
</pre></div>
<p>This template has a name of <tt class="docutils literal">CUSTOM_HYPERTHREADING_ON</tt> (which is also a valid
trait name), and references a deploy step on the <tt class="docutils literal">bios</tt> interface that sets
the <tt class="docutils literal">LogicalProc</tt> BIOS setting to <tt class="docutils literal">Enabled</tt> in order to enable
Hyperthreading on a node.</p>
</div>
<div class="section" id="tomorrows-raid">
<h2>Tomorrow’s RAID</h2>
<p>In the Stein release we have the deploy templates and steps frameworks, but
lack drivers with deploy step implementations to make this useful. As part of
the demo for the Bespoke Bare Metal talk, we built and demoed a proof of
concept deploy step for configuring RAID during deployment on Dell machines.
This code has been polished and is working its way upstream at the time of
writing, and has also influenced deploy steps for the HP iLO driver. Thanks to
Shivanand Tendulker for extracting and polishing some of the code from the PoC.</p>
<p>We now have an <tt class="docutils literal">apply_configuration</tt> deploy step available on the RAID
interface which accepts RAID configuration as an argument, to avoid the
separate API call required in cleaning.</p>
<p>The first pass at implementing this in the iDRAC driver took over 30 minutes to
complete deployment. This was streamlined to just over 10 minutes by combining
deletion and creation of virtual disks into a single deploy step, and avoiding
an unnecessary reboot.</p>
<div class="section" id="end-to-end-flow">
<h3>End to end flow</h3>
<p>Now we know what a deploy template looks like, how are they used?</p>
<p>First of all, the cloud operator creates deploy templates via the Ironic API
to execute deploy steps for allowed actions. In this example, we have a
deploy template used to create a 42GB RAID1 virtual disk.</p>
<div class="highlight"><pre><span></span><span class="go">cat << EOF > raid1-steps.json</span>
<span class="go">[</span>
<span class="go"> {</span>
<span class="go"> "interface": "raid",</span>
<span class="go"> "step": "apply_configuration",</span>
<span class="go"> "args": {</span>
<span class="go"> "raid_config": {</span>
<span class="go"> "logical_disks": [</span>
<span class="go"> {</span>
<span class="go"> "raid_level": "1",</span>
<span class="go"> "size_gb": 42,</span>
<span class="go"> "is_root_volume": true</span>
<span class="go"> }</span>
<span class="go"> ]</span>
<span class="go"> }</span>
<span class="go"> },</span>
<span class="go"> "priority": 150</span>
<span class="go"> }</span>
<span class="go">]</span>
<span class="go">EOF</span>
<span class="go">openstack baremetal deploy template create \</span>
<span class="go"> CUSTOM_RAID1 \</span>
<span class="go"> --steps raid1-steps.json</span>
</pre></div>
<p>Next, the operator creates Nova flavors or Glance images with required traits
that reference the names of deploy templates.</p>
<div class="highlight"><pre><span></span><span class="go">openstack flavor create raid1 \</span>
<span class="go"> --property resources:VCPU=0 \</span>
<span class="go"> --property resources:MEMORY_MB=0 \</span>
<span class="go"> --property resources:DISK_GB=0 \</span>
<span class="go"> --property resources:CUSTOM_COMPUTE=1 \</span>
<span class="go"> --property trait:CUSTOM_RAID1=required</span>
</pre></div>
<p>Finally, a user creates a bare metal instance using one of these flavors
that is accessible to them.</p>
<div class="highlight"><pre><span></span><span class="go">openstack server create \</span>
<span class="go"> --name test \</span>
<span class="go"> --flavor raid1 \</span>
<span class="go"> --image centos7 \</span>
<span class="go"> --network mynet \</span>
<span class="go"> --key-name mykey</span>
</pre></div>
<p>What happens? A bare metal node is scheduled by Nova which has all of the
required traits from the flavor and/or image. Those traits are then used by
Ironic to find deploy templates with matching names, and the deploy steps from
those templates are executed in addition to the core step, in an order
determined by their priorities. In this case, the RAID <tt class="docutils literal">apply_configuration</tt>
deploy step runs before the core step because it has a higher priority.</p>
</div>
</div>
<div class="section" id="future-challenges">
<h2>Future Challenges</h2>
<p>There is still work to be done to improve the flexibility of bare metal
deployment. We need to split out the mega step. We need to support executing
steps in the agent running on the node, which would enable deployment-time use
of the <a class="reference external" href="https://techblog.web.cern.ch/techblog/post/ironic_software_raid/">software RAID support</a> recently
developed by Arne Wiebalck from CERN.</p>
<p>Drivers need to expose more deploy steps for BIOS, RAID and other functions. We
should agree on how to handle executing a step multiple times, and all the
tricky corner cases involved.</p>
<p>We have discussed the Nova use case here, but we could also make use of deploy
steps in standalone mode, by passing a list of steps to execute to the Ironic
<tt class="docutils literal">provision</tt> API call, similar to manual cleaning. There is also a <a class="reference external" href="https://review.opendev.org/672252">spec</a> proposed by Madhuri Kumari which would
allow reconfiguring active nodes to do things like tweak BIOS settings without
requiring redeployment.</p>
<p>Thanks to everyone who has been involved in designing, developing and reviewing
the series of features in Nova and Ironic that got us this far. In particular
John Garbutt who proposed the specs for <a class="reference external" href="https://specs.openstack.org/openstack/ironic-specs/specs/approved/deployment-steps-framework.html">deploy steps</a>
and <a class="reference external" href="https://specs.openstack.org/openstack/ironic-specs/specs/approved/deploy-templates.html">deploy templates</a>,
and Ruby Loo who implemented the deploy steps framework.</p>
</div>
StackHPC at the CERN Ceph Day 20192019-09-20T09:00:00+01:002019-09-20T09:00:00+01:00Stig Telfertag:www.stackhpc.com,2019-09-20:/cern-ceph.html<p class="first last">Stig Telfer and John Garbutt presented recent work involving interesting ways of managing data in research computing.</p>
<p>It is always exciting to visit CERN, even more so for <a class="reference external" href="https://indico.cern.ch/event/765214/">CERN Ceph
day for Research and Non-profits</a>,
and the opportunity to present some recent work on Ceph and research
data storage was too good to miss.</p>
<p>Stig Telfer, Michal Nasiadka and John Garbutt attended the one-day
event, and Stig and John presented a double-bill presentation
<em>Ad-hoc Filesystems for Dynamic Science Workloads</em>, covering some of
our recent work:</p>
<ul class="simple">
<li>Converged CephFS storage for the <a class="reference external" href="https://www.euclid-ec.org/">Euclid space
telescope</a> as part of the <a class="reference external" href="https://www.iris.ac.uk/">IRIS compute
federation</a>.</li>
<li>The <a class="reference external" href="https://rse-cambridge.github.io/data-acc/">Data Accelerator</a>
project at Cambridge University, which creates dynamic filesystems
as burst buffers and recently took the top slot in the <a class="reference external" href="https://www.vi4io.org/io500/list/19-06/start">IO-500 list</a> of the world's fastest storage systems.</li>
</ul>
<p>John and Stig's <a class="reference external" href="https://indico.cern.ch/event/765214/contributions/3517141/attachments/1908894/3153554/2019-09-17-TelferGarbutt-Ad-hoc-Filesystems.pdf">presentation is available here</a> and a <a class="reference external" href="https://cds.cern.ch/record/2691828">video is available here</a>.</p>
<div class="figure">
<img alt="Stig and John presenting at CERN" src="//www.stackhpc.com/images/cern-ceph-stig-johng.jpeg" style="width: 640px;" />
</div>
<div class="section" id="get-in-touch">
<h2>Get in touch</h2>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Migrating a running OpenStack to containerisation with Kolla2019-09-17T15:00:00+01:002019-09-17T15:00:00+01:00Pierre Riteautag:www.stackhpc.com,2019-09-17:/migrating-to-kolla.html<p class="first last">We describe how we migrated a running OpenStack to a containerised
solution with Kolla and Kayobe.</p>
<p>Deploying OpenStack infrastructures with containers brings many operational
benefits, such as isolation of dependencies and repeatability of deployment, in
particular when coupled with a CI/CD approach. The <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla project</a> provides tooling that helps deploy
and operate containerised OpenStack deployments. Configuring a new OpenStack
cloud with Kolla containers is well documented and can benefit from the sane
defaults provided by the highly opinionated <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla Ansible subproject</a>. However, migrating
existing OpenStack deployments to Kolla containers can require a more ad hoc
approach, particularly to minimise impact on end users.</p>
<p>We recently helped an organization migrate an existing OpenStack Queens
production deployment to a containerised solution using Kolla and <a class="reference external" href="https://docs.openstack.org/kayobe/latest/">Kayobe</a>, a subproject designed to
simplify the provisioning and configuration of bare-metal nodes. This blog post
describes the migration strategy we adopted in order to reduce impact on end
users and shares what we learned in the process.</p>
<div class="section" id="existing-openstack-deployment">
<h2>Existing OpenStack deployment</h2>
<p>The existing cloud was running the <a class="reference external" href="https://www.openstack.org/software/queens/">OpenStack Queens release</a> deployed using <a class="reference external" href="https://docs.openstack.org/install-guide/environment-packages-rdo.html">CentOS RPM
packages</a>.
This cloud was managed by a control plane of 16 nodes, with each service
deployed over two (for OpenStack services) or three (for Galera and RabbitMQ)
servers for high availability. Around 40 hypervisor nodes from different
generations of hardware were available, resulting in a heterogeneous mix of CPU
models, amount of RAM, and even network interface names (with some nodes using
onboard Ethernet interfaces and others using PCI cards).</p>
<p>A separate Ceph cluster was used as a backend for all OpenStack services
requiring large amounts of storage: Glance, Cinder, Gnocchi, and also disks of
Nova instances (i.e. none of the user data was stored on hypervisors).</p>
</div>
<div class="section" id="a-new-infrastructure">
<h2>A new infrastructure</h2>
<p>With a purchase of new control plane hardware also being planned, we advised
the following configuration, based on our experience and <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/admin/production-architecture-guide.html">recommendations from
Kolla Ansible</a>:</p>
<ul class="simple">
<li>three controller nodes hosting control services like APIs and databases,
using an odd number for quorum</li>
<li>two network nodes hosting Neutron agents along with HAProxy / Keepalived</li>
<li>three monitoring nodes providing centralized logging, metrics collection and
alerting, a feature which was critically lacking from the existing deployment</li>
</ul>
<p>Our goal was to migrate the entire OpenStack deployment to use Kolla containers
and be managed by Kolla Ansible and Kayobe, with control services running on
the new control plane hardware and hypervisors reprovisioned and reconfigured,
with little impact on users and their workflows.</p>
</div>
<div class="section" id="migration-strategy">
<h2>Migration strategy</h2>
<p>Using a small-scale candidate environment, we developed our migration strategy.
The administrators of the infrastructure would install CentOS 7 on the new
control plane, using their existing provisioning system, Foreman. We would
<a class="reference external" href="https://docs.openstack.org/kayobe/latest/configuration/hosts.html">configure the host OS</a> of the
new nodes with Kayobe to make them ready to deploy Kolla containers: configure
multiple VLAN interfaces and networks, create LVM volumes, install Docker, etc.</p>
<p>We would then deploy OpenStack services on this control plane. To reduce the
risk of the migration, our strategy was to progressively reconfigure the load
balancers to point to the new controllers for each OpenStack service while
validating that they were not causing errors. If any issue arose, we would be
able to quickly revert to the API services running on the original control
plane. Fresh Galera, Memcached, and RabbitMQ clusters would also be set up on
the new controllers, although the existing ones would remain in use by the
OpenStack services for now. We would then gradually shut down the original
services after making sure that all resources are managed by the new OpenStack
services.</p>
<p>Then, during a scheduled downtime, we would copy the content of the SQL
database, reconfigure all services (on the control plane and also on
hypervisors) to use the new Galera, Memcached, and RabbitMQ clusters, and move
the virtual IP of the load balancer over to the new network nodes, where
HAProxy and Keepalived would be deployed.</p>
<p>The animation below depicts the process of migrating from the original to the
new control plane, with only a subset of the services displayed for clarity.</p>
<div class="figure">
<img alt="Migration from the original to the new control plane" src="//www.stackhpc.com/images/kolla_control_plane_migration.gif" style="width: 750px;" />
</div>
<p>Finally, we would use live migration to free up several hypervisors, redeploy
OpenStack services on them after reprovisioning, and live migrate virtual
machines back on them. The animation below shows the transition of hypervisors
to Kolla:</p>
<div class="figure">
<img alt="Migration of hypervisors to Kolla" src="//www.stackhpc.com/images/migrating_hypervisors_to_kolla.gif" style="width: 750px;" />
</div>
</div>
<div class="section" id="tips-tricks">
<h2>Tips & Tricks</h2>
<p>Having described the overall migration strategy, we will now cover tasks that
required special care and provide tips for operators who would like to follow
the same approach.</p>
<div class="section" id="translating-the-configuration">
<h3>Translating the configuration</h3>
<p>In order to make the migration seamless, we wanted to keep the configuration of
services deployed on the new control plane as close as possible to the original
configuration. In some cases, this meant moving away from Kolla Ansible's sane
defaults and making use of its extensive customisation capabilities. In this
section, we describe how to integrate an existing configuration into Kolla
Ansible.</p>
<p>The original configuration management tool kept entire OpenStack configuration
files under source control, with unique values templated using <a class="reference external" href="https://palletsprojects.com/p/jinja/">Jinja</a>. The existing deployment had been
upgraded several times, and configuration files had not been updated with
deprecation and removal of some configuration options. In comparison, Kolla
Ansible uses a layered approach where configuration generated by Kolla Ansible
itself is merged with additions or overrides specified by the operator either
globally, per role (<tt class="docutils literal">nova</tt>), per service (<tt class="docutils literal"><span class="pre">nova-api</span></tt>), or per host
(<tt class="docutils literal">hypervisor042</tt>). This has the advantage of reducing the amount of
configuration to check at each upgrade, since Kolla Ansible will track
deprecation and removals of the options it uses.</p>
<p>The <tt class="docutils literal"><span class="pre">oslo-config-validator</span></tt> tool from the <a class="reference external" href="https://docs.openstack.org/oslo.config/latest/">oslo.config</a> project helps with the task
of auditing an existing configuration for outdated options. While introduced in
Stein, it may be possible to run it against older releases if the API has not
changed substantially. For example, to audit <tt class="docutils literal">nova.conf</tt> using code from the
<tt class="docutils literal">stable/queens</tt> branch:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> git clone -b stable/queens https://opendev.org/openstack/nova.git
<span class="gp">$</span> <span class="nb">cd</span> nova
<span class="gp">$</span> tox -e venv -- pip install --upgrade oslo.config <span class="c1"># Update to the latest oslo.config release</span>
<span class="gp">$</span> tox -e venv -- oslo-config-validator --config-file etc/nova/nova-config-generator.conf --input-file /etc/nova/nova.conf
</pre></div>
<p>This would output messages identifying removed and deprecated options:</p>
<div class="highlight"><pre><span></span><span class="go">ERROR:root:DEFAULT/verbose not found</span>
<span class="go">WARNING:root:Deprecated opt DEFAULT/notify_on_state_change found</span>
<span class="go">WARNING:root:Deprecated opt DEFAULT/notification_driver found</span>
<span class="go">WARNING:root:Deprecated opt DEFAULT/auth_strategy found</span>
<span class="go">WARNING:root:Deprecated opt DEFAULT/scheduler_default_filters found</span>
</pre></div>
<p>Once updated to match the deployed release, all the remaining options could be
moved to a role configuration file used by for Kolla Ansible. However, we
preferred to audit each one against Kolla Ansible templates, such as
<a class="reference external" href="https://opendev.org/openstack/kolla-ansible/src/branch/master/ansible/roles/nova/templates/nova.conf.j2">nova.conf.j2</a>,
to avoid keeping redundant options and detect any potential conflicts. Future
upgrades will be made easier by reducing the amount of custom configuration
compared to Kolla Ansible's defaults.</p>
<p>Templating also needs to be adapted from the original configuration management
system. Kolla Ansible relies on Jinja which can use variables set in Ansible.
However, when called from Kayobe, <a class="reference external" href="https://storyboard.openstack.org/#!/story/2006542">extra group variables cannot be set in Kolla
Ansible's inventory</a>, so
instead of <tt class="docutils literal">cpu_allocation_ratio = {{ cpu_allocation_ratio }}</tt> you would have
to use a different approach:</p>
<div class="highlight"><pre><span></span><span class="cp">{%</span> <span class="k">if</span> <span class="nv">inventory_hostname</span> <span class="k">in</span> <span class="nv">groups</span><span class="o">[</span><span class="s1">'compute_big_overcommit'</span><span class="o">]</span> <span class="cp">%}</span>
<span class="l l-Scalar l-Scalar-Plain">cpu_allocation_ratio = 16.0</span>
<span class="cp">{%</span> <span class="k">elif</span> <span class="nv">inventory_hostname</span> <span class="k">in</span> <span class="nv">groups</span><span class="o">[</span><span class="s1">'compute_small_overcommit'</span><span class="o">]</span> <span class="cp">%}</span>
<span class="l l-Scalar l-Scalar-Plain">cpu_allocation_ratio = 4.0</span>
<span class="cp">{%</span> <span class="k">else</span> <span class="cp">%}</span>
<span class="l l-Scalar l-Scalar-Plain">cpu_allocation_ratio = 1.0</span>
<span class="cp">{%</span> <span class="k">endif</span> <span class="cp">%}</span>
</pre></div>
</div>
<div class="section" id="configuring-kolla-ansible-to-use-existing-services">
<h3>Configuring Kolla Ansible to use existing services</h3>
<p>We described earlier that our migration strategy was to progressively deploy
OpenStack services on the new control plane while using the existing Galera,
Memcached, and RabbitMQ clusters. This section explains how this can be
configured with Kayobe and Kolla Ansible.</p>
<p>In Kolla Ansible, many deployment settings are configured in
<a class="reference external" href="https://opendev.org/openstack/kolla-ansible/src/branch/master/ansible/group_vars/all.yml">ansible/group_vars/all.yml</a>,
including the RabbitMQ transport URL (<tt class="docutils literal">rpc_transport_url</tt>) and the database
connection (<tt class="docutils literal">database_address</tt>).</p>
<p>An operator can override these values from Kayobe using
<tt class="docutils literal">etc/kayobe/kolla/globals.yml</tt>:</p>
<div class="highlight"><pre><span></span><span class="nt">rpc_transport_url</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">rabbit://username:password@ctrl01:5672,username:password@ctrl02:5672,username:password@ctrl03:5672</span>
</pre></div>
<p>Another approach is to populate the groups that Kolla Ansible uses to generate
these variables. In Kayobe, we can create an extra group for each existing
service (e.g. <tt class="docutils literal">ctrl_rabbitmq</tt>), populate it with existing hosts, and
customise the Kolla Ansible inventory to map services to them.</p>
<p>In <tt class="docutils literal">etc/kayobe/kolla.yml</tt>:</p>
<div class="highlight"><pre><span></span><span class="nt">kolla_overcloud_inventory_top_level_group_map</span><span class="p">:</span>
<span class="nt">control</span><span class="p">:</span>
<span class="nt">groups</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">controllers</span>
<span class="nt">network</span><span class="p">:</span>
<span class="nt">groups</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">network</span>
<span class="nt">compute</span><span class="p">:</span>
<span class="nt">groups</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">compute</span>
<span class="nt">monitoring</span><span class="p">:</span>
<span class="nt">groups</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">monitoring</span>
<span class="nt">storage</span><span class="p">:</span>
<span class="nt">groups</span><span class="p">:</span>
<span class="s">"{{</span><span class="nv"> </span><span class="s">kolla_overcloud_inventory_storage_groups</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">ctrl_rabbitmq</span><span class="p">:</span>
<span class="nt">groups</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">ctrl_rabbitmq</span>
<span class="nt">kolla_overcloud_inventory_custom_components</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">lookup('template',</span><span class="nv"> </span><span class="s">kayobe_config_path</span><span class="nv"> </span><span class="s">~</span><span class="nv"> </span><span class="s">'/kolla/inventory/overcloud-components.j2')</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>In <tt class="docutils literal">etc/kayobe/inventory/hosts</tt>:</p>
<div class="highlight"><pre><span></span><span class="k">[ctrl_rabbitmq]</span>
<span class="na">ctrl01 ansible_host</span><span class="o">=</span><span class="s">192.168.0.1</span>
<span class="na">ctrl02 ansible_host</span><span class="o">=</span><span class="s">192.168.0.2</span>
<span class="na">ctrl03 ansible_host</span><span class="o">=</span><span class="s">192.168.0.3</span>
</pre></div>
<p>We copy <a class="reference external" href="https://opendev.org/openstack/kayobe/src/branch/master/ansible/roles/kolla-ansible/templates/overcloud-components.j2">overcloud-components.j2</a>
from the Kayobe source tree to
<tt class="docutils literal"><span class="pre">etc/kayobe/kolla/inventory/overcloud-components.j2</span></tt> in our <tt class="docutils literal"><span class="pre">kayobe-config</span></tt>
repository and customise it:</p>
<div class="highlight"><pre><span></span><span class="k">[rabbitmq:children]</span>
<span class="na">ctrl_rabbitmq</span>
<span class="k">[outward-rabbitmq:children]</span>
<span class="na">ctrl_rabbitmq</span>
</pre></div>
<p>While better integrated with Kolla Ansible, this approach should be used with
care so that the original control plane is not reconfigured in the process.
Operators can use the <tt class="docutils literal"><span class="pre">--limit</span></tt> and <tt class="docutils literal"><span class="pre">--kolla-limit</span></tt> options of Kayobe to
restrict Ansible playbooks to specific groups or hosts.</p>
</div>
<div class="section" id="customising-kolla-images">
<h3>Customising Kolla images</h3>
<p>Even though Kolla Ansible can be configured extensively, it is sometimes
required to customise Kolla images. For example, we had to rebuild the
<tt class="docutils literal"><span class="pre">heat-api</span></tt> container image so it would use a different Keystone domain name:
Kolla uses <tt class="docutils literal">heat_user_domain</tt> while the existing deployment used <tt class="docutils literal">heat</tt>.</p>
<p>Once a modification has been pushed to the Kolla repository configured to be
pulled by Kayobe, one can simply rebuild images with the <tt class="docutils literal">kayobe overcloud
container image build</tt> <a class="reference external" href="https://docs.openstack.org/kayobe/latest/deployment.html#id4">command</a>.</p>
</div>
<div class="section" id="deploying-services-on-the-new-control-plane">
<h3>Deploying services on the new control plane</h3>
<p>Before deploying services on the new control plane, it can be useful to
double-check that our configuration is correct. Kayobe can generate the
configuration used by Kolla Ansible with the following command:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud service configuration generate --node-config-dir /tmp/kolla
</pre></div>
<p>To deploy only specific services, the operator can restrict Kolla Ansible to
specific roles using tags:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> kayobe overcloud service deploy --kolla-tags glance
</pre></div>
</div>
<div class="section" id="migrating-resources-to-new-services">
<h3>Migrating resources to new services</h3>
<p>Most OpenStack services will start managing existing resources immediately
after deployment. However, a few require manual intervention from the operator
to perform the transition, particularly when services are not configured for
high availability.</p>
<div class="section" id="cinder">
<h4>Cinder</h4>
<p>Even when volume data is kept on a distributed backend like a Ceph cluster,
each volume can be associated with a specific <tt class="docutils literal"><span class="pre">cinder-volume</span></tt> service. The
service can be identified from the <tt class="docutils literal"><span class="pre">os-vol-host-attr:host</span></tt> field in the
output of <tt class="docutils literal">openstack volume show</tt>.</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> openstack volume show <volume_uuid> -c os-vol-host-attr:host -f value
<span class="go">ctrl01@rbd</span>
</pre></div>
<p>There is a <tt class="docutils literal"><span class="pre">cinder-manage</span></tt> command that can be used to migrate volumes from
one <tt class="docutils literal"><span class="pre">cinder-volume</span></tt> service to another:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> cinder-manage volume update_host --currenthost ctrl01@rbd --newhost newctrl01@rbd
</pre></div>
<p>However there is no command to migrate specific volumes only, so if you are
migrating to a bigger number of <tt class="docutils literal"><span class="pre">cinder-volume</span></tt> services, some will have have
no volume to manage until the Cinder scheduler allocate new volumes on them.</p>
<p>Do not confuse this command with <tt class="docutils literal">cinder migrate</tt> which is designed to
transfer volume data between different backends. Be advised that when the
destination is a <tt class="docutils literal"><span class="pre">cinder-volume</span></tt> service using the same Ceph backend, it will
happily delete your volume data!</p>
</div>
<div class="section" id="neutron">
<h4>Neutron</h4>
<p>Unless Layer 3 High Availability is configured in Neutron, routers will be
assigned to a specific <tt class="docutils literal"><span class="pre">neutron-l3-agent</span></tt> service. The existing service can
be replaced with the commands:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> openstack network agent remove router --l3 <old-agent-uuid> <router-uuid>
<span class="gp">$</span> openstack network agent add router --l3 <new-agent-uuid> <router-uuid>
</pre></div>
<p>Similarly, you can use the <tt class="docutils literal">openstack network agent remove network <span class="pre">--dhcp</span></tt>
and <tt class="docutils literal">openstack network agent add network <span class="pre">--dhcp</span></tt> commands for DHCP agents.</p>
</div>
</div>
<div class="section" id="live-migrating-instances">
<h3>Live migrating instances</h3>
<p>In addition to the new control plane, several additional compute hosts were
added to the system, in order to provide free resources that could host the
first batch of live migrated instances. Once configured as Nova hypervisors, we
discovered that we could not migrate instances to them because CPU flags didn't
match, even though source hypervisors were using the same hardware.</p>
<p>This was caused by a mismatch in BIOS versions: the existing hypervisors in
production had been updated to the latest BIOS to protect against the Spectre
and Meltdown vulnerabilities, but these new hypervisors had not, resulting in
different CPU flags.</p>
<p>This is a good reminder that in a heterogeneous infrastructure, operators
should check the <tt class="docutils literal">cpu_mode</tt> used by Nova. <a class="reference external" href="https://www.openstack.org/videos/summits/berlin-2018/effective-virtual-cpu-configuration-in-nova">Kashyap Chamarthy's talk on
effective virtual CPU configuration in Nova</a>
gives a good overview of available options.</p>
</div>
</div>
<div class="section" id="what-about-downtime">
<h2>What about downtime?</h2>
<p>While we wanted to minimize the impact on end users and their workflow, there
were no critical services running on this cloud that would have needed a zero
downtime approach. If it had been a requirement, we would have explored
dynamically added new control plane nodes to the existing clusters before
removing the old ones. Instead, it was a welcome opportunity to reinitialize
the configuration of several critical components to a clean slate.</p>
</div>
<div class="section" id="the-road-ahead">
<h2>The road ahead</h2>
<p>This OpenStack deployment is now ready to benefit from all the improvements
developed by the Kolla community, which released <a class="reference external" href="https://docs.openstack.org/releasenotes/kolla/stein.html">Kolla 8.0.0</a> and <a class="reference external" href="https://docs.openstack.org/releasenotes/kolla-ansible/stein.html">Kolla Ansible
8.0.0</a> for
the Stein cycle earlier this summer and <a class="reference external" href="https://docs.openstack.org/releasenotes/kayobe/stein.html">Kayobe 6.0.0</a> at the end of
August. The community is now actively working on releases for OpenStack Train.</p>
<p>If you would like to get in touch we would love to hear from you. Reach out to
us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a> or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact
page</a>.</p>
</div>
Fabric control in Intel MPI2019-09-13T16:30:00+01:002019-09-13T16:30:00+01:00Steve Brasiertag:www.stackhpc.com,2019-09-13:/intel-mpi-fabric.html<p class="first last">Changes to control of communication fabrics in Intel MPI and recommended settings for Microsoft Azure.</p>
<p>High Performance Computing usually involves some sort of parallel computing and process-level parallelisation using the <a class="reference external" href="https://en.wikipedia.org/wiki/Message_Passing_Interface">MPI
(Message Passing Interface) protocol</a> has been a common approach on "traditional" HPC clusters. Although alternative approaches are
gaining some ground, getting good MPI performance will continue to be crucially important for many big scientific workloads even in
a cloudy new world of software-defined infrastructure.</p>
<p>There are several high-quality MPI implementations available and deciding which one to use is important as applications must be compiled against
specific MPI libraries - the different MPI libraries are (broadly) source-compatible but not binary-compatible. Unfortunately selecting the
"right" one to use is not straightforward as a search for benchmarks will quickly show, with different implementations coming out on top
in different situations. Intel's MPI has historically been a strong contender, with easy
"<a class="reference external" href="https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo">yum install</a>" deployment, good performance
(especially on Intel processors), and being - unlike Intel's compilers - free to use. Intel MPI 2018 still remains relevant even for new installs as the 2019 versions have had various issues, including the fairly-essential
hydra manager appearing not to work with at least some AMD processors. A fix for this is <a class="reference external" href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/816696">apparently planned</a> for 2019 update 5 but there is no release date
for this yet.</p>
<p>MPI can run over many different types of interconnect or "fabrics" that are actually carrying the inter-process communications,
such as Ethernet, InfiniBand etc. and the Intel MPI runtime will, by default, automatically try to select a fabric which works. Knowing
how to control fabric choices is however still important as there is no guarantee it will select the optimal fabric, and fall-back
through non-working options can lead to slow startup or lots of worrying error messages for the user.</p>
<p>Intel significantly changed the fabric control between 2018 and 2019 MPI versions but this isn't immediately obvious from the changelog
and you have to jump about between the developer references and developer guides to get the full picture. In both MPI versions the
<code>I_MPI_FABRICS</code> environment variable specifies the fabric, but the values it takes are quite different:</p>
<ul class="simple">
<li>For 2018 options are <code>shm</code>, <code>dapl</code>, <code>tcp</code>, <code>tmi</code>, <code>ofa</code> or <code>ofi</code>, or you can use <code>x:y</code> to control intra- and inter-node communications separately (see the docs for which combinations are valid).</li>
<li>For 2019 options are only <code>ofi</code>, <code>shm:ofi</code> or <code>shm</code>, with the 2nd option setting intra- and inter-node communications separately as before.</li>
</ul>
<p>The most generally-useful options are probably:</p>
<ul class="simple">
<li><code>shm</code> (2018 & 2019): The shared memory transport; only applicable to intra-node communication so generally used with another transport as suggested above - see the docs for details.</li>
<li><code>tcp</code> (2018 only): A TCP/IP capable fabric e.g. Ethernet or IB via IPoIB.</li>
<li><code>ofi</code> (2018 & 2019): An "OpenFabrics Interfaces-capable fabric". These use a library called libfabric (either an Intel-supplied or "external" version) which provides a fixed application-facing API while talking to one of several "OFI providers" which communicate with the interconnect hardware. Really your choice of provider here depends on the hardware, with possibilities being:<ul>
<li><code>psm2</code>: Intel OmniPath</li>
<li><code>verbs</code>: InfiniBand or iWARP</li>
<li><code>RxM</code>: A utility provider supporting <cite>verbs</cite></li>
<li><code>sockets</code>: Again an TCP/IB capable fabric but this time through libfabric. It's not intended to be faster than the 2018 tcp option, but allows developing/debugging libfabric codes without actually having a faster interconnnect available.</li>
</ul>
</li>
</ul>
<p>With both 2018 and 2019 you can use <code>I_MPI_OFI_PROVIDER_DUMP=enable</code> to see which providers MPI thinks are available.</p>
<p>2018 also supported some additional options which have gone away in 2019:</p>
<ul class="simple">
<li><code>ofa</code> (2018): "OpenFabrics Alliance" e.g. InfiniBand (through OFED Verbs) & possibly also iWARP and RoCE?</li>
<li><code>dapl</code> (2018): "Direct Access Programming Library" e.g. InfiniBand and iWARP.</li>
<li><code>tmi</code> (2018): "Tag Matching Interface" e.g. Intel True Scale Fabric, Intel Omni-Path Architecture, Myrinet</li>
</ul>
<p>With any of these fabrics there are additional variables to tweak things. 2018 has <code>I_MPI_FABRICS_LIST</code> which allows specification of a list of available fabrics to try, plus variables to control fallback through this list. These variables are all gone in 2019 now there are fewer fabric options. Clearly Intel have clearly decided to concentrate on OFA/libfabric which unifies (or restricts, depending on your view!) the application-facing interface.</p>
<p>If you're using the 2018 MPI over InfiniBand you might be wondering which option to use; at least back in 2012 performance between DAPL and OFA/OFED Verbs was <a class="reference external" href="https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/279244#comment-1468484">apparently generally similar</a> although the transport options available varied, so which is usable/best if both are available will depend on your application and hardware.</p>
<div class="section" id="hpc-fabrics-in-the-public-cloud">
<h2>HPC Fabrics in the Public Cloud</h2>
<p>Hybrid and public cloud HPC solutions have been gaining increasing attention, with scientific users looking to burst peak usage out to the cloud, or investigating the impact of wholesale migration.</p>
<p>Azure have been pushing their capabilities for HPC hard recently, <a class="reference external" href="https://www.youtube.com/watch?v=XKUQYYhiV1g&feature=youtu.be">showcasing</a> ongoing work to get closer to bare-metal performance and <a class="reference external" href="https://azure.microsoft.com/en-gb/blog/introducing-the-new-hbv2-azure-virtual-machines-for-high-performance-computing/">launching</a> a 2nd generation of "HB-series" VMs which provide 120 cores of AMD Epyc 7002 processors. With InfiniBand interconnects and as many as 80,000 cores of HBv2 available for jobs for (some) customers, Azure looks to be providing pay-as-you-go access to some very serious (virtual) hardware. And in addition to providing a platform for new HPC workloads in the cloud, for organisations which are already embedded in the Microsoft ecosystem Azure may seem an obvious route to acquiring a burst capacity for on-premises HPC workloads.</p>
<p>If you're running in a virtualised environment such as Azure, MPI configuration is likely to have additional complexities and a careful read of any and all documentation you can get your hands on is likely to be needed.</p>
<p>For example for Azure, the recommended Intel MPI settings <a class="reference external" href="https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#rdma-capable-instances">described here</a>, <a class="reference external" href="https://docs.microsoft.com/en-us/azure/cyclecloud/hb-hc-best-practices">here</a> and in the <a class="reference external" href="https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/overview">suite of pages here</a> vary depending on which type of VM you are using:</p>
<ul class="simple">
<li>Standard and most compute-optimised nodes only have Ethernet (needing <code>tcp</code> or <code>sockets</code>) which is likely to make them uninteresting for multi-node MPI jobs.</li>
<li>Hr-series VMs and some others have FDR InfiniBand but need specific drivers (provided in an Azure image), Intel MPI <em>2016</em> and the DAPL provider set to <code>ofa-v2-ib0</code>.</li>
<li>HC44 and HB60 VMs have EDR InfiniBand and can theoretically use any MPI (although for HB60 VMs note the issues with Intel 2019 MPI on AMD processors mentioned above) but need the <a class="reference external" href="https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/setup-mpi#intel-mpi">appropriate fabric</a> to be manually set.</li>
</ul>
<p>InfiniBand on Azure still seems to be undergoing considerable development with for example <a class="reference external" href="http://mvapich.cse.ohio-state.edu/performance/mv2-azure-pt_to_pt/">new drivers for MVAPICH2 coming out around now</a> so treat any guidance with a pinch of salt until you know it's not stale, to mix metaphors!</p>
<p>---</p>
<p>If you would like to get in touch we would love to hear from you. Reach out to us on <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a> or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
CloudKitty and Monasca: OpenStack charging without Telemetry2019-07-30T23:00:00+01:002019-07-31T09:50:00+01:00Pierre Riteautag:www.stackhpc.com,2019-07-30:/cloudkitty-and-monasca-1.html<p class="first last">We explore options to use CloudKitty to charge for OpenStack usage
without a full Telemetry stack.</p>
<div class="figure">
<img alt="CloudKitty and Monasca project mascots" src="//www.stackhpc.com/images/cloudkitty-monasca.png" style="width: 350px;" />
</div>
<p>Tracking resource usage, and charging for it, is a requirement for many cloud
deployments. Public clouds obviously need to bill their customers, but private
clouds can also use <a class="reference external" href="https://en.wikipedia.org/wiki/IT_chargeback_and_showback">chargeback and showback policies</a> to encourage more
efficient use of resources. In the OpenStack world, <a class="reference external" href="https://docs.openstack.org/cloudkitty/latest/">CloudKitty</a> is the standard <em>rating</em>
solution. It works by applying rating rules, which turn metric measurements
into rated usage information.</p>
<p>For several years, gathering metrics in OpenStack has been implemented by two
separate project teams: Telemetry and, more recently, Monasca. The future of
<a class="reference external" href="https://governance.openstack.org/tc/reference/projects/telemetry.html">Telemetry</a>,
which produces the <a class="reference external" href="https://opendev.org/openstack/ceilometer">Ceilometer software</a>, is uncertain: historical
contributors have stopped working on the project and its de-facto back end for
measurements, <a class="reference external" href="https://gnocchi.xyz/">Gnocchi</a>, is also seeing <a class="reference external" href="https://github.com/gnocchixyz/gnocchi/commit/0be1bc0431441d10d2844611d75e557640f2af48">low activity</a>.
Although <a class="reference external" href="http://lists.openstack.org/pipermail/openstack-discuss/2019-March/003897.html">Telemetry users have volunteered to maintain the project</a>,
the <a class="reference external" href="https://governance.openstack.org/tc/reference/projects/monasca.html">Monasca</a> project
appears to be healthier and more active.</p>
<p>Since <a class="reference external" href="//www.stackhpc.com/monasca-comes-to-kolla.html">deploying Monasca</a> is
our preferred choice to monitor OpenStack, we asked ourselves: can we use
CloudKitty to charge for usage without deploying a full Telemetry software
stack?</p>
<div class="section" id="ceilometer-monasca-ceilosca">
<h2>Ceilometer + Monasca = Ceilosca</h2>
<p>Ceilometer is well integrated in OpenStack and can collect usage data from
various OpenStack services, either by <a class="reference external" href="https://docs.openstack.org/ceilometer/latest/contributor/architecture.html">polling or listening for notifications</a>.
Ceilometer is designed to publish this data to the Gnocchi time series database
for storage and querying.</p>
<p>In Monasca, metrics collected by the <a class="reference external" href="https://opendev.org/openstack/monasca-agent/src/branch/master/docs/Agent.md">Monasca Agent</a>
focus more on monitoring the health and performance of the infrastructure and
its services, rather than resource usage from end users (although it can gather
instance metrics via the <a class="reference external" href="https://github.com/openstack/monasca-agent/blob/master/docs/Libvirt.md">Libvirt plugin</a>).
Monasca stores these metrics in a time series database, with support for
<a class="reference external" href="https://www.influxdata.com/products/influxdb-overview/">InfluxDB</a> and
<a class="reference external" href="http://cassandra.apache.org/">Cassandra</a>.</p>
<p>Despite this, we are not required to deploy and maintain Gnocchi just to collect
usage data via Ceilometer: <a class="reference external" href="https://opendev.org/openstack/monasca-ceilometer">monasca-ceilometer</a>, also known as Ceilosca,
enables Ceilometer to publish data to the Monasca API for storage in its metrics
database. Although Ceilosca currently lives in its own repository and must be
installed by <a class="reference external" href="https://opendev.org/openstack/monasca-ceilometer/src/branch/stable/stein#installation-instructions-for-setting-up-ceilosca-manually">adding it to the Ceilometer source tree</a>,
there is <a class="reference external" href="https://review.opendev.org/#/c/562400/">an ongoing effort to integrate it directly into Ceilometer</a>.</p>
<p>By default, Ceilosca will push several metrics based on instance detailed
information, such as <tt class="docutils literal">disk.root.size</tt>, <tt class="docutils literal">memory</tt>, and <tt class="docutils literal">vcpus</tt>, to Monasca
under the <tt class="docutils literal">service</tt> tenant. Each metric will be associated with a specific
instance ID via the <tt class="docutils literal">resource_id</tt> dimension. Metric dimensions also include
user and project IDs. For example, to retrieve metrics associated with the
<tt class="docutils literal">p3</tt> project, we can use the Monasca Python client:</p>
<div class="highlight"><pre><span></span><span class="go">monasca metric-list \</span>
<span class="go">--tenant-id $(openstack project show service -c id -f value) \</span>
<span class="go">--dimensions project_id=$(openstack project show p3 -c id -f value)</span>
</pre></div>
<p>Once stored in Monasca, these metrics can be used by CloudKitty, thanks to the
<a class="reference external" href="https://www.objectif-libre.com/en/blog/2018/03/14/integration-monasca-et-cloudkitty/">inclusion of a Monasca collector since the Queens release</a>.</p>
<p>Let's see how we can apply a charge to the <tt class="docutils literal">vcpus</tt> metric. We need to configure
CloudKitty with the <tt class="docutils literal">metrics.yml</tt> file to know about our metric:</p>
<div class="highlight"><pre><span></span><span class="nt">metrics</span><span class="p">:</span>
<span class="nt">vcpus</span><span class="p">:</span>
<span class="nt">unit</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">vcpus</span>
<span class="nt">groupby</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">resource_id</span>
<span class="nt">extra_args</span><span class="p">:</span>
<span class="nt">resource_key</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">resource_id</span>
</pre></div>
<p>Then, we configure the hashmap rating rules to apply a rate to CPU usage. We
create a <tt class="docutils literal">vcpus</tt> service and then create a mapping with a cost of 0.5 per CPU
hour:</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> cloudkitty hashmap service create vcpus
<span class="go">+-------+--------------------------------------+</span>
<span class="go">| Name | Service ID |</span>
<span class="go">+-------+--------------------------------------+</span>
<span class="go">| vcpus | cb72cd89-43ef-46b9-b047-58e0b5335992 |</span>
<span class="go">+-------+--------------------------------------+</span>
<span class="gp">$</span> cloudkitty hashmap mapping create <span class="m">0</span>.5 -s cb72cd89-43ef-46b9-b047-58e0b5335992 -t flat
<span class="go">+--------------------------------------+-------+------------+------+----------+--------------------------------------+----------+------------+</span>
<span class="go">| Mapping ID | Value | Cost | Type | Field ID | Service ID | Group ID | Project ID |</span>
<span class="go">+--------------------------------------+-------+------------+------+----------+--------------------------------------+----------+------------+</span>
<span class="go">| 68465dad-7c68-4f8e-a256-6a62735c1e3b | None | 0.50000000 | flat | None | cb72cd89-43ef-46b9-b047-58e0b5335992 | None | None |</span>
<span class="go">+--------------------------------------+-------+------------+------+----------+--------------------------------------+----------+------------+</span>
</pre></div>
<p>We then launch an instance. Once the instance becomes active, a notification is
processed by Ceilometer and published to Monasca, recording that instance
<tt class="docutils literal"><span class="pre">b7d926a8-cd63-4205-8f90-e3c610aeaad5</span></tt> has 64 vCPUs.</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> monasca metric-statistics --tenant-id <span class="k">$(</span>openstack project show service -c id -f value<span class="k">)</span> vcpus avg <span class="s2">"2019-07-30T14:00:00"</span> --merge_metrics --group_by resource_id --period <span class="m">1</span>
<span class="go">+-------+---------------------------------------------------+----------------------+--------------+</span>
<span class="go">| name | dimensions | timestamp | avg |</span>
<span class="go">+-------+---------------------------------------------------+----------------------+--------------+</span>
<span class="go">| vcpus | resource_id: b7d926a8-cd63-4205-8f90-e3c610aeaad5 | 2019-07-30T14:43:01Z | 64.000 |</span>
<span class="go">+-------+---------------------------------------------------+----------------------+--------------+</span>
</pre></div>
<p>With the default Kolla configuration, Nova also sends a report notification
every hour, which is also stored in Monasca. Similarly, when an instance is
terminated, a notification is published and converted into a final measurement
in Monasca. However, using the default CloudKitty configuration, every instance
measurement is interpreted as if the associated instance ran for the whole
hour. For example, an instance launched at 10:45 and terminated at 11:15 would
result in two whole hours being charged, instead of just 30 minutes. This can
be mitigated by reducing the <tt class="docutils literal"><span class="pre">[collect]/period</span></tt> setting in
<tt class="docutils literal">cloudkitty.conf</tt>, for example down to one minute, and adjusting the charge
rate to match the new period. For this approach to work, we need to have at
least one measurement stored for each period. This isn't possible with audit
notifications sent by Nova because one hour is the lowest possible period. An
alternative is to rely on continously updated metrics collected by Ceilometer,
such as CPU utilisation. However, this kind of Ceilometer metrics is
unavailable in our bare metal environment.</p>
<p>Once CloudKitty has analysed usage metrics, we can extract rated data to CSV
format. As can be seen below, two whole hours have been charged for 0.5 each.
In this case, the instance had been launched around 14:45 and terminated around
15:20. We have compared using pure Ceilometer and Gnocchi instead of Ceilosca
and Monasca and noticed the exact same issue.</p>
<div class="highlight"><pre><span></span><span class="gp">$</span> cloudkitty dataframes get -f df-to-csv --format-config-file cloudkitty-csv.yml
<span class="go">Begin,End,Metric Type,Qty,Cost,Project ID,Resource ID,User ID</span>
<span class="go">2019-07-30T14:00:00,2019-07-30T15:00:00,vcpus,64.0,32.0,35be5437552f40cba2aa6e5cb47df613,b7d926a8-cd63-4205-8f90-e3c610aeaad5,53ed408e5a7a4e79baa76803e1df61d6</span>
<span class="go">2019-07-30T15:00:00,2019-07-30T16:00:00,vcpus,64.0,32.0,35be5437552f40cba2aa6e5cb47df613,b7d926a8-cd63-4205-8f90-e3c610aeaad5,53ed408e5a7a4e79baa76803e1df61d6</span>
</pre></div>
<p>A downside of using Ceilosca instead of Ceilometer with Gnocchi is that
metadata such as instance flavour is not available for CloudKitty to use for
rating by default, at least in the Rocky release that we used. We will update
this post if we can develop a configuration for Ceilosca that supports this
feature.</p>
</div>
<div class="section" id="openstack-usage-metrics-without-ceilometer">
<h2>OpenStack usage metrics without Ceilometer</h2>
<p>Monasca has <a class="reference external" href="http://specs.openstack.org/openstack/monasca-specs/specs/stein/approved/monasca-events-listener.html">plans</a>
to capture OpenStack notifications and store them with the Monasca Events API,
although this is not yet implemented. CloudKitty would require changes to
support charging based on these events, since it is currently designed around
metrics. It is worth pointing out that <a class="reference external" href="https://review.opendev.org/#/c/673461/">an ElasticSearch storage driver has
just been proposed in CloudKitty</a>, so
these two new designs may line up in the future.</p>
<p>In the meantime, an alternative is to bypass Ceilometer completely and rely on
another mechanism to publish metrics to Monasca. As mentioned earlier in this
article, Monasca can provide instance metrics via the <a class="reference external" href="https://github.com/openstack/monasca-agent/blob/master/docs/Libvirt.md">Libvirt plugin</a>.
However, this won't cover other services for which we may want to charge, such
as volume usage.</p>
<p>Since the Monasca Agent can scrape metrics from Prometheus exporters, we are
exploring whether we can leverage <a class="reference external" href="https://github.com/openstack-exporter/openstack-exporter">openstack-exporter</a> to provide metrics
to be rated by CloudKitty. Stay tuned for the next blog post on this topic!</p>
</div>
A Universe from Nothing: Try Kayobe in your own Model Universe2019-06-05T12:00:00+01:002019-06-05T23:00:00+01:00Isaac Priortag:www.stackhpc.com,2019-06-05:/universe-from-nothing.html<p class="first last">Following a highly successful workshop at the Denver Open Infrastructure
Summit, here's a step-by-step guide for how to recreate the lab on your own
hardware and try out Kayobe for yourself.</p>
<p>There is momentum building behind <a class="reference external" href="https://kayobe.readthedocs.io/">Kayobe</a>, our deployment tool of choice
for deploying OpenStack in performance-intensive and research-oriented
use cases. At the recent Open Infrastructure Summit in Denver,
Maciej Kucia and Maciej Siczek gave <a class="reference external" href="https://www.openstack.org/videos/summits/denver-2019/deploying-openstack-what-options-do-we-have">a great presentation</a>
in which they spoke positively about their experiences with Kayobe:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/8ODdvCogwl8?start=696" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>At the same summit, our hands-on workshop on Kayobe deployment had
people queuing to get in, and received plenty of positive feedback
from attendees who rolled up their sleeves and worked through the
experience.</p>
<div class="figure">
<img alt="Universe from Nothing Workshop" src="//www.stackhpc.com/images/universe-from-nothing-attendees.jpg" style="width: 750px;" />
</div>
<p>One significant piece of feedback from the workshop was that people wanted to
be able to try this workshop out at home, on their own resources, to enable
them to share the experience and understand at their leisure how it all fits
together.</p>
<p>So we added a page to the Kayobe docs for people looking to recreate
<a class="reference external" href="https://kayobe.readthedocs.io/en/latest/resources.html#a-universe-from-nothing">A Universe from Nothing</a>
in their own time and space. As well as a <a class="reference external" href="https://github.com/stackhpc/a-universe-from-nothing/blob/master/README.rst">step-by-step README</a>
all the scripts for <a class="reference external" href="https://github.com/stackhpc/a-universe-from-nothing">creating the lab environment</a> are provided.</p>
<div class="figure">
<img alt="Universe from Nothing Tenks" src="//www.stackhpc.com/images/universe-from-nothing-tenks.png" style="width: 750px;" />
</div>
<p>To recreate the lab, a single server is required with a fairly relaxed
baseline of requirements:</p>
<ul class="simple">
<li>At least 32GB of RAM</li>
<li>At least 40GB of disk</li>
<li>CentOS 7 installed</li>
<li>Passwordless <cite>sudo</cite> for the lab user</li>
<li>Processor virtualisation should be enabled (nested virt if it is a VM)</li>
</ul>
<p>Have fun!</p>
<div class="figure">
<img alt="Universe from Nothing Logo" src="//www.stackhpc.com/images/universe-from-nothing-logo.png" style="width: 750px;" />
</div>
StackHPC at the CERN OpenStack Day 20192019-05-31T09:00:00+01:002019-05-31T09:00:00+01:00Stig Telfertag:www.stackhpc.com,2019-05-31:/cern-os2019.html<p class="first last">StackHPC visited our friends at CERN for the CERN OpenStack Day 2019</p>
<p>With a subtitle of <em>Accelerating Science with OpenStack</em>, the
<a class="reference external" href="https://indico.cern.ch/event/776411/timetable/#20190527">CERN OpenStack day</a> was
always going to be our kind of conference. The <a class="reference external" href="https://indico.cern.ch/event/776411/timetable/#20190527">schedule</a>
was packed with interesting content and the audience was packed
with interesting people.</p>
<p>Stig had the privilege of co-presenting two projects that StackHPC have
supported - with Chiara Ferrari for the <a class="reference external" href="https://www.skatelescope.org/">SKA radio telescope</a>, and with Jani Heikkinen for the
<a class="reference external" href="https://dcc.sib.swiss/biomedit/">BioMedIT project</a>.</p>
<div class="figure">
<img alt="Stig co-presenting with Chiara Ferrari" src="//www.stackhpc.com/images/stig-chiara-ska.jpg" style="width: 500px;" />
</div>
<p>In addition to the projects themselves, Stig promoted the
OpenStack <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">Scientific SIG</a>
as a forum for information sharing for scientific use cases.</p>
<div class="figure">
<img alt="Stig co-presenting with Jani Heikkinen" src="//www.stackhpc.com/images/stig-jani-biomedit.jpg" style="width: 500px;" />
</div>
<p>Thanks to Belmiro and the CERN team for all their effort to make
the day such a great success!</p>
I/O performance of Kata containers2019-05-09T12:00:00+01:002019-05-09T16:00:00+01:00Bharat Kunwartag:www.stackhpc.com,2019-05-09:/kata-io-1.html<p class="first last">We compare the I/O performance of Kata containers against runC and
bare metal cases and establish of I/O cost of trading off the added
layer of security gained through hardware virtualisation
in container infrastructure.</p>
<div class="figure">
<img alt="Kata project logo" src="//www.stackhpc.com/images/kata-logo.png" style="width: 350px;" />
</div>
<p>This analysis was performed using Kata containers version 1.6.2, the latest at
the time of writing.</p>
<p>After attending a <a class="reference external" href="https://katacontainers.io/docs/">Kata Containers</a> workshop
at <a class="reference external" href="https://openinfradays.co.uk">OpenInfra Days 2019</a> in London, we were
impressed by their start-up time, only marginally slower compared to ordinary
runC containers in a Kubernetes cluster. We were naturally curious about their
disk I/O bound performance and whether they also live up to the speed claims.
In this article we explore this subject with a view to understanding the
trade offs of using this technology in environments where I/O bound performance
and security are both critical requirements.</p>
<div class="section" id="what-are-kata-containers">
<h2>What are Kata containers?</h2>
<p>Kata containers are lightweight VMs designed to integrate seamlessly with
container orchestration software like Docker and Kubernetes. One envisaged use
case is running untrusted workloads, exploiting the additional isolation gained
by not sharing the Operating System kernel with the host.
However, the unquestioning assumption that using a guest kernel leads to
additional security is challenged in a <a class="reference external" href="https://arxiv.org/abs/1904.12226">recent survey of virtual machines and
containers</a>. Kata has roots in Intel Clear
Containers and Hyper runV technology. They are also often mentioned alongside
<a class="reference external" href="https://gvisor.dev/docs/">gVisor</a>, which aims to solve a similar problem by
filtering and redirecting system calls to a separate user space kernel. As a
result gVisor suffers from runtime performance penalties. Further
discussion on gVisor is out of scope in this blog.</p>
</div>
<div class="section" id="configuring-kubernetes-for-kata">
<h2>Configuring Kubernetes for Kata</h2>
<p>Kata containers are <a class="reference external" href="https://www.opencontainers.org">OCI conformant</a> which
means that a Container Runtime Interface (CRI) that supports external runtime
classes can use Kata to run workloads. Examples of these CRIs currently include
<a class="reference external" href="https://cri-o.io">CRI-O</a> and <a class="reference external" href="https://containerd.io/docs/">containerd</a>
which both use <tt class="docutils literal">runC</tt> by default, but this can be swapped for the <tt class="docutils literal"><span class="pre">kata-qemu</span></tt>
runtime. From Kubernetes 1.14+ onwards, the <tt class="docutils literal">RuntimeClass</tt> feature flag has been
promoted to beta, therefore enabled by default. Consequently the setup is relatively
straightforward.</p>
<p>At present Kata supports <tt class="docutils literal">qemu</tt> and <tt class="docutils literal">firecracker</tt> hypervisor
backends, but the support for the latter is considered preliminary,
<a class="reference external" href="https://github.com/Kata-containers/documentation/issues/351">especially a lack of host to guest file sharing</a>.
This leaves us with <tt class="docutils literal"><span class="pre">kata-qemu</span></tt> as the current option, in which
<tt class="docutils literal"><span class="pre">virtio-9p</span></tt> provides the basic shared filesystem functionalities
critical for this analysis (the test path is a network filesystem
mounted on the host).</p>
<p>This <a class="reference external" href="https://gist.github.com/brtknr/09a5fdd70d6497648d01e49ef5d0b17c">example Gist</a> shows how
to swap runC for Kata runtime in a Minikube cluster. Note that at the time of
writing, Kata containers have additional host requirements:</p>
<ul class="simple">
<li>Kata will only run on a machine configured to support nested virtualisation.</li>
<li>Kata requires <a class="reference external" href="https://github.com/Kata-containers/runtime/commit/8cfb06f1a92893348bba730059e436439f1f28f4#diff-f4ef6cd0d71cf6781e3b30ef4489901cR64">at least a Westmere processor architecture</a>.</li>
</ul>
<p>Without these prerequisites Kata startup will fail silently
(we learnt this the hard way).</p>
<p>For this analysis a baremetal Kubernetes cluster was deployed, using OpenStack Heat to
provision the machines via our <a class="reference external" href="https://github.com/stackhpc/p3-appliances/pull/62">appliances playbooks</a> and <a class="reference external" href="https://github.com/kubernetes-sigs/kubespray">Kubespray</a> to configure them as a
Kubernetes cluster. Kubespray supports
specification of container runtimes other than <tt class="docutils literal">Docker</tt>, e.g. <tt class="docutils literal"><span class="pre">CRI-O</span></tt> and
<tt class="docutils literal">containerd</tt>, which is required to support the Kata runtime.</p>
</div>
<div class="section" id="designing-the-i-o-performance-study">
<h2>Designing the I/O Performance Study</h2>
<p>To benchmark the I/O performance Kata containers, we present equivalent
scenarios in bare metal and runC container cases to draw comparison. In all
cases, we use <tt class="docutils literal">fio</tt> (version 3.1) as the I/O benchmarking tool invoked as
follows where <tt class="docutils literal">$SCRATCH_DIR</tt> is the path to our BeeGFS (described in more
detail later in this section) network storage mounted on the host:</p>
<div class="highlight"><pre><span></span>fio fio_jobfile.fio --fallocate<span class="o">=</span>none --runtime<span class="o">=</span><span class="m">30</span> --directory<span class="o">=</span><span class="nv">$SCRATCH_DIR</span> --output-format<span class="o">=</span>json+ --blocksize<span class="o">=</span><span class="m">65536</span> --output<span class="o">=</span><span class="m">65536</span>.json
</pre></div>
<p>The <tt class="docutils literal">fio_jobfile.fio</tt> file referenced above reads as follows:</p>
<div class="highlight"><pre><span></span>[global]
; Parameters common to all test environments
; Ensure that jobs run for a specified time limit, not I/O quantity
time_based=1
; To model application load at greater scale, each test client will maintain
; a number of concurrent I/Os.
ioengine=libaio
iodepth=8
; Note: these two settings are mutually exclusive
; (and may not apply for Windows test clients)
direct=1
buffered=0
; Set a number of workers on this client
thread=0
numjobs=4
group_reporting=1
; Each file for each job thread is this size
filesize=32g
size=32g
filename_format=$jobnum.dat
[fio-job]
; FIO_RW is read, write, randread or randwrite
rw=${FIO_RW}
</pre></div>
<p>In order to understand how the performance scales with the number of I/O bound
clients, we look at 1, 8 and 64 clients. While the single client is
instantiated on a single instance, for the cases with 8 and 64 clients, they
run in parallel across across 2 worker instances, with 4 and 32 clients per bare metal
instance respectively. Additionally, each <tt class="docutils literal">fio</tt> client instantiates 4 threads
which randomly and sequentially read and write a 32G file per thread, depending on
the scenario.</p>
<p>All scenarios are configured with a block size of 64K. It is worth noting that the
<tt class="docutils literal">direct=true</tt> flag has not been supplied to <tt class="docutils literal">fio</tt> for these tests as it is
not representative of a typical use case.</p>
<p>The test infrastructure is set up in an optimal configuration for data-intensive analytics.
The storage backend which consists of NVMe devices is implemented with <a class="reference external" href="https://www.beegfs.io">BeeGFS</a>, a parallel file system for which we have an <a class="reference external" href="https://galaxy.ansible.com/stackhpc/beegfs">Ansible
Galaxy role</a> and have <a class="reference external" href="https://www.stackhpc.com/ansible-role-beegfs.html">previously
written about</a>. The
network connectivity between the test instances and BeeGFS storage platform uses
RDMA over a 100G Infiniband fabric.</p>
<table border="1" class="docutils">
<colgroup>
<col width="31%" />
<col width="35%" />
<col width="33%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Scenario</th>
<th class="head">Number of clients</th>
<th class="head">Disk I/O pattern</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>bare metal</td>
<td>1</td>
<td>sequential read</td>
</tr>
<tr><td>runC containers</td>
<td>8</td>
<td>random read</td>
</tr>
<tr><td>Kata containers</td>
<td>64</td>
<td>sequential write</td>
</tr>
<tr><td> </td>
<td> </td>
<td>random write</td>
</tr>
</tbody>
</table>
<blockquote>
<em>The parameter space explored for the I/O performance
study covers 36 combinations of scenarios, number of clients and
disk I/O pattern.</em></blockquote>
</div>
<div class="section" id="results">
<h2>Results</h2>
<div class="section" id="disk-i-o-bandwidth">
<h3>Disk I/O Bandwidth</h3>
<p>In these results we plot the aggregate bandwidth across all clients,
demonstrating the scale-up bandwidth achievable by a single client and the
scale-out throughput achieved across many clients.</p>
<div class="figure">
<img alt="Comparison of disk I/O bandwidth" src="//www.stackhpc.com/images/scenario-cumulative-aggregate-bw-all-clients.png" style="width: 100%;" />
<p class="caption"><em>Comparison of disk I/O bandwidth between between bare metal, runC and Kata. In
all cases, the bandwidth achieved with runC containers is slightly below
bare metal. However, Kata containers generally fare much worse, achieving
around 15% of the bare metal read bandwidth and a much smaller proportion of
random write bandwidth when there are 64 clients. The only exception
is the sequential write case using 64 clients, where Kata
containers appear to outperform baremetal scenario by approximately 25%.</em></p>
</div>
</div>
<div class="section" id="commit-latency-cumulative-distribution-function-cdf">
<h3>Commit Latency Cumulative Distribution Function (CDF)</h3>
<p>In latency-sensitive workloads, I/O latency can dominate. I/O
operation commit latency is plotted on a logarithmic scale, to fit
a very broad range of data points.</p>
<div class="figure">
<img alt="Comparison of commit latency CDF" src="//www.stackhpc.com/images/scenario-cumulative-aggregate-cf-all-clients.png" style="width: 100%;" />
<p class="caption"><em>Comparison of commit latency CDF between bare metal, runC and Kata
container environments for 1, 8 and 64 clients respectively. There is a
small discrepancy between running fio jobs in bare metal compared to
running them as runC containers. However, comparing bare metal to Kata
containers, the overhead is significant in all cases.</em></p>
</div>
<table border="1" class="docutils">
<colgroup>
<col width="22%" />
<col width="12%" />
<col width="10%" />
<col width="11%" />
<col width="11%" />
<col width="10%" />
<col width="11%" />
<col width="12%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head" colspan="2">Number of clients ></th>
<th class="head" colspan="2">1</th>
<th class="head" colspan="2">8</th>
<th class="head" colspan="2">64</th>
</tr>
<tr><th class="head">Mode</th>
<th class="head">Scenario</th>
<th class="head">50%</th>
<th class="head">99%</th>
<th class="head">50%</th>
<th class="head">99%</th>
<th class="head">50%</th>
<th class="head">99%</th>
</tr>
</thead>
<tbody valign="top">
<tr><td rowspan="3">sequential read</td>
<td>bare</td>
<td>1581</td>
<td>2670</td>
<td>2416</td>
<td>3378</td>
<td>14532</td>
<td>47095</td>
</tr>
<tr><td>runC</td>
<td>2007</td>
<td>2506</td>
<td>2391</td>
<td>3907</td>
<td>15062</td>
<td>46022</td>
</tr>
<tr><td>Kata</td>
<td>4112</td>
<td>4620</td>
<td>12648</td>
<td>46464</td>
<td>86409</td>
<td>563806</td>
</tr>
<tr><td rowspan="3">random read</td>
<td>bare</td>
<td>970</td>
<td>2342</td>
<td>2580</td>
<td>3305</td>
<td>14935</td>
<td>43884</td>
</tr>
<tr><td>runC</td>
<td>1155</td>
<td>2277</td>
<td>2506</td>
<td>3856</td>
<td>15378</td>
<td>42229</td>
</tr>
<tr><td>Kata</td>
<td>5472</td>
<td>6586</td>
<td>13517</td>
<td>31080</td>
<td>109805</td>
<td>314277</td>
</tr>
<tr><td rowspan="3">sequential write</td>
<td>bare</td>
<td>1011</td>
<td>1728</td>
<td>2592</td>
<td>15023</td>
<td>3730</td>
<td>258834</td>
</tr>
<tr><td>runC</td>
<td>1011</td>
<td>1990</td>
<td>2547</td>
<td>14892</td>
<td>4308</td>
<td>233832</td>
</tr>
<tr><td>Kata</td>
<td>3948</td>
<td>4882</td>
<td>4102</td>
<td>6160</td>
<td>14821</td>
<td>190742</td>
</tr>
<tr><td rowspan="3">random write</td>
<td>bare</td>
<td>1269</td>
<td>2023</td>
<td>3698</td>
<td>11616</td>
<td>19722</td>
<td>159285</td>
</tr>
<tr><td>runC</td>
<td>1286</td>
<td>1957</td>
<td>3928</td>
<td>11796</td>
<td>19374</td>
<td>151756</td>
</tr>
<tr><td>Kata</td>
<td>4358</td>
<td>5275</td>
<td>4566</td>
<td>14254</td>
<td>1780559</td>
<td>15343845</td>
</tr>
</tbody>
</table>
<blockquote>
<em>Table summarising the 50% and the 99% commit latencies (in μs)
corresponding to the figure shown earlier.</em></blockquote>
</div>
</div>
<div class="section" id="looking-ahead">
<h2>Looking Ahead</h2>
<p>In an I/O intensive scenario such as this one, Kata containers do not yet match the
performance of conventional containers.</p>
<p>It is clear from the results that there are significant trade offs
to consider when choosing between bare metal, runC and Kata containers.
While runC containers provide valuable abstractions for most use
cases, they still leave the host kernel vulnerable to exploit with
the system call interface as attack surface. Kata containers provide
hardware-supported isolation but currently there is significant
performance overhead, especially for disk I/O bound operations.</p>
<p>Kata's development roadmap and pace of evolution provide substantial
grounds for optimism. The Kata team are aware of the performance
drawbacks of using <tt class="docutils literal"><span class="pre">virtio-9p</span></tt> as the storage driver for sharing
paths between host and guest VMs.</p>
<p>Kata version 1.7 (due on 15 May 2019) is expected to ship with
experimental support for <a class="reference external" href="https://virtio-fs.gitlab.io/">virtio-fs</a> which is expected to improve I/O performance
issues. Preliminary results look encouraging, with
other published benchmarks reporting the <tt class="docutils literal"><span class="pre">virtio-fs</span></tt> driver
demonstrating <a class="reference external" href="https://lwn.net/Articles/774495/">2x to 8x disk I/O bandwidth improvement</a> over <tt class="docutils literal"><span class="pre">virtio-9p</span></tt>.
We will repeat our analysis when the new capabilities become available.</p>
<p>In the meantime, if you would like to get in touch we would love to hear
from you, especifically if there is a specific configuration which we may not
have considered. Reach out to us on <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
StackHPC Joins the OpenStack Bare Metal Program2019-04-29T09:00:00+01:002019-04-29T09:00:00+01:00Stig Telfertag:www.stackhpc.com,2019-04-29:/baremetal-program.html<p class="first last">Announcing our participation in the Open Infrastructure Foundation's
new Bare Metal Program</p>
<p>At StackHPC, our client requirements often take the form that we
must deliver cloud-native infrastructure without making any sacrifice
to existing levels of performance. This can be challenging at
times, but would not be possible at all without OpenStack <a class="reference external" href="https://docs.openstack.org/ironic/latest/">Ironic</a>, the engine that makes
software-defined bare metal work.</p>
<p>Ironic enables our clients to deploy on-premise high-performance
computing infrastructure using the same methods they would use to
deploy infrastructure in the cloud. This is driving a revolution
in research computing infrastructure management.</p>
<div class="figure">
<img alt="Bare metal program logo" src="//www.stackhpc.com/images/baremetal-program-logo.png" style="width: 500px;" />
</div>
<p>The StackHPC team's commitment to Ironic is long and deep, and
pre-dates the formation of StackHPC itself. Within StackHPC we
have made it a core component of our expertise. At the <a class="reference external" href="https://www.openstack.org/summit/denver-2019">Open
Infrastructure Summit</a>
this week in Denver, check out StackHPC team member Mark Goddard's
presentation of his recent work on <a class="reference external" href="https://www.openstack.org/summit/denver-2019/summit-schedule/events/23365/ironic-deploy-templates-bespoke-bare-metal">deep reconfigurability of bare metal</a>. And come along to our
hands-on workshop, <a class="reference external" href="https://www.openstack.org/summit/denver-2019/summit-schedule/events/23426/a-universe-from-nothing-containerised-openstack-deployment-using-kolla-ansible-and-kayobe">A Universe from Nothing</a> to get familiar with
<a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a>, our Ironic-centric
deployment tool for Kolla-Ansible OpenStack. Both are on Tuesday afternoon.</p>
<p>We'll also be talking about our commitment to high performance computing
(and no doubt touching on the role Ironic can play in delivering it) in
John Garbutt's presentation <a class="reference external" href="https://www.openstack.org/summit/denver-2019/summit-schedule/events/23106/lessons-learnt-federating-openstack-powered-supercomputers">Lessons learnt federating OpenStack powered supercomputers</a> on Monday afternoon, and Stig Telfer's panel session
<a class="reference external" href="https://www.openstack.org/summit/denver-2019/summit-schedule/events/23209/hpc-using-openstack">HPC using OpenStack</a> on Wednesday morning.</p>
<p>Finally, the <a class="reference external" href="https://www.openstack.org/summit/denver-2019/summit-schedule/events/23741/scientific-sig-bof-and-lightning-talks">Scientific SIG</a> on Monday afternoon always includes a boat load of
bare metal.</p>
Blazar 3.0.0: Highlights of the Stein Release2019-04-10T09:00:00+01:002019-04-10T09:00:00+01:00Pierre Riteautag:www.stackhpc.com,2019-04-10:/blazar-stein.html<p class="first last">Announcing the release of Blazar 3.0.0. This release provides
floating IP reservation and integrates with the Placement service.</p>
<p><a class="reference external" href="https://docs.openstack.org/blazar/latest/">Blazar</a> is a resource reservation
service for OpenStack. Initially started in 2013 under the name Climate, Blazar
was revived during the <a class="reference external" href="https://www.openstack.org/software/ocata">Ocata</a>
release cycle and became an official OpenStack project during the <a class="reference external" href="https://www.openstack.org/software/queens/">Queens</a> release cycle. It has just
shipped its third official release (the fifth since the revival of the project)
as part of the <a class="reference external" href="https://www.openstack.org/software/stein/">OpenStack Stein release</a>.</p>
<p>While Blazar’s ambition has always been to provide reservations for the various
types of resources managed by OpenStack, it has only supported compute
resources so far, in the form of instance reservations and physical host
reservations. Both were supported purely by integrating with <a class="reference external" href="https://docs.openstack.org/nova/latest/">Nova</a>. This is changing in Stein in two
ways.</p>
<p>First, the Blazar community has added support for reserving floating IPs by
integrating with <a class="reference external" href="https://docs.openstack.org/neutron/latest/">Neutron</a>.
Public IPv4 addresses are usually scarce resources which need to be carefully
managed. Users can now request to reserve one or several floating IPs for a
specific time period to ensure their future availability, and even bundle a
floating IP reservation with a reservation of compute resources inside the same
<a class="reference external" href="https://docs.openstack.org/blazar/latest/user/introduction.html#glossary-of-terms">lease</a>.
While the implementation of this feature is not fully complete in Stein and is
thus classified as a <em>preview</em>, most of the missing pieces are in client
support and documentation, and should be completed soon. <a class="reference external" href="https://www.chameleoncloud.org/">Chameleon</a>, a testbed for large-scale cloud research,
has already <a class="reference external" href="https://www.chameleoncloud.org/blog/2019/04/01/chameleon-changelog-march-2019/">made available</a>
this new feature to its users.</p>
<p>Second, the instance reservation feature is now leveraging the <a class="reference external" href="https://docs.openstack.org/placement/latest/">Placement API
service</a>. Originally introduced
within Nova, OpenStack Placement provides an HTTP service for managing,
selecting, and claiming providers of classes of inventory representing
available resources in a cloud. Placement was extracted from Nova in the Stein
release and is now a separate project. This change allows Blazar to support all
types of affinity policies for instance reservation, instead of being limited
to anti-affinity as in previous releases. While Blazar initially leverages
Placement only for instance reservation, it paves the way for extending
reservation to other types of resources when they integrate with Placement
themselves. It will also help Blazar to provide reservation of bare-metal nodes
managed by <a class="reference external" href="https://docs.openstack.org/ironic/latest/">Ironic</a>.</p>
<p>Blazar also includes a new <a class="reference external" href="https://developer.openstack.org/api-ref/reservation/v1/index.html#resource-allocations">Resource Allocation API</a>,
allowing operators to query the reserved state of their cloud resources. This
provides a foundation for developing new tools such as a graphical calendar
view, which we hope can be made available upstream in a future release.</p>
<p>More details about all the notable changes in Stein are available in the
<a class="reference external" href="https://docs.openstack.org/releasenotes/blazar/stein.html">Blazar release notes</a>.</p>
<p>On May 1, two of the Blazar core reviewers will be presenting a <a class="reference external" href="https://www.openstack.org/summit/denver-2019/summit-schedule/events/23722/blazar-project-update">Project Update</a>
at the <a class="reference external" href="https://www.openstack.org/summit/denver-2019">Denver 2019 Open Infrastructure Summit</a>. Join them to learn more about
these changes and discuss how reservations can make better use of cloud
resources!</p>
<p>With the <a class="reference external" href="https://releases.openstack.org/train/schedule.html">Train release</a>
on the horizon, the Blazar community is planning to go full steam ahead by:</p>
<ul class="simple">
<li>extending its integration with Neutron with reservation of network segments
(e.g. VLANs and VXLANs);</li>
<li>making Blazar compatible with bare-metal nodes managed by Ironic, <a class="reference external" href="http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004780.html">possibly
without using Nova</a>;</li>
<li>providing a graphical reservation calendar within <a class="reference external" href="https://docs.openstack.org/horizon/latest/">Horizon</a>;</li>
<li>integrating with <a class="reference external" href="https://www.openstack.org/videos/summits/berlin-2018/science-demonstrations-preemptible-instances-at-cern-and-bare-metal-containers-for-hpc-at-ska">preemptible instances</a>.</li>
</ul>
<p>StackHPC sees resource reservation as one of OpenStack’s functional gaps for
<a class="reference external" href="https://www.stackhpc.com/openstack-and-hpc-workloads.html">meeting the needs of research computing</a>. Blazar can
provide a critical service, enabling users to reserve in advance enough
resources for running large-scale workloads.</p>
<div class="figure">
<img alt="Blazar project mascot" src="//www.stackhpc.com/images/blazar-horizontal.png" style="width: 500px;" />
</div>
Kayobe 5.0.0: The Rocky Release2019-02-25T22:00:00+00:002019-02-25T22:00:00+00:00Stig Telfertag:www.stackhpc.com,2019-02-25:/kayobe-5.html<p class="first last">Announcing the release of Kayobe 5.0.0. This release
supports deployment of OpenStack Rocky. Lots of new features
and fixes.</p>
<p><a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a> is a free and open source
deployment tool for containerised OpenStack control planes, based
on <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a> and
<a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a>,
and embodying current best practices. Kayobe is seeing broad
adoption for research computing configurations and use cases.</p>
<p>After its beginnings with OpenStack Ocata, Kayobe is now onto its
fourth major OpenStack release with support for <a class="reference external" href="https://www.openstack.org/software/rocky/">Rocky</a>.</p>
<p>Admittedly, Rocky was finalised back in November 2018. StackHPC's
dedicated team (who drive much of the work on Kayobe) have been
busy with some major pieces of work, both within StackHPC and <a class="reference external" href="https://www.stackalytics.com/?company=stackhpc">around
the OpenStack ecosystem</a>.
Thanks to growing strength and breadth, the team was actually quicker
with this release than it was with Queens, and expects to be quicker
still with the forthcoming Stein release.</p>
<p>In addition to support for deploying and managing Rocky, the <a class="reference external" href="https://kayobe-release-notes.readthedocs.io/en/latest/rocky.html">release
notes</a>
describe many new features in this release.</p>
<p><a class="reference external" href="author/mark-goddard.html">Mark Goddard</a> presented our
work on Kayobe at the recent <a class="reference external" href="https://cloud.ac.uk/workshops/feb2019/">UKRI Cloud Workshop</a> at the Francis Crick
Institute in London.</p>
<div class="figure">
<img alt="Mark Goddard at Cloud WG Workshop 2019" src="//www.stackhpc.com/images/mark-at-cloudwg-2019.png" style="width: 360px;" />
</div>
<p>Mark says, "The Kayobe 5.0.0 release includes a number of useful
features. We now have a full upgrade path for the seed services
from Ocata to Rocky. The Python package now includes the Ansible
playbooks, meaning that you can now use Kayobe without a copy of
the source code repository. This sets us up for more reproducible
and easy to install Kayobe control host environments. Thanks to
everyone who contributed to the release. Now onto Stein - get in
touch via #openstack-kayobe or the <a class="reference external" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-discuss">openstack-discuss mailing list</a>
to help shape the next release!"</p>
StackHPC at the UKRI Cloud Workshop2019-02-14T14:00:00+00:002019-02-14T14:00:00+00:00Stig Telfertag:www.stackhpc.com,2019-02-14:/cloudwg-2019.html<p class="first last">StackHPC participates in the 4th UKRI Cloud WG Workshop,
held at the Francis Crick Institute in London.</p>
<p>We always enjoy attending the <a class="reference external" href="https://cloud.ac.uk/workshops/feb2019/">UKRI Cloud Working Group Workshop</a> held annually at the
awesome <a class="reference external" href="https://www.crick.ac.uk/">Francis Crick Institute</a>. The
sizeable crowd it draws and the high quality of content are both
healthy signs of the vitality of cloud for research computing.</p>
<p>This year's workshop demonstrated a maturing approach to use of
cloud, with some notable focus on various methods for harnessing
hybrid and public clouds for dynamic and bursting workloads. Public
cloud companies presented on new and forthcoming HPC-aware features,
while research organisations presented on mobility to avoid lock-in
to cloud vendors. How these two contrasting tensions play out will
be interesting over the next few years.</p>
<p>There was also a welcome focus on operating and sustaining cloud-hosted
infrastructure and platforms. In particular, Matt Pryor from
<a class="reference external" href="http://www.jasmin.ac.uk/">STFC/JASMIN</a> presented their current
project on a user-friendly application portal, coupled with
Cluster-as-a-Service deployments of Slurm and Kubernetes, with focus
on both usability for scientists and day-2 operations for administrators.
StackHPC is proud to be working with the JASMIN team on implementing this
well-considered initiative and we hope to write more about it in due course.</p>
<p>We always particpate as much as possible, and this year StackHPC was
more involved than we have ever been before. Five members of our team
attended, and in a one-day programme three presentations were delivered
by the team - a real achievement for a ten-person company.</p>
<p>We presented three prominent areas of recent work. <a class="reference external" href="author/john-garbutt.html">John Garbutt</a> spoke about our recent work on storage
for the software-defined supercomputer, in particular SKA SDP buffer
prototyping and the Cambridge Data Accelerator.</p>
<div class="figure">
<img alt="John Garbutt at Cloud WG Workshop 2019" src="//www.stackhpc.com/images/johng-at-cloudwg-2019.jpg" style="width: 720px;" />
</div>
<p><em>Pictured here with David Yuan of EMBL and Matt Pryor of STFC</em></p>
<p><a class="reference external" href="author/mark-goddard.html">Mark Goddard</a> presented our work on <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a>, a free and open source
deployment tool for containerised OpenStack control planes, based
on <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a> and
<a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a>,
and embodying current best practices. Kayobe is seeing broad
adoption for research computing configurations and use cases.</p>
<div class="figure">
<img alt="Mark Goddard at Cloud WG Workshop 2019" src="//www.stackhpc.com/images/mark-at-cloudwg-2019.png" style="width: 720px;" />
</div>
<p><a class="reference external" href="author/bharat-kunwar.html">Bharat Kunwar</a> delivered a demonstration
of <a class="reference external" href="https://pangeo.io/">Pangeo</a>, the second of the day after Jacob
Tomlinson presented the work of the <a class="reference external" href="https://www.metoffice.gov.uk/about-us/what/informatics-lab">Met Office Informatics Lab</a>.
With a focus on data-intensive analytics on private cloud infrastructure,
Bharat demonstrated the deployment of Pangeo on a bare metal HPC
OpenStack deployment, using Kubernetes deployed by <a class="reference external" href="https://docs.openstack.org/magnum/latest/user/">Magnum</a>. In addition
to demonstrating containers running on bare metal, Bharat demonstrated
storage attachments backed by Ceph and RDMA-enabled <a class="reference external" href="https://www.beegfs.io/content/">BeeGFS</a>. All of that in ten minutes!</p>
<div class="figure">
<img alt="Bharat Kunwar at Cloud WG Workshop 2019" src="//www.stackhpc.com/images/bharat-at-cloudwg-2019.jpg" style="width: 720px;" />
</div>
Scientific OpenStack Hackathon2019-02-08T22:00:00+00:002019-02-08T22:00:00+00:00Stig Telfertag:www.stackhpc.com,2019-02-08:/sos-hackathon-2019.html<p class="first last">A movement to build greater community around users of
OpenStack for scientific computing being spearheaded by a
consortium of UK research institutions.</p>
<p>This week we have been hosting a gathering of technical teams
from a number of prominent UK scientific institutions affiliated to the
<a class="reference external" href="https://www.iris.ac.uk/">IRIS consortium</a>, including
<a class="reference external" href="http://www.ccfe.ac.uk/">the Culham Centre for Fusion Energy</a>,
<a class="reference external" href="https://www.manchester.ac.uk/">Manchester University</a>,
<a class="reference external" href="https://www.roe.ac.uk/">the Royal Observatory Edinburgh</a>,
<a class="reference external" href="https://www.cam.ac.uk/">Cambridge University</a>,
<a class="reference external" href="https://stfc.ukri.org/about-us/where-we-work/rutherford-appleton-laboratory/">Rutherford Appleton Laboratory</a> and
<a class="reference external" href="https://www.diamond.ac.uk/Home.html">the Diamond Light Source</a>.
We were also joined by our friends from <a class="reference external" href="https://www.bristolisopen.com/">Bristol is Open</a>.</p>
<p>The group was gathered for a hackathon, aimed at helping to spread
technical knowledge about
<a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a>,
<a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a>
and <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a>, and how they
can be used together to create OpenStack deployments optimised for
scientific computing use cases - a concept we informally refer to
as <em>Scientific OpenStack</em>.</p>
<p>Kayobe is a free and open source deployment tool for containerised
OpenStack control planes, embodying current best practices.
Kayobe is seeing broad adoption for research computing configurations
and use cases.</p>
<div class="figure">
<img alt="Scientific OpenStack hackathon 2019" src="//www.stackhpc.com/images/sos-hackathon-2019.jpg" style="width: 720px;" />
</div>
<p>Aside from helping make progress with many new OpenStack projects,
a secondary aim of the hackathon has been, along with other users
of Kayobe worldwide and the OpenStack <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">Scientific SIG</a>, to cement a
strong set of inter-institutional technical relationships, enabling
a self-supporting community to grow for this space.</p>
Ceph on the Brain: A Year with the Human Brain Project2018-12-17T12:00:00+00:002018-12-17T13:00:00+00:00Stig Telfertag:www.stackhpc.com,2018-12-17:/ceph-on-the-brain-a-year-with-the-human-brain-project.html<p class="first last">Working with our partners at Cray, a year of investigation
into technologies for data movement and using Ceph in support of
demanding storage use cases.</p>
<div class="section" id="background">
<h2>Background</h2>
<p>The <a class="reference external" href="https://www.humanbrainproject.eu/en/">Human Brain Project (HBP)</a>
is a 10-year EU <a class="reference external" href="http://ec.europa.eu/programmes/horizon2020/en/h2020-section/fet-flagships">FET flagship project</a>
seeking to <em>“provide researchers worldwide with tools and mathematical
models for sharing and analysing large brain data they need for
understanding how the human brain works and for emulating its
computational capabilities”</em>. This ambitious and far-sighted goal
has become increasingly relevant during the lifetime of the project
with the rapid uptake of Machine Learning and AI (in its various
forms) for a broad range of new applications.</p>
<p>A significant portion of the HBP is concerned with massively parallel
applications in neuro-simulation, in analysis techniques to interpret
data produced by such applications, and in platforms to enable
these. The advanced requirements of the HBP in terms of mixed
workload processing, storage and access models are way beyond current
technological capabilities and will therefore drive innovation in
the HPC industry. The Pre-commercial procurement (PCP) is a funding
vehicle developed by the European Commission, in which an industrial
body co-designs with a public institution an innovative solution
to a real-world technical problem, with the intention of providing
the solution as commercialized IP.</p>
<p>The <a class="reference external" href="https://www.fz-juelich.de/">Jülich Supercomputer Centre</a> on behalf of
the Human Brain Project entered into a competitive three-phased PCP
programme to design next-generation supercomputers for the demanding
brain simulation, analysis and data-driven problems facing the wider
Human Brain Project. Two consortia - NVIDA and IBM, and Cray and
Intel were selected to build prototypes of their proposed solutions.
The phase III projects ran until January 2017, but Cray’s project
deferred significant R&D investment, and was amended and extended.
Following significant activity supporting the research efforts
at Jülich, JULIA was finally decommissioned at the end of November.</p>
</div>
<div class="section" id="introducing-julia">
<h2>Introducing JULIA</h2>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/cray-julia-cabinets.jpg" style="width: 300px; height: 446px;" />
</div>
<p>In 2016, Cray installed a prototype named <a class="reference external" href="https://hbp-hpc-platform.fz-juelich.de/?page_id=1063">JULIA</a>, with the
aim of exploring APIs for access to dense memory and storage, and
the effective support of mixed workloads. In this context, mixed
workloads may include interactive visualisation of live simulation
data and the possibility of applying feedback to "steer" a simulation
based on early output. Flexible exploitation of new hardware and
software aligns well with Cray's <a class="reference external" href="https://www.cray.com/company">vision of adaptive supercomputing</a>.</p>
<p>JULIA is based on a <a class="reference external" href="https://www.cray.com/products/computing/cs-series/cs400-ac">Cray CS400</a>
system, but extended with some novel hardware and software technologies:</p>
<ul class="simple">
<li>60 Intel Knights Landing compute nodes</li>
<li>8 visualisation nodes with NVIDIA GPUs</li>
<li>4 data nodes with Intel Xeon processor and 2x Intel Fultondate P3600 SSDs</li>
<li>All system partitions connected using the Omnipath interconnect</li>
<li>Installation of a remote visualization system for concurrent,
post-processing and in-transit visualization of data primarily from
neurosimulation.</li>
<li>An installed software environment combining conventional HPC toolchains
(Cray, Intel, GNU compilers), and machine learning software stacks
(e.g. Theano, caffe, TensorFlow)</li>
<li>A storage system consisting of SSD-backed Ceph</li>
</ul>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-system-architecture.png" style="width: 700px;" />
</div>
<p>StackHPC was sub-contracted by Cray in order to perform analysis
and optimisation of the Ceph cluster. Analysis work started in
August 2017.</p>
<div class="section" id="ceph-on-julia">
<h3>Ceph on JULIA</h3>
<p>The Ceph infrastructure comprises four data nodes, each equipped with two
P3600 NVME devices and a 100G Omnipath high-performance network:</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-node-architecture.png" style="width: 700px;" />
</div>
<p>Each of the NVME devices is configured with four partitions. Each partition
is provisioned as a Ceph OSD, providing a total of 32 OSDs.</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-osd-partitioning.png" style="width: 350px;" />
</div>
<p>The Ceph cluster was initially running the Jewel release of Ceph
(current at the time). After characterising the performance, we
started to look for areas for optimisation.</p>
</div>
<div class="section" id="high-performance-fabric">
<h3>High-Performance Fabric</h3>
<p>The JULIA system uses a 100G <a class="reference external" href="https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-driving-exascale-computing.html">Intel Omni-Path</a>
RDMA-centric network fabric, also known as OPA. This network is conceptually derived
and evolved from <a class="reference external" href="https://www.infinibandta.org/">InfiniBand</a>, and
reuses a large proportion of the InfiniBand software stack, including
the <a class="reference external" href="https://en.wikipedia.org/wiki/InfiniBand#API">Verbs message-passing API</a>.</p>
<p>Ceph's predominant focus on TCP/IP-based networking is supported
through <a class="reference external" href="https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt">IP-over-InfiniBand</a>, a
kernel network driver that enables the Omni-Path network to carry
layer-3 IP traffic.</p>
<p>The <tt class="docutils literal">ipoib</tt> network driver enables connectivity, but does not
unleash the full potential of the network. Performance is good on
architectures where a processor core is sufficiently
powerful to maintain a significant proportion of line speed and
protocol overhead.</p>
<p>This <a class="reference external" href="https://en.wikipedia.org/wiki/Sankey_diagram">sankey diagram</a>
illustrates the connectivity between different hardware components
within JULIA:</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-iperf-sankey.png" style="width: 600px;" />
</div>
<p>In places there are two arrows, as the TCP performance was found
to be highly variable. Despite some investigation, the underlying
reason for the variability is still unclear to us.</p>
<p>Using native APIs, Omni-Path will comfortably saturate the 100G
network link. However, the <tt class="docutils literal">ipoib</tt> interface falls short of the
mark, particularly on the Knights Landing processors.</p>
</div>
<div class="section" id="raw-block-device-performance">
<h3>Raw Block Device Performance</h3>
<p>In order to understand the overhead of filesystem and network
protocol, we attempt to benchmark the system at every level, moving
from the raw devices up to the end-to-end performance between client
and server. In this way, we can identify the achievable performance
at each level, and where there is most room for improvement.</p>
<p>Using the <tt class="docutils literal">fio</tt> <a class="reference external" href="https://github.com/axboe/fio">I/O benchmarking tool</a>,
we measure the aggregated block read performance of all NVME
partitions in a single JULIA data server. We used four <tt class="docutils literal">fio</tt>
clients per partition (32 in total) and 64KB reads. The results
are stacked to get the raw aggregate bandwidth for single node:</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-raw-device-perf.png" style="width: 400px;" />
</div>
<p>The aggregate I/O read performance achieved by the data server is
approximately <strong>5200 MB/s</strong>. If we compare the I/O read performance
per node with the TCP/IP performance across the <tt class="docutils literal">ipoib</tt> interface,
we can see that actually the two are somewhat comparable (within
the observed bounds of variability in <tt class="docutils literal">ipoib</tt> performance):</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-ipoib-rawio.png" style="width: 750px;" />
</div>
<p>Taking into account that heuristic access patterns are likely to
include serving data from the kernel buffer cache taking a sizeable
proportion of each data node's 64G RAM, the <tt class="docutils literal">ipoib</tt> network
performance is likely to become a bottleneck.</p>
</div>
<div class="section" id="jewel-to-luminous">
<h3>Jewel to Luminous</h3>
<p>Preserving the format of the backend data store, the major version
of Ceph was upgraded from Jewel to Luminous. Single-client performance
was tested using <tt class="docutils literal">rados bench</tt> before and after the upgrade:</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-jewel-luminous.png" style="width: 750px;" />
</div>
<p>The results that we see indicate a solid improvement for smaller objects
(below 64K) but negligible difference otherwise, and no increase in
peak performance.</p>
</div>
<div class="section" id="filestore-to-bluestore">
<h3>Filestore to Bluestore</h3>
<p>The Luminous release of Ceph introduced major improvements in the
Bluestore backend data store. The Ceph cluster was migrated to Bluestore
and tested again with a single client node and <tt class="docutils literal">rados bench</tt>:</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-filestore-bluestore.png" style="width: 750px;" />
</div>
<p>There is a dramatic uplift in performance for larger objects for
both reads and writes. The peak RADOS object bandwidth is also
within the bounds of the observed limits achieved by the <tt class="docutils literal">ipoib</tt>
network interface. This level of performance is becoming less of
an I/O problem and more of a networking problem.</p>
<p>That's a remarkable jump. What just happened?</p>
<p>The major differences appear to be the greater efficiency of
a bespoke storage back-end over a general-purpose filesystem,
and also reduction in the amount of data handling through
avoiding writing first to a journal, and then to the main store.</p>
</div>
</div>
<div class="section" id="write-amplification">
<h2>Write Amplification</h2>
<p>For every byte written to Ceph via the RADOS protocol, how many
bytes are actually written to disk? To find this, we sample
disk activity using <cite>iostat</cite>, aggregate across all devices in
the cluster and compare with the periodic bandwidth reports of
<cite>rados bench</cite>. The result is a pair of graphs, plotting RAODS
bandwidth against bandwidth of the underlying devices, over time.</p>
<p>Here's the results for the filestore backend:</p>
<div class="figure">
<img alt="Filestore backend, RADOS bandwidth" src="//www.stackhpc.com/images/ceph-filestore-rados.png" style="width: 750px;" />
</div>
<div class="figure">
<img alt="Filestore backend, iostat bandwidth" src="//www.stackhpc.com/images/ceph-filestore-iostat.png" style="width: 750px;" />
</div>
<p>There appears to be a write amplification factor of approx 4.5x
- the combination of a 2x replication factor, having every
object written first through a collocated write journal, and
an small amount of additional overhead for filesystem metadata.</p>
<p>What is interesting to observe is the periodic freezes in activity
as the test progresses. These are believed to be the filestore back-end
subdividing object store directories when they exceed a given threshold.</p>
<p>Plotted with the same axes, the bluestore configuration is strikingly different:</p>
<div class="figure">
<img alt="Filestore backend, RADOS bandwidth" src="//www.stackhpc.com/images/ceph-bluestore-rados.png" style="width: 750px;" />
</div>
<div class="figure">
<img alt="Filestore backend, iostat bandwidth" src="//www.stackhpc.com/images/ceph-bluestore-iostat.png" style="width: 750px;" />
</div>
<p>The device I/O performance is approximately doubled, and sustained.
The write amplification is reduced from 4.5x to just over 2x (because
we are benchmarking here with 2x replication). It is the combination
of these factors that give us the dramatic improvement in write
performance.</p>
</div>
<div class="section" id="sustained-write-effects">
<h2>Sustained Write Effects</h2>
<p>Using the P3600 devices, performing sustained writes for long periods
eventually leads to performance degradation. This can be observed
in a halving of device write performance, and erratic and occastionally
lengthy commit times.</p>
<p>This effect can be seen in the results of <cite>rados bench</cite> when plotted over time.
In this graph, bandwidth is plotted in green and commit times are impulses in red:</p>
<div class="figure">
<img alt="JULIA" src="//www.stackhpc.com/images/julia-nvme-sustained-writes.png" style="width: 750px;" />
</div>
<p>This effect made it very hard to generate repeatable write benchmark results. It
was assumed the cause was activity within the NVME controller when the available
resource of free blocks became depleted.</p>
<div class="section" id="scaling-the-client-load">
<h3>Scaling the Client Load</h3>
<p>During idle periods on the JULIA system it was possible to harness
larger numbers of KNL systems as Ceph benchmark clients. Using concurrent runs
of <cite>rados bench</cite> and aggregating the results, we could get a reasonable idea of
Ceph's scalability (within the bounds of the client resources availalbe).</p>
<p>We were able to test with up configurations of to 20 clients at a time:</p>
<div class="figure">
<img alt="Luminous Ceph, RADOS read performance" src="//www.stackhpc.com/images/julia-luminous-knl.png" style="width: 750px;" />
</div>
<p>It was interesting to see how the cluster performance became erratic under
heavy load and high client concurrency.</p>
<p>The storage cluster BIOS and kernel parameters were reconfigured
to a low-latency / high-performance profile, and processor C-states
were disabled. This appeared to help with sustaining performance under
high load (superimposed here in black):</p>
<div class="figure">
<img alt="Luminous Ceph, RADOS read performance" src="//www.stackhpc.com/images/julia-luminous-knl-nocstates.png" style="width: 750px;" />
</div>
<p>Recalling that the raw I/O read performance of each OSD server was benchmarked
at 5200 MB/s, giving an aggregate performance across all four servers of 20.8 GB/s,
our peak RADOS read performance of 16.5 GB/s represents about 80% of peak raw performance.</p>
</div>
</div>
<div class="section" id="spectre-meltdown-strikes">
<h2>Spectre/Meltdown Strikes</h2>
<p>At this point, microcode and kernel mitigations were applied for the Spectre/Meltdown
CVEs. After retesting, the raw I/O read performance the aggregate performance per OSD
server was found to have dropped by over 15%, from 5200 MB/s to <strong>4400 MB/s</strong>. The aggregate
raw read performance of the Ceph cluster was now 17.6 GB/s.</p>
<div class="section" id="luminous-to-mimic">
<h3>Luminous to Mimic</h3>
<p>Along with numerous improvements and optimisations, the Mimic release also heralded
the deprecation of support for raw partitions for OSD backing, in favour of standardising
on LVM volumes.</p>
<p>Using an <a class="reference external" href="https://galaxy.ansible.com/mrlesmithjr/manage-lvm/">Ansible Galaxy role</a>,
we zapped our cluster and recreated a similar configuration within LVM. We retained the
same configuration of four OSDs associated with each physical NVME device. Benchmarking
the I/O performance using <tt class="docutils literal">fio</tt> revealed little discernable difference.</p>
<p>We redeployed the cluster using LVM and <cite>ceph-ansible</cite> and re-ran the <cite>rados bench</cite> tests.
The difference when using Ceph was dramatic for object sizes of 64K and bigger:</p>
<div class="figure">
<img alt="Mimic Ceph, LVM OSDs, RADOS read performance" src="//www.stackhpc.com/images/julia-mimic-knl-lvm.png" style="width: 750px;" />
</div>
<p>Reprovisioning again with partitions (and ignoring the deprecation warnings) restored and
increased levels of performance:</p>
<div class="figure">
<img alt="Mimic Ceph, raw partition OSDs, RADOS read performance" src="//www.stackhpc.com/images/julia-mimic-knl-raw.png" style="width: 750px;" />
</div>
<p>Taking into account the Spectre/Meltdown mitigations, Ceph Mimic
is delivering up to <strong>92%</strong> efficiency over RADOS protocol.</p>
<p><strong>UPDATE</strong>: After presenting these findings at <a class="reference external" href="https://ceph.com/cephdays/ceph-day-berlin/">Ceph Day Berlin</a>,
Sage Weil introduced me to the Ceph performance team at Red Hat, and in particular Mark Nelson. Mark helped
confirm the issue and with analysis on the root cause. It looks likely that Bluestore+LVM suffers the same issue
as XFS+LVM on Intel NVMe devices <a class="reference external" href="https://access.redhat.com/solutions/3406851">as reported here</a> (Red Hat
subscription required). The fix is to ugrade the kernel to the latest available for Red Hat / CentOS systems.</p>
<p>Unfortunately by this time JULIA reached the end of the project lifespan and we were not able to verify this. However,
on a different system with a newer hardware configuration, I was able to confirm that the performance
issues occur with <tt class="docutils literal"><span class="pre">kernel-3.10.0-862.14.4.el7</span></tt> and are resolved in <tt class="docutils literal"><span class="pre">kernel-3.10.0-957.1.3.el7</span></tt>.</p>
</div>
<div class="section" id="native-network-performance-for-hpc-enabled-ceph">
<h3>Native Network Performance for HPC-Enabled Ceph</h3>
<p>When profiling the performance of this system using <tt class="docutils literal">perf</tt> and
<a class="reference external" href="http://www.brendangregg.com/flamegraphs.html">flame graph analysis</a>,
I found that under high load 52.5% of the time appeared to be spent
in netowrking, either in the Ceph messenger threads, the kernel
TCP/IP stack or the low-level device drivers.</p>
<div class="figure">
<img alt="Mimic Ceph, flame graph profile" src="//www.stackhpc.com/images/ceph-rados-flame-highlighted.png" style="width: 750px;" />
</div>
<p>A substantial amount of this time is actually spent in servicing
page faults (a side-effect of the Spectre/Meltdown mitigations)
when copying socket data between kernel space and user space. This
performance data makes a strong case, at least for systems with
this balance of compute, storage and networking, for bypassing
kernel space, bypassing TCP/IP (with its inescapable copying of
data) and moving to a messenger class that offers <a class="reference external" href="https://en.wikipedia.org/wiki/Remote_direct_memory_access">RDMA</a>.</p>
<p>When the Julia project end was announced, and our users left the system,
we upgraded Ceph one final time, from Mimic to master branch.</p>
</div>
</div>
<div class="section" id="ceph-rdma-and-opa">
<h2>Ceph, RDMA and OPA</h2>
<p>Ceph has included messenger classes for RDMA for some time. However,
our previous experience of using these with a range of RDMA-capable
network fabrics (<a class="reference external" href="https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet">RoCE</a>,
InfiniBand and now
OPA) was that the messenger classes for RDMA work reasonably well
for RoCE but not for Infiniband or OPA.</p>
<p>For RDMA support, the <tt class="docutils literal">systemd</tt> unit files for all communicating
Ceph processes must have virtual memory page pinning permitted, and
access to the devices required for direct communication with the network
fabric adapter:</p>
<p>For example, in <tt class="docutils literal"><span class="pre">/usr/lib/systemd/system/ceph-mon@.service</span></tt>, add:</p>
<div class="highlight"><pre><span></span><span class="o">[</span>Service<span class="o">]</span>
<span class="nv">LimitMEMLOCK</span><span class="o">=</span>infinity
<span class="nv">PrivateDevices</span><span class="o">=</span>no
</pre></div>
<p>Clients also require support for memory locking, which can be added by
inserting the following into <tt class="docutils literal">/etc/security/limits.conf</tt>:</p>
<div class="highlight"><pre><span></span>* hard memlock unlimited
* soft memlock unlimited
</pre></div>
<p>Fortunately Intel recently contributed <a class="reference external" href="https://github.com/ceph/ceph/blob/master/src/msg/async/rdma/RDMAIWARPConnectedSocketImpl.cc">support for iWARP</a> (another
RDMA-enabled network transport), which is not actually iWARP-specific
but does introduce use of a protocol parameter broker known as the
<a class="reference external" href="https://github.com/ofiwg/librdmacm">RDMA connection manager</a>, which
provides greater portability for RDMA connection establishment on a range
of different fabrics.</p>
<p>To enable this support in <tt class="docutils literal">/etc/ceph/ceph.conf</tt> (here for the OPA <tt class="docutils literal">hfi1</tt> NIC):</p>
<div class="highlight"><pre><span></span><span class="nv">ms_async_rdma_device_name</span> <span class="o">=</span> hfi1_0
<span class="nv">ms_async_rdma_polling_us</span> <span class="o">=</span> <span class="m">0</span>
<span class="nv">ms_async_rdma_type</span> <span class="o">=</span> iwarp
<span class="nv">ms_async_rdma_cm</span> <span class="o">=</span> True
<span class="nv">ms_type</span> <span class="o">=</span> async+rdma
</pre></div>
<p>Using the iWARP RDMA messenger classes (<strong>but actually on OPA and
InfiniBand</strong>) got us a lot further thanks to the connection manager
support. However, with OPA the maintenance of cluster membership
was irregular and unreliable. Further work is required to iron out
these issues, but unfortunately our time on JULIA has completed.</p>
<div class="section" id="looking-ahead">
<h3>Looking Ahead</h3>
<p>The project drew to a close before our work on RDMA could be completed
to satisfaction, and it is premature to post results here. I am
aware of other people becoming increasingly active in the Ceph RDMA
messaging space. In 2019 I hope to see the release of a <a class="reference external" href="https://github.com/Mellanox/ceph/tree/ucx">development
project by Mellanox</a>
to develop a new RDMA-enabled messenger class based on the <a class="reference external" href="http://www.openucx.org/">UCX
communication library</a>. (An equivalent
effort to perform the equivalent in <a class="reference external" href="https://ofiwg.github.io/libfabric/">libfabric</a> could be even more compelling).</p>
<p>Looking further ahead, the adoption of Scylla's <a class="reference external" href="http://seastar.io/">Seastar</a> could potentially become a game-changer for
future developments with high-performance hardware-offloaded
networking.</p>
<p>For RDMA technologies to be adopted more widely, the biggest barriers
appear to be testing and documentation of best practice. If
we can, at StackHPC we hope to become more active in these areas
through 2019.</p>
</div>
<div class="section" id="acknowledgements">
<h3>Acknowledgements</h3>
<p>This work would not have been possible (or been far less informative) without
the help and support of a wide group of people:</p>
<ul class="simple">
<li>Adrian Tate and the team from the <a class="reference external" href="https://www.cray.com/company/partners/organizations-initiatives/cray-emea-research-lab">Cray EMEA Research Lab</a></li>
<li>Dan van der Ster from CERN</li>
<li>Mark Nelson, Sage Weil and the team from Red Hat</li>
<li>Lena Oden, Bastian Tweddell and the team from Jülich Supercomputer Centre</li>
</ul>
</div>
</div>
Federation and identity brokering using Keycloak2018-11-27T15:00:00+00:002018-11-27T15:00:00+00:00Nick Jonestag:www.stackhpc.com,2018-11-27:/federation-and-identity-brokering-using-keycloak.html<p>Using Keycloak to help facilitate federated cloud deployments</p><p>Federated cloud deployments encompass an ever-evolving set of requirements, particularly within areas of industry, commerce and research supporting high-performance (HPC) and high-throughput (HTC) scientific workloads. It's an area in which OpenStack really shines, through excellent support for <a href="https://docs.openstack.org/security-guide/identity/federated-keystone.html">federation protocols</a> and its standard API for the manipulation of infrastructure primitives, but at the same time the reality is that no two deployments are entirely alike - and this can cause problems for both users and operators.</p>
<p>If you're an optimistic sort then you could in fact view this as a strength of the platform, as it means that a given installation can be tailored according to the workload. For example, you might have one installation at a particular institution that's designed for provisioning and presenting interfaces to databases, and another which is developed to run a HPC job scheduler such as SLURM. Practically speaking though, the fact that each is installed on completely different selections of hardware, each according to the workload, is of no interest to our users. What they do care about is having access to each in a way that isn't bogged down with too much bureaucracy or burdensome tooling.</p>
<p>To that end, one of the key themes that overarches almost any architectural discussion with regards to federated cloud workloads is that of authentication and authorisation infrastructure (AAI). The need to provide a secure and compliant solution, yet one which is (ostensibly) seamless and as pain-free for users, is certainly a challenge - and getting this right is fundamental to a platform's adoption and its success.</p>
<p>Fortunately there are myriad tools and technologies available to help meet this. <a href="https://keycloak.org">Keycloak</a> is one such application, and it's one that we at StackHPC have been experimenting with as a proof-of-concept in order to provide the AAI 'glue' between two cloud deployments that we help to operate.</p>
<p>A few people have asked us to share our experiences and so this blog post is an attempt at summarising those. Also note that this post focuses on browser-driven interactions, as you'll see that a lot of the redirections make use of <a href="http://saml.xml.org/web-sso">WebSSO</a> where a web browser is required. A future blog post will delve into interactions using the API via OpenStack and related CLI tools.</p>
<h2>Introducing Keycloak</h2>
<p>First and foremost, Keycloak is an identity and access management service, capable of brokering authentication on behalf of clients using standard protocols such as <a href="https://openid.net/connect/faq/">OpenID Connect</a> (OIDC) and <a href="https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=security">OASIS SAML</a>. It's the upstream version of RedHat's enterprise <a href="https://access.redhat.com/products/red-hat-single-sign-on">Single Sign-On</a> offering and as such is well supported, developed and maintained.</p>
<p>Keystone already has good support for federated authentication in a variety of contexts, and there are existing services such as EGI's <a href="https://www.egi.eu/services/check-in/">Check-in</a> and Jisc's <a href="https://www.jisc.ac.uk/assent">Assent</a> that provide identity brokering using compatible protocols, so why introduce another moving component into the mix?</p>
<p>There are a number of reasons:</p>
<ul>
<li>You might want to be able to associate identity from a number of different sources for a particular user's account;</li>
<li>You might want to standardise the authentication protocol for Keystone on OpenID Connect but offer support for integration with identity providers using SAML;</li>
<li>A hub-and-spoke architecture (as outlined in the <a href="https://aarc-project.eu/architecture/">AARC Blueprint</a>) might be preferrable to tightly coupling individual OpenStack clouds with one another in a full-mesh configuration, which is the case when using Keystone-to-Keystone;</li>
<li>You might need to be able to federate user authentication via internal sources (such as an Active Directory instance);</li>
<li>... and so on.</li>
</ul>
<p>For us, it was about being able to add another layer of control and flexibility into the authorisation piece. A proof-of-concept federated OpenStack using two disparate deployments was integrated using the aforementioned EGI Check-in solution, and while this worked well we often found ourselves wanting more control over the attributes, or claims, that are presented and subsequently mapped via Keystone as part of the authorisation stage.</p>
<h1>Identity brokering with Keycloak</h1>
<h2>Authentication</h2>
<p>Let's review how Keycloak fits into the equation. A user makes a resource request via their service provider, which in return expects them to be authenticated. When Keystone is configured to use an identity provider (IdP), the user is redirected to the IdP's landing page - which in our case is Keycloak. Here the user is presented with a selection of login choices. Depending on their selection, they're then redirected a second time in order to perform the authentication step, and then Keycloak handles the transparent redirection and security assertion back to Keystone. At this point access is granted, assuming the right mappings are in place to grant membership of the user to a group which has permissions within scope of a project. On paper this sounds somewhat convoluted, but in practice it's reasonably slick and intuitive from a user's point of view:</p>
<p><img src="../images/alaska-keycloak-egi.gif" width="600"></p>
<p>In my case (in the video above), the Horizon login page redirected me to a Keycloak instance, which presented me with three authentication options. I selected the option which deferred this to EGI's Check-in service, in which I again deferred to Google, and I then used my StackHPC credentials. This provided Check-in with some context, which in turn was passed back to Keycloak, and onwards to Keystone. As part of this final step, Keystone is configured to map a particular OIDC claim containing my company affiliation to a <code>stackhpc</code> group which has the member role assigned in a <code>stackhpc</code> project, thus granting me access to resources on this service provider.</p>
<p>This little demonstration neatly shows one of the immediate benefits of introducing Keycloak into your federation infrastructure - being able to maintain control over a diverse selection of potential authentication sources. At this point, I have an identity within Keycloak that an administrator can associate with other AAI primitives, including multiple IdPs, groups, security policies, and so on.</p>
<h3>Integrating OTP</h3>
<p>A little side note on what else is possible with Keycloak, a feature that could be of use even if you aren't interested in delegating authentication to another service. Keycloak provides support for One Time Passwords (OTP) - either time-based or counter-based - via FreeOTP or Google Authenticator. Thus, it's possible to federate your users with something such as Active Directory and at the same time add in another layer of security in the form of two-factor authentication.</p>
<p>Once a user has first signed up to Keycloak, either directly (such as via an invitation link), or indirectly by delegation to another configured IdP, the user can login to Keycloak and associate their login with an authenticator. With that in place, they can then access cloud resources on a given service provider using the credentials for their Keycloak account:</p>
<p><img src="../images/keycloak-login.png" width="700"></p>
<p>And then when prompted, enter the code generated using the Google Authenticator application:</p>
<p><img src="../images/keycloak-otp.png" width="700"></p>
<p>If that's successful then then they're redirected back to Horizon with access to OpenStack resources - secured via two factors used during the authentication process.</p>
<h2>Authorisation</h2>
<p>As mentioned earlier, one of the problems we were trying to solve was normalising or having control over the identity-related attributes (claims) that are presented to Keystone. It's these claims that give the cloud administrator control over who gets granted access to what, and services such as Check-in make use of proving the entitlement context by hooking into external attribute sources such as <a href="https://www.internet2.edu/products-services/trust-identity/comanage/">COmanage</a> or <a href="https://perun-aai.org">Perun</a>. However, Keycloak can also assume the role of these components and populate claims based on its knowledge of a particular user. Let's take a quick look into this mapping process and then what the Keycloak configuration looks like in order to influence this process.</p>
<p>Here's a snippet of the JSON mappings file that's configured in Keystone and consulted whenever authentication is triggered via this IdP (Keycloak):</p>
<div class="highlight"><pre><span></span><span class="p">[</span>
<span class="p">{</span>
<span class="nt">"local"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"group"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="s2">"44a46f4e41504e01ae77008c88dfc2da"</span>
<span class="p">},</span>
<span class="nt">"user"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"name"</span><span class="p">:</span> <span class="s2">"{0}"</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">],</span>
<span class="nt">"remote"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"OIDC-preferred_username"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"HTTP_OIDC_ISS"</span><span class="p">,</span>
<span class="nt">"any_one_of"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"https://aai-dev.egi.eu/oidc/"</span>
<span class="p">],</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"OIDC-edu_person_scoped_affiliations"</span><span class="p">,</span>
<span class="nt">"any_one_of"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"^.*@StackHPC$"</span>
<span class="p">],</span>
<span class="nt">"regex"</span><span class="p">:</span> <span class="kc">true</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">]</span>
</pre></div>
<p>This basically tells Keystone to create and associate a federated user with the group that has the ID <code>44a46..</code> and a username of whatever <code>OIDC-preferred_username</code> contains, as long as these two conditions are met:</p>
<ul>
<li>The <code>HTTP_OIDC_ISS</code> attribute is <code>https://aai-dev.egi.eu/oidc/</code>;</li>
<li>The <code>edu_person_scoped_affiliations</code> claim matches the regular expression <code>^.*@StackHPC$</code>.</li>
</ul>
<p>These are standard claims returned from EGI Check-in, however what if we wanted to make use of our own arbitrary grouping so that we're not just relying on the above selection in order to associate users with particular group? Keycloak lets you create your own group and user associations, and these are then (by default) in scope for claims presented to Keystone for mapping consideration. So we can expand the above example, and perhaps replace the <code>OIDC-edu_person_scoped_affiliations</code> section with something that makes use of what we get via Keycloak:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"OIDC-group"</span><span class="p">,</span>
<span class="nt">"any_one_of"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"StackHPC"</span>
<span class="p">],</span>
<span class="p">}</span>
</pre></div>
<p>Now, any user in Keycloak associated with the 'StackHPC' group, regardless of their authentication source (so long as it's valid!) will be able to access resources with whatever role is associated with the OpenStack group ID shown in the previous example. Here's what it looks like from Keycloak's point of view, in this group I have three users, each of which has different linked IdP:</p>
<p><img src="../images/keycloak-groups.png" width="700"></p>
<p>If we look at the linked IdP for my personal Google account:</p>
<p><img src="../images/keycloak-user-idp-link.png" width="700"></p>
<p>The grouping is an abstraction handled by Keycloak, but it gives us control over which users have access to our OpenStack deployment. This is a simple example and it's possible to do much, much more - including specifying additional attributes to provide further authorisation context, as well as more complicated abstractions such as nested groups - but hopefully this gives you a flavour of what's possible.</p>
<h2>Further information</h2>
<p>Our proof-of-concept and investigation with federated Keystone would have been immeasurably more difficult if it wasn't for Colleen Murphy's fantastic blog posts, <a href="http://www.gazlene.net/federation-devstack.html">here</a> and <a href="http://www.gazlene.net/demystifying-keystone-federation.html">here</a>.</p>
<p>It's also worth mentioning that the Keystone team are currently working on developing identity provider proxying functionality, which might make the requirement for Keycloak redundant in a lot of cases. The Etherpad used at the Berlin Summit for the Stein release to gather requirements is <a href="https://etherpad.openstack.org/p/BER-stein-keystone-as-idp">here</a>, and there's some more information from the Stein's PTG Etherpad <a href="https://etherpad.openstack.org/p/keystone-stein-ptg">here</a>.</p>
<p>Finally, we'd also like to thank our friends and colleagues at the <a href="https://www.hpc.cam.ac.uk/">University of Cambridge</a> for their assistance with the infrastructure resources that made this proof-of-concept possible.</p>Monasca comes to Kolla2018-11-08T11:00:00+00:002018-11-08T11:00:00+00:00Doug Szumskitag:www.stackhpc.com,2018-11-08:/monasca-comes-to-kolla.html<p class="first last">StackHPC and the Kolla community have added support for Monasca
in Kolla.</p>
<p>Back near the dawn of time in December, 2016, Sam Yaple created a
<a class="reference external" href="https://blueprints.launchpad.net/kolla/+spec/monasca-containers">spec</a>
to add Monasca containers to Kolla. Aeons later
<a class="reference external" href="https://blueprints.launchpad.net/kolla-ansible/+spec/monasca-roles">Kolla-Ansible</a>
finally supports deploying Monasca out-of-the-box. Much like crossing the
<a class="reference external" href="https://imgur.com/r/ImagesOfEngland/4pMhdMl">Magic Roundabout</a>
in Swindon, many things had to line up to make it happen. The ground was
paved by adding support for
<a class="reference external" href="https://kafka.apache.org/">Apache Kafka</a>,
<a class="reference external" href="https://zookeeper.apache.org/">Zookeeper</a>,
<a class="reference external" href="http://storm.apache.org/">Storm</a> and
<a class="reference external" href="https://www.elastic.co/products/logstash">Logstash</a>.
Then came the Monasca services, rolled out one-by-one until the
<a class="reference external" href="https://www.fluentd.org/">Fluentd</a>
firehose was coupled up to the Monasca Log API. The CI system
creaked, the lights went dim and the core reviewers groaned as Zuul unleashed
a colossal chunk of Ansible. No longer did one have to carefully deploy,
configure and maintain an uncountable number of services. Injurious crashes
were reduced by
<a class="reference external" href="http://www.trb.org/Publications/Blurbs/164470.aspx">three quarters</a>
and sanity returned to the Monasca sysadmins. So what exactly did the end
result look like?</p>
<div class="figure">
<img alt="Monasca overview" src="//www.stackhpc.com/images/monasca.png" style="width: 720px;" />
</div>
<p>At this stage you might be thinking that
<a class="reference external" href="https://www.graphviz.org/">Graphviz</a>
has just exploded on your screen, or even, has anyone keeled over and died
from looking at the diagram? But if you defocus your eyes a little further,
you'll see that there can actually be three, or even more of everything.
Three APIs, three instances of almost anything you can see, with traffic
pirouetting through a Kafka cluster in between. The only things which spoil
the fun are InfluxDB which requires an enterprise license for clustering,
and the Monasca fork of Grafana, which just doesn't seem to play nicely
with load balancing.</p>
<p>So what is it like to run this monster in production? Does it deliver? Why
on Earth would you want to do it? We actually have some compelling reasons
which we'll summarise below:</p>
<ul class="simple">
<li><dl class="first docutils">
<dt>Horizontally scalable</dt>
<dd>We love working with small deployments, and supporting these matters
greatly to us, but in the world of HPC, machines can get really huge.
Indeed, it's not uncommon for small deployments to morph into large ones,
and with Monasca, no matter where you start, you can seamlessly scale
with demand.</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt>Multi-tenant</dt>
<dd>Add value to your OpenStack deployment. Through the power of automation
it's true that you could stamp out a monitoring and logging solution per
tenant without too much fuss. However, it's hard to beat simply logging
in via a public endpoint with your OpenStack credentials.</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt>Highly available / fault tolerant</dt>
<dd>Kolla Monasca has been designed to provide a single pane of glass for
monitoring the health of your OpenStack deployment. If a wheel falls
off, we don't want you scrambling for the spare tyre. All critical
monitoring and logging services can be deployed in a highly available
and fault tolerant configuration.</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt>Support for push-metrics</dt>
<dd>In big systems there are often complex interactions and understanding
these is part of the art of HPC. What's more, complex interactions don't
tend to happen at fixed time intervals. Support for push-metrics allows
users to stream batches of data into Monasca with a sampling frequency
of whatever they like. So whether you're tuning traffic flows in your
network fabric, or optimising your MPI routine, Monasca has you covered.</dd>
</dl>
</li>
</ul>
<p>So without further ado, we're going to hand you over to the
<a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/monasca-guide.html">Kolla documentation</a>.
Unlike the Magic Roundabout you'll have two paths to
follow: The brave can enable Monasca in their existing Kolla Ansible
deployment, and the cautious can choose to
<a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/monasca-guide.html#stand-alone-configuration-optional">deploy Monasca standalone</a>
and integrate, if they wish, with an external instance of Keystone which
doesn't need to be provided by Kolla. We hope that you like it, and
most of all we hope that you find it useful.</p>
Kubernetes, HPC and MPI2018-11-07T12:00:00+00:002018-11-07T12:00:00+00:00Kitrick Sheetstag:www.stackhpc.com,2018-11-07:/k8s-mpi.html<p class="first last">Convergence of HPC and Cloud will not stop at the
infrastructure level. How can applications and users take the
greatest advantage from cloud-native technologies to deliver on
HPC-native requirements? How can we separate true progress from a
blind love of the shiny?</p>
<p>Convergence of HPC and Cloud will not stop at the infrastructure
level. How can applications and users take the greatest advantage
from cloud-native technologies to deliver on HPC-native requirements?
How can we separate true progress from a blind love of the shiny?</p>
<p>The last decade has continued the rapid movement toward the
consolidation of hardware platforms around common processor
architectures and the adoption of Linux as the defacto base operating
system, leading to the emergence of large scale clusters applied
to the HPC market. Then came the adoption of elastic computing
concepts around AWS, OpenStack, and Google Cloud. While these elastic
computing frameworks have been focused on the ability to provide
on-demand computing capabilities, they have also introduced the
powerful notion of self-service software deployments. The ability
to pull from any number of sources (most commonly open source
projects) for content to stitch together powerful software ecosystems
has become the norm for those leveraging these cloud infrastructures.</p>
<p>The quest for ultimate performance has come at a significant price
for HPC application developers over the years. Tapping into the
full performance of an HPC platform typically involves integration
with the vendor’s low-level “special sauce”, which entails vendor
lock-in. For example, developing and running an application on an
IBM Blue Gene system is significantly different than HP Enterprise
or a Cray machine. Even in cases where the processor and even the
high-speed interconnects are the same, the operating runtime, storage
infrastructure, programming environment, and batch infrastructure
are likely to be different in key respects. This means that running
the same simulations on machines from different vendors within or
across data centers requires significant customization effort.
Further, the customer is at the mercy of the system vendor for
software updates to the base operating systems on the nodes or
programming environment libraries, which in many cases significantly
inhibits a customer’s ability to take advantage of the latest updates
to common utilities or even entire open source component ecosystems.</p>
<p>For these and other reasons reasons, HPC customers are now clamoring
for the ability to run their own ‘user defined’ software stacks
using familiar containerized software constructs.</p>
<div class="section" id="the-case-for-containers">
<h2>The Case for Containers</h2>
<p>Containers hold great promise for enabling the delivery of
<a class="reference external" href="https://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-16-22370">user-defined software stacks</a>.
We have covered <a class="reference external" href="//www.stackhpc.com/the-state-of-hpc-containers.html">the state of HPC containers</a> in
a previous post.</p>
<p>Cloud computing users are given the freedom to leverage a variety
of pre-packaged images or even build their own images and deploy
them into their provisioned compute spaces to address their specific
needs. Container infrastructures have taken this a step further by
leveraging the <a class="reference external" href="http://man7.org/linux/man-pages/man7/namespaces.7.html">namespace isolation</a> capabilities
of contemporary Linux kernels to provide light-weight, efficient,
and secure packaging and runtime environment in which to execute
sophisticated applications. Container images are immutable and
self-sufficient, which make them very portable and for the most
part immune to the OS distribution on which they are deployed.</p>
</div>
<div class="section" id="kubernetes-once-more-unto-the-breach">
<h2>Kubernetes - Once More Unto the Breach...</h2>
<p>Over recent years, containerization (outside of HPC) has consolidated
around two main technologies, <a class="reference external" href="https://www.docker.com/">Docker</a>
and <a class="reference external" href="https://kubernetes.io/">Kubernetes</a>. Docker provides a core
infrastructure for the construction and maintenance of software
stacks, while Kubernetes provides a robust container orchestrator
that manages the coordination and distribution of containerized
applications within a distributed environment.</p>
<p>Kubernetes has risen to the top in the challenge to provide
orchestration and management for containerized software components
due to its rich ecosystem and scaling properties. Kubernetes has
shown to be quite successful for cloud-native workloads, high-throughput
computing and data analytics workflows. But <a class="reference external" href="https://stackoverflow.com/questions/38093438/kubernetes-and-mpi">what about conventional
HPC workloads</a>? As
we will discuss below, there are some significant challenges to the
full integration of Kubernetes with the conventional HPC problem
space but is there a path to convergence?</p>
</div>
<div class="section" id="a-bit-of-history">
<h2>A Bit of History</h2>
<p>To understand the challenges facing the full adoption of open
container ecosystems for HPC, it is helpful to <a class="reference external" href="//www.stackhpc.com/openstack-and-hpc-workloads.html">present some of the
unique needs</a> of this
problem space. We’ve provided a survey of the current state of
<a class="reference external" href="//www.stackhpc.com/the-state-of-hpc-containers.html">containers in HPC</a>
in a previous blog post.</p>
<div class="section" id="taxonomy-of-hpc-workloads">
<h3>Taxonomy of HPC Workloads</h3>
<p>Conventionally, HPC workloads have been made up of a set of
purpose-driven applications designed to solve specific scientific
simulations. These simulations can consist of a series of small
footprint short-lived ‘experiments’, whose results are aggregated
to obtain a particular target result; or large-scale, data-parallel
applications that can execute across many thousands of nodes within
the system. These two types of applications are commonly referred
to as <em>capability</em> and <em>capacity</em> applications.</p>
<div class="figure">
<img alt="Submitted Jobs vs Requested Cores" src="//www.stackhpc.com/images/slurm-jobs-vs-cores.png" style="width: 750px;" />
<p class="caption"><em>Data from an operational HPC cluster demonstrating that dominant
usage of this resource is for sequential or single-node
multi-threaded workloads. What is not shown here is that the
large-scale parallel workloads have longer runtimes, resulting
in a balanced mix of use cases for the infrastructure.</em></p>
</div>
<p>Capability computing refers to applications built to leverage the
unique capabilities or attributes of an HPC system. This could be
a special high performance network with exceptional <a class="reference external" href="https://courses.cs.washington.edu/courses/csep524/99wi/lectures/lecture7/sld006.htm">bisection
bandwidth</a>
to support large scale applications, nodes with large memory capacity
or specialized computing capabilities of the system (e.g., GPUs)
or simply the scale of the system that enables the execution of
extreme-scale applications. Capacity computing, on the other hand,
refers to the ability of a system to hold large numbers of simultaneous
jobs, essentially providing extreme throughput of small and modest
sized jobs from the user base.</p>
<p>There are several critical attributes that HPC system users and
managers demand to support an effective infrastructure for these
classes of jobs. A few of the most important include:</p>
<ol class="arabic">
<li><p class="first"><strong>High Job Throughput</strong></p>
<p>Due to the significant financial commitment required to build and
operate large HPC systems, the ability to maximize these resources
on the solution of real science problems is critical. In most HPC
data centers, accounting for the utilization of system resources
is a primary focus of the data center manager. For this reason,
much work has been expended on the development of Workload Managers
(WLMs) to efficiently and effectively schedule and manage large
numbers of application jobs on to HPC systems. These WLMs sometimes
integrate tightly with system vendor capabilities for advanced
node allocation and task placement to ensure most effective use
of the underlying computing resource.</p>
</li>
<li><p class="first"><strong>Low Service Overhead</strong></p>
<p>For research scientists, time to solution is key. One important
example is weather modeling. Simulations have a very strict time
deadline as results must be provided in a timely way to release
to the public. The amount of computing capacity available to apply
to these simulations directly impacts the accuracy, granularity
and scope of the results that can be produced.</p>
<p>Such large-scale simulations are commonly referred to as <a class="reference external" href="https://computing.llnl.gov/tutorials/parallel_comp">data
parallel</a> applications. These applications typically process a
large data set manageable pieces, spread in parallel across many
tasks. Parallelism occurs both within nodes and between nodes -
for which data is exchanged between tasks over high speed networking
fabrics using communication libraries such as Partitioned Global
Address Space (<a class="reference external" href="http://www.nersc.gov/users/training/online-tutorials/introduction-to-pgas-languages/">PGAS</a>) or Message Passing Interface (<a class="reference external" href="https://computing.llnl.gov/tutorials/mpi/">MPI</a>).</p>
<p>These distributed applications are highly synchronized and typically
exchange data after some fixed period of computation. Due to this
synchronization, they are very sensitive to, amongst other things,
drift between the tasks (nodes). Any deviation by an individual
node will often cause a delay in the continuation of the overall
simulation. This deviation is commonly referred to as <a class="reference external" href="https://www.computerworld.com/article/2548377/linux/lightweight-linux-for-high-performance-computing.html">jitter</a>.
A significant amount of work has been done to mitigate or eliminate
such effects within HPC software stacks. So much so, that many
large HPC system manufacturers have spent significant resources
to identify and eliminate or isolate tasks that have the potential
to induce jitter in the Linux kernels that they ship with their
systems. As customers reap direct benefit from these changes, it
would be expected that any containerized infrastructure would be
assumed to carry forward similar benefits. This would presume
that any on-node presence supporting container scheduling or
deployment would present minimal impact to the application workload.</p>
</li>
<li><p class="first"><strong>Advanced Scheduling Capabilities</strong></p>
<p>Many HPC applications have specific requirements relative to where
they are executed within the system. Where each task (rank) of
an application may need to communicate with specific neighboring
tasks and so prefer to be placed topologically close to these
neighbors to improve communication with these neighbors. Other
tasks within the application may be sensitive to the performance
of the I/O subsystem and as such may prefer to be placed in areas
of the system where I/O throughput or response times are more
favorable. Finally, individual tasks of an application may require
access to specialized computing hardware, including nodes with
specific processor types attached processing accelerators (e.g.,
GPUs). What’s more, individual threads of a task are scheduled
in such a way as to avoid interference by work unrelated to the
user’s job (e.g., operating system services or support infrastructure,
such as monitoring). Interference with the user’s job by these
supporting components has a direct and measurable impact on overall
job performance.</p>
</li>
</ol>
<div class="figure">
<img alt="St George and the Dragon (Wikipedia, public domain)" src="//www.stackhpc.com/images/st-george.jpg" style="width: 300px;" />
</div>
</div>
</div>
<div class="section" id="the-role-of-pmi-x">
<h2>The Role of PMI(x)</h2>
<p>The <a class="reference external" href="http://mpitutorial.com/tutorials/mpi-introduction/">Message Passing Interface (MPI)</a> is the most
common mechanism used by data-parallel applications to exchange
information. There are many implementations of MPI, ranging from
OpenMPI, which is a community effort, to vendor-specific MPI
implementations, which integrate closely with vendor-supplied
programming environments. One key building block on which all MPI
implementations are built is the Process Management Interface (<a class="reference external" href="https://www.mcs.anl.gov/papers/P1760.pdf">PMI</a>). PMI provides the
infrastructure for an MPI application to distribute the information
about all of the other participants across an entire application.</p>
<p>PMI is a standardized interface which has gone through a few
iterations each with improvements to support increased job scale
with reduced overhead. The most recent version, <a class="reference external" href="https://pmix.org/">PMIx</a>
is an attempt to develop a standardized process management library
capable of supporting the exchange of connection details for
applications deployed on exascale systems reaching upwards of 100K
nodes and a million ranks. The goal of the project is to achieve
this ambitious scaling without compromising the needs of more modest
sized clusters. In this way, PMIx intends to support the full range
of existing and anticipated HPC systems.</p>
<p>Early evaluation of launch performance in the wire-up phase of PMIx
is quite illuminating as can be seen from this <a class="reference external" href="https://slurm.schedmd.com/SC17/Mellanox_Slurm_pmix_UCX_backend_v4.pdf">SuperComputing '17
presentation</a>.
This presentation shows the performance advantages in launch times
as the number of on-node ranks increases by utilizing a native PMIx
runtime TCP interchange to distribute wire-up information rather
than using <a class="reference external" href="https://slurm.schedmd.com/">Slurm</a>’s integrated RPC
capability. The presentation then goes on to show how an additional
two orders of magnitude improvement by leveraging native communication
interfaces of the platform through the <a class="reference external" href="http://www.openucx.org/introduction/">UCX communication stack</a>. While this discussion
isn’t intended to focus on the merits of one specific approach over
another for launching and initializing a data parallel application,
it does help to illustrate the sensitivity of these applications
to the underlying distributed application support infrastructure.</p>
<div class="figure">
<img alt="Dürer's Rhinoceros (Wikipedia, public domain)" src="//www.stackhpc.com/images/durers-rhinoceros.png" style="width: 400px;" />
</div>
</div>
<div class="section" id="full-integration-of-open-container-frameworks-with-conventional-hpc-workflows">
<h2>Full Integration of Open Container Frameworks with Conventional HPC Workflows</h2>
<p>There are projects underway with the goal of integrating Kubernetes
with MPI. One notable approach, <a class="reference external" href="https://github.com/everpeace/kube-openmpi">kube-openmpi</a>, uses Kubernetes to
launch a cluster of containers capable of supporting the target
application set. Once this Kubernetes namespace is created, it is
possible to use <tt class="docutils literal">kubectl</tt> to launch and <tt class="docutils literal">mpiexec</tt> applications into
the namespace and leverage the deployed OpenMPI environment.
(<tt class="docutils literal"><span class="pre">kube-openmpi</span></tt> only supports OpenMPI, as the name suggests).</p>
<p>Another framework, <a class="reference external" href="https://www.kubeflow.org/">Kubeflow</a>,
also supports execution of MPI tasks atop Kubernetes. Kubeflow’s
focus is evidence that the driving force for MPI-Kubernetes integration
will be large-scale machine learning. Kubeflow uses a secondary
scheduler within Kubernetes, <a class="reference external" href="https://kubernetes-sigs.github.io/kube-batch/">kube-batch</a>
to support the scheduling and uses OpenMPI and a companion <cite>ssh</cite> daemon for the <a class="reference external" href="https://github.com/kubeflow/kubeflow/blob/master/kubeflow/openmpi/assets/init.sh">launch of MPI-based jobs</a>.</p>
<p>While approaches such as <tt class="docutils literal"><span class="pre">kube-openmpi</span></tt> and <tt class="docutils literal">kubeflow</tt> provide the
ability to launch MPI-based applications as Kubernetes jobs atop a containerized
cluster, they essentially <a class="reference external" href="https://www.mail-archive.com/devel@lists.open-mpi.org/msg20533.html">replicate existing *flat earth* models</a>
for data-parallel application launch within the context of an
ephemeral container space. Such approaches do not fully leverage
the flexibility of the elastic Kubernetes infrastructure, or support
the critical requirements of large-scale HPC environments, as
described above.</p>
<p>In some respects, <tt class="docutils literal"><span class="pre">kube-openmpi</span></tt> is another example of the <em>fixed
use</em> approach to the use of containers within HPC environments. For
the most part there have been two primary approaches. Either launch
containers into a conventional HPC environment using existing
application launchers (e.g., <a class="reference external" href="http://www.nersc.gov/research-and-development/user-defined-images/">Shifter</a>,
<a class="reference external" href="https://singularity.lbl.gov/">Singularity</a>, etc.), or emulate a
conventional data parallel HPC environment atop a container deployment
(à la <tt class="docutils literal"><span class="pre">kube-openmpi</span></tt>).</p>
<p>While these approaches are serviceable for single-purpose environments
or environments with relatively static or purely ephemeral use
cases, problems arise when considering a mixed environment where
consumers wish to leverage conventional workload manager-based
workflows in conjunction with a native container environment. In
cases where such a mixed workload is desired, the problem becomes
how to coordinate the submission of work between the batch scheduler
(e.g., Slurm) and the container orchestrator (e.g., Kubernetes).</p>
<p>Another approach to this problem is to <a class="reference external" href="https://kubernetes.io/blog/2017/08/kubernetes-meets-high-performance/">use a meta-scheduler</a>
that coordinates the work across the disparate domains. This approach
has been developed and promoted by Univa through their <a class="reference external" href="https://blogs.univa.com/2017/04/running-mixed-workloads-on-kubernetes-with-navops-command/">Navops
Command infrastructure</a>.
Navops is based on the former Sun Grid Engine, originally developed
by Sun Microsystems, then acquired by Oracle, and eventually landing
at Univa.</p>
<p>While Navops provides an effective approach to addressing these
mixed use coordination issues, it is a proprietary approach and
limits the ability to leverage common and open solutions across the
problem space. Given the momentum of this space and the desire to
leverage emerging technologies for user-defined software stacks
without relinquishing the advances made in the scale supported by
the predominant workload schedulers, it should be possible to develop
cleanly integrated, open solutions which support the set of existing
and emerging use cases.</p>
<div class="figure">
<img alt="He, over all the starres doth raigne, that unto wisdome can attaine..." src="//www.stackhpc.com/images/wither-emblem-wisdom.jpg" style="width: 350px;" />
</div>
</div>
<div class="section" id="what-next">
<h2>What Next?</h2>
<p>So what will it take to truly develop and integrate a fully open,
scalable, and flexible HPC stack that can leverage the wealth of
capabilities provided by an elastic infrastructure? The following
presents items on our short list:</p>
<ol class="arabic">
<li><p class="first"><strong>Peaceful Coexistence of Slurm with Kubernetes</strong></p>
<p>Slurm has become the de facto standard for open management of
conventional HPC batch-oriented, distributed workloads. Likewise,
Kubernetes dominates in the management of flexible, containerized
application workloads. Melding these two leading technologies
cleanly in a way that leverages the strengths of each without
compromising the capabilities of either will be key to the
realization of the full potential of elastic computing within
the HPC problem space.</p>
<p>Slurm already integrates with existing custom (and ostensibly
closed) frameworks such as Cray’s Application Launch and
Provisioning System (ALPS). It has been proven through integration
efforts such as this that there is significant gain to be made
by leveraging capabilities provided by such infrastructures.
ALPS has been designed to manage application launch at scale and
manage the runtime ecosystem (including network and compute
resources) required by large, hero-class applications.</p>
<p>Like these scaled job launchers, Kubernetes provides significant
capability for placement, management, and deployment of applications.
However, it provides a much richer set of capabilities to manage
containerized workflows that are familiar to those who are
leveraging cloud-based ecosystems.</p>
<p>While the flexibility of cloud computing allows users to easily
spin up a modest-sized set of cooperating resources on which to
launch distributed applications, within a conventional HPC
infrastructure, designed for the execution of petascale and
(coming soon) exascale applications, there are real resource
constraints at play that require a more deliberate approach at
controlling and managing the allocation and assignment of these
resources.</p>
<p>The ability to manage such a conventional workload-based placement
strategy in conjunction with emerging container-native workflows
has the potential of significantly extending the reach and
broadening the utility of high performance computing platforms.</p>
</li>
<li><p class="first"><strong>Support for Elasticity within Slurm</strong></p>
<p>Slurm is quite effective in the management of the scheduling and
placement of conventional distributed applications onto nodes
within an HPC infrastructure. As with most conventional job
schedulers, Slurm assumes that it is managing a relatively static
set of compute resources. Compute entities (nodes) can come and
go during the lifetime of a Slurm cluster. However, Slurm prefers
that the edges of the cluster be known apriori so that all hosts
can be aware of all others. In other words, the list of compute
hosts is distributed to all hosts in the cluster when the Slurm
instance is initialized. Slurm then manages the workload across
this set of hosts. However, management of a dynamic infrastructure
within Slurm can be a challenge.</p>
</li>
<li><p class="first"><strong>Mediation of Scheduler Overhead</strong></p>
<p>There is a general consensus that there are tangible advantages
to the use of on-demand computing to solve high performance
computing problems. There is also general consensus that the
flexibility of an elastic infrastructure brings with it a few
undesirable traits. The one that receives the most attention is
added overhead. Any additional overhead has a direct impact on
the usable computing cycles that can be applied by the target
platform to the users’ applications. The source of that overhead,
however, is in the eye of the beholder. If you ask someone focused
on the delivery of containers, they would point to the bare-metal
or virtual machine infrastructure management (e.g., OpenStack)
as a significant source of this overhead. If you were to ask an
application consumer attempting to scale a large, distributed
application, they would likely point at the container scheduling
infrastructure (e.g., Kubernetes) as a significant scaling
concern. For this reason, it is common to hear comments like,
“OpenStack doesn’t scale”, or “Kubernetes doesn’t scale”. Both
are true… and neither are true. It really depends on your
perspective and the way in which you are trying to build the
infrastructure.</p>
<p>This attitude tends to cause a stovepiping of solutions to address
specific portions of the problem space. What is really needed
is a holistic view, covering a range of capabilities and solutions
and a concerted effort to provide integrated solutions. An
ecosystem that exposes the advantages of each of the components
of elastic infrastructure management, containerized software
delivery, and scaled, distributed application support, while
providing seamless coexistence of familiar workflows across these
technologies would provide tremendous opportunities for the
delivery of high performance computing solutions into the next
generation.</p>
</li>
</ol>
<p>If you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Deploying Performant Parallel Filesystems: Ansible and BeeGFS2018-10-10T12:00:00+01:002018-10-10T12:00:00+01:00Bharat Kunwartag:www.stackhpc.com,2018-10-10:/ansible-role-beegfs.html<p class="first last">We explore the features of our Ansible BeeGFS role to provide
disaggregated and hyperconverged storage solution.</p>
<p><a class="reference external" href="https://www.beegfs.io">BeeGFS</a> is a parallel file system suitable
for High Performance Computing with a proven track record in scalable
storage solution space. In this article, we explore how different
components of BeeGFS are pieced together and how we have incorporated
them into an Ansible role for a seamless storage cluster deployment
experience.</p>
<div class="figure">
<img alt="BeeGFS logo" src="//www.stackhpc.com/images/beegfs-logo.png" style="width: 200px;" />
</div>
<p>We've previously described ways of integrating <a class="reference external" href="//www.stackhpc.com/openstack-and-high-performance-data.html">OpenStack and
High-Performance Data</a>. In
this post we'll focus on some practical details for how to dynamically
provision BeeGFS filesystems and/or clients running in cloud
environments. There are actually no dependencies on OpenStack here
- although we do like to draw our Ansible inventory from
<a class="reference external" href="//www.stackhpc.com/cluster-as-a-service.html">Cluster-as-a-Service infrastructure</a>.</p>
<p><a class="reference external" href="https://www.beegfs.io/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf">As described here</a>,
BeeGFS has components which may be familiar concepts to those working
in parallel file system solution space:</p>
<ul class="simple">
<li>Management service: for registering and watching all other services</li>
<li>Storage service: for storing the distributed file contents</li>
<li>Metadata service: for storing access permissions and striping info</li>
<li>Client service: for mounting the file system to access stored data</li>
<li>Admon service (optional): for presenting administration and
monitoring options through a graphical user interface.</li>
</ul>
<div class="figure">
<img alt="Ansible logo" src="//www.stackhpc.com/images/ansible-thumbnail.png" style="width: 200px;" />
</div>
<p>Introducing our Ansible role for BeeGFS...</p>
<p>We have an Ansible role published on <a class="reference external" href="https://galaxy.ansible.com/stackhpc/beegfs">Ansible Galaxy</a> which handles the
end-to-end deployment of BeeGFS. It takes care of details all the way
from deployment of management, storage and metadata servers to setting
up client nodes and mounting the storage point. To install, simply run:</p>
<div class="highlight"><pre><span></span>ansible-galaxy install stackhpc.beegfs
</pre></div>
<p>There is <a class="reference external" href="https://github.com/stackhpc/ansible-role-beegfs/blob/master/README.md">a README</a>
that describes the role parameters and example usage.</p>
<p>An Ansible inventory is organised into groups, each representing a
different role within the filesystem (or its clients). An example
<tt class="docutils literal"><span class="pre">inventory-beegfs</span></tt> file with two hosts <tt class="docutils literal">bgfs1</tt> and <tt class="docutils literal">bgfs2</tt> may
look like this:</p>
<div class="highlight"><pre><span></span><span class="k">[leader]</span>
<span class="na">bgfs1 ansible_host</span><span class="o">=</span><span class="s">172.16.1.1 ansible_user=centos</span>
<span class="k">[follower]</span>
<span class="na">bgfs2 ansible_host</span><span class="o">=</span><span class="s">172.16.1.2 ansible_user=centos</span>
<span class="k">[cluster:children]</span>
<span class="na">leader</span>
<span class="na">follower</span>
<span class="k">[cluster_beegfs_mgmt:children]</span>
<span class="na">leader</span>
<span class="k">[cluster_beegfs_mds:children]</span>
<span class="na">leader</span>
<span class="k">[cluster_beegfs_oss:children]</span>
<span class="na">leader</span>
<span class="na">follower</span>
<span class="k">[cluster_beegfs_client:children]</span>
<span class="na">leader</span>
<span class="na">follower</span>
</pre></div>
<p>Through controlling the membership of each inventory group, it is
possible to create <a class="reference external" href="https://www.beegfs.io/wiki/SystemArchitecture">a variety of use cases and configurations</a>. For example,
client-only deployments, server-only deployments, or hyperconverged
use cases in which the filesystem servers are also the clients
(as above).</p>
<p>A minimal Ansible playbook which we shall refer to as <tt class="docutils literal">beegfs.yml</tt> to
configure the cluster may look something like this:</p>
<div class="highlight"><pre><span></span><span class="nn">---</span>
<span class="p p-Indicator">-</span> <span class="nt">hosts</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">cluster_beegfs_mgmt</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">cluster_beegfs_mds</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">cluster_beegfs_oss</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">cluster_beegfs_client</span>
<span class="nt">roles</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">role</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">stackhpc.beegfs</span>
<span class="nt">beegfs_state</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">present</span>
<span class="nt">beegfs_enable</span><span class="p">:</span>
<span class="nt">mgmt</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['cluster_beegfs_mgmt']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">oss</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['cluster_beegfs_oss']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">meta</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['cluster_beegfs_mds']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">client</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">groups['cluster_beegfs_client']</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">admon</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">no</span>
<span class="nt">beegfs_mgmt_host</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">groups['cluster_beegfs_mgmt']</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">first</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">beegfs_oss</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">dev</span><span class="p">:</span> <span class="s">"/dev/sdb"</span>
<span class="nt">port</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">8003</span>
<span class="p p-Indicator">-</span> <span class="nt">dev</span><span class="p">:</span> <span class="s">"/dev/sdc"</span>
<span class="nt">port</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">8103</span>
<span class="p p-Indicator">-</span> <span class="nt">dev</span><span class="p">:</span> <span class="s">"/dev/sdd"</span>
<span class="nt">port</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">8203</span>
<span class="nt">beegfs_client</span><span class="p">:</span>
<span class="nt">path</span><span class="p">:</span> <span class="s">"/mnt/beegfs"</span>
<span class="nt">port</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">8004</span>
<span class="nt">beegfs_interfaces</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"ib0"</span>
<span class="nt">beegfs_fstype</span><span class="p">:</span> <span class="s">"xfs"</span>
<span class="nt">beegfs_force_format</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">no</span>
<span class="nt">beegfs_rdma</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">yes</span>
<span class="nn">...</span>
</pre></div>
<p>To create a BeeGFS cluster spanning the two nodes as defined in the
inventory, run a single Ansible playbook to handle the setup and the
teardown of BeeGFS storage cluster components by setting
<tt class="docutils literal">beegfs_state</tt> flag to <tt class="docutils literal">present</tt> or <tt class="docutils literal">absent</tt>:</p>
<div class="highlight"><pre><span></span><span class="c1"># build cluster</span>
ansible-playbook beegfs.yml -i inventory-beegfs -e <span class="nv">beegfs_state</span><span class="o">=</span>present
<span class="c1"># teardown cluster</span>
ansible-playbook beegfs.yml -i inventory-beegfs -e <span class="nv">beegfs_state</span><span class="o">=</span>absent
</pre></div>
<p>The playbook is designed to fail if the path specified for BeeGFS
storage service under <cite>beegfs_oss</cite> is already being used for another
service. To override this behaviour, pass an extra option as <tt class="docutils literal"><span class="pre">-e</span>
beegfs_force_format=yes</tt>. Be warned that this will cause data loss as
it formats the disk if a block device is specifed and also erase
management and metadata server data if there is an existing BeeGFS
deployment.</p>
<p>Highlights of the Ansible role for BeeGFS:</p>
<ul class="simple">
<li>The idempotent role will leave state unchanged if the configuration
has not changed compared to the previous deployment.</li>
<li>The tuning parameters for optimal performance of the storage servers
<a class="reference external" href="https://www.beegfs.io/wiki/StorageServerTuning">recommended</a> by
the BeeGFS maintainers themselves are automatically set.</li>
<li>The role can be used to deploy both <a class="reference external" href="https://www.beegfs.io/wiki/SystemArchitecture">storage-as-a-service and
hyperconverged</a>
architecture by the nature of how roles are ascribed to hosts in
the Ansible inventory. For example, the hyperconverged case would
have storage and client services running on the same nodes while
in the disaggregated case, the clients are not aware of storage
servers.</li>
</ul>
<p>Other things we learnt along the way:</p>
<ul class="simple">
<li>BeeGFS is sensitive to hostname. It prefers hostnames to be consistent
and permanent. If the hostname changes, <a class="reference external" href="https://www.beegfs.io/wiki/FAQ#hostname_changed">services refuse to start</a>. As a result,
this is worth being mindful of during the initial setup.</li>
<li>This is unrelated to BeeGFS specifically but we had to set a <cite>-K</cite> flag
when formatting NVME devices in order to prevent it from discarding
blocks under instructions from Dell otherwise the disk would disappear
with the following error message:</li>
</ul>
<div class="highlight"><pre><span></span><span class="o">[</span> <span class="m">7926</span>.276759<span class="o">]</span> nvme nvme3: Removing after probe failure status: -19
<span class="o">[</span> <span class="m">7926</span>.349051<span class="o">]</span> nvme3n1: detected capacity change from <span class="m">3200631791616</span> to <span class="m">0</span>
</pre></div>
<div class="section" id="looking-ahead">
<h2>Looking Ahead</h2>
<p>The simplicity of BeeGFS deployment and configuration makes it a
great fit for automated cloud-native deployments. We have seen
a lot of potential in the performance of BeeGFS, and we hope to be
publishing more details from our tests in a future post.</p>
<p>We are also investigating the current state of Kubernetes integration,
using the emerging <a class="reference external" href="https://kubernetes-csi.github.io/docs/">CSI driver API</a> to support the attachment of
BeeGFS filesystems to Kubernetes-orchestrated containerised workloads.</p>
<p>Watch this space!</p>
<p>In the meantime, if you would like to get in touch we would love to hear
from you. Reach out to us via <a class="reference external" href="https://twitter.com/stackhpc">Twitter</a>
or directly via our <a class="reference external" href="//www.stackhpc.com/pages/contact.html">contact page</a>.</p>
</div>
Kayobe and Rundeck2018-10-01T17:00:00+01:002018-10-01T17:00:00+01:00Nick Jonestag:www.stackhpc.com,2018-10-01:/kayobe-and-rundeck.html<p>Operational Hygiene for Infrastructure as Code</p><h1>Operational Hygiene for Infrastructure as Code</h1>
<p><img src="../images/rundeck-logo3-2000w.png" width="200" title="Rundeck logo" alt="The Rundeck logo"></p>
<p><a href="http://rundeck.org">Rundeck</a> is an infrastructure automation tool, aimed at simplifying and streamlining operational process when it comes to performing a particular task, or ‘job’. That sounds pretty grand, but basically what it boils down to is being able to click a button on a web-page or hit a simple API in order to drive a complex task; For example - something that would otherwise involve SSH’ing into a server, setting up an environment, and then running a command with a specific set of options and parameters which, if you get them wrong, can have catastrophic consequences.</p>
<p>This can be the case with a tool as powerful and all-encompassing as <a href="https://github.com/openstack/kayobe">Kayobe</a>. The flexibility and agility of the CLI is wonderful when first configuring an environment, but what about it when it comes to day two operations and business-as-usual (BAU)? How do you ensure that your cloud operators are following the right process when reconfiguring a service? Perhaps you introduced 'run books', but how do you ensure a rigorous degree of consistency to this process? And how do you glue it together with some additional automation? So many questions!</p>
<p>Of course, when you can't answer any or all of these questions, it's difficult to maintain a semblance of 'operational hygiene'. Not having a good handle on whether or not a change is live in an environment, how it's been propagated, or by whom, can leave infrastructure operators in a difficult position. This is especially true when it's a service delivered on a platform as diverse as OpenStack.</p>
<p>Fortunately, there are applications which can help with solving some of these problems - and Rundeck is precisely one of those.</p>
<h1>Integrating Kayobe</h1>
<p>Kayobe has a rich set of features and options, but often in practice - especially in BAU - there's perhaps only a subset of these options and their associated parameters that are required. For our purposes at StackHPC, we've mostly found those to be confined to:</p>
<ul>
<li>Deployment and upgrade of Kayobe and an associated configuration;</li>
<li>Sync. of version controlled kayobe-config;</li>
<li>Container image refresh (pull);</li>
<li>Service deployment, (re)configuration and upgrade.</li>
</ul>
<p>This isn't an exhaustive list, but these have been the most commonly run jobs with a standard set of options i.e those targetting a particular service. A deployment will eventually end up with a 'library' of jobs in Rundeck that are capable of handling the majority of Kayobe's functionality, but in our case and in the early stages we found it useful to focus on what's immediately required in practical terms, refactoring and refining as we go.</p>
<h1>Structure and Usage</h1>
<p>Rundeck has no shortage of options when it comes to triggering jobs, including the ability to fire off Ansible playbooks directly - which in some ways makes it a poor facsimile of <a href="https://github.com/ansible/awx">AWX</a>. Rundeck's power though comes from its flexibility, so having considered the available options, the most obvious solution seemed to be utilising a simple wrapper script around <code>kayobe</code> itself, which would act as the interface between the two - managing the initialisation of the working environment and capable of passing a set of options based on a set of selections presented to the user.</p>
<p>Rundeck allows you to call jobs from other projects, so we started off by creating a <code>library</code> project which contains common jobs that will be referenced elsewhere such as this Kayobe wrapper. The individual jobs themselves then take a set of options and pass these through to our script, with an action that reflects the job's name. This keeps things reasonably modular and is a nod towards <a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself">DRY</a> principles.</p>
<p>The other thing to consider is the various 'roles' of operators (and I use this in the broadest sense of the term) within a team, or the different hats that people need to wear during the course of their working day. We've found that three roles have been sufficient up until now - the omnipresent administrator, a role for seeding new environments, and a 'read-only' role for BAU.</p>
<p>Finally it's worth mentioning Rundeck's support for concurrency. It's entirely possible to kick off multiple instances of a job at the same time, however this is something to be avoided when implementing workflows based around tools such as Kayobe.</p>
<p>With those building blocks in place we were then able to start to build other jobs around these on a per-project (environment) basis.</p>
<h2>Example</h2>
<p>Let's run through a quick example, in which I pull in a change that's been merged upstream on GitHub and then reconfigure a service (Horizon).</p>
<p>The first step is to synchronise the version-controlled configuration repository from which Kayobe will deploy our changes. There aren't any user-configurable options for this job (the 'root' path is set by an administrator) so we can just go ahead and run it:</p>
<p><img src="../images/rd_sync_ss.png" width="700" title="Image sync options" alt="Screenshot showing image sync options"></p>
<p>The default here is to 'follow execution' with 'log output', which will echo the (standard) output of the job as it's run:</p>
<p><img src="../images/rd_sync_ss2.png" width="700" title="Completed sync" alt="Screenshot showing completed image sync"></p>
<p>Note that this step could be automated entirely with webhooks that call out to Rundeck to run that job when our pull request has been merged (with the requisite passing tests and approvals).</p>
<p>With the latest configuration in place on my deployment host, I can now go ahead and run the job that will reconfigure Horizon for me:</p>
<p><img src="../images/rd_conf_ss.png" width="700" title="Service reconfiguration selection" alt="Screenshot showing service reconfiguration options"></p>
<p>And again, I can watch Kayobe's progress as it's echoed to <code>stdout</code> for the duration of the run:</p>
<p><img src="../images/rd_deploy_ss.png" width="700" title="Monitoring reconfiguration" alt="Screenshot showing service reconfiguration output"></p>
<p>Note that jobs can be aborted, just in case something unintended happens during the process.</p>
<p>Of course, no modern DevOps automation tool would be complete without some kind of Slack integration. In our #rundeck channel we get notifications from every job that's been triggered, along with its status:</p>
<p><img src="../images/rd_slack_ss.png" width="700" title="Rundeck job status in Slack" alt="Screenshot showing Rundeck job status in Slack"></p>
<p>Once the service reconfiguration job has completed, our change is then live in the environment - consistency, visibility and ownership maintained throughout.</p>
<h2>CLI</h2>
<p>For those with an aversion to using a GUI, as Rundeck has a <a href="https://rundeck.org/docs/api/">comprehensive API</a> you'll be happy to learn that you can use a <a href="https://github.com/rundeck/rundeck-cli">CLI tool</a> in order to interact with it and do all of the above from the comfort of your favourite terminal emulator. Taking the synchronisation job as an example:</p>
<div class="highlight"><pre><span></span><span class="o">[</span>stack@dev-director nick<span class="o">]</span>$ rd <span class="nb">jobs</span> list <span class="p">|</span> grep -i sync
2d917313-7d4b-4a4e-8c8f-2096a4a1d6a3 Kayobe/Configuration/Synchronise
<span class="o">[</span>stack@dev-director nick<span class="o">]</span>$ rd run -j Kayobe/Configuration/Synchronise -f
<span class="c1"># Found matching job: 2d917313-7d4b-4a4e-8c8f-2096a4a1d6a3 Kayobe/Configuration/Synchronise</span>
<span class="c1"># Execution started: [145] 2d917313-7d4b-4a4e-8c8f-2096a4a1d6a3 Kayobe/Configuration/Synchronise <http://10.60.210.1:4440/project/AlaSKA/execution/show/145></span>
Already on <span class="s1">'alaska-alt-1'</span>
Already up-to-date.
</pre></div>
<h1>Conclusions and next steps</h1>
<p>Even with just a relatively basic operational subset of Kayobe's features being exposed via Rundeck, we've already added a great deal of value to the process around managing OpenStack infrastructure as code. Leveraging Rundeck gives us a central point of focus for how change, no matter how small, is delivered into an environment. This provides immediate answers to those difficult questions posed earlier, such as when a change is made and by whom, all the while streamlining the process and exposing these new operational functions via Rundeck's API, offering further opportunities for integration.</p>
<p>Our plan for now is to try and standardise - at least in principle - our approach to managing OpenStack installations via Kayobe with Rundeck. Although it's already proved useful, further development and testing is required to refine workflow and to expand its scope to cover operational outliers, and on the subject of visibility the next thing on the list for us to integrate is <a href="https://github.com/openstack/ara">ARA</a>.</p>
<p>If you fancy giving Rundeck a go, getting started is surprisingly easy thanks to the official <a href="https://hub.docker.com/r/rundeck/rundeck/">Docker images</a> as well as some <a href="https://github.com/rundeck/docker-zoo">configuration examples</a>. There's also <a href="https://github.com/yankcrime/riab">this repository</a> which comprises some of our own customisations, including minor fix for the integration with Ansible.</p>
<p>Kick things off via <code>docker-compose</code> and in a minute or two you'll have a couple of containers, one for Rundeck itself and one for MariaDB:</p>
<div class="highlight"><pre><span></span>nick@bluetip:~/src/riab> docker-compose up -d
Starting riab_mariadb_1 ... <span class="k">done</span>
Starting riab_rundeck_1 ... <span class="k">done</span>
nick@bluetip:~/src/riab> docker-compose ps
Name Command State Ports
---------------------------------------------------------------------------------------
riab_mariadb_1 docker-entrypoint.sh mysqld Up <span class="m">0</span>.0.0.0:3306->3306/tcp
riab_rundeck_1 /opt/boot mariadb /opt/run Up <span class="m">0</span>.0.0.0:4440->4440/tcp, <span class="m">4443</span>/tcp
</pre></div>
<p>Point your browser at the host where you've deployed these containers and port 4440, and all being well you'll be struck with the login page.</p>
<p>Feel free to reach out on <a href="https://twitter.com/stackhpc">Twitter</a> or via IRC (#stackhpc on Freenode) with any comments or feedback!</p>Bare Metal VMs: Virtually Ironic?2018-09-21T00:00:00+01:002018-09-21T00:00:00+01:00Will Millertag:www.stackhpc.com,2018-09-21:/tenks.html<p>A summary of the virtual cluster management tool, Tenks, whose
development formed the lion's share of Will Miller's summer internship with
StackHPC.</p><h2>Background</h2>
<p>When I joined StackHPC for the summer, <a href="author/mark-goddard.html">Mark</a> gave me
a problem description. Testing OpenStack Ironic tends to be tough, due to the
difficulty of finding and provisioning bare metal hardware for the purpose. The
normal solution to this is to set up one or more virtual machines to act as
pseudo-bare metal servers to simulate a cluster. Automation of this task
already exists in Bifrost's test scripts<sup id="fnref:bifrost"><a class="footnote-ref" href="#fn:bifrost">1</a></sup>, TripleO's
quickstart<sup id="fnref:tripleo"><a class="footnote-ref" href="#fn:tripleo">2</a></sup>, Ironic's own DevStack script<sup id="fnref:devstack"><a class="footnote-ref" href="#fn:devstack">3</a></sup>, and possibly
others too. However, none of these has become a <em>de facto</em> tool for virtual
cluster deployment, be it due to lack of extensibility, tight coupling with
their respective environment, or an architecture centred around monolithic
shell scripts.</p>
<p><img alt="DevStack plugin" src="//www.stackhpc.com/images/devstack.png" title="DevStack plugin"></p>
<h2>...enter <strong>Tenks</strong>!</h2>
<p>Tenks<sup id="fnref:etymology"><a class="footnote-ref" href="#fn:etymology">4</a></sup> is a new tool designed to solve this problem.</p>
<table>
<thead>
<tr>
<th align="center"><img alt="xkcd 927" src="//www.stackhpc.com/images/standards.png" title="xkcd 927"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center"><em>Source: xkcd 927 <sup id="fnref:xkcd"><a class="footnote-ref" href="#fn:xkcd">5</a></sup></em></td>
</tr>
</tbody>
</table>
<p>Written mostly in Ansible with custom Python plugins where necessary, it is
designed to be a general virtual cluster orchestration utility. As well as
allowing a developer to spin up a test cluster with minimal configuration, Mark
explained that neither the Kolla<sup id="fnref:kolla"><a class="footnote-ref" href="#fn:kolla">6</a></sup> nor the Kayobe project<sup id="fnref:kayobe"><a class="footnote-ref" href="#fn:kayobe">7</a></sup>
performs testing of Ironic in any of its continuous integration jobs; this is
another key use case which Tenks should assist with.</p>
<h2>Implementation</h2>
<h3>Key Features</h3>
<p>After some discussion and drafting of an Ironic feature specification<sup id="fnref:spec"><a class="footnote-ref" href="#fn:spec">8</a></sup>,
the scope was broken down into a few key components, and a set of desired
capabilities was created. Some of the features that have been implemented
include:</p>
<ul>
<li>
<p><strong>Multi-hypervisor support</strong>. While the commonest use case may well be
all-in-one deployment (when all cluster nodes are hosted on the same machine
from which Tenks is executed: <em>localhost</em>), it was deemed important for Tenks
to allow the use of multiple different hypervisors, each of which hosting a
subset of the cluster nodes (<em>localhost</em> optionally also performing this
role).</p>
<p>This feature was implemented almost "for free", thanks to Ansible's
multi-host nature<sup id="fnref:ansible-hosts"><a class="footnote-ref" href="#fn:ansible-hosts">9</a></sup>. Different Ansible groups are defined to
represent the different roles within Tenks, and the user creates an inventory
file to add hosts to each group as they please.</p>
</li>
<li>
<p><strong>Extensibility</strong>. Tenks was designed with a set of default assumptions.
These include the assumption that the user will want to employ the
Libvirt/QEMU/KVM stack to provide virtualisation on the hypervisors and
Virtual BMC<sup id="fnref:vbmc"><a class="footnote-ref" href="#fn:vbmc">10</a></sup> to provide a virtual platform management interface using
IPMI; it is also assumed that the user will have provisioned the physical
network infrastructure between the hypervisors.</p>
<p>However, these assumptions do not impact the extensibility of Tenks. The
tool is written as a set of Ansible playbooks, each of which delegates
different tasks to different hosts. If, in future, there is a use case for
OpenStack to be added as a provider in place of the Libvirt stack (as the
sushy-tools emulator<sup id="fnref:sushy-tools"><a class="footnote-ref" href="#fn:sushy-tools">11</a></sup>, for example, already allows), the
existing plays need not be modified. The new provider can be added as a
sub-group of the existing Ansible <em>hypervisors</em> group: any existing
Libvirt-specific tasks will automatically be omitted, and new tasks can be
added for the new group.</p>
</li>
<li>
<p><strong>Tear-down and reconfiguration</strong>. Whether used as an ephemeral cluster for
a CI job or for a developer's test environment, a deployed Tenks cluster will
need to be cleaned up afterwards. A developer may also wish to reconfigure
their cluster to add or remove nodes, without affecting the existing nodes.</p>
<p>Tear-down was reasonably easy to implement due to the modular nature of
Tenks' tasks. To a large extent, tear-down involves running the deployment
playbooks "in reverse". The Ansible convention of having a <code>state</code> variable
with values <code>present</code> and <code>absent</code> means that minimal duplication of logic
was required to do this.</p>
<p>Reconfiguration required a little more thought, but this too was
implemented without too much disruption to the core playbooks. The
scheduling of nodes to hypervisors is performed by a custom Ansible action
plugin, which is where most of the reconfiguration logic lies. The state of
the existing cluster (if any) is preserved in a YAML file, and this is
compared to the cluster configuration that was given to Tenks. The
scheduling plugin decides which existing nodes should be purged, and how
many new nodes need to be provisioned. It eventually outputs an updated
cluster state which is saved in the YAML file, then the rest of Tenks runs
as normal from this configuration.</p>
</li>
</ul>
<h3>Architecture</h3>
<p>The below diagram represents the interaction of the different Ansible
components within Tenks. The <code>libvirt-vm</code> and <code>libvirt-host</code> roles are imported
from StackHPC's Ansible Galaxy collection<sup id="fnref:galaxy"><a class="footnote-ref" href="#fn:galaxy">12</a></sup>; the rest are internal to
Tenks.</p>
<!-- Use HTML here to allow scaling of the image as a proportion of the size of
the container, while maintaining the original image resolution. -->
<p><a href="//www.stackhpc.com/images/tenks_ansible_structure.png"><img src="//www.stackhpc.com/images/tenks_ansible_structure.png" alt="Ansible structure diagram" width="100%" /></a></p>
<h3>Networking</h3>
<p>To allow testing of a wider range of scenarios, we decided it was important
that Tenks support multihomed nodes. This would represent an improvement on the
capabilities of DevStack, by allowing an arbitrary number of networks to be
configured and connected to any node.</p>
<p>Tenks has a concept of 'physical network' which currently must map one-to-one
to the hardware networks plugged into the hypervisors. It requires device
mappings to be specified on a hypervisor for each physical network that is to
be connected to nodes on that hypervisor. This device can be an interface, a
Linux bridge or an Open vSwitch bridge. For each physical network that is given
a mapping on a hypervisor, a new Tenks-managed Open vSwitch bridge is created.
If the device mapped to this physnet is an interface, it is plugged directly
into the new bridge. If the device is an existing Linux bridge, a veth pair is
created to connect the existing bridge to the new bridge. If the device is an
existing Open vSwitch bridge, an Open vSwitch patch port is created to link the
two bridges.</p>
<p>A new veth pair is created for each physical network that each node on each
hypervisor is connected to, and one end of the pair is plugged into the Tenks
Open vSwitch bridge for that physical network; the other end will be plugged
into the node itself. Creation of these veth pairs is necessary (at least for
the Libvirt provider) to ensure that an interface is present in Open vSwitch
even when the node itself is powered off.</p>
<table>
<thead>
<tr>
<th align="center"><img alt="Networking structure diagram" src="//www.stackhpc.com/images/tenks_networking_structure.png" title="Networking structure diagram"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center"><em>An example of the networking structure of Tenks. In this example, one node was requested to be connected to physnet0 and physnet1, and two nodes were requested to be connected just to physnet1.</em></td>
</tr>
</tbody>
</table>
<h2>Desirable Extensions</h2>
<p>There are many features and extensions to the functionality of Tenks that I
would make if I had more time at StackHPC. A few examples of these follow.</p>
<h4>More Providers and Platform Management Systems</h4>
<p>As mentioned earlier, it would be useful to extend Tenks to support providers
other than Libvirt/QEMU/KVM - for example, VirtualBox, VMware or OpenStack.
Redfish is also gaining momentum in the Ironic community as an alternative
to IPMI-over-LAN, so adding support for this to Tenks would widen its appeal.</p>
<h4>Increased Networking Complexity</h4>
<p>As described in the <em>Networking</em> section above, making the assumption that each
network to which nodes are connected will have a physical counterpart imposes
some limitations. For example, if a hypervisor has fewer interfaces than
physical networks exist in Tenks, either one or more physical networks will not
be usable by nodes on that hypervisor, or multiple networks will have to share
the same interface, breaking network isolation.</p>
<p>It would be useful for Tenks to support more complex software-defined
networking. This could allow multiple 'physical networks' to safely share the
same physical link on a hypervisor. VLAN tagging is used by certain OpenStack
networking drivers (networking-generic-switch, for example) to provide tenant
isolation for instance traffic. While this in itself is outside of the scope of
Tenks, it would need to be taken into account if VLANs were also used for
network separation in Tenks, due to potential gotchas when using nested VLANs.</p>
<h4>More Intelligent Scheduling</h4>
<p>The current system used to choose a hypervisor to host each node is rather
naïve: it uses a round-robin approach to cycle through the hypervisors. If
the next hypervisor in the cycle is not able to host the node, it will check
the others as well. However, the incorporation of more advanced scheduling
heuristics to inform more optimal placement of nodes would be desirable. All of
Ansible's gathered facts about each hypervisor are available to the scheduling
plugin, so it would be relatively straightforward to use facts about
total/available memory or CPU load to shift the load balance towards more
capable hypervisors.</p>
<h2>Evaluation</h2>
<p>Overall, I'm happy with the progress I've been able to make during my
internship. The rest of the team has been very welcoming and I've got a lot out
of the experience (special thanks to Mark for supervising and painstakingly
reviewing all my pull requests!). Roughly the first half of my time here was
spent reacquainting myself with OpenStack and the technologies around it, by
way of performing test OpenStack deployments with various configurations on a
virtual machine and preparing patches to be submitted upstream to fix issues as
I encountered them. While this process meant that I didn't spend my entire
internship working directly on Tenks, it was a very useful opportunity to 'dip
my toes' back into the OpenStack ecosystem and it helped to shape the design
decisions I made later on when developing Tenks.</p>
<p>Due to the time constraints of my placement, the initial work on Tenks was
started outside of the OpenStack ecosystem. It has been an enjoyable summer
project, but it would be more gratifying to see it continue in development and
use not only within the company, but hopefully with the OpenStack community as
an upstream project. Feel free to check out the project repository<sup id="fnref:tenks-repo"><a class="footnote-ref" href="#fn:tenks-repo">13</a></sup>
and read the documentation about contribution if you'd like to get involved.
Finally, to my colleagues at StackHPC: <em>so long and Tenks for all the fish!</em></p>
<h2>Footnotes</h2>
<div class="footnote">
<hr>
<ol>
<li id="fn:bifrost">
<p><a href="https://github.com/openstack/bifrost/tree/master/playbooks/roles/bifrost-create-vm-nodes">Bifrost create-vm-nodes role</a> <a class="footnote-backref" href="#fnref:bifrost" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:tripleo">
<p><a href="https://github.com/openstack/tripleo-quickstart">TripleO Quickstart</a> <a class="footnote-backref" href="#fnref:tripleo" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:devstack">
<p><a href="https://github.com/openstack/ironic/blob/master/devstack/lib/ironic">DevStack Ironic plugin</a> <a class="footnote-backref" href="#fnref:devstack" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:etymology">
<p><a href="https://review.openstack.org/#/c/579583/7/specs/approved/virtual-bare-metal-cluster-manager.rst@104"><em>Tenks</em> etymology</a> <a class="footnote-backref" href="#fnref:etymology" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:xkcd">
<p><a href="https://xkcd.com/927/">Relevant XKCD source</a> <a class="footnote-backref" href="#fnref:xkcd" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:kolla">
<p><a href="https://wiki.openstack.org/wiki/Kolla">Kolla</a> <a class="footnote-backref" href="#fnref:kolla" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:kayobe">
<p><a href="https://github.com/openstack/kayobe">Kayobe</a> <a class="footnote-backref" href="#fnref:kayobe" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:spec">
<p><a href="https://review.openstack.org/#/c/579583/">Virtual bare metal clusters Ironic spec</a> <a class="footnote-backref" href="#fnref:spec" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:ansible-hosts">
<p><a href="https://docs.ansible.com/ansible/2.6/user_guide/intro_inventory.html#hosts-and-groups">Ansible hosts documentation</a> <a class="footnote-backref" href="#fnref:ansible-hosts" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:vbmc">
<p><a href="https://github.com/openstack/virtualbmc">Virtual BMC</a> <a class="footnote-backref" href="#fnref:vbmc" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:sushy-tools">
<p><a href="https://github.com/openstack/sushy-tools">sushy-tools</a> <a class="footnote-backref" href="#fnref:sushy-tools" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:galaxy">
<p><a href="https://galaxy.ansible.com/stackhpc">StackHPC Ansible Galaxy</a> <a class="footnote-backref" href="#fnref:galaxy" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:tenks-repo">
<p><a href="https://github.com/stackhpc/tenks">Tenks repository</a> <a class="footnote-backref" href="#fnref:tenks-repo" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
</ol>
</div>Heads Up: Ansible Galaxy Breaks the World2018-09-04T16:40:00+01:002018-09-04T16:40:00+01:00Stig Telfertag:www.stackhpc.com,2018-09-04:/galaxy-broken.html<p class="first last">The software repository within which we build our ecosystem
breaks everything without apparent notice.</p>
<p>One of the great advantages arising from our technology choices has
been that through standardising on <a class="reference external" href="https://docs.ansible.com/ansible/index.html">Ansible</a> we have been able
to use a single, simple tool to drive everything we do.</p>
<p>Ansible is not really a programming language, and modularity cannot
be ensured without some amount of programmer discipline. One great
tool in providing a level of modularity and component reuse has
been <a class="reference external" href="https://galaxy.ansible.com/home">Ansible Galaxy</a>. Our
<a class="reference external" href="https://galaxy.ansible.com/stackhpc">OpenStack deployment toolbag</a>
has been steadily growing and we've been thrilled to see others
make use of our components as well. Share and enjoy!</p>
<p>Unfortunately, we are writing this post because of an event today
which apparently without notice broke all our builds, and also all
the work of our clients who use our technology.</p>
<div class="section" id="it-s-working-great-what-could-possibly-go-wrong">
<h2>It's Working Great, What Could Possibly Go Wrong?</h2>
<p>We started to notice oddities when updating some of our roles on
Galaxy earlier today. The first thing was that the implicit naming convention
used for <tt class="docutils literal">git</tt> repos such as <a class="reference external" href="https://github.com/stackhpc/ansible-role-beegfs">our new BeeGFS role</a>
was no longer being honoured, so that the role name on Galaxy
changed from <tt class="docutils literal">beegfs</tt> to <tt class="docutils literal"><span class="pre">ansible-role-beegfs</span></tt>. As a result,
the role could no longer be found by playbooks that required it.</p>
<p>This we fixed through adding a metadata tag <tt class="docutils literal">role_name</tt> which
explicitly sets the name. We did this to each of our 32 roles.
Our repos are long established, many are cloned, some are forked.
We can't simply rename them on a whim.</p>
<p>On pushing the change that sets this metadata tag, every one of our
roles with a hyphenated name was silently converted to using
underscores instead. This may seem innocuous, but the consequence
is that, again, every playbook that referenced these roles - which
is <em>every playbook we write</em> - could no longer retrieve the roles
it required from Ansible Galaxy.</p>
<p>The root cause appears to be the combined effect of two changes.
Ansible has removed the implicit naming convention for the <tt class="docutils literal">git</tt>
repos that back Galaxy roles. Around the same time they have
introduced a newer, stricter <a class="reference external" href="https://galaxy.ansible.com/docs/contributing/creating_role.html#role-names">naming convention for Galaxy roles</a>
that prevents names containing hyphens. The backwards-compatibility
plans for these two changes are mutually exclusive. Unfortunately
most of our roles fall into both categories.</p>
<p>We are not out of the woods as it appears the <tt class="docutils literal">role_name</tt> tag
that we now require to explicitly set the correct name for our roles
may also be <a class="reference external" href="https://github.com/ansible/galaxy/issues/1042">about to be deprecated</a>. This may leave us
needing to rename all the <tt class="docutils literal">git</tt> repos for our roles.</p>
</div>
<div class="section" id="what-about-kayobe">
<h2>What about Kayobe?</h2>
<p>OpenStack <a class="reference external" href="https://kayobe.readthedocs.io">Kayobe</a> is a project that
makes extensive use of Galaxy for reuse and modularity. At the time
of writing Kayobe's CI is also broken by this change, and an extensive
<a class="reference external" href="https://review.openstack.org/#/c/599552/">search-and-replace patchset</a>
is required, pending the outcome of our requests for upstream resolution.</p>
</div>
<div class="section" id="what-do-our-clients-need-to-do">
<h2>What Do Our Clients Need to Do?</h2>
<p>In summary, there seem to be a number of tedious but simple changes that
must be applied everywhere:</p>
<ul class="simple">
<li>All our roles now have underscores instead of hyphens in them from now on.
This appears to be an inevitable change to accommodate forwards compatibility
for future versions of Galaxy. We'd <a class="reference external" href="https://github.com/ansible/galaxy/issues/1128">like to see a server-side fix</a> to Galaxy to enable
recognition of either hyphens or underscores, thus enabling a smooth
transition.</li>
<li>The requirements and role invocations of every playbook that references
them will need to be updated to change occurrences of <tt class="docutils literal">-</tt> with <tt class="docutils literal">_</tt>.
We will commit those changes to our repos, but all clients will need to pull
in the new changes. This should happen automatically when repos are cloned.</li>
<li>We might not be done with these build-breaking changes yet, although
hopefully there will be a way forward that doesn't break things for users.</li>
</ul>
<p>Let's hope this kind of event doesn't happen too often in future...</p>
<div class="figure">
<img alt="Mushroom cloud!" src="//www.stackhpc.com/images/nuclear-bomb.png" style="width: 750px;" />
</div>
</div>
Pierre Riteau Joins our Team2018-07-04T14:20:00+01:002018-07-12T22:00:00+01:00Stig Telfertag:www.stackhpc.com,2018-07-04:/pierre-riteau-joins-our-team.html<p class="first last">Pierre Riteau joins the StackHPC team, bringing new
skills to broaden our capabilities and experience of operating
Ironic at scale.</p>
<p>We are very happy to announce that our team has grown further, with
the addition of Pierre Riteau. Pierre is a core team member of the
<a class="reference external" href="https://docs.openstack.org/blazar/latest/">OpenStack Blazar project</a>,
has been highly active in the <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">OpenStack Scientific SIG</a>
and joins us from the <a class="reference external" href="https://www.chameleoncloud.org/">Chameleon testbed</a>,
where he was lead DevOps engineer.</p>
<p>StackHPC sees the huge potential of the Blazar reservation service,
particularly if it could be combined with the <a class="reference external" href="//www.stackhpc.com/openstack-forum-vancouver-2018.html">recent work on
preemptible instances</a>,
and we hope Pierre can help us to deliver this vision for our clients
and the Scientific OpenStack community.</p>
<p>We will also draw on Pierre's expertise gained from designing, deploying and
managing a bare metal research cloud that has pushed OpenStack's boundaries in
some disruptive ways.</p>
<p>Pierre adds "StackHPC has been doing amazing work building research computing
platforms with OpenStack. I am honoured and delighted to be joining this
talented team to help deliver even more powerful capabilities."</p>
<p>Follow Pierre on Twitter <a class="reference external" href="https://twitter.com/priteau">@priteau</a>.</p>
<div class="figure">
<img alt="Pierre Riteau" src="//www.stackhpc.com/images/priteau.jpg" style="width: 300px;" />
</div>
StackHPC at OpenStack Vancouver2018-06-06T10:20:00+01:002018-06-21T15:40:00+01:00Stig Telfertag:www.stackhpc.com,2018-06-06:/stackhpc-at-openstack-vancouver.html<p class="first last">Roundup from a huge week in Vancouver for the StackHPC team</p>
<img alt="Vancouver Seaplanes" src="//www.stackhpc.com/images/vancouver-seaplanes.jpg" style="width: 540px;" />
<p>Stig Telfer and John Garbutt had a busy week in Vancouver. Aside from taking
in all the content packed into a summit, John and Stig both presented, participated
actively in many forum sessions and the activities of the Scientific SIG.</p>
<div class="section" id="preemptible-instances-and-bare-metal-containers">
<h2>Preemptible Instances and Bare Metal Containers</h2>
<div class="youtube"><iframe src="https://www.youtube.com/embed/K5N4LYrupSs" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>John presented with Belmiro Moreira from CERN on the recent fruits of the
collaboration between OpenStack teams at CERN and SKA.</p>
</div>
<div class="section" id="the-openstack-forum">
<h2>The OpenStack Forum</h2>
<p>John has already provided a comprehensive round-up of much of the discussion
at the forum, and you can find <a class="reference external" href="//www.stackhpc.com/openstack-forum-vancouver-2018.html">his full report here</a>.</p>
</div>
<div class="section" id="openstack-and-hpc-panel">
<h2>OpenStack and HPC Panel</h2>
<div class="youtube"><iframe src="https://www.youtube.com/embed/xgMk3hOGWCU" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>Stig was privileged to take part in a packed-out panel session on using
OpenStack for HPC, hosted by Martial Michel and with co-panelists Jonathan
Mills, Blair Bethwaite, Mike Lowe, Robert Budden and Jim Golden.</p>
</div>
<div class="section" id="the-scientific-sig">
<h2>The Scientific SIG</h2>
<img alt="The Three Amigos" src="//www.stackhpc.com/images/stig-blair-martial-vancouver.jpg" style="width: 450px;" />
<p>The SIG was busy as ever, with many new faces and a stimulating series of
lightning talks. The Scientific SIG renewed its interest in
best practice for management of controlled-access datasets within
OpenStack. This may well be a reflection of the burgeoning use case
for life sciences workloads on Scientific OpenStack infrastructure.</p>
<p>What a great week!</p>
</div>
OpenStack Forum in Vancouver 2018 Summary2018-06-04T17:30:00+01:002018-06-04T17:30:00+01:00John Garbutttag:www.stackhpc.com,2018-06-04:/openstack-forum-vancouver-2018.html<p class="first last">A Scientific Computing focused summary of some discussions during
the OpenStack Forum in Vancouver.</p>
<p>Here is a Scientific Computing focused summary of some discussions during the OpenStack Forum. The full list of etherpads is available at: <a class="reference external" href="https://wiki.openstack.org/wiki/Forum/Vancouver2018">https://wiki.openstack.org/wiki/Forum/Vancouver2018</a></p>
<p>This blog jumps into the middle of several ongoing conversations with the upstream community. In an attempt to ensure this blog is published while it is still relevant, I have only spent a limited amount of time setting context around each topic. Hopefully it is still a useful collection of ideas from the recent Forum discussions.</p>
<div class="figure">
<img alt="Picture of Vancouver" src="//www.stackhpc.com/images/openstack-vancouver-2018.jpg" style="width: 750px;" />
</div>
<p>As you can see from the pictures the location was great, the discussions were good too...</p>
<div class="section" id="role-based-access-control-and-quotas">
<h2>Role Based Access Control and Quotas</h2>
<p>The refresh of OpenStack’s policy system is going well. RBAC in OpenStack is still suffering from a default global admin role and many services allowing access to all but global admin things when you have any role in a given project. Care to define the scope of each API action (system vs project scope) and mapping to one of the three default roles (read-only, member, admin) should help get us to a place where we can add some more interesting defaults.</p>
<p>Hierarchical Quotas discussions were finally more concrete, with a clear path towards making progress. With unified limits largely implemented, agreement was needed on the shape of <tt class="docutils literal">oslo.limits</tt>. The current plan focuses around getting what we hope is the 80% use case working, i.e. when you apply a limit, you have the option of ensuring it also includes any resources used in the tree of projects below. There was an attempt to produce a rule set that would allow for a very efficient implementation, but that has been abandoned as the rule set doesn’t appear to map to many (if any) current use cases. For Nova, it was discussed that the move to using placement to count resources could coincide with a move to <tt class="docutils literal">oslo.limits</tt>, to avoid too many transition periods for operators that make extensive use of quotas.</p>
</div>
<div class="section" id="preemptible-servers">
<h2>Preemptible Servers</h2>
<p>To understand the context behind preemptible servers, please take a look at the video of a joint presentation I was involved with: <a class="reference external" href="https://www.openstack.org/videos/vancouver-2018/containers-on-baremetal-and-preemptible-vms-at-cern-and-ska">Containers on Baremetal and Preemtable VMs at CERN and SKA</a>.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/K5N4LYrupSs" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>The preemptible servers effort is moving forward, hoping to bring higher utilisation to clouds that use quotas to provide a level of fair sharing of resources. The prototype being built by CERN is progressing well, and Nova changes are being discussed in a spec. There was agreement around using soft shutdown to notify the running instance that its time is up.</p>
<p>The previous nova-spec blocker around the notifications has been resolved. The reaper only wants a subset of the notifications, while at the same time not affecting ceilometer or other similar services. Firstly we agreed to consider the <a class="reference external" href="https://github.com/openstack/nova/blob/90b728afeaf390b845ad5469edfe6b352b8b0b20/nova/conductor/manager.py#L755">schedule_instances notifications</a> for this purpose, and proposed looking at the <a class="reference external" href="https://github.com/openstack/oslo.messaging/blob/ea8fad47a57b1c8784a0a7c1fc435c4e38200863/oslo_messaging/notify/_impl_routing.py#L133">routing backend</a> for oslo messaging to ensure the reaper only gets a subset of the notifications.</p>
<p>I personally hope to revisit Blazar once preemptible servers are working. The current future reservation model in Blazar leads to much underutilisation that could be helped by preemptibles. Blazar also seems too disconnected from Nova’s create server API. I do wonder if there is a new workflow making use of the new pending instance status that would feel more cloud native. For example, when a cloud is full, rather than having a reaper to delete instances to make space, maybe you also have a promoter that helps rebuild pending instances as soon as some space becomes free, and maybe says only pick me if all the instances in the group can be built, and you will let me run for at least 8 hours. More thought is required, ideally without building yet another Slurm/HTCondor like batch job scheduler.</p>
</div>
<div class="section" id="federation">
<h2>Federation</h2>
<p>Federation was a big topic of conversation, with the edge computing use cases being heavily discussed. Hallway track conversations covered discussions of the <a class="reference external" href="https://aarc-project.eu/architecture/">AARC Blueprint Architecture</a>, <a class="reference external" href="https://wiki.geant.org/display/eduGAIN/Federation+Architectures">eduGain Federation Architectures</a> and relating them back to keystone. There is a blog post brewing on the work we at StackHPC have done with STFC IRIS cloud federation project.</p>
<p>The AARC blueprint involving having a “proxy” that provides a single point of management for integrating with multiple external Identity Providers. It uses these IdPs purely for authentication. Firstly allowing for multiple ways of authenticating a single user, and centralising the decisions of what that user is authorized to do, in particular it holds information on what virtual organisations (or communities) that user is associated to, and what roles they have in each organisation. Ideally the task of authorizing users for a particular role is delegated to the manager of that organisation, rather than the operator of the service. When considering the later part of the use case, this application becomes really interesting (thanks to Kristi Nikolla for telling me about it!): <a class="reference external" href="https://github.com/CCI-MOC/ksproj">https://github.com/CCI-MOC/ksproj</a></p>
<p>Pulling these ideas together, we get something like this:</p>
<div class="figure">
<img alt="Keystone Federation Ideas" src="//www.stackhpc.com/images/openstack-vancouver-2018-keystone-federation.jpg" style="width: 750px;" />
</div>
<p>Talking through a few features of the diagram it is worth noting:
- Authentication to central Keystone is via an assertion from a trusted Identity Provider, mapped to a local user via federation mapping, such that additional roles can be assigned locally, optionally via <tt class="docutils literal">ksproj</tt> workflows
- Authentication to leaf keystone is via assertion from the central keystone, authorization info extracted via locally-stored federation mapping
- App Creds are local to each Keystone
- There can also be a sharing of keystone tokens, and having all those regions listed in each others service catalog.
- There is no nasty long distance sync of keystone databases</p>
<p>It is believed this should work today, but there are some things we could consider changing in keystone:
* Federation mappings of groups and projects + roles are inconsistent. One lasts for the life of the keystone token, another lasts forever. One is totally ephemeral, the other is written in the DB. One refreshes the full list of permissions on each authentication, the other can be used to slowly accrue more and more permissions as assertions change over time. For the leaf keystone deployments, a refresh on every authentication, with a time to live applied to the group membership added into the DB.
* Note that using groups in the federation maps breaks trusts and application credentials, among other things this breaks Heat and Magnum. Ideally the above federation mapping work ensures all things are written to the database in a way that ensures trusts and application credentials work (see <a class="reference external" href="https://bugs.launchpad.net/keystone/+bug/1589993">https://bugs.launchpad.net/keystone/+bug/1589993</a>)
* Keystone to Keystone federation is fairly non-standard (see <a class="reference external" href="http://www.gazlene.net/demystifying-keystone-federation.html">http://www.gazlene.net/demystifying-keystone-federation.html</a>) as it is IdP initiated, not the more “normal” SP initiated federation. We should be able to make both configurations possible as we move to SP initiated keystone to keystone federation
* No central service catalog, without allowing tokens to work across all those regions. Might be nice to have pointers to service catalogs in the central component (this could be the list of service providers used for keystone to keystone federation, but this seems to not quite fit here).</p>
<p>It turns out, the Edge Computing use cases have very similar needs. Application credentials allow scripts to have a user local to a specific keystone instance, in a way that authentication still works even when the central federation keystone isn’t available.</p>
<p>There is a lot more to this story, and we hope to bring you more blog entries on the details very soon!</p>
</div>
<div class="section" id="ironic">
<h2>Ironic</h2>
<p>Lots of usage, including from some of the largest users (looking at you Yahoo/Oath). The effort around extending traits so it gives deploy templates seems really important for lots of people. Of particular interest was the engagement of RedFish developers. In particular, interest around making better use of the TPM and better and more efficient cleaning of a system, including resetting all firmware and configuration.</p>
</div>
<div class="section" id="kayobe-kolla-and-kolla-ansible">
<h2>Kayobe, Kolla and Kolla-Ansible</h2>
<p>There seems to be growing usage of kolla containers, in particular triple-o has moved to use them (alongside kolla-ansible). There was talk of more lightweight alternatives to kolla’s containers (as used by openstack-ansible I think), but folks expressed how they liked the current extensibility of the kolla container workflow.</p>
<p>StackHPC makes use of <a class="reference external" href="https://kayobe.readthedocs.io">Kayobe</a>, which combines kolla-ansible and bifrost. We discussed how the new kolla-cli built on top of kolla-ansible and used by Oracle’s OpenStack product could be made to work well with kayobe and its configuration patterns. We also spoke a little about how Kayobe is evolving to help deal with different kolla-ansible configurations for different environments (for example production vs pre-production).</p>
</div>
<div class="section" id="manila">
<h2>Manila</h2>
<p>Interesting conversations about CephFS support and usage of Manila from both CERN and RedHat. There was talk about how it is believed that a large public cloud user makes extensive use of Manila, but they were not present to comment on that. Some rough edges were discussed, but the well trodden paths seem to be working well for many users.</p>
</div>
<div class="section" id="glance-and-the-image-lifecycle">
<h2>Glance and the Image Lifecycle</h2>
<p>Glance caching was discussed, largely thinking about Edge Computing use cases, but the solution is more broadly applicable. It was a reminder about how you can deploy additional glance API servers close to the places that want to download, and it will cache images on its local disk using a simple LRU policy. While that is all available today, and I remember working on using that when I worked at Rackspace public cloud, there were some good discussions on how we could add better visibility and control, particularly around pre-seeding the cache with images.</p>
<p>Personally I want to see an OpenStack ecosystem wide way to select “the latest CentOS 7” image. Currently each cloud, or federation of clouds, ends up building their own way of doing things a bit like that. Having helped run a public cloud, the “hard” bit is updating images with security fixes, without breaking people.</p>
<p>This discussion led to the establishing of three key use cases:</p>
<p>Cloud operator free to update “latest” images to include all security fixes, while not breaking:</p>
<ul class="simple">
<li>By default, list images only shows the “latest” image for each image family<ul>
<li>Users can request the “latest” image for a given family, ideally with the same REST API call across all OpenStack certified clouds</li>
<li>Users automation scripts request a specific version of an image via image uuid, allowing them to test any new images before having to use them (From a given image the user is consuming, it should be easy to find the new updated image)</li>
</ul>
</li>
<li>Allow cloud operators to stop users booting new servers from a given image, hiding the image from the list of images, all without breaking existing running instances</li>
</ul>
<p>In summary, we support (and recommend) this sort of an image lifecycle:</p>
<ul class="simple">
<li>Cloud operator tests new image in production</li>
<li>Optional, public beta for new image</li>
<li>Promote image to new default/latest image for a given image family</li>
<li>Eventually hide image from default image list, but users can still use and find the image if needed</li>
<li>Stop users building from a given image, but keep image to not break live migrations</li>
<li>Finally delete the image, once no existing instances require the image (may never happen).</li>
</ul>
<p>Things that need doing to make these use cases possible:</p>
<ul class="simple">
<li>Hidden images spec: <a class="reference external" href="https://review.openstack.org/#/c/545397/">https://review.openstack.org/#/c/545397/</a></li>
<li>CLI tooling for “image family” metadata, including server side property verification, refstack tests to adopt family usage pattern</li>
<li>Allow Nova to download a deactivated image, probably using the new service token system that is used by Nova when talking to Neutron and Cinder.</li>
</ul>
</div>
<div class="section" id="openstack-upgrades">
<h2>OpenStack Upgrades</h2>
<p>It seems there is agreement on how best to collaborate upstream on supporting OpenStack releases for longer, mostly due to agreeing the concept of fast forward upgrades. The backporting policy has been extended to cover these longer lived branches. By the next forum, we will have the first release branch go into extended maintenance, and could have some initial feedback on how well the new system is working.</p>
</div>
<div class="section" id="kubernetes">
<h2>Kubernetes</h2>
<p>Did I put this section in to make my write up more cool? Well no, there is real stuff happening. The OpenStack provider and the community interactions are evolving at quite a pace right now.</p>
<p>Using Keystone for authentication and possibly authorization, looked to be taking shape. I am yet to work out how this fits into the above federation picture, longer term.</p>
<p>When it comes to Nova integration, you can install kubernetes like any other service, Magnum is simply one way you can offer that installation as a service.</p>
<p>Looking at Cinder there are two key scenarios to consider: Firstly consider k8s running inside an OpenStack VM, Secondly consider running k8s on (optionally OpenStack provisioned) baremetal. Given that much work has been done around Cinder, let us consider that case first. When running in a VM, you want to attach the cinder volume to the VM via Nova, but when in baremetal mode (or indeed an install outside of OpenStack all together) you can use cinder to attach the volume inside the OS that is running k8s. The plan is to support both cases, with the latter generally being called “standalone cinder”, focused on k8s deployed outside of OpenStack.</p>
<p>It will be interesting to see how this all interacts with the discussions around natively support multi-tenancy in k8s, including the work on kata-containers (<a class="reference external" href="https://blog.jessfraz.com/post/hard-multi-tenancy-in-kubernetes/">https://blog.jessfraz.com/post/hard-multi-tenancy-in-kubernetes/</a>).</p>
<p>It does seem like many of these approaches are viable, depending on your use case. For me, the current winner is using OpenStack to software define your data centre, and treating the COE as an application that runs on that infrastructure. Magnum is a great attempt at providing that application as a service. Deployment on both Baremetal and VMs makes sense depending on the specific use case involved. More details on how I see some of this currently being used is available in this presentation I was involved with in Vancouver:
<a class="reference external" href="https://www.openstack.org/videos/vancouver-2018/containers-on-baremetal-and-preemptible-vms-at-cern-and-ska">https://www.openstack.org/videos/vancouver-2018/containers-on-baremetal-and-preemptible-vms-at-cern-and-ska</a></p>
</div>
<div class="section" id="cyborg-and-fpgas">
<h2>Cyborg and FPGAs</h2>
<p>I am disappointed to not have more to report here. OpenStack now has reasonable GPU and vGPU support, but there appears to be little progress around FPGAs. Sadly it is currently a long way down the list of things I would like to work on.</p>
<p>There was discussion on how the naive PCI passthrough of the FPGA is a bad idea because an attacker would be able to damage the hardware. There does appear to be work on proprietary shells that allow a protected level of access. There was also discussion on how various physical interfaces (such as a network interface, DMA to main memory) will be integrated on the board containing the FPGAs, and how that could be properly integrate with the rest of OpenStack.</p>
<p>There does also appear to be some interest in delivering specific pre-build accelerators build using FPGAs, and some discussion on sharing function contexts and multi-tenancy, but there seemed to be few specifics being shared.</p>
<p>On reflection the ecosystem still seems very vendor specific. As various clouds offer a selection of different services based around FPGAs, it will be interesting to see what approaches gain traction. This is just one example of the possible approaches:
<a class="reference external" href="https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md">https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md</a></p>
</div>
<div class="section" id="monitoring-and-self-healing">
<h2>Monitoring and Self-Healing</h2>
<p>There are way more interesting sessions than one person can attend, and this section of the update suffers from that. I hope to read many of the other Forum summary blog posts as they start to appear.</p>
<p>Discussing NFV related monitoring and self-healing requirements, it became clear the use cases are very similar to those of some Scientific workloads. Fast detection of problems appeared to be key.</p>
<p>Personally I enjoyed the discussion around high resolution sampling and how it relates to pull (Prometheus) vs push (OpenStack Monasca) style systems. I had forgotten that you can do high resolution sampling processing in a distributed way, i.e. locally on each node (not unlike what <a class="reference external" href="https://opensource.google.com/projects/mtail">mtail</a> can do for logs when using Prometheus). You can then push alarms directly from the node, alongside pulling less time critical messages (i.e. being both a prometheus scrape endpoint and a client to prometheus alert manager). Although it sounds dangerously like an SNMP trap, it seems a useful blend of different approaches.</p>
</div>
<div class="section" id="scientific-sig">
<h2>Scientific SIG</h2>
<p>It was great to see the vibrant discussions and lightning talks during the SIG’s sessions. I am even happier seeing those discussion spilling out into all the general Forum sessions thought the week, as I am attempting to document here. For example, it was great to see the overlap between the telco (Edge/NFV/etc) use cases and the scientific use cases.</p>
<p>I was humbled to win the lightning talk prize. If you want to find out more about the data accelerator I am helping build, please take a look at:
<a class="reference external" href="https://insidehpc.com/2018/04/introducing-cambridge-data-accelerator/">https://insidehpc.com/2018/04/introducing-cambridge-data-accelerator/</a></p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/bJAGeJ_EEHY" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div></div>
<div class="section" id="overall">
<h2>Overall</h2>
<p>I feel energized after attending the Forum in Vancouver. I was in lots of Forum sessions that saw real progress made by having engagement from a broad cross section of the community. There was a great combination of new and old voices, all working towards getting real stuff done with OpenStack. I hope the only material change to the Forum, PTG and Ops Summit is to have the events back to back in the same location. I am all for being more efficient, but we must keep the great open collaboration.</p>
</div>
Future directions for OpenStack Monasca2018-06-01T13:30:00+01:002018-06-01T13:30:00+01:00Doug Szumskitag:www.stackhpc.com,2018-06-01:/future-directions-for-monasca.html<p class="first last">StackHPC have been helping to improve deployment,
configuration and multi-tenancy in OpenStack Monasca.</p>
<p>Monasca provides monitoring as a service for OpenStack. It’s scalable,
fault tolerant and supports multi-tenancy with Keystone integration.
You can bolt it on to your existing OpenStack distribution and it will
happily go about collecting logs and metrics, not just for your control
plane, but for tenant workloads too.</p>
<p>So how do you get started? Errr... well, one of the drawbacks of Monasca’s
microservice architecture is the complexity of deploying and managing the
services within it. Sound familiar? On the other hand this microservice
architecture is one of Monasca’s strengths. The deployment is flexible and
you can horizontally scale out components as your ingest rate increases.
But how do you do all of this?</p>
<p>Enter <a class="reference external" href="https://docs.openstack.org/kolla/latest/">OpenStack Kolla</a>.
Back in 2017, Steven Dake, the founder of the Kolla project, <a class="reference external" href="https://sdake.io/2017/01/09/crucial-strategy-for-openstack-in-2017/">wrote
about</a>
the significant human resource costs of running an OpenStack managed
cloud, and how the Kolla project offers a pathway to reduce them.
By providing robust deployment and upgrade mechanisms, Kolla helps
to keep OpenStack competitive with proprietary offerings, and at
StackHPC we want to bring the same improvements in operational
efficiency to the Monasca project. In doing so we’ve picked up the
baton for deploying Monasca with Kolla and we don't expect to put
it down until the job is finished. Indeed, since Kolla already
provides many required services and <a class="reference external" href="https://blueprints.launchpad.net/kolla-ansible/+spec/monasca-roles">support for deploying the APIs
has just been merged</a>,
we're hoping that this isn't too long.</p>
<p>So what else is new in the world of Monasca? One of the key things that we
believe differentiates Monasca is support for multi-tenancy. By allowing a
single set of infrastructure to be used for monitoring both the control plane
and tenant workloads, operational efficiency is increased. Furthermore,
because the data is all in one place, it becomes easy to augment tenant
data with what are typically admin only metrics. We envisage a tenant
being able to log in and see something like this:</p>
<div class="figure">
<img alt="Cluster overview" src="//www.stackhpc.com/images/cluster_vision.png" style="width: 720px;" />
</div>
<p>By providing a <a class="reference external" href="http://michaelnielsen.org/reinventing_explanation/index.html">suitable medium for thought</a>, the tenant
no longer has to sift through streams of data to understand that their job
was running slow because Ceph was heavily loaded, or the new intern had
saturated the external gateway. Of course, exposing such data needs to be
done carefully and we hope to expand more upon this in a later blog post.</p>
<p>So how else can we help tenants? A second area that we've been looking at is
logging. Providing a decent logging service which can quickly and
easily offer insight into the complex and distributed jobs that tenants run
can save them a lot of time. To this effect we've been adding
support for <a class="reference external" href="https://review.openstack.org/#/c/570599/">querying tenant logs via the Monasca Log API</a>. After all
tenants can POST logs in, so why not support getting them out? One particular
use case that we've had is to monitor jobs orchestrated by
<a class="reference external" href="https://www.stackhpc.com/magnum-queens.html">Docker Swarm</a>. As part of
this work we knocked up a proof of concept Compose file
<a class="reference external" href="https://github.com/stackhpc/p3-appliances/pull/6/files">which deploys the Monasca Agent and Fluentd as global services</a>
across the Swarm cluster. With
a local instance of Fluentd running the <a class="reference external" href="https://github.com/stackhpc/fluentd-monasca">Monasca plugin</a>, container stdout can
be streamed directly into Monasca by selecting the Fluentd Docker log driver.
The tenant can then go to Grafana and see both container metrics and logs
all in one place, and with proper tenant isolation. Of course, we don't see
this as a replacement for Kibana, but it has its use cases.</p>
<p>Thirdly, a HPC blog post wouldn't be complete without mentioning Slurm. As
part of our work to provide intuitive visualisations we've developed a
<a class="reference external" href="https://github.com/stackhpc/stackhpc-monasca-agent-plugins">Monasca plugin</a>
which integrates with the
<a class="reference external" href="https://grafana.com/plugins/natel-discrete-panel">Discrete plugin</a> for
Grafana. By using the plugin to harvest Slurm job data we can present the
overall state of the Slurm cluster to anyone with access to see it:</p>
<div class="figure">
<img alt="Slurm Ganntt chart" src="//www.stackhpc.com/images/monasca_slurm.png" style="width: 720px;" />
</div>
<p>The coloured blocks map to Slurm jobs, and as a cluster admin I can
immediately see that there’s been a fair bit of activity. So as a user running
a Slurm job, can I easily get detailed information on the performance of my
job? It’s a little bit clunky at the moment, but this is something we want to
work on. Both on the scale of the visualisation; we’re talking thousands of
nodes not 8, and in the quality of the interface. As an example of what we
have today here’s the CPU usage and some Infiniband stats for 3 jobs running
on nodes 0 and 1:</p>
<div class="figure">
<img alt="Slurm drill down" src="//www.stackhpc.com/images/monasca_slurm_job.png" style="width: 720px;" />
</div>
<p>Finally, we'll finish up with a summary. We've talked about helping to drive
forward progress in areas such as deployment, data visualisation and logging
within the Monasca project. Indeed, <a class="reference external" href="http://stackalytics.com/report/contribution/monasca-group/90">we're far from the only people</a>
with a goal for bettering Monasca, and we're very grateful for the others
that share it with us. However, we don't want you to think that we're living
in a bubble. In fact, speaking of driving, we see Monasca as an old car.
Not a bad one, rather a potential classic. One where you can still open the
bonnet and easily swap in and out parts. It's true that there is a little
rust. The forked version of Grafana with Keystone integration prevents users
from getting their hands on shiny new Grafana features. The forked Kafka
client means that we can't use the most recent version of Kafka, deployable
out of the box with Kolla. Similar issues exist with InfluxDB. And whilst the
rust is being repaired (and it <em>is</em> <a class="reference external" href="https://etherpad.openstack.org/p/monasca-ptg-rocky">being repaired</a>) newer, more tightly
integrated cars are coming out with long life servicing. One of these
is <a class="reference external" href="https://prometheus.io/">Prometheus</a>, which compared to Monasca is exceptionally easy
to deploy and manage. But with tight integration comes less flexibility. One
size fits all doesn't fit everyone. Prometheus doesn't officially support
multi-tenancy, <a class="reference external" href="https://docs.google.com/document/d/1C7yhMnb1x2sfeoe45f4mnnKConvroWhJ8KQZwIHJOuw">yet</a>.
We look forward to exploring other monitoring and logging frameworks in
future blog posts.</p>
Bare Metal InfiniBand2018-05-24T20:00:00+01:002018-05-24T20:00:00+01:00Mark Goddardtag:www.stackhpc.com,2018-05-24:/bare-metal-infiniband.html<p class="first last">In this post we take a look at recent StackHPC projects involving
integration of OpenStack Neutron, InfiniBand (IB) networks, and bare
metal compute.</p>
<p>InfiniBand networks are commonplace in the High Performance Computing (HPC)
sector, but less so in cloud. As the two sectors converge, InfiniBand support
becomes important for OpenStack. This article covers our recent work
integrating InfiniBand networks with OpenStack and bare metal compute.</p>
<div class="section" id="infiniband">
<h2>InfiniBand</h2>
<p>InfiniBand (IB) is a standardised network interconnect technology frequently
used in High Performance Computing (HPC), due to its high throughput and low
latency, and intrinsic support for Remote Direct Memory Access (RDMA) based
communication.</p>
<p>After initial competition in the IB market, only one vendor remains - Mellanox.
Despite this, IB remains a popular choice for HPC and storage clusters.</p>
<p>This slightly old roadmap from the InfiniBand Trade Association shows the
data rates of historical and future InfiniBand standards.</p>
<div class="figure">
<img alt="InfiniBand roadmap" src="//www.stackhpc.com/images/InfiniBand_Roadmap_113015.jpg" style="width: 720px;" />
</div>
</div>
<div class="section" id="unmanaged-ib">
<h2>Unmanaged IB</h2>
<p>Integration of Mellanox IB and OpenStack is <a class="reference external" href="https://wiki.openstack.org/wiki/Mellanox-OpenStack">documented</a> for virtualised
compute, but support for bare metal compute with Ironic is listed as <em>TBD</em>.</p>
<p>The simplest solution to that <em>TBD</em> is to treat the IB network as
an unmanaged resource, outside of OpenStack's control, and give
instances unrestricted access to the network. There are a couple
of drawbacks to this approach. First, there is no isolation between
different users of the system, since all users share a single
partition. Second, IP addresses on the IB network must be managed
outside of Neutron and configured manually.</p>
</div>
<div class="section" id="mellanox-ib-and-virtualised-compute">
<h2>Mellanox IB and Virtualised Compute</h2>
<p>Before we look at IB for bare metal compute, let's look at the (relatively)
well trodden path - <a class="reference external" href="https://wiki.openstack.org/wiki/Mellanox-Neutron-Pike-InfiniBand">virtualised compute</a>.
Mellanox provides a <a class="reference external" href="https://github.com/openstack/networking-mlnx">Neutron plugin</a> with two ML2 mechanism
drivers that allow Mellanox ConnectX-[3-5] NICs to be plugged into VMs via
SR-IOV. IB partitions are treated as VLANs as far as Neutron is concerned.</p>
<p>The first driver, <tt class="docutils literal">mlnx_infiniband</tt>, binds to IB ports. The companion agent,
<tt class="docutils literal"><span class="pre">neutron-mlnx-agent</span></tt>, runs on the compute nodes and manages their NICs, their
SR-IOV Virtual Functions (VFs), and ensures they are members of the correct IB
partitions.</p>
<p>The second driver, <tt class="docutils literal">mlnx_sdn_assist</tt>, programs the InfiniBand subnet manager
to ensure that compute nodes are members of the required partitions. This may
sound simple, but actually involves a chain of services. Mellanox NEO is an
SDN controller, Mellanox UFM is an InfiniBand fabric manager, and OpenSM is an
open source InfiniBand subnet manager. This driver is optional since
<tt class="docutils literal"><span class="pre">neutron-mlnx-agent</span></tt> manages partition membership on the compute nodes, but
does add an extra layer of security.</p>
<p>Connectivity to IB partitions from the Open vSwitch bridges on the Neutron
network host is made possible via Mellanox's eIPoIB kernel module, provided
with OFED, which allows Ethernet to be tunnelled over an InfiniBand network.
This allows Neutron to provide DHCP and L3 routing services.</p>
<div class="figure">
<img alt="OpenStack and Mellanox NEO for InfiniBand" src="//www.stackhpc.com/images/Mellanox-NEO-IB.png" style="width: 720px;" />
</div>
<div class="section" id="eipoib-contraband">
<h3>eIPoIB ContraBand?</h3>
<p>As of the 4.0 release, Mellanox OFED no longer provides the eIPoIB kernel
module that is required for Neutron DHCP and L3 routing. No official solution
to this problem is provided, nor is it mentioned in the documentation on the
wiki. The OFED software is fairly tightly coupled to a particular kernel, so
using an older release of OFED is typically not an option when using a recent
OS.</p>
</div>
</div>
<div class="section" id="mellanox-ib-and-bare-metal-compute">
<h2>Mellanox IB and Bare Metal Compute</h2>
<p>How do we translate this to the world of bare metal compute?</p>
<p>The <a class="reference external" href="https://specs.openstack.org/openstack/ironic-specs/specs/6.2/add-infiniband-support.html">Ironic InfiniBand specification</a>
provides us with a few pointers. The specification includes two principal
changes:</p>
<ol class="arabic simple">
<li>inspection of InfiniBand devices using Ironic Inspector</li>
<li>PXE booting compute nodes using InfiniBand devices</li>
</ol>
<p>Inspection is necessary, but we don't need to PXE boot over IB, since both of
our clients' systems have Ethernet networks as well.</p>
<p>Comparing with the solution for virtualised compute, we don't need SR-IOV
since compute instances have full control over their hardware. We also don't
need the <tt class="docutils literal">mlnx_infiniband</tt> ML2 driver and associated <tt class="docutils literal"><span class="pre">neutron-mlnx-agent</span></tt>,
since we can't run an agent on the compute instance. The <tt class="docutils literal">mlnx_sdn_assist</tt>
driver is required, as the only way to enforce partition membership for a bare
metal compute node is via the subnet manager.</p>
<p>Ironic port <a class="reference external" href="https://specs.openstack.org/openstack/ironic-specs/specs/not-implemented/physical-network-awareness.html">physical network tags</a>,
are applied during hardware inspection, and can be used to ensure a correct
mapping between Neutron ports and Ironic ports in the presence of multiple
networks.</p>
<div class="section" id="addressing">
<h3>Addressing</h3>
<p>A quick diversion into addressing of InfiniBand in OpenStack. An InfiniBand
GUID is a 64 bit identifier, composed of a 24 bit manufacturer prefix followed
by a 40 bit device identifier. In Ironic, ports are registered with an
Ethernet-style 48 bit address, by concatenating the first and last 24 bits of
the GUID. For example:</p>
<div class="line-block">
<div class="line">Port GUID: <tt class="docutils literal">0x0002c90300002f79</tt></div>
<div class="line">Ironic port address: <tt class="docutils literal">00:02:c9:00:2f:79</tt></div>
</div>
<p>The <a class="reference external" href="https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/hardware_managers/mlnx.py">Mellanox hardware manager</a>
in Ironic Python Agent (IPA) registers a client ID for InfiniBand ports that is
used for DHCP, and also within Ironic for classifying a port as InfiniBand. The
client ID is formed by concatenating a vendor-specific prefix with the port's
GUID. For example, for a Mellanox ConnectX family NIC:</p>
<div class="line-block">
<div class="line">Port GUID: <tt class="docutils literal">0x0002c90300002f79</tt></div>
<div class="line">Client ID: <tt class="docutils literal">ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:2f:79</tt></div>
</div>
<p>If using IP over InfiniBand (IPoIB), the IPoIB device has a 20 byte hardware
address which is formed by concatenating 4 bytes of flags, 8 bytes of subnet
identifier, and the 8 byte port GUID. For example:</p>
<div class="line-block">
<div class="line">Port GUID: <tt class="docutils literal">0x0002c90300002f79</tt></div>
<div class="line">Flags: <tt class="docutils literal">00:00:00:86</tt></div>
<div class="line">Default subnet ID: <tt class="docutils literal">fe:80:00:00:00:00:00:00</tt></div>
<div class="line">IPoIB hardware address: <tt class="docutils literal">00:00:00:86:fe:80:00:00:00:00:00:00:00:02:c9:03:00:00:2f:79</tt></div>
</div>
</div>
</div>
<div class="section" id="alaska">
<h2>ALaSKA</h2>
<p>Our first bare metal IB project, the SKA's ALaSKA system, is a shared computing
resource, and does not have strict project isolation requirements. We therefore
decided to treat the IB network as a 'flat' network, without partitions. This
avoids the requirement for Mellanox NEO and UFM, and the associated software
license. Instead, we run an OpenSM subnet manager on the OpenStack controller.</p>
<p>In order for Neutron to bind to the IB ports, we still use the
<tt class="docutils literal">mlnx_sdn_assist</tt> mechanism driver. We developed a patch for the driver that
allows us to configure the driver not to forward configuration to NEO. We're
working on pushing this patch upstream.</p>
<p>The ALaSKA controller runs CentOS 7.4, which means that we need to use Mellanox
OFED 4.1+. As mentioned previously, this prevents us from using eIPoIB, the
prescribed method for providing Neutron DHCP and L3 routing services on
InfiniBand.</p>
<div class="section" id="config-drive-to-the-rescue">
<h3>Config Drive to the Rescue</h3>
<p>Thankfully, Stig (our CTO) came to the rescue with his suggestion of using a
<a class="reference external" href="https://docs.openstack.org/nova/queens/user/config-drive.html">config drive</a> to configure
IP addresses on the IB network. If a Neutron network has <tt class="docutils literal">enable_dhcp</tt> set to
<tt class="docutils literal">False</tt>, then Nova will populate a file, <tt class="docutils literal">network_data.json</tt>, in the
config drive with IP configuration for the instance.</p>
<p>This works because we don't need to PXE boot via the IB network. We still don't
get L3 routing, but that isn't a requirement for our current use cases.</p>
</div>
<div class="section" id="cloud-init">
<h3>Cloud-init</h3>
<p>Config drives are typically processed at boot by a service such as
<tt class="docutils literal"><span class="pre">cloud-init</span></tt>, prior to configuring network interfaces. One thing we need to
ensure is that the required kernel modules (<tt class="docutils literal">ib_ipoib</tt> and <tt class="docutils literal">mlx4_ib</tt> or
<tt class="docutils literal">mlx5_ib</tt>) are loaded prior to processing the config drive. For this we
use systemd's <tt class="docutils literal"><span class="pre">/etc/modules-load.d/</span></tt> mechanism, and have built a <a class="reference external" href="https://github.com/stackhpc/stackhpc-image-elements/tree/master/elements/systemd-modules-load">Diskimage
Builder (DIB) element</a>
that injects configuration for this into user images.</p>
<p>Problem sorted? Sadly not. Due to the discrepancy between the 20 byte IPoIB
hardware address and the 6 byte Ethernet style address in the config drive,
cloud-init is not able to determine which interface to configure. We developed
a <a class="reference external" href="https://github.com/stackhpc/cloud-init/commit/0fec13275831c857ff4c1c0bb0c14f8fef9abb28">patch for cloud-init</a>
that resolves this dual identity issue.</p>
</div>
<div class="section" id="zooming-out">
<h3>Zooming Out</h3>
<p>Putting all of this together, from a user's perspective, we have:</p>
<ul class="simple">
<li>a new Neutron network that user instances can attach to, in addition to the
two existing Ethernet networks</li>
<li>updated user images that include the kernel module loading config and patched
cloud-init package</li>
<li>a requirement for users to set <tt class="docutils literal"><span class="pre">config-drive=True</span></tt> when creating instances</li>
</ul>
<p>This will lead to instances with an IP address on their IPoIB interface.</p>
</div>
</div>
<div class="section" id="verne-global-adding-multitenancy">
<h2>Verne Global - Adding Multitenancy</h2>
<p>Verne Global operate <a class="reference external" href="https://verneglobal.com/solutions/hpcdirect">hpcDIRECT</a>, a ground breaking
HPC-as-a-Service cloud <a class="reference external" href="//www.stackhpc.com/verne-globals-hpcdirect-service-bare-metal-powered-by-molten-rock.html">based on OpenStack</a>. hpcDIRECT
inherently requires strict tenant isolation requirements. We therefore
added two Mellanox services to the setup - NEO and UFM - to provide
isolation between projects via InfiniBand partitions.</p>
<div class="section" id="neo-ufm">
<h3>NEO & UFM</h3>
<p>The hpcDIRECT control plane is deployed in containers via Kolla Ansible, and
the team at Verne Global were keen to ensure the benefits of containerisation
extend to NEO and UFM. To this end, we created a set of image recipes and
Ansible deployment roles:</p>
<ul class="simple">
<li>NEO <a class="reference external" href="https://github.com/stackhpc/docker-mlnx-neo">image</a> and <a class="reference external" href="https://github.com/stackhpc/ansible-role-mlnx-neo">deployment
role</a></li>
<li>UFM <a class="reference external" href="https://github.com/stackhpc/docker-mlnx-ufm">image</a> and <a class="reference external" href="https://github.com/stackhpc/ansible-role-mlnx-ufm">deployment
role</a></li>
</ul>
<p>These tools have been integrated into the hpcDIRECT Kayobe configuration as an
extension to the standard deployment workflow.</p>
<p>Neither service would be classed as a typical container workload - the
containers run systemd and several services each. We also require host
networking for access to the InfiniBand network. Still, containers allow us to
decouple these services from the host OS, and other services running on the
same host.</p>
<p>The setup did not work initially, but after receiving a patched version of
NEO from the team at Mellanox, we were on our way.</p>
<p>NEO and UFM allow some ports to be configured, but both seem to require the use
of ports 80 and 443. These are popular ports, and it's easy to hit a conflict
with another service such as Horizon.</p>
</div>
<div class="section" id="fabric-visibility">
<h3>Fabric Visibility</h3>
<p>In an isolated environment, compute nodes should only have visibility
into the partitions of which they are a member. This principle
should also apply to low-level InfiniBand fabric tools such as
<tt class="docutils literal">ibnetdiscover</tt>.</p>
<p>Isolation at this layer requires the <a class="reference external" href="http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_3.pdf">Secure Host</a>
feature of Mellanox ConnectX NICs.</p>
</div>
</div>
<div class="section" id="a-fellow-traveller">
<h2>A Fellow Traveller</h2>
<p>The solution presented here is based on work done recently
by the Mellanox team working with Jacob Anders from <a class="reference external" href="https://www.csiro.au/">CSIRO</a> in Australia.</p>
<p>You can see the presentation Jacob made with Moshe and Erez from Mellanox
at the OpenStack summit in Vancouver:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/Z_3XlL7ocJU" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div></div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>After integrating a Mellanox InfiniBand network with OpenStack and bare metal
compute for two clients, it's clear there are significant obstacles - not least
the removal of eIPoIB from OFED without a replacement solution. Despite this,
we were able to satisfy the requirements at each site. As Ironic gains
popularity, particularly for Scientific Computing, we expect to see more
deployments of this kind.</p>
<p>Thanks to Moshe Levi from Mellanox for helping us to piece together
some of the missing pieces of the puzzle.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://wiki.openstack.org/wiki/Mellanox-Neutron-Pike-Redhat-InfiniBand">Mellanox Neutron Pike install guide</a></li>
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
<li><a class="reference external" href="https://verneglobal.com/solutions/hpcdirect">hpcDIRECT</a></li>
</ul>
</div>
HPC Container Orchestration with Bare Metal Magnum2018-05-22T10:00:00+01:002018-05-22T10:00:00+01:00Bharat Kunwartag:www.stackhpc.com,2018-05-22:/magnum-queens.html<p class="first last">We summarise the challenges involved in our recent upgrade of
Magnum from Pike to Queens on OpenStack Pike deployment using
Kayobe where we additionally used custom Fedora Atomic 27
image with latest Docker release and support for RDMA enabled
gluster mount over InfiniBand.</p>
<p>Our project with the <a class="reference external" href="https://www.stackhpc.com/kolla-kayobe-pike.html">Square Kilometre Array</a> includes a
requirement for high-performance containerised runtime environments. We
have been building a system with bare metal infrastructure, multiple
physical networks, high-performance data services and optimal
integrations between OpenStack and container orchestration engines (such
as Kubernetes and Docker Swarm).</p>
<p>We have previously documented our upgrade of OpenStack deployment from
<a class="reference external" href="https://www.stackhpc.com/kolla-kayobe-pike.html">Ocata to Pike</a>.
This upgrade impacted Docker Swarm and Kubernetes: provisioning of both
COEs in a bare metal environment failed after the upgrade. We resolved
the issues with Docker Swarm but left Kubernetes for patching over a
major release upgrade.</p>
<p>A fix was announced with Queens release, along with <tt class="docutils literal"><span class="pre">swarm-mode</span></tt> support
for Docker. This strengthened the case to upgrade Magnum to Queens on an
underlying Openstack Pike. The design ethos of <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a> and <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a>, using containerisation to
avoid the nightmares of dependency interlock, made the targeted upgrade
of Magnum a relatively smooth ride.</p>
<div class="section" id="fixing-magnum-deployment-by-upgrading-from-pike-to-queens">
<h2>Fixing Magnum deployment by upgrading from Pike to Queens</h2>
<p>We use Kayobe to manage the configuration of our Kolla deployment.
Changing the version of a single OpenStack service (in this case,
Magnum) is as simple as setting the version of a specific Docker
container tag, as follows:</p>
<blockquote>
<ul>
<li><p class="first">Prepare the Kaybobe environment (assuming it is <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/installation.html">already installed</a>):</p>
<div class="highlight"><pre><span></span><span class="nb">cd</span> src/kayobe-config
git checkout BRANCH-NAME
git pull
<span class="nb">source</span> kayobe-env
<span class="nb">cd</span> ../kayobe
<span class="nb">source</span> ../../venv/kayobe/bin/activate
<span class="nb">export</span> <span class="nv">KAYOBE_VAULT_PASSWORD</span><span class="o">=</span>**secret**
</pre></div>
</li>
<li><p class="first">Add <tt class="docutils literal">magnum_tag: 6.0.0.0</tt> to
<tt class="docutils literal"><span class="pre">kayobe-config/etc/kayobe/kolla/globals.yml</span></tt>.</p>
</li>
<li><p class="first">Finally, build and deploy the new version of Magnum to the control
plane. To ensure that other OpenStack services are not affected
during the deployment, we use <tt class="docutils literal"><span class="pre">--kolla-tags</span></tt> and
<tt class="docutils literal"><span class="pre">--kolla-skip-tags</span></tt>:</p>
<div class="highlight"><pre><span></span>kayobe overcloud container image build magnum --push
-e <span class="nv">kolla_source_version</span><span class="o">=</span>stable/queens
-e <span class="nv">kolla_openstack_release</span><span class="o">=</span><span class="m">6</span>.0.0.0
-e <span class="nv">kolla_source_url</span><span class="o">=</span>https://git.openstack.org/openstack/kolla
kayobe overcloud container image pull --kolla-tags magnum
--kolla-skip-tags common
kayobe overcloud service upgrade --kolla-tags magnum
--kolla-skip-tags common
</pre></div>
</li>
</ul>
</blockquote>
<p>That said, the upgrade came with a few unforeseen issues:</p>
<blockquote>
<ul>
<li><p class="first">We discovered that Kolla Ansible, a tool that Kayobe uses to deploy
Magnum containers, assumes that host machines running Kubernetes are
able to communicate with Keystone on an internal endpoint, not an
option in our case since the internal endpoints were internal to the
control plane, which does not include tenant networks and instances
(which could be baremetal nodes or VMs). Since this is generally an
invalid assumption, <a class="reference external" href="https://review.openstack.org/#/c/566361/">a patch was pushed upstream</a> which has been quickly
approved in the code review process. After applying this patch, it
is necessary to reconfigure default configuration templates for
Heat, made possible by a single Kayobe command:</p>
<div class="highlight"><pre><span></span>kayobe overcloud service reconfigure --kolla-tags heat
--kolla-skip-tags common
</pre></div>
</li>
<li><p class="first">Docker community edition (v17.03.0-ce onwards) uses <tt class="docutils literal">cgroupfs</tt>
as the <tt class="docutils literal">native.cgroupdriver</tt>. However, Magnum assumes that
this is <tt class="docutils literal">systemd</tt> and does not explicitly demand this be the
case. As a result, deployment fails. This was addressed in <a class="reference external" href="https://github.com/stackhpc/magnum/pull/2">this
pull request</a>.</p>
</li>
<li><p class="first">By default, Magnum's behaviour is to assign a floating IP to each
server in a container infrastructure cluster. This means that all
the traffic flows through the control plane (when accessing the
cluster from an external location; internal traffic is direct).
Disabling floating IP appeared to have no effect which we filed as a
<a class="reference external" href="https://bugs.launchpad.net/magnum/+bug/1772433">bug on launchpad</a>.
Patch to fix Magnum to correctly handle disabling of floating IP in
swarm mode is <a class="reference external" href="https://github.com/stackhpc/magnum/pull/3">currently under way</a>.</p>
</li>
<li><p class="first"><a class="reference external" href="https://github.com/SKA-ScienceDataProcessor/alaska-kayobe-config/pull/62">Patch kayobe-config</a>
to update <tt class="docutils literal">magnum_tag</tt> to <tt class="docutils literal">6.0.0.0</tt> as well as point
<tt class="docutils literal">magnum_conductor_footer</tt> and <tt class="docutils literal">magnum_api_footer</tt> to a
patched Magnum Queens fork <tt class="docutils literal">stackhpc/queens</tt> on our Github
account.</p>
</li>
</ul>
</blockquote>
</div>
<div class="section" id="fedora-atomic-27-image-for-containers">
<h2>Fedora Atomic 27 image for containers</h2>
<p>A recently recently released Fedora Atomic 27 image (<a class="reference external" href="http://dl05.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-27-20180326.1/CloudImages/x86_64/images/Fedora-Atomic-27-20180326.1.x86_64.qcow2">download link</a>
comes packaged with baremetal and Mellanox drivers therefore it is no
longer necessary to build custom image using <tt class="docutils literal"><span class="pre">diskimage-builder</span></tt>
to incorporate these drivers. However, it was necessary to make a few
one-off manual changes to the image which we achieved by making changes
to the image through <tt class="docutils literal">virsh</tt> console:</p>
<blockquote>
<ul>
<li><p class="first">First, boot into the image using <tt class="docutils literal"><span class="pre">cloud-init</span></tt> credentials
defined within <tt class="docutils literal">init.iso</tt> (built from <a class="reference external" href="http://www.projectatomic.io/docs/quickstart/">these instructions</a>):</p>
<div class="highlight"><pre><span></span>sudo virt-install --name fa27 --ram <span class="m">2048</span> --vcpus <span class="m">2</span> --disk
<span class="nv">path</span><span class="o">=</span>/var/lib/libvirt/images/Fedora-Atomic-27-20180326.1.x86_64.qcow2
--os-type linux --os-variant fedora25 --network <span class="nv">bridge</span><span class="o">=</span>virbr0
--cdrom /var/lib/libvirt/images/init.iso --noautoconsole
sudo virsh console fa27
</pre></div>
</li>
<li><p class="first">The images are shipped with Docker v1.13.1 which is 12 releases
behind the current stable release, Docker v18.03.1-ce (note that
versioning scheme changed after v1.13.1 to v17.03.0-ce). To obtain
up-to-date features required by our customers, we upgraded this to
the latest release.</p>
<div class="highlight"><pre><span></span>sudo su
<span class="nb">cd</span> /etc/yum.repos.d/
curl -O https://download.docker.com/linux/fedora/docker-ce.repo
rpm-ostree override remove docker docker-common cockpit-docker
rpm-ostree install docker-ce -r
</pre></div>
</li>
<li><p class="first">Fedora Atomic 27 comes installed with packages for a scalable
network file system called <a class="reference external" href="https://docs.gluster.org/en/latest/">GlusterFS</a>. However, one of our
customer requirements was to support RDMA capability for GlusterFS
in order to maximise IOPs for data intensive tasks compared to
IP-over-Infiniband. The package was available on <tt class="docutils literal"><span class="pre">rpm-ostree</span></tt>
repository as <tt class="docutils literal"><span class="pre">glusterfs-rdma</span></tt>. It was installed and enabled
as follows:</p>
<div class="highlight"><pre><span></span><span class="c1"># installing glusterfs</span>
sudo su
rpm-ostree upgrade
rpm-ostree install glusterfs-rdma fio
systemctl <span class="nb">enable</span> rdma
rpm-ostree install perftest infiniband-diags
</pre></div>
</li>
<li><p class="first">Write <tt class="docutils literal"><span class="pre">cloud-init</span></tt> script that runs once to resize root volume
partition due to the fact that root volume is mounted as LVM and
conventional <tt class="docutils literal"><span class="pre">cloud-init</span></tt> script to grow this partition fails
leading to containers deployed inside swarm cluster to quickly fill
up. The following script placed under
<tt class="docutils literal">/etc/cloud/cloud.cfg.d/99_growpart.cfg</tt> did the trick in our
case which generalises to different types of root block devices:</p>
<div class="highlight"><pre><span></span><span class="c1">#cloud-config</span>
<span class="c1"># resize volume</span>
<span class="nt">runcmd</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">lsblk</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">PART=$(pvs | awk '$2 == "atomicos" { print $1 }')</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">echo $PART</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">/usr/bin/growpart ${PART:</span><span class="p p-Indicator">:</span> <span class="l l-Scalar l-Scalar-Plain">-1} ${PART</span><span class="p p-Indicator">:</span> <span class="l l-Scalar l-Scalar-Plain">-1}</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">pvresize ${PART}</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">lvresize -r -l 100%FREE /dev/atomicos/root</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">lsblk</span>
</pre></div>
</li>
<li><p class="first">Cleaning up the image makes it lighter but prevents users rolling
back the installation, an action we do not anticipate users to need
to perform. Removing <tt class="docutils literal"><span class="pre">cloud-init</span></tt> config allows it to run
again and removes authorisation details. Cleaning service logs gives
the image a fresh start.</p>
<div class="highlight"><pre><span></span><span class="c1"># undeploy old images</span>
sudo atomic host status
<span class="c1"># current deployment <0/1> should have a * next to it</span>
sudo ostree admin undeploy <<span class="m">0</span>/1>
sudo rpm-ostree cleanup -r
<span class="c1"># image cleanup</span>
sudo rm -rf /var/log/journal/*
sudo rm -rf /var/log/audit/*
sudo rm -rf /var/lib/cloud/*
sudo rm /var/log/cloud-init*.log
sudo rm -rf /etc/sysconfig/network-scripts/ifcfg-*
<span class="c1"># auth cleanup</span>
sudo rm ~/.ssh/authorized_keys
sudo passwd -d fedora
<span class="c1"># virsh cleanup</span>
<span class="c1"># press Ctrl+Shift+] to exit virsh console</span>
sudo virsh shutdown fa27
sudo virsh undefine fa27
</pre></div>
</li>
</ul>
</blockquote>
</div>
<div class="section" id="ansible-roles-for-managing-container-infrastructure">
<h2>Ansible roles for managing container infrastructure</h2>
<p>There are official Ansible modules for various OpenStack projects like
Nova, Heat, Keystone, etc. However, Magnum currently does not have one,
especially those that concern creating, updating and managing container
infrastructure inventory. Magnum currently lacks certain useful features
possible indirectly through Nova API like attaching multiple network
interfaces to each node in the cluster that it creates. The ability to
generate and reuse an existing cluster inventory is further necessitated
by a specific requirement of this project to mount GlusterFS volumes to
each node in the container infrastructure cluster.</p>
<p>In order to lay the foundation for performing preliminary data
consumption tests for the Square Kilometre Array's (SKA) Performance
Prototype Platform (P3), we needed to attach each node in the container
infrastructure cluster to multiple high speed network interfaces:</p>
<blockquote>
<ul class="simple">
<li>10G Ethernet</li>
<li>25G High Throughput Ethernet</li>
<li>100G Infiniband</li>
</ul>
</blockquote>
<p>We have submitted a <a class="reference external" href="https://blueprints.launchpad.net/magnum/+spec/multiple-external-networks">blueprint</a>
to support multiple networks using Magnum API since Nova already allows
multiple network interfaces to be attached. In the meantime, we wrote an
<a class="reference external" href="https://github.com/stackhpc/ansible-role-os-container-infra">Ansible role to drive Magnum</a>
and generate an Ansible inventory from the cluster deployment. Using
this inventory, further playbooks apply our enhancements to the deployment.</p>
<p>The role allows us to declare specification of container infrastructure
required including a variable that is a list of networks to attach to
cluster nodes. A bespoke ansible module <tt class="docutils literal">os_container_infra</tt>
creates/updates/deletes cluster as specified using
<tt class="docutils literal"><span class="pre">python-magnumclient</span></tt>. Another module called
<tt class="docutils literal">os_stack_facts</tt> then gathers facts about container infrastructure
using <tt class="docutils literal"><span class="pre">python-heatclient</span></tt> allowing us to generate an inventory of
the cluster. Finally, a module called <tt class="docutils literal">os_server_interface</tt> uses
<tt class="docutils literal"><span class="pre">python-novaclient</span></tt> to attach each node in the container
infrastructure cluster to additional network interfaces declared in the
specifications.</p>
<p>We make use of the recently announced <tt class="docutils literal">openstacksdk</tt> Python module
for talking to OpenStack which was conceived with an aim to assimilate
<tt class="docutils literal">shade</tt> and <tt class="docutils literal">os_client_config</tt> projects which have been
performing similar functions under separate umbrellas. We enjoyed the
experience of using <tt class="docutils literal">openstacksdk</tt> API that is largely consistent with
the parent projects. Ansible plans to eventually transition to
<tt class="docutils literal">openstacksdk</tt> but they do not currently have specific plans to support
plugin libraries like <tt class="docutils literal"><span class="pre">python-magnumclient</span></tt>,
<tt class="docutils literal"><span class="pre">python-heatclient</span></tt> and <tt class="docutils literal"><span class="pre">python-novaclient</span></tt> which provide
wider coverage in terms of the range of interaction with their API
compared to <tt class="docutils literal">openstacksdk</tt>, which only offers a set of
common-denominators across various OpenStack cloud platforms.</p>
<p>With Magnum playing an increasing more important role in the OpenStack
ecosystem by allowing users to create and manage container orchestration
engines like Kubernetes, we expect this role will make lives easier for
those of us who regularly use Ansible to manage complex and large scale
HPC infrastructure.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://docs.openstack.org/magnum/queens/user/">Magnum</a></li>
<li><a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a></li>
<li><a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a></li>
<li><a class="reference external" href="https://kubernetes.io/docs/reference/">Kubernetes</a></li>
<li><a class="reference external" href="https://docs.docker.com">Docker</a></li>
<li><a class="reference external" href="https://docs.gluster.org/en/latest/">Gluster</a></li>
<li><a class="reference external" href="https://community.mellanox.com/docs/DOC-1465">Mellanox</a></li>
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
</ul>
</div>
Kayobe Update2018-04-25T10:00:00+01:002018-04-25T10:50:00+01:00Mark Goddardtag:www.stackhpc.com,2018-04-25:/kayobe-update.html<p class="first last">15 months on from the inception of the Kayobe OpenStack deployment
tool we provide an update on the project, covering the recent move to
become an OpenStack-related project, and why we think Kayobe is
quickly becoming a viable alternative to TripleO that provides
significant advantages.</p>
<p><a class="reference external" href="http://kayobe.readthedocs.io/en/latest/">Kayobe</a> is an OpenStack deployment
tool I started at the beginning of 2017. Things have come a long way since
then, and this article aims to provide an update on where the project is
currently, recent changes, and where we expect it to go next.</p>
<div class="section" id="becoming-an-openstack-related-project">
<h2>Becoming an OpenStack-Related Project</h2>
<p>My main area of focus for Kayobe recently has been the move to become an
OpenStack-related project. This is the current name for non-official projects
that use OpenStack infrastructure - much like the 'big tent'.</p>
<p>There were a number of factors that lead to this, including:</p>
<ul class="simple">
<li>promote wider adoption through conformity</li>
<li>use of OpenStack's excellent Continuous Integration (CI)
infrastructure</li>
</ul>
<p>The <a class="reference external" href="https://docs.openstack.org/infra/manual/creators.html">OpenStack project creator's guide</a> is a long read, with
many serialised steps. Thankfully not all steps were applicable, and the
infrastructure was up and running relatively quickly - thanks infra team!</p>
<p>The headline changes for users and developers of Kayobe are:</p>
<ul class="simple">
<li>Bugs and features are now tracked via <a class="reference external" href="https://storyboard.openstack.org/#!/project/928">Storyboard</a>, rather than Github.
Storyboard is OpenStack's shiny new task management tool, and will be
gradually replacing Launchpad which is no longer supported by Canonical.</li>
<li>Code changes are submitted via <a class="reference external" href="https://review.openstack.org">Gerrit</a>, rather than Github pull requests.</li>
<li>Code changes are tested via <a class="reference external" href="https://docs.openstack.org/infra/zuul/">Zuul</a>,
rather than <a class="reference external" href="https://travis-ci.org/">TravisCI</a>.</li>
</ul>
</div>
<div class="section" id="testing-continuous-integration-ci">
<h2>Testing & Continuous Integration (CI)</h2>
<p>Over time we built up a set of CI jobs for Kayobe in TravisCI. It's an
impressive service - they provide free CI for any public repository on Github,
with pull request integration. A full control plane deployment test remained
elusive however. TravisCI provides only Ubuntu 12.04 and 14.04 as test
instance images, and these releases are no longer supported by OpenStack. We
tried installing Kayobe in a Docker container, but continued to hit issues.</p>
<p>While moving to OpenStack's infrastructure, I started reading up on Zuul v3.
The latest release of the project uses a simple in-repo job definition syntax
based on YAML. Jobs are implemented using Ansible, and a form of inheritence
is supported that allows jobs to be built up in layers and reused. Zuul is
currently going through the process of becoming a project in its own right -
still under the OpenStack Foundation umbrella, but separate from OpenStack.
Hopefully this will increase adoption of this great tool.</p>
<p>At the time of writing, Kayobe has 10 Zuul jobs, including one for deploying a
single node control plane, and one for deploying a seed host. We still have a
way to go to improve test coverage, but these jobs have already shown their
worth and should allow us to move forward with greater confidence.</p>
</div>
<div class="section" id="upcoming-changes">
<h2>Upcoming Changes</h2>
<p>Here are some of the work items in the pipeline:</p>
<ul class="simple">
<li><a class="reference external" href="https://storyboard.openstack.org/#!/story/2001863">support for the Queens release of OpenStack</a></li>
<li><a class="reference external" href="https://storyboard.openstack.org/#!/story/2001864">support for development against master branch of OpenStack projects</a></li>
<li><a class="reference external" href="https://storyboard.openstack.org/#!/story/2001649">support more recent versions of Ansible</a> - we're currently
limited to Ansible 2.3</li>
</ul>
<p>A feature that I'm particularly excited about is <a class="reference external" href="https://storyboard.openstack.org/#!/story/2001663">supporting extension points
for custom behaviour</a>.
This will allow users to integrate their own Ansible playbooks and roles with
Kayobe, and configure them to run at specific points during existing workflows.
This will allow the core of Kayobe to remain small without limiting the
flexibility of the tool.</p>
</div>
<div class="section" id="dublin-ptg">
<h2>Dublin PTG</h2>
<p>I attended the snowy Project Teams Gathering (PTG) in Dublin along with other
folks from StackHPC. During the Kolla sessions <a class="reference external" href="https://etherpad.openstack.org/p/kolla-rocky-ptg-kayobe">I presented Kayobe</a> to the team. There
was interest from a number of parties, from wanting to try out Kayobe, to
considering how the two projects could collaborate in future.</p>
<p>I also gave a video update on the project while in Dublin. Thanks to Rich
Bowen from Red Hat for putting these interviews together.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/0axlEFIYi2s" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div></div>
<div class="section" id="demo">
<h2>Demo!</h2>
<p>The following demo runs through deployment of an OpenStack control plane in a
Vagrant VM, using the Kayobe development environment tooling. The commands
were taken from the <a class="reference external" href="http://kayobe.readthedocs.io/en/latest/development/manual.html">Kayobe development guide</a>, and it
should be possible to replicate the demo on a modern laptop.</p>
<p>The asciicast has been edited to limit idle time to 0.5 seconds, and playback
is in double time. The real time video is available <a class="reference external" href="https://asciinema.org/a/176883">here</a>, and the single time, idle time limited
video is available <a class="reference external" href="https://asciinema.org/a/176888">here</a>.</p>
<script src="https://asciinema.org/a/176888.js" id="asciicast-176888" async data-speed=2></script></div>
<div class="section" id="manchester-openstack-meetup">
<h2>Manchester OpenStack Meetup</h2>
<p>Thanks to the Manchester OpenStack meetup for allowing me to <a class="reference external" href="https://www.meetup.com/Manchester-OpenStack-Meetup/events/248435996/">speak on Kayobe</a>
recently. Always interesting to meet new folks in the community, and great to
see OpenStack remains strong in Manchester. The talk was not recorded, but the
<a class="reference external" href="https://www.slideshare.net/MarkGoddard2/to-kayobe-or-not-to-kayobe">slides are available</a>.</p>
<blockquote class="twitter-tweet" data-lang="en-gb"><p lang="en" dir="ltr">Mark Goddard (<a href="https://twitter.com/markgoddard86?ref_src=twsrc%5Etfw">@markgoddard86</a>) from <a href="https://twitter.com/stackhpc?ref_src=twsrc%5Etfw">@stackhpc</a> delving into Kayobe <a href="https://twitter.com/hashtag/mcropenstack?src=hash&ref_src=twsrc%5Etfw">#mcropenstack</a> <a href="https://t.co/LJMASifKaf">pic.twitter.com/LJMASifKaf</a></p>— Danny Abukalam (@dabukalam) <a href="https://twitter.com/dabukalam/status/973295933613015041?ref_src=twsrc%5Etfw">12 March 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div>
<div class="section" id="kayobe-s-history-in-the-beginning">
<h2>Kayobe's History: In the Beginning</h2>
<p>Like many projects, Kayobe started life as a pile of shell scripts. These
provided automation of <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla Ansible</a> to deploy the control
plane for the Square Kilometre Array's (SKA) Performance Prototype Platform
(P3). It soon became apparent that a more capable automation tool than bash
would be required, and I switched to my old friend <a class="reference external" href="https://www.ansible.com/">Ansible</a>.</p>
<p>Before long, the core of Kayobe as it currently stands began to emerge, filling
several gaps in Kolla Ansible:</p>
<ul class="simple">
<li>control plane server provisioning via <a class="reference external" href="https://docs.openstack.org/bifrost/latest/">Bifrost</a> (similar to TripleO's
undercloud)</li>
<li>configuration of the control plane hosts</li>
<li>improved support for bare metal compute node provisioning</li>
</ul>
</div>
<div class="section" id="widening-the-audience">
<h2>Widening the Audience</h2>
<p>P3 is a High Performance Computing (HPC) cluster, and Kayobe inevitably took on
a HPC focus initially. We quickly realised that the project would have wider
applicability, and made an effort not to restrict its use to HPC.</p>
<p>Kayobe's next opportunity came from <a class="reference external" href="https://verneglobal.com/">Verne Global</a>, who use it to deploy OpenStack for <a class="reference external" href="https://verneglobal.com/solutions/hpcdirect">hpcDIRECT</a>, their HPC-as-a-service
offering powered by green Icelandic energy. Thanks to the team at Verne for
their early adoption of Kayobe, allowing the project to really thrive.</p>
<p>I'd like to thank Kevin Tibi (IRC: ktibi) and the folks at <a class="reference external" href="https://www.devoteam.com/">Devoteam</a> for seeing Kayobe's potential early on while
building their CI/CD platform. This lead to the addition of support for
virtualised compute, greatly widening the potential audience for Kayobe.</p>
</div>
<div class="section" id="a-viable-alternative-to-tripleo">
<h2>A Viable Alternative to TripleO</h2>
<p>We feel that Kayobe has become a viable alternative to TripleO for users
wanting to deploy an OpenStack private cloud. It provides an end-to-end
solution for provisioning, reconfiguring, and upgrading the control plane.</p>
<p>It also provides several advantages over TripleO, including:</p>
<p><strong>Simplicity</strong></p>
<ul class="simple">
<li>Ansible for everything vs. many technologies in TripleO</li>
<li>standalone Ironic vs. full undercloud in TripleO</li>
<li>benefits of containers without using kubernetes (which is on the TripleO
roadmap) - one less turtle to worry about!</li>
</ul>
<p><strong>Modularity</strong></p>
<ul class="simple">
<li>separation of concerns - Kayobe for provisioning, Kolla Ansible for
containerised OpenStack deployment</li>
<li>make targeted changes, e.g. reconfigure or upgrade a specific service vs.
full overcloud Heat stack update in TripleO</li>
</ul>
<p><strong>Extensibility</strong></p>
<ul class="simple">
<li>set any OpenStack config option via Kolla Ansible</li>
<li>run custom Ansible playbooks & roles (coming soon!)</li>
</ul>
</div>
<div class="section" id="get-involved">
<h2>Get Involved</h2>
<p>There are many ways to get involved with Kayobe:</p>
<ul class="simple">
<li>read the <a class="reference external" href="http://kayobe.readthedocs.io/en/latest">documentation</a></li>
<li>try out the <a class="reference external" href="http://kayobe.readthedocs.io/en/latest/development/automated.html">development environment</a> on
your laptop</li>
<li>say <tt class="docutils literal">o/</tt> on IRC: <tt class="docutils literal"><span class="pre">#openstack-kayobe</span></tt></li>
</ul>
<p>That's all for now, thanks for reading.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a></li>
<li><a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a></li>
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
<li><a class="reference external" href="https://www.stackhpc.com/ironic-idrac-ztp.html">Zero touch provisioning of P3</a></li>
</ul>
</div>
HPCAC Conference Keynote: Ceph on the Brain2018-04-17T10:20:00+01:002018-04-17T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2018-04-17:/hpcac-conference-keynote-ceph-on-the-brain.html<p class="first last">StackHPC returned to the HPC Advisory Council conference in
Lugano, Switzerland. Stig was delighted to be able to participate in
a keynote address, on using Ceph within the Human Brain Project.</p>
<p>The <a class="reference external" href="http://www.hpcadvisorycouncil.com/index.php">HPC Advisory Council</a>
held its <a class="reference external" href="http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php">2018 workshop in Lugano</a>
in the beautiful region of Ticino, Switzerland.</p>
<p><a class="reference external" href="http://insidehpc.com">InsideHPC</a> has recorded footage and gathered
presentation slides from much of the conference, <a class="reference external" href="https://insidehpc.com/2018-swiss-hpc-conference/">available here</a>.
Thanks Rich!</p>
<p>I was able to return to the conference and deliver a keynote
presentation, in partnership with Adrian Tate from <a class="reference external" href="https://www.cray.com/cerl">Cray EMEA
Research Lab</a>. We spoke on the
Cray-Intel pre-commercial procurement (PCP) project, codenamed
<a class="reference external" href="http://www.fz-juelich.de/portal/EN/Research/ITBrain/Supercomputer/JULIA_JURON/_node.html">JULIA</a>,
for the <a class="reference external" href="https://www.humanbrainproject.eu/en/">Human Brain Project</a>
at Jülich Supercomputer Centre. Adrian spoke on Cray's recent R&D work
on data movement within application workflows. I spoke on work to
optimise Ceph for NVMe and HPC network architectures.</p>
<p>InsideHPC <a class="reference external" href="https://insidehpc.com/2018/04/ceph-brain-storage-data-movement-supporting-human-brain-project/">covered it here</a>.
Their footage of the presentation is also available on their YouTube
channel:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/iTbwnbiItM4" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>An HPC conference can bring together hot technologies such as
software-defined storage and non-volatile memory, and appraise them
in the context of HPC's long tradition of maximising the capability
of computation at scale.</p>
<p>An exciting time for StackHPC to be standing at this crossroads!</p>
The State of HPC Containers2018-04-16T16:20:00+01:002018-04-18T10:00:00+01:00Stig Telfertag:www.stackhpc.com,2018-04-16:/the-state-of-hpc-containers.html<p class="first last">A conference attended by representatives from all
the leading container runtime environments used in
research computing brought the opportunity to understand
the commonality and differences.</p>
<div class="section" id="hpcac-lugano-2018">
<h2>HPCAC Lugano 2018</h2>
<p>The recent <a class="reference external" href="http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php">HPC Advisory Council 2018 Lugano Conference</a>
was another great event, with a stimulating schedule of presentations
that were once again themed on emerging trends in HPC. This year,
the use of containers in HPC was at the forefront.</p>
<div class="figure">
<img alt="Swiss mountains" src="//www.stackhpc.com/images/swiss-mountains.jpg" style="width: 750px;" />
</div>
<div class="section" id="motivations-for-containers-in-research-computing">
<h3>Motivations for Containers in Research Computing</h3>
<p>The case for containerised compute is already well established, but
here's a quick recap on our view of the most prominent advantages
of supporting workload containerisation:</p>
<ul class="simple">
<li><strong>Flexibility</strong>: Conventional HPC infrastructure is built around
a multi-user console system, a parallel filesystem and a batch-queued
workload manager. This paradigm works for many applications, but
there is an increasing number of workloads that do not fit this mould.
The users who require applications classed as non-traditional HPC
can often do so through the ability for containers to package a
runtime that otherwise cannot be practically supported.
Some software that wouldn’t be possible to run can do so by
containerising it.</li>
<li><strong>Convenience</strong>: For the scientist user, application development
is a means to an end. The true metric in which they are interested
is the "time to science". If a user can pull their application
in as a container, requiring minimal consideration for adapting
to the run-time environment of the HPC system, then they probably
save themselves time and effort in their goal of conducting their
research.</li>
<li><strong>Consistency</strong>: Users of a research computing infrastructure
have often arrived there after outgrowing the compute resources
of their laptop or workstation. Their home workspace may be a
very different runtime environment from the HPC system. The
inconvenience and frustration that could be incurred by porting
to a new environment (or maintaining portability between multiple
environments) is an inconvenience that can be avoided if the
user is able to containerise their workspace and carry over to
other systems.</li>
<li><strong>Reproducibility</strong>: The ability to repeat research performed
in software environments has been a longstanding concern in
science. A container, being a recipe for installing and running
an application, can play a vital role in addressing this concern.
If the recipe is sufficiently precise, it can describe the sources
and versions of dependencies to enable others to regenerate the
exact environment at a later time.</li>
</ul>
</div>
</div>
<div class="section" id="the-contenders-for-containerised-hpc">
<h2>The Contenders for Containerised HPC</h2>
<p>Representatives from all the prominent projects spoke, and thankfully
Rich Brueckner from <a class="reference external" href="https://insidehpc.com">InsideHPC</a> was there
to capture each presentation. InsideHPC has posted all presentations
from the conference <a class="reference external" href="https://insidehpc.com/2018-swiss-hpc-conference/">online here</a>.</p>
<p>This post attempts to capture the salient points of each technology
and make a few comparisons that hopefully are not too odious.</p>
<div class="section" id="docker">
<h3>Docker</h3>
<p>Christian Kniep from Docker Inc <a class="reference external" href="https://insidehpc.com/2018/04/video-state-containers-convergence-hpc-bigdata/">spoke on the work he has been doing</a>
to enhance the integration of the Docker engine with
HPC runtime environments. (<a class="reference external" href="https://www.slideshare.net/insideHPC/state-of-containers-and-the-convergence-of-hpc-and-bigdata">Christian's slides uploaded here</a>).</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/r9bPHKiagco" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>Christian's vision is of a set of runtime tools based on modular
components standardised through the <a class="reference external" href="https://www.opencontainers.org/">OCI</a>, centred upon use of the Docker
daemon. As it executes with root privileges, the Docker daemon has
often been characterised as a security risk. Christian points out
that this argument is unfairly singling out the Docker daemon, given
the same could be said of <tt class="docutils literal">slurmd</tt> (but rarely is).</p>
<p>Through use of a mainstream set of tools, the philosophy is that
containerised scientific compute is not left behind when new
capabilities are introduced. And with a common user experience for
working with containerised applications, both 'scientific' and not,
cross-fertilisation remains easy.</p>
<p>If the user requirements of scientific compute can be implemented
through extension of the OCI's modular ecosystem, this could become
a simple way of focussing on the differences, rather than creating
and maintaining an entirely disjoint toolchain. Christian's
<a class="reference external" href="https://github.com/qnib/doxy">work-in-progress doxy project</a>
aims to demonstrate this approach. Watch this space.</p>
<p>The Docker toolchain is the de facto standard implementation.
The greatest technical challenges to this approach remain around
scalability, process tree lineage and the integration of HPC network
fabrics.</p>
</div>
<div class="section" id="kubernetes">
<h3>Kubernetes</h3>
<p>Saverio Proto from <a class="reference external" href="http://www.switch.ch/">SWITCH</a> presented their
new service for Kubernetes and how it integrates with their
SWITCHEngines OpenStack infrastructure.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/iYIz1ClNWXw" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>Kubernetes stands apart from the other projects covered here in
that it is a <em>Container Orchestration Engine</em> rather than a runtime
environment. The other projects described here use a conventional
HPC workload manager to manage resources and application deployment.</p>
<p>Saverio picked out <a class="reference external" href="https://www.openstack.org/videos/sydney-2017/kubernetes-on-openstack-the-technical-details">this OpenStack Summit talk</a>
that describes the many ways that Kubernetes can integrate with
OpenStack infrastructure. At StackHPC we use the <a class="reference external" href="https://docs.openstack.org/magnum/latest/">Magnum project</a> (where available) to
take advantage of the convenience of having Kubernetes provided
- as a service.</p>
<p>Saverio and the SWITCH team have been blogging about how Kubernetes
is used effectively in the <a class="reference external" href="https://cloudblog.switch.ch/category/k8s/">SWITCHengines infrastructure here</a>.</p>
</div>
<div class="section" id="singularity">
<h3>Singularity</h3>
<p>Abhinav Thota from Indiana University presented on <a class="reference external" href="http://singularity.lbl.gov/">Singularity</a>, and how it is used with success
on IU infrastructure.</p>
<div class="figure">
<img alt="Singularity" src="//www.stackhpc.com/images/singularity-logo.png" style="width: 200px;" />
</div>
<p>A 2017 paper <a class="reference external" href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177459">in PLOS ONE</a>
describes Singularity's motivation for mobility, reproducibility
and user freedom, without sacrificing performance and access to HPC
technologies. Singularity takes a iconoclastic position as the
"anti-Docker" of containerisation. This contrarian stance also
sees Singularity eschew common causes such as the Linux Foundation's
<a class="reference external" href="https://www.opencontainers.org/">Open Container Initiative</a>, in
favour of entirely home-grown implementations of the tool chain.</p>
<p>Singularity is also becoming widely used within research computing
and is developing a self-sustaining community around it.</p>
<p>Singularity requires set-UID binaries for both building container
images and for executing them. As an attack surface this may be
an improvement over a daemon that continuously runs as root. However
the unprivileged execution environment of Charliecloud goes further,
and reduces its attack surface to the bare minimum - the kernel ABI and
namespace implementations themselves.</p>
<p>The evident drawback of Singularity is that its policy of independence
from the Docker ecosystem could lead to difficulties with portability
and interoperability. Unlike the ubiquitous Docker image format,
the Singularity Image Format depends on the ongoing existence,
maintenance and development of the Singularity project. The sharing
of scientific applications packaged in SIF is restricted to other
users of Singularity, which must inevitably have an impact on the
project's aims of mobility and reproducibility of science.</p>
<p>If these limitations were resolved then Singularity appears to be
a good choice for it's rapidly-growing user base and evolving
ecosystem. It also requires little administrative overhead to
install, but may not be as secure as Charliecloud due to its
requirement for set-UID.</p>
</div>
<div class="section" id="shifter">
<h3>Shifter</h3>
<p>Alberto Madonna from CSCS <a class="reference external" href="https://insidehpc.com/2018/04/shifter-docker-containers-hpc/">gave an overview</a>
of <a class="reference external" href="http://www.nersc.gov/research-and-development/user-defined-images/">Shifter</a>, and
an update on recent work at CSCS to improve it.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/YgsiRzAm6cU" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>Shifter's genesis was a project between NERSC and Cray, to support
the scalable deployment of HPC applications, packaged in Docker
containers, in a batch-queued workload manager environment. Nowadays
Shifter is generic and does not require a Cray to run it. However,
if you do have a Cray system, Shifter is already part of the Cray
software environment and supported by Cray.</p>
<p>Shifter's focus is on a user experience based around Docker's
composition tools, using containers as a means of packaging complex
application runtimes, and a bespoke runtime toolchain designed to
be as similar as possible to the Docker tools. Shifter's implementation
addresses the scalable deployment of the application into an HPC
environment. Shifter also aims to restrict privilege escalation
within the container and performs specific customisations to ensure
containerisation incurs no performance overhead.</p>
<p>To ensure performance, the MPI libraries of the container environment
are swapped for the native MPI support of the host environment (this
requires use of the MPICH ABI).</p>
<p>To enable scalability, Shifter approaches docker container launches
by creating a flattened image file on the shared parallel filesystem,
and then mounting the image locally on each node using a loopback
device. At NERSC, Shifter's scalability has been demonstrated to
extend well to many thousands of processes on the <a class="reference external" href="http://www.nersc.gov/users/computational-systems/cori/">Cori supercomputer</a>.</p>
<p>CSCS work has removed several perceived issues with the Shifter
architecture. CSCS have been developing Shifter to improve the
pulling of images from Dockerhub (or local user-owned Docker image
repositories), and have added the ability to import images from tar
files.</p>
<p>Shifter appears to be a good choice for sites that have a conventional
HPC batch-queued infrastructure and are seeking to provide a scalable
and performant solution, but retaining as much compatibility as
possible with the Docker work flow. Shifter requires more administrative
setup than Singularity or Charliecloud.</p>
<p>Shifter is available on <a class="reference external" href="https://github.com/NERSC/shifter">NERSC's github site</a>.</p>
</div>
<div class="section" id="charliecloud">
<h3>Charliecloud</h3>
<p>Michael Jennings from Los Alamos National Lab <a class="reference external" href="https://youtu.be/ESsZgcaP-ZQ">presented Charliecloud</a> and the concepts upon which it is
built.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/ESsZgcaP-ZQ" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p><a class="reference external" href="https://hpc.github.io/charliecloud/">Charliecloud's</a> development
at LANL resulted in a solution developed in a site with strict
security requirements. Cluster systems in such an environment
typically have no external internet connectivity. System applications
are closely scrutinised, in particular those that involve privileged
execution.</p>
<p>In these environments, Charliecloud's distinct advantage is the
usage of the newly-introduced user namespace to support non-privileged
launch of containerised applications. This technique was described
in the 2017 Singularity paper as being "not deemed stable by multiple
prominent distributions of Linux". It was actually <a class="reference external" href="https://lwn.net/Articles/532593/">introduced in
2013</a>, but its use widened
exposure to a new kernel attack surface. As a result its maturity
has been complex and slow, but user namespaces are now a standard
feature of the latest releases of all major Linux distributions.
Configuration of Debian, RHEL and CentOS is <a class="reference external" href="https://hpc.github.io/charliecloud/install.html#run-time">described here</a>.
(For environments where unprivileged user namespaces cannot be supported,
Charliecloud can fall back to using <tt class="docutils literal">setuid</tt> binaries).</p>
<p>The user namespace is an unprivileged namespace. A user namespace
can be created without requiring root privileges. Within a user
namespace all other privileged namespaces can be created. In this
way, a containerised application can be launched without requiring
privileged access on the host.</p>
<p>Development for a Charliecloud environment involves using the Docker
composition tools locally. Unlike Docker, a container is flattened
to a single archive file in preparation for execution. Execution
is scaled by the scalable distribution of this archive, which is
unpacked into a tmpfs environment locally on each compute node.
Charliecloud has been demonstrated scaling to 10,000 nodes on LANL's
<a class="reference external" href="http://www.lanl.gov/projects/trinity/">Trinity Cray system</a>.</p>
<div class="figure">
<img alt="Charliecloud" src="//www.stackhpc.com/images/charliecloud-logo.png" style="width: 250px;" />
</div>
<p>Charliecloud appears to be a good choice for sites that are seeking
scalability, but with strong requirements for runtime security. The
Docker development environment and composition tools are also helpful
for users on a learning curve for containerisation.</p>
<p>Further details on Charliecloud can be be found from the informative
paper presented at <a class="reference external" href="https://dl.acm.org/citation.cfm?doid=3126908.3126925">Supercomputing 2017</a>.
Michael has provided a <a class="reference external" href="https://gist.github.com/mej/bdb1c1c6abe2632e8d88a22ee3b5af81">Charliecloud Vagrantfile</a>
to help people familiarise themselves with it. Charliecloud packages
are expected to ship in the <a class="reference external" href="http://openhpc.community/">next major release of OpenHPC</a>.</p>
</div>
</div>
<div class="section" id="the-road-ahead">
<h2>The Road Ahead</h2>
<p>The ecosystem around container technology is rapidly evolving, and this
is also true in the niche of HPC.</p>
<div class="section" id="the-open-container-initiative">
<h3>The Open Container Initiative</h3>
<p>The tools for this niche are somewhat bespoke, but thanks to the
efforts of the OCI to break down the established Docker tools into
modular components, there is new scope to build a specialist solution
upon a common foundation.</p>
<p>This initiative has brought about new innovation. <a class="reference external" href="https://rootlesscontaine.rs/">Rootless RunC</a> is an approach for using the <tt class="docutils literal">runc</tt>
tool for unprivileged container launch. This approach and its
current limitations are well documented in the above link.</p>
<p>In a similar vein, <a class="reference external" href="http://cri-o.io/">the CRI-O project</a> is working
on a lightweight container runtime interface that displaces the
Docker daemon from Kubernetes compute nodes, in favour of any
OCI-compliant runtime.</p>
<p>Shifter, Charliecloud and Singularity are not OCI-compliant runtimes,
as they predate OCI’s relativately recent existence. However,
when the OCI's tools become suitably capable and mature
they are likely be adopted in Charliecloud and Shifter.</p>
</div>
<div class="section" id="challenges-for-hpc">
<h3>Challenges for HPC</h3>
<p>There are signs of activitiy around developing better support for
RDMA in containerised environments. The <a class="reference external" href="https://www.kernel.org/doc/Documentation/cgroup-v1/rdma.txt">RDMA Cgroup</a>
introduced in Linux 4.11 introduces support for controlling the
consumption of RDMA communication resources. This is already being
<a class="reference external" href="https://github.com/opencontainers/runtime-spec/pull/942">included in the spec</a> for the
OCI runtime.</p>
<p>RDMA isolation (for example, through a namespace) doesn’t seem to
be currently possible. Current implementations can only pass-through
the host’s RDMA context. This will work fine for HPC configurations
with a scheduling policy not to share a compute node between workloads.</p>
<p>The greatest advantages of specialist solutions appear to address
challenges that remain unique to scientific computing. For example:</p>
<ul class="simple">
<li>Scalable launch of containerised workloads. The approach taken by
Singularity, Shifter and Charliecloud involves using a parallel filesystem
for the distribution of the application container image. This addresses
one of the major differences in use case and design. Distributing
the container as a single image file also greatly reduces filesystem
metadata operations.</li>
<li>Launching multi-node MPI applications in research computing containers.
The Docker runtime creates complications with interacting with
MPI's Process Management Interface. Shifter's innovation around
replacing container MPI libraries with host MPI libraries is an
intriguing way of specialising a generalised environment. Given
multi-node MPI applications are the standard environment of
conventional HPC infrastructure, running containerised applications
of this form is likely to be a tightly specialised niche use case.</li>
</ul>
</div>
<div class="section" id="most-paths-converge">
<h3>(Most) Paths Converge</h3>
<p>A future direction in which HPC runtime frameworks for containerised
applications have greater commonality with the de facto standard
ecosystem around OCI and/or Docker's tools has considerable appeal.
The development environment for Docker containers is rich, and
developing rapidly. The output is portable and interchangeable
between different runtimes. As Michael Jennings of Charliecloud
says, “If we can all play in the same playground, that saves everyone
time, effort and money”.</p>
<p>The HPCAC 2018 Swiss Conference brought together experts from all
the leading options for containerised scientific compute, and was a
rare opportunity to make side-by-side comparisons. Given the rapid
development of this technology I am sure things will have changed
again in time for the 2019 conference.</p>
</div>
</div>
Scientific SIG at the Dublin PTG2018-03-05T14:20:00+00:002018-03-05T17:00:00+00:00Stig Telfertag:www.stackhpc.com,2018-03-05:/scientific-sig-at-the-dublin-ptg.html<p class="first last">The OpenStack PTG invited the Scientific SIG to participate,
and we were certainly glad that we did.</p>
<p>We settled on a half-day session in the cross-project phase of the
<a class="reference external" href="https://www.openstack.org/ptg/">Project Teams Gathering</a> (PTG).
This turned out to be a great choice: our session, formed from a
development-centric core of regional SIG members, was greatly
enhanced with a number of leaders from the Nova and Ironic projects,
who contributed hugely in advancing the discussion.</p>
<div class="figure">
<img alt="Snowy Croke Park, Dublin" src="//www.stackhpc.com/images/crokepark-snowpenstack.jpg" style="width: 750px;" />
</div>
<p>The proceedings of the discussion were <a class="reference external" href="https://etherpad.openstack.org/p/scientific-sig-ptg-rocky">tracked in an Etherpad</a>.</p>
<p>Here are some highlights from my own point of view.</p>
<div class="section" id="preemptible-instances">
<h2>Preemptible Instances</h2>
<p>Theodoris Tsioutsias updated the SIG on the <a class="reference external" href="https://gitlab.cern.ch/ttsiouts/ReaperServicePrototype">Reaper service
prototype</a>
he has been developing at CERN. This is described in greater
detail in a <a class="reference external" href="http://openstack-in-production.blogspot.ie/2018/02/maximizing-resource-utilization-with.html">CERN OpenStack in Production blog post</a>.</p>
<p>The concept of the service is to introduce a second class of
instances, managed outside of a user's standard quota, which can
take opportunistic advantage of temporary resource availability.
The flipside is that when the cloud infrastructure becomes
fully-utilised these instances will be terminated at short notice
in order to service prioritised requests for resources.</p>
<p>The Reaper process is designed to intercept a <tt class="docutils literal">NoValidHostFound</tt>
event in order to search for a preemtible victim, shut it down,
harvest the resources and retry the original request (which
should hopefully now succeed).</p>
<p>CERN's approach is to only trigger the Reaper when the cloud is
totally full. Some use cases may favour reaping when utliisation
exceeds a given threshold. For example:</p>
<ul class="simple">
<li>In a cloud with bare metal compute instances, reaping preemptible
bare metal resources could take significantly longer, particularly if
Ironic node cleaning is enabled.</li>
<li>In a cloud with high turnover of instances, the additional time
required to perform the reaping action could result in many requests
contending for reaped resources. This could add delay to instance
creation, and the race for claiming reaped resources is not guaranteed
to be fair.</li>
</ul>
<p>The SKA project has taken a strong interest in this work, which is
seen as key to raising utilisation on a finite cloud resource. John
Garbutt from StackHPC has been providing technical input and upstream
assistance on the SKA project's behalf. The long-term hope is to
introduce the minimal extensions to Nova required to enable services
like the Reaper to function, and deliver preemptible instances as
effectively as possible.</p>
<p>The discussion and design work continues - <a class="reference external" href="https://review.openstack.org/#/c/438640/">here is the spec</a>.</p>
</div>
<div class="section" id="nested-projects-and-quotas">
<h2>Nested Projects and Quotas</h2>
<p>At the inaugural session for the Scientific Working Group at the
<a class="reference external" href="https://etherpad.openstack.org/p/scientific-wg-austin-summit-agenda">2016 OpenStack summit in Austin</a>,
one pain point articulated by Tim Bell of CERN was OpenStack's
issues with managing quotas across hierarchical projects. He
provided a detailed use case in a <a class="reference external" href="http://openstack-in-production.blogspot.ie/2016/04/resource-management-at-cern.html">subsequent blog post</a>.
Not much progress has been made since then - until now.</p>
<p>John Garbutt updated the SIG on a <a class="reference external" href="https://etherpad.openstack.org/p/unified-limits-rocky-ptg">cross project PTG session on
identity</a>
- this discussion had been around refactoring quota management to
be handled by Keystone and a new Oslo library. The advantage of
this is that it brings quota management to the service that understands
the hierarchy of nested projects. Managing quota for users in
nested projects has long been a pain point for large-scale users
of OpenStack in research computing.</p>
</div>
<div class="section" id="cinder-volume-multi-attach">
<h2>Cinder Volume Multi-Attach</h2>
<p>OpenStack Queens introduces multi-attach Cinder volumes, potentially
enabling many nodes to share a common volume. This introduces some
potential for new ways of scaling research computing infrastructure:</p>
<ul class="simple">
<li>Enabling a common cluster of nodes to boot from the same <strong>read-only</strong>
volume, leading to minimal infrastructure state per instance and strong
immutability.</li>
<li>Enabling the same capability for large-scale bare metal infrastructure.</li>
</ul>
<p>Nova's documentation already includes details on <a class="reference external" href="https://docs.openstack.org/nova/latest/admin/manage-volumes.html#volume-multi-attach">using multi-attach</a>.</p>
<p>The initial implementation of Cinder volume multi-attach is a huge step
forward but unlikely to meet all our requirements. We'll test this
capability out in due course and see how close we get.</p>
</div>
<div class="section" id="ironic-advanced-deployments">
<h2>Ironic Advanced Deployments</h2>
<p>Ironic's new flexibility with regard to deployment has introduced
new and compelling possibilities for the ways in which it can be used for
infrastructure management.</p>
<p>To ensure the advanced deployment requirements articulated in the
meeting have an enduring impact, SIG members were asked to capture
use cases in RFEs for Ironic in Storyboard. This has already
started happening:</p>
<ul class="simple">
<li><strong>Boot to RAMdisk</strong>: booting Ironic compute nodes <a class="reference external" href="https://storyboard.openstack.org/#!/story/1753842">directly into
a RAMdisk image</a>
could create a very rapid and scaleable deployment-free infrastructure
provisioning process. This can only realistically support small
software images and special purpose deployment, but in general
could unlock the potential for using Ironic for extreme-scale HPC
environments.</li>
<li><strong>Deploy with kexec</strong>: Some SIG members find the extra reboot
in Ironic's deployment cycle unbearably tiresome. That's not
because they have a chronically short attention span - rather
that there are systems out there with enough RAM and other devices
to initialise that a power cycle can take of the order of hours
instead of minutes. The possibility of <a class="reference external" href="https://storyboard.openstack.org/#!/story/1757137">using kexec to avoid a
power cycle</a>
makes a huge usability difference in these circumstances.</li>
</ul>
</div>
<div class="section" id="and-finally">
<h2>And Finally...</h2>
<p>I had the opportunity to talk about the Scientific SIG and the aims
of the group in an interview for SuperUser.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/4Vo0Eu_euLk" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div></div>
The Gathering of the (Snow) Clouds2018-03-05T14:20:00+00:002018-03-05T17:00:00+00:00Stig Telfertag:www.stackhpc.com,2018-03-05:/the-gathering-of-the-snow-clouds.html<p class="first last">StackHPC sent a full team away to the Rocky PTG in Dublin.
Here's the trip report from an eventful week.</p>
<p>With the global OpenStack developer community gathering so close by,
StackHPC sent a team of four to the PTG in Dublin to participate to
the fullest extent possible.</p>
<div class="figure">
<img alt="Snowy Dublin streets" src="//www.stackhpc.com/images/dublin-snowpenstack.jpg" style="width: 750px;" />
</div>
<p>The team took on individual roles to give us the greatest amount of
interaction.</p>
<div class="section" id="scientific-sig">
<h2>Scientific SIG</h2>
<div class="youtube"><iframe src="https://www.youtube.com/embed/4Vo0Eu_euLk" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>The Scientific SIG took part in the cross-project sessions, which
ran before the scheduling for the major project teams.</p>
<p>The session was very well attended and there was considerable useful
discussion around requirements driven by the use cases of Scientific
OpenStack. More information about this session is covered in Stig's
<a class="reference external" href="//www.stackhpc.com/scientific-sig-at-the-dublin-ptg.html">Scientific SIG PTG blog post</a>.</p>
</div>
<div class="section" id="kayobe-grows">
<h2>Kayobe Grows</h2>
<div class="youtube"><iframe src="https://www.youtube.com/embed/0axlEFIYi2s" width="450" height="300" allowfullscreen seamless frameBorder="0"></iframe></div><p>StackHPC's Mark Goddard gave an update on the progress of <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a>, our tool for deployment
of HPC-oriented OpenStack that builds upon <a class="reference external" href="https://docs.openstack.org/bifrost/latest/">Bifrost</a>, <a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a> and <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a>.</p>
<p>Kayobe's source repo has migrated to <a class="reference external" href="https://github.com/openstack/kayobe">OpenStack on Github</a> with infra set up for
<a class="reference external" href="https://storyboard.openstack.org/#!/project/928">Storyboard</a>
and <a class="reference external" href="https://review.openstack.org/#/q/project:openstack/kayobe">Gerrit code review</a></p>
<p>Mark will report on Kayobe's growing momentum on our blog shortly.</p>
</div>
<div class="section" id="ironic">
<h2>Ironic</h2>
<p>Mark spent the majority of his time with the Ironic team in a room with an
incredible panoramic city view and no heating. John and Stig also joined a
number of sessions. Interesting <a class="reference external" href="https://etherpad.openstack.org/p/ironic-rocky-ptg">sessions</a> included:</p>
<ul class="simple">
<li><a class="reference external" href="https://etherpad.openstack.org/p/ironic-rocky-ptg-deploy-steps">deploy steps</a> - a
proposal for making the deployment process configurable and extendable using
a mechanism similar to the existing node cleaning feature. The long term
plan here is to change things like BIOS settings and RAID configuration at
deployment time based on traits requested by an instance and/or flavor.</li>
<li><a class="reference external" href="https://etherpad.openstack.org/p/ironic-rocky-ptg-location-awareness">location awareness</a> -
how a single Ironic service might be able to support nodes in several
locations.</li>
<li><a class="reference external" href="https://etherpad.openstack.org/p/nova-ptg-rocky">cross-project session with nova</a> that was a great example
of the many cross project sessions aiming to improve the interations between
projects.</li>
</ul>
</div>
<div class="section" id="id1">
<h2>Kolla</h2>
<p>Mark attended a number of the <a class="reference external" href="https://etherpad.openstack.org/p/kolla-rocky-ptg-planning">Kolla sessions</a>, starting with
<a class="reference external" href="https://etherpad.openstack.org/p/kolla-rocky-ptg-db-backup-restore">database backup and recovery</a>. The
kolla team looks set to adopt StackHPC's Nick Jones's proposal and prototype
implementation.</p>
<p>Thursday morning included a number of sessions proposed by Mark:</p>
<ul class="simple">
<li><a class="reference external" href="https://etherpad.openstack.org/p/kolla-rocky-ptg-kayobe">kayobe overview & collaboration possibilities</a></li>
<li><a class="reference external" href="https://etherpad.openstack.org/p/kolla-rocky-ptg-haproxy-config">customisation of non-OpenStack configuration files</a>, in
particular HAProxy</li>
<li>using the <a class="reference external" href="https://molecule.readthedocs.io/en/latest/">Molecule</a> test
framework to <a class="reference external" href="https://etherpad.openstack.org/p/kolla-rocky-ptg-test-coverage">improve the test coverage of Ansible roles</a></li>
</ul>
</div>
<div class="section" id="brrrr">
<h2>Brrrr...</h2>
<p>The Beast from the East kept us frozen in place for a day or two longer,
but the welcome in Dublin remained warm and friendly.</p>
<p>We also met our latest new recruit - Doug and Stig's mohican-sporting snowman!</p>
<div class="figure">
<img alt="Snowy Dublin streets" src="//www.stackhpc.com/images/doug-snowman.jpg" style="width: 600px;" />
</div>
</div>
StackHPC at the RCUK Cloud Workshop2018-01-08T14:20:00+00:002018-01-10T17:00:00+00:00Stig Telfertag:www.stackhpc.com,2018-01-08:/stackhpc-at-the-rcuk-cloud-workshop.html<p class="first last">Nick Jones, Stig Telfer, John Taylor and John Garbutt attended
the third Research Councils UK Cloud Workshop. Stig and John G
gave presentations.</p>
<p>The RCUK Cloud Working Group exists to define and propagate best
practice for using cloud services and technologies for research
computing in the UK. We were delighted to be invited back to speak
at their third annual Cloud Workshop, held again at the Francis
Crick institute in London.</p>
<p>Stig delivered a presentation based on the collaboration between
the SKA and CERN that <a class="reference external" href="//www.stackhpc.com/future-science-on-future-openstack-cern-and-ska.html">we have previously written about</a>. The slides for the
presentation are <a class="reference external" href="https://cloud.ac.uk/workshops/jan2018/workshops-jan2018-cern-ska/">available online here</a>.</p>
<p>John presented some of his recent work on using <a class="reference external" href="https://docs.openstack.org/manila/latest/">OpenStack Manila</a> for managing filesystems
in research computing applications. In particular, John has been
looking at how to manage GlusterFS filesystems for achieving HPC
levels of data throughput. John's slides are <a class="reference external" href="https://cloud.ac.uk/workshops/jan2018/manila/">available online</a>.</p>
Future Science on Future OpenStack: CERN and SKA2017-12-17T23:20:00+00:002017-12-20T12:00:00+00:00Stig Telfertag:www.stackhpc.com,2017-12-17:/future-science-on-future-openstack-cern-and-ska.html<p class="first last">Stig presented at OpenStack Sydney on the CERN/SKA future
science platforms collaboration with Belmiro Moreira from CERN.
Here we describe that collaboration in a little more depth and
how we see it taking shape in future.</p>
<p>Housed at Cambridge University, the ALaSKA SDP Performance Prototype
Platform has been quick to take advantage of the Big Data Cooperation
Agreement <a class="reference external" href="https://skatelescope.org/news/ska-signs-big-data-cooperation-agreement-cern/">signed over the summer with CERN</a>.
ALaSKA uses OpenStack to deliver a flexible but performant bare
metal compute enviroment to enable SKA project scientists to
experiment with and explore software technologies and make objective
performance comparisons.</p>
<div class="figure">
<img alt="ASKAP" src="//www.stackhpc.com/images/CSIRO-ASKAP.jpg" style="width: 600px; height: 400px;" />
</div>
<p>The ALaSKA system uses several OpenStack technologies that are
already in full-scale production at CERN. Conversely, to develop
ALaSKA's capability some advanced technologies have been developed
by the StackHPC team managing ALaSKA. The CERN team have identified
several areas where ALaSKA's experience can inform the ongoing development
of CERN's compute infrastructure, particularly as they scale up to meet
the challenges of the forthcoming high-luminosity upgrade.</p>
<div class="figure">
<img alt="CMS" src="//www.stackhpc.com/images/CERN-CMS.jpg" style="width: 600px; height: 400px;" />
</div>
<p>Following the collaboration, on several occasions through the autumn
CERN and SKA project staff have met to talk over the practical
details of sharing effort, and StackHPC's team was delighted to participate
in the discussions relating to OpenStack. These discussions led
to a <a class="reference external" href="https://www.openstack.org/videos/sydney-2017/future-science-on-future-openstack-developing-next-generation-infrastructure-at-cern-and-ska">co-presentation at OpenStack Sydney</a>
by Stig and Belmiro Moreira, a Cloud Architect in
the team at CERN.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/XmQR06Mwd5g" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>Stig will also be elaborating on the same topic at the
<a class="reference external" href="https://cloud.ac.uk/">Research Councils UK Cloud Workshop 2018</a> at the
Francis Crick institute in London on 8th January 2018.</p>
<p>Beyond meeting the goals of major flagship science programmes,
this collaboration speaks to a wider unmet need among the
growing Scientific OpenStack community for OpenStack infrastructure
that enables the next generation of flexible HPC.</p>
<p>A number of key priorities have been identified so far:</p>
<ul class="simple">
<li>Using Ironic for infrastructure management.</li>
<li>Better support for bare metal compute resources in Magnum.</li>
<li>High performance Ceph, especially using RDMA-enabled interconnects.</li>
<li>Intelligent worklaod scheduling in federated environments.</li>
<li>Opportunistic "spot" instances and preemption in OpenStack.</li>
</ul>
<div class="section" id="enter-the-scientific-sig">
<h2>Enter the Scientific SIG</h2>
<p>The teams from CERN and the SKA see these issues as far from specific
to their use cases, and are certainly not the only people around the
world working on making OpenStack even better for research computing.</p>
<p>Through the <a class="reference external" href="https://wiki.openstack.org/wiki/Scientific_SIG">Scientific SIG</a>,
the teams are starting to coordinate activities between themselves, and
also with other research institutions active in the Scientific OpenStack
community.</p>
<p>This open collaboration will be actively discussed during the Scientific SIG's
scheduled session at <a class="reference external" href="https://www.openstack.org/ptg/#tab_about">the Project Teams Gathering (PTG)</a>,
coming up in Dublin, February 26th - March 2nd 2018.</p>
</div>
Verne Global's hpcDIRECT Service: Bare Metal Powered by Molten Rock2017-12-12T10:20:00+00:002017-12-12T22:00:00+00:00Stig Telfertag:www.stackhpc.com,2017-12-12:/verne-globals-hpcdirect-service-bare-metal-powered-by-molten-rock.html<p class="first last">Verne Global announce hpcDIRECT, built upon bare metal
OpenStack, to deliver "the cleanest HPC on Earth".</p>
<p>After some months of development, prototyping and early customer trials,
Verne Global's hpcDIRECT service has been announced. This system
builds upon OpenStack Ironic to deliver the performance of bare metal
combined with the dynamic flexibility of cloud. Iceland's abundant
geothermal energy ensures the energy is clean but the power costs are
kept low.</p>
<p>You can find plenty of details about Verne Global's announcement,
for example as covered by <a class="reference external" href="https://insidehpc.com/2017/12/verne-global-launches-hpcdirect-hpc-service-platform/">Inside HPC</a>
or <a class="reference external" href="https://www.nextplatform.com/2017/12/06/renting-cleanest-hpc-earth/">The Next Platform</a>.
However, we can also talk about some of the transformative technologies
going on within the infrastructure.</p>
<p>Working in partnership the with technical team at Verne Global, StackHPC has
designed, developed and deployed an OpenStack control plane that pushes
the boundaries of OpenStack and Ironic:</p>
<ul class="simple">
<li><strong>OpenStack Pike</strong>, the latest release, in an agile deployment that tracks
upstream closely.</li>
<li><strong>Multi-network support</strong>, including both Ethernet and Infiniband, with tenant
isolation throughout.</li>
<li><strong>Custom resource classes</strong> for supporting Nova sheduling to multiple bare metal server specs.</li>
<li><strong>Kolla-Ansible control plane</strong> deployed using <a class="reference external" href="https://kayobe.readthedocs.io">Kayobe</a>,
the de-facto solution for automating Kolla-Ansible deployments from scratch.</li>
</ul>
<p>In many other ways, hpcDIRECT is a showcase for the use of OpenStack for
HPC applications:</p>
<ul class="simple">
<li><strong>Flexible delivery</strong> of complex application environments using Ansible and Heat.</li>
<li><strong>HPC-oriented storage options</strong>, designed to meet a wide
range of use cases for HPC and data-intensive analytics.</li>
<li><strong>HPC workload telemetry</strong> gathered using <a class="reference external" href="http://monasca.io">Monasca</a> or
<a class="reference external" href="http://prometheus.io">Prometheus</a> and shared with users to provide performance
insights into the behaviour of their codes.</li>
<li><strong>Consistent management tools everywhere</strong> - for the first time, a system that is
managed at every level from firmware to OS to application using the same tools
(<a class="reference external" href="http://docs.ansible.com/ansible">Ansible</a> and friends).</li>
</ul>
<p>Lewis Tucker, Verne Global Enterprise Solutions Architect, comments "From the outset
we had an ambitious goal to provide a bare metal on demand HPC Service. StackHPC's
expertise has enabled us to execute our plans to timescales and budget and
offer our customers an innovative and agile HPC service to meet their exacting requirements.
We look forward to developing our relationship as hpcDIRECT grows".</p>
<p>John Taylor, CEO of StackHPC, adds "the hpcDIRECT project has validated our vision
of an HPC-enabled OpenStack and gone even further to make it available as a service.
We are thrilled to have been involved in this cutting-edge project."</p>
<div class="figure align-center">
<object data="//www.stackhpc.com/images/hpc-logo-navy.svg" style="width: 400px;" type="image/svg+xml">
hpcDIRECT</object>
</div>
Baremetal Cloud Capacity2017-11-15T14:00:00+00:002017-11-21T09:42:00+00:00John Garbutttag:www.stackhpc.com,2017-11-15:/baremetal-cloud-capacity.html<p class="first last">Discussion of scheduling baremetal resources using OpenStack Pike
and looking at news from the Denver PTG about what is likely going
to happen during Queens and beyond.</p>
<img alt="OpenStack Pike" src="//www.stackhpc.com/images/openstack-pike-logo.png" style="width: 80px;" />
<p>For many reasons, it is common for HPC users of OpenStack to use Ironic and
Nova together to deliver baremetal servers to their users. In this post we
look at recent changes to how Nova chooses which Ironic node to use for each
user's <tt class="docutils literal">nova boot</tt> request.</p>
<div class="section" id="ska-sdp-performance-prototype">
<h2>SKA SDP Performance Prototype</h2>
<p>As part of the development of the <a class="reference external" href="https://skatelescope.org">SKA</a>, StackHPC has
created an OpenStack based cloud that researchers can use to prototype how best
to process the data output from the telescopes. For this discussion of managing
cloud capacity, the key points to note are:</p>
<ul class="simple">
<li>Several types of nodes: Regular, a few GPU nodes, a few NVMe nodes, some
high memory nodes.</li>
<li>Several physical networks, including 10G Ethernet, 25G Ethernet and 100Gb/s Infiniband.</li>
<li>A globally distributed team of scientists.</li>
<li>Some work will need exclusive use of the whole system, to benchmark the
performance of prototype scientific pipelines.</li>
</ul>
</div>
<div class="section" id="managing-capacity">
<h2>Managing Capacity</h2>
<p>Public cloud must present the illusion of infinite capacity. For
private cloud use cases, and research computing in particular, the
amount of available, unused capacity is of great interest. Most
small clouds soon hit the reality of running out of space. There
are two main approaches to dealing with capacity problems:</p>
<ul class="simple">
<li>Explicit assignment (and pre-emption)</li>
<li>Co-operative multitasking</li>
</ul>
<p>Given our situation described above, we have opted for co-operative
multitasking, where the users delete their own instances when they are finished
with those nodes, allowing others to do what they need.</p>
<p>To help reduce the strain on resources we are also prototyping
having a shared execution frameworks such as a Heat provisioned
<a class="reference external" href="http://openhpc.community">OpenHPC</a> Slurm cluster, a Magnum
provisioned Docker Swarm cluster and a Sahara provisioned RDMA
enabled Spark cluster from <a class="reference external" href="http://hibd.cse.ohio-state.edu/">HiBD</a>.</p>
<p>In this blog we are focusing on the capacity of Ironic based clouds. When you
add virtualisation into the mix, there are many questions around how different
combinations of flavors fit onto a single hypervisor, how to try to avoid
wasted space. Similarly, we are focusing on statically sized private clouds,
so this blog will ignore the whole world of capacity planning.</p>
</div>
<div class="section" id="openstack-security-model">
<h2>OpenStack Security Model</h2>
<p>You can argue about this being a security model, or just the details of the
abstraction OpenStack presents, but the public APIs try their best to hide any
idea of physical hosts and capacity from non-cloud-admin users.</p>
<p>When building a public cloud as a publicly traded company, exposing via the
API in realtime how many physical hosts you run or how much free capacity you
have could probably break the law in some countries. But when you run a private
could, you really want a nice view of what your friends are using.</p>
</div>
<div class="section" id="co-operative-capacity-management">
<h2>Co-operative Capacity Management</h2>
<p>"Play nice, or I will delete all your servers every Friday afternoon!"</p>
<p>That is a very tempting thing to say, and I basically have said that. But its
hard to play nice when you have no idea how much capacity is in use. So we
have a solution: the capacity dashboard.</p>
<img alt="OpenStack Pike" src="//www.stackhpc.com/images/ska-capacity-dashboard.png" style="width: 700px;" />
<p>Talking to the users of P3, its clear having a visual representation of who is
currently using what has been much more useful that a wiki page of requests
that quickly drifted out of sync with reality. In the future we may consider
<a class="reference external" href="https://docs.openstack.org/mistral">Mistral</a> to enforce lease times of
servers, or maybe <a class="reference external" href="https://docs.openstack.org/blazar">Blazar</a> for a more
formal reservation system, but for now giving the scientists the flexibility
of a more informal system is working well.</p>
</div>
<div class="section" id="building-the-dashboard">
<h2>Building the Dashboard</h2>
<p>Firstly we have our monitoring infrastructure. This is currently built using
OpenStack Monasca, making use of Kafka and Influx DB. (We also use Monasca with
ELK to generate metrics from our logs, but that is a story for another day):</p>
<ul class="simple">
<li><a class="reference external" href="http://monasca.io/docs/architecture.html">http://monasca.io/docs/architecture.html</a></li>
<li><a class="reference external" href="https://docs.openstack.org/monasca-api/latest">https://docs.openstack.org/monasca-api/latest</a></li>
</ul>
<p>The dashboard is built using Grafana. There is a Monasca plugin, that means we
can use the Monasca API as a data source, and a Keystone plugin that is used
to authenticate and authorise access to both Grafana and its use of the Monasca
APIs:
<a class="reference external" href="https://grafana.com/plugins/monasca-datasource">https://grafana.com/plugins/monasca-datasource</a></p>
<p>Our system metrics are kept in a project general users don't have access to,
but the capacity metrics and dashboards are associated with the project that
all the users of the system have access to.</p>
<p>Now we have a system in place to ingest, store, query and visualize metrics in
a multi-tenant way, we now need a tool to query the capacity and send metrics
into Monasca.</p>
</div>
<div class="section" id="querying-baremetal-capacity">
<h2>Querying Baremetal Capacity</h2>
<p>We have created a small CLI tool called <a class="reference external" href="https://github.com/stackhpc/os-capacity">os-capacity</a>. It uses <a class="reference external" href="https://github.com/openstack/os-client-config">os_client_config</a> and <a class="reference external" href="https://github.com/openstack/cliff">cliff</a> to query the <a class="reference external" href="https://developer.openstack.org/api-ref/placement/">Placement API</a> for details about the
current cloud capacity and usage. It also uses Nova APIs and Keystone APIs to
get hold of useful friendly names for the information that is in placement.</p>
<p>For the Capacity dashboard we use data from two particular CLI calls. Firstly
we look at the capacity by calling:</p>
<pre class="literal-block">
os-capacity resources group
+----------------------------------+-------+------+------+-------------+
| Resource Class Groups | Total | Used | Free | Flavors |
+----------------------------------+-------+------+------+-------------+
| VCPU:1,MEMORY_MB:512,DISK_GB:20 | 5 | 1 | 4 | my-flavor-1 |
| VCPU:2,MEMORY_MB:1024,DISK_GB:40 | 2 | 0 | 2 | my-flavor-2 |
+----------------------------------+-------+------+------+-------------+
</pre>
<p>This tool is currently very focused on baremetal clouds. The flavor mapping is
done assuming the flavors should exactly match all the available resources for
a given Resource Provider. This is clearly not true for a virtualised scenario.
It is also not true in some baremetal clouds, but this works OK for our cloud.
Of course, patches welcome :)</p>
<p>Secondly we can look at the usage of the cloud by calling:</p>
<pre class="literal-block">
os-capacity usages group user --max-width 70
+----------------------+----------------------+----------------------+
| User | Current Usage | Usage Days |
+----------------------+----------------------+----------------------+
| 1e6abb726dd04d4eb4b8 | Count:4, | Count:410, |
| 94e19c397d5e | DISK_GB:1484, | DISK_GB:152110, |
| | MEMORY_MB:524288, | MEMORY_MB:53739520, |
| | VCPU:256 | VCPU:26240 |
| 4661c3e5f2804696ba26 | Count:1, | Count:3, |
| 56b50dbd0f3d | DISK_GB:371, | DISK_GB:1113, |
| | MEMORY_MB:131072, | MEMORY_MB:393216, |
| | VCPU:64 | VCPU:192 |
+----------------------+----------------------+----------------------+
</pre>
<p>You can also group by project, but in the current SKA cloud all users are in
the same project, so grouping by user works best.</p>
<p>The only additional step is converting the above information into metrics that
are fed into Monasca. For now this has also been integrated into the
<tt class="docutils literal"><span class="pre">os-capacity</span></tt> tool, by a magic environment variable. Ideally we would feed
the json based output of os-capacity into a separate tool that manages sending
metrics, but that is a nice task for a rainy day.</p>
</div>
<div class="section" id="what-s-next">
<h2>What's Next?</h2>
<p>Through our project in SKA we are starting to <a class="reference external" href="https://skatelescope.org/news/ska-signs-big-data-cooperation-agreement-cern">work very closely with CERN</a>.
As part of that work we are looking at helping with the CERN prototype of
preemptible instances, and looking at many other ways that both the SKA and
CERN can work together to help our scientists be even more productive.</p>
<p>The ultimate goal is to deliver private cloud infrastructure for research
computing use cases that achieves levels of utilisation comparable to the
best examples of well-run conventional research computing clusters. Being
able to track available capacity is an important step in that direction.</p>
</div>
Nick Jones Joins our Team2017-11-13T10:20:00+00:002017-11-21T12:00:00+00:00Stig Telfertag:www.stackhpc.com,2017-11-13:/nick-jones-joins-our-team.html<p class="first last">Nick Jones joins the StackHPC team, extending our developer
capability and bringing deep experience of OpenStack cloud operations.</p>
<p>We are excited to announce our newest team member: Nick Jones joins us
from DataCentred. Nick is well known within the OpenStack community,
as co-organiser of <a class="reference external" href="https://openstackdays.uk/2017/">OpenStack Days UK 2017</a>, the <a class="reference external" href="https://www.meetup.com/Manchester-OpenStack-Meetup">Manchester OpenStack Meetup</a>, and as an
active participant in the creation of the <a class="reference external" href="https://wiki.openstack.org/wiki/PublicCloudWorkingGroup">OpenStack Public Cloud WG</a>.</p>
<p>StackHPC will be drawing on Nick's great depth of experience as head
of cloud computing at DataCentred, to assist our clients in research
computing with their transition into operation.</p>
<p>We will also be benefiting from Nick's strong technical skills to
help power our development activities to achieve our goals for
Scientific OpenStack.</p>
<p>Nick adds "As someone who grew up dragging my parents to Jodrell Bank
on a regular basis, I'm extraordinarily proud to be given the
opportunity to work with StackHPC at such an exciting time in the
company's evolution."</p>
<p>Follow Nick on Twitter <a class="reference external" href="https://twitter.com/yankcrime">@yankcrime</a>.</p>
<div class="figure">
<img alt="Nick Jones" src="//www.stackhpc.com/images/nick-jones.jpg" style="width: 300px;" />
</div>
StackHPC at OpenStack Sydney2017-11-11T10:20:00+00:002017-11-11T18:40:00+00:00Stig Telfertag:www.stackhpc.com,2017-11-11:/stackhpc-at-openstack-sydney.html<p class="first last">Stig attended the OpenStack Sydney summit with the team
from Cambridge University. Stig helped lead two Scientific SIG
sessions. In addition to that he did a lightning talk and two
presentations. What a week!</p>
<p>Stig had a busy week at the <a class="reference external" href="https://www.openstack.org/summit/sydney-2017/">OpenStack Sydney Summit</a>. The Scientific
SIG had a meeting and a BoF session, which was organised as lightning
talks.</p>
<p>Stig co-presented with Belmiro Moreira of the CERN team on the
collaboration activities between CERN and SKA:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/XmQR06Mwd5g" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>Stig also co-presented with Erez Cohen of Mellanox on "Five Ways
with Hypervisor Networking", a study on the different strategies
for virtualised networking and their consequential impact on
performance:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/E1VI1p4mCBM" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>We will follow up in due course with details from this talk in a
blog post.</p>
Scheduling Baremetal Resources in Pike2017-10-26T15:00:00+01:002017-11-09T23:01:00+00:00John Garbutttag:www.stackhpc.com,2017-10-26:/baremetal-scheduling-pike.html<p class="first last">Discussion of scheduling baremetal resources using OpenStack Pike
and looking at news from the Denver PTG about what is likely going
to happen during Queens and beyond.</p>
<img alt="OpenStack Pike" src="//www.stackhpc.com/images/openstack-pike-logo.png" style="width: 80px;" />
<p>For many reasons, it is common for HPC users of OpenStack to use Ironic and
Nova together to deliver baremetal servers to their users. In this post we
look at recent changes to how Nova chooses which Ironic node to use for each
user's <tt class="docutils literal">nova boot</tt> request.</p>
<div class="section" id="flavors">
<h2>Flavors</h2>
<p>To set the scene, lets look at what at what a user gets to choose from when
asking Nova to boot a server. While there are many options relating to
the boot image, storage volumes and networking, let's ignore these and focus
on the choice of <cite>Flavor</cite>.</p>
<p>The choice of <cite>Flavor</cite> allows the user to specify which of the predefined
options of CPU, RAM and disk combinations bests suits their needs. In many
clouds the choice of flavor maps directly to how much the user has to pay.
In some clouds it is also possible to pick between a baremetal server (i.e.
using the Ironic driver) or a VM (i.e. using the libvirt driver) by picking
a particular flavor, while most clouds only use a single driver for all their
instances.</p>
</div>
<div class="section" id="before-pike">
<h2>Before Pike</h2>
<p>Ironic manages an inventory of nodes (i.e. physical machines). We
need to somehow translate Nova's flavor into a choice of Ironic node.
Before the Pike release, this was done by comparing the RAM, CPU and disk
resources for each node with what is defined in the flavor.</p>
<p>If you don't use the exact match filters in Nova, you will find Nova is happy
to give users any physical machine that has at least the amount of
resources requested in the flavor. This can lead to your special high memory
servers being used by people who only requested your regular type of server.
Some find this is a feature; if you are out of small servers your preference
might be giving people a slightly bigger server instead.</p>
<p>All this confusion comes because we are trying to manage indivisible physical
machines using a set of descriptions designed for packing VMs onto a
hypervisor, possibly taking into account a degree of overcommit. Things get
even harder when you consider having both VM and baremetal resources in the
same region, with a single scheduler having to pick the correct resources
based on the user's request. At this point you need the exact match filters
for only a subset of the hosts. This problem is now starting to be resolved by
the creation of Nova's placement service.</p>
</div>
<div class="section" id="the-resource-class">
<h2>The Resource Class</h2>
<p>The new Placement API brings its own set of new terms.
Lets just say a <cite>Resource Provider</cite> has an <cite>Inventory</cite> that defines what
quantity of each available <cite>Resource Class</cite> the <cite>Resource Provider</cite> has. Users
can get a set of <cite>Allocation</cite> for specific amounts of a <cite>Resource Class</cite> from
a given <cite>Resource Provider</cite>. Note: while there are a set of well known
<cite>Resource Class</cite> names, you are also able to have custom names.</p>
<p>Furthermore, a <cite>Resource Provider</cite> can be tagged with <cite>Traits</cite> that describe
the qualitative capabilities of the <cite>Resource Provider</cite>. The python library
<tt class="docutils literal"><span class="pre">os-traits</span></tt> defines the standard <cite>Traits</cite>, but the system also allows custom
traits. Ironic has <a class="reference external" href="https://specs.openstack.org/openstack/ironic-specs/specs/approved/node-resource-class.html">recently added</a>
the ability to set a <cite>Resource Class</cite> on an Ironic Node.</p>
<p>In Pike Nova now reads the Ironic node <tt class="docutils literal">resource_class</tt> property, and if it
has been set updates the <cite>Inventory</cite> of the <cite>Resource Provider</cite> that represents
that Ironic node to have an amount of 1 available of a given custom
<cite>Resource Class</cite>.</p>
</div>
<div class="section" id="using-ironic-s-resource-classes">
<h2>Using Ironic's Resource Classes</h2>
<p>Lots of technical jargon in that last section. What does that really mean?</p>
<p>Well it means we can divide up all Ironic nodes into distinct subsets, and we
can label each distinct subset with a <cite>Resource Class</cite>. For an existing system,
you can update any Node to add a Resource Class. But be careful, because once
you add a Resource Class to a node, you can't change the field until the Ironic
node is no longer being used (i.e. in the <tt class="docutils literal">available</tt> state). (There are good
reasons why, but lets leave that for another blog post).</p>
<p>If you are adding new Nodes or creating a new cloud, you can use Ironic
inspector rules to set the Resource Class to an appropriate value, in a similar
way to initializing any of the other Node properties you can determine via
inspection.</p>
</div>
<div class="section" id="mapping-resource-classes-to-flavors">
<h2>Mapping Resource Classes to Flavors</h2>
<p>So here is were it gets more interesting. Now we have defined these groups of
Ironic nodes, we can map these groups to a particular <cite>Nova</cite> flavor. <a class="reference external" href="https://docs.openstack.org/ironic/pike/install/configure-nova-flavors.html#scheduling-based-on-resource-classes">Here are
the docs</a>
on how you do that.</p>
<div class="section" id="health-warning-time">
<h3>Health warning time</h3>
<p>You probably noticed our
<a class="reference external" href="//www.stackhpc.com/kolla-kayobe-pike.html">blog post on upgrading to Pike</a>.
Well if you want to do this, you need to make sure you have a bug fix we have
helped develop to make this work. In particular you want to be on a new enough
version of Pike that you <a class="reference external" href="https://github.com/openstack/nova/commit/842641c9b0b8f60d0e19b38ad1180078e9c4330c">get this backport</a>.</p>
<p>Without the above fix, you will find adding the flavor extra specs such as
<tt class="docutils literal">resources:VCPU=0</tt> cause the Nova scheduler to start picking Ironic nodes
that are already being used by existing instances, triggering lots of retries,
and likely lots of build failures.</p>
<p>One more heath warning. If you set a resource class of <tt class="docutils literal">CUSTOM_GOLD</tt> in
Ironic, that will get registered in Nova as <tt class="docutils literal">CUSTOM_CUSTOM_GOLD</tt>. As such
its best not to add the <tt class="docutils literal">CUSTOM_</tt> prefix in Ironic. There is a lot of history
around why it works this way, for more details see <a class="reference external" href="https://bugs.launchpad.net/nova/+bug/1724524">the bug on launchpad</a>.</p>
</div>
<div class="section" id="an-unrelated-pike-bug">
<h3>An Unrelated Pike bug</h3>
<p>While we are talking about Pike and using Ironic through Nova, if you have
started using the experimental HA mode, where two or more <tt class="docutils literal"><span class="pre">nova-compute</span></tt>
processes talk to one Ironic deploy, you will want to know about <a class="reference external" href="https://bugs.launchpad.net/nova/+bug/1714248">this bug</a> that
means it is quite badly broken in Pike.</p>
<p>Once we have the fix for that merged, we will let you know what can be done
for Pike based clouds in a future blog post.</p>
</div>
</div>
<div class="section" id="something-you-must-do-before-you-upgrade-to-queens">
<h2>Something you must do before you upgrade to Queens</h2>
<p>In Pike there is a choice between the old scheduling world and the new
<cite>Resource Class</cite> based world. But you must add a <cite>Resource Class</cite> for every
before you upgrade to Queens.</p>
<p>For more details on the deprecation of scheduling of Ironic nodes using VCPU,
RAM and disk, please see the <a class="reference external" href="https://docs.openstack.org/releasenotes/nova/pike.html#deprecation-notes">Nova release notes</a>.</p>
<p>Once you update your Ironic nodes with the <cite>Resource Class</cite> (once you are on
the latest version of Pike that has the bug fix in), existing instances that
previously never claimed the new <cite>Resource Class</cite>.</p>
</div>
<div class="section" id="why-not-use-mogan">
<h2>Why not use Mogan?</h2>
<p>I hear you ask, why bother with Nova any more, there is this new project
called Mogan that is focusing on Ironic and ignores VMs?</p>
<p>Talking to our users, they like making use of the rich ecosystem around the
Nova API that (largely) works equally well for both VMs and Baremetal, be that
the OpenStack support in Ansible or the support for orchestrating big data
systems in OpenStack Sahara. In my opinion, this means its worth sticking
with Nova, and I am not just saying that because I used to be the Nova PTL.</p>
</div>
<div class="section" id="where-we-have-got-in-pike">
<h2>Where we have got in Pike</h2>
<p>In the <a class="reference external" href="http://www.skatelescope.org">SKA</a> performance prototype we are now
making use of the Resource Class based placement. This means placement picks
only an Ironic Node that exactly matches what the flavor requests. Previously,
because we did not use the exact filters or capabilities, we had GPU capable
nodes being handed out to users who only requested a regular node.</p>
<p>When you look at the capacity of the cloud using the Placement API, it is now
much simpler when you consider the available Resource Classes. You can see
a <a class="reference external" href="https://github.com/johngarbutt/os-capacity">prototype tool</a> I created to
query the capacity from Placement (using Ocata).</p>
</div>
<div class="section" id="what-is-happening-in-queens-and-rocky">
<h2>What is Happening in Queens and Rocky?</h2>
<p>If you want to know more about the context around the work on the Placement API
and the plans for the future, these two presentations from the Boston summit
are a great place to start:</p>
<ul class="simple">
<li><a class="reference external" href="https://www.openstack.org/videos/boston-2017/scheduler-wars-a-new-hope">https://www.openstack.org/videos/boston-2017/scheduler-wars-a-new-hope</a></li>
<li><a class="reference external" href="https://www.openstack.org/videos/boston-2017/scheduler-wars-revenge-of-the-split">https://www.openstack.org/videos/boston-2017/scheduler-wars-revenge-of-the-split</a></li>
</ul>
<p>I recently attended the Project Team Gathering (PTG) in Denver. There was lots
of discussion on how Ironic can make use of <cite>Traits</cite> for finer grained
scheduling, including how you could use Nova flavors to pick between different
RAID and BIOS configurations that are be optimized for specific workloads.
More on how those discussions are going, and how the SKA (Square Kilometre
Array) project is looking to use those new features in a future blog post!</p>
</div>
Upgrade to Pike using Kolla and Kayobe2017-09-21T12:00:00+01:002017-09-21T12:00:00+01:00Stig Telfertag:www.stackhpc.com,2017-09-21:/kolla-kayobe-pike.html<p class="first last">A control plane upgrade from Ocata to Pike, using Kolla
and Kayobe. How we prepared for it and what happened on the day.</p>
<img alt="OpenStack Pike" src="//www.stackhpc.com/images/openstack-pike-logo.png" style="width: 100px;" />
<p>We have previously described a new kind of OpenStack infrastructure,
built to combine polymorphic flexibility with HPC levels of
performance, in the context of our project with the <a class="reference external" href="//www.stackhpc.com/hpc-networking-in-openstack-1.html">Square Kilometre
Array</a>.
To take advanage of OpenStack's latest capabilities, this week we
upgraded that infrastructure from Ocata to Pike.</p>
<p>Early on, we took a design decision to base our deployments on
<a class="reference external" href="https://docs.openstack.org/kolla/latest/">Kolla</a>, which uses Docker
to containerise the OpenStack control plane, transforming it into
something approximating a microservice architecture.</p>
<p>Kolla is in reality several projects. There is the project to define
the composition of the Docker containers for each OpenStack service,
and then there are the projects to orchestrate the deployment of Docker
containers across one or more control plane hosts. This could be done
using <a class="reference external" href="https://docs.openstack.org/kolla-kubernetes/latest/">Kolla-Kubernetes</a>,
but our preference is for <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-Ansible</a>.</p>
<p>Kolla-Ansible builds upon a set of hosts already deployed and
configured up to a baseline level where Ansible can drive the Docker
deployment. Given we are typically starting from pallets of new
servers in a loading dock, there is a gap to be filled to get from
one to the other. For that role, we created <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a>, loosely defined as
"Kolla on <a class="reference external" href="https://docs.openstack.org/bifrost/latest/">Bifrost</a>",
and intended to perform a similar role to <a class="reference external" href="http://tripleo.org/">TripleO</a>,
but using only Ironic for the undercloud seed and driven by Ansible
throughout. This approach has enabled us to incorporate some
compelling features, such as Ansible-driven configuration of <a class="reference external" href="//www.stackhpc.com/ansible-drac.html">BIOS
and RAID</a> firmware parameters
and <a class="reference external" href="//www.stackhpc.com/hpc-networking-in-openstack-2.html">Network switch configuration</a>.</p>
<p>There is no doubt that Kayobe has been a huge enabler for us, but what about Kolla?
One of the advantages claimed for a containerised control plane is how it
simplifies the upgrade process by severing the interlocking package dependencies
of different services. This week we put this to the test, by upgrading
a number of systems from Ocata to Pike.</p>
<p>This is a short guide to how we did it, and how it worked out...</p>
<div class="section" id="have-a-working-test-plan">
<h2>Have a Working Test Plan</h2>
<p>It may seem obvious but it may not an obvious starting point.
Make a set of tests to ensure that your OpenStack system is working
before you start. Then repeat these tests at any convenient point.
By starting with a test plan that you know works, you'll know for sure
if you've broken it.</p>
<p>Otherwise in the depths of troubleshooting you'll have a lingering
doubt that perhaps your cloud was broken in this way all along...</p>
</div>
<div class="section" id="preparing-the-system-for-upgrade">
<h2>Preparing the System for Upgrade</h2>
<p>We brought the system to the latest on the <tt class="docutils literal">stable/ocata</tt> branch.
This in itself shakes out a number of issues. Just how healthy is
the kernel and OS on the controller hosts? Is the Netron agents
containers spinning looking for lost namespaces? Is the kernel
blocking on most cores before spewing out reams of <tt class="docutils literal">kernel:NMI
watchdog: BUG: soft lockup - CPU#2 stuck for 23s!</tt></p>
<p>A host in this state is unlikely to succeed in moving one patchset forward,
let alone a major OpenStack release.</p>
<p>One of Kolla's strengths is the elimination of dependencies between
services. It makes it possible to deploy different versions of
OpenStack services without worrying about dependency conflicts.
This can be a very powerful advantage.</p>
<p>The ability to update a kolla container forward along the same
stable release branch establishes the basic procedure is working
as expected. Getting the control plane migrated to the tip of the
current release branch is a good precursor to making the version
upgrade.</p>
</div>
<div class="section" id="staging-the-upgrade">
<h2>Staging the Upgrade</h2>
<p>Take the leap on a staging or development system and you'll be more
confident of landing in one piece on the other side. In tests on
a development system, we identified and fixed a number of issues
that would each have become a major problem on the production system
upgrade.</p>
<p>Even a single-node staging system will find problems for you.</p>
<p>For example:</p>
<ul class="simple">
<li>During the Pike upgrade, the Docker Python bindings package renames
from <tt class="docutils literal">docker_py</tt> to <tt class="docutils literal">docker</tt>. They are mutually exclusive.
The python environment we use for Kolla-Ansible must start the
process with <tt class="docutils literal">docker_py</tt> installed and at the appropriate point
transition to <tt class="docutils literal">docker</tt>. We <a class="reference external" href="https://github.com/stackhpc/kayobe/commit/c2312561dd9b3c45b7846a8dd59ff53f9a7695ef">found a way through</a>
and developed Kayobe to <a class="reference external" href="https://github.com/stackhpc/kayobe/commit/e5b8ce6cbf9543e11b8e0f81210c297f78836ba5">perform this orchestration</a>.</li>
<li>We carrried forward a piece of work to enable our Kolla logs via Fluentd
to go to Monasca,
<a class="reference external" href="https://review.openstack.org/#/c/483026/">which just made its way upstream</a>.</li>
<li>We hit a problem with Kolla-Ansible's RabbitMQ containers generating
<a class="reference external" href="https://github.com/stackhpc/kayobe/issues/14">duplicate entries in /etc/hosts</a>,
which we work around while the root cause is investigated.</li>
<li>We found and fixed some more issues with Kolla-Ansible pre-checks for both
<a class="reference external" href="https://review.openstack.org/#/c/500535/">Ironic</a> and <a class="reference external" href="https://review.openstack.org/#/c/500765/">Murano</a>.</li>
<li>We hit this bug with <a class="reference external" href="https://bugs.launchpad.net/kolla-kubernetes/+bug/1707856">generating config for mariadb</a> - easily
fixed once the problem was identified.</li>
</ul>
</div>
<div class="section" id="performing-the-upgrade">
<h2>Performing the Upgrade</h2>
<p>On the day, at a production scale, new problems can occur that were not
exposed at the scale of a staging system.</p>
<p>In a production upgrade, the best results come from bringing all the technical
stakeholders together while the upgrade progresses. This enables a team to draw
on all the expertise it needs to work through issues encountered.</p>
<p>In production upgrades, we worked through new issues:</p>
<ul class="simple">
<li>A race condition encountered in the management of keepalived for an <tt class="docutils literal">haproxy</tt>
cluster. This was identified to be a race condition reported in
<a class="reference external" href="https://bugs.launchpad.net/kolla-ansible/+bug/1714407">this bug</a> and already
fixed on the master branch, which we could cherry-pick.</li>
<li>We hit this bug with <a class="reference external" href="https://bugs.launchpad.net/kolla/+bug/1717922">Horizon reporting End of script output before headers: django.wsgi</a>, for which a bug fix was
<a class="reference external" href="https://review.openstack.org/#/c/505569/">already in review upstream</a> that
we could cherry-pick.</li>
</ul>
<p>That final point should have been found by our test plan, but was
not covered (this time). Arguably it should have been found by Kolla-Ansible's
CI testing too.</p>
</div>
<div class="section" id="the-early-bird-gets-the-worm">
<h2>The Early Bird Gets The Worm</h2>
<p>Being an early adopter has both benefits and drawbacks. Kolla,
Ansible and Kayobe have made it possible to do what we did -
successfully - with a small but talented team.</p>
<p>Our users have scientific work to do, and our OpenStack projects
exist to support that.</p>
<p>We are working to deliver infrastructure with cutting-edge capabilities that
exploit OpenStack's latest features. We are proud to take some credit for our
upstream contributions, and excited to make the most of these new powers
in Pike.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://www.openstack.org/software/pike/">Lots more on OpenStack Pike</a></li>
<li><a class="reference external" href="http://www.stackhpc.com/openstack-and-hpc-infrastructure.html">OpenStack and HPC Infrastructure Management, StackHPC blog</a></li>
<li><a class="reference external" href="https://www.openstack.org/science/">OpenStack for Scientific Research</a></li>
</ul>
</div>
John Garbutt Joins our Team2017-08-17T10:20:00+01:002017-08-17T10:20:00+01:00Stig Telfertag:www.stackhpc.com,2017-08-17:/john-garbutt-joins-our-team.html<p class="first last">John Garbutt joins the StackHPC team, giving us new and enhanced
capabilities to push the boundaries of Scientific OpenStack.</p>
<p>We are excited to announce our star summer signing: John Garbutt comes to us
from Rackspace and the OpenStack Innovation Center (OSIC). John is also
active as a <a class="reference external" href="https://wiki.openstack.org/wiki/Nova/CoreTeam">Nova core</a>
and sits on the <a class="reference external" href="https://www.openstack.org/foundation/tech-committee/">OpenStack Technical Committee</a>. We are thrilled
to have John's input from both of these vital areas.</p>
<p>John adds "I'm really excited to join the talented StackHPC team.
Looking forward to the part we will play in shaping how the Scientific
Computing community does their computing."</p>
<p>Aside from drawing on John's depth of technical input and experience, our new
team member will be helping us to drive forward our ambitious plans for the
development of Scientific OpenStack. In the coming months we hope to provide
further details on what we will be getting up to together.</p>
<p>Follow John on Twitter <a class="reference external" href="https://twitter.com/johnthetubaguy">@johnthetubaguy</a>.</p>
<div class="figure">
<img alt="John Garbutt" src="//www.stackhpc.com/images/john-garbutt.jpg" style="width: 231px;" />
</div>
Clusters for Scientific Applications: as-a-Service2017-08-03T17:00:00+01:002017-08-03T17:00:00+01:00Stig Telfertag:www.stackhpc.com,2017-08-03:/cluster-as-a-service.html<p class="first last">The marriage of Heat and Ansible for delivering scientific
applications that are simple to deploy and manage.</p>
<p>How can we make a workload easier on cloud? In a previous article
we <a class="reference external" href="//www.stackhpc.com/openstack-and-hpc-workloads.html">presented the lay of the land</a> for HPC
workload management in an OpenStack environment. A substantial
part of the work done to date focuses on automating the creation
of a software-defined workload management environment -
<em>SLURM-as-a-Service</em>.</p>
<div class="figure">
<img alt="SLURM logo" src="//www.stackhpc.com/images/slurm-logo.png" style="width: 192px;" />
</div>
<p>SLURM is only one narrow (but widely-used) use case in a broad ecosystem of multi-node
scientific application clusters: let's not over-specialise to that.
It raises the question of what is needed to make a generally useful,
flexible system for creating Cluster-as-a-Service?</p>
<div class="section" id="what-do-users-really-want">
<h2>What do Users Really Want?</h2>
<p>A user of the system will not care how elegantly the infrastructure
is orchestrated:</p>
<ul class="simple">
<li>Users will want support for the science tools they need, and when
new tools are needed, the users will want support for those too.</li>
<li>Users will want to get started with minimal effort. The learning
curve they must climb to deploy tools needs to be shallow.</li>
<li>Users will want easy access to the datasets upon which their
research is based.</li>
</ul>
<p>Scientists certainly do not want to be given clusters which in truth
are just replicated infrastructure. We must provide immediately
useful environments that don’t require scientists to be sysadmins.
The time to science (or more bluntly, the time to paper) is pretty
much the foremost consideration. Being able to use automation
to reliably reproduce research findings comes a close second.</p>
</div>
<div class="section" id="application-building-blocks">
<h2>Application Building Blocks</h2>
<p>A really useful application cluster service is built with several
key design criteria taken into account:</p>
<ul>
<li><p class="first"><em>The right model for sharing</em>. Do we provide a globally-available shared
infrastructure, project-level multi-user infrastructure or per-user
isolation? Per-user isolation might work until the user decides
they prefer to collaborate. But can a user trust every other user
in their project? Per-project sharing might work unless the users
don't actually trust one another (which might in itself be a bigger
problem).</p>
</li>
<li><p class="first"><em>Users on the cluster</em>. Are the members of the project also to
be the users of the cluster? In this case, we should do what we can to
offer a cluster deployment that is tightly integrated with the
OpenStack environment. Why not authenticate with their OpenStack
credentials, for example?</p>
<p>In a different service model, the OpenStack project members are the
cluster admins, not its users. If the cluster is being provided as
a service to others who have no connection with the infrastructure,
an external mechanism is required to list the users, and to
authenticate them. Some flexibility should be supported in an effective
solution.</p>
</li>
<li><p class="first"><em>Easy data access</em>. Copying user files into a cluster introduces
a boundary to cross, which adds inconvenience for using that resource.
Furthermore, copying data in requires the same data to be stored
in two (or more) places.</p>
<p>Where a cluster requires a shared filesystem, creating its own
ad-hoc filesystem is unlikely to be the best solution. In the
same manner as provider networks, a cluster should support "provider
filesystems" - site production filesystems that are exported from
other infrastructure in the data centre.</p>
<p>Large scientific datasets may also be required, and are often
mediated using platform data services such as <a class="reference external" href="https://irods.org/">iRODS</a>, or <a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5</a> (note,
that's a <em>5</em>).</p>
<p>Object storage (S3 in particular) is seen as the long-term solution
for connecting applications with datasets, and does appear to be
the direction of travel for many. However, sharing read-only
filesystems, at either file or block level, are simple and universal
approaches that work perfectly well. Both are also well-established choices.</p>
</li>
<li><p class="first"><em>Scaling up and down</em>. The cloud access model does not entail queuing
and waiting for resources like a conventional HPC batch queuing
system. Equally, cloud resources are assumed to grow and shrink
dynamically, as required. Perhaps this could even happen
automatically, triggered by peaks or troughs in demand. To
maximise overall utilisation, cluster resizing is actually quite
important, and managing all the resources of a cluster together
enables us to do it well.</p>
</li>
<li><p class="first"><em>Self-service creation</em> - for some definition of <em>self</em>. Users shouldn't
need to learn sysadmin skills in order to create a resource for
doing their science. For some, the Horizon web interface might
be too complex to bother learning. Enter the <em>ResOps</em> role - for
example described in a recent <a class="reference external" href="http://superuser.openstack.org/articles/scientific-wg-update/">SuperUser article on the Scientific
WG</a>
- a specialist embedded within the project team, trained on working
with cloud infrastructure to deliver the best outcomes for that
team. To help the task of cluster creation, it should also be
automated to the fullest extent possible.</p>
</li>
</ul>
</div>
<div class="section" id="bringing-it-all-together-openhpc-as-a-service">
<h2>Bringing it all together: OpenHPC-as-a-Service</h2>
<p>The <a class="reference external" href="http://openhpc.community/">OpenHPC project</a> is an initiative
to build a community package distribution and ecosystem around a
common software framework for HPC.</p>
<div class="figure">
<img alt="OpenHPC logo" src="//www.stackhpc.com/images/openhpc-logo.png" style="width: 256px;" />
</div>
<p>OpenHPC clusters are built around the popular
<a class="reference external" href="https://slurm.schedmd.com/overview.html">SLURM</a> workload manager.</p>
<p>A team from Intel including Sunil Mahawar, Yih Leong Sun and Jeff
Adams have already <a class="reference external" href="https://www.openstack.org/videos/boston-2017/openstack-openhpc-hpc-cloud">presented their work</a>
at the latest OpenStack summit in Boston:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/CShEig6mgJA" width="600" height="400" allowfullscreen seamless frameBorder="0"></iframe></div><p>Independently, we have been working on our own OpenHPC clusters as
one of our scientific cluster applications on the <a class="reference external" href="http://www.skatelescope.org">SKA</a> performance prototype system, and
I'm going to share some of the components we have used to make this
project happen.</p>
<div class="figure">
<img alt="SKA Alaska: Performance Prototype Platform" src="//www.stackhpc.com/images/alaska-compute.jpg" style="width: 256px;" />
</div>
<ul>
<li><p class="first"><em>Software base image:</em> An image with major components baked in saves
network bandwidth and deployment time for large clusters. We use
our <a class="reference external" href="https://galaxy.ansible.com/stackhpc/os-images/">os-images role</a>,
available on Ansible Galaxy. We use this role with <a class="reference external" href="https://github.com/stackhpc/stackhpc-image-elements/tree/master/elements/openhpc">our custom elements</a>
written for <a class="reference external" href="https://docs.openstack.org/developer/diskimage-builder/">Diskimage-builder</a></p>
<p>Here's an example of the configuration to provide for the <tt class="docutils literal"><span class="pre">os-images</span></tt> role:</p>
</li>
</ul>
<div class="highlight"><pre><span></span><span class="nt">os_images_list</span><span class="p">:</span>
<span class="c1"># Build of OpenHPC image on a CentOS base</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="s">"CentOS7-OpenHPC"</span>
<span class="nt">elements</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"centos7"</span>
<span class="p p-Indicator">-</span> <span class="s">"epel"</span>
<span class="p p-Indicator">-</span> <span class="s">"openhpc"</span>
<span class="p p-Indicator">-</span> <span class="s">"selinux-permissive"</span>
<span class="p p-Indicator">-</span> <span class="s">"dhcp-all-interfaces"</span>
<span class="p p-Indicator">-</span> <span class="s">"vm"</span>
<span class="nt">env</span><span class="p">:</span>
<span class="nt">DIB_OPENHPC_GRPLIST</span><span class="p">:</span> <span class="s">"ohpc-base-compute</span><span class="nv"> </span><span class="s">ohpc-slurm-client</span><span class="nv"> </span><span class="s">'InfiniBand</span><span class="nv"> </span><span class="s">Support'"</span>
<span class="nt">DIB_OPENHPC_PKGLIST</span><span class="p">:</span> <span class="s">"lmod-ohpc</span><span class="nv"> </span><span class="s">mrsh-ohpc</span><span class="nv"> </span><span class="s">lustre-client-ohpc</span><span class="nv"> </span><span class="s">ntp"</span>
<span class="nt">DIB_OPENHPC_DELETE_REPO</span><span class="p">:</span> <span class="s">"n"</span>
<span class="nt">properties</span><span class="p">:</span>
<span class="nt">os_distro</span><span class="p">:</span> <span class="s">"centos"</span>
<span class="nt">os_version</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">7</span>
<span class="nt">os_images_elements</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">../stackhpc-image-elements</span>
</pre></div>
<ul>
<li><p class="first"><em>Flexible infrastructure definition:</em> A flexible definition is needed
because it soon becomes apparent that application clusters do not all
conform to a template of 1 master, <em>n</em> workers. We use a Heat template
in which the instances to create are parameterised as a list of groups.
This takes Heat close to the limits of its expressiveness - and requires
the Newton release of Heat as a minimum. Some users of Heat take an
alternative approach of code-generated Heat templates.</p>
<p>Our flexible Heat template is encapsulated within an <a class="reference external" href="https://galaxy.ansible.com/stackhpc/cluster-infra/">Ansible
role</a>
available on Ansible Galaxy. This role includes a task to <a class="reference external" href="https://github.com/stackhpc/ansible-role-cluster-infra/blob/master/tasks/main.yml#L11-L14">generate
a static inventory file</a>,
extended with user-supplied node groupings, which is suitable for
higher-level configuration.</p>
<p>The definition of the compute node components cluster takes a simple form:</p>
</li>
</ul>
<div class="highlight"><pre><span></span><span class="nt">cluster_groups</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">slurm_login</span><span class="nv"> </span><span class="s">}}"</span>
<span class="p p-Indicator">-</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">slurm_compute</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">slurm_login</span><span class="p">:</span>
<span class="nt">name</span><span class="p">:</span> <span class="s">"login"</span>
<span class="nt">flavor</span><span class="p">:</span> <span class="s">"compute-B"</span>
<span class="nt">image</span><span class="p">:</span> <span class="s">"CentOS7-OpenHPC"</span>
<span class="nt">num_nodes</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">1</span>
<span class="nt">slurm_compute</span><span class="p">:</span>
<span class="nt">name</span><span class="p">:</span> <span class="s">"compute"</span>
<span class="nt">flavor</span><span class="p">:</span> <span class="s">"compute-A"</span>
<span class="nt">image</span><span class="p">:</span> <span class="s">"CentOS7-OpenHPC"</span>
<span class="nt">num_nodes</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">8</span>
</pre></div>
<p>The invocation of the infrastructure role becomes equally simple:</p>
<div class="highlight"><pre><span></span><span class="nn">---</span>
<span class="c1"># This playbook uses the Ansible OpenStack modules to create a cluster</span>
<span class="c1"># using a number of baremetal compute node instances, and configure it</span>
<span class="c1"># for a SLURM partition</span>
<span class="p p-Indicator">-</span> <span class="nt">hosts</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">openstack</span>
<span class="nt">roles</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">role</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">stackhpc.cluster-infra</span>
<span class="nt">cluster_name</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">cluster_name</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">cluster_params</span><span class="p">:</span>
<span class="nt">cluster_prefix</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">cluster_name</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">cluster_keypair</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">cluster_keypair</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">cluster_groups</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">cluster_groups</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">cluster_net</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">cluster_net</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">cluster_roles</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">cluster_roles</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<ul>
<li><p class="first"><em>Authenticating our users:</em> On this prototype system, we currently
have a small number of users, and these users are locally defined within
Keystone. In a larger production environment, a more likely scenario would
be that the users of an OpenStack cloud are stored within external
authentication infrastructure, such as LDAP.</p>
<p>Equivalent user accounts must be created on our OpenHPC cluster. Users
need to be able to login on the externally-facing login node. The users
should be defined on the batch compute nodes, but they should not be able
to login on these instances.</p>
<p>Our solution is to enable our users to authenticate using Keystone on
the login node. This is done using two projects,
<a class="reference external" href="http://pam-python.sourceforge.net/">PAM-Python</a> and
<a class="reference external" href="https://github.com/stackhpc/pam-keystone">PAM-Keystone</a> - a minimal
PAM module that performs auth requests using the Keystone API.
Using this, our users benefit from common authentication on OpenStack and
all the resources created on it.</p>
</li>
<li><p class="first"><em>Access to cluster filesystems:</em> OpenHPC clusters require a common filesystem
mounted across all nodes in a workload manager. One possible solution here
would be to use <a class="reference external" href="https://wiki.openstack.org/wiki/Manila">Manila</a>, but our
bare metal infrastructure may complicate its usage. It is an area for future
exploration for this project.</p>
<p>We are using CephFS, exported from our local Ceph cluster, with an all-SSD
pool for metadata and a journaled pool for file data. Our solution defines
a CephX key, shared between project users, which enables access to the CephFS
storage pools and metadata server. This CephX key is stored in <a class="reference external" href="https://wiki.openstack.org/wiki/Barbican">Barbican</a>. This appears to be an area where
support in <a class="reference external" href="https://docs.openstack.org/shade/latest/">Shade</a> and Ansible's
own <a class="reference external" href="http://docs.ansible.com/ansible/latest/list_of_cloud_modules.html#openstack">OpenStack modules</a>
is limited. We have written an Ansible role for retrieving secrets from
Barbican and storing them as facts, and we'll be working to package it and publish
on Galaxy in due course.</p>
</li>
<li><p class="first"><em>Converting infrastructure into platform:</em> Once we have built upon the
infrastructure to add the support we need, the next phase is to configure and
start the platform services. In this case, we build a SLURM configuration
that draws from the infrastructure inventory to define the workers and controllers
in the SLURM configuration.</p>
</li>
</ul>
<div class="section" id="adding-value-in-a-cloud-context">
<h3>Adding Value in a Cloud Context</h3>
<p>In the first instance, cloud admins recreate application environments,
defined by software and deployed on demand. These environments
meet user requirements. The convenience of their creation is
probably offset by a slight overhead in performance. On balance,
an indifferent user might not see compelling benefit to working
this way. Our OpenHPC-as-a-Service example described here largely
falls into this category.</p>
<p><em>Don't stop here</em>.</p>
<p>Software-defined cloud methodologies enable us to do some more
imaginative things in order to make our clusters the best they
possibly can be. We can introduce infrastructure services for
consuming and processing syslog streams, simplifying the administrative
workload of cluster operation. We can automate monitoring services
for ensuring smooth cluster operation, and application performance
telemetry as standard to assist users with optimsation. We can
help admins secure the cluster.</p>
<p>All of these things are attainable, because we have moved from
managing a deployment to developing the automation of that deployment.</p>
</div>
<div class="section" id="reducing-the-time-to-science">
<h3>Reducing the Time to Science</h3>
<p>Our users have scientific work to do, and our OpenStack projects
exist to support that.</p>
<p>We believe that OpenStack infrastructure can go beyond simply recreating
conventional scientific application clusters to generate Cluster-as-a-Service
deployments that integrate cloud technologies to be even better.</p>
</div>
<div class="section" id="further-reading">
<h3>Further Reading</h3>
<ul class="simple">
<li><a class="reference external" href="http://www.stackhpc.com/openstack-and-hpc-workloads.html">OpenStack and HPC Workload Management, StackHPC blog</a></li>
<li><a class="reference external" href="https://www.openstack.org/science/">OpenStack for Scientific Research</a></li>
</ul>
</div>
</div>
HPC Networking in OpenStack: Part 22017-07-14T20:00:00+01:002017-07-14T20:00:00+01:00Mark Goddardtag:www.stackhpc.com,2017-07-14:/hpc-networking-in-openstack-2.html<p class="first last">Part 2 of our series on HPC networking in OpenStack covers managing
physical and virtual network infrastructure as code using the Kayobe
project.</p>
<p>This post is the second in a series on HPC networking in OpenStack. In
the series we'll discuss StackHPC's current and future work on integrating
OpenStack with high performance network technologies. This post discusses how
the <a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a> project uses Ansible
to define physical and virtual network infrastructure as code.</p>
<p>If you've not read it yet, why not begin with the <a class="reference external" href="https://www.stackhpc.com/hpc-networking-in-openstack-1.html">first post in this series</a>.</p>
<div class="section" id="the-network-as-code">
<h2>The Network as Code</h2>
<p>Operating a network has for a long time been a difficult and high risk task.
Networks can be fragile, and the consequences of incorrect configuration can be
far reaching. Automation using scripts allows us to improve on this, reaping
some of the benefits of established software development practices such as
version control.</p>
<img alt="Ansible" src="//www.stackhpc.com/images/ansible-thumbnail.png" style="width: 200px;" />
<p>The recent influx of DevOps and configuration management tools are well suited
to the task of network management, with Ansible in particular having a large
selection of modules for configuration of network devices. Of course, the
network doesn't end at the switch, and Ansible is equally well suited to
driving network configuration on the attached hosts - from simple interface
configuration to complex virtual networking topologies.</p>
</div>
<div class="section" id="openstack-networks">
<h2>OpenStack Networks</h2>
<p>OpenStack can be deployed in a number of configurations, with various levels of
networking complexity. Some of the classes of networks that may be used in an
OpenStack cluster include:</p>
<dl class="docutils">
<dt>Power & out-of-band management network</dt>
<dd>Access to power management devices and out-of-band management systems
(e.g. BMCs) of control and compute plane hosts.</dd>
<dt>Overcloud provisioning network</dt>
<dd>Used by the seed host to provision the control plane and virtualised
compute plane hosts.</dd>
<dt>Workload inspection network</dt>
<dd>Used by the control plane hosts to inspect the hardware of the bare metal
compute hosts.</dd>
<dt>Workload provisioning network</dt>
<dd>Used by the control plane hosts to provision the bare metal compute hosts.</dd>
<dt>Workload cleaning network</dt>
<dd>Used by the control plane hosts to clean the bare metal compute hosts after
use.</dd>
<dt>Internal network</dt>
<dd>Used by the control plane for internal communication and access to the
internal and admin OpenStack API endpoints.</dd>
<dt>External network</dt>
<dd>Hosts the public OpenStack API endpoints and provides external network
access for the hosts in the system.</dd>
<dt>Tenant networks</dt>
<dd>Used by tenants for communication between compute instances. Multiple
networks can provide isolation between tenants. These may be overlay
networks such as GRE or VXLAN tunnels but are more commonly VLANs in bare
metal compute environments.</dd>
<dt>Storage network</dt>
<dd>Used by control and compute plane hosts for access to storage systems.</dd>
<dt>Storage management network</dt>
<dd>Used by storage systems for internal communication.</dd>
</dl>
<p>Hey wait, where are you going? Don't worry, not all clusters require all of
these classes of networks, and in general it's possible to map more than one of
these to a single virtual or physical network.</p>
</div>
<div class="section" id="kayobe-kolla-ansible">
<h2>Kayobe & Kolla-ansible</h2>
<p>Kayobe heavily leverages the <a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-ansible</a> project to deploy a
containerised OpenStack control plane. In general, Kolla-ansbile performs very
little direct configuration of the hosts that it manages - most is limited to
Docker containers and volumes. This leads to a very reliable and portable
tool, but does leave wide open the question of how to configure the underlying
hosts to the point where they can run Kolla-ansible's containers. This is
where Kayobe comes in.</p>
</div>
<div class="section" id="host-networking">
<h2>Host networking</h2>
<p>Kolla-ansible takes as input the names of network interfaces that map to the
various classes of network that it differentiates. The following variables
should be set in <tt class="docutils literal">globals.yml</tt>.</p>
<div class="highlight"><pre><span></span><span class="c1"># Internal network</span>
<span class="nt">api_interface</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">br-eth1</span>
<span class="c1"># External network</span>
<span class="nt">kolla_external_vip_interface</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">br-eth1</span>
<span class="c1"># Storage network</span>
<span class="nt">storage_interface</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">eth2</span>
<span class="c1"># Storage management network</span>
<span class="nt">cluster_interface</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">eth2</span>
</pre></div>
<p>In this example we have two physical interfaces, <tt class="docutils literal">eth1</tt> and <tt class="docutils literal">eth2</tt>. A
software bridge <tt class="docutils literal"><span class="pre">br-eth1</span></tt> exists, into which <tt class="docutils literal">eth1</tt> is plugged.
Kolla-ansible expects these interfaces to be up and configured with an IP
address.</p>
<p>Rather than reinventing the wheel, Kayobe makes use of existing Ansible roles
available on <a class="reference external" href="https://galaxy.ansible.com/">Ansible Galaxy</a>. Galaxy can be a
bit of a wild west, with many overlapping and unmaintained roles of dubious
quality. That said, with a little perseverance it's possible to find good
quality roles such as <a class="reference external" href="https://github.com/michaelrigart/ansible-role-interfaces">MichaelRigart.interfaces</a>. Kayobe uses this
role to configure network interfaces, bridges, and IP routes on the control
plane hosts.</p>
<p>Here's an example of a simple Ansible playbook that uses the
<tt class="docutils literal">MichaelRigart.interfaces</tt> role to configure <tt class="docutils literal">eth2</tt> for DHCP, and a bridge
<tt class="docutils literal">breth1</tt> with a static IP address, static IP route and a single port,
<tt class="docutils literal">eth1</tt>.</p>
<div class="highlight"><pre><span></span><span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Ensure network interfaces are configured</span>
<span class="nt">hosts</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">localhost</span>
<span class="nt">become</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">yes</span>
<span class="nt">roles</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">role</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">MichaelRigart.interfaces</span>
<span class="c1"># List of Ethernet interfaces to configure.</span>
<span class="nt">interfaces_ether_interfaces</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">device</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">eth2</span>
<span class="nt">bootproto</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">dhcp</span>
<span class="c1"># List of bridge interfaces to configure.</span>
<span class="nt">interfaces_bridge_interfaces</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">device</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">breth1</span>
<span class="nt">bootproto</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">static</span>
<span class="nt">address</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">192.168.1.150</span>
<span class="nt">netmask</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">255.255.255.0</span>
<span class="nt">gateway</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">192.168.1.1</span>
<span class="nt">mtu</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">1500</span>
<span class="nt">route</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">network</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">192.168.200.0</span>
<span class="nt">netmask</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">255.255.255.0</span>
<span class="nt">gateway</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">192.168.1.1</span>
<span class="nt">ports</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">eth1</span>
</pre></div>
<p>If you are following along at home, ensure that Ansible maintains a stable control
connection to the hosts being configured, and that this connection is not
reconfigured by the <tt class="docutils literal">MichaelRigart.interfaces</tt> role.</p>
<p>Kayobe uses Open vSwitch as the Neutron ML2 mechanism for providing network
services such as DHCP and routing on the control plane hosts. Kolla-ansible
deploys a containerised Open vSwitch daemon, and creates OVS bridges for
Neutron which are attached to existing network interfaces. Kayobe creates
virtual Ethernet pairs using its <a class="reference external" href="https://github.com/stackhpc/kayobe/tree/master/ansible/roles/veth">veth</a> role, and
configures Kolla-ansible to connect the OVS bridges to these.</p>
<p>A quick namecheck of other Galaxy roles used by Kayobe for host network
configuration: <a class="reference external" href="https://galaxy.ansible.com/ahuffman/resolv/">ahuffman.resolv</a>
configures the DNS resolver, and <a class="reference external" href="https://galaxy.ansible.com/resmo/ntp/">resmo.ntp</a> configures the NTP daemon. Thanks go
to the maintainers of these roles.</p>
</div>
<div class="section" id="the-bigger-picture">
<h2>The bigger picture</h2>
<p>The previous examples show how one might configure a set of network interfaces
on a single host, but how can we extend that configuration to cover multiple
hosts in a cluster in a declarative manner, without unnecessary repetition?
Ansible's combination of YAML and Jinja2 templating turns out to be great at
this.</p>
<p>A network in Kayobe is assigned a name, which is used as a prefix for all
variables that describe the network's attributes. Here's the global
configuration for a hypothetical <tt class="docutils literal">example</tt> network that would typically be
added to Kayobe's <tt class="docutils literal">networks.yml</tt> configuration file.</p>
<div class="highlight"><pre><span></span><span class="c1"># Definition of 'example' network.</span>
<span class="nt">example_cidr</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.0.0.0/24</span>
<span class="nt">example_gateway</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.0.0.1</span>
<span class="nt">example_allocation_pool_start</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.0.0.3</span>
<span class="nt">example_allocation_pool_end</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.0.0.127</span>
<span class="nt">example_vlan</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">42</span>
<span class="nt">example_mtu</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">1500</span>
<span class="nt">example_routes</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">cidr</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.1.0.0/24</span>
<span class="nt">gateway</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.0.0.2</span>
</pre></div>
<p>Defining each option as a top level variable allows them to be overridden
individually, if necessary.</p>
<p>We define the network's IP subnet, VLAN, IP routes, MTU, and a pool of IP
addresses for Kayobe to assign to the control plane hosts. Static IP addresses
are allocated automatically using Kayobe's <a class="reference external" href="https://github.com/stackhpc/kayobe/tree/master/ansible/roles/ip-allocation">ip-allocation</a>
role, but may be manually defined by pre-populating <tt class="docutils literal"><span class="pre">network-allocation.yml</span></tt>.</p>
<p>There are also some per-host configuration items that allow us to define how
hosts attach to networks. These would typically be added to a group or host
variable file.</p>
<div class="highlight"><pre><span></span><span class="c1"># Definition of network interface for 'example' network.</span>
<span class="nt">example_interface</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">breth1</span>
<span class="nt">example_bridge_ports</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">eth1</span>
</pre></div>
<p>Kayobe defines classes of networks which can be mapped to the actual networks
that have been configured. In our example, we may want to use the <tt class="docutils literal">example</tt>
network for both internal and external control plane communication. We would
then typically define the following in <tt class="docutils literal">networks.yml</tt>.</p>
<div class="highlight"><pre><span></span><span class="c1"># Map internal network communication to 'example' network.</span>
<span class="nt">internal_net_name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">example</span>
<span class="c1"># Map external network communication to 'example' network.</span>
<span class="nt">external_net_name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">example</span>
</pre></div>
<p>These network classes are used to determine to which networks each host should
be attached, and how to configure Kolla-ansible. The default network list may
be extended if necessary by setting <tt class="docutils literal">controller_extra_network_interfaces</tt> in
<tt class="docutils literal">controllers.yml</tt>.</p>
<p>The final piece of this puzzle is a set of <a class="reference external" href="https://github.com/stackhpc/kayobe/blob/master/ansible/filter_plugins/networks.py">custom Jinja2 filters</a>
that allow us to query various attributes of these networks, using the name of
the network.</p>
<div class="highlight"><pre><span></span><span class="c1"># Get the MTU for the 'example' network.</span>
<span class="nt">example_mtu</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">'example'</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">net_mtu</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>We can also query attributes of other hosts.</p>
<div class="highlight"><pre><span></span><span class="c1"># Get the network interface for the 'example' network on host 'controller1'.</span>
<span class="nt">example_interface</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">'example'</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">net_interface('controller1')</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>Finally, we can remove the explicit reference to our site-specific network
name, <tt class="docutils literal">example</tt>.</p>
<div class="highlight"><pre><span></span><span class="c1"># Get the CIDR for the internal network.</span>
<span class="nt">internal_cidr</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">internal_net_name</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">net_interface</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>The decoupling of network definitions from network classes enables Kayobe to be
very flexible in how it configures a cluster. In our experience this is an area
in which TripleO is a little rigid.</p>
<p>Further information on configuration of networks can be found in the <a class="reference external" href="http://kayobe.readthedocs.io/en/latest/configuration.html#network-configuration">Kayobe
documentation</a>.</p>
<p>The <a class="reference external" href="https://github.com/SKA-ScienceDataProcessor/alaska-kayobe-config/tree/alaska-prod/etc/kayobe">Kayobe configuration</a>
for the Square Kilometre Array (SKA) Performance Prototype Platform (P3) system
provides a good example of how this works in a real system. We used the
<a class="reference external" href="https://github.com/skydive-project/skydive">Skydive project</a> to visualise
the network topology within one of the OpenStack controllers in the P3 system.
In the Pike release, Kolla-ansible adds support for deploying Skydive on the
control plane and virtualised compute hosts. We had to make a small change to
Skydive to fix discovery of the relationship between VLAN interfaces and their
parent link, and we'll contribute this upstream. Click the image link to see
it in its full glory.</p>
<a class="reference external image-reference" href="//www.stackhpc.com/images/skydive-p3-controller.png"><img alt="Skydive on P3 controller" src="//www.stackhpc.com/images/skydive-p3-controller.png" style="width: 700px;" /></a>
</div>
<div class="section" id="physical-networking">
<h2>Physical networking</h2>
<p>Hosts are of little use without properly configured network devices to connect
them. Kayobe has the capability to manage the configuration of physical
network switches using Ansible's network modules. Currently Dell OS6 and Dell
OS9 switches are supported, while Juniper switches will be soon be added to the
list.</p>
<p>Returning to our example of the SKA P3 system, we note that in HPC clusters it
is common to need to manage multiple physical networks.</p>
<div class="figure">
<a class="reference external image-reference" href="//www.stackhpc.com/images/alaska-networks-stylised.png"><img alt="Physical networks in the P3 deployment" src="//www.stackhpc.com/images/alaska-networks-stylised.png" style="width: 700px;" /></a>
</div>
<p>Each switch is configured as a host in the Ansible inventory, with host
variables used to specify the switch's management IP address and admin user
credentials.</p>
<div class="highlight"><pre><span></span><span class="nt">ansible_host</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">10.0.0.200</span>
<span class="nt">ansible_user</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain"><admin username></span>
<span class="nt">ansible_ssh_password</span><span class="p">:</span> <span class="err">******</span>
</pre></div>
<p>Kayobe doesn't currently provide a lot of abstraction around switch
configuration - it is specified using three per-host variables.</p>
<div class="highlight"><pre><span></span><span class="c1"># Type of switch. One of 'dellos6', 'dellos9'.</span>
<span class="nt">switch_type</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">dellos6</span>
<span class="c1"># Global configuration. List of global configuration lines.</span>
<span class="nt">switch_config</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"ip</span><span class="nv"> </span><span class="s">ssh</span><span class="nv"> </span><span class="s">server"</span>
<span class="p p-Indicator">-</span> <span class="s">"hostname</span><span class="nv"> </span><span class="s">\"{{</span><span class="nv"> </span><span class="s">inventory_hostname</span><span class="nv"> </span><span class="s">}}\""</span>
<span class="c1"># Interface configuration. Dict mapping switch interface names to configuration</span>
<span class="c1"># dicts. Each dict contains a 'description' item and a 'config' item which should</span>
<span class="c1"># contain a list of per-interface configuration.</span>
<span class="nt">switch_interface_config</span><span class="p">:</span>
<span class="nt">Gi1/0/1</span><span class="p">:</span>
<span class="nt">description</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">controller1</span>
<span class="nt">config</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"switchport</span><span class="nv"> </span><span class="s">mode</span><span class="nv"> </span><span class="s">access"</span>
<span class="p p-Indicator">-</span> <span class="s">"switchport</span><span class="nv"> </span><span class="s">access</span><span class="nv"> </span><span class="s">vlan</span><span class="nv"> </span><span class="s">{{</span><span class="nv"> </span><span class="s">internal_net_name</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">net_vlan</span><span class="nv"> </span><span class="s">}}"</span>
<span class="p p-Indicator">-</span> <span class="s">"lldp</span><span class="nv"> </span><span class="s">transmit-mgmt"</span>
<span class="nt">Gi1/0/2</span><span class="p">:</span>
<span class="nt">description</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">compute1</span>
<span class="nt">config</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"switchport</span><span class="nv"> </span><span class="s">mode</span><span class="nv"> </span><span class="s">access"</span>
<span class="p p-Indicator">-</span> <span class="s">"switchport</span><span class="nv"> </span><span class="s">access</span><span class="nv"> </span><span class="s">vlan</span><span class="nv"> </span><span class="s">{{</span><span class="nv"> </span><span class="s">internal_net_name</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">net_vlan</span><span class="nv"> </span><span class="s">}}"</span>
<span class="p p-Indicator">-</span> <span class="s">"lldp</span><span class="nv"> </span><span class="s">transmit-mgmt"</span>
</pre></div>
<p>In this example we define the type of the switch as <tt class="docutils literal">dellos6</tt> to instruct
Kayobe to use the <a class="reference external" href="http://docs.ansible.com/ansible/list_of_network_modules.html#dellos6">dellos6</a>
Ansible modules. Note the use of the custom filters seen earlier to limit the
proliferation of the internal VLAN ID throughout the configuration. The
<a class="reference external" href="https://github.com/stackhpc/kayobe/tree/master/ansible/roles/dell-switch">Kayobe dell-switch</a>
role applies the configuration to the switches when following command is run:</p>
<div class="highlight"><pre><span></span><span class="go">kayobe physical network configure --group <group name></span>
</pre></div>
<p>The group defines the set of switches to be configured.</p>
<p>Once more, the P3 system's Kayobe configuration provides some good examples.
In particular, check out the configuration for <a class="reference external" href="https://github.com/SKA-ScienceDataProcessor/alaska-kayobe-config/blob/alaska-prod/etc/kayobe/inventory/host_vars/ethsw-b16-u39">one of the management switches</a>,
and the <a class="reference external" href="https://github.com/SKA-ScienceDataProcessor/alaska-kayobe-config/blob/alaska-prod/etc/kayobe/inventory/group_vars/mgmt-switches">associated group variables</a>.</p>
</div>
<div class="section" id="next-time">
<h2>Next Time</h2>
<p>In the next article in this series we'll look at how StackHPC is working
upstream on improvements to networking in the <a class="reference external" href="https://github.com/openstack/ironic">Ironic</a> and <a class="reference external" href="https://github.com/openstack/networking-generic-switch">Neutron Networking Generic Switch
ML2 mechanism driver</a>
projects.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
<li><a class="reference external" href="https://www.stackhpc.com/openstack-and-hpc-networks.html">OpenStack and HPC Network Fabrics</a></li>
<li><a class="reference external" href="https://kayobe.readthedocs.io/en/latest/">Kayobe</a></li>
<li><a class="reference external" href="https://docs.openstack.org/kolla-ansible/latest/">Kolla-ansible</a></li>
<li><a class="reference external" href="https://www.stackhpc.com/ironic-idrac-ztp.html">Zero touch provisioning of P3</a></li>
</ul>
</div>
Ethernet's future is approaching - fast2017-07-07T17:20:00+01:002017-07-07T17:40:00+01:00Stig Telfertag:www.stackhpc.com,2017-07-07:/ethernets-future-is-approaching-fast.html<p class="first last">StackHPC attended the Mellanox product launch for a new generation of
switches, built around their Spectrum-2 ASIC, and delivering link speeds up to
400Gbits/s.</p>
<p>Mellanox certainly know how to throw a party.</p>
<p>We joined them at the <a class="reference external" href="http://skygarden.london/">Sky Garden</a>, a private venue
atop one of London's most iconic skyscrapers. With a hint of "Bond villain's lair"
about it, this luxurious location made a perfect backdrop to announce some special
networking products.</p>
<div class="figure">
<img alt="The Sky Garden" src="//www.stackhpc.com/images/skygarden-plants.jpg" style="width: 500px;" />
</div>
<p>During the launch, StackHPC's CEO John Taylor discussed the scale
of the data challenges of the <a class="reference external" href="https://skatelescope.org/">Square Kilometre Array</a>
with Mellanox CEO Eyal Waldman.</p>
<div class="figure">
<img alt="John Taylor and Eyal Waldman" src="//www.stackhpc.com/images/skygarden-john.jpg" style="width: 500px;" />
</div>
<p>Spectrum-2 supports ground-breaking Ethernet link speeds: 200Gbits/s
and 400Gbits/s were announed. 400Gbits/s is pushing the envelope
to the point where it doesn't even have a physical cabling standard
ratified yet.</p>
<p>Customers with the most demanding network-intensive workloads may
hope to soon have access to the <a class="reference external" href="http://www.mellanox.com/page/products_dyn?product_family=268&mtag=connectx_6_en_ic">Mellanx ConnectX-6 200Gbits/s NIC</a>.
The next generation NIC also promises enhanced support for Open vSwitch offloads.
There isn't a NIC announced for 400Gbits/s yet...</p>
<div class="section" id="the-future-of-sdn">
<h2>The Future of SDN</h2>
<p>Aside from raw speed, there were also some really interesting features
for SDN. The Spectrum-2 ASIC includes support for the emerging <a class="reference external" href="http://p4.org/">P4
language</a>, embodying the next generation of SDN,
and we will be watching for details of that as they become public.</p>
</div>
<div class="section" id="open-ethernet-supported">
<h2>Open Ethernet supported</h2>
<div class="figure">
<img alt="Open Ethernet" src="//www.stackhpc.com/images/skygarden-oe.jpg" style="width: 500px;" />
</div>
<p>Mellanox CEO Eyal Waldman also affirmed Spectrum-2 support for
<a class="reference external" href="http://www.mellanox.com/open-ethernet/">Open Ethernet</a>, enabling
customers to choose an alternative network OS (Mellanox OS and Cumulus
Linux) - although neither choice would be considered open source.</p>
</div>
Our loss is Norway's gain: StackHPC is recruiting2017-07-07T12:20:00+01:002017-07-07T12:40:00+01:00Stig Telfertag:www.stackhpc.com,2017-07-07:/our-loss-is-norways-gain-stackhpc-is-recruiting.html<p class="first last">We say farewell to Steve Simpson, our highly valued technical
lead on monitoring, and look ahead to replacing him with new blood
for the team.</p>
<p>A new life in Norway finally beckons for Steve and Linn. In his time at
StackHPC, Steve has contributed hugely to our vision for game-changing
HPC infrastructure monitoring.</p>
<p>Now we are looking for fresh talent to take up the mantle and lead our
ongoing efforts. If you're interested in a role with us at StackHPC,
<a class="reference external" href="http://techfolk.co.uk/current-jobs/software-engineer-openstack-hpc-central-bristol-remote-tl216">find more information with our recruiter</a>.</p>
<blockquote>
<div class="figure">
<img alt="StackHPC team in the Castle" src="//www.stackhpc.com/images/stackhpc-in-pub.jpg" style="width: 500px;" />
<p class="caption"><em>Room for you at our usual table in the Castle Inn, Cambridge?</em></p>
</div>
</blockquote>
<p>Good luck Steve and Linn from all of us at StackHPC!</p>
HPC Networking in OpenStack: Part 12017-06-21T15:00:00+01:002017-07-14T18:00:00+01:00Mark Goddardtag:www.stackhpc.com,2017-06-21:/hpc-networking-in-openstack-1.html<p class="first last">Part 1 of our series on HPC networking in OpenStack showcases the
varied networking capabilities of one of our recent OpenStack
deployments, the Performance Prototype Platform (P3) built for the
Square Kilometre Array (SKA) telescope's Science Data Processor
(SDP).</p>
<p>This post is the first in a series on HPC networking in OpenStack. In
the series we'll discuss StackHPC's current and future work on integrating
OpenStack with high performance network technologies. This post sets the
scene and the varied networking capabilities of one
of our recent OpenStack deployments, the Performance Prototype Platform (P3),
built for the <a class="reference external" href="https://skatelescope.org">Square Kilometre Array (SKA)</a>
telescope's <a class="reference external" href="http://ska-sdp.org/">Science Data Processor (SDP)</a>.</p>
<div class="section" id="not-too-distant-cousins">
<h2>(Not Too) Distant Cousins</h2>
<p>There are many similarities between the cloud and HPC worlds, driving
the adoption of OpenStack for scientific computing.
Viewed from a networking perspective however, HPC clusters and modern cloud
infrastructure can seem worlds apart.</p>
<p>OpenStack clouds tend to rely on overlay network technologies such as GRE and
VXLAN tunnels to provide separation between tenants. These are often implemented in
software, running atop a statically configured physical Ethernet fabric.
Conversely, HPC clusters may feature a variety of physical networks,
potentially including technologies such as <a class="reference external" href="https://en.wikipedia.org/wiki/InfiniBand">Infiniband</a> and <a class="reference external" href="https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-fabric-overview.html">Intel Omnipath Architecture</a>.
Low overhead access to these networks is crucial, with applications accessing
the network directly in bare metal environments or via SR-IOV in when running
in virtual machines. Performance may be further enhanced by using NICs with
support for Remote Direct Memory Access (RDMA).</p>
</div>
<div class="section" id="background-the-ska-and-its-sdp">
<h2>Background: the SKA and its SDP</h2>
<p>The SKA is an awe-inspiring project, to which any short description of ours is
unlikely to do justice. Here's what the SKA website has to say:</p>
<blockquote>
The Square Kilometre Array (SKA) project is an international effort to
build the world’s largest radio telescope, with eventually over a square
kilometre (one million square metres) of collecting area. The scale of the
SKA represents a huge leap forward in both engineering and research &
development towards building and delivering a unique instrument, with the
detailed design and preparation now well under way. As one of the largest
scientific endeavours in history, the SKA will bring together a wealth of
the world’s finest scientists, engineers and policy makers to bring the
project to fruition.</blockquote>
<p>The SDP Consortium forms part of the SKA project, aiming to build a
supercomputer-scale computing facility to process and store the data generated
by the SKA telescope. The data ingested by the SDP is expected to exceed the
global Internet traffic per day. Phew!</p>
<div class="figure">
<img alt="Artist's impression of SKA dishes in South Africa" src="//www.stackhpc.com/images/ska-sa-closeup.jpg" style="width: 600px;" />
<p class="caption"><em>The SKA will use around 3000 dishes, each 15 m in diameter. Credit: SKA
Organisation</em></p>
</div>
</div>
<div class="section" id="performance-prototype-platform-a-high-performance-melting-pot">
<h2>Performance Prototype Platform: a High Performance Melting Pot</h2>
<p>The SDP architecture is still being developed, but is expected to incorporate
the concept of a <em>compute island</em>, a scalable unit of compute resources and
associated network connectivity. The SDP workloads will be partitioned and
scheduled across these compute islands.</p>
<p>During its development, a complex project such as the SDP has many variables
and unknowns. For the SDP this includes a variety of workloads and an
assortment of new hardware and software technologies which are becoming
available.</p>
<p>The Performance Prototype Platform (P3) aims to provide a platform that roughly
models a single compute island, and allows SDP engineers to evaluate a number
of different technologies against the anticipated workloads.
P3 provides a variety of interesting compute, storage and network
technologies including GPUs, NVMe memory, SSDs, high speed Ethernet and
Infiniband.</p>
<p>OpenStack offers a compelling solution for managing the diverse infrastructure in the
P3 system, and StackHPC is proud to have built an OpenStack management
plane that allows the SDP team to get the most out of the system. The compute
plane is managed as a bare metal compute resource using <a class="reference external" href="https://docs.openstack.org/developer/ironic/deploy/user-guide.html">Ironic</a>. The
<a class="reference external" href="https://wiki.openstack.org/wiki/Magnum">Magnum</a> and <a class="reference external" href="https://wiki.openstack.org/wiki/Sahara">Sahara</a> services allow the SDP team to
explore workloads based on container and data processing technologies, taking
advantage of the native performance provided by bare metal compute.</p>
</div>
<div class="section" id="how-many-networks">
<h2>How Many Networks?</h2>
<p>The P3 system features multiple physical networks with different
properties:</p>
<ul class="simple">
<li>1GbE out of band management network for BMC management</li>
<li>10GbE control and provisioning network for bare metal provisioning, private
workload communication and external network access</li>
<li>25/100GbE Bulk Data Network (BDN)</li>
<li>100Gbit/s EDR Infiniband Low Latency Network (LLN)</li>
</ul>
<div class="figure">
<img alt="Physical networks in the deployment" src="//www.stackhpc.com/images/alaska-networks-stylised.png" style="width: 700px;" />
</div>
<p>On this physical topology we provision a set of static VLANs for the control
plane and external network access, and dynamic VLANS for use by workloads.
Neutron manages the control/provisioning network switches, but due to current
limitations in ironic it cannot also manage the BDN or LLN, so these are
provided as a shared resource.</p>
<p>The complexity of the networking in the P3 system means that automation
is crucial to making the system managable. With the help of ansible's network
modules, the <a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a> deployment tool is
able to configure the physical and virtual networks of the switches and control
plane hosts using a declarative YAML format.</p>
<p>Ironic's networking capabilities are improving rapidly, adding features such as
multi-tenant network isolation and port groups but still have a way to go to
reach parity with VMs. In a later post we'll discuss the work being done
upstream in ironic by StackHPC to support <a class="reference external" href="https://specs.openstack.org/openstack/ironic-specs/specs/not-implemented/physical-network-awareness.html">multiple physical networks</a>.</p>
</div>
<div class="section" id="next-time">
<h2>Next Time</h2>
<p>In the <a class="reference external" href="http://www.stackhpc.com/hpc-networking-in-openstack-2.html">next article in this series</a> we'll discuss
how the <a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a> project uses Ansible's
network modules to define physical and virtual network infrastructure as code.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
<li><a class="reference external" href="https://www.stackhpc.com/openstack-and-hpc-networks.html">OpenStack and HPC Network Fabrics</a></li>
<li><a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a></li>
<li><a class="reference external" href="https://www.stackhpc.com/ironic-idrac-ztp.html">Zero touch provisioning of P3</a></li>
</ul>
</div>
Extending TripleO for the HPC-enabled Overcloud2017-06-06T14:00:00+01:002017-06-06T14:00:00+01:00Stig Telfertag:www.stackhpc.com,2017-06-06:/tripleo-dib-ofed.html<p class="first last">A rainy day activity led to the refactoring of large
swathes of our TripleO-driven deployment process for HPC-enabled
OpenStack infrastructure.</p>
<p>What is the single biggest avoidable toil in deployment of OpenStack?
Hard to choose, but at StackHPC we've recently been looking at one
of our pet grievances, around the way that we've been creating the
images for provisioning an HPC-enabled overcloud.</p>
<p>An HPC-enabled overcloud might differ in various ways, in order to
offer high performance connectivity, or greater efficiency - whether
that be in compute overhead or data movement.</p>
<p>In this specific instance, we are looking at incorporating <a class="reference external" href="https://www.openfabrics.org">Open
Fabrics</a> for binding the network-oriented
data services that our hypervisor is providing to its guests.</p>
<div class="section" id="open-fabrics-on-mellanox-ethernet">
<h2>Open Fabrics on Mellanox Ethernet</h2>
<p>We take the view that <em>CPU cycles spent in the hypervisor are taken
from our clients</em>, and we do what we can to minimise this. We've
had good success in <a class="reference external" href="//www.stackhpc.com/vxlan-ovs-bandwidth.html">demonstrating the advantages</a> of both SR-IOV and RDMA
for trimming the fat from hypervisor data movement.</p>
<p>Remote DMA (RDMA) is supported by integrating packages from Open
Fabrics enterprise distribution (OFED), an alternative networking
stack that bypasses the kernel's TCP/IP stack to deliver data
directly to the processes requesting it. Mellanox produce their
<a class="reference external" href="http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers">own version of OFED</a>,
developed and targeted specifically for their NICs.</p>
</div>
<div class="section" id="tripleo-and-the-red-hat-openstack-ecosystem">
<h2>TripleO and the Red Hat OpenStack Ecosystem</h2>
<p>Red Hat's ecosystem is built upon <a class="reference external" href="http://tripleo.org">TripleO</a>,
which uses <a class="reference external" href="https://docs.openstack.org/developer/diskimage-builder/">DiskImage-Builder (DIB)</a> - with
a good deal of extra customisation in the form of DIB elements.</p>
<p>The TripleO project have done a load of good work to integrate the
invocation of DIB into the OpenStack client. The images created
in TripleO's process include the overcloud images used for hypervisors
and controller nodes deployed using TripleO. Conventionally the
same image is used for all overcloud roles but <a class="reference external" href="//www.stackhpc.com/tripleo-numa-vcpu-pinning.html">as we've shown in
previous articles</a>
we can built distinct images tailored to compute, control, networking
or storage as required.</p>
</div>
<div class="section" id="introducing-the-tar-pit">
<h2>Introducing the Tar Pit</h2>
<p>We'd been following a process of taking the output from the OpenStack
client's <tt class="docutils literal">openstack overcloud image build</tt> command (the overcloud
images are in QCOW2 format at this point) and then using
<tt class="docutils literal"><span class="pre">virt-customize</span></tt> to boot a captive VM in order to apply site-specific
transformations, including the deployment of OFED.</p>
<p>We've <a class="reference external" href="//www.stackhpc.com/building-mellanox-ofed.html">previously covered</a>
the issues around creating Mellanox OFED packages specifically built
for the kernel version embedded in OpenStack overcloud images. The
repo produced is made available on our intranet, and accessed
by the captive VM instantiated by <tt class="docutils literal"><span class="pre">virt-customize</span></tt>.</p>
<p>This admittedly works, but sucks in numerous ways:</p>
<ul class="simple">
<li>It adds a heavyweight extra stage to our deployment process (and one that
requires a good deal of extra software dependencies).</li>
<li>OFED really fattens up the image and this is probably the slowest possible way
in which it could be integrated into the deployment.</li>
<li>It adds significant complexity to scripting an automated ground-up redeployment.</li>
</ul>
</div>
<div class="section" id="the-rainy-day">
<h2>The Rainy Day</h2>
<p>Through our work on Kolla-on-Bifrost <a class="reference external" href="https://github.com/stackhpc/kayobe">(a.k.a Kayobe)</a> we have been building our
own DiskImage-Builder elements. Our deployments for the <a class="reference external" href="http://skatelescope.org">Square
Kilometre Array telescope</a> have had us
looking again at the image building process. A quiet afternoon led
us to put the work in to integrating our own HPC-specific DIB
elements into a single-step process for generating overcloud images.
For TripleO deployments, we now integrate our steps into the
invocation of the TripleO OpenStack CLI, as described in <a class="reference external" href="http://tripleo.org/basic_deployment/basic_deployment_cli.html#get-images">the TripleO
online documentation</a>.</p>
<p>Here's how:</p>
<ul class="simple">
<li>We install our MLNX-OFED repo on an intranet webserver acting
as a package repo as before. In TripleO this can easily be the
undercloud seed node. It's best for future control plane
upgrades if it is a server that is reachable from the
OpenStack overcloud instances when they are active.</li>
<li>We use a git repo of <a class="reference external" href="https://github.com/stackhpc/stackhpc-image-elements">StackHPC's toolbox of DIB elements</a></li>
<li>We define some YAML for adding our element to TripleO's
<tt class="docutils literal"><span class="pre">overcloud-full</span></tt> image build (call this <tt class="docutils literal"><span class="pre">overcloud-images-stackhpc.yaml</span></tt>):</li>
</ul>
<div class="highlight"><pre><span></span><span class="nt">disk_images</span><span class="p">:</span>
<span class="p p-Indicator">-</span>
<span class="nt">imagename</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">overcloud-full</span>
<span class="nt">elements</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">mlnx-ofed</span>
<span class="nt">environment</span><span class="p">:</span>
<span class="c1"># Example: point this to your intranet's unpacked MLNX-OFED repo</span>
<span class="nt">DIB_MLNX_OFED_VERSION</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">4.0-2</span>
<span class="nt">DIB_MLNX_OFED_REPO</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">http://172.16.8.2/repo/MLNX_OFED_LINUX-4.0-2.0.0.1-rhel7.3-x86_64</span>
<span class="nt">DIB_MLNX_OFED_DELETE_REPO</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">n</span>
<span class="nt">DIB_MLNX_OFED_PKGLIST</span><span class="p">:</span> <span class="s">"mlnx-ofed-hypervisor</span><span class="nv"> </span><span class="s">mlnx-fw-updater"</span>
</pre></div>
<ul class="simple">
<li>Define some environment variables. Here we select to build Ocata stable images.
DiskImage-Builder doesn't extend any existing value assigned for <tt class="docutils literal">ELEMENTS_PATH</tt>,
so we must define all of TripleO's elements locations, plus our own:</li>
</ul>
<div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">STABLE_RELEASE</span><span class="o">=</span><span class="s2">"ocata"</span>
<span class="nb">export</span> <span class="nv">DIB_YUM_REPO_CONF</span><span class="o">=</span><span class="s2">"/etc/yum.repos.d/delorean*"</span>
<span class="nb">export</span> <span class="nv">ELEMENTS_PATH</span><span class="o">=</span>/home/stack/stackhpc-image-elements/elements:<span class="se">\</span>
/usr/share/tripleo-image-elements:<span class="se">\</span>
/usr/share/instack-undercloud:<span class="se">\</span>
/usr/share/tripleo-puppet-elements
</pre></div>
<ul class="simple">
<li>Invoke the OpenStack client providing configurations - here for a CentOS overcloud image -
plus our <tt class="docutils literal"><span class="pre">overcloud-images-stackhpc.yaml</span></tt> fragment:</li>
</ul>
<div class="highlight"><pre><span></span>openstack overcloud image build <span class="se">\</span>
--config-file /usr/share/openstack-tripleo-common/image-yaml/overcloud-images.yaml<span class="se">\</span>
--config-file /usr/share/openstack-tripleo-common/image-yaml/overcloud-images-centos7.yaml <span class="se">\</span>
--config-file /home/stack/stackhpc-image-elements/overcloud-images-stackhpc.yaml
</pre></div>
<p>All going to plan, the result is an RDMA-enabled overcloud image, done right (or at least, better
than it was before).</p>
<p>Share and enjoy!</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
<li><a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a></li>
</ul>
</div>
StackHPC at ACCU 20172017-05-05T12:20:00+01:002017-05-05T12:40:00+01:00Stig Telfertag:www.stackhpc.com,2017-05-05:/stackhpc-at-accu-2017.html<p class="first last">StackHPC presents on asynchronous C++ for I/O and networking at ACCU 2017</p>
<p><a class="reference external" href="https://conference.accu.org/stories/2017/schedule.html">ACCU 2017</a> was
held in our home city of Bristol, and StackHPC senior technical lead
Steve Simpson took the opportunity to present an overview of asynchronous
I/O and networking - in C++, naturally.</p>
<p>The video of his popular talk is available here:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/Z8tbjyZFAVQ" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div>HPCAC Conference End Note: The Case for a Scientific OpenStack2017-04-24T10:20:00+01:002017-04-24T12:40:00+01:00Stig Telfertag:www.stackhpc.com,2017-04-24:/hpcac-conference-end-note-the-case-for-a-scientific-openstack.html<p class="first last">StackHPC attended the HPC Advisory Council conference in
Lugano, Switzerland. Stig was delighted to be able to give the
endnote address, on Scientific OpenStack, to close the conference.</p>
<p>The <a class="reference external" href="http://www.hpcadvisorycouncil.com/index.php">HPC Advisory Council</a>
held its <a class="reference external" href="http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php">2017 workshop in Lugano</a>
in the beautiful region of Ticino, Switzerland.</p>
<p><a class="reference external" href="http://insidehpc.com">InsideHPC</a> has recorded footage and gathered
presentation slides from much of the conference, <a class="reference external" href="http://insidehpc.com/video-gallery-switzerland-hpc-conference-2017/">available here</a>.
Thanks Rich!</p>
<p>I was able to attend this year, and was thoroughly delighted to do
so. The conference content included some stimulating presentations
outlining the future directions of HPC. Three general themes were
strong in much of the content.</p>
<div class="section" id="deep-learning-ai">
<h2>Deep Learning & AI</h2>
<p>It is clear that Deep Learning algorithms are transforming many
areas of scientific computing. With presentations ranging from the
giant corporations (such as IBM) to new and nimble startups (such
as DeepCube), there is huge interest and clear potential for where
Deep Learning techniques can be applied.</p>
</div>
<div class="section" id="gpus-and-accelerators">
<h2>GPUs and Accelerators</h2>
<p>If Deep Learning holds great promise for researchers looking for
solutions to the problems of modern science, it also holds great
promise for the bottom line of companies developing accelerators
such as GPUs and more exotic hardware architectures! The highly
computationally-intensive nature of Deep Learning algorithms was
the subject of several very interesting talks, including
refactoring software with awareness of PCI bus congestion, direct
communication between GPUs and HPC networks, and the new OpenCAPI
initiative for OpenPOWER.</p>
</div>
<div class="section" id="openstack">
<h2>OpenStack</h2>
<p>OpenStack is clearly thriving in this space. <a class="reference external" href="http://nowlab.cse.ohio-state.edu/member/panda/">DK Panda</a> from <a class="reference external" href="http://nowlab.cse.ohio-state.edu/">Ohio State
University</a> presented two
compelling keynotes describing their latest work on enhancing MVAPICH2
MPI for virtualisation and big data software stacks. Francis Lam
from Huawei talked about HPC hardware, citing OpenStack use cases.</p>
<p>Mike Lowe and Dave Hancock from Indiana University presented the
OpenStack journey they took with getting <a class="reference external" href="http://jetstream-cloud.org/technology.php">Jetstream</a> into production. With
Jetstream's deployment Mike took a very hands-on approach, and as a
result they have a system that performs very well for research
computing requirements, but is also well understood by Mike and his
team at a fundamental level.</p>
<p>Saverio Proto from <a class="reference external" href="http://www.switch.ch/services/engines/">SWITCH</a>
described how they are able to integrate cloud instances
into the layer-2 data centre networks of the research faculties they
are supporting, giving seamless scalability without any inconvenience
to the researchers using the platform.</p>
<p>I had the honour of the closing end note address, and I took the
opportunity to lay out our case for using OpenStack for meeting the
needs of modern research computing.</p>
<img alt="Stig talking at HPCAC Lugano 2017" src="//www.stackhpc.com/images/robot_shakespeare.jpg" style="width: 400px;" />
<p>InsideHPC <a class="reference external" href="http://insidehpc.com/2017/04/openstack-research-computing/">covered it here</a>.
Their footage of the presentation is also available on their YouTube
channel:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/UNoE6Mj_oew" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>An HPC conference can bring together hot technologies such as
accelerators, deep learning and virtualisation, and appraise them
in the context of HPC's long tradition of maximising the capability
of computation at scale.</p>
<p>An exciting time for StackHPC to be standing at this crossroads!</p>
</div>
Zero-Touch Provisioning using Ironic Inspector and Dell iDRAC2017-03-28T22:00:00+01:002018-11-13T17:00:00+00:00Mark Goddardtag:www.stackhpc.com,2017-03-28:/ironic-idrac-ztp.html<p class="first last">Ironic Inspector's new capabilities unlock the possibility of
using OpenStack for the zero-touch provisioning of new hardware. Here
we investigate these features and demonstrate a way of commissioning
new hardware with minimal operator involvement.</p>
<p>How long does it take a team to bring up new hardware for private
cloud? Despite long days at the data centre, why does it always
seem to take longer than initial expectations? The commissioning
of new hardware is tedious, with much unnecessary operator toil.</p>
<p>The scientific computing sector is already well served by tools for
streamlining this process. The commercial product <a class="reference external" href="https://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc">Bright Cluster
Manager</a>
and open-source project <a class="reference external" href="https://xcat.org/">xCAT</a> (originally from
IBM) are good examples. The OpenStack ecosystem can learn a lot
from the approaches taken by these packages, and some of the gaps
in what OpenStack Ironic can do have been painfully inconvenient
when using projects such as <a class="reference external" href="https://tripleo.org/">TripleO</a> and
<a class="reference external" href="https://github.com/openstack/bifrost">Bifrost</a> at scale.</p>
<p>This post covers how this landscape is changing. Using new
capabilities in OpenStack's Ironic Inspector, and new support for
manipulating network switches using Ansible, we can build a system
for zero-touch provisioning using Ironic and Ansible together to
bring zero-touch provisioning to OpenStack private clouds.</p>
<div class="section" id="provision-this">
<h2>Provision This...</h2>
<p>Recently we have been working on a performance prototyping platform
for the <a class="reference external" href="https://skatelescope.org/">SKA telescope</a>. In a nutshell,
this project aims to identify promising technologies to pick up and
run with as the SKA development ramps up over the next few years.
The system features a number of compute nodes with more exotic hardware
configurations for the SKA scientists to explore.</p>
<p>This system uses Dell R630 compute nodes, running as bare metal,
managed using OpenStack Ironic, with an OpenStack control plane
deployed using Kolla.</p>
<p>The system has a number of networks that must be managed effectively
by OpenStack, without incurring any performance overhead. All nodes
have rich network connectivity - something which has been a problem
for Ironic, and which we are also <a class="reference external" href="https://review.openstack.org/#/c/435781">working on</a>.</p>
<div class="figure">
<img alt="Physical networks in the deployment" src="//www.stackhpc.com/images/alaska-networks-stylised.png" style="width: 700px;" />
</div>
<ul class="simple">
<li><strong>Power Management</strong>. Ironic requires access to the compute server
baseboard management controllers (BMCs). This enables Ironic to
power nodes on and off, access serial consoles and reconfigure BIOS
and RAID settings.</li>
<li><strong>Provisioning and Control</strong>. When a bare metal compute node is being
provisioned, Ironic uses this network interface to network-boot the
compute node, and transfer the instance software image to the compute
node's disk. When a compute node has been deployed and is active, this
network is configured as the primary network for external access
to the instance.</li>
<li><strong>High Speed Ethernet</strong>. This network will be used for modelling the
high-bandwidth data feeds being delivered from the telescope's
Central Signal Processor (CSP). Some Ethernet-centric storage
technologies will also use this network.</li>
<li><strong>High Speed Infiniband</strong>. This network will be reserved for
low-latency, high-bandwidth messaging, either between tightly-coupled
compute or compute that is tightly coupled to storage.</li>
</ul>
</div>
<div class="section" id="automagic-provisioning-using-xcat">
<h2>Automagic Provisioning Using xCAT</h2>
<p>Before we dive in to the OpenStack details, lets make a quick detour
with an overview of how xCAT performs what it calls "Automagic
provisioning".</p>
<ul class="simple">
<li>This technique only works if your hardware attempts a PXE boot
in its factory default configuration. If it doesn't, well that's
unfortunate!</li>
<li>We start with the server hardware racked and cabled up to the
provisioning network switches. The servers don't need configuring
- that is automatically done later.</li>
<li>The provisioning network must be configured with management access
and SNMP read access enabled. The required VLAN state must be
configured on the server access ports. The VLAN has to be isolated
for the exclusive use of xCAT provisioning.</li>
<li>xCAT is configured with addresses and credentials for SNMP access to
the provisioning network switches. A physical network topology
must also be defined in xCAT, which associates switches and ports with
connected servers. At this point, this is all xCAT knows about a
server: that it is an object attached to a given network port.</li>
<li>The server is powered on (OK, this is manual; perhaps "zero-touch"
is an exaggeration...), and performs a PXE boot. For a DHCP request
from an unidentified MAC, xCAT will provide a generic introspection
image for PXE-boot.</li>
<li>xCAT uses SNMP to trace the request to a switch and network port.
If this network port is in xCAT's database, the server object associated
with the port is populated with introspection details (such as the MAC
address).</li>
<li>At this stage, the server awaits further instructions. Commissioning
new hardware may involve firmware upgrades, in-band BIOS configuration,
BMC credentials, etc. These are performed using site-specific
operations at this point.</li>
</ul>
<p>We have had some experience of various different approaches - Bright
Cluster Manager, xCAT, TripleO and OpenStack Ironic. This gives
us a pretty good idea of what is possible, and the benefits and
weaknesses of each. As an example, the xCAT flow offers many
advantages over a manual approach to hardware commissioning - once
it is set up. Some cloud-centric infrastructure management techniques
can be applied to simplify that process.</p>
</div>
<div class="section" id="an-ironic-inspector-calls">
<h2>An Ironic Inspector Calls</h2>
<p>Here's how we put together a system, built around Ironic, for
streamlined infrastructure commissioning using OpenStack tools.
We've collected together our Ansible playbooks and supporting scripts
as part of our new Kolla-based OpenStack deployment project, <a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a>.</p>
<p>One principal difference with the xCAT workflow is that Ironic's Inspector
does not make modifications to server state, and by default does not keep
the introspection ramdisk active after introspection or enable an SSH
login environment. This prevents us from using xCAT's technique of
invoking custom commands to perform site-specific commissioning actions.
We'll cover how those actions are achieved below.</p>
<ul class="simple">
<li>We use Ansible <a class="reference external" href="https://docs.ansible.com/ansible/list_of_network_modules.html">network modules</a>
to configure the management switches for the Provisioning and Control
network. In this case, they are <a class="reference external" href="https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-Networking-S-Series-S6010-ON-Spec-Sheet.pdf">Dell S6010-ON</a>
network-booting switches, and we have configured Ironic Inspector's
dnsmasq server to boot them with Dell/Force10 OS9. We use the <a class="reference external" href="https://docs.ansible.com/ansible/list_of_network_modules.html#dellos9">Dellos9
Ansible module</a></li>
<li>Using tabulated YAML data mapping switch ports to compute hosts, Ansible
configures the switches with port descriptions and membership of the
provisioning VLAN for all the compute node access ports. Some other
basic configuration is applied to set the switches up for operation, such
as enabling LLDP and configuring trunk links.</li>
<li>Ironic Inspector's dnsmasq service is configured to boot unrecognised MAC
addresses (for servers not in the Ironic inventory) to perform
introspection on those servers.</li>
<li>The LLDP datagrams from the switch are received during introspection,
including the switch port description label we assigned using Ansible.</li>
<li>The introspection data gathered from those nodes is used to populate the
Ironic inventory to register the new node using Inspector's <a class="reference external" href="https://docs.openstack.org/developer/ironic-inspector/usage.html#discovery">discovery
capabilities</a>.</li>
<li>With Ironic Inspector's <a class="reference external" href="https://docs.openstack.org/developer/ironic-inspector/usage.html#introspection-rules">rule-based transformations</a>,
we can populate the server's state in Ironic with BMC credentials,
deployment image IDs and other site-specific information. We name the
nodes using a rule that extracts the switch port description received via
LLDP.</li>
</ul>
<p>So far, so good, but there's a catch...</p>
<div class="section" id="dell-os9-and-lldp">
<h3>Dell OS9 and LLDP</h3>
<p>Early development was done on other switches, including simpler
beasts running Dell Network OS6. It appears that Dell Network OS9 does not
yet support some simple-but-critical features we had taken for granted
for the cross-referencing of switch ports with servers. Specifically,
Dell Network OS9 does not support transmitting LLDP's port description TLV
that we were assigning using Ansible network modules.</p>
<p>To work around this we decided to fall back to the same method
used in xCAT: we match using switch address and port ID. To do this,
we create Inspector rules for matching each port and performing the
appropriate assignment. And with that, the show rolls on.</p>
<p><strong>Update</strong>: Since the publication of this article, newer versions of Dell
Networking OS9 have gained the capability of advertising port descriptions,
making this workaround unnecessary. For S6010-ON switches, this is available
since version 9.11(2.0P1) using the <tt class="docutils literal">advertise <span class="pre">interface-port-desc</span>
description</tt> LLDP configuration.</p>
</div>
<div class="section" id="ironic-s-catch-22">
<h3>Ironic's Catch-22</h3>
<p>Dell's defaults for the iDRAC BMC assign a default IP address and
credentials. All BMC ports on the Power Management network begin with the
same IP. IPMI is initially disabled, allowing access only through the
WSMAN protocol used by the idracadm client. In order for Ironic to manage
these nodes, their BMCs each need a unique IP address on the Power
Management subnet.</p>
<p>Ironic Inspection is designed to be a read-only process. While the IPA
ramdisk can discover the IP address of a server's BMC, there is currently
no mechanism for setting the IP address of a newly discovered node's BMC.</p>
<p>Our solution involves more use of the Ansible network modules.</p>
<ul class="simple">
<li>Before inspection takes place we traverse our port-mapping YAML tables,
putting the network port of each new server's BMC in turn into a
dedicated commissioning VLAN.</li>
<li>Within the commissioning VLAN, the default IP address can be
addressed in isolation. We connect to the BMC via idracadm, assign it
the required IP address, and enable IPMI.</li>
<li>The network port for this BMC is reverted to the Power Management
VLAN.</li>
</ul>
<p>At this point the nodes are ready to be inspected. The BMCs' new IP
addresses will be discovered by Inspector and used to populate the nodes'
driver info fields in Ironic.</p>
</div>
<div class="section" id="automate-all-the-switches">
<h3>Automate all the Switches</h3>
<p>The configuration of a network, when applied manually, can quickly become
complex and poorly understood. Even simple CLI-based automation like that
used by the <tt class="docutils literal">dellos</tt> Ansible network modules can help to grow confidence
in making changes to a system, without the complexity of an SDN controller.</p>
<p>Some modern switches such as the Dell S6010-ON support network booting an
Operating System image. Kayobe's <a class="reference external" href="https://github.com/stackhpc/kayobe/tree/5c1d05bdfa7d1920249080febe2b4cc03b4d7026/ansible/roles/dell-switch-bmp">dell-switch-bmp role</a>
configures a network boot environment for these switches in a Kolla-ansible
managed Bifrost container.</p>
<p>Once booted, these switches need to be configured. We developed the simple
<a class="reference external" href="https://github.com/stackhpc/kayobe/tree/5c1d05bdfa7d1920249080febe2b4cc03b4d7026/ansible/roles/dell-switch">dell-switch role</a>
to configure the required global and per-interface options.</p>
<p>Switch configuration is codified as Ansible host variables (<tt class="docutils literal">host_vars</tt>)
for each switch. The following is an excerpt from one of our switch's host
variables files:</p>
<div class="highlight"><pre><span></span><span class="c1"># Host/IP on which to access the switch via SSH.</span>
<span class="nt">ansible_host</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain"><switch IP></span>
<span class="c1"># Interface configuration.</span>
<span class="nt">switch_interface_config</span><span class="p">:</span>
<span class="nt">Te1/1/1</span><span class="p">:</span>
<span class="nt">description</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">compute-1</span>
<span class="nt">config</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">switch_interface_config_all</span><span class="nv"> </span><span class="s">}}"</span>
<span class="nt">Te1/1/2</span><span class="p">:</span>
<span class="nt">description</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">compute-2</span>
<span class="nt">config</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">switch_interface_config_all</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>As described previously, the interface description provides the necessary
mapping from interface name to compute host. We reference the
<tt class="docutils literal">switch_interface_config_all</tt> variable which is kept in an Ansible group
variables (<tt class="docutils literal">group_vars</tt>) file to keep things <a class="reference external" href="https://en.wikipedia.org/wiki/Don't_repeat_yourself">DRY</a>. The following
snippet is taken from such a file:</p>
<div class="highlight"><pre><span></span><span class="c1"># User to access the switch via SSH.</span>
<span class="nt">ansible_user</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain"><username></span>
<span class="c1"># Password to access the switch via SSH.</span>
<span class="nt">ansible_ssh_pass</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain"><password></span>
<span class="c1"># Interface configuration for interfaces with controllers or compute</span>
<span class="c1"># nodes attached.</span>
<span class="nt">switch_interface_config_all</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="s">"no</span><span class="nv"> </span><span class="s">shutdown"</span>
<span class="p p-Indicator">-</span> <span class="s">"switchport"</span>
<span class="p p-Indicator">-</span> <span class="s">"protocol</span><span class="nv"> </span><span class="s">lldp"</span>
<span class="p p-Indicator">-</span> <span class="s">"</span><span class="nv"> </span><span class="s">advertise</span><span class="nv"> </span><span class="s">dot3-tlv</span><span class="nv"> </span><span class="s">max-frame-size"</span>
<span class="p p-Indicator">-</span> <span class="s">"</span><span class="nv"> </span><span class="s">advertise</span><span class="nv"> </span><span class="s">management-tlv</span><span class="nv"> </span><span class="s">management-address</span><span class="nv"> </span><span class="s">system-description</span><span class="nv"> </span><span class="s">system-name"</span>
<span class="p p-Indicator">-</span> <span class="s">"</span><span class="nv"> </span><span class="s">advertise</span><span class="nv"> </span><span class="s">interface-port-desc"</span>
<span class="p p-Indicator">-</span> <span class="s">"</span><span class="nv"> </span><span class="s">no</span><span class="nv"> </span><span class="s">disable"</span>
<span class="p p-Indicator">-</span> <span class="s">"</span><span class="nv"> </span><span class="s">exit"</span>
</pre></div>
<p>Interfaces attached to compute hosts are enabled as switchports and have
several LLDP TLVs enabled to support inspection.</p>
<p>We wrap this up in a <a class="reference external" href="https://github.com/stackhpc/kayobe/blob/5c1d05bdfa7d1920249080febe2b4cc03b4d7026/ansible/physical-network.yml">playbook</a>
and make it user-friendly through our <a class="reference external" href="https://github.com/stackhpc/kayobe/blob/5c1d05bdfa7d1920249080febe2b4cc03b4d7026/kayobe/cli/commands.py#L130">CLI</a>
as the command <tt class="docutils literal">kayobe physical network configure</tt>.</p>
</div>
<div class="section" id="idrac-commissioning">
<h3>iDRAC Commissioning</h3>
<p>The <a class="reference external" href="https://github.com/stackhpc/kayobe/blob/5c1d05bdfa7d1920249080febe2b4cc03b4d7026/ansible/idrac-bootstrap.yml">idrac-bootstrap.yml playbook</a>
used to commission the box-fresh iDRACs required some relatively complex
task sequencing across multiple hosts using multiple plays and roles.</p>
<p>A key piece of the puzzle involves the use of an Ansible <a class="reference external" href="https://github.com/stackhpc/kayobe/blob/5c1d05bdfa7d1920249080febe2b4cc03b4d7026/ansible/idrac-bootstrap-one.yml">task file</a>
included multiple times using a <tt class="docutils literal">with_dict</tt> loop, in a play targeted at
the switches using <tt class="docutils literal">serial: 1</tt>. This allows us to execute a set of
tasks for each BMC in turn. A simplified example of this is shown in the
playbook below:</p>
<div class="highlight"><pre><span></span><span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Execute multiple tasks for each interface on each switch serially</span>
<span class="nt">hosts</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">switches</span>
<span class="nt">serial</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">1</span>
<span class="nt">tasks</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Execute multiple tasks for an interface</span>
<span class="nt">include</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">task-file.yml</span>
<span class="nt">with_dict</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">switch_interface_config</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>Here we reference the <tt class="docutils literal">switch_interface_config</tt> variable seen previously.
<tt class="docutils literal"><span class="pre">task-file.yml</span></tt> might look something like this:</p>
<div class="highlight"><pre><span></span><span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Display the name of the interface</span>
<span class="nt">debug</span><span class="p">:</span>
<span class="nt">var</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">item.key</span>
<span class="p p-Indicator">-</span> <span class="nt">name</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Display the description of the interface</span>
<span class="nt">debug</span><span class="p">:</span>
<span class="nt">var</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">item.value.description</span>
</pre></div>
<p>This commissioning technique is clearly not perfect, having an execution
time that scales linearly with the number of servers being commissioned.
That said, it automated a labour intensive manual task on the critical path
of our deployment in a relatively short space of time - about 20 seconds
per node.</p>
<p>We think there is room for a solution that is more integrated with Ironic
Inspector and would like to return to the problem before our next
deployment.</p>
</div>
<div class="section" id="introspection-rules">
<h3>Introspection Rules</h3>
<p>Ironic Inspector's introspection rules API provides a flexible mechanism
for processing the data returned from the introspection ramdisk that does
not require any server-side code changes.</p>
<p>There is currently no upstream Ansible module for automating the creation
of these rules. We developed the <a class="reference external" href="https://github.com/stackhpc/kayobe/tree/d4df9cc7b816421700a6d5d0677d72275bff57fa/ansible/roles/ironic-inspector-rules">ironic-inspector-rules role</a>
to fill the gap and continue boldly into the land of
<a class="reference external" href="https://en.wikipedia.org/wiki/Infrastructure_as_Code">Infrastructure-as-code</a>. At the core of
this role is the <a class="reference external" href="https://github.com/stackhpc/kayobe/blob/d4df9cc7b816421700a6d5d0677d72275bff57fa/ansible/roles/ironic-inspector-rules/library/os_ironic_inspector_rule.py">os_ironic_inspector_rule module</a>
which follows the patterns of the upstream <tt class="docutils literal">os_*</tt> modules and provides us
with an Ansible-compatible interface to the introspection rules API. The
role ensures required python dependencies are installed and allows
configuration of multiple rules.</p>
<p>With this role in place, we can define our required introspection rules as
<a class="reference external" href="https://github.com/stackhpc/kayobe/blob/08b83abc22cd3a54a7d1ca2cafb2e79ba41a54ca/ansible/group_vars/all/inspector#L55">Ansible variables</a>.
For example, here is a rule definition used to update the BMC credentials
of a newly discovered node:</p>
<div class="highlight"><pre><span></span><span class="c1"># Ironic inspector rule to set IPMI credentials.</span>
<span class="nt">inspector_rule_ipmi_credentials</span><span class="p">:</span>
<span class="nt">description</span><span class="p">:</span> <span class="s">"Set</span><span class="nv"> </span><span class="s">IPMI</span><span class="nv"> </span><span class="s">driver_info</span><span class="nv"> </span><span class="s">if</span><span class="nv"> </span><span class="s">no</span><span class="nv"> </span><span class="s">credentials"</span>
<span class="nt">conditions</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">field</span><span class="p">:</span> <span class="s">"node://driver_info.ipmi_username"</span>
<span class="nt">op</span><span class="p">:</span> <span class="s">"is-empty"</span>
<span class="p p-Indicator">-</span> <span class="nt">field</span><span class="p">:</span> <span class="s">"node://driver_info.ipmi_password"</span>
<span class="nt">op</span><span class="p">:</span> <span class="s">"is-empty"</span>
<span class="nt">actions</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">action</span><span class="p">:</span> <span class="s">"set-attribute"</span>
<span class="nt">path</span><span class="p">:</span> <span class="s">"driver_info/ipmi_username"</span>
<span class="nt">value</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inspector_rule_var_ipmi_username</span><span class="nv"> </span><span class="s">}}"</span>
<span class="p p-Indicator">-</span> <span class="nt">action</span><span class="p">:</span> <span class="s">"set-attribute"</span>
<span class="nt">path</span><span class="p">:</span> <span class="s">"driver_info/ipmi_password"</span>
<span class="nt">value</span><span class="p">:</span> <span class="s">"{{</span><span class="nv"> </span><span class="s">inspector_rule_var_ipmi_password</span><span class="nv"> </span><span class="s">}}"</span>
</pre></div>
<p>By adding a layer of indirection to the credentials, we can provide a rule
template that is reusable between different systems. The username and
password are then configured separately:</p>
<div class="highlight"><pre><span></span><span class="c1"># IPMI username referenced by inspector rule.</span>
<span class="nt">inspector_rule_var_ipmi_username</span><span class="p">:</span>
<span class="c1"># IPMI password referenced by inspector rule.</span>
<span class="nt">inspector_rule_var_ipmi_password</span><span class="p">:</span>
</pre></div>
<p>This pattern is commonly used in Ansible as it allows granular
customisation without the need to redefine the entirety of a complex
variable.</p>
</div>
</div>
<div class="section" id="share-and-enjoy">
<h2>Share and Enjoy</h2>
<p>Bringing it all together, our deployment uses Ironic, Ansible and
friends to boot and configure Dell network switches, and then in turn
boot, commission and configure Dell compute servers.</p>
<p>This deployment demonstrates using OpenStack to deliver zero-touch
provisioning. Everything about the deployment infrastructure is
defined in code. What's more, by using OpenStack our zero-touch
provisioning will develop at the pace of cloud technology. We
believe this project will rapidly surpass what is possible using
conventional cluster management techniques.</p>
<div class="figure">
<img alt="Alaska control and compute nodes" src="//www.stackhpc.com/images/alaska-compute.jpg" style="width: 250px;" />
</div>
<p>Everything described here is available as part of <a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a>, our new open source project for
automating inventory management within a Bifrost and Kolla environment.
Where any of the roles provided by Kayobe may have wider appeal, we will
consider making them available on Ansible Galaxy for others to use.</p>
<p>We have grand ambitions for Kayobe, and we hope to be speaking more
about the capabilities the project is developing in due course.</p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<ul class="simple">
<li>With thanks to Matt Raso-Barnett from Cambridge University for
a detailed overview of how xCAT zero-touch provisioning was used
for deploying the compute resource for the <a class="reference external" href="http://casu.ast.cam.ac.uk/surveys-projects/gaia">Gaia project</a>.</li>
</ul>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://skatelescope.org/">SKA Telescope</a></li>
<li><a class="reference external" href="http://ska-sdp.org/">SKA Science Data Processor (SDP)</a></li>
<li><a class="reference external" href="https://docs.openstack.org/developer/ironic-inspector/usage.html#introspection-rules">Ironic Inspector introspection rules</a></li>
<li><a class="reference external" href="https://github.com/stackhpc/kayobe">Kayobe</a></li>
</ul>
</div>
StackHPC at PGDay Paris 20172017-03-27T14:00:00+01:002017-03-27T14:40:00+01:00Stig Telfertag:www.stackhpc.com,2017-03-27:/stackhpc-at-pgday-paris-2017.html<p class="first last">Extending Monasca to use Postgres for OpenStack monitoring and logging</p>
<p>StackHPC participated in <a class="reference external" href="http://2017.pgday.paris/">PG Day Paris</a>
last week, to share our vision of using Postgres to improve OpenStack
monitoring, logging and telemetry.</p>
<p><a class="reference external" href="https://www.postgresql.eu/events/schedule/pgdayparis2017/session/1517-infrastructure-monitoring-with-postgres/">Steve's talk</a>
takes a Monasca deployment and presents ways to simplify the
monitoring infrastructure to create something more manageable for
small deployments. The inherent flexibility of Postgres makes it
a good fit for a diverse range of roles - including configuration,
time-series telemetry metrics and JSON-formatted semi-structured
log data. Given native support across the board, the result is a
far simpler deployment yet without sacrificing performance.</p>
<p>Looking ahead from here, Steve's work on delivering an implementation
for Monasca's <a class="reference external" href="//www.stackhpc.com/monasca-log-api.html">multi-tenant log API</a> will bring new
services to our OpenStack ecosystem, and new convenience for our
users.</p>
Logging Services for Guest Workloads: A Step Closer2017-03-14T12:30:00+00:002017-03-14T12:30:00+00:00Steve Simpsontag:www.stackhpc.com,2017-03-14:/monasca-log-api.html<p class="first last">Monasca's multi-tenant capabilities are being extended
to support multi-tenant logging. Here's our take on how this new
service will add value for doing research computing on OpenStack.</p>
<p>How can we make a workload easier on cloud? In a previous article
we <a class="reference external" href="//www.stackhpc.com/openstack-and-hpc-workloads.html">presented the lay of the land</a> for HPC
workload management in an OpenStack environment. A substantial
part of the work done to date focuses on automating the creation
of a software-defined workload management environment -
<em>SLURM-as-a-Service</em>. The projects that look at enriching the
environment available to workload management services once they are
up and running in the cloud appear to be less common.</p>
<p>One example that came along last week was the merge upstream of a
new spec for <a class="reference external" href="https://review.openstack.org/#/c/433016/">multi-tenant log retrieval in Monasca</a>. This proposal was made
and seen through by StackHPC's Steve Simpson.</p>
<div class="section" id="monasca-and-multi-tenant-monitoring">
<h2>Monasca and Multi-Tenant Monitoring</h2>
<p>Monasca monitors OpenStack, but it goes further than that.</p>
<p>From its inception, Monasca has been designed with the distinction
of supporting multi-tenant telemetry. Any tenant host, service or
workload can submit telemetry data to a Monasca API endpoint, and
have it collected and salted away. Later, the user can log in to
a dashboard (Grafana in many cases), and interactively explore the
telemetry data that they collected about the operation of their
instances.</p>
<p><em>Can your tenants do that?</em></p>
<p>The intention is that complex services like telemetry and monitoring
are provided as a service, without requiring the users to create and
deploy their own.</p>
</div>
<div class="section" id="adding-logging-to-the-mix">
<h2>Adding Logging to the Mix</h2>
<p>Time-series telemetry is certainly useful, but is only one part of a
comprehensive solution. We also want to gather data on events that
occur, and logs of activity from the services and operating systems
that underpin our research computing platforms.</p>
<p>The Monasca project (led by the team from Fujitsu) have been working
on logging support for a little while. They first presented their
work at the Tokyo summit:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/ghZ5gnySlWo" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>Logging for system and OpenStack services has been up and running
in Monasca for a few releases.</p>
<p>What has been missing (until now) has been a way of providing multi-tenant
access to log retrieval.</p>
</div>
<div class="section" id="reducing-the-time-to-science">
<h2>Reducing the Time to Science</h2>
<p>It's clear our users have work to do, and our OpenStack projects
exist to support that.</p>
<p>Using Monasca, we can already present log data inline with telemetry
data for system administration use cases. For example, here's log
and telemetry data collected from monitoring RabbitMQ services,
drawn from Monasca and presented together on a Grafana dashboard:</p>
<div class="figure">
<img alt="A Grafana dashboard displaying telemetry and log messages for RabbitMQ" src="//www.stackhpc.com/images/rabbitmq-dashboard-with-logs.png" style="width: 600px;" />
</div>
<p>Once the new multi-tenant logging API is implemented, we'll be providing our
users with the same services for telemetry and logging of their own infrastructure,
platforms and workloads.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://wiki.openstack.org/wiki/Monasca/Logging">Monasca Logging Wiki</a></li>
<li><a class="reference external" href="http://www.stackhpc.com/openstack-and-hpc-workloads.html">OpenStack and HPC Workload Management, StackHPC blog</a></li>
<li><a class="reference external" href="https://www.openstack.org/science/">OpenStack for Scientific Research</a></li>
</ul>
</div>
StackHPC at the Sanger Centre OpenStack Day2017-03-10T12:20:00+00:002017-03-10T12:40:00+00:00Stig Telfertag:www.stackhpc.com,2017-03-10:/stackhpc-at-the-sanger-centre-openstack-day.html<p class="first last">StackHPC presents plans for using OpenStack to support SKA prototyping infrastructure.</p>
<p>In the countryside on the outskirts of Cambridge, a very special gathering
took place. At the invitation of the OpenStack team at the
<a class="reference external" href="http://www.sanger.ac.uk/">Wellcome Trust Sanger Institute</a>,
the regional Scientific OpenStack community got together for a day of
presentations and discussion.</p>
<p>The Sanger Institute put on a great event, and a good deal of
birds-of-a-feather discussion was stimulated.</p>
<blockquote>
<img alt="Stig presenting ALaSKA" src="//www.stackhpc.com/images/sanger-stig-presenting-alaska.jpg" style="width: 700px;" />
</blockquote>
<p>As part of a fascinating schedule including presentations from Sanger,
the Francis Crick Institute, the European Bioinformatics Institute,
RackSpace, Public Health England and Cambridge University, Stig
presented StackHPC's recent work for the <a class="reference external" href="http://skatelescope.org/">SKA telescope</a> project.</p>
<p>This project is the Science Data Processor (SDP) Performance
Prototype. The project is a technology exploration vehicle for
evaluating various strategies for the extreme data challenges posed
by the SKA.</p>
<blockquote>
<img alt="ALaSKA compute rack" src="//www.stackhpc.com/images/alaska-compute.jpg" style="width: 484px;" />
</blockquote>
<p>OpenStack is delivering multi-tenant access to a rich and diverse
range of bare metal hardware, and cloud-native methodologies are
being used to deliver an equally broad range of software stacks.
We call it Alaska (A La SKA).</p>
<p>Alaska is a really exciting project that embodies our ethos of
driving OpenStack development for the scientific use case.
StackHPC is thrilled to be delivering the infrastructure to support it.</p>
StackHPC at FOSDEM/PGDay 20172017-02-21T12:20:00+00:002017-02-21T12:40:00+00:00Stig Telfertag:www.stackhpc.com,2017-02-21:/stackhpc-at-fosdempgday-2017.html<p class="first last">StackHPC presents designs for extending Monasca with support for Postgres</p>
<p><a class="reference external" href="https://2017.fosdempgday.org/">PG Day</a> covered all things Postgres
at <a class="reference external" href="https://fosdem.org/2017/">FOSDEM 2017</a>, and Steve Simpson,
one of StackHPC's senior technical leads, presented at PG Day on
his thoughts for how some of the advanced features of Postgres could
really shine as a backing store for telemetry, logging and monitoring.</p>
<p>As Steve describes in his <a class="reference external" href="https://2017.fosdempgday.org/interview_Steven_Simpson/">interview for FOSDEM PG Day</a>, he
understands Postgres from the intimate vantage point of having
worked with the code base, and gained respect for its implementation
under the hood in addition to its capabilities as an RDBMS.</p>
<p>Through exploiting the unique strengths of Postgres, Steve sees an
opportunity to both simplify and enhance OpenStack monitoring <em>in
one move</em>. He'll be elaborating on his proposed designs and the
progress of this project in a <a class="reference external" href="https://www.stackhpc.com/blog.html">StackHPC blog</a> post in due course.</p>
<p>Steve's talk was <a class="reference external" href="https://fosdem.org/2017/schedule/event/postgresql_infrastructure_monitoring/">recorded</a>
and slides are available <a class="reference external" href="http://www.slideshare.net/StevenSimpson30/infrastructure-monitoring-with-postgres">on slideshare</a>.</p>
TripleO, NUMA and vCPU Pinning: Improving Guest Performance2017-02-03T10:30:00+00:002017-02-03T10:30:00+00:00Mark Goddardtag:www.stackhpc.com,2017-02-03:/tripleo-numa-vcpu-pinning.html<p class="first last">Virtualised performance can be boosted through using NUMA passthrough and vCPU Pinning.
This article describes how it's done in an environment deployed with TripleO</p>
<p>The hardware powering modern cloud and High Performance Computing (HPC) systems
is variable and complex. The assumption that access to memory and devices
across a system is uniform is often incorrect. Without knowledge
of the properties of the physical hardware of the host, the virtualised guests
running atop can perform poorly.</p>
<p>This post covers how we can take advantage of knowledge of the system
architecture in OpenStack Nova to improve guest VM performance, and how to
configure OpenStack TripleO to support this.</p>
<div class="section" id="non-uniform-memory-access-numa">
<h2>Non-Uniform Memory Access (NUMA)</h2>
<p>Server CPU clock speeds have for a long time <a class="reference external" href="https://www.comsol.com/blogs/havent-cpu-clock-speeds-increased-last-years/">ceased increasing</a>.
In order to continue to improve system performance, CPU and system vendors
now scale outwards instead of upwards, offering servers with multiple CPU
sockets and multiple cores per CPU. In multi-socket systems access to memory
and devices is no longer uniform between the CPU nodes across all memory, as
the inter-node communication paths are limited. This leads to variable memory
bandwidth and latency, and is known as Non-Uniform Memory Access (NUMA).</p>
<p>When virtualisation is used on NUMA systems, typically guest VMs will have no
knowledge of the memory architecture of the physical host. Consequently they
will make poor use of the system's resources, making many expensive memory
accesses across the interconnect bus. The same is true when accessing I/O
devices such as Network Interface Cards (NICs).</p>
<p>To avoid these issues it is possible to expose all or a subset of the memory
architecture of the physical system to the guest VM, allowing it to make more
intelligent decisions around the use of memory and how tasks are scheduled to
CPU cores.</p>
<p>We call this process the "physicalisation" of virtualisation. Revealing the
underlying hardware sacrifices some generality and flexibility, but delivers
performance gains through informed placement and scheduling. A compromise is
struck; we find we can get most of the benefits of software defined infrastructure
without paying a price in performance.</p>
</div>
<div class="section" id="vcpu-pinning">
<h2>vCPU Pinning</h2>
<p>In KVM, the virtual CPUs of a guest VM are emulated by host tasks in userspace
of the hypervisor. As such they may be scheduled across any of the cores in
the system. This behaviour can lead to sub-optimal cache performance as
virtual CPUs are scheduled between CPU cores within a NUMA node or worse,
between NUMA nodes. With virtual CPU pinning, we can improve this behaviour by
restricting the physical CPU cores on which each virtual CPU can run.</p>
<p>It can in some scenarios be advantageous to also restrict host processes to a
subset of the available CPU cores to avoid adverse interactions between
hypervisor processes and the application workloads.</p>
</div>
<div class="section" id="numa-and-vcpu-pinning-in-nova">
<h2>NUMA and vCPU Pinning in Nova</h2>
<p>Support for NUMA topology awareness and vCPU pinning in OpenStack Nova was
first introduced in October 2015 with the Juno release. These features allow
Nova to more intelligently schedule VM instances onto the available hardware.
Essentially, we can request a NUMA topology via Nova flavor keys or Glance
image properties when creating a Nova instance. The same is true for vCPU
pinning. The <a class="reference external" href="https://docs.openstack.org/admin-guide/compute-cpu-topologies.html">OpenStack admin guide</a> provides
some useful information on how to use these features. There is a good <a class="reference external" href="http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-openstack-compute/">blog
article</a>
by Red Hat on this topic which is worth reading. We are going to build on that
here by describing how to deliver these capabilities in TripleO.</p>
</div>
<div class="section" id="numa-and-vcpu-pinning-in-tripleo">
<h2>NUMA and vCPU Pinning in TripleO</h2>
<p>The OpenStack <a class="reference external" href="http://www.tripleo.org">TripleO</a> project provides tools to
deploy an OpenStack cloud, and Red Hat's popular <a class="reference external" href="https://access.redhat.com/documentation/en/red-hat-openstack-platform">OpenStack Platform (OSP)</a> is
based on TripleO. The default configuration of TripleO is not optimal for
NUMA placement and vCPU pinning, so we'll outline a few steps that can be
taken to improve the situation.</p>
<div class="section" id="kernel-command-line-arguments">
<h3>Kernel Command Line Arguments</h3>
<p>We can use the <tt class="docutils literal">isolcpus</tt> kernel command line argument to restrict
host processes to a subset of the total available CPU cores. The argument
specifies a list of ranges of CPU IDs from which host processes should be
isolated. In other words, the CPUs we will use for guest VMs. For example, to
reserve CPUs 4 through 23 exclusively for guest VMs we could specify:</p>
<pre class="literal-block">
isolcpus=4-23
</pre>
<p>Ideally we want this argument to be applied on the first and subsequent boots
rather than applying it dynamically during deployment, to avoid waiting for our
compute nodes to reboot. Currently Ironic does not provide a mechanism to
specify additional kernel arguments on a per-node or per-image basis, so we
must bake them into the overcloud image instead.</p>
<p>If using the Grub bootloader, additional arguments can be provided to the
kernel by modifying the <tt class="docutils literal">GRUB_CMDLINE_LINUX</tt> variable in
<tt class="docutils literal">/etc/default/grub</tt> in the overcloud compute image, then rebuilding the Grub
configuration. We use the <tt class="docutils literal"><span class="pre">virt-customize</span></tt> command to apply post-build
configuration to the overcloud images:</p>
<pre class="literal-block">
$ export ISOLCPUS=4-23
$ function cpu_pinning_args {
CPU_PINNING_ARGS="isolcpus=${ISOLCPUS}"
echo --run-command \"echo GRUB_CMDLINE_LINUX=\"'\\\"'\"\\$\{GRUB_CMDLINE_LINUX\} ${CPU_PINNING_ARGS}\"'\\\"'\" \>\> /etc/default/grub\"
}
$ (cpu_pinning_args) | xargs virt-customize -v -m 4096 --smp 4 -a overcloud-compute.qcow2
</pre>
<p>(We structure the execution this way because typically we are composing a
string of operations into a single invocation of <tt class="docutils literal"><span class="pre">virt-customize</span></tt>).
Alternatively this change could be a applied with a custom
<tt class="docutils literal"><span class="pre">diskimage-builder</span></tt> element.</p>
</div>
<div class="section" id="one-size-does-not-fit-all-multiple-overcloud-images">
<h3>One Size Does Not Fit All: Multiple Overcloud Images</h3>
<p>While the isolcpus argument may provide performance benefits for guest VMs on
compute nodes, it would be seriously harmful to limit host processes in the same way on
controller and storage nodes. With different sets of nodes requiring different arguments,
we now need multiple overcloud images. Thankfully, TripleO provides an (undocumented) set of
options to set the image for each of the overcloud roles. We'll use the name
<tt class="docutils literal"><span class="pre">overcloud-compute</span></tt> for the compute image here.</p>
<p>When uploading overcloud images to Glance, use the <tt class="docutils literal">OS_IMAGE</tt> environment
variable to reference an image with a non-default name:</p>
<pre class="literal-block">
$ export OS_IMAGE=overcloud-compute.qcow2
$ openstack overcloud image upload
</pre>
<p>We can execute this command multiple times to register multiple images.
To specify a different image for the overcloud compute roles, create a Heat
environment file containing the following:</p>
<pre class="literal-block">
parameter_defaults:
NovaImage: overcloud-compute
</pre>
<p>Ensure that the image name matches the one registered with Glance and that the
environment file is referenced when deploying or updating the overcloud.
Other node roles will continue to use the default image <tt class="docutils literal"><span class="pre">overcloud-full</span></tt>.
Our specialised kernel configuration is now only applied where it is needed,
and not where it is harmful.</p>
</div>
<div class="section" id="kvm-and-libvirt">
<h3>KVM and Libvirt</h3>
<p>The Nova compute service will not advertise the NUMA topology of its host if it
determines that the versions of libvirt and KVM are inappropriate. As of the
Mitaka release, the following version restrictions are applied:</p>
<ul class="simple">
<li><tt class="docutils literal">libvirt: >= 1.2.8, != 1.2.9.7</tt></li>
<li><tt class="docutils literal"><span class="pre">qemu-kvm:</span> >= 2.1.0</tt></li>
</ul>
<p>On CentOS 7.3, the <tt class="docutils literal"><span class="pre">qemu-kvm</span></tt> package is at version <tt class="docutils literal">1.5.3</tt>. This can be
updated to a more contemporary <tt class="docutils literal">2.4.1</tt> by adding the <a class="reference external" href="http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/">kvm-common</a> Yum repository
and installing <cite>qemu-kvm-ev</cite>:</p>
<pre class="literal-block">
$ cat << EOF | sudo tee /etc/yum.repos.d/kvm-common.repo
[kvm-common]
name=KVM common
baseurl=http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/
gpgcheck=0
EOF
$ sudo yum -y install qemu-kvm-ev
</pre>
<p>This should be applied to the compute nodes either after deployment or during
the overcloud compute image build procedure. If applied after deployment, the
<tt class="docutils literal"><span class="pre">openstack-nova-compute</span></tt> service should be restarted on the compute nodes to
ensure it checks the library versions again:</p>
<pre class="literal-block">
$ sudo systemctl restart openstack-nova-compute
</pre>
</div>
<div class="section" id="kernel-same-page-merging-ksm">
<h3>Kernel Same-page Merging (KSM)</h3>
<p>Kernel Same-page Merging (KSM) allows for more efficient use of memory by guest
VMs by allowing identical memory pages used by different processes to be backed
by a single page of Copy On Write (COW) physical memory. When used on a NUMA
system this can have adverse effects if pages are merged belonging to VMs on
different nodes. In Linux kernels prior to 3.14 KSM was not NUMA-aware, which
could lead to VM performance and stability issues. It is possible to disable
KSM merging memory across NUMA nodes:</p>
<pre class="literal-block">
$ echo 0 > /sys/kernel/mm/ksm/merge_across_nodes
</pre>
<p>Some <a class="reference external" href="https://www.centos.org/forums/viewtopic.php?t=59749">additional commmands</a> are required if any
shared pages are already in use. If memory is not constrained, greater
performance may be achieved by disabling KSM altogether:</p>
<pre class="literal-block">
$ echo 0 > /sys/kernel/mm/ksm/run
</pre>
</div>
<div class="section" id="nova-scheduler-configuration">
<h3>Nova Scheduler Configuration</h3>
<p>The Nova scheduler provides the <tt class="docutils literal">NUMATopologyFilter</tt> filter to incorporate
NUMA topology information into the placement process. TripleO does not appear
to provide a mechanism to append additional filters to the default list
(although it may be possible with sufficient 'puppet-fu'). To override the
default scheduler filter list, use a Heat environment file like the following:</p>
<pre class="literal-block">
parameter_defaults:
controllerExtraConfig:
nova::scheduler::filter::scheduler_default_filters:
- RetryFilter
- AvailabilityZoneFilter
- RamFilter
- DiskFilter
- ComputeFilter
- ComputeCapabilitiesFilter
- ImagePropertiesFilter
- ServerGroupAntiAffinityFilter
- ServerGroupAffinityFilter
- NUMATopologyFilter
</pre>
<p>The <tt class="docutils literal">controllerExtraConfig</tt> parameter (recently renamed to
<tt class="docutils literal">ControllerExtraConfig</tt>) allows us to specialise the overcloud configuration.
Here <tt class="docutils literal"><span class="pre">nova::scheduler::filter::scheduler_default_filters</span></tt> references a
variable in the Nova scheduler <a class="reference external" href="https://github.com/openstack/puppet-nova/blob/stable/mitaka/manifests/scheduler/filter.pp">puppet manifest</a>.
Be sure to include this environment file in your <cite>openstack overcloud deploy</cite>
command as a <cite>-e</cite> argument.</p>
</div>
<div class="section" id="nova-compute-configuration">
<h3>Nova Compute Configuration</h3>
<p>The Nova compute service can be configured to pin virtual CPUs to a subset of
the physical CPUs. We can use the set of CPUs previously isolated via kernel
arguments. It is also prudent to reserve an amount of memory for the host
processes. In TripleO we can again use a Heat environment file to set these
options:</p>
<pre class="literal-block">
parameter_defaults:
NovaComputeExtraConfig:
nova::compute::vcpu_pin_set: 4-23
nova::compute::reserved_host_memory: 2048
</pre>
<p>Here we are using CPUs 4 through 23 for vCPU pinning and reserving 2GB of
memory for host processes. As before, remember to include this environment
file when managing the TripleO overcloud.</p>
</div>
</div>
<div class="section" id="performance-for-all">
<h2>Performance For All</h2>
<p>We hope this guide helps the community to improve the performance of their
TripleO-based OpenStack deployments. Thanks to the University of Cambridge for
the use of their development cloud while developing this configuration.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li><a class="reference external" href="https://www.openstack.org/science/">OpenStack for Scientific Research</a></li>
<li><a class="reference external" href="https://stackhpc.com/hpc-and-virtualisation.html">StackHPC blog: OpenStack and Virtualised HPC</a></li>
<li><a class="reference external" href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-NUMA-NUMA_and_libvirt.html">Red Hat Virtualization Tuning and Optimization Guide</a></li>
<li><a class="reference external" href="https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html">OpenStack Nova NUMA placement specification</a></li>
<li><a class="reference external" href="https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-cpu-pinning.html">OpenStack Nova vCPU pinning specification</a></li>
<li><a class="reference external" href="https://stackhpc.com/vxlan-ovs-bandwidth.html">StackHPC blog: Understanding VXLAN & OVS bandwidth</a></li>
</ul>
</div>
Managing BIOS and RAID in the Hyperscale Era2017-01-09T15:20:00+00:002017-01-09T15:20:00+00:00Mark Goddardtag:www.stackhpc.com,2017-01-09:/ansible-drac.html<p class="first last">Using Ansible and Ironic for firmware configuration management.</p>
<p>Have you ever had the nuisance of configuring a server BIOS? How
about a rack full of servers? Or an aisle, a hall, an entire
facility even? It gets to be tedious toil even before the second
server, and it also becomes increasingly unreliable to apply a
consistent configuration with increasing scale.</p>
<p>In this post we describe how we apply some modern tools from the
cloud toolbox (Ansible, Ironic and Python) to tackle this age-old
problem.</p>
<div class="section" id="server-management-in-the-21st-century">
<h2>Server Management in the 21st Century</h2>
<p>Baseboard management controllers (BMCs) are a valuable tool for
easing the inconvenience of hardware management. By using a BMC
we can configure our firmware using remote access, avoiding a trip
to the data centre and stepping from server to server with a crash
cart. This is already a big win.</p>
<p>However, BMCs are still pretty slow to apply changes, and are
manipulated individually. Through automation, we could address these
shortcomings.</p>
<p>I've seen some pretty hairy early efforts at automation, for example
playing out timed keystroke macros across a hundred terminals of BMC
sessions. This might work, but it's a desperate hack. Using the tools
created for configuration management we can do so much better.</p>
<div class="section" id="a-quick-tour-of-openstack-server-hardware-management">
<h3>A Quick Tour of OpenStack Server Hardware Management</h3>
<p>OpenStack deployment usually draws upon some hardware inventory
management intelligence. In our recent project with the University
of Cambridge this was <a class="reference external" href="https://access.redhat.com/documentation/en/red-hat-openstack-platform/">Red Hat OSP Director</a>.
The heart of OSP Director is <a class="reference external" href="http://www.tripleo.org">TripleO</a>
and the heart of TripleO is <a class="reference external" href="https://wiki.openstack.org/wiki/Ironic">OpenStack Ironic</a>.</p>
<p>Ironic is OpenStack's bare metal manager. It masquerades as a
virtualisation driver for OpenStack Nova, and provisions bare metal
hardware when a user asks for a compute instance to be created.
TripleO uses this capability to good effect to create
OpenStack-on-OpenStack (OoO), in which the servers of the OpenStack
control plane are instances created within another OpenStack layer
beneath.</p>
<p>Our new tools fit neatly into the TripleO process between <a class="reference external" href="http://tripleo.org/basic_deployment/basic_deployment_cli.html#register-nodes">registration</a>
and <a class="reference external" href="http://tripleo.org/basic_deployment/basic_deployment_cli.html#introspect-nodes">introspection</a>
of undercloud nodes, and are complementary to the existing functionality
offered by TripleO.</p>
</div>
<div class="section" id="idrac-dell-s-server-management-toolkit">
<h3>iDRAC: Dell's Server Management Toolkit</h3>
<p>The system at Cambridge makes extensive use of Dell server hardware,
including:</p>
<ul class="simple">
<li>R630 servers for OpenStack controllers.</li>
<li>C6320 servers for high-density compute nodes.</li>
<li>R730 servers for high performance storage.</li>
</ul>
<p>Deploying a diverse range of servers in a diverse range of roles
requires flexible (but consistent) management of firmware configuration.</p>
<p>These Dell server models feature Dell's proprietary
BMC, the <a class="reference external" href="http://en.community.dell.com/techcenter/systems-management/w/wiki/3204.dell-remote-access-controller-drac-idrac">integrated Dell Remote Access Controller (iDRAC)</a>.
This is what we use for remote configuration of our Dell server hardware.</p>
</div>
</div>
<div class="section" id="a-cloud-centric-approach-to-firmware-configuration-management">
<h2>A Cloud-centric Approach to Firmware Configuration Management</h2>
<p>OpenStack Ironic tracks hardware state for every server in an OpenStack deployment.</p>
<p>A simple overview can be seen with <tt class="docutils literal">ironic <span class="pre">node-list</span></tt>:</p>
<pre class="literal-block">
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
| 415c254f-3e82-446d-a63b-232af5816e4e | control1 | 3d27b7d2-729c-467c-a21b-74649f1b1203 | power on | active | False |
| 2646ece4-a24e-4547-bbe8-786eca16da82 | control2 | 8a066c7e-36ec-4c45-9e1b-5d0c5635f256 | power on | active | False |
| 2412f0ef-dedb-49c8-a923-778db36a57d9 | control3 | 6a62936f-40ec-49e7-a820-6f3329e5bb0c | power on | active | False |
| 81676b2d-9c37-4111-a32a-456a9f933e57 | compute0 | aac2866c-7d16-4089-9d94-611bfc38467e | power on | active | False |
| c6a5fbe7-566a-447e-a806-9e33676be5ea | compute1 | 619476ae-fec4-42c6-b3f5-3a4f5296d3bc | power on | active | False |
| c7f27dd4-67a7-42b9-93ab-2e444802c5c2 | compute2 | a074c3f8-eb87-46d6-89c8-f360fbf2a3df | power on | active | False |
| 025d84dc-a590-46c5-a456-211d5c1e8f1a | compute3 | 11524318-2ecf-4880-a1cf-76cd62935b00 | power on | active | False |
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
</pre>
<p>Ironic's node data includes how to access the BMC of every server
in the node inventory.</p>
<p>We extract the data from Ironic's inventory to generate a <a class="reference external" href="http://docs.ansible.com/ansible/intro_dynamic_inventory.html">dynamic inventory</a> for use
with Ansible. Instead of a file of hostnames, or a list of command line parameters,
a dynamic inventory is the output from an executed command. A dynamic inventory
executable accepts a few simple arguments and emits node inventory data in JSON format.
Using Python and the <tt class="docutils literal">ironicclient</tt> module simplifies the implementation.</p>
<p>To perform fact gathering and configuration, two new Ansible roles were
developed and published on Ansible Galaxy.</p>
<dl class="docutils">
<dt>DRAC configuration</dt>
<dd>Provides the <tt class="docutils literal">drac</tt> Ansible module for configuration of BIOS settings
and RAID controllers. A single task is provided to execute the module.
The role is available on Ansible Galaxy as
<a class="reference external" href="https://galaxy.ansible.com/stackhpc/drac">stackhpc.drac</a> and the
source code is available on Github as
<a class="reference external" href="https://github.com/stackhpc/drac">stackhpc/drac</a>.</dd>
<dt>DRAC fact gathering</dt>
<dd>Provides the <tt class="docutils literal">drac_facts</tt> Ansible module for gathering facts from a
DRAC card. The module is not executed by this role but is available to
subsequent tasks and roles.
The role is available on Ansible Galaxy as
<a class="reference external" href="https://galaxy.ansible.com/stackhpc/drac-facts">stackhpc.drac-facts</a>
and the source code is available on Github as
<a class="reference external" href="https://github.com/stackhpc/drac-facts">stackhpc/drac-facts</a>.</dd>
</dl>
<p>We use the <tt class="docutils literal"><span class="pre">python-dracclient</span></tt> module as a high-level interface
for querying and configuring the DRAC via the WSMAN protocol. This
module was developed by the Ironic team to support the DRAC family
of controllers. The module provides a useful level of abstraction
for these Ansible modules, hiding the complexities of the WSMAN
protocol.</p>
</div>
<div class="section" id="example-playbooks">
<h2>Example Playbooks</h2>
<p>The source code for all of the following examples is available on Github at
<a class="reference external" href="https://github.com/stackhpc/ansible-drac-examples">stackhpc/ansible-drac-examples</a>. The playbooks are not
large, and we encourage you to read through them.</p>
<p>A Docker image providing all dependencies has also been created and
made available on Dockerhub at the <a class="reference external" href="https://hub.docker.com/r/stackhpc/ansible-drac-examples/">stackhpc/ansible-drac-examples
repository</a>.
To use this image, run:</p>
<pre class="literal-block">
$ docker run --name ansible-drac-examples -it --rm docker.io/stackhpc/ansible-drac-examples
</pre>
<p>This will start a Bash shell in the /ansible-drac-examples directory where
there is a checkout of the <tt class="docutils literal"><span class="pre">ansible-drac-examples</span></tt> repository. The
<tt class="docutils literal">stackhpc.drac</tt> and <tt class="docutils literal"><span class="pre">stackhpc.drac-facts</span></tt> roles are installed under
<tt class="docutils literal">/etc/ansible/roles/</tt>. Once the shell is exited the container will be
removed.</p>
<div class="section" id="ironic-inventory">
<h3>Ironic Inventory</h3>
<p>In the example repository, the inventory script is
<tt class="docutils literal">inventory/ironic_inventory.py</tt>. We need to provide this script with the
following environment variables to allow it to communicate with Ironic:
<tt class="docutils literal">OS_USERNAME</tt>, <tt class="docutils literal">OS_PASSWORD</tt>, <tt class="docutils literal">OS_TENANT_NAME</tt> and <tt class="docutils literal">OS_AUTH_URL</tt>.
For the remainder of this article we will assume that a file, <tt class="docutils literal">cloudrc</tt>, is
available and exports these variables. To see the output of the inventory
script:</p>
<pre class="literal-block">
$ source cloudrc
$ ./inventory/ironic_inventory.py --list
</pre>
<p>To use this dynamic inventory with <tt class="docutils literal"><span class="pre">ansible-playbook</span></tt>, use the <tt class="docutils literal"><span class="pre">-i</span></tt>
argument:</p>
<pre class="literal-block">
$ source cloudrc
$ ansible-playbook -i inventory ...
</pre>
<p>The inventory will contain all Ironic nodes, named by their UUID. For
convenience, an Ansible group is created for each named node using its name
with a prefix of <cite>node_</cite>.</p>
<p>The inventory also contains groupings for servers in Ironic maintenance
mode, and for servers in different states in Ironic's hardware state
machine. Groups are also created for each server profile defined
by TripleO: <tt class="docutils literal">controller</tt>, <tt class="docutils literal">compute</tt>, <tt class="docutils literal"><span class="pre">block-storage</span></tt>, etc..</p>
<p>In the following examples, the playbooks will execute against all Ironic nodes
discovered by the inventory script. To limit the hosts against which a play is
executed, use the <tt class="docutils literal"><span class="pre">--limit</span></tt> argument to <tt class="docutils literal"><span class="pre">ansible-playbook</span></tt>.</p>
<p>If you would rather not make any changes to the systems in the inventory, use
the <tt class="docutils literal"><span class="pre">--check</span></tt> argument to <tt class="docutils literal"><span class="pre">ansible-playbook</span></tt>. This will display the changes
that would have been made if the <tt class="docutils literal"><span class="pre">--check</span></tt> argument were not passed.</p>
</div>
<div class="section" id="example-1-gather-and-display-facts-about-firmware-configuration">
<h3>Example 1: Gather and Display Facts About Firmware Configuration</h3>
<p>The <tt class="docutils literal"><span class="pre">drac-facts.yml</span></tt> playbook shows how the <tt class="docutils literal"><span class="pre">stackhpc.drac-facts</span></tt> role
can be used to query the DRAC module of each node in the inventory. It also
displays the results. Run the following command to execute the playbook:</p>
<pre class="literal-block">
$ source cloudrc
$ ansible-playbook -i inventory drac-facts.yml
</pre>
</div>
<div class="section" id="example-2-configure-the-numlock-bios-setting">
<h3>Example 2: Configure the <tt class="docutils literal">NumLock</tt> BIOS Setting</h3>
<p><em>NOTE</em>: This example may make changes to systems in the inventory.</p>
<p>The <tt class="docutils literal"><span class="pre">drac-bios-numlock.yml</span></tt> playbook demonstrates how the <tt class="docutils literal">stackhpc.drac</tt>
role can be used to configure BIOS settings. It sets the <tt class="docutils literal">NumLock</tt> BIOS
setting to either <tt class="docutils literal">On</tt> or <tt class="docutils literal">Off</tt>.</p>
<p>The playbook specifies the <tt class="docutils literal">drac_reboot</tt> variable as <tt class="docutils literal">False</tt>, so the setting
will not be applied immediately. A reboot of the system is required for this
pending setting to be applied. The <tt class="docutils literal">drac_facts</tt> module provides information
on any pending BIOS configuration changes, as may be seen in the first example.</p>
<p>Run the following command to execute the playbook and configure the setting:</p>
<pre class="literal-block">
$ source cloudrc
$ ansible-playbook -i inventory -e numlock=<value> drac-bios-numlock.yml
</pre>
<p>Set the numlock variable to the required value (On or Off).
The <tt class="docutils literal">drac_result</tt> variable is registered by the role and contains the results
returned by the <tt class="docutils literal">drac</tt> module. The playbook displays this variable after the
role is executed. Of particular interest is the <tt class="docutils literal">reboot_required</tt> variable
which indicates whether a reboot is required to apply the changes. If a reboot
is required, this must be performed before making further BIOS configuration
changes.</p>
</div>
<div class="section" id="example-3-configure-a-raid-1-virtual-disk">
<h3>Example 3: Configure a RAID-1 Virtual Disk</h3>
<p>NOTE: This example may make changes to systems in the inventory.</p>
<p>The <tt class="docutils literal"><span class="pre">drac-raid1.yml</span></tt> playbook shows how the <tt class="docutils literal">stackhpc.drac</tt> role can be
used to configure RAID controllers. In this example we configure a RAID1
virtual disk.</p>
<p>Ensure that <tt class="docutils literal">raid_pdisk1</tt> and <tt class="docutils literal">raid_pdisk2</tt> are set to the IDs of two
physical disks in the system that are attached to the same RAID controller and
not already part of another virtual disk. The facts gathered in the first
example may be useful here. This time we specify the <tt class="docutils literal">drac_reboot</tt> variable
as <tt class="docutils literal">True</tt>. This means that if required, the <tt class="docutils literal">drac</tt> module will reboot the
system to apply changes.</p>
<p>Run the following command to execute the playbook and configure the system.
The task will likely take a long time to execute if the virtual disk
configuration is not already as requested, as the system will need to be
rebooted:</p>
<pre class="literal-block">
$ source cloudrc
$ ansible-playbook -i inventory -e raid_pdisk1=<pdisk1> -e raid_pdisk2=<pdisk2> drac-raid1.yml
</pre>
</div>
</div>
<div class="section" id="under-the-hood">
<h2>Under The Hood</h2>
<p>The vast majority of the useful code provided by these roles takes the form of
python Ansible modules. This takes advantage of the capability of Ansible roles
to contain modules under a <tt class="docutils literal">library</tt> directory, and means that no python code
needs to be installed on the system or included with the core or extra Ansible
modules.</p>
<div class="section" id="the-drac-facts-module">
<h3>The <tt class="docutils literal">drac_facts</tt> Module</h3>
<p>The <tt class="docutils literal">drac_facts</tt> module is relatively simple. It queries the state of BIOS
settings, RAID controllers and the DRAC job queues. The results are translated
to a JSON-friendly format and returns them as facts.</p>
</div>
<div class="section" id="the-drac-module">
<h3>The <tt class="docutils literal">drac</tt> Module</h3>
<p>The <tt class="docutils literal">drac</tt> module is more complex than the <tt class="docutils literal">drac_facts</tt> module.
The DRAC API provides a split-phase execution model, allowing changes
to be staged before either committing or aborting them. Committed
changes are applied by rebooting the system. To further complicate
matters, the BIOS settings and each of the RAID controllers represents
a separate configuration channel. Upon execution of the <tt class="docutils literal">drac</tt>
module these channels may have uncommitted or committed pending
changes. We must therefore determine a minimal sequence of steps
to realise the requested configuration for an arbitrary initial
state, which may affect more than one of these channels.</p>
<p>The <tt class="docutils literal"><span class="pre">python-dracclient</span></tt> module provided almost all of the necessary
input data with one exception. When querying the virtual disks, the
returned objects did not contain the list of physical disks that each virtual
disk is composed of. We developed the required functionality and submitted it
to the <tt class="docutils literal"><span class="pre">python-dracclient</span></tt> project.</p>
<p>Thanks go to the <tt class="docutils literal"><span class="pre">python-dracclient</span></tt> community for their help in
implementing the feature.</p>
</div>
</div>
Understanding VXLAN+OVS Bandwidth Issues2016-11-30T10:20:00+00:002016-12-05T18:40:00+00:00Stig Telfertag:www.stackhpc.com,2016-11-30:/vxlan-ovs-bandwidth.html<p class="first last">A study of some of the VXLAN network performance issues
we encountered while working on the OpenStack/HPC cloud at Cambridge
University, and what we did to resolve them.</p>
<p>With Cambridge University, StackHPC has been working on our goal
of an HPC-enabled OpenStack cloud. I have previously presented on
the architecture and approach taken in deploying the system at
Cambridge, for example at the <a class="reference external" href="https://www.stackhpc.com/stackhpc-at-openstack-day-uk.html">OpenStack days UK event in Bristol</a></p>
<p>Our project there uses Mellanox ConnectX4-LX 50G Ethernet NICs for
high-speed networking. Over the summer we worked on our TripleO
configuration to unlock the SR-IOV capabilities of this NIC,
delivering HPC-style RDMA protocols direct to our compute instances.
We also have Cinder volumes backed by the iSER RDMA protocol plugged
into our hypervisors. Those components are working well and delivering
on the promise of an HPC-enabled OpenStack cloud.</p>
<p>However, SR-IOV does not fit every problem. On ConnectX4-LX, virtual
functions bypass all but the most basic of OpenStack's SDN capabilities:
we can attach a VF to a tenant VLAN, and that's it. All security
groups and other richer network functions are circumvented.
Furthermore, an instance using SR-IOV cannot be migrated (at least
not yet). We use SR-IOV when we want Lustre and MPI, but we need
a generic solution for all other types of network. However,
performance of virtualised networking has been particularly elusive
for our project.... but read on!</p>
<p>In addition to SR-IOV networking, TripleO creates for us a more
standard OpenStack network configuration: VXLAN-encapsulated tenant
networks and a hierarchy of Open vSwitch bridges to plumb together
the data plane. In this case we have two OVS bridges: <tt class="docutils literal"><span class="pre">br-int</span></tt>
and <tt class="docutils literal"><span class="pre">br-tun</span></tt>.</p>
<div class="section" id="performance-analysis">
<h2>Performance Analysis</h2>
<p>A very simple test case reveals the performance we achieve over
VXLAN. We spin up a couple of VMs on different hypervisors and use
<tt class="docutils literal">iperf</tt> to benchmark the bandwidth of <strong>single TCP stream</strong> between them.</p>
<div class="figure">
<img alt="iperf bandwidth baseline" src="//www.stackhpc.com/images/iperf-config-0-streams-1.png" style="width: 600px;" />
</div>
<p>Despite using a 50G network, we hit about 1.2 gbits/s TCP bandwidth
in our instances. Ouch! But wait - the <a class="reference external" href="https://community.mellanox.com/docs/DOC-1456">Mellanox community website</a> describes a use
case, similar to the Cambridge use case but with an older version
of NIC, delivering 20.51 gbits/s between VMs for VXLAN-encapsulated
networking.</p>
<p>Something had to be done...</p>
<p>We started low-level. We took the <a class="reference external" href="http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf">Mellanox tuning guide</a>
for BIOS and kernel settings. Prinicipally this involved turning
off all the hardware power-saving features, plus a chipset mode
that adversely affects the performance of the Mellanox NIC.</p>
<div class="figure">
<img alt="iperf bandwidth after BIOS tweaks and kernel tuning" src="//www.stackhpc.com/images/iperf-config-1-streams-1.png" style="width: 600px;" />
</div>
<p>On our hardware this delivered roughly another 0.5 gbits/s: relatively
a decent improvement, but still far from where we needed to be in
absolute terms.</p>
<p>A perf profile and <a class="reference external" href="http://www.brendangregg.com/flamegraphs.html">flame graph</a> of the host OS
provides insight into the overhead incurred by VXLAN, Open vSwitch
and the software-defined networking performed in the hypervisor:</p>
<div class="figure">
<img alt="perf profile of host OS work during iperf benchmark" src="//www.stackhpc.com/images/vxlan-ovs-flame-graph.png" style="width: 600px;" />
</div>
<p>In this profile, the smaller "flame" on the left is time spent in
the guest VM (where <tt class="docutils literal">iperf</tt> is running), while the larger "flame"
on the right is time spent in service of OVS and hypervisor networking.
The relative width of each - not the height - is the significant
metric. The bottleneck is the software-defined networking in the host OS.</p>
<p>The team at Mellanox came up with a breakthrough: our systems,
based on RHEL 7 and CentOS 7, have a kernel that lacks some features
developed upstream for efficient hardware offloading of VXLAN-encapsulated
frames. We migrated a test system to kernel 4.7.10 and Mellanox OFED 3.5 -
and saw a step change:</p>
<div class="figure">
<img alt="iperf bandwidth after moving to kernel 4.7.10" src="//www.stackhpc.com/images/iperf-config-3-streams-1.png" style="width: 600px;" />
</div>
<p>This is a good improvement to about 11 gbits/s - except that changing
the kernel breaks supportability for Cambridge's production
environment. For our own interest, we continued the investigation
in order to know the untapped potential of our system.</p>
<p>The question of hyperthreading divides the HPC and cloud worlds:
in HPC, it is almost never enabled but in cloud it is almost always
enabled. By disabling it, we saw a surprisingly big uplift in performance
to about 18.3 gbits/s (but we halved the number of cores in the system at
the same time).</p>
<div class="figure">
<img alt="iperf bandwidth after disabling hyperthreading" src="//www.stackhpc.com/images/iperf-config-4-streams-1.png" style="width: 600px;" />
</div>
<p>The performance with hyperthreading disabled has jumped higher, but
has also become quite erratic. I guessed that this was because the
system does not yet pin virtual CPU cores onto physical CPU cores.
In Nova this is easily done (Kilo versions or later). A clearly-written
<a class="reference external" href="http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-openstack-compute/">blog post by Steve Gordon of Red Hat</a>
describes the next set of optimisations, although CPU pinning is relatively
straightforward to implement.</p>
<ul>
<li><p class="first">On each Nova compute hypervisor, update <tt class="docutils literal">/etc/nova/nova.conf</tt> to
set <tt class="docutils literal">vcpu_pin_set</tt> to a run-length list of CPU cores to which
QEMU/KVM should pin the guest VM processes.</p>
<p>On our system, we tested with pinning guest VMs to cores 4-23,
ie leaving 4 CPU cores 0-3 exclusively available to the host OS:</p>
<p><tt class="docutils literal"><span class="pre">vcpu_pin_set=4-23</span></tt></p>
</li>
<li><p class="first">We also make sure that the host OS has RAM reserved for it. In our case,
we reserved 2GB for the host OS:</p>
<p><tt class="docutils literal">reserved_host_memory_mb=2048</tt></p>
</li>
<li><p class="first">Once these changes are applied, restart the Nova compute service
on the updated hypervisor:</p>
<p><tt class="docutils literal">systemctl restart <span class="pre">openstack-nova-compute</span></tt></p>
</li>
<li><p class="first">To select which instances get their CPUs pinned, tag the images
with metadata such as <tt class="docutils literal">hw_cpu_policy</tt></p>
<p><tt class="docutils literal">glance <span class="pre">image-update</span> <image ID> <span class="pre">--property</span> hw_cpu_policy=dedicated</tt></p>
<p>(Steve Gordon's blog describes the alternative approach in which
Nova flavors are tagged with <tt class="docutils literal">hw:cpu_policy=dedicated</tt>, and
then scheduled to specific availability zones tagged for pinned
instances. In our use case we schedule them across the whole
system instead).</p>
</li>
</ul>
<p>After we had pinned all the VCPUs onto physical cores, performance was steady
again but (surprisingly to me) no higher:</p>
<div class="figure">
<img alt="iperf bandwidth after VCPU pinning" src="//www.stackhpc.com/images/iperf-config-5-streams-1.png" style="width: 600px;" />
</div>
<p>There are more optimisations available in the hypervisor environment. Finally
we tried a few more of these:</p>
<ul>
<li><p class="first">Isolate the host OS from the cores to which guest VMs are pinned. This is
done by updating the kernel boot parameters (and rebooting the hypervisor).
Taking the CPU cores provided to Nova in <tt class="docutils literal">vcpu_pin_set</tt>, pass them
to the kernel using the command-line boot parameter <tt class="docutils literal">isolcpus</tt></p>
<p><tt class="docutils literal"><span class="pre">isolcpus=4-23</span></tt></p>
<p>After rebooting, htop (or similar) should demonstrate scheduling is confined
to the CPUs specified.</p>
</li>
<li><p class="first">Pass-through of NUMA: this requires some legwork on CentOS systems, because
the QEMU version that ships with CentOS 7.2 (at the time of writing, 1.5.3)
is too old for Nova, which requires a minimum of 2.1 for NUMA (and handling
of huge pages).</p>
<p>The packages from the <a class="reference external" href="http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/">CentOS 7 virt KVM repo</a>
bring a CentOS system up to spec with a sufficiently recent version
of QEMU-KVM for Nova's needs.</p>
</li>
</ul>
<p>After enabling CPU isolation and NUMA passthrough, we get another boost:</p>
<div class="figure">
<img alt="iperf bandwidth after CPU isolation and NUMA passthrough" src="//www.stackhpc.com/images/iperf-config-6-streams-1.png" style="width: 600px;" />
</div>
<p>The instance TCP bandwidth has risen to about 24.3 gbits/s. From where we
started this is a big improvement. But we are only at 50% line rate -
<strong>we are still losing half our ultimate performance</strong>.</p>
</div>
<div class="section" id="alternatives-to-vxlan-and-ovs">
<h2>Alternatives to VXLAN and OVS</h2>
<p>There are other software-defined networking options that don't use
OVS (or indeed VXLAN), and I am researching those capabilities with
the hope of evaluating them in a future project.</p>
<p>Using modern high-performance Ethernet NICs, SR-IOV delivers a far greater
level of performance. However, that performance comes at a cost of convenience
and flexibility.</p>
<ul class="simple">
<li>As mentioned previously, SR-IOV bypasses security groups and
associated rich functionality of software-defined networking. It
should not be used on any network that is intended to be externally
visible.</li>
<li>Using SR-IOV currently prevents the live migration of instances.</li>
<li>Our Mellanox ConnectX-4 LX NICs only support SR-IOV in conjunction with
VLAN tenant network segmentation. VXLAN networks are not currently supported
in conjunction with SR-IOV on our NIC hardware.</li>
</ul>
<p>Taking all of the above limitations on board, the same benchamrk between
VMs using SR-IOV network port bindings:</p>
<div class="figure">
<img alt="iperf bandwidth over SR-IOV" src="//www.stackhpc.com/images/iperf-config-6-streams-1-sriov-virt-10core-pinned-isolated.png" style="width: 600px;" />
</div>
<p>Our bandwidth has raised to just under 42 gbits/s.</p>
<p>If ultimate performance is the ultimate priority, a bare metal
solution using OpenStack Ironic is the best option. Ironic offers
a combination of the performance of bare metal with (some of) the
flexibility of software-defined infrastructure:</p>
<div class="figure">
<img alt="iperf bandwidth on bare metal" src="//www.stackhpc.com/images/iperf-baremetal-streams-1-hwtuned.png" style="width: 600px;" />
</div>
<p>46 gbits/s on a single TCP stream, rising to line rate for multiple
TCP streams. Not bad!</p>
</div>
<div class="section" id="the-road-ahead-is-less-rocky">
<h2>The Road Ahead is Less Rocky</h2>
<p>The kernel upgrade we performed restricts this investigation to
being an experimental result. For customers in production there
is always a risk in adopting advanced kernels, and doing so will
inevitably break the terms of commercial support.</p>
<p>However, for people in the same situation there are grounds for
hope. RHEL 7.3 (and CentOS 7.3) kernels include a back-port of the
capabilities we were using for supporting hardware offloading of
encapsulated traffic. We will be using this when our control plan gets
upgraded to 7.3. For Ubuntu users the Xenial hardware enablement
kernel is a 4.x kernel.</p>
<p>Further ahead, a more powerful solution is being developed for
Mellanox ConnectX4-LX NICs. Mellanox are calling it <a class="reference external" href="http://www.mellanox.com/blog/2016/12/three-ways-asap2-beats-dpdk-for-cloud-and-nfv/">Accelerated
Switching and Packet Processing (ASAP2)</a>.
This technology uses the embedded eSwitch in the Mellanox NIC as a
hardware offload of OVS. SR-IOV capabilities can then be used
instead of paravirtualised virtio NICs in the VMs. The intention
is that it should be transparent to users. The code to support ASAP2
must first make its way upstream before it will appear in production
OpenStack deployments. We will follow its progress with interest.</p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>Thanks go to Scientific Working Group colleagues Blair Bethwaite and
Mike Lowe for sharing information and guidance on what works for them.</p>
<p>Throughout the course of this investigation I worked closely with
Mellanox support and engineering teams, and I'm hugely grateful to
them for their input and assistance.</p>
</div>
StackHPC at Supercomputing 20162016-11-15T10:20:00+00:002016-11-15T15:40:00+00:00Stig Telfertag:www.stackhpc.com,2016-11-15:/stackhpc-at-supercomputing-2016.html<p class="first last">Momentum gathers for Scientific OpenStack and StackHPC</p>
<p>This week at <a class="reference external" href="http://sc16.supercomputing.org">Supercomputing 2016</a>
there has been a good deal of activity around
OpenStack. Working Group co-chair Blair Bethwaite put
together <a class="reference external" href="http://superuser.openstack.org/articles/openstack-supercomputing-2016/">a pick of the schedule's OpenStack highlights</a>,
and used the event to launch the book edition of Stig's OpenStack/HPC
studies, featuring user stories and case studies from many prominent
WG members.</p>
<blockquote>
<img alt="OpenStack/HPC book cover" src="//www.stackhpc.com/images/openstack-hpc-book-cover.png" style="width: 700px;" />
</blockquote>
<p>In testament to the momentum building behind using OpenStack for
research computing, the Scientific OpenStack BoF packed out the room,
and was even bigger than the Scientific OpenStack BoF at the OpenStack
summit in Barcelona. Topics discussed included the trade-offs between
virtualisation and containers for scientific use cases, and different
strategies for finding performance on Scientific OpenStack clouds.</p>
<blockquote>
<img alt="Scientific OpenStack BoF crowd" src="//www.stackhpc.com/images/sc2016-openstack-bof-crowd-small.jpg" style="width: 700px;" />
</blockquote>
<p>One interesting observation came from a show of hands on who in the
room was already using OpenStack - it turned out to be no more than
a significant minority. OpenStack already feels like it has arrived
with a bang in HPC, but there's substantially more interest out
there, and the clear implication is that even greater things are
yet to come.</p>
<p>Earlier in the day, the Indiana University booth hosted a series of
lightning talks, including from Stig and several Scientific WG members.
Blair Bethwaite gave an excellent and informative talk on the recent
studies he has done at Monash University on the overhead of virtualisation.</p>
<blockquote>
<img alt="OpenStack lightning talks" src="//www.stackhpc.com/images/sc2016-lightning-talks-small.jpg" style="width: 700px;" />
</blockquote>
<p>OpenStack's presence at the SC2016 conference was wrapped up with a
panel session featuring a great set of luminaries from the melting pot of
OpenStack and research computing:</p>
<ul>
<li><p class="first">Blair Bethwaite and Steve Quenette from Monash University in Melbourne,
Australia</p>
</li>
<li><p class="first">Jonathan Mills from NASA Goddard Space Flight Center</p>
</li>
<li><p class="first">Kate Keahey from University of Chicago and Argonne National Laboratory</p>
</li>
<li><p class="first">Mike Lowe from Indiana University</p>
</li>
<li><p class="first">Robert Budden from Pittsburgh Supercomputer Center</p>
<div class="figure">
<img alt="OpenStack HPC panel session" src="//www.stackhpc.com/images/sc2016-openstack-hpc-panel.jpg" style="width: 700px;" />
<p class="caption"><em>Photo with thanks to Chris Hoge, OpenStack Foundation</em></p>
</div>
</li>
</ul>
<p>Stig took the chair as moderator. With such a great panel it's no
surprise that many interesting points came up from the discussion,
including OpenStack's true capabilities for HPC, and the level of
investment in (but the ultimate value of) self-supported OpenStack.
There were also some insightful comments on a wishlist for future
OpenStack development.</p>
<p>Finally, we ended the conference with a WG dinner out, confirming
the trend that when it comes to socials, the Scientific WG is the
team to beat!</p>
<p>The OpenStack Foundation has also made a splash this week
with the launch of the <a class="reference external" href="https://www.openstack.org/science">Scientific OpenStack landing page</a>, which highlights the contributions
of the Scientific Working Group (including a free digital download of
the OpenStack/HPC book). The Scientific Openstack landing page was
also promoted as the masthead banner graphic for the <a class="reference external" href="https://www.openstack.org">openstack.org home
page</a>. What an accolade!</p>
StackHPC at OpenStack Barcelona2016-10-28T10:20:00+01:002016-11-08T15:40:00+00:00Stig Telfertag:www.stackhpc.com,2016-10-28:/stackhpc-at-openstack-barcelona.html<p>Even in the fast-moving world of cloud compute, we don't see many weeks
like this one...</p>
<p>Although it has officially been running for
only six months, the <a class="reference external" href="http://wiki.openstack.org/wiki/Scientific_working_group">Scientific Working Group</a> took centre
stage at the opening keynote at the OpenStack summit in Barcelona.
Stig took to the stage (approx 19 minutes into the movie clip below) to
talk about the great value of the working group, and talked with Lauren
Sell about the forthcoming book written by Stig with expert contributions
by a large number of WG members.</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/eDcDe485DGk?start=19m00s" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div><p>The Scientific Working Group went on to
have a busy summit:</p>
<ul>
<li><p class="first">The <a class="reference external" href="https://www.openstack.org/videos/search?search=HPC%20%2F%20Research">HPC/Research speaker track</a>,
selected by track judges drawn from the WG, featured seven talks on
subjects relevant to research computing use cases.</p>
</li>
<li><p class="first">The <a class="reference external" href="https://etherpad.openstack.org/p/scientific-wg-barcelona-agenda">Scientific Working Group committee meeting</a>
was well attended and new directions were set for the coming development
cycle.</p>
</li>
<li><p class="first">The Scientific OpenStack BoF was great fun, with the prize for the
best lightning talk (sponsored by Dell/EMC) being won by Adam Huffman
from the Crick Institute in London.</p>
<img alt="Attendees at the Scientific OpenStack BoF" src="//www.stackhpc.com/images/barcelona-scientific-openstack-bof.jpg" style="width: 400px;" />
<img alt="George Mihaiescu at the Scientific OpenStack BoF" src="//www.stackhpc.com/images/barcelona-scientific-openstack-bof-george.jpg" style="width: 400px;" />
</li>
<li><p class="first"><a class="reference external" href="https://twitter.com/oneswig/status/791808362627366912">Scientific WG evening social</a> was attended
by 57 hungry research computing specialists, and generously subsidised
by Mellanox.</p>
</li>
</ul>
<p>What a great week!</p>
SuperUser Reports on OpenStack Foundation Visit to Cambridge2016-10-21T10:20:00+01:002016-10-21T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-10-21:/superuser-reports-on-openstack-foundation-visit-to-cambridge.html<p>SuperUser Magazine has <a class="reference external" href="http://superuser.openstack.org/articles/cambridge-openstack/">published an article</a> about
Jonathan Bryce and Lauren Sell's recent visit to Cambridge after the great
fun <a class="reference external" href="//www.stackhpc.com/stackhpc-at-openstack-day-uk.html">we all had</a> at
<a class="reference external" href="https://openstackday.uk/">OpenStack Day UK 2016</a>.</p>
<p>This follows <a class="reference external" href="//www.stackhpc.com/openstack-historical-newton.html">our own article</a> on their visit and
we are very excited by the interest and affirmation that the Foundation
has shown in the scientific use case for OpenStack in recent months.</p>
A Little Fun in a Historical Context2016-10-12T10:20:00+01:002016-10-12T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-10-12:/openstack-historical-newton.html<p class="first last">A Little Fun in a Historical Context</p>
<p>Following the great success of <a class="reference external" href="https://openstackday.uk/">OpenStack Day UK Bristol</a> (at which <a class="reference external" href="https://youtu.be/hkw810UqLuk">I presented for StackHPC</a>), I invited Jonathan Bryce and Lauren
Sell to call in and see the team at Cambridge University and talk
scientific compute on OpenStack. To our great delight, they accepted!</p>
<p>How do you entertain such VIP guests? We talked radio astronomy,
medical informatics and genomics. OpenStack's momentum in research
computing brings it to new territory, in which it could be argued that
OpenStack's supporting role in the advancement of science leads in some
small way to the advancement of humanity...</p>
<div class="section" id="cambridge-a-player-on-the-world-stage-for-800-years">
<h2>Cambridge: A Player on the World Stage for 800 Years</h2>
<p>To mark the occasion, we thought of a bit of historical fun. To coincide
with the release of OpenStack's 14th version, known as "Newton", we
went in search of some memorabilia of Isaac Newton, father of the laws
of motion, theories of gravity and co-creator of mathematical calculus.</p>
<p>At Trinity College, the Wren library holds a precious first edition of
his work. We were lucky on the timing of our visit: that day they had
received a third edition, dating from 1726, which we could handle.</p>
<div class="figure">
<img alt="Principia Mathematica, first edition" src="/images/newton-principia_mathematica.jpg" style="width: 600px;" />
<p class="caption">Isaac Newton's first edition of Principia Mathematica, safely in its
case in the Wren Library, Trinity College, Cambridge</p>
</div>
<div class="figure">
<img alt="Principia Mathematica, first edition" src="/images/newton-jonathan_lauren.jpg" style="width: 600px;" />
<p class="caption">Lauren Sell and Jonathan Bryce from the OpenStack Foundation</p>
</div>
<div class="figure">
<img alt="Principia Mathematica, first edition" src="/images/newton-group.jpg" style="width: 600px;" />
<p class="caption">L-R Lauren Sell, Stig Telfer, Jonathan Bryce, Paul Calleja, John Taylor</p>
</div>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>With thanks to the Master and Fellows of Trinity College, Cambridge.</p>
</div>
OpenStack and High Performance Data2016-10-10T10:20:00+01:002016-10-12T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-10-10:/openstack-and-high-performance-data.html<p class="first last">OpenStack and High Performance Data</p>
<div class="section" id="id1">
<h2>OpenStack and High Performance Data</h2>
<p>What can data requirements mean an HPC context? The range of use cases
is almost boundless. With considerable generalisation we can consider
some broad criteria for requirements, which expose the inherent tensions
between HPC-centric and cloud-centric storage offerings:</p>
<ul class="simple">
<li>The <strong>data access</strong> model: data objects could be stored and retrieved
using file-based, block-based, object-based or stream-based access.
HPC storage tends to focus on a model of file-based shared data storage
(with an emerging trend for object-based storage proposed for achieving
new pinnacles of scalability). Conversely cloud infrastructure favours
block-based storage models, often backed with and extended by object-based
storage. Support for data storage through shared filesystems is still
maturing in OpenStack.</li>
<li>The <strong>data sharing</strong> model: applications may request the same data
from many clients, or the clients may make data accesses that are
segregated from one another. This distinction can have significant
consequences for storage architecture. Cloud storage and HPC storage
are both highly distributed, but often differ in the way in which data
access is parallelised. Providing high-performance access for many
clients to a shared dataset can be a niche requirement specific to HPC.
Cloud-centric storage architectures typically focus on delivering high
aggregate throughput on many discrete data accesses.</li>
<li>The level of <strong>data persistence</strong>. An HPC-style tiered data storage
architecture does not need to incorporate data redundancy at every level
of the hierarchy. This can improve performance for tiers caching data
closer to the processor.</li>
</ul>
<p>The cloud model offers capabilities that enable new possibilities for HPC:</p>
<ul class="simple">
<li><strong>Automated provisioning</strong>. Software-defined infrastructure automates the
provisioning and configuration of compute resources, including storage.
Users and group administrators are able to create and configure storage
resources to their specific requirements at the exact time they are
needed.</li>
<li><strong>Multi-tenancy</strong>. HPC storage does not offer multi-tenancy with the level
of segregation that cloud can provide. A virtualised storage resource
can be reserved for the private use of a single user, or could be shared
between a controlled group of collaborating users, or could even be
accessible by all users.</li>
<li><strong>Data isolation</strong>. Sensitive data requires careful data management.
Medical informatics workloads may contain patient genomes. Engineering
simulations may contain data that is trade secret. OpenStack’s
segregation model is stronger than ownership and permissions on a
POSIX-compliant shared filesystem, and also provides finer-grained
access control.</li>
</ul>
<p>There is clear value in increased flexibility - but at what cost in
performance? In more demanding environments, HPC storage tends to focus
on and be tuned for delivering the requirements of a confined subset
of workloads. This is the opposite approach to the conventional cloud
model, in which assumptions may not be possible about the storage access
patterns of the supported workloads.</p>
<p>This study will describe some of these divergences in greater detail, and
demonstrate how OpenStack can integrate with HPC storage infrastructure.
Finally some methods of achieving high performance data management on
cloud-native storage infrastructure will be discussed.</p>
</div>
<div class="section" id="file-based-data-hpc-parallel-filesystems-in-openstack">
<h2>File-based Data: HPC Parallel Filesystems in OpenStack</h2>
<p>Conventionally in HPC, file-based data services are delivered
by parallel filesystems such as Lustre and Spectrum Scale (GPFS).
A parallel filesystem is a shared resource. Typically it is mounted on
all compute nodes in a system and available to all users of a system.
Parallel filesystems excel at providing low-latency, high-bandwidth
access to data.</p>
<p>Parallel filesystems can be integrated into an OpenStack environment in
a variety of configuration models.</p>
<div class="section" id="provisioned-client-model">
<h3>Provisioned Client Model</h3>
<p>Access to an external parallel filesystem is provided through an OpenStack
provider network. OpenStack compute instances - virtualised or bare
metal - mount the site filesystem as clients.</p>
<p>This use case is fairly well established. In the virtualised use case,
performance is achieved through use of SR-IOV (with only a moderate
level of overhead). In the case of Lustre, with a layer-2 VLAN provider
network the o2ib client drivers can use RoCE to perform Lustre data
transport using RDMA.</p>
<p>Cloud-hosted clients on a parallel filesystem raise issues with root in
a cloud compute context. On cloud infrastructure, privileged accesses
from a client do not have the same degree of trust as on conventional HPC
infrastructure. Lustre approaches this issue by introducing Kerberos
authentication for filesystem mounts and subsequent file accesses.
Kerberos credentials for Lustre filesystems can be supplied to OpenStack
instances upon creation as instance metadata.</p>
</div>
<div class="section" id="provisioned-filesystem-model">
<h3>Provisioned Filesystem Model</h3>
<p>There are use cases where the dynamic provisioning of software-defined
parallel filesystems has considerable appeal. There have been
proof-of-concept demonstrations of provisioning Lustre filesystems from
scratch using OpenStack compute, storage and network resources.</p>
<p>The OpenStack Manila project aims to provision and manage shared
filesystems as an OpenStack service. IBM’s Spectrum Scale integrates
with Manila to re-export GPFS parallel filesystems using the user-space
Ganesha NFS server.</p>
<p>Currently these projects demonstrate functionality over performance.
In future evolutions the overhead of dynamically provisioned parallel
filesystems on OpenStack infrastructure may improve.</p>
</div>
<div class="section" id="a-parallel-data-substrate-for-openstack-services">
<h3>A Parallel Data Substrate for OpenStack Services</h3>
<p>IBM positions Spectrum Scale as a distributed data service for
underpinning OpenStack services such as Cinder, Glance, Swift and Manila.
More information about using Spectrum Scale in this manner can be found
in IBM Research’s red paper on the subject (listed in the Further
Reading section).</p>
</div>
</div>
<div class="section" id="applying-hpc-technologies-to-enhance-data-io">
<h2>Applying HPC Technologies to Enhance Data IO</h2>
<p>A recurring theme throughout this study has been the use of remote DMA
for efficient data transfer in HPC environments. The advantages of this
technology are especially pertinent in data intensive environments.
OpenStack’s flexibility enables the introduction of RDMA protocols
for many cloud infrastructure operations to reduce latency, increase
bandwidth and enhance processor efficiency:</p>
<p>Cinder block data IO can be performed using iSER (iSCSI extensions
for RDMA). iSER is a drop-in replacement for iSCSI that is easy to
configure and set up. Through providing tightly-coupled IO resources
using RDMA technologies, the functional equivalent of HPC-style burst
buffers can be added to the storage tiers of cloud infrastructure.</p>
<p>Ceph data transfers can be performed using the Accelio RMDA transport.
This technology was demonstrated some years ago but does not appear
to have achieved production levels of stability or gained significant
mainstream adoption.</p>
<p>The NOWLAB group at Ohio State University have developed extensions to
data analytics platforms such as HBase, Hadoop, Spark and Memcached to
optimise data movements using RDMA.</p>
</div>
<div class="section" id="optimising-ceph-storage-for-data-intensive-workloads">
<h2>Optimising Ceph Storage for Data-Intensive Workloads</h2>
<p>The versatility of Ceph embodies the cloud-native approach to storage,
and consequently Ceph has become a popular choice of storage technology
for OpenStack infrastructure. A single Ceph deployment can support
various protocols and data access models.</p>
<p>Ceph is capable of delivering strong read bandwidth. For large reads
from OpenStack block devices, Ceph is able to parallelise the delivery
of the read data across multiple OSDs.</p>
<p>Ceph’s data consistency model commits writes to multiple OSDs before
a write transaction is completed. By default a write is replicated
three times. This can result in higher latency and lower performance
on write bandwidth.</p>
<p>Ceph can run on clusters of commodity hardware configurations. However,
in order to maximise the performance (or price performance) of a Ceph
cluster some design rules of thumb can be applied:</p>
<p>Use separate physical network interfaces for external storage network and
internal storage management. On the NICs and switches, enable Ethernet
flow control and raise the MTU to support jumbo frames.</p>
<p>Each drive used for Ceph storage is managed by an OSD process.
A Ceph storage node usually contains multiple drives (and multiple
OSD processes).</p>
<p>The best price/performance and highest density is achieved using fat
storage nodes, typically containing 72 HDDs. These work well for
large scale deployments, but can lead to very costly units of failure
in smaller deployments. Node configurations of 12-32 HDDs are usually
found in deployments of intermediate scale.</p>
<p>Ceph storage nodes usually contain a higher-speed write journal, which is
dedicated to service of a number of HDDs. An SSD journal can typically
feed 6 HDDs while an NVMe flash device can typically feed up to 20 HDDs.</p>
<p>About 10G of external storage network bandwidth balances the read
bandwidth of up to 15 HDDs. The internal storage management network
should be similarly scaled.</p>
<p>A rule of thumb for RAM is to provide 0.5GB-1GB of RAM per TB per
OSD daemon.</p>
<p>On multi-socket storage nodes, close attention should be paid to NUMA
considerations. The PCI storage devices attached to each socket should
be working together. Journal devices should be connected with HDDs
attached to HBAs on the same socket. IRQ affinity should be confined
to cores on the same socket. Associated OSD processes should be pinned
to the same cores.</p>
<p>For tiered storage applications in which data can be regenerated from
other storage, the replication count can safely be reduced from 3 to
2 copies.</p>
</div>
<div class="section" id="the-cancer-genome-collaboratory-large-scale-genomics-on-openstack">
<h2>The Cancer Genome Collaboratory: Large-scale Genomics on OpenStack</h2>
<p>Genome datasets can be hundreds of terabytes in size, sometimes requiring
weeks or months to download and significant resources to store and
process.</p>
<img alt="OICR logo" class="align-right" src="//www.stackhpc.com/images/high_performance_data-oicr_logo.jpg" style="width: 300px;" />
<p>The Ontario Institute for Cancer Research built the Cancer Genome
Collaboratory (or simply The Collaboratory) as a biomedical research
resource built upon OpenStack infrastructure. The Collaboratory aims
to facilitate research on the world’s largest and most comprehensive
cancer genome dataset, currently produced by the International Cancer
Genome Consortium (ICGC).</p>
<p>By making the ICGC data available in cloud compute form in the
Collaboratory, researchers can bring their analysis methods to the cloud,
yielding benefits from the high availability, scalability and economy
offered by OpenStack, and avoiding the large investment in compute
resources and the time needed to download the data.</p>
<div class="section" id="an-openstack-architecture-for-genomics">
<h3>An OpenStack Architecture for Genomics</h3>
<p>The Collaboratory’s requirements for the project were to build a cloud
computing environment providing 3000 compute cores and 10-15 PB of raw
data stored in a scalable and highly-available storage. The project
has also met constraints of budget, data security, confined data centre
space, power and connectivity. In selecting the storage architecture,
capacity was considered to be more important than latency and performance.</p>
<p>Each rack hosts 16 compute nodes using 2U high-density chassis, and
between 6 and 8 Ceph storage nodes. Hosting a mix of compute and storage
nodes in each rack keeps some of the Nova-Ceph traffic in the same rack,
while also lowering the power requirement for these high density racks
(2 x 60A circuits are provided to each rack).</p>
<p>As of September 2016, Collaboratory has 72 compute nodes (2600 CPU
cores, Hyper-Threaded) with a physical configuration optimized for large
data-intensive workflows: 32 or 40 CPU cores and a large amount of RAM
(256 GB per node). The workloads make extensive use of high performance
local disk, incorporating hardware RAID10 across 6 x 2TB SAS drives.</p>
<p>The networking is provided by Brocade ICX 7750-48C top-of-rack switches
that use 6x40Gb cables to interconnect the racks in a ring stack topology,
providing 240 Gbps non-blocking redundant inter-rack connectivity,
at a 2:1 oversubscription ratio.</p>
<p>The Collaboratory is deployed using entirely community-supported free
software. The OpenStack control plane is Ubuntu 14.04 and deployment
configuration is based on Ansible. The Collaboratory was initially
deployed using OpenStack Juno and a year later upgraded to Kilo and
then Liberty.</p>
<p>Collaboratory deploys a standard HA stack based on Haproxy/Keepalived and
Mariadb-Galera using three controller nodes. The controller nodes also
perform the role of Ceph-mon and Neutron L3-agents, using three separate
RAID1 sets of SSD drives for MySQL, Ceph-mon and Mongodb processes.</p>
<p>The compute nodes have 10G Ethernet with GRE and SDN capabilities
for virtualized networking. The Ceph nodes use 2x10G NICs bonded for
client traffic and 2x10G NICs bonded for storage replication traffic.
The Controller nodes have 4x10G NICs in an active-active bond (802.3ad)
using layer3+4 hashing for better link utilisation. The Openstack tenant
routers are highly-available with two routers distributed across the three
controllers. The configuration does not use Neutron DVR out of concern
for limiting the number of servers directly attached to the Internet.
The public VLAN is carried only on the trunk ports facing the controllers
and the monitoring server.</p>
</div>
<div class="section" id="optimising-ceph-for-genomics-workloads">
<h3>Optimising Ceph for Genomics Workloads</h3>
<p>Upon workload start, the instances usually download data stored in Ceph's
object storage. OICR developed a download client that controls access
to sensitive ICGC protected data through managed tokens. Downloading a
100GB file stored in Ceph takes around 18 minutes, with another 10-12
minutes used to automatically check its integrity (md5sum), and is mostly
limited by the instance’s local disk.</p>
<p>The ICGC storage system adds a layer of control on top of Ceph’s
object storage. Currently this is a 2-node cluster behind an Haproxy
instance serving the ICGC storage client. The server component uses
OICR’s authorization and metadata systems to provide secure access to
related objects stored in Ceph. By using OAuth-based access tokens,
researchers can be given access to the Ceph data without having to
configure Ceph permissions. Access to individual project groups can
also be implemented in this layer.</p>
<p>Each Ceph storage node consists of 36 OSD drives (4, 6 or 8 TB) in
a large Ceph cluster currently providing 4 PB of raw storage, using
three replica pools. The radosgw pool has 90% of the Ceph space being
reserved for storing protected ICGC datasets, including the very large
whole genome aligned reads for almost 2000 donors. The remaining 10% of
Ceph space is used as a scalable and highly-available backend for Glance
and Cinder. Ceph radosgw was tuned for the specific genomic workloads,
mostly by increasing read-ahead on the OSD nodes, 65 MB as rados object
stripe for Radosgw and 8 MB for RBD.</p>
</div>
<div class="section" id="further-considerations-and-future-directions">
<h3>Further Considerations and Future Directions</h3>
<p>In the course of the development of the OpenStack infrastructure at the
Collaboratory, several issues have been encountered and addressed:</p>
<p>The instances used in cancer research are usually short lived
(hours/days/weeks), but with high resource requirements in terms of CPU
cores, memory and disk allocation. As a consequence of this pattern of
usage the Collaboratory OpenStack infrastructure does not support live
migration as a standard operating procedure.</p>
<p>The Collaboratory have encountered a few problems caused by Radosgw bugs
involving overlapping multipart uploads. However, these were detected by
the Collaboratory’s monitoring system, and did not result in data loss.
The Collaboratory created a monitoring system that uses automated Rally
tests to monitor end-to-end functionality, and also download a random
large S3 object (around 100 GB) to confirm data integrity and monitor
object storage performance.</p>
<p>Because of the mix of very large (BAM), medium (VCF) and very small
(XML, JSON) files, the Ceph OSD nodes have imbalanced load and we have
to regularly monitor and rebalance data.</p>
<p>Currently, the Collaboratory is hosting 500TB of data from 2,000 donors.
Over the next 2 years, OICR will increase the number of ICGC genomes
available in the Collaboratory, with the goal of having the entire ICGC
data set of 25,000 donors estimated to be 5PB when the project completes
in 2018.</p>
<p>Although in a closed beta phase with only a few research labs having
accounts, there were more than 19,000 instances started in the last 18
months, with almost 7,000 in the last three months. One project that
uses the Collaboratory heavily is the PanCancer Analysis of Whole Genomes
(PCAWG), which characterizes the somatic, and germline variants from
over 2,800 ICGC cancer whole genomes in 20 primary tumour sites.</p>
<p>In conclusion, the Collaboratory environment has been running well for
OICR and its partners. George Mihaiescu, senior cloud architect at OICR,
has many future plans for OpenStack and the Collaboratory:</p>
<blockquote>
“We hope to add new Openstack projects to the Collaboratory’s offering
of services, with Ironic and Heat being the first candidates. We would
also like to provide new compute node configurations with RAID0 instead
of RAID10, or even SSD based local storage for improved IO performance.”</blockquote>
</div>
</div>
<div class="section" id="climb-openstack-parallel-filesystems-and-microbial-bioinformatics">
<h2>CLIMB: OpenStack, Parallel Filesystems and Microbial Bioinformatics</h2>
<p>The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) is a
collaboration between four UK universities (Swansea, Warwick, Cardiff
and Birmingham) and funded by the UK’s Medical Research Council.
CLIMB provides compute and storage as a free service to academic
microbiologists in the UK. After an extended period of testing, the
CLIMB service was formally launched in July 2016.</p>
<img alt="CLIMB hardware" class="align-right" src="//www.stackhpc.com/images/high_performance_data-climb.jpg" style="width: 400px;" />
<p>CLIMB is a federation of 4 sites, configured as OpenStack regions.
Each site has an approximately equivalent configuration of compute nodes,
network and storage.</p>
<p>The compute node hardware configuration is tailored to support the
memory-intensive demands of bioinformatics workloads. The system as
a whole comprises 7680 CPU cores, in fat 4-socket compute nodes with
512GB RAM. Each site also has three large memory nodes with 3TB of RAM
and 192 hyper-threaded cores.</p>
<p>The infrastructure is managed and deployed using xCAT cluster management
software. The system runs the Kilo release of OpenStack, with packages
from the RDO distribution. Configuration management is automated
using Salt.</p>
<p>Each site has 500 TB of GPFS storage. Every hypervisor is a GPFS client,
and uses an infiniband fabric to access the GPFS filesystem. GPFS is
used for scratch storage space in the hypervisors.</p>
<p>For longer term data storage, to share datasets and VMs, and to provide
block storage for running VMs, CLIMB deploys a storage solution based
on Ceph. The Ceph storage is replicated between sites. Each site has 27
Dell R730XD nodes for Ceph storage servers. Each storage server contains
16x 4TB HDDs for Ceph OSDs, giving a total raw storage capacity of 6912TB.
After 3-way replication this yields a usable capacity of 2304TB.</p>
<p>On two sites Ceph is used as the storage back end for Swift, Cinder
and Glance. At Birmingham GPFS is used for Cinder and Glance, with
plans to migrate to Ceph.</p>
<p>In addition to the infiniband network, a Brocade 10G Ethernet fabric is
used, in conjunction with dual-redundant Brocade Vyatta virtual routers
to manage cross-site connectivity.</p>
<p>In the course of deploying and trialling the CLIMB system, a number of
issues have been encountered and overcome.</p>
<ul class="simple">
<li>The Vyatta software routers were initially underperforming with
consequential impact on inter-site bandwidth.</li>
<li>Some performance issues have been encountered due to NUMA topology
awareness not being passed through to VMs.</li>
<li>Stability problems with Broadcom 10GBaseT drivers in the controllers
led to reliability issues. (Thankfully the HA failover mechanisms were
found to work as required).</li>
<li>Problems with interactions between Ceph and Dell hardware RAID cards.</li>
<li>Issues with Infiniband and GPFS configuration.</li>
</ul>
<p>CLIMB has future plans for developing their OpenStack infrastructure,
including:</p>
<ul class="simple">
<li>Migrating from regions to Nova cells as the federation model between
sites.</li>
<li>Integrating OpenStack Manila for exporting shared filesystems from
GPFS to guest VMs.</li>
</ul>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<p>An IBM research study on integrating GPFS
(Spectrum Scale) within OpenStack environments:
<a class="reference external" href="http://www.redbooks.ibm.com/redpapers/pdfs/redp5331.pdf">http://www.redbooks.ibm.com/redpapers/pdfs/redp5331.pdf</a></p>
<p>A 2015 presentation from ATOS on using Kerberos authentication in Lustre:
<a class="reference external" href="http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-and-Kerberos_Buisson.pdf">http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-and-Kerberos_Buisson.pdf</a></p>
<p>Glyn Bowden of HPE and Alex Macdonald from SNIA discuss OpenStack
storage (including the Provisioned Filesystem Model using Lustre):
<a class="reference external" href="https://www.brighttalk.com/webcast/663/168821">https://www.brighttalk.com/webcast/663/168821</a></p>
<p>The High-Performance Big Data team at Ohio State University:
<a class="reference external" href="http://hibd.cse.ohio-state.edu">http://hibd.cse.ohio-state.edu</a></p>
<p>A useful talk from the 2016 Austin OpenStack Summit on Ceph design:
<a class="reference external" href="https://www.openstack.org/videos/video/designing-for-high-performance-ceph-at-scale">https://www.openstack.org/videos/video/designing-for-high-performance-ceph-at-scale</a></p>
<p>The Ontario Institute for Cancer Research Collaboratory:
<a class="reference external" href="http://www.cancercollaboratory.org">http://www.cancercollaboratory.org</a></p>
<p>Further details on the International Cancer Genome Consortium:
<a class="reference external" href="http://icgc.org/">http://icgc.org/</a></p>
<p>Dr Tom Connor presented CLIMB at the 2016 Austin OpenStack summit:
<a class="reference external" href="https://www.openstack.org/videos/video/the-cloud-infrastructure-for-microbial-bioinformatics-breaking-biological-silos-using-openstack">https://www.openstack.org/videos/video/the-cloud-infrastructure-for-microbial-bioinformatics-breaking-biological-silos-using-openstack</a></p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>This document was written by Stig Telfer of StackHPC Ltd with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:</p>
<ul class="simple">
<li><strong>George Mihaiescu</strong>, <strong>Bob Tiernay</strong>, <strong>Andy Yang</strong>, <strong>Junjun Zhang</strong>,
<strong>Francois Gerthoffert</strong>, <strong>Christina Yung</strong>, <strong>Vincent Ferretti</strong>
from the Ontario Institute for Cancer Research. The authors wish
to acknowledge the funding support from the Discovery Frontiers:
Advancing Big Data Science in Genomics Research program (grant
no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’), which
is jointly funded by the Natural Sciences and Engineering Research
Council (NSERC) of Canada, the Canadian Institutes of Health Research
(CIHR), Genome Canada, and the Canada Foundation for Innovation (CFI),
and with in-kind support from the Ontario Research Fund of the Ministry
of Research, Innovation and Science.</li>
<li><strong>Dr Tom Connor</strong> from Cardiff University and the CLIMB collaboration.</li>
</ul>
<div class="figure">
<img alt="Creative commons licensing" src="//www.stackhpc.com/images/cc-by-sa.png" style="width: 100px;" />
<p class="caption">This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)</p>
</div>
</div>
StackHPC at OpenStack Day UK2016-09-15T10:20:00+01:002016-09-15T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-09-15:/stackhpc-at-openstack-day-uk.html<p><a class="reference external" href="https://openstackday.uk/">OpenStack Day UK 2016</a>, was held at the HPE campus in
StackHPC's home town of Bristol.</p>
<p>Stig presented a lightning talk on high-performance IO for "hot data", and then
presented the <a class="reference external" href="https://openstackday.uk/stig-telfer/">closing session</a> of the main conference.</p>
<p>The main presentation is available on YouTube here:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/hkw810UqLuk" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div>OpenStack and HPC Workload Management2016-09-12T10:20:00+01:002016-10-07T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-09-12:/openstack-and-hpc-workloads.html<p class="first last">OpenStack and HPC Workload Management</p>
<div class="section" id="workload-management-in-hpc-and-cloud">
<h2>Workload Management in HPC and Cloud</h2>
<p>The approach taken for managing workloads is a major difference between
conventional use cases of HPC and cloud.</p>
<p>A conventional approach to HPC workload management is likely to involve
one or more head nodes of an HPC cluster for login, development,
compilation and job submission services. Parallel workloads would
be submitted from a head node to job batch queues of the workload
manager, which control access to parallel partitions of compute nodes.
Such partitions may equate to mappings of types of compute nodes and the
specific resources (CPU, memory, storage and networking) that applications
require. Each compute node runs a workload manager agent which configures
resources, launches application processes and monitors utilisation.</p>
<div class="section" id="pain-points-in-conventional-hpc-workload-management">
<h3>Pain Points in Conventional HPC Workload Management</h3>
<p>On a large multi-user HPC system the login node is a continual source
of noisy neighbour problems. Inconsiderate users may, for example,
consume system resources by performing giant compilations with wide
task parallelism, open giant logfiles from their task executions,
or run recursive finds across the filesystem to look for forgotten files.</p>
<p>An HPC system must often support a diverse mix of workloads. Different
workloads may have a wide range of dependencies. With increasing
diversity comes an increasing test matrix, which increases the toil
involved in making any changes. How can an administrator be sure of the
effects of any change to the software packages installed? What must
be done to support a new version of an ISV application? What are the
side-effects of updating the version of a dependency? What if a security
update leads to a dependency conflict? As the flexibility of an HPC
software environment grows, so too does the complexity maintaining it.</p>
<p>In an environment where data is sensitive, local scratch space and
parallel filesystems for HPC workloads can often have default access
permissions with an undesirable level of openness. Data security can be
problematic in a shared HPC resource in which the tenants are not trusted.</p>
</div>
<div class="section" id="the-case-for-workload-management-on-openstack-infrastructure">
<h3>The Case for Workload Management on OpenStack Infrastructure</h3>
<p>The flexibility of OpenStack can ease a number of pain points of HPC
cluster administration:</p>
<ul class="simple">
<li>With software-defined OpenStack infrastructure, a new compute node or
head node is created through software processes - not a trip to the
data centre. Through intelligent orchestrated automated provisioning,
the administrative burden of managing changes to resource configuration
can be eliminated. And from a user’s perspective, a self-service
process for resizing their resource allocation is much more responsive
and devolves control to the user.</li>
<li>Through OpenStack it becomes a simple process to automatically provision
and manage any number of login nodes and compute nodes. The multi-tenancy
access control of cloud infrastructure ensures that compute resources
allocated to a project are only seen and accessible to members of that
project. OpenStack does not pretend to change the behaviour of noisy
neighbours, but it helps to remove the strangers from a neighbourhood.</li>
<li>OpenStack’s design ethos is the embracing (not replacing) of data
centre diversity. Supporting a diverse mix of HPC workloads is not
materially different from supporting the breadth of cloud-native
application platforms. One of the most significant advances of cloud
computing has been in the effective management of software images. Once a
user project has dedicated workload management resources allocated to it,
the software environment of those compute resources can be tailored to
the specific needs of that project without infringing on any conflicting
requirements of other users.</li>
<li>The cloud multi-tenancy implemented by OpenStack enforces segregation
so that tenants are only visible to one another through the interfaces
that they choose to expose. The isolation of tenants applies to all
forms of resources - compute, networking and storage. The fine-grained
control over what is shared (and what is not shared) results in greater
data security than a conventional multi-user HPC system.</li>
</ul>
<p>All of this can be done using conventional HPC infrastructure and
conventional management techniques, but to do so would demand using
industry best practices as a baseline, and require the continual
attention of a number of competent system administrators to keep it
running smoothly, securely and to the satisfaction of the users.</p>
<p>Organisations working on the convergence of HPC and cloud often refer
to this subject as cluster-as-a-service. How can a cloud resource
be equipped with the interfaces familiar to users of batch-queued
conventional HPC resources?</p>
</div>
</div>
<div class="section" id="delivering-an-hpc-platform-upon-openstack-infrastructure">
<h2>Delivering an HPC Platform upon OpenStack Infrastructure</h2>
<p>HPC usually entails a platform, not an infrastructure. How is OpenStack
orchestrated to provision an HPC cluster and workload manager?</p>
<p>Addressing this market are proprietary products and open source projects.
The tools available in the OpenStack ecosystem also ensure that a
home-grown cluster orchestration solution is readily attainable.
An example of each approach is included here.</p>
<p>Broadly the cluster deployment workflow would follow these steps:</p>
<ol class="arabic simple">
<li>The creation of the HPC cluster can be instigated through the command
line. In some projects a custom panel for managing clusters is added
to Horizon the OpenStack web dashboard.</li>
<li>Resources for the cluster must be allocated from the OpenStack
infrastructure. Compute node instances, networks, ports, routers,
images and volumes must all be assigned to the new cluster.</li>
<li>One or more head nodes must be deployed to manage the cluster node
instances, provide access for end users and workload management.
The head node may boot a customised image (or volume snapshot) with
the HPC cluster management software installed. Alternatively, it may
boot a stock cloud image and install the required software packages as
a secondary phase of deployment.</li>
<li>Once the head nodes are deployed with base OS and HPC cluster
management packages, an amount of site-specific and deployment-specific
configuration must be applied. This can be applied through instance
metadata or a configuration management language such as Ansible or Puppet.
A Heat-orchestrated deployment can use a combination of instance metadata
and a configuration management language (usually Puppet but more recently
Ansible provides such capability).</li>
<li>A number of cluster node instances must be deployed. The process of
node deployment can follow different paths. Typically the cluster nodes
would be deployed in the same manner as the head nodes by booting from
OpenStack images or volumes, and applying post-deployment configuration.</li>
<li>The head nodes and cluster nodes will share one or more networks, and
the cluster nodes will register with the HPC workload management service
deployed on the head nodes.</li>
</ol>
<div class="section" id="open-platforms-for-cluster-as-a-service">
<h3>Open Platforms for Cluster-as-a-Service</h3>
<p>The simplest implementation is arguably ElastiCluster, developed and
released as GPL open source by a research computing services group at
the University of Zurich. ElastiCluster supports OpenStack, Google
Compute Engine and Amazon EC2 as back-end cloud infrastructure, and
can deploy (among others) clusters offering SLURM, Grid Engine, Hadoop,
Spark and Ceph.</p>
<p>ElastiCluster is somewhat simplistic and its capabilities are
comparatively limited. For example, it doesn't currently support Keystone
v3 authentication - a requirement for deployments where a private
cloud is divided into a number of administrative domains. A cluster
is defined using an INI-format configuration template. When creating
a SLURM cluster, virtual cluster compute nodes and a single head node
are provisioned as VMs from the OpenStack infrastructure. The compute
nodes are interconnected using a named OpenStack virtual network.
All post-deployment configuration is carried out using Ansible playbooks.
The head node is a SLURM controller, login node and NFS file server for
/home mounting onto the compute nodes.</p>
<p>Trinity from ClusterVision uses OpenStack to manage bare metal
infrastructure, and creates a dynamic HPC-as-a-Service platform comprising
SLURM workload management and Docker containers (running on bare metal)
for the virtual cluster compute nodes. Management of virtual clusters is
more user-friendly in Trinity than in ElastiCluster. A custom panel has
been added to the OpenStack Horizon dashboard to enable users to create,
manage and monitor their virtual clusters.</p>
<p>Trinity is developed as open source, but has a very small group of
developers. The ‘bus factor’ of this project has been exposed by
the recent departure from ClusterVision of Trinity’s core contributor.</p>
</div>
<div class="section" id="bright-computing-cluster-on-demand">
<h3>Bright Computing Cluster-on-Demand</h3>
<p>Bright Computing has developed its proprietary products for HPC
cluster management and adapted them for installation, configuration and
administration of OpenStack private clouds. The product is capable of
partitioning a system into a mix of bare metal HPC compute and OpenStack
private cloud.</p>
<div class="figure">
<img alt="Bright Computing Cluster-on-Demand" src="//www.stackhpc.com/images/hpc_workload-bright_caas.png" style="width: 600px;" />
</div>
<p>Bright Computing also provides an OpenStack distribution with
Bright-themed OpenStack web interface and an additional panel for
management of Cluster-on-Demand deployments.</p>
<p>Cluster-on-Demand uses OpenStack Heat for orchestrating the allocation and
provisioning of virtualised cluster resources. When a virtual cluster is
created, the Nova flavors (virtualised hardware templates) for head node
and cluster compute node are specified. OpenStack networking details
are also provided. Bright OpenStack is capable of deploying OpenStack
with SR-IOV support, and Cluster-on-Demand is capable of booting cluster
compute nodes with SR-IOV networking.</p>
<p>Cluster-on-Demand deployment begins with pre-built generic head
node images. Those can then be quickly instantiated (via optional
copy-on-write semantics) and automatically customized to user’s
requirements. Bright’s deployment solution differs slightly from other
approaches by using Bright Cluster Manager on the virtualised head node
to deploy the virtual cluster nodes as though they were bare metal.
This approach neatly nests the usage model of Bright Cluster Manager
within a virtualised environment, preserving the familiar workflow of
bare metal deployment. However, as a result it does not exploit the
efficiencies of cloud infrastructure for compute node deployment at scale.
A virtualised cluster of “typical” size can be deployed on-demand from
scratch in several minutes, at which point it is ready to accept HPC jobs.</p>
<p>Bright provide configurations for a wide range of workload managers, big
data services (Spark, Hadoop), deep learning tools, or even virtualised
OpenStack clouds (OpenStack on OpenStack). Bright Cluster-on-Demand
can also dynamically burst to public clouds (AWS) when more resources
are needed (e.g. GPU nodes) or during heavy load spikes.</p>
<p>Cluster-on-Demand focuses on delivering the flexibility advantages of
self-service cluster provisioning, but can also deliver performance with
minimised virtualisation overhead through use of SR-IOV.</p>
<p>A distinctive feature of Bright OpenStack is the ability to easily deploy
virtualised HPC compute nodes next to physical ones, and run HPC workloads
in an environment spanning mixture of physical and virtual compute nodes.
Doing so provides the admin with a whole new level of flexibility.
For example, it allows the assignment of high priority HPC job queues
to physical compute nodes, and low priority job queues, or long running
jobs, to virtual compute nodes. This in turn allows the VMs to be
live migrated across the datacentre (e.g. due to hardware maintenance)
without impacting the long-running HPC jobs hosted on them.</p>
</div>
<div class="section" id="extending-slurm-and-openstack-to-orchestrate-mvapich2-virt-configuration">
<h3>Extending SLURM and OpenStack to Orchestrate MVAPICH2-Virt Configuration</h3>
<p>The NOWLAB group at Ohio State University has developed a virtualised
variant of their MPI library, MVAPICH2-Virt. MVAPICH2-Virt is described
in greater detail in the section OpenStack and HPC Network Fabrics.</p>
<p>NOWLAB has also developed plugins for SLURM, called SLURM-V, to extend
SLURM with virtualization-oriented capabilities such as submitting jobs to
dynamically created VMs with isolated SR-IOV and inter-VM shared memory
(IVSHMEM) resources. Through MVAPICH2-Virt runtime, the workload is
able to take advantage of the configured SR-IOV and IVSHMEM resources
efficiently. The NOWLAB model is slightly different from the approach
taken in Cluster-as-a-Service, in that a MVAPICH2-Virt based workload
launches into a group of VMs provisioned specifically for that workload.</p>
<blockquote>
"The model we chose to create VMs for the lifetime of each job seems
a clear way of managing virtualized resources for HPC workloads. This
approach can avoid having long-lived VMs on compute nodes, which makes
the HPC resources always in the virtualised state. Through the SLURM-V
model, both bare-metal and VM based jobs can be launched on the same set
of compute nodes since the VMs are provisioned and configured dynamically
only when the jobs need virtualised environments", says Prof. DK Panda
and Dr. Xiaoyi Lu of NOWLAB.</blockquote>
<p>The IVSHMEM component runs as a software device driver in the host kernel.
Every parallel workload has a separate instance of the IVSHMEM device
for communication between co-resident VMs. The IVSHMEM device is mapped
into the workload VMs as a paravirtualised device. The NOWLAB team has
developed extensions to Nova to add the connection of the IVSHMEM device
on VM creation, and recover the resources again on VM deletion.</p>
<p>Users can also hotplug/unplug the IVSHMEM device to/from specified
running virtual machines. The NOWLAB team provides a tool with
MVAPICH2-Virt (details can be found in the <a class="reference external" href="http://mvapich.cse.ohio-state.edu/userguide/virt/#_support_for_integration_with_openstack_for_vms">MVAPICH2-Virt userguide</a>)
to hotplug an IVSHMEM device to a virtual machine and unplug an IVSHMEM
device from a virtual machine.</p>
<p>The SLURM-V extensions have been developed to work with KVM directly.
However, the NOWLAB group have extended their project to enable SLURM-V
to make OpenStack API calls to orchestrate the creation of workload VMs.
In this model of usage, SLURM-V uses OpenStack to allocate VM instances,
isolate networks and attach SR-IOV and IVSHMEM devices to workload VMs.
OpenStack has already provided scalable and efficient mechanisms for
creation, deployment, and reclamation of VMs on a large number of
physical nodes.</p>
<p>SLURM-V is likely to be one of many sources competing for
OpenStack-managed resources. If other cloud users consume all resources,
leaving SLURM-V unable to launch sufficient workload VMs, then the new
submitted jobs will be queued in SLURM to wait for available resources.
As soon as one job completes and the corresponding resources are
reclaimed, SLURM will find another job in the queue to execute based on
the configured scheduling policy and resource requirements of jobs.</p>
</div>
<div class="section" id="combining-the-strengths-of-cloud-with-hpc-workload-management">
<h3>Combining the Strengths of Cloud with HPC Workload Management</h3>
<p>At Los Alamos National Lab, there is a desire to increase the flexibility
of the user environment of their HPC clusters.</p>
<p>To simplify their workload, administrators want every software image
to be the same, everywhere. LANL systems standardise on a custom
Linux distribution, based on Red Hat 6 and tailored for their demanding
requirements. Sustaining the evolution of that system to keep it current
with upstream development whilst maintaining local code branches is an
ongoing challenge.</p>
<p>The users demand ever increasing flexibility, but have requirements
that are sometimes contradictory. Some users have applications with
complex package dependencies that are out of date or not installed in
the LANL distribution. Some modern build systems assume internet access
at build time, which is not available on LANL HPC clusters. Conversely,
some production applications are built from a code base that is decades
old, and has dependencies on very old versions of libraries. Not all
software updates are backwards compatible.</p>
<p>Tim Randles, a senior Linux administrator and OpenStack architect at the
Lab, uses OpenStack and containers to provide solutions. Woodchuck is
LANL’s third-generation system aimed at accommodating these conflicting
needs. The 192-node system has a physical configuration optimised for
data-intensive analytics: a large amount of RAM per CPU core, local disk
for scratch space for platforms such as HDFS and 10G Ethernet with VXLAN
and SDN capabilities for virtualised networking.</p>
<p>Reid Priedhorsky at LANL has developed an unprivileged containerised
runtime environment, dubbed “Charliecloud”, upon which users can
run applications packaged using Docker tools. This enables users to
develop and build their packages on their (comparatively open) laptops
or workstations, pulling in the software dependencies they require.</p>
<p>One issue arising from this development cycle is that in a
security-conscious network such as LANL, the process of transferring
application container images to the HPC cluster involves copying large
amounts of data through several hops. This process was soon found to
have drawbacks:</p>
<ul class="simple">
<li>It quickly became time-consuming and frustrating.</li>
<li>It could not be incorporated into continuous integration frameworks.</li>
<li>The application container images were being stored for long periods of
time on Lustre-backed scratch space, which has a short data retention
policy, was occasionally unreliable and not backed up.</li>
</ul>
<p>Tim’s solution was to use OpenStack Glance as a portal between the
user’s development environment on their workstation and the HPC cluster.
Compared with the previous approach, the Glance API was accessible from
both the user’s workstations and the HPC cluster management environment.
The images stored in Glance were backed up, and OpenStack’s user model
provided greater flexibility than traditional Unix users and groups,
enabling fine-grained control over the sharing of application images.</p>
<div class="figure">
<img alt="Integration of Glance and SLURM" src="//www.stackhpc.com/images/hpc_workload-slurm_glance.png" style="width: 600px;" />
</div>
<p>Tim developed SLURM plugins to interact with Glance for validating the
image and the user’s right to access it. When the job was scheduled for
execution, user and image were both revalidated and the application image
downloaded and deployed ready for launch in the Charliecloud environment.</p>
<p>Future plans for this work include using Neutron to create and manage
virtual tenant networks for each workload, and releasing the plugins
developed as open source contributions to SLURMs codebase.</p>
</div>
<div class="section" id="hpc-and-cloud-converge-at-the-university-of-melbourne">
<h3>HPC and Cloud Converge at the University of Melbourne</h3>
<p>Research compute clusters are typically designed according to the demands
of a small group of influential researchers representing an ideal use
case. Once built, however, the distribution of use cases can change as
a broader group of researchers come onboard. These new uses cases may
not match the expected ideal, and in some cases conflict. If job queues
and computation times stretch out, it can drive the proliferation of
isolated department-level clusters which are more expensive to maintain,
lack scale, and are all too often orphaned when the responsible researcher
moves on.</p>
<div class="figure">
<img alt="Spartan logo" src="//www.stackhpc.com/images/hpc_workload-spartan_logo.png" style="width: 400px;" />
</div>
<div class="section" id="introducing-spartan">
<h4>Introducing Spartan</h4>
<p>In 2016 the University of Melbourne launched a new cluster called Spartan.
It takes an empirical approach, driven by the job profiles observed in
its predecessor, Edward, in the prior year. In particular, single-core
and low memory jobs dominate; 76% of were single core, and 97% used <4
GB of memory. High-core count, task-parallel jobs were often delayed
due to competition with these single core jobs, leading to research
funds being directed towards department level resources. National peak
facilities were often rejected as an option due to their long queue
times and restrictive usage requirements.</p>
<p>Spartan takes advantage of the availability of an existing and very large
research cloud (NeCTAR) to allow additional computation capacity, and
the provisioning of common login and management infrastructure. This is
combined with a small but more powerful partition of tightly coupled
bare-metal compute nodes, and specialist high-memory and GPU partitions.</p>
<p>This hybrid arrangement offers the following advantages:</p>
<ul class="simple">
<li>Users with data parallel jobs have access to the much larger research
cloud and can soak up the spare cycles available with cloud bursting,
reducing their job wait time.</li>
<li>Users with task parallel jobs have access to optimised bare-metal HPC,
supported by high-speed networking and storage.</li>
<li>The larger task parallel jobs remain segregated from less
resource-intensive data parallel jobs, reducing contention.</li>
<li>Job demands can be continually monitored, and the cloud and bare metal
partitions selectively expanded as and when the need arises.</li>
<li>Departments and research groups can co-invest in Spartan. If they need
more processing time or a certain type of hardware, they can attach it
directly to Spartan and have priority access. This avoids the added
overheads of administering their own system, including the software
environment, login and management nodes.</li>
<li>Management nodes can be readily migrated to new hardware, allowing us
to upgrade or replace hardware without bringing the entire cluster down.</li>
<li>Spartan can continue beyond the life of its original hardware, as
different partitions are resized or replaced, a common management and
usage platform remains.</li>
</ul>
<p>Spartan does not have extraordinary hardware or software, and it’s
peak performance does not exceed that of other HPC systems. Instead,
it seeks to segregate compute loads into partitions with different
performance characteristics according to their demands. This will
result in shorter queues, better utilisation, cost-effectiveness, and,
above all, faster time to results for our research community.</p>
</div>
<div class="section" id="job-and-resource-management">
<h4>Job and Resource Management</h4>
<p>Previous HPC systems at the University utilised Moab Workload Manager
for job scheduling and Terascale Open-source Resource and QUEue Manager
(TORQUE) as a resource manager. The Spartan team adopted the SLURM
Workload Manager for the following reasons:</p>
<ul class="simple">
<li>Existing community of users at nearby Victorian Life Sciences Compute
Initiative (VLSCI) facility.</li>
<li>Similar syntax to the PBS scripts used on Edward, simplifying user
transition.</li>
<li>Highly configurable through add-on modules.</li>
<li>Importantly, support for cloud bursting, for example, to the Amazon
Elastic Computing Cloud (EC2) or, in Spartan's case, the NeCTAR research
cloud.</li>
</ul>
</div>
<div class="section" id="account-management">
<h4>Account Management</h4>
<p>Integration with a central staff and student Active Directory was
initially considered, but ultimately rejected due to the verbose usernames
required (i.e. email addresses). The Spartan team reverted to using
an LDAP-based system as had been the case with previous clusters, and
a custom user management application.</p>
</div>
<div class="section" id="application-environment">
<h4>Application Environment</h4>
<p>EasyBuild was used as a build and installation framework, with the LMod
environmental modules system selected to manage application loading by
users. These tools tightly integrate, binding the specific toolchains
and compilation environment to the applications loaded by users.
EasyBuild's abstraction in its scripts sometimes required additional
administrative overhead, and not all software had a pre-canned script
ready for modification, necessitating them to be built from scratch.</p>
</div>
<div class="section" id="training">
<h4>Training</h4>
<p>Training been a particular focus for the implementation of Spartan.
Previous HPC training for researchers was limited, with only 38
researcher/days of training conducted in the 2012-2014 period.
The Spartan team now engage in weekly training, rotating across the
following sessions:</p>
<ul class="simple">
<li>Introductory, targeting researchers with little or no HPC or Linux
experience.</li>
<li>Transition, targeting existing Edward users who need to port their jobs
to Spartan.</li>
<li>Shell scripting.</li>
<li>Parallel programming.</li>
</ul>
<p>The team collaborate closely with researchers to drive this curriculum,
serving a range of experience levels, research disciplines, and software
applications.</p>
</div>
<div class="section" id="the-future">
<h4>The Future</h4>
<p>Bernard Meade, Spartan project sponsor, adds:</p>
<blockquote>
“The future configuration of Spartan will be driven by how it is
actually used. We continue to monitor what applications are run,
how long they take, and what resources they require. While Spartan
has considerable elasticity on the cloud side, we’re also able to
incrementally invest in added bare-metal and specialist nodes (high
memory, GPU) as the need arises. Given the diversity in HPC job
characteristics will only grow, we believe this agile approach is the
best means to serve the research community.”</blockquote>
</div>
</div>
</div>
<div class="section" id="cloud-infrastructure-does-not-yet-provide-all-the-answers">
<h2>Cloud Infrastructure Does Not (yet) Provide All the Answers</h2>
<div class="section" id="openstack-control-plane-responsiveness-and-job-startup">
<h3>OpenStack Control Plane Responsiveness and Job Startup</h3>
<p>Implementations of HPC workload management that create new VMs for worker
nodes for every job in the batch queue can have consequential impact
on the overall utilisation of the system if the jobs in the queue are
comparatively short-lived:</p>
<ul class="simple">
<li>Job startup time can be substantially increased. A fast boot for a VM
could is of the order of 20 seconds. Similarly, job cleanup time can
add more overhead while the VM is destroyed and its resources harvested.</li>
<li>A high churn of VM creation and deletion can add considerable load to
the OpenStack control plane.</li>
</ul>
<p>The Cluster-as-a-Service pattern of virtualised workload managers does
not typically create VMs for every workload. However, the OpenStack
control plane can still have an impact on job startup time, for example
if the application image must be retrieved and distributed, or a virtual
tenant network must be created. Empirical tests have measured the time
to create a virtual tenant network to grow linearly with the number of
ports in the network, which could have an impact on the startup time
for large parallel workloads.</p>
</div>
<div class="section" id="workload-managers-optimise-placement-decisions">
<h3>Workload Managers Optimise Placement Decisions</h3>
<p>A sophisticated workload manager can use awareness of physical network
topology to optimise application performance through placing the workload
on physical nodes with close network proximity.</p>
<p>On a private cloud system such as OpenStack, the management of
the physical network is delegated to a network management platform.
OpenStack avoids physical network knowledge and focuses on defining the
intended state, leaving physical network management platforms to apply
architecture-specific configuration.</p>
<p>In a Cluster-as-a-Service use case, there are two scheduling operations
where topology-aware placement could be usefully applied:</p>
<ul class="simple">
<li>When the virtual cluster compute node instances are created, their
placement is determined by the OpenStack Nova scheduler.</li>
<li>When a queued job in the workload manager is being allocated to virtual
cluster compute nodes.</li>
</ul>
<p>Through use of Availability Zones, OpenStack Nova can be configured to
perform a simple form of topology-aware workload placement, but without
any hierarchical grouping of nodes. Nova’s scheduler filter API
provides a mechanism which could be used for implementing topology-aware
placement in a more intelligent fashion.</p>
</div>
<div class="section" id="openstacks-flexibility-is-stretched-by-the-economics-of-utilisation">
<h3>OpenStack’s Flexibility is Stretched by the Economics of Utilisation</h3>
<p>With its decoupled execution model, batch queue job submission is an
ideal use case for off-peak compute resources. The AWS spot market
auctions time on idle cores for opportunistic usage at up to a 90%
discount from the on-demand price.</p>
<p>There is no direct equivalent to the AWS spot market in OpenStack.
More generally, management of pricing and billing is considered outside of
OpenStack’s scope. OpenStack does not currently have the capabilities
required for supporting opportunistic spot usage.</p>
<p>However, work is underway to implement the software capabilities
necessary for supporting preemptible spot instances, and it is hoped
that OpenStack will support this use case in due course. At that point,
Cluster-as-a-Service deployments could grow or shrink in response to
the availability of under-utilised compute resources on an OpenStack
private cloud.</p>
</div>
<div class="section" id="the-difficulty-of-future-resource-commitments">
<h3>The Difficulty of Future Resource Commitments</h3>
<p>HPC facilities possess a greater degree of oversight and coordination,
enabling users to request exclusive advance reservations of large sections
of an HPC system to perform occasional large-scale workloads.</p>
<p>In private cloud, there is no direct mainstream equivalent to this.
However, the Blazar project aims to extend OpenStack compute with support
for resource reservations. Blazar works by changing the management
of resource allocation for a segregated block of nodes. Within the
partition of nodes allocated to Blazar, resources can only be managed
through advance reservations.</p>
<p>A significant drawback of Blazar is that it does not support the
intermingling of reservations with on-demand usage. Without the ability
to gracefully preempt running instances, Blazar can only support advance
reservations by segregating a number of nodes exclusively for that mode
of usage.</p>
</div>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>OpenStack delivers new capabilities to flexibly manage compute clusters
as on-demand resources. The ability to define a compute cluster
and workload manager through code, data and configuration plays to
OpenStack’s strengths.</p>
<p>With the exception of some niche high-end requirements, OpenStack can
be configured to deliver Cluster-as-a-Service with minimal performance
overhead compared with a conventional bare metal HPC resource.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<p>The ElastiCluster project from University of Zurich is
open source. Online documentation is available here:
<a class="reference external" href="https://elasticluster.readthedocs.io/en/latest/index.html">https://elasticluster.readthedocs.io/en/latest/index.html</a></p>
<p>The Trinity project from ClusterVision is developed as open source:
<a class="reference external" href="http://clustervision.com/solutions/trinity/">http://clustervision.com/solutions/trinity/</a></p>
<p>Bright Computing presented their proprietary
Bright OpenStack and Cluster-as-a-Service products
at the OpenStack Austin summit in April 2016:
<a class="reference external" href="https://www.openstack.org/videos/video/bright-computing-high-performance-computing-hpc-and-big-data-on-demand-with-cluster-as-a-service-caas">https://www.openstack.org/videos/video/bright-computing-high-performance-computing-hpc-and-big-data-on-demand-with-cluster-as-a-service-caas</a></p>
<p>The NOWLAB’s publication on Slurm-V: Extending Slurm
for Building Efficient HPC Cloud with SR-IOV and IVShmem:
<a class="reference external" href="http://link.springer.com/chapter/10.1007/978-3-319-43659-3_26">http://link.springer.com/chapter/10.1007/978-3-319-43659-3_26</a></p>
<p>Tim Randles from Los Alamos presented his work on
integrating SLURM with Glance on the HPC/Research speaker
track at the OpenStack Austin summit in April 2016:
<a class="reference external" href="https://www.openstack.org/videos/video/glance-and-slurm-user-defined-image-management-on-hpc-clusters">https://www.openstack.org/videos/video/glance-and-slurm-user-defined-image-management-on-hpc-clusters</a></p>
<p>The Spartan OpenStack/HPC system at the University of Melbourne:
<a class="reference external" href="http://newsroom.melbourne.edu/news/new-age-computing-launched-university-melbourne">http://newsroom.melbourne.edu/news/new-age-computing-launched-university-melbourne</a>
<a class="reference external" href="http://insidehpc.com/2016/07/spartan-hpc-service/">http://insidehpc.com/2016/07/spartan-hpc-service/</a></p>
<p>Topology-aware placement in SLURM is described here:
<a class="reference external" href="http://slurm.schedmd.com/topology.html">http://slurm.schedmd.com/topology.html</a></p>
<p>Some research describing a method of adding
topology-aware placement to the OpenStack Nova scheduler:
<a class="reference external" href="http://charm.cs.illinois.edu/newPapers/13-01/paper.pdf">http://charm.cs.illinois.edu/newPapers/13-01/paper.pdf</a></p>
<p>HPC resource management at CERN and some current
OpenStack pain points are described in detail here:
<a class="reference external" href="http://openstack-in-production.blogspot.co.uk/2016/04/resource-management-at-cern.html">http://openstack-in-production.blogspot.co.uk/2016/04/resource-management-at-cern.html</a></p>
<p>OpenStack Pre-emptible Instances Extension (OPIE) from Indigo Datacloud
is available here: <a class="reference external" href="https://github.com/indigo-dc/opie">https://github.com/indigo-dc/opie</a></p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>This document was written by Stig Telfer of StackHPC Ltd with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:</p>
<ul class="simple">
<li><strong>Piotr Wachowicz</strong>, Cloud Integration Lead at Bright Computing</li>
<li><strong>Professor DK Panda</strong> and <strong>Dr. Xiaoyi Lu</strong> from NOWLAB, Ohio State University.</li>
<li><strong>Tim Randles</strong> from Los Alamos National Laboratory.</li>
<li><strong>Lev Lafayette</strong>, <strong>Bernard Meade</strong>, <strong>David Perry</strong>, <strong>Greg Sauter</strong>
and <strong>Daniel Tosello</strong> from the University of Melbourne.</li>
</ul>
<div class="figure">
<img alt="Creative commons licensing" src="//www.stackhpc.com/images/cc-by-sa.png" style="width: 100px;" />
<p class="caption">This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)</p>
</div>
</div>
OpenStack and HPC Infrastructure Management2016-08-29T10:20:00+01:002016-10-06T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-08-29:/openstack-and-hpc-infrastructure.html<p class="first last">OpenStack and HPC Infrastructure Management</p>
<p>In this section we discuss the emerging OpenStack use case for management
of HPC infrastructure. We introduce Ironic, the OpenStack bare metal
service and describe some of the differences, advantages and limitations
of managing HPC infrastructure as a bare metal OpenStack cloud.</p>
<p>Compared with OpenStack, established approaches to HPC infrastructure
management are very different. Conventional solutions offer much higher
scale, and much lower management plane overhead. However, they are also
inflexible, difficult to use and slow to evolve.</p>
<p>Through differences in the approach taken by cloud infrastructure
management, OpenStack brings new flexibility to HPC infrastructure
management:</p>
<ul class="simple">
<li>OpenStack’s integrated support for multi-tenancy infrastructure
introduces segregation between users and projects that require isolation.</li>
<li>The cloud model enables the infrastructure deployed for different projects
to use entirely different software stacks.</li>
<li>The software-defined orchestration of deployments is assumed.
This approach, sometimes referred to as “infrastructure as code”,
ensures that infrastructure is deployed and configured according to a
prescriptive formula, often maintained under source control in the same
manner as source code.</li>
<li>The range of platforms supported by Ironic is highly diverse. Just about
any hardware can and has been used in this context.</li>
<li>The collaborative open development model of OpenStack ensures that
community support is quick and easy to obtain.</li>
</ul>
<p>The “infrastructure as code” concept is also gaining traction among
some HPC infrastructure management platforms that are adopting proven
tools and techniques from the cloud infrastructure ecosystem.</p>
<div class="section" id="deploying-hpc-infrastructure-at-scale">
<h2>Deploying HPC Infrastructure at Scale</h2>
<p>HPC infrastructure deployment is not the same as cloud deployment.
A cloud assumes large numbers of users, each administering a small
number of instances compared to the overall size of the system. In a
multi-tenant environment, each user may use different software images.
Without coordination between the tenants, it would be very unlikely for
more than a few instances to be deployed at any one time. The software
architecture of the cloud deployment process is designed around this
assumption.</p>
<p>Conversely, HPC infrastructure deployment has markedly different
properties:</p>
<ul class="simple">
<li>A single user (the cluster administrator). HPC infrastructure is a
managed service, not user-administered.</li>
<li>A single software image. All user applications will run in a single
common environment.</li>
<li>Large proportions of the HPC cluster will be deployed simultaneously.</li>
<li>Many HPC infrastructures use diskless compute nodes that network-boot
a common software image.</li>
</ul>
<p>In the terminology of the cloud world, a typical HPC infrastructure
deployment might even be considered a “black swan event”. Cloud
deployment strategies do not exploit the simplifying assumptions that
deployments are usually across many nodes using the same image and for
the same user. Consequently, OpenStack Ironic deployments tend to scale
to the low thousands of compute nodes with current software releases
and best-practice configurations. Network booting a common image is a
capability that only recently has become possible in OpenStack and has
yet to become an established practice.</p>
</div>
<div class="section" id="bare-metal-management-using-openstack-ironic">
<h2>Bare Metal Management Using OpenStack Ironic</h2>
<p>Using Ironic, bare metal compute nodes are automatically provisioned at
a user’s request. Once the compute allocation is released, the bare
metal hardware is automatically decommissioned ready for it’s next use.</p>
<p>Ironic requires no presence on the compute node instances that it manages.
The software-defined infrastructure configuration that would typically
be applied in the hypervisor environment must instead be applied in
the hardware objects that interface with the bare metal compute node:
local disks, network ports, etc.</p>
<div class="section" id="support-for-a-wide-range-of-hardware">
<h3>Support for a Wide Range of Hardware</h3>
<p>A wide range of hardware is supported, from full-featured BMCs on
enterprise server equipment down to devices whose power can only be
controlled through an SNMP-enabled data centre power strip.</p>
<p>An inventory of compute nodes is registered with Ironic and stored in
Ironic’s node database. Ironic records configuration details and
current hardware state, including:</p>
<ul class="simple">
<li>Physical properties of the compute node, including CPU count, RAM size
and disk capacity.</li>
<li>The MAC address of the network interface used for provisioning instance
software images.</li>
<li>The hardware drivers used to control and interact with the compute node.</li>
<li>Details needed by those drivers to address this specific compute node
(for example, BMC IP address and login credentials).</li>
<li>The current power state and provisioning state of the compute node,
including whether it is in active service.</li>
</ul>
</div>
<div class="section" id="inventory-grooming-through-hardware-inspection">
<h3>Inventory Grooming through Hardware Inspection</h3>
<p>A node is initially registered with a minimal set of identifying
credentials - sufficient to power it on and boot a ramdisk. Ironic
generates a detailed hardware profile of every compute node through a
process called Hardware Inspection.</p>
<p>Hardware inspection uses this minimal bootstrap configuration provided
during node registration. During the inspection phase a custom ramdisk
is booted which probes the hardware configuration and gathers data.
The data is posted back to Ironic to update the node inventory. Large
amounts of additional hardware profile data are also made available for
later analysis.</p>
<p>The inspection process can optionally run benchmarks to identify
performance anomalies across a group of nodes. Anomalies in the hardware
inspection dataset of a group of nodes can be analysed using a tool
called Cardiff. Performance anomalies, once identified, can often be
traced to configuration anomalies. This process helps to isolate and
eliminate potential issues before a new system enters production.</p>
</div>
<div class="section" id="bare-metal-and-network-isolation">
<h3>Bare Metal and Network Isolation</h3>
<p>The ability for Ironic to support multi-tenant network isolation is a
new capability, first released in OpenStack’s Newton release cycle.
This capability requires some mapping of the network switch ports
connected to each compute node. The mapping of an Ironic network port to
its link partner switch port is maintained with identifiers for switch
and switch port. These are stored as attributes in the Ironic network
port object. Currently the generation of the network mapping is not
automated by Ironic.</p>
<p>Multi-tenant networking is implemented through configuring state in the
attached switch port. The state could be the access port VLAN ID for
a VLAN network, or VTEP state for a VXLAN network. Currently only a
subset of Neutron drivers are able to perform the physical switch port
state manipulations needed by Ironic. Switches with VXLAN VTEP support
and controllable through the OVSDB protocol are likely to be supported.</p>
<p>Ironic maintains two private networks of its own: Networks dedicated to
node provisioning and cleaning networks are defined in Neutron as provider
networks. When a node is deployed, its network port is placed into the
provisioning network. Upon successful deployment the node is connected
to the virtual tenant network for active service. Finally, when the node
is destroyed it is placed on the cleaning network. Maintaining distinct
networks for each role enhances security, and the logical separation of
traffic enables different QoS attributes to be assigned for each network.</p>
</div>
<div class="section" id="current-limitations-of-ironic-multi-tenant-networking">
<h3>Current Limitations of Ironic Multi-tenant Networking</h3>
<p>In HPC hardware configurations, compute nodes are attached to multiple
networks. Separate networks dedicated to management and high-speed data
communication are typical.</p>
<p>Current versions of Ironic do not have adequate support for attaching
nodes to multiple physical networks. Multiple physical interfaces can
be defined for a node, and a node can be attached to multiple Neutron
networks. However, it is not possible to attach specific physical
interfaces to specific networks.</p>
<p>Consequently, with current capabilities only a single network interface
should be managed by Ironic. Other physical networks would be managed
outside of OpenStack’s purview, but will not benefit from OpenStack's
multi-tenant network capabilities as a result. Furthermore, Ironic only
supports a single network per physical port: all network switch ports
for Ironic nodes are access ports. Trunk ports are not yet supported
although this feature is in the development backlog.</p>
</div>
<div class="section" id="remote-console-management">
<h3>Remote Console Management</h3>
<p>Many server management products include support for remote consoles,
both serial and video. Ironic includes drivers for serial consoles,
built upon support in the underlying hardware.</p>
<p>Recently-developed capabilities within Ironic have seen bare metal
consoles integrated with OpenStack Nova’s framework for managing
virtual consoles. Ironic’s node kernel boot parameters are extended
with a serial console port, which is then redirected by the BMC to
serial-over-LAN. Server consoles can be presented in the Horizon web
interface in the same manner as virtualised server consoles.</p>
<p>Currently this capability is only supported for IPMI-based server
management.</p>
</div>
<div class="section" id="security-and-integrity">
<h3>Security and Integrity</h3>
<p>When bare metal compute is sold as an openly-accessible service,
privileged access is granted to a bare metal system. There is substantial
scope for a malicious user to embed malware payloads in the BIOS and
device firmware of the system.</p>
<p>Ironic counters this threat in several ways:</p>
<ul class="simple">
<li><strong>Node Cleaning</strong>: Ironic’s node state machine includes states where
hardware state is reset and consistency checks can be run to detect
attempted malware injection. Ironic’s default hardware manager does
not support these hardware-specific checks. However, custom hardware
drivers can be developed to include BIOS configuration settings and
firmware integrity tests.</li>
<li><strong>Network Isolation</strong>: Through using separate networks for node provisioning,
active tenant service and node cleaning, the opportunities for a
compromised system to probe and infect other systems across the network
are greatly reduced.</li>
<li><strong>Trusted Boot</strong>: use of a Trusted Platform Module (TPM) and chain of trust
built upon it is necessary. These processes are used to secure public
cloud deployments of Ironic-administered bare metal compute today.</li>
</ul>
<p>None of these capabilities is enabled by default. Hardening Ironic’s
security model requires expertise and some amount of effort.</p>
</div>
<div class="section" id="provisioning-at-scale">
<h3>Provisioning at Scale</h3>
<p>The cloud model use case makes different assumptions to HPC. A cloud
is expected to support a large number of individual users. At any
time, each user is assumed to make comparatively small changes to their
compute resource usage. The HPC infrastructure use case is dramatically
different. HPC infrastructure typically runs a single software image
across the entire compute partition, and is likely to be deployed jointly
in one operation.</p>
<p>Ironic’s current deployment models do not scale as well as the models
used by conventional HPC infrastructure management platforms. xCAT uses
a hierarchy of subordinate service nodes to fan out an iSCSI-based
image deployment. Rocks cluster toolkit uses BitTorrent to distribute
RPM packages to all nodes. In the Rocks model, each deployment target
is a torrent peer. The capacity of the deployment infrastructure grows
alongside the number of targets being deployed.</p>
<p>However, the technologies for content distribution and caching that are
widely adopted by the cloud can be incorporated to address this issue.
Caching proxy servers can be used to speed up deployment at scale.</p>
<p>With appropriate configuration choices, Ironic can scale to handle
deployment to multiple thousands of servers.</p>
<div class="figure">
<img alt="Ironic node deployment flow diagram" src="//www.stackhpc.com/images/hpc_infrastructure-ironic.png" style="width: 600px;" />
<p class="caption"><em>An overview of Ironic’s node deployment process when using the Ironic
Python Agent ramdisk and Swift URLs for image retrieval. This strategy
demonstrates good scalability, but the deploy disk image cannot be bigger
than the RAM available on the node.</em></p>
</div>
</div>
</div>
<div class="section" id="building-upon-ironic-to-convert-infrastructure-into-hpc-platforms">
<h2>Building Upon Ironic to Convert Infrastructure into HPC Platforms</h2>
<p>The strengths of cloud infrastructure tooling become apparent once Ironic
has completed deployment. From this point a set of unconfigured compute
nodes must converge into the HPC compute platform required to meet the
users’ needs. A rich ecosystem of flexible tools is available to
perform this purpose.</p>
<p>See the section
<a class="reference external" href="//www.stackhpc.com/openstack-and-hpc-workloads.html">OpenStack and HPC Workload Management</a>
for further details of some of the available approaches.</p>
<div class="section" id="chameleon-an-experimental-testbed-for-computer-science">
<h3>Chameleon: An Experimental Testbed for Computer Science</h3>
<div class="figure">
<img alt="Chameleon logo" src="//www.stackhpc.com/images/hpc_infrastructure-chameleon_logo.jpg" style="width: 400px;" />
</div>
<p>Chameleon is an infrastructure project implementing an experimental
testbed for Computer Science led by University of Chicago, with Texas
Advance Computing Center (TACC), University of Texas at San Antonio
(UTSA), Northwestern University and Ohio State University as partners.
The Chameleon project is funded by the National Science Foundation.</p>
<p>The current system comprises ~600 nodes split between sites at TACC in
Austin and University of Chicago. The sites are interconnected with a
100G network. The compute nodes are divided into twelve racks, referred
to as “standard cloud units”, comprising 42 compute nodes, 4 storage
nodes with 16 2 TB hard drives each, and 10G Ethernet connecting all nodes
with an SDN-enabled top-of-rack switch. Each SCU has 40G Ethernet uplinks
into the Chameleon core network fabric. On this, largely homogenous
framework were grafted heterogenous elements allowing for different
types of experimentation. One SCU has Mellanox ConnectX-3 Infiniband.
Two computer nodes were set up as storage hierarchy nodes with 512 GB
of memory, two Intel P3700 NVMe of 2.0 TB each, four Intel S3610 SSDs of
1.6 TB each, and four 15K SAS HDDs of 600 GB each. Two additional nodes
are equipped with NVIDIA Tesla K80 accelerators and two with NVIDIA
Tesla M40 accelerators.</p>
<p>In the near term additional, heterogeneous cloud units for experimentation
with alternate processors and networks will be incorporated, including
FPGAs, Intel Atom microservers and ARM microservers. Compute nodes with
GPU accelerators have already been added to Chameleon.</p>
<p>Chameleon’s public launch was at the end of July 2015; since then
it has supported over 200 research projects into computer science and
cloud computing.</p>
<p>The system is designed to be deeply reconfigurable and adaptive, to
produce a wide range of flexible configurations for computer science
research. Chameleon uses the OpenStack Blazar project to manage advance
reservation of compute resources for research projects.</p>
<p>Chameleon deploys OpenStack packages from RDO, orchestrated using
OpenStack Puppet modules. Chameleon’s management services currently
run CentOS 7 and OpenStack Liberty. Through Ironic a large proportion
of the compute nodes are provided to researchers as bare metal (a
few SCUs are dedicated to virtualised compute instances using KVM).
Chameleon’s Ironic configuration uses the popular driver pairing of
PXE-driven iSCSI deployment and IPMItool power management.</p>
<p>Ironic’s capabilities have expanded dramatically in the year since
Chameleon first went into production, and many of the new capabilities
will be integrated into this project.</p>
<p>The Chameleon project’s wish list for Ironic capabilities includes:</p>
<ul class="simple">
<li>Ironic-Cinder integration, orchestrating the attachment of network block
devices to bare metal instances. This capability has been under active
development in Ironic and at the time of writing it is nearing completion.</li>
<li>Network isolation, placing different research projects onto different
VLANs to minimise their interference with one another. Chameleon hosts
projects researching radically different forms of networking, which must
be segregated.</li>
<li>Bare metal consoles, enabling researchers to interact with their allocated
compute nodes at the bare metal level.</li>
<li>BIOS parameter management, enabling researchers to (safely) change
BIOS parameters, and then to restore default parameters at the end of
an experiment.</li>
</ul>
<p>Pierre Riteau, DevOps lead for the Chameleon project, sees Chameleon as
an exciting use case for Ironic, which is currently developing many of
these features:</p>
<blockquote>
<p>“With the Ironic project, OpenStack provides a modern bare-metal
provisioning system benefiting from an active upstream community, with
each new release bringing additional capabilities. Leveraging Ironic
and the rest of the OpenStack ecosystem, we were able to launch Chameleon
in a very short time.”</p>
<p>“However, the Ironic software is still maturing, and can lack in
features or scalability compared to some other bare-metal provisioning
software, especially in an architecture without a scalable Swift
installation.”</p>
<p>“Based on our experience, we recommend getting familiar with the
other core OpenStack projects when deploying Ironic. Although Ironic
can be run as standalone using Bifrost, when deployed as part of an
OpenStack it interacts closely with Nova, Neutron, Glance, and Swift.
And as with all bare-metal provisioning systems, it is crucial to
have serial console access to compute nodes in order to troubleshoot
deployment failures, which can be caused by all sorts of hardware issues
and software misconfigurations.”</p>
<p>“We see the future of OpenStack in this area as providing a fully
featured system capable of efficiently managing data centre resources,
from provisioning operating systems to rolling out firmware upgrades
and identifying performance anomalies.”</p>
</blockquote>
</div>
<div class="section" id="bridges-a-next-generation-hpc-resource-for-data-analytics">
<h3>BRIDGES: A Next-Generation HPC Resource for Data Analytics</h3>
<p>Bridges is a supercomputer at the Pittsburgh Supercomputer Center funded
by the National Science Foundation. It is designed as a uniquely flexible
HPC resource, intended to support both traditional and non-traditional
workflows. The name implies the system’s aim to “bridge the research
community with HPC and Big Data.”</p>
<p>Bridges supports a diverse range of use cases, including graph analytics,
machine learning and genomics. As a flexible resource, Bridges supports
traditional SLURM-based batch workloads, Docker containers and interactive
web-based workflows.</p>
<p>Bridges has 800 compute nodes, 48 of which have dual-GPU accelerators
from Nvidia. There are also 46 high-memory nodes, including 4 with
12TB of RAM each. The entire system is interconnected with an Omnipath
high-performance 100G network fabric.</p>
<p>Bridges is deployed using community-supported free software. The
OpenStack control plane is CentOS 7 and Red Hat RDO (a freely available
packaging of OpenStack for Red Hat systems). OpenStack deployment
configuration is based on the PackStack project. Bridges was deployed
using OpenStack Liberty and is scheduled to be upgraded to OpenStack
Mitaka in the near future.</p>
<p>Most of the nodes are deployed in a bare metal configuration using Ironic.
Puppet is used to select the software role of a compute node at boot
time, avoiding the need to re-image. For example, a configuration for
MPI, Hadoop or virtualisation could be selected according to workload
requirements.</p>
<p>OmniPath networking is delivered using the OFED driver stack. Compute
nodes use IP over OPA for general connectivity. HPC apps use RDMA verbs
to take full advantage of OmniPath’s capabilities.</p>
<div class="figure">
<img alt="PSC BRIDGES network architecture" src="//www.stackhpc.com/images/hpc_infrastructure-bridges.png" style="width: 600px;" />
<p class="caption"><em>Visualisation of the Bridges OmniPath network topology. 800 General
purpose compute nodes and GPU nodes are arrayed along the bottom of the
topology. Special purpose compute nodes, storage and control plane nodes
are arrayed across the top of the topology. 42 compute nodes connect to
each OmniPath ToR switch (in yellow), creating a “compute island”,
with 7:1 oversubscription into the upper stages of the network.</em></p>
</div>
</div>
<div class="section" id="bridges-exposes-issues-at-scale">
<h3>Bridges Exposes Issues at Scale</h3>
<p>The Bridges system is a very large deployment for Ironic. While there are
no exact numbers, Ironic has been quoted to scale to thousands of nodes.</p>
<p>Coherency issues between Nova Scheduler and Ironic could arise if
too many nodes were deployed simultaneously. Introducing delays
during the scripting of the "nova boot" commands kept things in check.
Node deployments would be held to five ‘building’ instances with
subsequent instances staggered by 25 seconds, resulting in automated
deployment of the entire machine taking 1-2 days.</p>
<p>Within Ironic the periodic polling of driver power states is serialised.
BMCs can be very slow to respond, and this can lead to the time taken
to poll all power states in series to grow quite large. On Bridges,
the polling takes approximately 8 minutes to complete. This can also
lead to apparent inconsistencies of state between Nova and Ironic, and
the admin team work around this issue by enforcing “settling time”
between deleting a node and reprovisioning it.</p>
</div>
<div class="section" id="benefiting-from-openstack-and-contributing-back">
<h3>Benefiting from OpenStack and Contributing Back</h3>
<p>The team at PSC have found benefits from using OpenStack for HPC system
management:</p>
<ul class="simple">
<li>The ability to manage system image creation using OpenStack tools such
as diskimage-builder.</li>
<li>Ironic’s automation of the management of PXE node booting.</li>
<li>The prescriptive repeatable deployment process developed by the team
using Ironic and Puppet.</li>
</ul>
<p>Robert Budden, senior cluster systems developer at PSC, has many future
plans for OpenStack and Bridges:</p>
<ul class="simple">
<li>Using other OpenStack services such as Magnum (Containerised workloads),
Sahara (Hadoop on the fly) and Trove (database as a service).
Developing Ironic support for network boot over OmniPath.</li>
<li>Diskless boot of extremely large memory nodes using Ironic’s Cinder
integration.</li>
<li>Deployment of a containerised OpenStack control plane using Kolla.</li>
<li>Increased convergence between bare metal and virtualised OpenStack
deployments.</li>
</ul>
<p>Robert adds:</p>
<blockquote>
<p>“One of the great things is that as OpenStack improves, Bridges can
improve. As these new projects come online, we can incorporate those
features and the Bridges architecture can grow with the community."</p>
<p>“A big thing for me is to contribute back. I’m a developer by nature,
I want to fix some of the bugs and scaling issues that I’ve seen and
push these back to the OpenStack community.”</p>
</blockquote>
</div>
<div class="section" id="a-200-million-openstack-powered-supercomputer">
<h3>A $200 Million OpenStack-Powered Supercomputer</h3>
<p>In 2014 and 2015 the US Department of Energy announced three new
giant supercomputers, totalling $525 million, to be procured under the
CORAL (Collaboration of Oak Ridge, Argonne and Livermore) initiative.
Argonne National Laboratory’s $200 million system, Aurora, features a
peak performance of 180 PFLOPs delivered by over 50,000 compute nodes.
Aurora is expected to be 18 times more powerful than Argonne’s current
flagship supercomputer (Mira).</p>
<p>Aurora is to be deployed in 2018 by Intel, in partnership with Cray.
Aurora exemplifies the full capabilities of Intel’s Scalable Systems
Framework initiative. Whilst Intel are providing the processors,
memory technology and fabric interconnect, Cray’s long experience
and technical expertise in system integration are also fundamental to
Aurora’s successful delivery.</p>
<div class="figure">
<img alt="Aurora floorplan render" src="//www.stackhpc.com/images/hpc_infrastructure-aurora.jpg" style="width: 600px;" />
</div>
<div class="section" id="crays-vision-of-the-openstack-powered-supercomputer">
<h4>Cray’s Vision of the OpenStack-Powered Supercomputer</h4>
<p>Cray today sells a wide range of products for supercomputing, storage
and high-performance data analytics. Aside from the company’s core
offering of supercomputer systems, much of Cray’s product line has come
through acquisition. As a result of this historical path the system
management of each product is different, has different capabilities,
and different limitations.</p>
<p>The system management software that powers Cray’s supercomputers has
developed through long experience to become highly scalable and efficient.
The software stack is bespoke and specialised to delivering this single
capability. In some ways, it’s inflexible excellence represents the
antithesis of OpenStack and software-defined cloud infrastructure.</p>
<p>Faced with these challenges, and with customer demands for open management
interfaces, in 2013 Cray initiated a development programme for a unified
and open solution for system management across the product range.
Cray’s architects quickly settled on OpenStack. OpenStack relieves
the Cray engineering team of the generic aspects of system management
and frees them up to focus on problems specific to the demanding nature
of the products.</p>
<p>Successful OpenStack development strategies strongly favour an open
approach. Cray teams have worked with OpenStack developer communities to
bring forward the capabilities required for effective HPC infrastructure
management, for example:</p>
<ul class="simple">
<li><strong>Enhanced Ironic deployment</strong>, using the Bareon ramdisk derived from the Fuel
deployment project. Cray management servers require complex deployment
configurations featuring multiple partitions and system images.</li>
<li><strong>Diskless Ironic deployment</strong>, through active participation in the
development of Cinder and Ironic integration.</li>
<li><strong>Ironic multi-tenant networking</strong>, through submission of bug fixes and
demonstration use cases.</li>
<li><strong>Containerised OpenStack deployment</strong>, through participation in the OpenStack
Kolla project.</li>
<li><strong>Scalable monitoring infrastructure</strong>, through participation in the Monasca
project.</li>
</ul>
<p>Fundamental challenges still remain for Cray to deliver
OpenStack-orchestrated system management for supercomputer systems on
the scale of Aurora. Kitrick Sheets, senior principal engineer at Cray
and architect of Cray’s OpenStack strategy, comments:</p>
<blockquote>
<p>“Cray has spent many years developing infrastructure management
capabilities for high performance computing environments. The emergence
of cloud computing and OpenStack has provided a foundation for common
infrastructure management APIs. The abstractions provided within the
framework of OpenStack provide the ability to support familiar outward
interfaces for users who are accustomed to emerging elastic computing
environments while supporting the ability to provide features and
functions required for the support of HPC-class workloads. Normalizing
the user and administrator interfaces also has the advantage of increasing
software portability, thereby increasing the pace of innovation.”</p>
<p>“While OpenStack presents many advantages for the management of HPC
environments, there are many opportunities for improvement to support the
high performance, large scale use cases. Areas such as bulk deployment
of large collections of nodes, low-overhead state management, scalable
telemetry, etc. are a few of these. Cray will continue to work with
the community on these and other areas directly related to support of
current and emerging HPC hardware and software ecosystems.”</p>
<p>“We believe that additional focus on performance and scale which
drive toward the support of the highest-end systems will pay dividends
on systems of all sizes. In addition, as system sizes increase,
the incidents of hardware and software component failures become more
frequent, requiring increased resilience of services to support continual
operation. The community's efforts toward live service updates is one
area that will move us much further down that path.”</p>
<p>“OpenStack provides significant opportunities for providing core
management capabilities for diverse hardware and software ecosystems.
We look forward to continuing our work with the community to enhance
and extend OpenStack to address the unique challenges presented by high
performance computing environments.”</p>
</blockquote>
</div>
</div>
</div>
<div class="section" id="most-of-the-benefits-of-software-defined-infrastructure">
<h2>Most of the Benefits of Software-Defined Infrastructure...</h2>
<p>In the space of HPC infrastructure management, OpenStack’s attraction
is centred on the prospect of having all the benefits of software-defined
infrastructure while paying none of the performance overhead.</p>
<p>To date there is no single solution that can provide this. However,
a compromising trade-off can be struck in various ways:</p>
<ul class="simple">
<li>Fully-virtualised infrastructure provides all capabilities of cloud with
much of the performance overhead of cloud.</li>
<li>Virtualised infrastructure using techniques such as SR-IOV and PCI
pass-through dramatically improves performance for network and IO
intensive workloads, but imposes some constraints on the flexibility of
software-defined infrastructure.</li>
<li>Bare metal infrastructure management using Ironic incurs no performance
overhead, but has further restrictions on flexibility.</li>
</ul>
<p>Each of these strategies is continually improving. Fully-virtualised
infrastructure using OpenStack private cloud provides control over
performance-sensitive parameters like resource over-commitment and
hypervisor tuning. It is anticipated that infrastructure using hardware
device pass-through optimisations will soon be capable of supporting cloud
capabilities like live migration. Ironic’s bare metal infrastructure
management is continually developing new ways of presenting physical
compute resources as though they were virtual.</p>
<p>OpenStack has already arrived in the HPC infrastructure management
ecosystem. Projects using Ironic for HPC infrastructure management
have already demonstrated success. As it matures, its proposition
of software-defined infrastructure without the overhead will become
increasingly compelling.</p>
<div class="section" id="a-rapidly-developing-project">
<h3>A Rapidly Developing Project</h3>
<p>While it is rapidly becoming popular, Ironic is a relatively young
project within OpenStack. Some areas are still being actively developed.
For sites seeking to deploy Ironic-administered compute hardware, some
limitations remain. However, Ironic has a rapid pace of progress,
and new capabilities are released with every OpenStack release cycle.</p>
<p>HPC infrastructure management using OpenStack Ironic has been demonstrated
at over 800 nodes, while Ironic is claimed to scale to managing thousands
of nodes. However, new problems become apparent at scale. Currently,
large deployments using Ironic should plan for an investment in the
skill set of the administration team and active participation within
the Ironic developer community.</p>
</div>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<p>A clear and helpful introduction into the workings of Ironic in greater
detail: <a class="reference external" href="http://docs.openstack.org/developer/ironic/deploy/user-guide.html">http://docs.openstack.org/developer/ironic/deploy/user-guide.html</a></p>
<p>Deployment of Ironic as a standalone tool:
<a class="reference external" href="http://docs.openstack.org/developer/bifrost/readme.html">http://docs.openstack.org/developer/bifrost/readme.html</a></p>
<p>Kate Keahey from University of Chicago presented an architecture
show-and-tell on Chameleon at the OpenStack Austin summit in April 2016:
<a class="reference external" href="https://www.openstack.org/videos/video/chameleon-an-experimental-testbed-for-computer-science-as-application-of-cloud-computing-1">https://www.openstack.org/videos/video/chameleon-an-experimental-testbed-for-computer-science-as-application-of-cloud-computing-1</a></p>
<p>Chameleon Cloud’s home page is at: <a class="reference external" href="https://www.chameleoncloud.org">https://www.chameleoncloud.org</a></p>
<p>Robert Budden presented an architecture show-and-tell
on Bridges at the OpenStack Austin summit in April 2016:
<a class="reference external" href="https://www.openstack.org/videos/video/deploying-openstack-for-the-national-science-foundations-newest-supercomputers">https://www.openstack.org/videos/video/deploying-openstack-for-the-national-science-foundations-newest-supercomputers</a></p>
<p>Further information on Bridges is available at its home page at PSC:
<a class="reference external" href="http://www.psc.edu/index.php/bridges">http://www.psc.edu/index.php/bridges</a></p>
<p>Argonne National Lab’s home page for Aurora: <a class="reference external" href="http://aurora.alcf.anl.gov">http://aurora.alcf.anl.gov</a></p>
<p>A presentation from Intel giving an overview of Aurora:
<a class="reference external" href="http://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdf">http://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdf</a></p>
<p>Intel’s Scalable System Framework: <a class="reference external" href="http://www.intel.co.uk/content/www/uk/en/high-performance-computing/product-solutions.html">http://www.intel.co.uk/content/www/uk/en/high-performance-computing/product-solutions.html</a></p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>This document was written by Stig Telfer of StackHPC Ltd with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:</p>
<ul class="simple">
<li><strong>Pierre Riteau</strong>, University of Chicago and Chameleon DevOps lead.</li>
<li><strong>Kate Keahey</strong>, University of Chicago and Chameleon Science Director.</li>
<li><strong>Robert Budden</strong>, Senior Cluster Systems Developer, Pittsburgh Supercomputer Center.</li>
<li><strong>Kitrick Sheets</strong>, Senior Principal Engineer, Cray Inc.</li>
</ul>
<div class="figure">
<img alt="Creative commons licensing" src="//www.stackhpc.com/images/cc-by-sa.png" style="width: 100px;" />
<p class="caption">This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)</p>
</div>
</div>
OpenStack and HPC Network Fabrics2016-08-15T10:20:00+01:002016-10-05T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-08-15:/openstack-and-hpc-networks.html<p class="first last">OpenStack and HPC network fabrics</p>
<p>HPC and cloud infrastructure are not built to the same requirements.
As much as anything else, networking exemplifies the divergent criteria
between HPC applications and the typical workloads served by cloud
infrastructure.</p>
<p>With sweeping generalisations, one typically assumes an HPC
parallel workload is tightly-coupled, and a cloud-native workload is
loosely-coupled. A typical HPC parallel workload might be computational
fluid dynamics using a partitioned geometric grid. The application
code is likely to be structured in a bulk synchronous parallel model,
comprising phases of compute and data exchange between neighbouring
workers. Progress is made in lock-step, and is blocked until all workers
complete each phase.</p>
<p>Compare this with typical cloud-native application, which might be
a microservice architecture consisting of a number of communicating
sequential processes. The overall application structure is completely
different, and workers do not have the same degree of dependency upon
other workers in order to make progress.</p>
<p>The different requirements of HPC and cloud-native applications have led
to different architectural choices being made at every level in order to
deliver optimal and cost-effective solutions for each target application.</p>
<p>A cloud environment experiences workload diversity to a far greater extent
than seen in HPC. This has become the quintessential driving force of
software-defined infrastructure. Cloud environments are designed to
be flexible and adaptable. As a result, cloud infrastructure has the
flexibility to accommodate HPC requirements.</p>
<p>The flexibility of cloud infrastructure is delivered through layers
of abstraction. OpenStack’s focus is on defining the intent of
the multi-tenant cloud infrastructure. Dedicated network management
applications decide on the implementation. As an orchestrator, OpenStack
delegates knowledge of physical network connectivity to the network
management platforms to which it connects.</p>
<p>OpenStack’s surging momentum has ensured that support already exists
for all but the most exotic of HPC network architectures. This article
will describe several solutions for delivering HPC networking in an
OpenStack cloud.</p>
<div class="section" id="using-sr-iov-for-virtualised-hpc-networking">
<h2>Using SR-IOV for Virtualised HPC Networking</h2>
<p>SR-IOV is a technology that demonstrates how software-defined
infrastructure can introduce new flexibility in the management of HPC
resources, whilst retaining the high-performance benefits. With current
generation devices, there is a slight increase in I/O latency when using
SR-IOV virtual functions. However, this overhead is negligible for all
but the most latency-sensitive of applications.</p>
<div class="section" id="the-process-flow-for-using-sr-iov">
<h3>The Process Flow for Using SR-IOV</h3>
<p>The Nova compute hypervisor is configured at boot time with kernel flags
to support extensions for SR-IOV hardware management.</p>
<p>The network kernel device driver is configured to create virtual
functions. These are present alongside the physical function. When they
are not assigned to a guest workload instance, the virtual functions
are visible in the device tree of the hypervisor.</p>
<p>The OpenStack services are configured with identifiers or addresses
of devices configured to support SR-IOV. This is most easily done by
identifying the physical function (for example using its network device
name, PCI bus address, or PCI vendor/device IDs). All virtual functions
associated with this device will be made available for virtualised
compute instances. The configuration that identifies SR-IOV devices is
known as the whitelist.</p>
<p>To use an SR-IOV virtual function for networking in an instance,
a special direct-bound network port is created and connected with
the VM. This causes one of the virtual functions to be configured and
passed-through from the hypervisor into the VM.</p>
<p>Support for launching an instance using SR-IOV network interfaces from
OpenStack’s Horizon web interface was introduced in the OpenStack
Mitaka release (April 2016). Prior to this, it was only possible to
launch instances using SR-IOV ports through a sequence of command-line
invocations (or through direct interaction with the OpenStack APIs).</p>
</div>
<div class="section" id="the-limitations-of-using-sr-iov-in-cloud-infrastructure">
<h3>The Limitations of Using SR-IOV in Cloud Infrastructure</h3>
<p>SR-IOV places some limitations on the cloud computing model that can be
detrimental to the overall flexibility of the infrastructure:</p>
<ul class="simple">
<li>Current SR-IOV hardware implementations support flat (unsegregated)
and VLAN network separation but not VXLAN for tenant networks. This
limitation can constrain the configuration options for the network fabric.
Layer-3 IP-based fabrics using technologies such as ECMP are unlikely
to interoperate with VLAN-based network separation.</li>
<li>Live migration of VMs connected using SR-IOV is not possible with current
hardware and software. The capability is being actively developed for
Mellanox SR-IOV NICs. It is not confirmed whether live migration of
RDMA applications will be possible.</li>
<li>SR-IOV devices bypass OpenStack’s security groups, and consequently
should only be used for networks that are not externally connected.</li>
</ul>
</div>
</div>
<div class="section" id="virtualisation-aware-mpi-for-tightly-coupled-cloud-workloads">
<h2>Virtualisation-aware MPI for Tightly-Coupled Cloud Workloads</h2>
<p>The MVAPICH2 library implements MPI-3 (based on MPI 3.1 standard)
using the IB verbs low-level message passing primitives. MVAPICH2 was
created and developed by the <a class="reference external" href="http://nowlab.cse.ohio-state.edu/">Network-Based Computing Laboratory</a> (NOWLAB) at The Ohio State
University, and has been freely available for download for 15 years.
Over that time MVAPICH2 has been continuously developed and now runs on
systems as big as 500,000 cores.</p>
<p>An Infiniband NIC with SR-IOV capability was first developed by Mellanox
in the ConnectX-3 generation of its product, unlocking the possibility
of achieving near-native Infiniband performance in a virtualised
environment. MVAPICH2-Virt was introduced in 2015 to develop HPC levels
of performance for cloud infrastructure. The techniques adopted by
MVAPICH2-Virt currently support KVM and Docker based cloud environments.
MVAPICH2-Virt introduces Inter-VM Shared Memory (IVSHMEM) support to KVM
hypervisors, increasing performance between co-resident VMs. In order
to run MVAPICH2-Virt based applications on top of OpenStack-based cloud
environments easily, several extensions to set up SR-IOV and IVSHMEM
devices in VMs have been developed for OpenStack’s Nova compute manager.</p>
<p>MVAPICH2-Virt has two principal optimisation strategies for KVM-based
cloud environments:</p>
<ul class="simple">
<li>Dynamic locality awareness for MPI communication among co-resident
VMs. A new communication channel, IVSHMEM, introduces a memory-space
communication mechanism between different VMs co-resident on the same
hypervisor. Inter-node communication continues to use the SR-IOV
virtual function.</li>
<li>Tuning of MPI performance for both SR-IOV and IVSHMEM channels.</li>
</ul>
<div class="figure">
<img alt="IV-SHMEM device for intra-hypervisor communication" src="//www.stackhpc.com/images/hpc_fabrics-ivshmem.png" style="width: 400px;" />
</div>
<p>Similarly, MVAPICH2-Virt has two principal optimisation strategies for
Docker based cloud environments:</p>
<ul class="simple">
<li>Dynamic locality awareness for MPI communication among co-resident
containers. All Intra-Node MPI communication can go through either IPC-SHM
enabled channel or CMA channel, no matter they are in the same container
or different ones. Inter-Node-Inter-Container MPI communication will
leverage the InfiniBand HCA channel.</li>
<li>Tuning of MPI performance for all different channels, including IPC-SHM,
CMA, and InfiniBand HCA.</li>
</ul>
<p>With these strategies in effect, the performance overhead of KVM and
Docker based virtualisation on standard MPI benchmarks and applications
are less than 10%.</p>
<blockquote>
<p>"The novel designs introduced in MVAPICH2-Virt take advantage of the
latest advances in virtualisation technologies and promise to design
next-generation HPC cloud environments with good performance", says
Prof. DK Panda and Dr. Xiaoyi Lu of NOWLAB.</p>
<p>OpenStack as an environment for supporting MPI based HPC workloads has
many benefits such as fast VM or container deployment for setting up
MPI job execution environments, security, enabling resource sharing,
providing privileged access in virtualized environments, supporting
high-performance networking technologies (e.g. SR-IOV), etc</p>
<p>Currently, OpenStack still could not fully support or work seamlessly
with technologies in HPC environments, such as IVSHMEM, SLURM, PBS,
etc. But with several extensions proposed by NOWLAB researchers, running
MPI based HPC workloads on top of OpenStack managed environments seems
a promising approach for building efficient clouds.</p>
<p>The future direction of MVAPICH2-Virt includes</p>
<ul class="simple">
<li>Further support different kinds of virtualised environments</li>
<li>Further improve MPI application performance on cloud environments
through novel designs</li>
<li>Support live migration of MPI applications in SR-IOV and IVSHMEM enabled VMs</li>
</ul>
</blockquote>
</div>
<div class="section" id="infiniband-and-other-non-ethernet-fabrics">
<h2>Infiniband and other Non-Ethernet Fabrics</h2>
<p>Infiniband is the dominant fabric interconnect for HPC clusters. Of the
TOP500 list published in June 2016, 41% of entries use Infiniband.</p>
<p>In part, OpenStack’s flexibility comes from avoiding many rigid
assumptions in infrastructure management. However, OpenStack networking
does have some expectations of an Ethernet and IP-centric network
architecture, which can present challenges for the network architectures
often used in HPC. The Neutron driver for Infiniband circumvents this
assumption by applying Neutron’s layer-3 network configuration to an
IP-over-IB interface, and mapping Neutron’s layer-2 network segmentation
ID to Infiniband pkeys.</p>
<p>However, Neutron is limited to an allocation of 126 pkeys, which imposes
a restrictive upper limit on the number of distinct tenant networks an
OpenStack Infiniband cloud can support.</p>
<p>A technical lead with experience of using OpenStack on Infiniband reports
mixed experiences from an evaluation performed in 2015. The overall
result led him to conclude that HPC fabrics such as Infiniband are
only worthwhile in an OpenStack environment if one is also using RDMA
communication protocols in the client workload:</p>
<blockquote>
"MPI jobs were never a targeted application for our system. Rather, the
goal for our OpenStack was to accommodate all the scientific audiences
for whom big HPC clusters, and batch job schedulers, weren't a best fit.
So, no hard, fast requirement for a low-latency medium. What we realized
was that it's complex. It may be hard to keep it running in production.
IPoIB on FDR, in unconnected mode, is slower than 10Gbps Ethernet,
and if you're not making use of RDMA, then you're really just kind of
hurting yourself. Getting data in and out is tricky. All the big data
we have is on a physically separate IB fabric, and no one wanted to span
those fabrics, and doing something involving IP routing would break down
the usefulness of RDMA."</blockquote>
<p>IP-over-IB performance and scalability has improved substantially in
subsequent hardware and software releases. A modern Infiniband host
channel adaptor with a current driver stack operating in connected mode
can sustain 35-40 Gbits/sec in a single TCP stream on FDR Infiniband.</p>
<p>The Canadian HPC4Health consortium have deployed a federation of OpenStack
private clouds using a Mellanox FDR Infiniband network fabric.</p>
<p>Intel’s Omnipath network architecture is starting to emerge in the
Scientific OpenStack community. At Pittsburgh Supercomputer Center,
the BRIDGES system entered production in early 2016 for HPC and data
analytics workloads. It comprises over 800 compute nodes with an
Omnipath fabric interconnect. In its current product generation,
Omnipath does not support SR-IOV. BRIDGES is a bare metal system,
managed using OpenStack Ironic. The Omnipath network management is
managed independently of OpenStack.</p>
<p>BRIDGES is described in greater detail in the section OpenStack and HPC
Infrastructure Management.</p>
</div>
<div class="section" id="an-rdma-centric-bioinformatics-cloud-at-cambridge-university">
<h2>An RDMA-Centric Bioinformatics Cloud at Cambridge University</h2>
<p>Cambridge University’s Research Computing Services group has a long
track record as a user of RDMA technologies such as Infiniband and
Lustre across all its HPC infrastructure platforms. When scoping a new
bioinformatics compute resource in 2015, the desire to combine this
proven HPC technology with a flexible self-service platform led to a
requirements specification for an RDMA-centric OpenStack cloud.</p>
<p>Bioinformatics workloads can be IO-intensive in nature, and can also
feature IO access patterns that are highly sensitive to IO latency.
Whilst this class of workload is typically a weakness of virtualised
infrastructure, the effects are mitigated through use of HPC technologies
such as RDMA and virtualisation technologies such as SR-IOV to maximise
efficiency and minimise overhead.</p>
<p>The added complexity of introducing HPC networking technologies is
considerable, but remains hidden from the bioinformatics users of
the system. Block-based IO via RDMA is delivered to the kernel of the
KVM hypervisor. The compute instances simply see a paravirtualised
block device. File-based IO via RDMA is delivered using the Lustre
filesystem client drivers running in the VM instances. Through use
of SR-IOV virtual functions, this is identical to a bare metal compute
node in a conventional HPC configuration. Similarly, MPI communication
is performed on the virtualised network interfaces with no discernable
difference for the user of the compute instance.</p>
<div class="figure">
<img alt="Node software stack" src="//www.stackhpc.com/images/hpc_fabrics-node_stack.png" style="width: 400px;" />
<p class="caption"><em>Software architecture of the compute node of an RDMA-centric
OpenStack cloud.</em></p>
</div>
<p>The cloud contains 80 compute nodes, 3 management nodes and a number
of storage nodes of various kinds. The system runs Red Hat OpenStack
Platform (OSP) and is deployed using Red Hat’s TripleO-based process.
All the HPC-centric features of the system have been implemented
using custom configuration and extensions to TripleO. Post-deployment
configuration management is performed using Ansible-OpenStack playbooks,
resulting in a devops approach for managing an HPC system.</p>
<p>To deploy a system with RDMA networking enabled in the compute node
hypervisor, overcloud management QCOW2 images are created with OpenFabrics
installed. Cinder is configured to use iSER (iSCSI Extensions for RDMA)
as a transport protocol.</p>
<p>The cloud uses a combination of Mellanox 50G Ethernet NICs and 100G
Ethernet switches for its HPC network fabric. RDMA support using RoCEv1
requires layer-2 network connectivity. Consequently OpenStack’s
networking is configured to use VLANs for control plane traffic and HPC
tenant networking. VXLAN is used for other classes of tenant networking.</p>
<p>A multi-path layer-2 network fabric is created using multi-chassis LAGs.
Traffic is distributed across multiple physical links whilst presenting a
single logical link for the Ethernet network topology. Port memberships
of the tenant network VLANs are managed dynamically using the NEO
network management platform from Mellanox, which integrates with
OpenStack Neutron.</p>
<div class="figure">
<img alt="Cambridge system overview" src="//www.stackhpc.com/images/hpc_fabrics-cambridge.png" style="width: 600px;" />
</div>
</div>
<div class="section" id="the-forces-driving-hpc-and-cloud-diverge-in-network-management">
<h2>The Forces Driving HPC and Cloud Diverge in Network Management</h2>
<p>At the pinnacle of HPC, ultimate performance is achieved through
exploiting full knowledge of all hardware details: the microarchitecture
of a processor, the I/O subsystem of a server - or the physical location
within a network. HPC network management delivers performance by enabling
workload placement with awareness of the network topology.</p>
<p>The cloud model succeeds because of its abstraction. Cloud infrastructure
commits to delivering a virtualised flat network to its instances.
All details of the underlying physical topology are obscured. Where an
HPC network management system can struggle to handle changes in physical
network topology, cloud infrastructure adapts.</p>
<p>OpenStack provides a limited solution to locality-aware placement, through
use of availability zones (AZ). By defining an AZ per top-of-rack
switch, a user can request instances be scheduled to be co-resident
on the same edge switch. However, this can be a clumsy interface for
launching instances on a large private cloud, and AZs cannot be nested
to provide multiple levels of locality for co-locating larger workloads.</p>
<p>OpenStack depends on other network management platforms for physical
network knowledge, and delegates to them all aspects of physical
network management. Network management and monitoring packages such
as Observium and Mellanox NEO are complementary to the functionality
provided by OpenStack.</p>
<p>Another key theme in HPC network management is in gathering
network-centric performance telemetry.</p>
<p>While HPC does not deliver on all of its promise in this area, there is
greater focus within HPC network management on the ability to collect
telemetry data on the performance of a network for optimising the
workload.</p>
<p>Cloud and HPC take very different approaches in this sector.</p>
<p>In general, HPC performance monitoring is done at the application level.
HPC application performance analysis typically follows a model in which
runtime trace data is gathered during execution for later aggregation
and visualisation. This approach avoids overhead when monitoring is
not required and minimises the overhead when monitoring is active.
When application monitoring is active, leading packages such as OVIS
minimise overhead by using RDMA for aggregation of runtime telemetry data.
Application performance visualisation is performed using tools such
as VAMPIR. All these HPC-derived application performance monitoring
tools will also work for applications running within an OpenStack/HPC
environment.</p>
<p>At a system level, HPC network performance analysis is more limited in
scope, but developments such as PAVE at Lawrence Livermore and more
recently INAM^2 from Ohio State University are able to demonstrate
a more holistic capability to identify adverse interactions between
applications sharing a network, in addition to performance bottlenecks
within an application itself.</p>
<p>The pace of development of cloud infrastructure monitoring is
faster, and in many cases is derived from open-source equivalents of
hyperscaler-developed capabilities. Twitter’s Zipkin is a distributed
application performance monitoring framework derived from conceptual
details from Google’s Dapper. LinkedIn developed and published
Kafka, a distributed near-real-time message log. However, the layers of
abstraction that give cloud its flexibility can prevent cloud monitoring
from providing performance insights from the physical domain that inform
performance in the virtual domain.</p>
<p>At OpenStack Paris in November 2014 Intel demonstrated Apex Lake,
a project which aims to provide performance telemetry across these
abstraction boundaries - including across virtual/physical network
abstractions. Some of these features may have been incorporated into
the Intel’s open source Snap telemetry/monitoring framework.</p>
<p>In its present situation, through use of SR-IOV network devices,
cloud network infrastructure has demonstrated that it is capable at
achieving performance levels that are typically within 1-9% of bare metal.
OpenStack can be viewed as the integration and orchestration of existing
technology platforms. The physical network performance telemetry of
cloud network infrastructure is delegated to the technology platforms
upon which it is built. In future, projects such as INAM^2 on the HPC
side and Apex Lake on the cloud side may lead to a telemetry monitoring
framework capable of presenting performance data from virtual and physical
domains in the context of one another.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<p>This Intel white paper provides a useful introduction to SR-IOV:
<a class="reference external" href="http://www.intel.in/content/dam/doc/white-paper/pci-sig-single-root-io-virtualization-support-in-virtualization-technology-for-connectivity-paper.pdf">http://www.intel.in/content/dam/doc/white-paper/pci-sig-single-root-io-virtualization-support-in-virtualization-technology-for-connectivity-paper.pdf</a></p>
<p>A step-by-step guide to setting up Mellanox Infiniband
with a Red Hat variant of Linux and OpenStack Mitaka:
<a class="reference external" href="https://wiki.openstack.org/wiki/Mellanox-Neutron-Mitaka-Redhat-InfiniBand">https://wiki.openstack.org/wiki/Mellanox-Neutron-Mitaka-Redhat-InfiniBand</a></p>
<p>A presentation by Professor DK Panda from NOWLAB at Ohio State University
on MVAPICH2-Virt: <a class="reference external" href="https://youtu.be/m0p2fibwukY">https://youtu.be/m0p2fibwukY</a></p>
<p>Further information on MVAPICH2-Virt can be found here:
<a class="reference external" href="http://mvapich.cse.ohio-state.edu">http://mvapich.cse.ohio-state.edu</a></p>
<p>Some papers from the team at NOWLAB describing MVAPICH2-Virt in greater depth:</p>
<ul class="simple">
<li>[HiPC'14] High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters.
Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi, Mingzhe Li, and Dhabaleswar
K. (DK) Panda. Proceedings of the 21st annual IEEE International
Conference on High Performance Computing (HiPC), 2014.</li>
<li>[CCGrid'15] MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach
to Build HPC Clouds. Jie Zhang, Xiaoyi Lu, Mark Arnold, and Dhabaleswar
K. (DK) Panda. Proceedings of the 15th IEEE/ACM International Symposium
on Cluster, Cloud and Grid Computing (CCGrid), 2015.</li>
</ul>
<p>The OVIS HPC application performance monitoring framework:
<a class="reference external" href="https://ovis.ca.sandia.gov/mediawiki/index.php/Main_Page">https://ovis.ca.sandia.gov/mediawiki/index.php/Main_Page</a></p>
<p>PAVE - Performance Analysis and Visualisation
at Exascale at Lawrence Livermore:
<a class="reference external" href="http://computation.llnl.gov/projects/pave-performance-analysis-visualization-exascale">http://computation.llnl.gov/projects/pave-performance-analysis-visualization-exascale</a></p>
<p>The introduction of INAM^2 for real-time
Infiniband network performance monitoring:
<a class="reference external" href="http://mvapich.cse.ohio-state.edu/static/media/publications/abstract/subramoni-isc16-inam.pdf">http://mvapich.cse.ohio-state.edu/static/media/publications/abstract/subramoni-isc16-inam.pdf</a></p>
<p>Further information on INAM^2 can be found here:
<a class="reference external" href="http://mvapich.cse.ohio-state.edu/tools/osu-inam/">http://mvapich.cse.ohio-state.edu/tools/osu-inam/</a></p>
<p>Open Zipkin is a distributed application performance monitoring framework
developed at Twitter, based on Google’s paper their Dapper monitoring
framework: <a class="reference external" href="http://zipkin.io">http://zipkin.io</a></p>
<p>Intel Snap is a new monitoring framework for virtualised infrastructure:
<a class="reference external" href="http://snap-telemetry.io">http://snap-telemetry.io</a></p>
<p>Observium is a network mapping and monitoring platform built upon SNMP:
<a class="reference external" href="http://observium.org/">http://observium.org/</a></p>
<p>A useful discussion on the value of high-resolution network telemetry
for researching issues with maximum latency in a cloud environment:
<a class="reference external" href="https://engineering.linkedin.com/performance/who-moved-my-99th-percentile-latency">https://engineering.linkedin.com/performance/who-moved-my-99th-percentile-latency</a></p>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>This document was written by Stig Telfer of StackHPC Ltd with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:</p>
<ul class="simple">
<li><strong>Professor DK Panda</strong> and <strong>Dr. Xiaoyi Lu</strong> from NOWLAB, Ohio State University.</li>
<li><strong>Jonathan Mills</strong> from NASA Goddard Spaceflight Center.</li>
</ul>
<div class="figure">
<img alt="Creative commons licensing" src="//www.stackhpc.com/images/cc-by-sa.png" style="width: 100px;" />
<p class="caption">This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)</p>
</div>
</div>
OpenStack and Virtualised HPC2016-08-01T10:20:00+01:002016-10-04T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-08-01:/hpc-and-virtualisation.html<p class="first last">HPC and the overhead of virtualisation</p>
<p>Likely doubts over the adoption of OpenStack centre around the impact
of infrastructure virtualisation. From the skeptical perspective of an
HPC architect, why OpenStack?</p>
<ul class="simple">
<li><em>I have heard the hype</em></li>
<li><em>I am skeptical to some degree</em></li>
<li><em>I need evidence of benefit</em></li>
</ul>
<p>In this section, we will describe the different forms of overhead that
can be introduced by virtualisation, and provide technical details
of solutions that mitigate, eliminate or bypass the overheads of
software-defined infrastructure.</p>
<div class="section" id="the-overhead-of-virtualisation">
<h2>The Overhead of Virtualisation</h2>
<p>Analysis typically shows that the overhead of virtualisation for
applications that are CPU or memory intensive is minimal.</p>
<p>Similarly, applications that depend on high-bandwidth I/O or network
communication for bulk data transfers can achieve levels of performance
that are close to equivalent bare metal configurations.</p>
<p>Where a significant performance impact is observed, it can often be
ascribed to overcommitment of hardware resources or “noisy neighbours”
- issues that could equally apply in non-virtualised configurations.</p>
<p>However, there remains a substantial class of applications whose
performance is significantly impacted by virtualisation. Some of the
causes of that performance overhead are described here.</p>
<div class="section" id="increased-software-overhead-on-i-o-operations">
<h3>Increased Software Overhead on I/O Operations</h3>
<p>Factors such as storage IOPs and network message latency are often
critical for HPC application performance.</p>
<p>HPC applications that are sensitive to these factors are poor performers
in a conventional virtualised environment. Fully-virtualised environments
incur additional overhead per I/O operation that can impact performance
for applications that depend on such patterns of I/O.</p>
<p>The additional overhead is mitigated through paravirtualisation, in
which the guest OS includes support for running within a virtualised
environment. The guest OS cooperates with the host OS to improve
the overhead of hardware device management. Direct hardware device
manipulation is performed in the host OS, keeping the micro-management
of hardware closer to the physical device. The hypervisor then presents
a more efficient software interface to a simpler driver in the guest OS.
Performance improves through streamlining interactions between guest OS
and the virtual hardware devices presented to it.</p>
</div>
<div class="section" id="hardware-offload-in-a-virtualised-network">
<h3>Hardware Offload in a Virtualised Network</h3>
<p>All modern Ethernet NICs provide hardware offload of IP, TCP and
other protocols. To varying degrees, these free up CPU cycles from the
transformations necessary between data in user buffers and packets on
the wire (and vice versa).</p>
<p>In a virtualised environment, the network traffic of a guest VM passes
from a virtualised network device into the software-defined network
infrastructure running in the hypervisor. Packet processing is usually
considerably more complex than a typical HPC configuration. Hardware
offload capabilities are often unable to operate or are ineffective
in this mode. As a result, networking performance in a virtualised
environment can be less performant and more CPU-intensive than an
equivalent bare metal environment.</p>
</div>
<div class="section" id="increased-jitter-in-virtualised-network-latency">
<h3>Increased Jitter in Virtualised Network Latency</h3>
<p>To varying degrees, virtualised environments generate increased system
noise effects. These effects result in a longer tail on latency
distribution for interrupts and I/O operations.</p>
<p>A bulk synchronous parallel workload, iterating in lock-step, moves at
the speed of the slowest worker. If the slowest worker is determined
by jitter effects in I/O latency, overall application progress becomes
affected by the increased system noise of a virtualised environment.</p>
</div>
</div>
<div class="section" id="using-openstack-to-deliver-virtualised-hpc">
<h2>Using OpenStack to Deliver Virtualised HPC</h2>
<p>There is considerable development activity in the area of virtualisation.
New levels of performance and capability are continually being introduced
at all levels: processor architecture, hypervisor, operating system and
cloud orchestration.</p>
<div class="section" id="best-practice-for-virtualised-system-performance">
<h3>Best Practice for Virtualised System Performance</h3>
<p>The twice-yearly cadence of OpenStack software releases leads to rapid
development of new capabilities, which improve its performance and
flexibility.</p>
<p>Across the OpenStack operators community, there is a continual
collaborative process of testing and improvement of hypervisor
efficiency. Empirical studies of different configurations of tuning
parameters are frequently published and reviewed. Clear improvements
are collected into a curated guide on hypervisor performance tuning
best practice.</p>
<p>OpenStack’s Nova compute service supports exposing many hypervisor
features for raising virtualised performance. For example:</p>
<ul class="simple">
<li>Enabling processor architecture extensions for virtualisation.</li>
<li>Controlling hypervisor techniques for efficiently managing many
guests, such as Kernel Same-page Merging (KSM). This can add CPU
overhead in return for varying degrees of improvement in memory usage
by de-duplicating identical pages. For supporting memory-intensive
workloads, KSM can be configured to prevent merging between NUMA
nodes. For performance-critical HPC, it can be disabled altogether.</li>
<li>Pinning virtual cores to physical cores.</li>
<li>Passing through the NUMA topology of the physical host to the guest
enables the guest to perform NUMA-aware memory allocation and task
scheduling optimisations.</li>
<li>Passing through the specific processor model of the physical CPUs
can enable use of model-specific architectural extensions and runtime
microarchitectural optimisations in high-performance scientific libraries.</li>
<li>Backing guest memory with huge pages reduces the impact of host
Translation Lookaside Buffer (TLB) misses.</li>
</ul>
<p>By using optimisation techniques such as these, the overhead of
virtualisation for CPU-bound and memory-bound workloads is reduced to
typically one–two percent of bare metal performance. More information
can be found in Further Reading for this section, below.</p>
<p>Conversely, by constraining the virtual architecture more narrowly,
these tuning parameters make VM migration more difficult in a cloud
infrastructure consisting of heterogeneous hypervisor hardware, in
particular, this may preclude live-migration.</p>
</div>
<div class="section" id="hardware-support-for-i-o-virtualisation">
<h3>Hardware Support for I/O Virtualisation</h3>
<p>Hardware devices that support Single-Rooted I/O Virtualization (SR-IOV)
enable the hardware resources of the physical function of a device to be
presented as many virtual functions. Each of these can be individually
configured and passed through into a different VM. In this way, the
hardware resources of a network card can provide performance with close
to no additional overhead, simultaneously serving the diverse needs of
many VMs.</p>
<p>Through direct access to physical hardware, SR-IOV networking places
some limitations on software-defined infrastructure. It is not typically
possible to apply security group policies to a network interface mapped
to an SR-IOV virtual function. This may raise security concerns for
externally accessible networks, but should not prevent SR-IOV networking
being used internally for high-performance communication between the
processes of an OpenStack hosted parallel workload.</p>
<p>Recent empirical studies have found that using SR-IOV for high-performance
networking can reduce the overhead of virtualisation typically to 1-9%
of bare metal performance for network-bound HPC workloads. Links to
some examples can be found in the Further Reading section below.</p>
</div>
<div class="section" id="using-physical-devices-in-a-virtualised-environment">
<h3>Using Physical Devices in a Virtualised Environment</h3>
<p>Some classes of HPC applications make intensive use of hardware
acceleration in the form of GPUs, Xeon Phi, etc.</p>
<p>Specialised compute hardware in the form of PCI devices can be included
in software-defined infrastructure by pass-through. The device is
mapped directly into the device tree of a guest VM, providing that VM
with exclusive access to the device.</p>
<p>A virtual machine that makes specific requirements for hardware
accelerators can be scheduled to a hypervisor with the resources
available, and the VM is ‘composed’ by passing through the hardware
it needs from the environment of the host.</p>
<p>The resource management model of GPU devices does not adapt to SR-IOV.
A GPU device is passed-through to a guest VM in its entirety. A host
system with multiple GPUs can pass-through different devices to different
systems. Similarly, multiple GPU devices can be passed-through into
a single instance and GPUdirect peer-to-peer data transfers can be
performed between GPU devices and also with RDMA-capable NICs.</p>
<p>Device pass-through, however, can have a performance impact on virtualised
memory management. The IOMMU configuration required for pass-through
restricts the use of transparent huge pages. Memory must, therefore,
be pinned in a guest VM using pass-through devices. This can limit the
flexibility of software-defined infrastructure to over-commit virtualised
resources (although over-committed resources are generally unlikely to
be worthwhile in an HPC use case). Static huge pages can still be used
to provide a boost to virtual memory performance.</p>
<p>The performance overhead of virtualised GPU-intensive scientific
workloads has been found to be as little as 1% of bare metal performance.
More information can be found in the Further Reading section below.</p>
<div class="figure">
<img alt="Three forms of virtualised hardware" src="//www.stackhpc.com/images/hpc_virtualisation-3_forms.png" style="width: 600px;" />
<p class="caption"><em>Different strategies for efficient handling of hardware devices. Here a
network card is used as example. In paravirtualisation a virtual
network device is created in software that is designed for the most
efficient software interface. In PCI-passthrough a physical device is
transferred exclusively from the hypervisor to a guest VM. In SR-IOV,
a physical device creates a number of virtual functions, sharing the
physical resources. Virtual functions can be passed-through to a guest
VM leaving the physical device behind in the hypervisor.</em></p>
</div>
</div>
<div class="section" id="os-level-virtualisation-containers">
<h3>OS-level Virtualisation: Containers</h3>
<p>The overheads of virtualisation are almost eliminated by moving to a
different model of compute abstraction. Containers, popularised by
Docker, package an application plus its dependencies as a lightweight
self-contained execution environment instead of an entire virtual machine.
The simplified execution model brings benefits in memory usage and
I/O overhead.</p>
<p>Currently, HPC networking using RDMA can be performed within containers,
but with limitations. The OFED software stack lacks awareness of network
namespaces and cgroups, which prevents per-container control and isolation
of RDMA resources. However, containers configured with host networking
can use RDMA.</p>
</div>
<div class="section" id="bare-metal-virtualisation-openstacks-project-ironic">
<h3>Bare Metal Virtualisation: OpenStack’s Project Ironic</h3>
<p>OpenStack’s software-defined infrastructure does not need to be virtual.</p>
<p>Ironic is a virtualisation driver. Through some artful abstraction it
presents bare metal compute nodes as though they were virtualised compute
resources. Ironic’s design philosophy results in zero overhead to the
performance of the compute node, whilst providing many of the benefits
of software-defined infrastructure management.</p>
<p>Through Ironic, a user gains bare metal performance from their compute
hardware, but retains the flexibility to run any software image they
choose.</p>
<p>The Ironic project is developing rapidly, with new capabilities being
introduced with every release. OpenStack’s latest release delivers
some compelling new functionality:</p>
<ul class="simple">
<li>Serial consoles</li>
<li>Volume attachment</li>
<li>Multi-tenant networking</li>
</ul>
<p>Complex image deployments (over multiple disks for example) is an
evolving capability.</p>
<p>Using Ironic has some limitations:</p>
<ul class="simple">
<li>Ironic bare metal instances cannot be dynamically intermingled with
virtualised instances. However, they can be organised as separate cells
or regions within the same OpenStack private cloud.</li>
<li>Some standard virtualisation features could never be supported, such as
overcommitment and migration.</li>
</ul>
<p>See the section <a class="reference external" href="//www.stackhpc.com/openstack-and-hpc-infrastructure.html">OpenStack and HPC Infrastructure Management</a>
for further details about Ironic.</p>
</div>
</div>
<div class="section" id="virtualised-hpc-on-openstack-at-monash-university">
<h2>Virtualised HPC on OpenStack at Monash University</h2>
<p>From its inception in 2012, Australian scientific research has benefited
from the NeCTAR Research Cloud federation. Now comprising eight
institutions from across the country, NeCTAR was an early adopter of
OpenStack, and has been at the forefront of development of the project
from that moment.</p>
<p>NeCTAR’s federated cloud compute infrastructure supports a wide range of
scientific research with diverse requirements. Monash Advanced Research
Computing Hybrid (MonARCH) was commissioned in 2015/2016 to provide a
flexible and dynamic HPC resource.</p>
<div class="figure">
<img alt="NeCTAR federation across Australia" src="//www.stackhpc.com/images/hpc_virtualisation-NeCTAR_Australia.png" style="width: 400px;" />
</div>
<p>MonARCH has 35 dual-socket Haswell-based compute nodes and 820 CPU
cores. MonARCH exploits cloud-bursting techniques to grow elastically by
using resources from across the NeCTAR federation. The infrastructure
uses a fabric of 56G Mellanox Ethernet for a converged, high-speed
network. The cloud control plane is running Ubuntu Trusty and the KVM
hypervisor. OpenStack Liberty (as of Q3’2016) was deployed using Ubuntu
distribution packages (including selected patches as maintained by NeCTAR
Core Services), orchestrated and configured using Puppet.</p>
<div class="figure">
<img alt="RACmon group at Monash University" src="//www.stackhpc.com/images/hpc_virtualisation-racmon.png" style="width: 400px;" />
</div>
<p>MonARCH makes extensive use of SR-IOV for accessing its HPC network
fabric. The high-speed network is configured to use VLANs for virtual
tenant networking, enabling layer-2 RoCEv1 (RDMA over Converged
Ethernet). RDMA is used in guest instances in support of tightly coupled
parallel MPI workloads, and for high-speed access to 300TB of Lustre
storage.</p>
<p>Following MonARCH, Monash University recently built a mixed CPU &
GPU cluster called M3, the latest system for the MASSIVE (Multi-modal
Australian ScienceS Imaging and Visualisation Environment) project. Within
M3, there are 1700 Haswell CPU cores along with 16 quad-GPU compute nodes
and an octo-GPU compute node, based upon the NVIDIA K80 dual-GPU. Staff
at Monash University’s <a class="reference external" href="mailto:R@CMon">R@CMon</a> cloud research group have integrated
SR-IOV networking and GPU pass-through into their compute instances.</p>
<div class="figure">
<img alt="GPU passthrough in a virtualised environment" src="//www.stackhpc.com/images/hpc_virtualisation-gpu_passthrough.png" style="width: 400px;" />
</div>
<p>Specific high-performance OpenStack flavors are defined to require
pass-through of one or more dedicated GPUs. This enables one to four
GPU instances to run concurrently on a dual-K80 compute node, e.g.,
to support CUDA accelerated HPC workloads and/or multiple interactive
visualisation virtual-workstations.</p>
<p>Blair Bethwaite, senior HPC consultant at Monash University, said:</p>
<blockquote>
<p>“Using OpenStack brings us a high degree of flexibility in the HPC
environment. Applying cloud provisioning and management techniques
also helps to make the HPC-stack more generic, manageable and quick to
deploy. Plus, we benefit from the constant innovation from the OpenStack
community, with the ability to pick and choose new services and projects
from the ecosystem. OpenStack’s flexibility in the SDN space also
offers compelling new avenues to integrate researchers’ personal or
lab servers with the HPC service.</p>
<p>“However, before racing out to procure your next HPC platform driven
by OpenStack, I’d recommend evaluating your potential workloads
and carefully planning and testing the appropriate mix of hardware
capabilities, particularly acceleration features. KVM, OpenStack’s most
popular hypervisor, can certainly perform adequately for HPC—in recent
testing we are getting 98 percent on average and up to 99.9 percent
of bare metal in Linpack tests—but a modern HPC system is likely to
require some subset of bare metal infrastructure. If I was planning a
new deployment today I’d seriously consider including Ironic so that a
mix of bare metal and virtual cloud nodes can be provisioned and managed
consistently. As Ironic is maturing and becoming more feature-complete,
I expect to see many more highly integrated deployments and reference
architectures emerging in the years to come.”</p>
</blockquote>
</div>
<div class="section" id="optimising-for-time-to-paper-using-hpc-on-openstack">
<h2>Optimising for “Time to Paper” using HPC on OpenStack</h2>
<p>When evaluating OpenStack as a candidate for HPC infrastructure for
research computing, the “time to paper” metric of the scientists
using the resource should be included in consideration.</p>
<p>Skeptics of using cloud compute for HPC infrastructure inevitably cite
the various overheads of virtualisation in the case against OpenStack.
With a rapidly-developing technology, these arguments can often be
outdated. Furthermore, cloud infrastructure presents a diminishing
number of trade-offs in return for an increasing number of compelling
new capabilities.</p>
<p>Unlike conventional HPC system management, OpenStack provides, for
example:</p>
<ul class="simple">
<li><strong>Standardisation</strong> as users can interact with the system through a
user-friendly web interface, a command line interface or a software API.</li>
<li><strong>Flexibility</strong> and agility as users allocate compute resources as required
and have exclusive use of the virtual resources. There is fine-grained
control of the extent to which physical resources are shared.</li>
<li>Users can <strong>self-serve</strong> and boot a software image of their choosing without
requiring operator assistance. It is even possible for users to create
their own software images to run - a powerful advantage that eliminates
toil for the administrators and delay for the users.</li>
<li>Additional <strong>security</strong> as users have a higher degree of separation from
each other. They cannot observe other users and are isolated from one
another on the network.</li>
</ul>
<p>Through careful consideration, an HPC-aware configuration of OpenStack is
capable of realising all the benefits of software-defined infrastructure
whilst incurring minimal overhead. In its various forms, virtualisation
strikes a balance between new capabilities and consequential overhead.</p>
</div>
<div class="section" id="further-reading">
<h2>Further Reading</h2>
<ul class="simple">
<li>The OpenStack Hypervisor Tuning Guide is a living document detailing best practice for virtualised performance: <a class="reference external" href="https://wiki.openstack.org/wiki/Documentation/HypervisorTuningGuide">https://wiki.openstack.org/wiki/Documentation/HypervisorTuningGuide</a></li>
<li>CERN’s OpenStack in Production blog is a good example of the continual community process of hypervisor tuning: <a class="reference external" href="http://openstack-in-production.blogspot.co.uk/">http://openstack-in-production.blogspot.co.uk/</a></li>
<li>As an example of the continuous evolution of hypervisor development, the MIKELANGELO project is currently working on optimisations for reducing the latency of virtualised IO using their sKVM project: <a class="reference external" href="https://www.mikelangelo-project.eu/2015/10/how-skvm-will-beat-the-io-performance-of-kvm/">https://www.mikelangelo-project.eu/2015/10/how-skvm-will-beat-the-io-performance-of-kvm/</a></li>
<li>The OpenStack Foundation has published a detailed white paper on using containers within OpenStack: <a class="reference external" href="https://www.openstack.org/assets/pdf-downloads/Containers-and-OpenStack.pdf">https://www.openstack.org/assets/pdf-downloads/Containers-and-OpenStack.pdf</a></li>
<li>An informative paper describing recent developments enabling GPUdirect peer-to-peer transfers between GPUs and RDMA-enabled NICs: <a class="reference external" href="http://grids.ucs.indiana.edu/ptliupages/publications/15-md-gpudirect%20(3).pdf">http://grids.ucs.indiana.edu/ptliupages/publications/15-md-gpudirect%20(3).pdf</a></li>
<li>Whilst the focus of this paper is on comparing virtualisation strategies on the ARM architecture, the background information is accessible and the comparisons made with the x86 architecture are insightful: <a class="reference external" href="http://www.cs.columbia.edu/~cdall/pubs/isca2016-dall.pdf">http://www.cs.columbia.edu/~cdall/pubs/isca2016-dall.pdf</a></li>
<li>For more information about MonARCH at Monash University, see the <a class="reference external" href="mailto:R@CMon">R@CMon</a> blog: <a class="reference external" href="https://rcblog.erc.monash.edu.au/">https://rcblog.erc.monash.edu.au/</a></li>
</ul>
</div>
<div class="section" id="acknowledgements">
<h2>Acknowledgements</h2>
<p>This document was written by Stig Telfer of StackHPC Ltd with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:</p>
<ul class="simple">
<li><strong>Professor DK Panda</strong> and <strong>Dr. Xiaoyi Lu</strong> from NOWLAB, Ohio State University.</li>
<li><strong>Blair Bethwaite</strong>, Senior HPC Consultant at Monash University.</li>
</ul>
<div class="figure">
<img alt="Creative commons licensing" src="//www.stackhpc.com/images/cc-by-sa.png" style="width: 100px;" />
<p class="caption">This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)</p>
</div>
</div>
Mellanox OFED for the Overcloud2016-06-03T10:20:00+01:002016-06-04T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-06-03:/building-mellanox-ofed.html<p class="first last">Enabling RDMA Support in OpenStack Overcloud Images</p>
<div class="section" id="building-mellanox-ofed-for-openstack-overcloud-images">
<h2>Building Mellanox OFED for OpenStack Overcloud Images</h2>
<p>A typical HPC system environment is continuously managed: a maintained
configuration, updated over time as new software packages become
available. The cloud compute model takes a different approach:
infrastructure, (such as the OS and run-time environment of an HPC
system) is like code, and that infrastructure gets managed through
"recompilation", not through being updated in place.</p>
<p>This can have far-reaching consequences for the workflow with which
systems are managed. One clear benefit is that any system can be
rebuilt according to a formula (likely to be a collection of scripts,
Ansible playbooks or Puppet manifests), and that formula can be
managed and developed under source control. If a formula is
sufficiently precise, it can provide an increased level of
repeatability: at some future point we can check out the formula
and use it again to rebuild servers to a similar configuration.</p>
<p>One quirky side-effect of this approach is that a cloud-model image
contains no 'baggage': no old, superseded packages lying unused
after an upgrade. In particular, cloud-model images do not contain
the kernel package that originally shipped with that distribution,
since updated several times over. This turns out to be a nuisance
when installing
<a class="reference external" href="http://www.mellanox.com/page/products_dyn?product_family=26">Mellanox OFED</a>,
which assumes this original kernel is present.</p>
<p>When building a cloud-model system image we are typically also
building something different from the environment of the build host.
One is likely to be a well-stocked sysadmins toolbox. The other
should be pared-down and minimally sufficient for the (virtualised)
task at hand. We need to prevent the configuration of the build
host from affecting the cloud image. We need to avoid this kind
of pollution.</p>
<div class="section" id="how-mellanox-ofed-is-built">
<h3>How Mellanox OFED is Built</h3>
<p>The new kernel should be installed on the build system. Assuming a
kernel RPM package is being used, both the kernel and the kernel-devel
RPMs should be installed.</p>
<p>Mellanox OFED is downloaded as a tarball, or an equivalent ISO
image. Within the image is a yum repo of RPMs and a set of scripts
for automating building, installing and uninstalling.</p>
<p>After unpacking the tarball (or mounting the ISO), we use the build
automation script, <tt class="docutils literal">mlnx_add_kernel_support.sh</tt>:</p>
<pre class="literal-block">
KVER=3.10.0-327.22.2.el7.x86_64
./mlnx_add_kernel_support.sh \
--mlnx_ofed $PWD --kernel $KVER --yes --verbose --make-iso
</pre>
</div>
<div class="section" id="how-to-use-the-output">
<h3>How to Use the Output</h3>
<p>If the build succeeds, an updated version of the Mellanox OFED
tarball (or ISO image) is generated in /tmp. This output can
be used to install in exactly the same way as the Mellanox OFED
build first downloaded.</p>
</div>
</div>
Talking the Scientific Working Group at OpenStack Austin2016-05-02T10:20:00+01:002016-05-02T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2016-05-02:/talking-the-scientific-working-group-at-openstack-austin.html<p>Stig Telfer (from StackHPC Ltd) and Blair Bethwaite (from Monash
University) talk to SuperUser TV about OpenStack and research computing
with Flanders from the OpenStack Foundation:</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/neZvmr1nzFE" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div>StackHPC at Computing Insight UK2015-12-08T10:20:00+00:002016-10-20T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2015-12-08:/stackhpc-at-computing-insight-uk.html<p><a class="reference external" href="https://eventbooking.stfc.ac.uk/news-events/computing-insight-uk-2015">Computing Insight UK 2015</a>, was held at the Ricoh Arena in Coventry, UK.</p>
<p>Stig presented on the challenges of integrating HPC with OpenStack.</p>
<img alt="Stig at CIUK 2015" src="//www.stackhpc.com/images/computing-insight-uk-2015-stig.png" style="width: 600px;" />
StackHPC at OpenStack Tokyo2015-10-29T10:20:00+00:002016-10-20T18:40:00+01:00Stig Telfertag:www.stackhpc.com,2015-10-29:/stackhpc-at-openstack-tokyo.html<p>Stig presented at the <a class="reference external" href="https://www.openstack.org/summit/tokyo-2015/">OpenStack Tokyo Summit</a>, on behalf of Cambridge University and with our friends at Canonical.</p>
<p>Stig presented on "The Case for a Scientific OpenStack":</p>
<div class="youtube"><iframe src="https://www.youtube.com/embed/WP96uLPvOdc" width="750" height="500" allowfullscreen seamless frameBorder="0"></iframe></div>