For optimal reading, please switch to desktop mode.
Kubernetes, RDMA and OpenStack?
It is actualy quite like oil, balasmic vinegar and bread.
Done right, its really tasty!
This blog describes how we get RDMA networking inside
Kubernetes pods that are running inside OpenStack VMs.
Moreover, we then make this available on demand via
Azimuth.
Why Kubernetes and OpenStack?
When you want to create a Kubernetes cluster on some hardware, you need
a way to manage that hardware and the associated infrastructure.
OpenStack gives you a nice set of APIs to manage your
infrastructure.
StackHPC has been using OpenStack to manage the storage, networking
and compute for both virtual machines and baremetal machines.
Moreover, StackHPC know how to help you use OpenStack to get the
most of your hardware investment, be that maximum performance or
maximum utilisation.
K8s Cluster API Provider OpenStack
We have always tried to treat Kubernetes clusters as cattle.
We started that journey with OpenStack Magnum.
Recently we have started building a new Magnum driver that
makes use of K8s Cluster API,
its OpenStack Provider,
and the K8s image builder.
The first part of that journey has been wrapping up our preferences in the form
of our CAPO helm charts that make use of Cluster API and a custom operator that installs add ons into those clusters. We have an ongoing upstream discussion on converging on a common approach to the addons.
Cluster API brings with it the ability to auto scale and auto heal.
The cluster template has "batteries included". Specifically the add-ons include GPU drivers,
network card drivers, and making use of
Cloud Provider OpenStack.
We also include Grafana, Loki and Prometheus to help users monitor the clusters.
Putting this all together, we now have a K8s native way to define
cluster templates, and create them as required on OpenStack.
SR-IOV with Mellanox VF-LAG
So Cluster API lets us create k8s clusters on OpenStack.
We make use of VF-LAG
to get performant RDMA networking within VMs.
On a single ConnectX-5 using bonded 2x100GbE on a PCI Gen 4
platform, inside VMs we can see RDMA bandwidths of over 195Gbs,
TCP bandwidth around 180Gbs, with MPI latency of under 5 micro seconds.
For more details please see our presentation from KubeCon Detroit, co-presented with the team from Nvidia Networking:
Now we have the VMs working nicely, the next thing is getting this working
inside VMs.
RDMA using MACVLAN and multus
To use macvlan, we make use of the Mellanox network operator:
https://github.com/Mellanox/network-operator
Using the operator, we make use of multus to allow pods to
request extra networking.
We then use macvlan to provide the pods with a network interace
with a random MAC address interface, with an appropriate IP such
that it can access other pods and external appliances that might
need RDMA. In addition, we make use of the shared device
plugin, to ensure pods get access to the RDMA device without
needing to be a privilaged pod with host networking.
Finally OFED drivers are installed using a node selector to
only target hosts with Mellanox NICs.
To prove this all works we need to run some benchmarks. To make this easier
we have created an operator kube-perftest.
The opertor users iperf2, ib_read_bw, ib_read_lat, MPI ping pong and more to
to help up understand the performance.
In particular, kube-perftest is able to request an additional
multus network, such as the above mac vlan network.
Self-Service RDMA enabled Apps using Azimuth
To make it easy to create K8s using the templates above we make use
of Azimuth to help create
clusters simply via a user interface, that come with working RDMA.
Get in touch
If you would like to get in touch we would love to hear
from you. Reach out to us via Twitter
or directly via our contact page.