The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

This case study was contributed to the Open Infrastructure for AI white paper, which can be accessed here.

AI workloads demand some of the most performance-intensive infrastructure available on the market today. Harnessing the full potential of such infrastructure is critical to maximising returns on very significant hardware and software investments.

6G AI Sweden has the ambition to provide Swedish companies with world-class AI capabilities, whilst maintaining absolute data sovereignty. To realise this ambition, 6G AI Sweden selected a technology stack centred around OpenStack, Kubernetes and open infrastructure solutions, designed and deployed in partnership with StackHPC.

Hardware Infrastructure

Compute

6G AI Sweden's AI compute nodes are based on the Nvidia "HGX" reference architecture:

8x Nvidia H200 GPUs
8x 400G NDR Infiniband high-speed networking
8x local NVME storage, local to each GPU
2x200G Bluefield-3 Ethernet smart NIC
1G Ethernet for provisioning
Dedicated BMC connection

All server hardware is managed using the industry-standard Redfish protocol.

A dedicated hypervisor resource provides virtualised compute for control, monitoring and supporting functions.

Networking

6G AI Sweden's high-speed Infiniband fabric is implemented with 400G NDR Infiniband from Nvidia, implemented as a multi-rail network interconnecting the H200 GPU devices.

An example Infiniband network fabric, in a fat tree topology, in which each compute node (bottom row) has 8 Infiniband links. Multiple links may exist between switches to produce a non-blocking network, in which all nodes could communicate simultaneously at full network bandwidth.

The infrastructure's Ethernet network, again from Nvidia, is based on 800G switches and network fabric, broken out into 2x200G bonded links connecting the compute nodes and storage.

In addition, management Ethernet networks exist for server provisioning, control and power/management.

Storage

6G AI Sweden selected high-performance storage from VAST data.

VAST Data storage provides the 6G AI cloud with object, block and file storage services using industry standard protocols. In addition, VAST Data storage provides self-service flexibility, multi-tenancy and industry-leading performance.

OpenStack Cloud Infrastructure

For AI use cases a significant advantage of OpenStack is its native support for bare metal, virtualisation and containerisation, all couched within the same reconfigurable infrastructure. OpenStack's multi-tenancy model and fine-grained policy is also well-suited to cloud-native service providers for AI infrastructure.

OpenStack Kayobe provisions clouds using infrastructure-as-code principles. Hardware configuration, OS provisioning and network configuration are all defined using a single version-controlled source repo. OpenStack services are deployed and configured via Kolla-Ansible. Kayobe's precise configuration and management of infrastructure was selected for its flexibility, ease of operation and built-in support for high-performance computing.

For the AI compute nodes a bare metal cloud infrastructure was created using OpenStack Ironic. The servers are deployed using Ironic's virtual media driver, streamlining the bootstrap process and avoiding the need for DHCP and iPXE steps during the deployment.

Multi-tenant isolation is a critical requirement for 6G AI's business model. Bare metal compute infrastructure is provisioned into client tenancies. Ironic implements functionality for effective management of bare metal in a multi-tenant environment, such as configurable steps for deployment and cleaning.

Neutron networking is implemented using OVN. Multi-tenant network isolation is implemented using virtual tenant networks, connecting a tenant's virtual machines and bare metal compute nodes.

Tenant networks in Ethernet are VLANs, enabling easy intercommunication between bare metal and virtualised compute resources. Multi-tenant Ethernet network isolation is implemented using the networking-generic-switch Neutron driver. Separate networks for cleaning and provisioning keep other phases of the infrastructure lifecycle safely behind the scenes.
Tenant networks in Infiniband are partitions, allocated as partition keys (pkeys) and managed with the networking-mellanox Neutron driver, integrated with Nvidia's Unified Fabric Manager (UFM).

The VAST Data storage provides backing services for Glance, Cinder and Manila, plus S3 object storage for tenant workloads. For virtualised infrastructure, block storage is implemented using VAST Data's newly-developed NVME-over-TCP Cinder driver.

File storage is implemented using VAST Data's Manila driver, providing multi-tenant orchestration of high-performance file storage, all using the latest optimisations of industry-standard NFS.

Azimuth and Kubernetes

World-class AI capabilities require world-class compute platforms, building on high-performance infrastructure to enable users to get straight to AI productivity. StackHPC is the lead developer and custodian of the Azimuth Cloud Portal, an intuitive portal for providing compute platforms on a self-service basis. Azimuth is free and open source software, and StackHPC offers services for deployment, configuration, extension, maintenance, and support of the project.

Platform creation in Azimuth involves selecting from a curated set of predefined infrastructure-as-code recipes, ready configured with deep integrations into the underlying infrastructure.

Most AI workloads are deployed as containerised applications using Kubernetes. In the 6G AI cloud, Kubernetes clusters span VM resources and bare metal AI compute nodes. Kubernetes can be deployed as a platform in Azimuth, or as a gitops-driven workflow via FluxCD. In both cases, Kubernetes is deployed using industry-standard Cluster API and configured using Helm. 6G AI Sweden provides the Nvidia NGC Catalog of containerised software applications and Helm charts.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Bluesky, LinkedIn or directly via our contact page.