OpenStack and High Performance Data
What can data requirements mean an HPC context? The range of use cases
is almost boundless. With considerable generalisation we can consider
some broad criteria for requirements, which expose the inherent tensions
between HPC-centric and cloud-centric storage offerings:
- The data access model: data objects could be stored and retrieved
using file-based, block-based, object-based or stream-based access.
HPC storage tends to focus on a model of file-based shared data storage
(with an emerging trend for object-based storage proposed for achieving
new pinnacles of scalability). Conversely cloud infrastructure favours
block-based storage models, often backed with and extended by object-based
storage. Support for data storage through shared filesystems is still
maturing in OpenStack.
- The data sharing model: applications may request the same data
from many clients, or the clients may make data accesses that are
segregated from one another. This distinction can have significant
consequences for storage architecture. Cloud storage and HPC storage
are both highly distributed, but often differ in the way in which data
access is parallelised. Providing high-performance access for many
clients to a shared dataset can be a niche requirement specific to HPC.
Cloud-centric storage architectures typically focus on delivering high
aggregate throughput on many discrete data accesses.
- The level of data persistence. An HPC-style tiered data storage
architecture does not need to incorporate data redundancy at every level
of the hierarchy. This can improve performance for tiers caching data
closer to the processor.
The cloud model offers capabilities that enable new possibilities for HPC:
- Automated provisioning. Software-defined infrastructure automates the
provisioning and configuration of compute resources, including storage.
Users and group administrators are able to create and configure storage
resources to their specific requirements at the exact time they are
- Multi-tenancy. HPC storage does not offer multi-tenancy with the level
of segregation that cloud can provide. A virtualised storage resource
can be reserved for the private use of a single user, or could be shared
between a controlled group of collaborating users, or could even be
accessible by all users.
- Data isolation. Sensitive data requires careful data management.
Medical informatics workloads may contain patient genomes. Engineering
simulations may contain data that is trade secret. OpenStack’s
segregation model is stronger than ownership and permissions on a
POSIX-compliant shared filesystem, and also provides finer-grained
There is clear value in increased flexibility - but at what cost in
performance? In more demanding environments, HPC storage tends to focus
on and be tuned for delivering the requirements of a confined subset
of workloads. This is the opposite approach to the conventional cloud
model, in which assumptions may not be possible about the storage access
patterns of the supported workloads.
This study will describe some of these divergences in greater detail, and
demonstrate how OpenStack can integrate with HPC storage infrastructure.
Finally some methods of achieving high performance data management on
cloud-native storage infrastructure will be discussed.
File-based Data: HPC Parallel Filesystems in OpenStack
Conventionally in HPC, file-based data services are delivered
by parallel filesystems such as Lustre and Spectrum Scale (GPFS).
A parallel filesystem is a shared resource. Typically it is mounted on
all compute nodes in a system and available to all users of a system.
Parallel filesystems excel at providing low-latency, high-bandwidth
access to data.
Parallel filesystems can be integrated into an OpenStack environment in
a variety of configuration models.
Provisioned Client Model
Access to an external parallel filesystem is provided through an OpenStack
provider network. OpenStack compute instances - virtualised or bare
metal - mount the site filesystem as clients.
This use case is fairly well established. In the virtualised use case,
performance is achieved through use of SR-IOV (with only a moderate
level of overhead). In the case of Lustre, with a layer-2 VLAN provider
network the o2ib client drivers can use RoCE to perform Lustre data
transport using RDMA.
Cloud-hosted clients on a parallel filesystem raise issues with root in
a cloud compute context. On cloud infrastructure, privileged accesses
from a client do not have the same degree of trust as on conventional HPC
infrastructure. Lustre approaches this issue by introducing Kerberos
authentication for filesystem mounts and subsequent file accesses.
Kerberos credentials for Lustre filesystems can be supplied to OpenStack
instances upon creation as instance metadata.
Provisioned Filesystem Model
There are use cases where the dynamic provisioning of software-defined
parallel filesystems has considerable appeal. There have been
proof-of-concept demonstrations of provisioning Lustre filesystems from
scratch using OpenStack compute, storage and network resources.
The OpenStack Manila project aims to provision and manage shared
filesystems as an OpenStack service. IBM’s Spectrum Scale integrates
with Manila to re-export GPFS parallel filesystems using the user-space
Ganesha NFS server.
Currently these projects demonstrate functionality over performance.
In future evolutions the overhead of dynamically provisioned parallel
filesystems on OpenStack infrastructure may improve.
A Parallel Data Substrate for OpenStack Services
IBM positions Spectrum Scale as a distributed data service for
underpinning OpenStack services such as Cinder, Glance, Swift and Manila.
More information about using Spectrum Scale in this manner can be found
in IBM Research’s red paper on the subject (listed in the Further
Applying HPC Technologies to Enhance Data IO
A recurring theme throughout this study has been the use of remote DMA
for efficient data transfer in HPC environments. The advantages of this
technology are especially pertinent in data intensive environments.
OpenStack’s flexibility enables the introduction of RDMA protocols
for many cloud infrastructure operations to reduce latency, increase
bandwidth and enhance processor efficiency:
Cinder block data IO can be performed using iSER (iSCSI extensions
for RDMA). iSER is a drop-in replacement for iSCSI that is easy to
configure and set up. Through providing tightly-coupled IO resources
using RDMA technologies, the functional equivalent of HPC-style burst
buffers can be added to the storage tiers of cloud infrastructure.
Ceph data transfers can be performed using the Accelio RMDA transport.
This technology was demonstrated some years ago but does not appear
to have achieved production levels of stability or gained significant
The NOWLAB group at Ohio State University have developed extensions to
data analytics platforms such as HBase, Hadoop, Spark and Memcached to
optimise data movements using RDMA.
Optimising Ceph Storage for Data-Intensive Workloads
The versatility of Ceph embodies the cloud-native approach to storage,
and consequently Ceph has become a popular choice of storage technology
for OpenStack infrastructure. A single Ceph deployment can support
various protocols and data access models.
Ceph is capable of delivering strong read bandwidth. For large reads
from OpenStack block devices, Ceph is able to parallelise the delivery
of the read data across multiple OSDs.
Ceph’s data consistency model commits writes to multiple OSDs before
a write transaction is completed. By default a write is replicated
three times. This can result in higher latency and lower performance
on write bandwidth.
Ceph can run on clusters of commodity hardware configurations. However,
in order to maximise the performance (or price performance) of a Ceph
cluster some design rules of thumb can be applied:
Use separate physical network interfaces for external storage network and
internal storage management. On the NICs and switches, enable Ethernet
flow control and raise the MTU to support jumbo frames.
Each drive used for Ceph storage is managed by an OSD process.
A Ceph storage node usually contains multiple drives (and multiple
The best price/performance and highest density is achieved using fat
storage nodes, typically containing 72 HDDs. These work well for
large scale deployments, but can lead to very costly units of failure
in smaller deployments. Node configurations of 12-32 HDDs are usually
found in deployments of intermediate scale.
Ceph storage nodes usually contain a higher-speed write journal, which is
dedicated to service of a number of HDDs. An SSD journal can typically
feed 6 HDDs while an NVMe flash device can typically feed up to 20 HDDs.
About 10G of external storage network bandwidth balances the read
bandwidth of up to 15 HDDs. The internal storage management network
should be similarly scaled.
A rule of thumb for RAM is to provide 0.5GB-1GB of RAM per TB per
On multi-socket storage nodes, close attention should be paid to NUMA
considerations. The PCI storage devices attached to each socket should
be working together. Journal devices should be connected with HDDs
attached to HBAs on the same socket. IRQ affinity should be confined
to cores on the same socket. Associated OSD processes should be pinned
to the same cores.
For tiered storage applications in which data can be regenerated from
other storage, the replication count can safely be reduced from 3 to
The Cancer Genome Collaboratory: Large-scale Genomics on OpenStack
Genome datasets can be hundreds of terabytes in size, sometimes requiring
weeks or months to download and significant resources to store and
The Ontario Institute for Cancer Research built the Cancer Genome
Collaboratory (or simply The Collaboratory) as a biomedical research
resource built upon OpenStack infrastructure. The Collaboratory aims
to facilitate research on the world’s largest and most comprehensive
cancer genome dataset, currently produced by the International Cancer
Genome Consortium (ICGC).
By making the ICGC data available in cloud compute form in the
Collaboratory, researchers can bring their analysis methods to the cloud,
yielding benefits from the high availability, scalability and economy
offered by OpenStack, and avoiding the large investment in compute
resources and the time needed to download the data.
An OpenStack Architecture for Genomics
The Collaboratory’s requirements for the project were to build a cloud
computing environment providing 3000 compute cores and 10-15 PB of raw
data stored in a scalable and highly-available storage. The project
has also met constraints of budget, data security, confined data centre
space, power and connectivity. In selecting the storage architecture,
capacity was considered to be more important than latency and performance.
Each rack hosts 16 compute nodes using 2U high-density chassis, and
between 6 and 8 Ceph storage nodes. Hosting a mix of compute and storage
nodes in each rack keeps some of the Nova-Ceph traffic in the same rack,
while also lowering the power requirement for these high density racks
(2 x 60A circuits are provided to each rack).
As of September 2016, Collaboratory has 72 compute nodes (2600 CPU
cores, Hyper-Threaded) with a physical configuration optimized for large
data-intensive workflows: 32 or 40 CPU cores and a large amount of RAM
(256 GB per node). The workloads make extensive use of high performance
local disk, incorporating hardware RAID10 across 6 x 2TB SAS drives.
The networking is provided by Brocade ICX 7750-48C top-of-rack switches
that use 6x40Gb cables to interconnect the racks in a ring stack topology,
providing 240 Gbps non-blocking redundant inter-rack connectivity,
at a 2:1 oversubscription ratio.
The Collaboratory is deployed using entirely community-supported free
software. The OpenStack control plane is Ubuntu 14.04 and deployment
configuration is based on Ansible. The Collaboratory was initially
deployed using OpenStack Juno and a year later upgraded to Kilo and
Collaboratory deploys a standard HA stack based on Haproxy/Keepalived and
Mariadb-Galera using three controller nodes. The controller nodes also
perform the role of Ceph-mon and Neutron L3-agents, using three separate
RAID1 sets of SSD drives for MySQL, Ceph-mon and Mongodb processes.
The compute nodes have 10G Ethernet with GRE and SDN capabilities
for virtualized networking. The Ceph nodes use 2x10G NICs bonded for
client traffic and 2x10G NICs bonded for storage replication traffic.
The Controller nodes have 4x10G NICs in an active-active bond (802.3ad)
using layer3+4 hashing for better link utilisation. The Openstack tenant
routers are highly-available with two routers distributed across the three
controllers. The configuration does not use Neutron DVR out of concern
for limiting the number of servers directly attached to the Internet.
The public VLAN is carried only on the trunk ports facing the controllers
and the monitoring server.
Optimising Ceph for Genomics Workloads
Upon workload start, the instances usually download data stored in Ceph's
object storage. OICR developed a download client that controls access
to sensitive ICGC protected data through managed tokens. Downloading a
100GB file stored in Ceph takes around 18 minutes, with another 10-12
minutes used to automatically check its integrity (md5sum), and is mostly
limited by the instance’s local disk.
The ICGC storage system adds a layer of control on top of Ceph’s
object storage. Currently this is a 2-node cluster behind an Haproxy
instance serving the ICGC storage client. The server component uses
OICR’s authorization and metadata systems to provide secure access to
related objects stored in Ceph. By using OAuth-based access tokens,
researchers can be given access to the Ceph data without having to
configure Ceph permissions. Access to individual project groups can
also be implemented in this layer.
Each Ceph storage node consists of 36 OSD drives (4, 6 or 8 TB) in
a large Ceph cluster currently providing 4 PB of raw storage, using
three replica pools. The radosgw pool has 90% of the Ceph space being
reserved for storing protected ICGC datasets, including the very large
whole genome aligned reads for almost 2000 donors. The remaining 10% of
Ceph space is used as a scalable and highly-available backend for Glance
and Cinder. Ceph radosgw was tuned for the specific genomic workloads,
mostly by increasing read-ahead on the OSD nodes, 65 MB as rados object
stripe for Radosgw and 8 MB for RBD.
Further Considerations and Future Directions
In the course of the development of the OpenStack infrastructure at the
Collaboratory, several issues have been encountered and addressed:
The instances used in cancer research are usually short lived
(hours/days/weeks), but with high resource requirements in terms of CPU
cores, memory and disk allocation. As a consequence of this pattern of
usage the Collaboratory OpenStack infrastructure does not support live
migration as a standard operating procedure.
The Collaboratory have encountered a few problems caused by Radosgw bugs
involving overlapping multipart uploads. However, these were detected by
the Collaboratory’s monitoring system, and did not result in data loss.
The Collaboratory created a monitoring system that uses automated Rally
tests to monitor end-to-end functionality, and also download a random
large S3 object (around 100 GB) to confirm data integrity and monitor
object storage performance.
Because of the mix of very large (BAM), medium (VCF) and very small
(XML, JSON) files, the Ceph OSD nodes have imbalanced load and we have
to regularly monitor and rebalance data.
Currently, the Collaboratory is hosting 500TB of data from 2,000 donors.
Over the next 2 years, OICR will increase the number of ICGC genomes
available in the Collaboratory, with the goal of having the entire ICGC
data set of 25,000 donors estimated to be 5PB when the project completes
Although in a closed beta phase with only a few research labs having
accounts, there were more than 19,000 instances started in the last 18
months, with almost 7,000 in the last three months. One project that
uses the Collaboratory heavily is the PanCancer Analysis of Whole Genomes
(PCAWG), which characterizes the somatic, and germline variants from
over 2,800 ICGC cancer whole genomes in 20 primary tumour sites.
In conclusion, the Collaboratory environment has been running well for
OICR and its partners. George Mihaiescu, senior cloud architect at OICR,
has many future plans for OpenStack and the Collaboratory:
“We hope to add new Openstack projects to the Collaboratory’s offering
of services, with Ironic and Heat being the first candidates. We would
also like to provide new compute node configurations with RAID0 instead
of RAID10, or even SSD based local storage for improved IO performance.”
CLIMB: OpenStack, Parallel Filesystems and Microbial Bioinformatics
The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) is a
collaboration between four UK universities (Swansea, Warwick, Cardiff
and Birmingham) and funded by the UK’s Medical Research Council.
CLIMB provides compute and storage as a free service to academic
microbiologists in the UK. After an extended period of testing, the
CLIMB service was formally launched in July 2016.
CLIMB is a federation of 4 sites, configured as OpenStack regions.
Each site has an approximately equivalent configuration of compute nodes,
network and storage.
The compute node hardware configuration is tailored to support the
memory-intensive demands of bioinformatics workloads. The system as
a whole comprises 7680 CPU cores, in fat 4-socket compute nodes with
512GB RAM. Each site also has three large memory nodes with 3TB of RAM
and 192 hyper-threaded cores.
The infrastructure is managed and deployed using xCAT cluster management
software. The system runs the Kilo release of OpenStack, with packages
from the RDO distribution. Configuration management is automated
Each site has 500 TB of GPFS storage. Every hypervisor is a GPFS client,
and uses an infiniband fabric to access the GPFS filesystem. GPFS is
used for scratch storage space in the hypervisors.
For longer term data storage, to share datasets and VMs, and to provide
block storage for running VMs, CLIMB deploys a storage solution based
on Ceph. The Ceph storage is replicated between sites. Each site has 27
Dell R730XD nodes for Ceph storage servers. Each storage server contains
16x 4TB HDDs for Ceph OSDs, giving a total raw storage capacity of 6912TB.
After 3-way replication this yields a usable capacity of 2304TB.
On two sites Ceph is used as the storage back end for Swift, Cinder
and Glance. At Birmingham GPFS is used for Cinder and Glance, with
plans to migrate to Ceph.
In addition to the infiniband network, a Brocade 10G Ethernet fabric is
used, in conjunction with dual-redundant Brocade Vyatta virtual routers
to manage cross-site connectivity.
In the course of deploying and trialling the CLIMB system, a number of
issues have been encountered and overcome.
- The Vyatta software routers were initially underperforming with
consequential impact on inter-site bandwidth.
- Some performance issues have been encountered due to NUMA topology
awareness not being passed through to VMs.
- Stability problems with Broadcom 10GBaseT drivers in the controllers
led to reliability issues. (Thankfully the HA failover mechanisms were
found to work as required).
- Problems with interactions between Ceph and Dell hardware RAID cards.
- Issues with Infiniband and GPFS configuration.
CLIMB has future plans for developing their OpenStack infrastructure,
- Migrating from regions to Nova cells as the federation model between
- Integrating OpenStack Manila for exporting shared filesystems from
GPFS to guest VMs.
This document was written by Stig Telfer of StackHPC Ltd with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:
- George Mihaiescu, Bob Tiernay, Andy Yang, Junjun Zhang,
Francois Gerthoffert, Christina Yung, Vincent Ferretti
from the Ontario Institute for Cancer Research. The authors wish
to acknowledge the funding support from the Discovery Frontiers:
Advancing Big Data Science in Genomics Research program (grant
no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’), which
is jointly funded by the Natural Sciences and Engineering Research
Council (NSERC) of Canada, the Canadian Institutes of Health Research
(CIHR), Genome Canada, and the Canada Foundation for Innovation (CFI),
and with in-kind support from the Ontario Research Fund of the Ministry
of Research, Innovation and Science.
- Dr Tom Connor from Cardiff University and the CLIMB collaboration.
This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)