The Human Brain Project (HBP)
is a 10-year EU FET flagship project
seeking to “provide researchers worldwide with tools and mathematical
models for sharing and analysing large brain data they need for
understanding how the human brain works and for emulating its
computational capabilities”. This ambitious and far-sighted goal
has become increasingly relevant during the lifetime of the project
with the rapid uptake of Machine Learning and AI (in its various
forms) for a broad range of new applications.
A significant portion of the HBP is concerned with massively parallel
applications in neuro-simulation, in analysis techniques to interpret
data produced by such applications, and in platforms to enable
these. The advanced requirements of the HBP in terms of mixed
workload processing, storage and access models are way beyond current
technological capabilities and will therefore drive innovation in
the HPC industry. The Pre-commercial procurement (PCP) is a funding
vehicle developed by the European Commission, in which an industrial
body co-designs with a public institution an innovative solution
to a real-world technical problem, with the intention of providing
the solution as commercialized IP.
The Jülich Supercomputer Centre on behalf of
the Human Brain Project entered into a competitive three-phased PCP
programme to design next-generation supercomputers for the demanding
brain simulation, analysis and data-driven problems facing the wider
Human Brain Project. Two consortia - NVIDA and IBM, and Cray and
Intel were selected to build prototypes of their proposed solutions.
The phase III projects ran until January 2017, but Cray’s project
deferred significant R&D investment, and was amended and extended.
Following significant activity supporting the research efforts
at Jülich, JULIA was finally decommissioned at the end of November.
In 2016, Cray installed a prototype named JULIA, with the
aim of exploring APIs for access to dense memory and storage, and
the effective support of mixed workloads. In this context, mixed
workloads may include interactive visualisation of live simulation
data and the possibility of applying feedback to "steer" a simulation
based on early output. Flexible exploitation of new hardware and
software aligns well with Cray's vision of adaptive supercomputing.
JULIA is based on a Cray CS400
system, but extended with some novel hardware and software technologies:
- 60 Intel Knights Landing compute nodes
- 8 visualisation nodes with NVIDIA GPUs
- 4 data nodes with Intel Xeon processor and 2x Intel Fultondate P3600 SSDs
- All system partitions connected using the Omnipath interconnect
- Installation of a remote visualization system for concurrent,
post-processing and in-transit visualization of data primarily from
- An installed software environment combining conventional HPC toolchains
(Cray, Intel, GNU compilers), and machine learning software stacks
(e.g. Theano, caffe, TensorFlow)
- A storage system consisting of SSD-backed Ceph
StackHPC was sub-contracted by Cray in order to perform analysis
and optimisation of the Ceph cluster. Analysis work started in
Ceph on JULIA
The Ceph infrastructure comprises four data nodes, each equipped with two
P3600 NVME devices and a 100G Omnipath high-performance network:
Each of the NVME devices is configured with four partitions. Each partition
is provisioned as a Ceph OSD, providing a total of 32 OSDs.
The Ceph cluster was initially running the Jewel release of Ceph
(current at the time). After characterising the performance, we
started to look for areas for optimisation.
The JULIA system uses a 100G Intel Omni-Path
RDMA-centric network fabric, also known as OPA. This network is conceptually derived
and evolved from InfiniBand, and
reuses a large proportion of the InfiniBand software stack, including
the Verbs message-passing API.
Ceph's predominant focus on TCP/IP-based networking is supported
through IP-over-InfiniBand, a
kernel network driver that enables the Omni-Path network to carry
layer-3 IP traffic.
The ipoib network driver enables connectivity, but does not
unleash the full potential of the network. Performance is good on
architectures where a processor core is sufficiently
powerful to maintain a significant proportion of line speed and
This sankey diagram
illustrates the connectivity between different hardware components
In places there are two arrows, as the TCP performance was found
to be highly variable. Despite some investigation, the underlying
reason for the variability is still unclear to us.
Using native APIs, Omni-Path will comfortably saturate the 100G
network link. However, the ipoib interface falls short of the
mark, particularly on the Knights Landing processors.
Raw Block Device Performance
In order to understand the overhead of filesystem and network
protocol, we attempt to benchmark the system at every level, moving
from the raw devices up to the end-to-end performance between client
and server. In this way, we can identify the achievable performance
at each level, and where there is most room for improvement.
Using the fio I/O benchmarking tool,
we measure the aggregated block read performance of all NVME
partitions in a single JULIA data server. We used four fio
clients per partition (32 in total) and 64KB reads. The results
are stacked to get the raw aggregate bandwidth for single node:
The aggregate I/O read performance achieved by the data server is
approximately 5200 MB/s. If we compare the I/O read performance
per node with the TCP/IP performance across the ipoib interface,
we can see that actually the two are somewhat comparable (within
the observed bounds of variability in ipoib performance):
Taking into account that heuristic access patterns are likely to
include serving data from the kernel buffer cache taking a sizeable
proportion of each data node's 64G RAM, the ipoib network
performance is likely to become a bottleneck.
Jewel to Luminous
Preserving the format of the backend data store, the major version
of Ceph was upgraded from Jewel to Luminous. Single-client performance
was tested using rados bench before and after the upgrade:
The results that we see indicate a solid improvement for smaller objects
(below 64K) but negligible difference otherwise, and no increase in
Filestore to Bluestore
The Luminous release of Ceph introduced major improvements in the
Bluestore backend data store. The Ceph cluster was migrated to Bluestore
and tested again with a single client node and rados bench:
There is a dramatic uplift in performance for larger objects for
both reads and writes. The peak RADOS object bandwidth is also
within the bounds of the observed limits achieved by the ipoib
network interface. This level of performance is becoming less of
an I/O problem and more of a networking problem.
That's a remarkable jump. What just happened?
The major differences appear to be the greater efficiency of
a bespoke storage back-end over a general-purpose filesystem,
and also reduction in the amount of data handling through
avoiding writing first to a journal, and then to the main store.
For every byte written to Ceph via the RADOS protocol, how many
bytes are actually written to disk? To find this, we sample
disk activity using iostat, aggregate across all devices in
the cluster and compare with the periodic bandwidth reports of
rados bench. The result is a pair of graphs, plotting RAODS
bandwidth against bandwidth of the underlying devices, over time.
Here's the results for the filestore backend:
There appears to be a write amplification factor of approx 4.5x
- the combination of a 2x replication factor, having every
object written first through a collocated write journal, and
an small amount of additional overhead for filesystem metadata.
What is interesting to observe is the periodic freezes in activity
as the test progresses. These are believed to be the filestore back-end
subdividing object store directories when they exceed a given threshold.
Plotted with the same axes, the bluestore configuration is strikingly different:
The device I/O performance is approximately doubled, and sustained.
The write amplification is reduced from 4.5x to just over 2x (because
we are benchmarking here with 2x replication). It is the combination
of these factors that give us the dramatic improvement in write
Sustained Write Effects
Using the P3600 devices, performing sustained writes for long periods
eventually leads to performance degradation. This can be observed
in a halving of device write performance, and erratic and occastionally
lengthy commit times.
This effect can be seen in the results of rados bench when plotted over time.
In this graph, bandwidth is plotted in green and commit times are impulses in red:
This effect made it very hard to generate repeatable write benchmark results. It
was assumed the cause was activity within the NVME controller when the available
resource of free blocks became depleted.
Scaling the Client Load
During idle periods on the JULIA system it was possible to harness
larger numbers of KNL systems as Ceph benchmark clients. Using concurrent runs
of rados bench and aggregating the results, we could get a reasonable idea of
Ceph's scalability (within the bounds of the client resources availalbe).
We were able to test with up configurations of to 20 clients at a time:
It was interesting to see how the cluster performance became erratic under
heavy load and high client concurrency.
The storage cluster BIOS and kernel parameters were reconfigured
to a low-latency / high-performance profile, and processor C-states
were disabled. This appeared to help with sustaining performance under
high load (superimposed here in black):
Recalling that the raw I/O read performance of each OSD server was benchmarked
at 5200 MB/s, giving an aggregate performance across all four servers of 20.8 GB/s,
our peak RADOS read performance of 16.5 GB/s represents about 80% of peak raw performance.
At this point, microcode and kernel mitigations were applied for the Spectre/Meltdown
CVEs. After retesting, the raw I/O read performance the aggregate performance per OSD
server was found to have dropped by over 15%, from 5200 MB/s to 4400 MB/s. The aggregate
raw read performance of the Ceph cluster was now 17.6 GB/s.
Luminous to Mimic
Along with numerous improvements and optimisations, the Mimic release also heralded
the deprecation of support for raw partitions for OSD backing, in favour of standardising
on LVM volumes.
Using an Ansible Galaxy role,
we zapped our cluster and recreated a similar configuration within LVM. We retained the
same configuration of four OSDs associated with each physical NVME device. Benchmarking
the I/O performance using fio revealed little discernable difference.
We redeployed the cluster using LVM and ceph-ansible and re-ran the rados bench tests.
The difference when using Ceph was dramatic for object sizes of 64K and bigger:
Reprovisioning again with partitions (and ignoring the deprecation warnings) restored and
increased levels of performance:
Taking into account the Spectre/Meltdown mitigations, Ceph Mimic
is delivering up to 92% efficiency over RADOS protocol.
UPDATE: After presenting these findings at Ceph Day Berlin,
Sage Weil introduced me to the Ceph performance team at Red Hat, and in particular Mark Nelson. Mark helped
confirm the issue and with analysis on the root cause. It looks likely that Bluestore+LVM suffers the same issue
as XFS+LVM on Intel NVMe devices as reported here (Red Hat
subscription required). The fix is to ugrade the kernel to the latest available for Red Hat / CentOS systems.
Unfortunately by this time JULIA reached the end of the project lifespan and we were not able to verify this. However,
on a different system with a newer hardware configuration, I was able to confirm that the performance
issues occur with kernel-3.10.0-862.14.4.el7 and are resolved in kernel-3.10.0-957.1.3.el7.
Native Network Performance for HPC-Enabled Ceph
When profiling the performance of this system using perf and
flame graph analysis,
I found that under high load 52.5% of the time appeared to be spent
in netowrking, either in the Ceph messenger threads, the kernel
TCP/IP stack or the low-level device drivers.
A substantial amount of this time is actually spent in servicing
page faults (a side-effect of the Spectre/Meltdown mitigations)
when copying socket data between kernel space and user space. This
performance data makes a strong case, at least for systems with
this balance of compute, storage and networking, for bypassing
kernel space, bypassing TCP/IP (with its inescapable copying of
data) and moving to a messenger class that offers RDMA.
When the Julia project end was announced, and our users left the system,
we upgraded Ceph one final time, from Mimic to master branch.
Ceph, RDMA and OPA
Ceph has included messenger classes for RDMA for some time. However,
our previous experience of using these with a range of RDMA-capable
network fabrics (RoCE,
InfiniBand and now
OPA) was that the messenger classes for RDMA work reasonably well
for RoCE but not for Infiniband or OPA.
For RDMA support, the systemd unit files for all communicating
Ceph processes must have virtual memory page pinning permitted, and
access to the devices required for direct communication with the network
For example, in /usr/lib/systemd/system/ceph-mon@.service, add:
Clients also require support for memory locking, which can be added by
inserting the following into /etc/security/limits.conf:
* hard memlock unlimited
* soft memlock unlimited
Fortunately Intel recently contributed support for iWARP (another
RDMA-enabled network transport), which is not actually iWARP-specific
but does introduce use of a protocol parameter broker known as the
RDMA connection manager, which
provides greater portability for RDMA connection establishment on a range
of different fabrics.
To enable this support in /etc/ceph/ceph.conf (here for the OPA hfi1 NIC):
ms_async_rdma_device_name = hfi1_0
ms_async_rdma_polling_us = 0
ms_async_rdma_type = iwarp
ms_async_rdma_cm = True
ms_type = async+rdma
Using the iWARP RDMA messenger classes (but actually on OPA and
InfiniBand) got us a lot further thanks to the connection manager
support. However, with OPA the maintenance of cluster membership
was irregular and unreliable. Further work is required to iron out
these issues, but unfortunately our time on JULIA has completed.
The project drew to a close before our work on RDMA could be completed
to satisfaction, and it is premature to post results here. I am
aware of other people becoming increasingly active in the Ceph RDMA
messaging space. In 2019 I hope to see the release of a development
project by Mellanox
to develop a new RDMA-enabled messenger class based on the UCX
communication library. (An equivalent
effort to perform the equivalent in libfabric could be even more compelling).
Looking further ahead, the adoption of Scylla's Seastar could potentially become a game-changer for
future developments with high-performance hardware-offloaded
For RDMA technologies to be adopted more widely, the biggest barriers
appear to be testing and documentation of best practice. If
we can, at StackHPC we hope to become more active in these areas
This work would not have been possible (or been far less informative) without
the help and support of a wide group of people:
- Adrian Tate and the team from the Cray EMEA Research Lab
- Dan van der Ster from CERN
- Mark Nelson, Sage Weil and the team from Red Hat
- Lena Oden, Bastian Tweddell and the team from Jülich Supercomputer Centre