Fabric control in Intel MPI

For optimal reading, please switch to desktop mode.

High Performance Computing usually involves some sort of parallel computing and process-level parallelisation using the MPI (Message Passing Interface) protocol has been a common approach on "traditional" HPC clusters. Although alternative approaches are gaining some ground, getting good MPI performance will continue to be crucially important for many big scientific workloads even in a cloudy new world of software-defined infrastructure.

There are several high-quality MPI implementations available and deciding which one to use is important as applications must be compiled against specific MPI libraries - the different MPI libraries are (broadly) source-compatible but not binary-compatible. Unfortunately selecting the "right" one to use is not straightforward as a search for benchmarks will quickly show, with different implementations coming out on top in different situations. Intel's MPI has historically been a strong contender, with easy "yum install" deployment, good performance (especially on Intel processors), and being - unlike Intel's compilers - free to use. Intel MPI 2018 still remains relevant even for new installs as the 2019 versions have had various issues, including the fairly-essential hydra manager appearing not to work with at least some AMD processors. A fix for this is apparently planned for 2019 update 5 but there is no release date for this yet.

MPI can run over many different types of interconnect or "fabrics" that are actually carrying the inter-process communications, such as Ethernet, InfiniBand etc. and the Intel MPI runtime will, by default, automatically try to select a fabric which works. Knowing how to control fabric choices is however still important as there is no guarantee it will select the optimal fabric, and fall-back through non-working options can lead to slow startup or lots of worrying error messages for the user.

Intel significantly changed the fabric control between 2018 and 2019 MPI versions but this isn't immediately obvious from the changelog and you have to jump about between the developer references and developer guides to get the full picture. In both MPI versions the I_MPI_FABRICS environment variable specifies the fabric, but the values it takes are quite different:

For 2018 options are shm, dapl, tcp, tmi, ofa or ofi, or you can use x:y to control intra- and inter-node communications separately (see the docs for which combinations are valid).
For 2019 options are only ofi, shm:ofi or shm, with the 2nd option setting intra- and inter-node communications separately as before.

The most generally-useful options are probably:

shm (2018 & 2019): The shared memory transport; only applicable to intra-node communication so generally used with another transport as suggested above - see the docs for details.
tcp (2018 only): A TCP/IP capable fabric e.g. Ethernet or IB via IPoIB.
ofi (2018 & 2019): An "OpenFabrics Interfaces-capable fabric". These use a library called libfabric (either an Intel-supplied or "external" version) which provides a fixed application-facing API while talking to one of several "OFI providers" which communicate with the interconnect hardware. Really your choice of provider here depends on the hardware, with possibilities being:
- psm2: Intel OmniPath
- verbs: InfiniBand or iWARP
- RxM: A utility provider supporting verbs
- sockets: Again an TCP/IB capable fabric but this time through libfabric. It's not intended to be faster than the 2018 tcp option, but allows developing/debugging libfabric codes without actually having a faster interconnnect available.

With both 2018 and 2019 you can use I_MPI_OFI_PROVIDER_DUMP=enable to see which providers MPI thinks are available.

2018 also supported some additional options which have gone away in 2019:

ofa (2018): "OpenFabrics Alliance" e.g. InfiniBand (through OFED Verbs) & possibly also iWARP and RoCE?
dapl (2018): "Direct Access Programming Library" e.g. InfiniBand and iWARP.
tmi (2018): "Tag Matching Interface" e.g. Intel True Scale Fabric, Intel Omni-Path Architecture, Myrinet

With any of these fabrics there are additional variables to tweak things. 2018 has I_MPI_FABRICS_LIST which allows specification of a list of available fabrics to try, plus variables to control fallback through this list. These variables are all gone in 2019 now there are fewer fabric options. Clearly Intel have clearly decided to concentrate on OFA/libfabric which unifies (or restricts, depending on your view!) the application-facing interface.

If you're using the 2018 MPI over InfiniBand you might be wondering which option to use; at least back in 2012 performance between DAPL and OFA/OFED Verbs was apparently generally similar although the transport options available varied, so which is usable/best if both are available will depend on your application and hardware.

HPC Fabrics in the Public Cloud

Hybrid and public cloud HPC solutions have been gaining increasing attention, with scientific users looking to burst peak usage out to the cloud, or investigating the impact of wholesale migration.

Azure have been pushing their capabilities for HPC hard recently, showcasing ongoing work to get closer to bare-metal performance and launching a 2nd generation of "HB-series" VMs which provide 120 cores of AMD Epyc 7002 processors. With InfiniBand interconnects and as many as 80,000 cores of HBv2 available for jobs for (some) customers, Azure looks to be providing pay-as-you-go access to some very serious (virtual) hardware. And in addition to providing a platform for new HPC workloads in the cloud, for organisations which are already embedded in the Microsoft ecosystem Azure may seem an obvious route to acquiring a burst capacity for on-premises HPC workloads.

If you're running in a virtualised environment such as Azure, MPI configuration is likely to have additional complexities and a careful read of any and all documentation you can get your hands on is likely to be needed.

For example for Azure, the recommended Intel MPI settings described here, here and in the suite of pages here vary depending on which type of VM you are using:

Standard and most compute-optimised nodes only have Ethernet (needing tcp or sockets) which is likely to make them uninteresting for multi-node MPI jobs.
Hr-series VMs and some others have FDR InfiniBand but need specific drivers (provided in an Azure image), Intel MPI 2016 and the DAPL provider set to ofa-v2-ib0.
HC44 and HB60 VMs have EDR InfiniBand and can theoretically use any MPI (although for HB60 VMs note the issues with Intel 2019 MPI on AMD processors mentioned above) but need the appropriate fabric to be manually set.

InfiniBand on Azure still seems to be undergoing considerable development with for example new drivers for MVAPICH2 coming out around now so treat any guidance with a pinch of salt until you know it's not stale, to mix metaphors!

---

If you would like to get in touch we would love to hear from you. Reach out to us on Bluesky or directly via our contact page.