Over the course of the last decade, remote direct memory access (RDMA) technologies have gained increased popularity in data center networks and in the HPC domain, particularly as high-bandwidth, low-latency access to large datasets has become critical for many parties. Modern commodity, RDMA-capable NICs often provide 100 Gb/s (soon 200Gb/s in the Mellanox ConnectX-6) of throughput and O(µs) latency, enabling new types of applications for HPC and wide-area networking. However, recent work has found that some RDMA features do not scale well due to sub-optimal hardware implementations and expensive software abstractions in the kernel and in user-space.
Our group is currently exploring the low-level performance characteristics of InfiniBand hardware using Chameleon’s ConnectX-3 nodes, including the relatively low performance of RDMA reads, RDMA atomics performance, and the trade-offs of one-sided and synchronous communication models. In addition, we are working to design custom, high-performance InfiniBand device drivers for specialized, light-weight kernels within our Nautilus Aerokernel framework developed by my advisor Dr. Kyle Hale. As far as we are aware, we have the first working, open-source, custom InfiniBand driver (outside of the vendor’s Linux driver). It is also the first such driver specialized to a unikernel-like environment.
While general-purpose operating systems (like Linux) leverage complex address spaces and demand paging, many specialized OSes use a very simple paging setup to optimize for TLB performance and to minimize jitter introduced by unpredictable memory management behavior. Because these network devices make extensive use of direct memory access (DMA), they must deal with the intricacies of address translation and protection, which comes at a cost. We believe that we can optimize the device driver for cases in which such complex memory management is unnecessary.
While there are several exciting applications for customized InfiniBand drivers, one we are currently exploring involves using the InfiniBand fabric as a delegation mechanism allowing specialized OSes to offload functionality onto a general-purpose OS like Linux.
We have put considerable time and effort to understand the hardware and build a custom driver which is RDMA-capable and tailored for high-performance, parallel workloads. We make extensive use of Chameleon to test our driver (InfiniBand hardware is expensive!) and perform experiments in order to analyze and optimize important IB operations like RDMA reads, RDMA writes, and atomic operations.
RDMA reads and atomic operations are known to be expensive in terms of performance in the HPC domain. This is because RDMA requires point-to-point queue pair (QP) connections. This is not suitable for large-scale deployment in data center networks because of the limited capability of NIC’s parallel processing to handle multiple point-to-point QPs.
Our current experiments involve running our Nautilus kernel on KVM in two Compute Haswell IB nodes with pass-through access to the cards, and measuring the interactions between them. As InfiniBand requires a subnet manager to be active on the subnet to dole out local identifiers to nodes, we are also making use of the IB switch fabric in Chameleon. Managed versions of these switches are prohibitively expensive for a small lab, especially in the early prototyping stages of development work. While our driver is still in its preliminary stages, Chameleon has been an invaluable resource in helping us make progress. Without our ability to reboot nodes remotely, making progress on the driver would have been very difficult. Our current set of experiments involve a limit study measuring the minimum latencies achievable by the card across various transport channels provided by the NIC. This will help us gain better insights into the NIC’s performance and the potential software overheads in the driver which may lead to decreased performance.
|UC Figure - 1||UD Figure - 2|
The above figures compare the minimum latency (round trip time) of the network card where the X-axis is the size of the packet and the Y-axis is the latency in nanoseconds for a single iteration running bare metal on Chameleon on Linux and nautilus for the UC(Unreliable Connection) and UD(Unreliable Datagram) transport types provided by the Network Card.
Our initial latency results with respect to the MTU(Maximum Transmission Unit) of the connectX-3 network card do indicate that Nautilus has lower latency than Linux with increasing order of packet sizes for the UD and UC transport services provided by the card. We are still exploring the complex details of the connectX-3 hardware such as the memory translation unit, Blue Flame, Completion Queue Coalescing, Interrupt Moderation) of the network card in order to achieve maximum performance from the network card.
In the future, we would like to play with various hardwares such as FPGA, NVME SSD, TPU and evaluate the performance of having specialized device drivers for accelerated devices in a specialized operating system.
Our current driver is freely available as part of Nautilus’s open-source codebase, and can be found at https://github.com/hexsa-lab/nautilus. Related publications and pointers to experiments and resources (such as Docker containers) can also be found at https://nautilus.halek.co. More information about our lab generally and ongoing projects can be found at https://hexsa.halek.co.