Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 3: RDMA Performance Testing

Before moving on to Part 3 of this project, lets review what we have accomplished thus far.

In Part 1 and Part 2 we have…

  • Did a bit of planning and scoping
  • Built a 3-node GPU cluster (viper, columbia, prometheus)
  • Interconnected with InfiniBand Installed and validated ConnectX-4 NICs and RDMA stack (mlx5, ib_core, etc.)
  • Brought up the InfiniBand fabric using OpenSM (links active, LIDs assigned)
  • Verified topology and connectivity (ibstat, ibnetdiscover)
  • Configured IP over InfiniBand for basic networking between nodes Identified PCIe/NUMA limitations affecting optimal GPU↔NIC performance

We are now ready to do some performance testing of our Infiniband network.


Pre-Test Setup

Before we can get started on our perf testing we have bit of work to do. We are going to install a few packages, and configure some tunables.

Diagnostic Tools

First lets make sure that we have a couple tools installed, so lets install some rpms.

sudo dnf install infiniband-diags libibverbs-utils librdmacm-utils -y

Kernel Modules

InfiniBand and GPUDirect require specific modules to load at boot. So lets create hpc.conf in /etc/modules-load.d/. This creates (or overwrites) /etc/modules-load.d/hpc.conf. This file ensures each module loads automatically at boot via systemd-modules-load. Run this on each host.

sudo tee /etc/modules-load.d/hpc.conf >/dev/null <<'EOF'
ib_ipoib
ib_umad
ib_uverbs
nvidia-peermem
EOF

Then force load the modules.

sudo modprobe ib_ipoib ib_umad ib_uverbs nvidia-peermem

Below is a short breakdown/description for each module.

ModuleHow it’s used
ib_ipoibProvides IP networking over InfiniBand (e.g., ib0) for SSH, NFS, TCP/IP
ib_umadEnables userspace IB management tools (e.g., ibstat, fabric queries)
ib_uverbsCore RDMA interface used by applications (MPI, NCCL, libibverbs)
nvidia-peermemEnables GPUDirect RDMA for direct GPU ↔ NIC memory transfers (no CPU copy)

Locked Memory Limits

RDMA works by “pinning” memory so the OS cannot swap it to disk. So we need to create /etc/security/limits.d/99-hpc.conf as shown below.

sudo tee /etc/security/limits.d/99-hpc.conf >/dev/null <<'EOF'
* soft memlock unlimited
* hard memlock unlimited
EOF

Performance & RDMA Benchmarking

Health Check

First lets run the following commands on any host under test, just to make sure the InfiniBand network is healthy before we start any testing. Run each line individually and make note of the output.

hostname
ibstat
ibv_devinfo | egrep 'hca_id|transport|fw_ver|port:|link_layer|active_mtu|sm_lid|port_lid'

You are specifically interesting in the following

  • Device Present (mlx5)
  • State: Active
  • Physical state: LinkUp
  • Link layer: InfiniBand

Confirm HCA Name and Port Number

Run on any device under test – we will need this for our test on our receiver and sender side.

ibv_devices

Output from columbia.lab.

 device          	   node GUID
 ------          	----------------
 mlx5_0          	248a070300ac5414

Output from prometheus.lab

 device          	   node GUID
 ------          	----------------
 mlx5_0          	248a070300ac5610


Run the RDMA Latency Test

For our ib_send_lat (latency test) our device IP addresses are as follows.

  • columbia.lab – 172.16.50.12
  • prometheus.lab -172.16.50.11

On our first device, columbia.lab, we run the following and leave it running.

ib_send_lat -d mlx5_0 -i 1

Now over on prometheus, run the command below. Insert the IP from columbia captured above. You will see a good bit of output in your terminal window.

ib_send_lat -d mlx5_0 -i 1 <columbia_ip>

Key configuration details

So assuming the test did not fail, you are going to see some data spit out. Lets make sense of some of it.

ParameterValueMeaning
Devicemlx5_0ConnectX-4 (mlx5 driver)
TransportIB (RC)Reliable Connection (standard RDMA mode)
MTU4096Optimal for IB performance
Queue Pairs1Single stream test
Inline data236BSmall messages optimized
Link typeInfiniBandCorrect mode

What this test is actually doing

ib_send_lat:

  • Registers memory with the NIC
  • Creates RDMA queue pairs
  • Sends messages using:
    • ibv_post_send()
  • Measures completion latency via completion queues (CQs)

This is direct RDMA messaging, not IP networking.

Our Overall results

  • Average latency: ~1.15 µs
  • Typical latency: ~1.14 µs
  • Minimum latency: 1.06 µs
  • Outliers: up to 11.41 µs
  • Conclusion: Healthy RDMA performance

While InfiniBand ≠ RDMA test by default, our test ib_send_lat specifically uses RDMA verbs, so a successful result proves RDMA is working.

In the output above, our average latency confirms that RDMA is functioning, as is kernel bypass. Note, that while we are using TCP/IP to setup the test, the actual data transfer is NIC to NIC and memory to memory. The queue pair exchange confirms RDMA session, as QPs were created on both nodes and transitioned through the required queue pair states shown below.

StateNamePurposeAnalogy
INITInitializeLocal QP setupPhone powered on
RTRReady to ReceiveCan receive remote dataYou know the other person’s number
RTSReady to SendFully operational (send + receive)Call connected and talking

Run the RDMA Bandwidth Test

For this test we will run ib_send_bw. This test measures the following.

  • Throughput (bandwidth) of RDMA send operations
  • NIC-to-NIC data transfer rate
  • Memory → NIC → fabric → NIC → memory

Again this test uses IP to establish the initial connection between nodes, but make no mistake we are using RDMA verbs and are testing IB traffic (not IP traffic).

So over on our first node (columbia.lab) we run the following.

ib_send_bw -d mlx5_0 -i 1 -a

Why these flags

FlagPurpose
-d mlx5_0Select your ConnectX-4 device
-i 1Use IB port 1
-aSweep all message sizes

And on our second node we run the command shown below.

ib_send_bw -d mlx5_0 -i 1 -a <columbia_ip>

Assuming that this command does not fail, you will see a bunch of output that we need to interpret. This output is truncated, but I wanted to give you an idea of what to expect in the output.

ib_send_bw -d mlx5_0 -i 1 -a 172.16.50.12
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
WARNING: CPU is not PCIe relaxed ordering compliant.
WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client.
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x0107 PSN 0xed831e
remote address: LID 0x01 QPN 0x0107 PSN 0xa2c511
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1200.000000 != 1300.046000. CPU Frequency is not max.
2 1000 7.79 7.40 3.879797

Keep in mind that our IB bottleneck is our 40Gbe IB Switch. Here is what we can interpret from our test data.

  • Plateau was about: 3776.9 MiB/s which is about 31.7 Gbit/s
  • That is a normal practical result for a nominal 40 Gb InfiniBand-class link
  • Our plateau is consistent and stable, which is good

Our InfiniBand link is healthy enough to sustain near-expected throughput, there are no obvious severe bottleneck or broken configuration. We are seeing some GPU frequency warnings, and some “PCIe relaxed ordering” warnings so lets fix those any try the test again.

What is PCIe Relaxed Ordering? PCIe Relaxed Ordering is a performance feature where The CPU/NIC is allowed to reorder memory transactions. This can improve throughput by reducing stalls and increasing parallelism

On both hosts, run the command below.

cpupower frequency-set -g performance


Now back on the first host, kick off the listen side of the test.

ib_send_bw -d mlx5_0 -i 1 -a -q 4 --disable_pcie_relaxed

And on the other server we kick off the test itself.

ib_send_bw -d mlx5_0 -i 1 -a -q 4 --disable_pcie_relaxed <columbia_ip>

Why these flags

FlagPurpose
-d mlx5_0Select your ConnectX-4 device
-i 1Use IB port 1
-aSweep all message sizes
-q 4Use multiple queue pairs (better utilization)
--disable_pcie_relaxedMatch your CPU capabilities and remove warning

So lets summarize our output.

  • Almost identical throughput as initial test
  • Slight improvement in consistency
  • Cleaner test conditions (set cpu-frequency to performance)
  • Multiple QPs established (we see 4 QPNs)
  • We still see CPU Frequency is not max, however this is non issue as we have already saturated our links.


Wrap Up

In our previous post, we stood up our IB network, and performed some basic fabric tests. Today was all about performance testing and testing with the actual RDMA verb stack. We found that our fabric was pretty much performing as expected out of the box with minimal tuning, as we are hitting near-theoretical limits for our 40Gb hardware.

We have a stable, low latency, high bandwidth IB Fabric.

I was hoping to get to GPU direct testing today, however that looks like it might a bit of a beast and I think I will call it a day and do a bit more research on the topic.

Leave a Reply