Before moving on to Part 3 of this project, lets review what we have accomplished thus far.
- Did a bit of planning and scoping
- Built a 3-node GPU cluster (
viper,columbia,prometheus) - Interconnected with InfiniBand Installed and validated ConnectX-4 NICs and RDMA stack (
mlx5,ib_core, etc.) - Brought up the InfiniBand fabric using OpenSM (links active, LIDs assigned)
- Verified topology and connectivity (
ibstat,ibnetdiscover) - Configured IP over InfiniBand for basic networking between nodes Identified PCIe/NUMA limitations affecting optimal GPU↔NIC performance
We are now ready to do some performance testing of our Infiniband network.
Pre-Test Setup
Before we can get started on our perf testing we have bit of work to do. We are going to install a few packages, and configure some tunables.
Diagnostic Tools
First lets make sure that we have a couple tools installed, so lets install some rpms.
sudo dnf install infiniband-diags libibverbs-utils librdmacm-utils -y
Kernel Modules
InfiniBand and GPUDirect require specific modules to load at boot. So lets create hpc.conf in /etc/modules-load.d/. This creates (or overwrites) /etc/modules-load.d/hpc.conf. This file ensures each module loads automatically at boot via systemd-modules-load. Run this on each host.
sudo tee /etc/modules-load.d/hpc.conf >/dev/null <<'EOF'ib_ipoibib_umadib_uverbsnvidia-peermemEOF
Then force load the modules.
sudo modprobe ib_ipoib ib_umad ib_uverbs nvidia-peermem
Below is a short breakdown/description for each module.
| Module | How it’s used |
| ib_ipoib | Provides IP networking over InfiniBand (e.g., ib0) for SSH, NFS, TCP/IP |
| ib_umad | Enables userspace IB management tools (e.g., ibstat, fabric queries) |
| ib_uverbs | Core RDMA interface used by applications (MPI, NCCL, libibverbs) |
| nvidia-peermem | Enables GPUDirect RDMA for direct GPU ↔ NIC memory transfers (no CPU copy) |
Locked Memory Limits
RDMA works by “pinning” memory so the OS cannot swap it to disk. So we need to create /etc/security/limits.d/99-hpc.conf as shown below.
sudo tee /etc/security/limits.d/99-hpc.conf >/dev/null <<'EOF'* soft memlock unlimited* hard memlock unlimitedEOF
Performance & RDMA Benchmarking
Health Check
First lets run the following commands on any host under test, just to make sure the InfiniBand network is healthy before we start any testing. Run each line individually and make note of the output.
hostnameibstatibv_devinfo | egrep 'hca_id|transport|fw_ver|port:|link_layer|active_mtu|sm_lid|port_lid'
You are specifically interesting in the following
- Device Present (mlx5)
- State: Active
- Physical state: LinkUp
- Link layer: InfiniBand
Confirm HCA Name and Port Number
Run on any device under test – we will need this for our test on our receiver and sender side.
ibv_devices
Output from columbia.lab.
device node GUID
------ ----------------
mlx5_0 248a070300ac5414
Output from prometheus.lab
device node GUID
------ ----------------
mlx5_0 248a070300ac5610
Run the RDMA Latency Test
For our ib_send_lat (latency test) our device IP addresses are as follows.
- columbia.lab – 172.16.50.12
- prometheus.lab -172.16.50.11
On our first device, columbia.lab, we run the following and leave it running.
ib_send_lat -d mlx5_0 -i 1
Now over on prometheus, run the command below. Insert the IP from columbia captured above. You will see a good bit of output in your terminal window.
ib_send_lat -d mlx5_0 -i 1 <columbia_ip>
Key configuration details
So assuming the test did not fail, you are going to see some data spit out. Lets make sense of some of it.
| Parameter | Value | Meaning |
|---|---|---|
| Device | mlx5_0 | ConnectX-4 (mlx5 driver) |
| Transport | IB (RC) | Reliable Connection (standard RDMA mode) |
| MTU | 4096 | Optimal for IB performance |
| Queue Pairs | 1 | Single stream test |
| Inline data | 236B | Small messages optimized |
| Link type | InfiniBand | Correct mode |
What this test is actually doing
ib_send_lat:
- Registers memory with the NIC
- Creates RDMA queue pairs
- Sends messages using:
ibv_post_send()
- Measures completion latency via completion queues (CQs)
This is direct RDMA messaging, not IP networking.
Our Overall results
- Average latency: ~1.15 µs
- Typical latency: ~1.14 µs
- Minimum latency: 1.06 µs
- Outliers: up to 11.41 µs
- Conclusion: Healthy RDMA performance
While InfiniBand ≠ RDMA test by default, our test ib_send_lat specifically uses RDMA verbs, so a successful result proves RDMA is working.
In the output above, our average latency confirms that RDMA is functioning, as is kernel bypass. Note, that while we are using TCP/IP to setup the test, the actual data transfer is NIC to NIC and memory to memory. The queue pair exchange confirms RDMA session, as QPs were created on both nodes and transitioned through the required queue pair states shown below.
| State | Name | Purpose | Analogy |
|---|---|---|---|
| INIT | Initialize | Local QP setup | Phone powered on |
| RTR | Ready to Receive | Can receive remote data | You know the other person’s number |
| RTS | Ready to Send | Fully operational (send + receive) | Call connected and talking |
Run the RDMA Bandwidth Test
For this test we will run ib_send_bw. This test measures the following.
- Throughput (bandwidth) of RDMA send operations
- NIC-to-NIC data transfer rate
- Memory → NIC → fabric → NIC → memory
Again this test uses IP to establish the initial connection between nodes, but make no mistake we are using RDMA verbs and are testing IB traffic (not IP traffic).
So over on our first node (columbia.lab) we run the following.
ib_send_bw -d mlx5_0 -i 1 -a
Why these flags
| Flag | Purpose |
|---|---|
-d mlx5_0 | Select your ConnectX-4 device |
-i 1 | Use IB port 1 |
-a | Sweep all message sizes |
And on our second node we run the command shown below.
ib_send_bw -d mlx5_0 -i 1 -a <columbia_ip>
Assuming that this command does not fail, you will see a bunch of output that we need to interpret. This output is truncated, but I wanted to give you an idea of what to expect in the output.
ib_send_bw -d mlx5_0 -i 1 -a 172.16.50.12--------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF WARNING: CPU is not PCIe relaxed ordering compliant. WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client. ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 100 CQE Poll Batch : 16 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet--------------------------------------------------------------------------------------- local address: LID 0x02 QPN 0x0107 PSN 0xed831e remote address: LID 0x01 QPN 0x0107 PSN 0xa2c511--------------------------------------------------------------------------------------- #bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]Conflicting CPU frequency values detected: 1200.000000 != 1300.046000. CPU Frequency is not max. 2 1000 7.79 7.40 3.879797
Keep in mind that our IB bottleneck is our 40Gbe IB Switch. Here is what we can interpret from our test data.
- Plateau was about: 3776.9 MiB/s which is about 31.7 Gbit/s
- That is a normal practical result for a nominal 40 Gb InfiniBand-class link
- Our plateau is consistent and stable, which is good
Our InfiniBand link is healthy enough to sustain near-expected throughput, there are no obvious severe bottleneck or broken configuration. We are seeing some GPU frequency warnings, and some “PCIe relaxed ordering” warnings so lets fix those any try the test again.
What is PCIe Relaxed Ordering? PCIe Relaxed Ordering is a performance feature where The CPU/NIC is allowed to reorder memory transactions. This can improve throughput by reducing stalls and increasing parallelism
On both hosts, run the command below.
cpupower frequency-set -g performance
Now back on the first host, kick off the listen side of the test.
ib_send_bw -d mlx5_0 -i 1 -a -q 4 --disable_pcie_relaxed
And on the other server we kick off the test itself.
ib_send_bw -d mlx5_0 -i 1 -a -q 4 --disable_pcie_relaxed <columbia_ip>
Why these flags
| Flag | Purpose |
|---|---|
-d mlx5_0 | Select your ConnectX-4 device |
-i 1 | Use IB port 1 |
-a | Sweep all message sizes |
-q 4 | Use multiple queue pairs (better utilization) |
--disable_pcie_relaxed | Match your CPU capabilities and remove warning |
So lets summarize our output.
- Almost identical throughput as initial test
- Slight improvement in consistency
- Cleaner test conditions (set cpu-frequency to performance)
- Multiple QPs established (we see 4 QPNs)
- We still see CPU Frequency is not max, however this is non issue as we have already saturated our links.
Wrap Up
In our previous post, we stood up our IB network, and performed some basic fabric tests. Today was all about performance testing and testing with the actual RDMA verb stack. We found that our fabric was pretty much performing as expected out of the box with minimal tuning, as we are hitting near-theoretical limits for our 40Gb hardware.
We have a stable, low latency, high bandwidth IB Fabric.
I was hoping to get to GPU direct testing today, however that looks like it might a bit of a beast and I think I will call it a day and do a bit more research on the topic.