Before moving on to Part 3 of this project, lets review what we have accomplished thus far.

In Part 1 and Part 2 we have…

Did a bit of planning and scoping
Built a 3-node GPU cluster (viper, columbia, prometheus)
Interconnected with InfiniBand Installed and validated ConnectX-4 NICs and RDMA stack (mlx5, ib_core, etc.)
Brought up the InfiniBand fabric using OpenSM (links active, LIDs assigned)
Verified topology and connectivity (ibstat, ibnetdiscover)
Configured IP over InfiniBand for basic networking between nodes Identified PCIe/NUMA limitations affecting optimal GPU↔NIC performance

We are now ready to do some performance testing of our Infiniband network.

Pre-Test Setup

Before we can get started on our perf testing we have bit of work to do. We are going to install a few packages, and configure some tunables.

Diagnostic Tools

First lets make sure that we have a couple tools installed, so lets install some rpms.

sudo dnf install infiniband-diags libibverbs-utils librdmacm-utils -y

Kernel Modules

InfiniBand and GPUDirect require specific modules to load at boot. So lets create hpc.conf in /etc/modules-load.d/. This creates (or overwrites) /etc/modules-load.d/hpc.conf. This file ensures each module loads automatically at boot via systemd-modules-load. Run this on each host.

			
sudo tee /etc/modules-load.d/hpc.conf >/dev/null <<'EOF'
ib_ipoib
ib_umad
ib_uverbs
nvidia-peermem
EOF

		

Then force load the modules.

sudo modprobe ib_ipoib ib_umad ib_uverbs nvidia-peermem

Below is a short breakdown/description for each module.

Module	How it’s used
ib_ipoib	Provides IP networking over InfiniBand (e.g., `ib0`) for SSH, NFS, TCP/IP
ib_umad	Enables userspace IB management tools (e.g., `ibstat`, fabric queries)
ib_uverbs	Core RDMA interface used by applications (MPI, NCCL, libibverbs)
nvidia-peermem	Enables GPUDirect RDMA for direct GPU ↔ NIC memory transfers (no CPU copy)

Locked Memory Limits

RDMA works by “pinning” memory so the OS cannot swap it to disk. So we need to create /etc/security/limits.d/99-hpc.conf as shown below.

			
sudo tee /etc/security/limits.d/99-hpc.conf >/dev/null <<'EOF'
* soft memlock unlimited
* hard memlock unlimited
EOF

Performance & RDMA Benchmarking

Health Check

First lets run the following commands on any host under test, just to make sure the InfiniBand network is healthy before we start any testing. Run each line individually and make note of the output.

			
hostname
ibstat
ibv_devinfo | egrep 'hca_id|transport|fw_ver|port:|link_layer|active_mtu|sm_lid|port_lid'

You are specifically interesting in the following

Device Present (mlx5)
State: Active
Physical state: LinkUp
Link layer: InfiniBand

Confirm HCA Name and Port Number

Run on any device under test – we will need this for our test on our receiver and sender side.

ibv_devices

Output from columbia.lab.

 device          	   node GUID
 ------          	----------------
 mlx5_0          	248a070300ac5414

Output from prometheus.lab

 device          	   node GUID
 ------          	----------------
 mlx5_0          	248a070300ac5610

Run the RDMA Latency Test

For our ib_send_lat (latency test) our device IP addresses are as follows.

columbia.lab – 172.16.50.12
prometheus.lab -172.16.50.11

On our first device, columbia.lab, we run the following and leave it running.

ib_send_lat -d mlx5_0 -i 1

Now over on prometheus, run the command below. Insert the IP from columbia captured above. You will see a good bit of output in your terminal window.

ib_send_lat -d mlx5_0 -i 1 <columbia_ip>

Key configuration details

So assuming the test did not fail, you are going to see some data spit out. Lets make sense of some of it.

Parameter	Value	Meaning
Device	`mlx5_0`	ConnectX-4 (mlx5 driver)
Transport	IB (RC)	Reliable Connection (standard RDMA mode)
MTU	4096	Optimal for IB performance
Queue Pairs	1	Single stream test
Inline data	236B	Small messages optimized
Link type	InfiniBand	Correct mode

What this test is actually doing

ib_send_lat:

Registers memory with the NIC
Creates RDMA queue pairs
Sends messages using:
- ibv_post_send()
Measures completion latency via completion queues (CQs)

This is direct RDMA messaging, not IP networking.

Our Overall results

Average latency: ~1.15 µs
Typical latency: ~1.14 µs
Minimum latency: 1.06 µs
Outliers: up to 11.41 µs
Conclusion: Healthy RDMA performance

While InfiniBand ≠ RDMA test by default, our test ib_send_lat specifically uses RDMA verbs, so a successful result proves RDMA is working.

In the output above, our average latency confirms that RDMA is functioning, as is kernel bypass. Note, that while we are using TCP/IP to setup the test, the actual data transfer is NIC to NIC and memory to memory. The queue pair exchange confirms RDMA session, as QPs were created on both nodes and transitioned through the required queue pair states shown below.

State	Name	Purpose	Analogy
INIT	Initialize	Local QP setup	Phone powered on
RTR	Ready to Receive	Can receive remote data	You know the other person’s number
RTS	Ready to Send	Fully operational (send + receive)	Call connected and talking

Run the RDMA Bandwidth Test

For this test we will run ib_send_bw. This test measures the following.

Throughput (bandwidth) of RDMA send operations
NIC-to-NIC data transfer rate
Memory → NIC → fabric → NIC → memory

Again this test uses IP to establish the initial connection between nodes, but make no mistake we are using RDMA verbs and are testing IB traffic (not IP traffic).

So over on our first node (columbia.lab) we run the following.

ib_send_bw -d mlx5_0 -i 1 -a

Why these flags

Flag	Purpose
`-d mlx5_0`	Select your ConnectX-4 device
`-i 1`	Use IB port 1
`-a`	Sweep all message sizes

And on our second node we run the command shown below.

ib_send_bw -d mlx5_0 -i 1 -a <columbia_ip>

Assuming that this command does not fail, you will see a bunch of output that we need to interpret. This output is truncated, but I wanted to give you an idea of what to expect in the output.

			
ib_send_bw -d mlx5_0 -i 1 -a 172.16.50.12
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON		Lock-free      : OFF
 WARNING: CPU is not PCIe relaxed ordering compliant.
 WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client.
 ibv_wr* API     : ON		Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0107 PSN 0xed831e
 remote address: LID 0x01 QPN 0x0107 PSN 0xa2c511
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1200.000000 != 1300.046000. CPU Frequency is not max.
 2          1000             7.79               7.40   		     3.879797

		

Keep in mind that our IB bottleneck is our 40Gbe IB Switch. Here is what we can interpret from our test data.

Plateau was about: 3776.9 MiB/s which is about 31.7 Gbit/s
That is a normal practical result for a nominal 40 Gb InfiniBand-class link
Our plateau is consistent and stable, which is good

Our InfiniBand link is healthy enough to sustain near-expected throughput, there are no obvious severe bottleneck or broken configuration. We are seeing some GPU frequency warnings, and some “PCIe relaxed ordering” warnings so lets fix those any try the test again.

What is PCIe Relaxed Ordering? PCIe Relaxed Ordering is a performance feature where The CPU/NIC is allowed to reorder memory transactions. This can improve throughput by reducing stalls and increasing parallelism

On both hosts, run the command below.

cpupower frequency-set -g performance

Now back on the first host, kick off the listen side of the test.

ib_send_bw -d mlx5_0 -i 1 -a -q 4 --disable_pcie_relaxed

And on the other server we kick off the test itself.

ib_send_bw -d mlx5_0 -i 1 -a -q 4 --disable_pcie_relaxed <columbia_ip>

Why these flags

Flag	Purpose
`-d mlx5_0`	Select your ConnectX-4 device
`-i 1`	Use IB port 1
`-a`	Sweep all message sizes
`-q 4`	Use multiple queue pairs (better utilization)
`--disable_pcie_relaxed`	Match your CPU capabilities and remove warning

So lets summarize our output.

Almost identical throughput as initial test
Slight improvement in consistency
Cleaner test conditions (set cpu-frequency to performance)
Multiple QPs established (we see 4 QPNs)
We still see CPU Frequency is not max, however this is non issue as we have already saturated our links.

Wrap Up

In our previous post, we stood up our IB network, and performed some basic fabric tests. Today was all about performance testing and testing with the actual RDMA verb stack. We found that our fabric was pretty much performing as expected out of the box with minimal tuning, as we are hitting near-theoretical limits for our 40Gb hardware.

We have a stable, low latency, high bandwidth IB Fabric.

I was hoping to get to GPU direct testing today, however that looks like it might a bit of a beast and I think I will call it a day and do a bit more research on the topic.

Chris Paquin

AI / Virt / Containers / Hardware / Linux

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 3: RDMA Performance Testing

Pre-Test Setup

Diagnostic Tools

Kernel Modules

Locked Memory Limits

Performance & RDMA Benchmarking

Health Check

Confirm HCA Name and Port Number

Run the RDMA Latency Test

Key configuration details

What this test is actually doing

Our Overall results

Run the RDMA Bandwidth Test

Why these flags

Wrap Up

Like this:

Related

Leave a ReplyCancel reply

Pre-Test Setup

Diagnostic Tools

Kernel Modules

Locked Memory Limits

Performance & RDMA Benchmarking

Health Check

Confirm HCA Name and Port Number

Run the RDMA Latency Test

Key configuration details

What this test is actually doing

Our Overall results

Run the RDMA Bandwidth Test

Why these flags

Wrap Up

Like this:

Related

Leave a ReplyCancel reply

Discover more from Chris Paquin