Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 2: Infiniband Setup

Black silhouette of a cat with an arched back.

Part 1: Of this Project Log can be found here

Now that the 3x Mellanox MCX455A-ECAT ConnectX-4 Adapters have arrived, its time to install them into their respective servers (columbia.lab, prometheus.lab, and viper.lab)


Verify Mellanox CX4s are Detected

Once installed, log into the IDRAC of each host and verify that the CX-4 appears in system inventory. Sample output below from one of the hosts.

Note that you may need to boot the system for the CX-4 to appear in the IDRAC inventory (as Collect System Inventory on Restart” (CSIOR) will run when starting up)

InfiniBand.Slot.1-1 - PCI Device
BusNumber 129
DataBusWidth 16x or x16
Description ConnectX-4 VPI IB EDR/100 GbE Single Port QSFP28 Adapter
Device Type PCIDevice
DeviceDescription InfiniBand.Slot.1-1
DeviceNumber 0
FQDD InfiniBand.Slot.1-1
FunctionNumber 0
InstanceID InfiniBand.Slot.1-1
LastSystemInventoryTime 2026-03-14T22:33:28
LastUpdateTime 2026-03-15T03:33:07
Manufacturer Mellanox Technologies
PCIDeviceID 1013
PCISubDeviceID 0033
PCISubVendorID 15B3
PCIVendorID 15B3
SlotLength Long Length
SlotType PCI Express Gen 3

Once each machine has booted to the running OS, you can confirm that the RHEL properly detects the CX-4 with lspci

lspci | grep -i mel
44:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

Verify Numa Topology Via nvidia-smi

Ideally, for best performance, your GPUS and InfiniBand adapters will be NUMA local to each other. If you were deploying a similar setup in a production environment, NUMA alignment would be critical.

Our lab setup is less than ideal due to the limited number of PCI slots. In many Dell servers, PCIe risers for GPUs have only one PCIe slot. Stick two of these risers in a single server, and you end up with only 3 slots free on riser 1 (half lenght) which is where we had to install our CX-4s.

All this being said, we “should” be fine for functional testing. Lets review each of our 3 nodes below. Since NVIDIA Drivers are already installed on all three systems, we can run nvidia-smi and confirm that the CX-4 is in the output and review the topology


Nvidia-smi output on host viper.lab

This server viper.lab is a Dell R720 running RHEL 9, and has 2x Nvidia Telsa P4 GPUs installed along with the CX-4.

nvidia-smi topo -m
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB SYS 0,2,4,6,8,10 0 N/A
GPU1 PHB X SYS 0,2,4,6,8,10 0 N/A
NIC0 SYS SYS X

In the output above. Both GPUs are local to NUMA node 0 and connected to each other through a PCIe host bridge (PHB), while the ConnectX NIC (mlx5_0) is topologically remote from both GPUs (SYS), making the setup workable but not ideal for GPUDirect RDMA performance. This should be ok for our lab as we are performing functional tests, and performance is secondary. Time will tell.

Nvidia SMI output on host columbia.lab

Columbia.lab is a Dell R730, with 1x Nvidia Telsa T4 installed along with one CX-4 (possible to move to another slot, if we had the full length bracket for the CX-4, or had half-length brackets for our NICs on Riser 2.

nvidia-smi topo -m
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 0,2,4,6,8,10 0 N/A
NIC0 SYS X

In the output above, we can see that this system has one GPU (T4) on NUMA node 0 and and CX-4, but the GPU-to-NIC path is SYS, indicating a topologically distant connection that is usable but sub-optimal for GPUDirect RDMA performance. Again, may be fine for functional testing.

Nvidia SMI output on host prometheus.lab

This Dell R730 has 2x NVIDIA Tesla T4s installed, as well as the recently installed CX-4

nvidia-smi topo -m
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB SYS 0,2,4,6,8,10 0 N/A
GPU1 PHB X SYS 0,2,4,6,8,10 0 N/A
NIC0 SYS SYS X

This output shows two GPUs on the same NUMA node with a moderate GPU-to-GPU path (PHB) and relatively distant NIC connectivity (SYS), which is acceptable for many workloads but not ideal for high-performance GPU-to-NIC or GPUDirect-style traffic.


Verify Drivers Loaded Properly

In the first step we saw the CX-4 in the IDRAC, and in the output of lspci. We will now check that the driver has loaded properly. Rinse and repeat on each host.

lsmod | egrep 'mlx5_core|mlx5_ib|ib_core'
mlx5_ib 561152 0
macsec 73728 1 mlx5_ib
mlx5_core 3153920 2 mlx5_fwctl,mlx5_ib
mlxfw 49152 1 mlx5_core
psample 20480 1 mlx5_core
tls 159744 2 bonding,mlx5_core
pci_hyperv_intf 12288 1 mlx5_core
ib_uverbs 217088 2 rdma_ucm,mlx5_ib
ib_core 573440 12 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

Lets review the imporant/relevant output below

mlx5_ib 561152 0
mlx5_core 3153920 2 mlx5_fwctl,mlx5_ib
ib_uverbs 217088 2 rdma_ucm,mlx5_ib
ib_core 573440 12 ...
  • mlx5_core – main low-level kernel driver for Mellanox/NVIDIA ConnectX-4/5-class adapters.
    • the kernel sees the adapter family and has loaded the base driver
    • this is required for the card to function at all
    • other Mellanox modules are depending on it
  • mlx5_ib – This is the InfiniBand/RDMA driver layer for mlx5 devices.
    • the adapter is not just using the generic Ethernet driver path
    • the system has the RDMA / InfiniBand-capable driver loaded
    • the kernel is prepared to expose the card as an IB/RDMA device
  • ib_core – the core InfiniBand subsystem in the kernel.
    • the Linux IB stack is loaded
    • multiple RDMA/IB-related modules are attached to it
    • the host is set up for InfiniBand/RDMA functionality, not just plain NIC support
  • ib_uverbs – This is the userspace verbs interface.
    • userspace RDMA tools and libraries should be able to talk to the device
    • commands like ibv_devinfo, ibstat, and RDMA applications have the proper kernel interface available

Show Devices and Port State

First we need to install some prerequesits

sudo dnf install rdma-core infiniband-diags libibverbs-utils -y

ibv_devices

We can show InfiniBand devices with ibv_devices, which shows local RDMA/InfiniBand devices that the OS can see on that host. It does not enumerate remote hosts, switches, or the rest of the IB fabric.

We will run this command on each host and capture the output.

On columbia.lab

[root@columbia ~]# ibv_devices
device node GUID
------ ----------------
mlx5_0 248a070300ac5414

On viper.lab

root@viper:~# ibv_devices
device node GUID
------ ----------------
mlx5_0 248a070300ac5f6c

On prometheus.lab

[root@prometheus ~]$ ibv_devices
device node GUID
------ ----------------
mlx5_0 248a070300ac5610
[root@prometheus ~]$

ibstat

Now that we have confirmed all devices are present and accounted for lets check for links. In the output below you can see that we have link “Physical state: LinkUp“, but since we have not configured subnet manager on any of our nodes, the logical fabric is “State: Initializing“.

ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.28.4512
Hardware version: 0
Node GUID: 0x248a070300ac5610
System image GUID: 0x248a070300ac5610
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 40
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2659e848
Port GUID: 0x248a070300ac5610
Link layer: InfiniBand

Run ibstat on any of your remaining nodes. Ensure that you see “Physical state: LinkUp”. You may also want to make notes of “Firmware version: 12.28.4512“. We have the same firmware on all three CX-4s.


ibv_definfo

We can also run ibv_devinfo, which gives a detailed view of the local RDMA / InfiniBand device and its ports. It is more detailed than ibv_devices and overlaps somewhat with ibstat, but from the verbs / RDMA stack perspective.

Example output below:

 ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				12.28.4512
	node_guid:			248a:0703:00ac:5414
	sys_image_guid:			248a:0703:00ac:5414
	vendor_id:			0x02c9
	vendor_part_id:			4115
	hw_ver:				0x0
	board_id:			DEL2180110032
	phys_port_cnt:			1
		port:	1
			state:			PORT_INIT (2)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

This output shows us the following…

  • Local RDMA devices
    • Example: mlx5_0, mlx5_1
  • Port state
    • Example: PORT_ACTIVE, PORT_DOWN, PORT_INIT
  • Physical link state
    • Example: LINK_UP, POLLING, DISABLED
  • Negotiated link details
    • Speed and link width
  • Fabric info
    • Local LID and SM LID
  • Device identifiers
    • Node GUID, port GUID, system image GUID
  • Transport / firmware details
    • Transport type and device-specific details
  • RDMA capabilities
    • Limits such as QPs, CQs, MR size, atomic support, GID table size

rdma link

The current output of “rdma link” shows use that our InfiniBand ports are connected but as we know the fabric is not initialized.

rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 65535 sm_lid 0 lmc 0 state INIT physical_state LINK_UP

Specifically the output shows us the following.

  • mlx5_0/1
    • Device mlx5_0, port 1
  • subnet_prefix fe80:0000:0000:0000
    • Normal default InfiniBand subnet prefix
  • lid 65535
    • The port does not have a valid assigned LID yet
  • sm_lid 0
    • No subnet manager is detected
  • lmc 0
    • LID mask control is 0; not important here
  • state INIT
    • The port is not fully active yet
  • physical_state LINK_UP
    • The physical link is up and the cable/port side is working

Setup Subnet Manager on one Host

For our lab, we are going to only setup subnet manager on one host. Pick your always-on host. Multiple instances of subnet manager can be used, but again, not needed for our current objective.

On the selected host run the following to install required packages.

 dnf install -y rdma-core opensm infiniband-diags

Next, start and enable the service

sudo systemctl enable --now opensm

Then check to ensure that the service started without error.

sudo systemctl status opensm --no-pager
journalctl -u opensm -b --no-pager

Now we can re-check the fabric on each host. Now we see the Fabric status is “State: Active”

 ibstat
CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.28.4512
	Hardware version: 0
	Node GUID: 0x248a070300ac5f6c
	System image GUID: 0x248a070300ac5f6c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 4
		LMC: 0
		SM lid: 1
		Capability mask: 0x2659e848
		Port GUID: 0x248a070300ac5f6c
		Link layer: InfiniBand
root@viper:~# 

rdma link shows similar output.

rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 4 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP

Confirm Infiniband Fabric Topology

You can run the following commands to confirm that your fabric is up and running

Run the “ibnetdiscover” to see host adapters, links, GUIDs, port relationships, and switches. Example output below

ibnetdiscover
#
# Topology file: generated on Sat Mar 14 21:15:43 2026
#
# Initiated from node 248a070300ac5f6c port 248a070300ac5f6c
vendid=0x2c9
devid=0xbd36
sysimgguid=0x2c902004cf11b
switchguid=0x2c902004cf118(2c902004cf118)
Switch 8 "S-0002c902004cf118" # "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0
[1] "H-248a070300ac5f6c"[1](248a070300ac5f6c) # "viper mlx5_0" lid 4 4xQDR
[2] "H-248a070300ac5414"[1](248a070300ac5414) # "columbia mlx5_0" lid 1 4xQDR
[3] "H-248a070300ac5610"[1](248a070300ac5610) # "prometheus mlx5_0" lid 2 4xQDR
vendid=0x2c9
devid=0x1013
sysimgguid=0x248a070300ac5414
caguid=0x248a070300ac5414
Ca 1 "H-248a070300ac5414" # "columbia mlx5_0"
[1](248a070300ac5414) "S-0002c902004cf118"[2] # lid 1 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR
vendid=0x2c9
devid=0x1013
sysimgguid=0x248a070300ac5610
caguid=0x248a070300ac5610
Ca 1 "H-248a070300ac5610" # "prometheus mlx5_0"
[1](248a070300ac5610) "S-0002c902004cf118"[3] # lid 2 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR
vendid=0x2c9
devid=0x1013
sysimgguid=0x248a070300ac5f6c
caguid=0x248a070300ac5f6c
Ca 1 "H-248a070300ac5f6c" # "viper mlx5_0"
[1](248a070300ac5f6c) "S-0002c902004cf118"[1] # lid 4 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

In the output above we can see the following

  • One Mellanox InfiniScale-IV switch is present in the fabric
    • Switch GUID: 0x2c902004cf118
    • Switch LID: 3
    • Model family shown as Infiniscale-IV Mellanox Technologies
    • It is an 8-port switch
  • Three hosts are connected to the switch
    • columbia on switch port 2, LID 1
    • prometheus on switch port 3, LID 2
    • viper on switch port 1, LID 4
  • All three hosts are being seen as CA / HCA nodes
    • columbia mlx5_0
    • prometheus mlx5_0
    • viper mlx5_0
  • All discovered links are running at:
    • 4xQDR
    • That means a 4-lane QDR InfiniBand link, which aligns with a 40 Gb/s class IB link

This output confirms that OpenSM is working properly, that the switch is visible, and all three nodes are connected to the fabric.

Run “ibnodes” which provides a similar output to ibdiscover, albeit a bit less verbose.

ibnodes
Ca : 0x248a070300ac5610 ports 1 "prometheus mlx5_0"
Ca : 0x248a070300ac5414 ports 1 "columbia mlx5_0"
Ca : 0x248a070300ac5f6c ports 1 "viper mlx5_0"
Switch : 0x0002c902004cf118 ports 8 "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0

Run “ibswitches” to see switches only.

 ibswitches
Switch	: 0x0002c902004cf118 ports 8 "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0

iblinkinfo will show you InfiniBand topology info

 iblinkinfo
CA: viper mlx5_0:
      0x248a070300ac5f6c      4    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       3    1[  ] "Infiniscale-IV Mellanox Technologies" ( )
CA: columbia mlx5_0:
      0x248a070300ac5414      1    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       3    2[  ] "Infiniscale-IV Mellanox Technologies" ( )
Switch: 0x0002c902004cf118 Infiniscale-IV Mellanox Technologies:
           3    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       4    1[  ] "viper mlx5_0" ( )
           3    2[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       1    1[  ] "columbia mlx5_0" ( )
           3    3[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       2    1[  ] "prometheus mlx5_0" ( )
           3    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: prometheus mlx5_0:
      0x248a070300ac5610      2    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       3    3[  ] "Infiniscale-IV Mellanox Technologies" ( )

In the output above, we see …

  • 4x 10Gbps per lane
  • InfiniBand speed class is QDR
  • aggregate raw signaling rate is about 40 Gb/s
  • Down/Polling – unused/not-connected switch ports

And finally, “sminfo” will show you info on subnet manager.

Specifically (below) we see that subnet manager is reachable on LID1, with a GUID of 0x248a070300ac5414 (which belongs to columbia mlx5_0). We also see “activity count 446” which shows that subnet manager has processed fabric-management activity 466 times since startup (basically a liveness/activty counter).

Additionally the output below show us the priority of the subnet manager instance (0 in this case), while state 3 SMINFO_MASTER shows us that this instance of subnet manager is in the master state and is the active controller in our IB fabric (assigning LIDS, managing paths/routing)

sminfo
sminfo: sm lid 1 sm guid 0x248a070300ac5414, activity count 446 priority 0 state 3 SMINFO_MASTER

Configuring IP over InfiniBand

IP over InfiniBand, or IPoIB, allows an InfiniBand fabric to carry normal IP traffic between hosts. That means systems connected by InfiniBand can use familiar network tools and services such as ping, ssh, scp, NFS, and other TCP/IP-based applications over the IB link instead of only using native RDMA-aware software.

IPoIB is not required for RDMA itself, and it is also not inherently required for technologies like GPUDirect RDMA. RDMA and GPUDirect RDMA operate through the RDMA/verbs stack and the InfiniBand fabric, not through the IP emulation layer that IPoIB provides. NVIDIA’s current networking/operator docs describe RDMA and GPUDirect RDMA enablement separately from IPoIB, and they also document IPoIB as an optional deployment pattern rather than a prerequisite.

We use IPoIB when we want the simplicity of standard IP networking on top of the higher-speed, low-latency InfiniBand fabric. In a small lab or cluster, this is useful for private host-to-host traffic, storage traffic, migration traffic, testing, or other east-west communication, while leaving the normal Ethernet interfaces in place for management access, internet access, and general connectivity.

IPoIB Addresses for our lab

Our lab uses 10.1.x.x for its existing IP scheme, so to avoid any confusion, we will use 172.16.x.x addresses for our small private subnet on the IB network. Note that we do not need a gateway.

HOSTIPoIB AddressINTERFACE
prometheus.lab172.16.50.11/24ibp129s0
columbia.lab172.16.50.12/24ibp129s0
viper.lab172.16.50.13/24ibp68s0

As part of our initial temporary test, we will apply the IPoIB addresses to the indicated interfaces on each host (all as outlined above. Example temporary config will be for one host. However we will run the command (modified) for each host in our cluster.

sudo ip link set ibp129s0 up
sudo ip addr add 172.16.50.11/24 dev ibp129s0

As you go host to host, verify that the address was assigned correctly.

9: ibp129s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 1000
link/infiniband 00:00:03:f2:fe:80:00:00:00:00:00:00:24:8a:07:03:00:ac:56:10 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 172.16.50.11/24 scope global ibp129s0
valid_lft forever preferred_lft forever

Also verify your routing table.

 netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.1.10.1       0.0.0.0         UG        0 0          0 bridge0
10.1.10.0       0.0.0.0         255.255.255.0   U         0 0          0 bridge0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 idrac
172.16.50.0     0.0.0.0         255.255.255.0   U         0 0          0 ibp129s0

Now perform ping tests from each host and ensure that they can hit the remaining hosts in your cluster. For example.

ping -I ibp129s0 -c 2 172.16.50.12
ping -I ibp129s0 -c 2 172.16.50.13

Once you have tested all three hosts, we can move forward with configuring persistent network configs.


Persistent RHEL 10 / NetworkManager setup

prometheus.lab

sudo nmcli connection add type infiniband ifname ibp129s0 con-name ib-ibp129s0
sudo nmcli connection modify ib-ibp129s0 ipv4.method manual ipv4.addresses 172.16.50.11/24 ipv6.method disabled
sudo nmcli connection up ib-ibp129s0




columbia.lab

sudo nmcli connection add type infiniband ifname ibp129s0 con-name ib-ibp129s0
sudo nmcli connection modify ib-ibp129s0 ipv4.method manual ipv4.addresses 172.16.50.12/24 ipv6.method disabled
sudo nmcli connection up ib-ibp129s0

viper.lab

sudo nmcli connection add type infiniband ifname ibp68s0 con-name ib-ibp68s0
sudo nmcli connection modify ib-ibp68s0 ipv4.method manual ipv4.addresses 172.16.50.13/24 ipv6.method disabled
sudo nmcli connection up ib-ibp68s0

Confirm Routing

Use “ip route” to ensure that we have the proper route in place for our IPoIB network

ip route
default via 10.1.10.1 dev bridge0 proto static metric 425
10.1.10.0/24 dev bridge0 proto kernel scope link src 10.1.10.25 metric 425
169.254.0.0/16 dev idrac proto kernel scope link src 169.254.0.2 metric 100
172.16.50.0/24 dev ibp68s0 proto kernel scope link src 172.16.50.13 metric 150

Also confirm that NetworkManger sees our IB devices correctly (as Infiniband)

nmcli device status
DEVICE TYPE STATE CONNECTION
bridge0 bridge connected bridge0
idrac ethernet connected idrac
ibp68s0 infiniband connected ib-ibp68s0
bond0 bond connected bond0
enp65s0f0 ethernet connected bond0-port0
enp65s0f1 ethernet connected bond0-port1





IP, IPoIB, and RDMA Usage Matrix

We can use this simple decision matrix to ensure that we understand when to use traditional IP for host to host communication, vs when to use IPoIB, and when native RDMA/IB

Use caseEthernetIPoIBNative RDMA / IB
Host management, SSH, web UI, package installsBest choicePossible, but usually unnecessaryNo
Internet access / default routeBest choiceNoNo
General admin traffic between hostsBest choiceGood for isolated lab trafficNo
Simple host-to-host testing with ping, ssh, scp, rsync over IB fabricNoBest choiceNo
NFS/SMB using normal IP networking over the IB fabricNoBest choiceNo
Fast private storage or migration traffic using standard TCP/IP appsNoBest choiceNo
RDMA-aware apps using verbs/libibverbsNoNoBest choice
MPI or cluster workloads built for native IB/RDMANoSometimes, if app specifically uses IPBest choice
GPUDirect RDMA / high-performance GPU-to-network workflowsNoNoBest choice
Lowest latency / highest efficiency IB data pathNoNoBest choice
Easiest troubleshooting and least risk of routing mistakesBest choiceGood if kept isolatedMore specialized

Next Steps.

So in Part 2 of our project, we focused on getting InfiniBand up and running, as well as IPoIB. We validated connectivity and setup subnet manager and made sure that our fabric was initialized. We leared a number of IB related command and learned how to read their output. Good Stuff.

In our next post we will start working with the various NVIDIA tools and projects, many of which will rely on our IB network. Additionally we may try to update firmware on our CX-4s and our IB Switch, however I may skip this step or circle back to it later.

Leave a Reply