
Part 1: Of this Project Log can be found here
Now that the 3x Mellanox MCX455A-ECAT ConnectX-4 Adapters have arrived, its time to install them into their respective servers (columbia.lab, prometheus.lab, and viper.lab)
Verify Mellanox CX4s are Detected
Once installed, log into the IDRAC of each host and verify that the CX-4 appears in system inventory. Sample output below from one of the hosts.
Note that you may need to boot the system for the CX-4 to appear in the IDRAC inventory (as Collect System Inventory on Restart” (CSIOR) will run when starting up)
InfiniBand.Slot.1-1 - PCI DeviceBusNumber 129DataBusWidth 16x or x16Description ConnectX-4 VPI IB EDR/100 GbE Single Port QSFP28 AdapterDevice Type PCIDeviceDeviceDescription InfiniBand.Slot.1-1DeviceNumber 0FQDD InfiniBand.Slot.1-1FunctionNumber 0InstanceID InfiniBand.Slot.1-1LastSystemInventoryTime 2026-03-14T22:33:28LastUpdateTime 2026-03-15T03:33:07Manufacturer Mellanox TechnologiesPCIDeviceID 1013PCISubDeviceID 0033PCISubVendorID 15B3PCIVendorID 15B3SlotLength Long LengthSlotType PCI Express Gen 3
Once each machine has booted to the running OS, you can confirm that the RHEL properly detects the CX-4 with lspci
lspci | grep -i mel44:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Verify Numa Topology Via nvidia-smi
Ideally, for best performance, your GPUS and InfiniBand adapters will be NUMA local to each other. If you were deploying a similar setup in a production environment, NUMA alignment would be critical.
Our lab setup is less than ideal due to the limited number of PCI slots. In many Dell servers, PCIe risers for GPUs have only one PCIe slot. Stick two of these risers in a single server, and you end up with only 3 slots free on riser 1 (half lenght) which is where we had to install our CX-4s.
All this being said, we “should” be fine for functional testing. Lets review each of our 3 nodes below. Since NVIDIA Drivers are already installed on all three systems, we can run nvidia-smi and confirm that the CX-4 is in the output and review the topology
Nvidia-smi output on host viper.lab
This server viper.lab is a Dell R720 running RHEL 9, and has 2x Nvidia Telsa P4 GPUs installed along with the CX-4.
nvidia-smi topo -m GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA IDGPU0 X PHB SYS 0,2,4,6,8,10 0 N/AGPU1 PHB X SYS 0,2,4,6,8,10 0 N/ANIC0 SYS SYS X
In the output above. Both GPUs are local to NUMA node 0 and connected to each other through a PCIe host bridge (PHB), while the ConnectX NIC (mlx5_0) is topologically remote from both GPUs (SYS), making the setup workable but not ideal for GPUDirect RDMA performance. This should be ok for our lab as we are performing functional tests, and performance is secondary. Time will tell.
Nvidia SMI output on host columbia.lab
Columbia.lab is a Dell R730, with 1x Nvidia Telsa T4 installed along with one CX-4 (possible to move to another slot, if we had the full length bracket for the CX-4, or had half-length brackets for our NICs on Riser 2.
nvidia-smi topo -m GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA IDGPU0 X SYS 0,2,4,6,8,10 0 N/ANIC0 SYS X
In the output above, we can see that this system has one GPU (T4) on NUMA node 0 and and CX-4, but the GPU-to-NIC path is SYS, indicating a topologically distant connection that is usable but sub-optimal for GPUDirect RDMA performance. Again, may be fine for functional testing.
Nvidia SMI output on host prometheus.lab
This Dell R730 has 2x NVIDIA Tesla T4s installed, as well as the recently installed CX-4
nvidia-smi topo -m GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA IDGPU0 X PHB SYS 0,2,4,6,8,10 0 N/AGPU1 PHB X SYS 0,2,4,6,8,10 0 N/ANIC0 SYS SYS X
This output shows two GPUs on the same NUMA node with a moderate GPU-to-GPU path (PHB) and relatively distant NIC connectivity (SYS), which is acceptable for many workloads but not ideal for high-performance GPU-to-NIC or GPUDirect-style traffic.
Verify Drivers Loaded Properly
In the first step we saw the CX-4 in the IDRAC, and in the output of lspci. We will now check that the driver has loaded properly. Rinse and repeat on each host.
lsmod | egrep 'mlx5_core|mlx5_ib|ib_core'mlx5_ib 561152 0macsec 73728 1 mlx5_ibmlx5_core 3153920 2 mlx5_fwctl,mlx5_ibmlxfw 49152 1 mlx5_corepsample 20480 1 mlx5_coretls 159744 2 bonding,mlx5_corepci_hyperv_intf 12288 1 mlx5_coreib_uverbs 217088 2 rdma_ucm,mlx5_ibib_core 573440 12 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
Lets review the imporant/relevant output below
mlx5_ib 561152 0mlx5_core 3153920 2 mlx5_fwctl,mlx5_ibib_uverbs 217088 2 rdma_ucm,mlx5_ibib_core 573440 12 ...
- mlx5_core – main low-level kernel driver for Mellanox/NVIDIA ConnectX-4/5-class adapters.
- the kernel sees the adapter family and has loaded the base driver
- this is required for the card to function at all
- other Mellanox modules are depending on it
- mlx5_ib – This is the InfiniBand/RDMA driver layer for mlx5 devices.
- the adapter is not just using the generic Ethernet driver path
- the system has the RDMA / InfiniBand-capable driver loaded
- the kernel is prepared to expose the card as an IB/RDMA device
- ib_core – the core InfiniBand subsystem in the kernel.
- the Linux IB stack is loaded
- multiple RDMA/IB-related modules are attached to it
- the host is set up for InfiniBand/RDMA functionality, not just plain NIC support
- ib_uverbs – This is the userspace verbs interface.
- userspace RDMA tools and libraries should be able to talk to the device
- commands like
ibv_devinfo,ibstat, and RDMA applications have the proper kernel interface available
Show Devices and Port State
First we need to install some prerequesits
sudo dnf install rdma-core infiniband-diags libibverbs-utils -y
ibv_devices
We can show InfiniBand devices with ibv_devices, which shows local RDMA/InfiniBand devices that the OS can see on that host. It does not enumerate remote hosts, switches, or the rest of the IB fabric.
We will run this command on each host and capture the output.
On columbia.lab
[root@columbia ~]# ibv_devices device node GUID ------ ---------------- mlx5_0 248a070300ac5414
On viper.lab
root@viper:~# ibv_devices device node GUID ------ ---------------- mlx5_0 248a070300ac5f6c
On prometheus.lab
[root@prometheus ~]$ ibv_devices device node GUID ------ ---------------- mlx5_0 248a070300ac5610[root@prometheus ~]$
ibstat
Now that we have confirmed all devices are present and accounted for lets check for links. In the output below you can see that we have link “Physical state: LinkUp“, but since we have not configured subnet manager on any of our nodes, the logical fabric is “State: Initializing“.
ibstatCA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.28.4512 Hardware version: 0 Node GUID: 0x248a070300ac5610 System image GUID: 0x248a070300ac5610 Port 1: State: Initializing Physical state: LinkUp Rate: 40 Base lid: 65535 LMC: 0 SM lid: 0 Capability mask: 0x2659e848 Port GUID: 0x248a070300ac5610 Link layer: InfiniBand
Run ibstat on any of your remaining nodes. Ensure that you see “Physical state: LinkUp”. You may also want to make notes of “Firmware version: 12.28.4512“. We have the same firmware on all three CX-4s.
ibv_definfo
We can also run ibv_devinfo, which gives a detailed view of the local RDMA / InfiniBand device and its ports. It is more detailed than ibv_devices and overlaps somewhat with ibstat, but from the verbs / RDMA stack perspective.
Example output below:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.28.4512
node_guid: 248a:0703:00ac:5414
sys_image_guid: 248a:0703:00ac:5414
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: DEL2180110032
phys_port_cnt: 1
port: 1
state: PORT_INIT (2)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 65535
port_lmc: 0x00
link_layer: InfiniBand
This output shows us the following…
- Local RDMA devices
- Example:
mlx5_0,mlx5_1
- Example:
- Port state
- Example:
PORT_ACTIVE,PORT_DOWN,PORT_INIT
- Example:
- Physical link state
- Example:
LINK_UP,POLLING,DISABLED
- Example:
- Negotiated link details
- Speed and link width
- Fabric info
- Local LID and SM LID
- Device identifiers
- Node GUID, port GUID, system image GUID
- Transport / firmware details
- Transport type and device-specific details
- RDMA capabilities
- Limits such as QPs, CQs, MR size, atomic support, GID table size
rdma link
The current output of “rdma link” shows use that our InfiniBand ports are connected but as we know the fabric is not initialized.
rdma linklink mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 65535 sm_lid 0 lmc 0 state INIT physical_state LINK_UP
Specifically the output shows us the following.
mlx5_0/1- Device
mlx5_0, port1
- Device
subnet_prefix fe80:0000:0000:0000- Normal default InfiniBand subnet prefix
lid 65535- The port does not have a valid assigned LID yet
sm_lid 0- No subnet manager is detected
lmc 0- LID mask control is 0; not important here
state INIT- The port is not fully active yet
physical_state LINK_UP- The physical link is up and the cable/port side is working
Setup Subnet Manager on one Host
For our lab, we are going to only setup subnet manager on one host. Pick your always-on host. Multiple instances of subnet manager can be used, but again, not needed for our current objective.
On the selected host run the following to install required packages.
dnf install -y rdma-core opensm infiniband-diags
Next, start and enable the service
sudo systemctl enable --now opensm
Then check to ensure that the service started without error.
sudo systemctl status opensm --no-pagerjournalctl -u opensm -b --no-pager
Now we can re-check the fabric on each host. Now we see the Fabric status is “State: Active”
ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.28.4512
Hardware version: 0
Node GUID: 0x248a070300ac5f6c
System image GUID: 0x248a070300ac5f6c
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 4
LMC: 0
SM lid: 1
Capability mask: 0x2659e848
Port GUID: 0x248a070300ac5f6c
Link layer: InfiniBand
root@viper:~#
rdma link shows similar output.
rdma linklink mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 4 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP
Confirm Infiniband Fabric Topology
You can run the following commands to confirm that your fabric is up and running
Run the “ibnetdiscover” to see host adapters, links, GUIDs, port relationships, and switches. Example output below
ibnetdiscover## Topology file: generated on Sat Mar 14 21:15:43 2026## Initiated from node 248a070300ac5f6c port 248a070300ac5f6cvendid=0x2c9devid=0xbd36sysimgguid=0x2c902004cf11bswitchguid=0x2c902004cf118(2c902004cf118)Switch 8 "S-0002c902004cf118" # "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0[1] "H-248a070300ac5f6c"[1](248a070300ac5f6c) # "viper mlx5_0" lid 4 4xQDR[2] "H-248a070300ac5414"[1](248a070300ac5414) # "columbia mlx5_0" lid 1 4xQDR[3] "H-248a070300ac5610"[1](248a070300ac5610) # "prometheus mlx5_0" lid 2 4xQDRvendid=0x2c9devid=0x1013sysimgguid=0x248a070300ac5414caguid=0x248a070300ac5414Ca 1 "H-248a070300ac5414" # "columbia mlx5_0"[1](248a070300ac5414) "S-0002c902004cf118"[2] # lid 1 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDRvendid=0x2c9devid=0x1013sysimgguid=0x248a070300ac5610caguid=0x248a070300ac5610Ca 1 "H-248a070300ac5610" # "prometheus mlx5_0"[1](248a070300ac5610) "S-0002c902004cf118"[3] # lid 2 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDRvendid=0x2c9devid=0x1013sysimgguid=0x248a070300ac5f6ccaguid=0x248a070300ac5f6cCa 1 "H-248a070300ac5f6c" # "viper mlx5_0"[1](248a070300ac5f6c) "S-0002c902004cf118"[1] # lid 4 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR
In the output above we can see the following
- One Mellanox InfiniScale-IV switch is present in the fabric
- Switch GUID:
0x2c902004cf118 - Switch LID:
3 - Model family shown as Infiniscale-IV Mellanox Technologies
- It is an 8-port switch
- Switch GUID:
- Three hosts are connected to the switch
- columbia on switch port 2, LID 1
- prometheus on switch port 3, LID 2
- viper on switch port 1, LID 4
- All three hosts are being seen as CA / HCA nodes
columbia mlx5_0prometheus mlx5_0viper mlx5_0
- All discovered links are running at:
- 4xQDR
- That means a 4-lane QDR InfiniBand link, which aligns with a 40 Gb/s class IB link
This output confirms that OpenSM is working properly, that the switch is visible, and all three nodes are connected to the fabric.
Run “ibnodes” which provides a similar output to ibdiscover, albeit a bit less verbose.
ibnodesCa : 0x248a070300ac5610 ports 1 "prometheus mlx5_0"Ca : 0x248a070300ac5414 ports 1 "columbia mlx5_0"Ca : 0x248a070300ac5f6c ports 1 "viper mlx5_0"Switch : 0x0002c902004cf118 ports 8 "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0
Run “ibswitches” to see switches only.
ibswitches
Switch : 0x0002c902004cf118 ports 8 "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0
iblinkinfo will show you InfiniBand topology info
iblinkinfo
CA: viper mlx5_0:
0x248a070300ac5f6c 4 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 1[ ] "Infiniscale-IV Mellanox Technologies" ( )
CA: columbia mlx5_0:
0x248a070300ac5414 1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 2[ ] "Infiniscale-IV Mellanox Technologies" ( )
Switch: 0x0002c902004cf118 Infiniscale-IV Mellanox Technologies:
3 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 4 1[ ] "viper mlx5_0" ( )
3 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 1 1[ ] "columbia mlx5_0" ( )
3 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 1[ ] "prometheus mlx5_0" ( )
3 4[ ] ==( Down/ Polling)==> [ ] "" ( )
3 5[ ] ==( Down/ Polling)==> [ ] "" ( )
3 6[ ] ==( Down/ Polling)==> [ ] "" ( )
3 7[ ] ==( Down/ Polling)==> [ ] "" ( )
3 8[ ] ==( Down/ Polling)==> [ ] "" ( )
CA: prometheus mlx5_0:
0x248a070300ac5610 2 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 3[ ] "Infiniscale-IV Mellanox Technologies" ( )
In the output above, we see …
- 4x 10Gbps per lane
- InfiniBand speed class is QDR
- aggregate raw signaling rate is about 40 Gb/s
- Down/Polling – unused/not-connected switch ports
And finally, “sminfo” will show you info on subnet manager.
Specifically (below) we see that subnet manager is reachable on LID1, with a GUID of 0x248a070300ac5414 (which belongs to columbia mlx5_0). We also see “activity count 446” which shows that subnet manager has processed fabric-management activity 466 times since startup (basically a liveness/activty counter).
Additionally the output below show us the priority of the subnet manager instance (0 in this case), while state 3 SMINFO_MASTER shows us that this instance of subnet manager is in the master state and is the active controller in our IB fabric (assigning LIDS, managing paths/routing)
sminfosminfo: sm lid 1 sm guid 0x248a070300ac5414, activity count 446 priority 0 state 3 SMINFO_MASTER
Configuring IP over InfiniBand
IP over InfiniBand, or IPoIB, allows an InfiniBand fabric to carry normal IP traffic between hosts. That means systems connected by InfiniBand can use familiar network tools and services such as ping, ssh, scp, NFS, and other TCP/IP-based applications over the IB link instead of only using native RDMA-aware software.
IPoIB is not required for RDMA itself, and it is also not inherently required for technologies like GPUDirect RDMA. RDMA and GPUDirect RDMA operate through the RDMA/verbs stack and the InfiniBand fabric, not through the IP emulation layer that IPoIB provides. NVIDIA’s current networking/operator docs describe RDMA and GPUDirect RDMA enablement separately from IPoIB, and they also document IPoIB as an optional deployment pattern rather than a prerequisite.
We use IPoIB when we want the simplicity of standard IP networking on top of the higher-speed, low-latency InfiniBand fabric. In a small lab or cluster, this is useful for private host-to-host traffic, storage traffic, migration traffic, testing, or other east-west communication, while leaving the normal Ethernet interfaces in place for management access, internet access, and general connectivity.
IPoIB Addresses for our lab
Our lab uses 10.1.x.x for its existing IP scheme, so to avoid any confusion, we will use 172.16.x.x addresses for our small private subnet on the IB network. Note that we do not need a gateway.
| HOST | IPoIB Address | INTERFACE |
|---|---|---|
| prometheus.lab | 172.16.50.11/24 | ibp129s0 |
| columbia.lab | 172.16.50.12/24 | ibp129s0 |
| viper.lab | 172.16.50.13/24 | ibp68s0 |
As part of our initial temporary test, we will apply the IPoIB addresses to the indicated interfaces on each host (all as outlined above. Example temporary config will be for one host. However we will run the command (modified) for each host in our cluster.
sudo ip link set ibp129s0 upsudo ip addr add 172.16.50.11/24 dev ibp129s0
As you go host to host, verify that the address was assigned correctly.
9: ibp129s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 1000 link/infiniband 00:00:03:f2:fe:80:00:00:00:00:00:00:24:8a:07:03:00:ac:56:10 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 172.16.50.11/24 scope global ibp129s0 valid_lft forever preferred_lft forever
Also verify your routing table.
netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.1.10.1 0.0.0.0 UG 0 0 0 bridge0
10.1.10.0 0.0.0.0 255.255.255.0 U 0 0 0 bridge0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 idrac
172.16.50.0 0.0.0.0 255.255.255.0 U 0 0 0 ibp129s0
Now perform ping tests from each host and ensure that they can hit the remaining hosts in your cluster. For example.
ping -I ibp129s0 -c 2 172.16.50.12ping -I ibp129s0 -c 2 172.16.50.13
Once you have tested all three hosts, we can move forward with configuring persistent network configs.
Persistent RHEL 10 / NetworkManager setup
prometheus.lab
sudo nmcli connection add type infiniband ifname ibp129s0 con-name ib-ibp129s0sudo nmcli connection modify ib-ibp129s0 ipv4.method manual ipv4.addresses 172.16.50.11/24 ipv6.method disabledsudo nmcli connection up ib-ibp129s0
columbia.lab
sudo nmcli connection add type infiniband ifname ibp129s0 con-name ib-ibp129s0sudo nmcli connection modify ib-ibp129s0 ipv4.method manual ipv4.addresses 172.16.50.12/24 ipv6.method disabledsudo nmcli connection up ib-ibp129s0
viper.lab
sudo nmcli connection add type infiniband ifname ibp68s0 con-name ib-ibp68s0sudo nmcli connection modify ib-ibp68s0 ipv4.method manual ipv4.addresses 172.16.50.13/24 ipv6.method disabledsudo nmcli connection up ib-ibp68s0
Confirm Routing
Use “ip route” to ensure that we have the proper route in place for our IPoIB network
ip routedefault via 10.1.10.1 dev bridge0 proto static metric 425 10.1.10.0/24 dev bridge0 proto kernel scope link src 10.1.10.25 metric 425 169.254.0.0/16 dev idrac proto kernel scope link src 169.254.0.2 metric 100 172.16.50.0/24 dev ibp68s0 proto kernel scope link src 172.16.50.13 metric 150
Also confirm that NetworkManger sees our IB devices correctly (as Infiniband)
nmcli device statusDEVICE TYPE STATE CONNECTION bridge0 bridge connected bridge0 idrac ethernet connected idrac ibp68s0 infiniband connected ib-ibp68s0 bond0 bond connected bond0 enp65s0f0 ethernet connected bond0-port0 enp65s0f1 ethernet connected bond0-port1
IP, IPoIB, and RDMA Usage Matrix
We can use this simple decision matrix to ensure that we understand when to use traditional IP for host to host communication, vs when to use IPoIB, and when native RDMA/IB
| Use case | Ethernet | IPoIB | Native RDMA / IB |
|---|---|---|---|
| Host management, SSH, web UI, package installs | Best choice | Possible, but usually unnecessary | No |
| Internet access / default route | Best choice | No | No |
| General admin traffic between hosts | Best choice | Good for isolated lab traffic | No |
Simple host-to-host testing with ping, ssh, scp, rsync over IB fabric | No | Best choice | No |
| NFS/SMB using normal IP networking over the IB fabric | No | Best choice | No |
| Fast private storage or migration traffic using standard TCP/IP apps | No | Best choice | No |
| RDMA-aware apps using verbs/libibverbs | No | No | Best choice |
| MPI or cluster workloads built for native IB/RDMA | No | Sometimes, if app specifically uses IP | Best choice |
| GPUDirect RDMA / high-performance GPU-to-network workflows | No | No | Best choice |
| Lowest latency / highest efficiency IB data path | No | No | Best choice |
| Easiest troubleshooting and least risk of routing mistakes | Best choice | Good if kept isolated | More specialized |
Next Steps.
So in Part 2 of our project, we focused on getting InfiniBand up and running, as well as IPoIB. We validated connectivity and setup subnet manager and made sure that our fabric was initialized. We leared a number of IB related command and learned how to read their output. Good Stuff.
In our next post we will start working with the various NVIDIA tools and projects, many of which will rely on our IB network. Additionally we may try to update firmware on our CX-4s and our IB Switch, however I may skip this step or circle back to it later.