Part 1: Of this Project Log can be found here

Now that the 3x Mellanox MCX455A-ECAT ConnectX-4 Adapters have arrived, its time to install them into their respective servers (columbia.lab, prometheus.lab, and viper.lab)

Verify Mellanox CX4s are Detected

Once installed, log into the IDRAC of each host and verify that the CX-4 appears in system inventory. Sample output below from one of the hosts.

Note that you may need to boot the system for the CX-4 to appear in the IDRAC inventory (as Collect System Inventory on Restart” (CSIOR) will run when starting up)

			
InfiniBand.Slot.1-1 - PCI Device
BusNumber   	129
DataBusWidth   	16x or x16
Description   	ConnectX-4 VPI IB EDR/100 GbE Single Port QSFP28 Adapter
Device Type	PCIDevice
DeviceDescription   	InfiniBand.Slot.1-1
DeviceNumber   	0
FQDD   	InfiniBand.Slot.1-1
FunctionNumber   	0
InstanceID   	InfiniBand.Slot.1-1
LastSystemInventoryTime   	2026-03-14T22:33:28
LastUpdateTime   	2026-03-15T03:33:07
Manufacturer   	Mellanox Technologies
PCIDeviceID   	1013
PCISubDeviceID   	0033
PCISubVendorID   	15B3
PCIVendorID   	15B3
SlotLength   	Long Length
SlotType   	PCI Express Gen 3

		

Once each machine has booted to the running OS, you can confirm that the RHEL properly detects the CX-4 with lspci

			
lspci | grep -i mel
44:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

Verify Numa Topology Via nvidia-smi

Ideally, for best performance, your GPUS and InfiniBand adapters will be NUMA local to each other. If you were deploying a similar setup in a production environment, NUMA alignment would be critical.

Our lab setup is less than ideal due to the limited number of PCI slots. In many Dell servers, PCIe risers for GPUs have only one PCIe slot. Stick two of these risers in a single server, and you end up with only 3 slots free on riser 1 (half lenght) which is where we had to install our CX-4s.

All this being said, we “should” be fine for functional testing. Lets review each of our 3 nodes below. Since NVIDIA Drivers are already installed on all three systems, we can run nvidia-smi and confirm that the CX-4 is in the output and review the topology

Nvidia-smi output on host viper.lab

This server viper.lab is a Dell R720 running RHEL 9, and has 2x Nvidia Telsa P4 GPUs installed along with the CX-4.

			
nvidia-smi topo -m
	GPU0	GPU1	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	SYS	0,2,4,6,8,10	0		N/A
GPU1	PHB	 X 	SYS	0,2,4,6,8,10	0		N/A
NIC0	SYS	SYS	 X 				

		

In the output above. Both GPUs are local to NUMA node 0 and connected to each other through a PCIe host bridge (PHB), while the ConnectX NIC (mlx5_0) is topologically remote from both GPUs (SYS), making the setup workable but not ideal for GPUDirect RDMA performance. This should be ok for our lab as we are performing functional tests, and performance is secondary. Time will tell.

Nvidia SMI output on host columbia.lab

Columbia.lab is a Dell R730, with 1x Nvidia Telsa T4 installed along with one CX-4 (possible to move to another slot, if we had the full length bracket for the CX-4, or had half-length brackets for our NICs on Riser 2.

			
nvidia-smi topo -m
	GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	0,2,4,6,8,10	0		N/A
NIC0	SYS	 X

In the output above, we can see that this system has one GPU (T4) on NUMA node 0 and and CX-4, but the GPU-to-NIC path is SYS, indicating a topologically distant connection that is usable but sub-optimal for GPUDirect RDMA performance. Again, may be fine for functional testing.

Nvidia SMI output on host prometheus.lab

This Dell R730 has 2x NVIDIA Tesla T4s installed, as well as the recently installed CX-4

			
nvidia-smi topo -m
	GPU0	GPU1	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	SYS	0,2,4,6,8,10	0		N/A
GPU1	PHB	 X 	SYS	0,2,4,6,8,10	0		N/A
NIC0	SYS	SYS	 X 				

		

This output shows two GPUs on the same NUMA node with a moderate GPU-to-GPU path (PHB) and relatively distant NIC connectivity (SYS), which is acceptable for many workloads but not ideal for high-performance GPU-to-NIC or GPUDirect-style traffic.

Verify Drivers Loaded Properly

In the first step we saw the CX-4 in the IDRAC, and in the output of lspci. We will now check that the driver has loaded properly. Rinse and repeat on each host.

			
lsmod | egrep 'mlx5_core|mlx5_ib|ib_core'
mlx5_ib               561152  0
macsec                 73728  1 mlx5_ib
mlx5_core            3153920  2 mlx5_fwctl,mlx5_ib
mlxfw                  49152  1 mlx5_core
psample                20480  1 mlx5_core
tls                   159744  2 bonding,mlx5_core
pci_hyperv_intf        12288  1 mlx5_core
ib_uverbs             217088  2 rdma_ucm,mlx5_ib
ib_core               573440  12 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

		

Lets review the imporant/relevant output below

			
mlx5_ib               561152  0
mlx5_core            3153920  2 mlx5_fwctl,mlx5_ib
ib_uverbs             217088  2 rdma_ucm,mlx5_ib
ib_core               573440  12 ...

mlx5_core – main low-level kernel driver for Mellanox/NVIDIA ConnectX-4/5-class adapters.
- the kernel sees the adapter family and has loaded the base driver
- this is required for the card to function at all
- other Mellanox modules are depending on it
mlx5_ib – This is the InfiniBand/RDMA driver layer for mlx5 devices.
- the adapter is not just using the generic Ethernet driver path
- the system has the RDMA / InfiniBand-capable driver loaded
- the kernel is prepared to expose the card as an IB/RDMA device
ib_core – the core InfiniBand subsystem in the kernel.
- the Linux IB stack is loaded
- multiple RDMA/IB-related modules are attached to it
- the host is set up for InfiniBand/RDMA functionality, not just plain NIC support
ib_uverbs – This is the userspace verbs interface.
- userspace RDMA tools and libraries should be able to talk to the device
- commands like ibv_devinfo, ibstat, and RDMA applications have the proper kernel interface available

Show Devices and Port State

First we need to install some prerequesits

sudo dnf install rdma-core infiniband-diags libibverbs-utils -y

ibv_devices

We can show InfiniBand devices with ibv_devices, which shows local RDMA/InfiniBand devices that the OS can see on that host. It does not enumerate remote hosts, switches, or the rest of the IB fabric.

We will run this command on each host and capture the output.

On columbia.lab

			
[root@columbia ~]# ibv_devices
    device          	   node GUID
    ------          	----------------
    mlx5_0          	248a070300ac5414

On viper.lab

			
root@viper:~# ibv_devices
    device          	   node GUID
    ------          	----------------
    mlx5_0          	248a070300ac5f6c

On prometheus.lab

			
[root@prometheus ~]$ ibv_devices
    device          	   node GUID
    ------          	----------------
    mlx5_0          	248a070300ac5610
[root@prometheus ~]$ 

		

ibstat

Now that we have confirmed all devices are present and accounted for lets check for links. In the output below you can see that we have link “Physical state: LinkUp“, but since we have not configured subnet manager on any of our nodes, the logical fabric is “State: Initializing“.

			
ibstat
CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.28.4512
	Hardware version: 0
	Node GUID: 0x248a070300ac5610
	System image GUID: 0x248a070300ac5610
	Port 1:
		State: Initializing
		Physical state: LinkUp
		Rate: 40
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0x2659e848
		Port GUID: 0x248a070300ac5610
		Link layer: InfiniBand

		

Run ibstat on any of your remaining nodes. Ensure that you see “Physical state: LinkUp”. You may also want to make notes of “Firmware version: 12.28.4512“. We have the same firmware on all three CX-4s.

ibv_definfo

We can also run ibv_devinfo, which gives a detailed view of the local RDMA / InfiniBand device and its ports. It is more detailed than ibv_devices and overlaps somewhat with ibstat, but from the verbs / RDMA stack perspective.

Example output below:

 ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				12.28.4512
	node_guid:			248a:0703:00ac:5414
	sys_image_guid:			248a:0703:00ac:5414
	vendor_id:			0x02c9
	vendor_part_id:			4115
	hw_ver:				0x0
	board_id:			DEL2180110032
	phys_port_cnt:			1
		port:	1
			state:			PORT_INIT (2)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

This output shows us the following…

Local RDMA devices
- Example: mlx5_0, mlx5_1
Port state
- Example: PORT_ACTIVE, PORT_DOWN, PORT_INIT
Physical link state
- Example: LINK_UP, POLLING, DISABLED
Negotiated link details
- Speed and link width
Fabric info
- Local LID and SM LID
Device identifiers
- Node GUID, port GUID, system image GUID
Transport / firmware details
- Transport type and device-specific details
RDMA capabilities
- Limits such as QPs, CQs, MR size, atomic support, GID table size

rdma link

The current output of “rdma link” shows use that our InfiniBand ports are connected but as we know the fabric is not initialized.

			
rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 65535 sm_lid 0 lmc 0 state INIT physical_state LINK_UP

Specifically the output shows us the following.

mlx5_0/1
- Device mlx5_0, port 1
subnet_prefix fe80:0000:0000:0000
- Normal default InfiniBand subnet prefix
lid 65535
- The port does not have a valid assigned LID yet
sm_lid 0
- No subnet manager is detected
lmc 0
- LID mask control is 0; not important here
state INIT
- The port is not fully active yet
physical_state LINK_UP
- The physical link is up and the cable/port side is working

Setup Subnet Manager on one Host

For our lab, we are going to only setup subnet manager on one host. Pick your always-on host. Multiple instances of subnet manager can be used, but again, not needed for our current objective.

On the selected host run the following to install required packages.

 dnf install -y rdma-core opensm infiniband-diags

Next, start and enable the service

sudo systemctl enable --now opensm

Then check to ensure that the service started without error.

			
sudo systemctl status opensm --no-pager
journalctl -u opensm -b --no-pager

Now we can re-check the fabric on each host. Now we see the Fabric status is “State: Active”

 ibstat
CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.28.4512
	Hardware version: 0
	Node GUID: 0x248a070300ac5f6c
	System image GUID: 0x248a070300ac5f6c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 4
		LMC: 0
		SM lid: 1
		Capability mask: 0x2659e848
		Port GUID: 0x248a070300ac5f6c
		Link layer: InfiniBand
root@viper:~#

rdma link shows similar output.

			
rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 4 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP

Confirm Infiniband Fabric Topology

You can run the following commands to confirm that your fabric is up and running

Run the “ibnetdiscover” to see host adapters, links, GUIDs, port relationships, and switches. Example output below

			
ibnetdiscover
#
# Topology file: generated on Sat Mar 14 21:15:43 2026
#
# Initiated from node 248a070300ac5f6c port 248a070300ac5f6c
vendid=0x2c9
devid=0xbd36
sysimgguid=0x2c902004cf11b
switchguid=0x2c902004cf118(2c902004cf118)
Switch	8 "S-0002c902004cf118"		# "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0
[1]	"H-248a070300ac5f6c"[1](248a070300ac5f6c) 		# "viper mlx5_0" lid 4 4xQDR
[2]	"H-248a070300ac5414"[1](248a070300ac5414) 		# "columbia mlx5_0" lid 1 4xQDR
[3]	"H-248a070300ac5610"[1](248a070300ac5610) 		# "prometheus mlx5_0" lid 2 4xQDR
vendid=0x2c9
devid=0x1013
sysimgguid=0x248a070300ac5414
caguid=0x248a070300ac5414
Ca	1 "H-248a070300ac5414"		# "columbia mlx5_0"
[1](248a070300ac5414) 	"S-0002c902004cf118"[2]		# lid 1 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR
vendid=0x2c9
devid=0x1013
sysimgguid=0x248a070300ac5610
caguid=0x248a070300ac5610
Ca	1 "H-248a070300ac5610"		# "prometheus mlx5_0"
[1](248a070300ac5610) 	"S-0002c902004cf118"[3]		# lid 2 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR
vendid=0x2c9
devid=0x1013
sysimgguid=0x248a070300ac5f6c
caguid=0x248a070300ac5f6c
Ca	1 "H-248a070300ac5f6c"		# "viper mlx5_0"
[1](248a070300ac5f6c) 	"S-0002c902004cf118"[1]		# lid 4 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 3 4xQDR

		

In the output above we can see the following

One Mellanox InfiniScale-IV switch is present in the fabric
- Switch GUID: 0x2c902004cf118
- Switch LID: 3
- Model family shown as Infiniscale-IV Mellanox Technologies
- It is an 8-port switch
Three hosts are connected to the switch
- columbia on switch port 2, LID 1
- prometheus on switch port 3, LID 2
- viper on switch port 1, LID 4
All three hosts are being seen as CA / HCA nodes
- columbia mlx5_0
- prometheus mlx5_0
- viper mlx5_0
All discovered links are running at:
- 4xQDR
- That means a 4-lane QDR InfiniBand link, which aligns with a 40 Gb/s class IB link

This output confirms that OpenSM is working properly, that the switch is visible, and all three nodes are connected to the fabric.

Run “ibnodes” which provides a similar output to ibdiscover, albeit a bit less verbose.

			
ibnodes
Ca	: 0x248a070300ac5610 ports 1 "prometheus mlx5_0"
Ca	: 0x248a070300ac5414 ports 1 "columbia mlx5_0"
Ca	: 0x248a070300ac5f6c ports 1 "viper mlx5_0"
Switch	: 0x0002c902004cf118 ports 8 "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0

		

Run “ibswitches” to see switches only.

 ibswitches
Switch	: 0x0002c902004cf118 ports 8 "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0

iblinkinfo will show you InfiniBand topology info

 iblinkinfo
CA: viper mlx5_0:
      0x248a070300ac5f6c      4    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       3    1[  ] "Infiniscale-IV Mellanox Technologies" ( )
CA: columbia mlx5_0:
      0x248a070300ac5414      1    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       3    2[  ] "Infiniscale-IV Mellanox Technologies" ( )
Switch: 0x0002c902004cf118 Infiniscale-IV Mellanox Technologies:
           3    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       4    1[  ] "viper mlx5_0" ( )
           3    2[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       1    1[  ] "columbia mlx5_0" ( )
           3    3[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       2    1[  ] "prometheus mlx5_0" ( )
           3    4[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           3    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: prometheus mlx5_0:
      0x248a070300ac5610      2    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       3    3[  ] "Infiniscale-IV Mellanox Technologies" ( )

In the output above, we see …

4x 10Gbps per lane
InfiniBand speed class is QDR
aggregate raw signaling rate is about 40 Gb/s
Down/Polling – unused/not-connected switch ports

And finally, “sminfo” will show you info on subnet manager.

Specifically (below) we see that subnet manager is reachable on LID1, with a GUID of 0x248a070300ac5414 (which belongs to columbia mlx5_0). We also see “activity count 446” which shows that subnet manager has processed fabric-management activity 466 times since startup (basically a liveness/activty counter).

Additionally the output below show us the priority of the subnet manager instance (0 in this case), while state 3 SMINFO_MASTER shows us that this instance of subnet manager is in the master state and is the active controller in our IB fabric (assigning LIDS, managing paths/routing)

			
sminfo
sminfo: sm lid 1 sm guid 0x248a070300ac5414, activity count 446 priority 0 state 3 SMINFO_MASTER

Configuring IP over InfiniBand

IP over InfiniBand, or IPoIB, allows an InfiniBand fabric to carry normal IP traffic between hosts. That means systems connected by InfiniBand can use familiar network tools and services such as ping, ssh, scp, NFS, and other TCP/IP-based applications over the IB link instead of only using native RDMA-aware software.

IPoIB is not required for RDMA itself, and it is also not inherently required for technologies like GPUDirect RDMA. RDMA and GPUDirect RDMA operate through the RDMA/verbs stack and the InfiniBand fabric, not through the IP emulation layer that IPoIB provides. NVIDIA’s current networking/operator docs describe RDMA and GPUDirect RDMA enablement separately from IPoIB, and they also document IPoIB as an optional deployment pattern rather than a prerequisite.

We use IPoIB when we want the simplicity of standard IP networking on top of the higher-speed, low-latency InfiniBand fabric. In a small lab or cluster, this is useful for private host-to-host traffic, storage traffic, migration traffic, testing, or other east-west communication, while leaving the normal Ethernet interfaces in place for management access, internet access, and general connectivity.

IPoIB Addresses for our lab

Our lab uses 10.1.x.x for its existing IP scheme, so to avoid any confusion, we will use 172.16.x.x addresses for our small private subnet on the IB network. Note that we do not need a gateway.

HOST	IPoIB Address	INTERFACE
prometheus.lab	172.16.50.11/24	ibp129s0
columbia.lab	172.16.50.12/24	ibp129s0
viper.lab	172.16.50.13/24	ibp68s0

As part of our initial temporary test, we will apply the IPoIB addresses to the indicated interfaces on each host (all as outlined above. Example temporary config will be for one host. However we will run the command (modified) for each host in our cluster.

			
sudo ip link set ibp129s0 up
sudo ip addr add 172.16.50.11/24 dev ibp129s0

As you go host to host, verify that the address was assigned correctly.

			
9: ibp129s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 1000
    link/infiniband 00:00:03:f2:fe:80:00:00:00:00:00:00:24:8a:07:03:00:ac:56:10 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.16.50.11/24 scope global ibp129s0
       valid_lft forever preferred_lft forever

Also verify your routing table.

 netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.1.10.1       0.0.0.0         UG        0 0          0 bridge0
10.1.10.0       0.0.0.0         255.255.255.0   U         0 0          0 bridge0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 idrac
172.16.50.0     0.0.0.0         255.255.255.0   U         0 0          0 ibp129s0

Now perform ping tests from each host and ensure that they can hit the remaining hosts in your cluster. For example.

			
ping -I ibp129s0 -c 2 172.16.50.12
ping -I ibp129s0 -c 2 172.16.50.13

Once you have tested all three hosts, we can move forward with configuring persistent network configs.

Persistent RHEL 10 / NetworkManager setup

prometheus.lab

			
sudo nmcli connection add type infiniband ifname ibp129s0 con-name ib-ibp129s0
sudo nmcli connection modify ib-ibp129s0 ipv4.method manual ipv4.addresses 172.16.50.11/24 ipv6.method disabled
sudo nmcli connection up ib-ibp129s0

columbia.lab

			
sudo nmcli connection add type infiniband ifname ibp129s0 con-name ib-ibp129s0
sudo nmcli connection modify ib-ibp129s0 ipv4.method manual ipv4.addresses 172.16.50.12/24 ipv6.method disabled
sudo nmcli connection up ib-ibp129s0

viper.lab

			
sudo nmcli connection add type infiniband ifname ibp68s0 con-name ib-ibp68s0
sudo nmcli connection modify ib-ibp68s0 ipv4.method manual ipv4.addresses 172.16.50.13/24 ipv6.method disabled
sudo nmcli connection up ib-ibp68s0

Confirm Routing

Use “ip route” to ensure that we have the proper route in place for our IPoIB network

			
ip route
default via 10.1.10.1 dev bridge0 proto static metric 425 
10.1.10.0/24 dev bridge0 proto kernel scope link src 10.1.10.25 metric 425 
169.254.0.0/16 dev idrac proto kernel scope link src 169.254.0.2 metric 100 
172.16.50.0/24 dev ibp68s0 proto kernel scope link src 172.16.50.13 metric 150 

		

Also confirm that NetworkManger sees our IB devices correctly (as Infiniband)

			
nmcli device status
DEVICE     TYPE        STATE                   CONNECTION  
bridge0    bridge      connected               bridge0     
idrac      ethernet    connected               idrac       
ibp68s0    infiniband  connected               ib-ibp68s0  
bond0      bond        connected               bond0       
enp65s0f0  ethernet    connected               bond0-port0 
enp65s0f1  ethernet    connected               bond0-port1 

		

IP, IPoIB, and RDMA Usage Matrix

We can use this simple decision matrix to ensure that we understand when to use traditional IP for host to host communication, vs when to use IPoIB, and when native RDMA/IB

Use case	Ethernet	IPoIB	Native RDMA / IB
Host management, SSH, web UI, package installs	Best choice	Possible, but usually unnecessary	No
Internet access / default route	Best choice	No	No
General admin traffic between hosts	Best choice	Good for isolated lab traffic	No
Simple host-to-host testing with `ping`, `ssh`, `scp`, `rsync` over IB fabric	No	Best choice	No
NFS/SMB using normal IP networking over the IB fabric	No	Best choice	No
Fast private storage or migration traffic using standard TCP/IP apps	No	Best choice	No
RDMA-aware apps using verbs/libibverbs	No	No	Best choice
MPI or cluster workloads built for native IB/RDMA	No	Sometimes, if app specifically uses IP	Best choice
GPUDirect RDMA / high-performance GPU-to-network workflows	No	No	Best choice
Lowest latency / highest efficiency IB data path	No	No	Best choice
Easiest troubleshooting and least risk of routing mistakes	Best choice	Good if kept isolated	More specialized

Next Steps.

So in Part 2 of our project, we focused on getting InfiniBand up and running, as well as IPoIB. We validated connectivity and setup subnet manager and made sure that our fabric was initialized. We leared a number of IB related command and learned how to read their output. Good Stuff.

In our next post we will start working with the various NVIDIA tools and projects, many of which will rely on our IB network. Additionally we may try to update firmware on our CX-4s and our IB Switch, however I may skip this step or circle back to it later.

Chris Paquin

AI / Virt / Containers / Hardware / Linux

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 2: Infiniband Setup

Verify Mellanox CX4s are Detected

Verify Numa Topology Via nvidia-smi

Nvidia-smi output on host viper.lab

Nvidia SMI output on host columbia.lab

Nvidia SMI output on host prometheus.lab

Verify Drivers Loaded Properly

Show Devices and Port State

ibv_devices

ibstat

ibv_definfo

rdma link

Setup Subnet Manager on one Host

Confirm Infiniband Fabric Topology

Configuring IP over InfiniBand

IPoIB Addresses for our lab

Persistent RHEL 10 / NetworkManager setup

prometheus.lab

columbia.lab

viper.lab

Confirm Routing

IP, IPoIB, and RDMA Usage Matrix

Next Steps.

Like this:

Related

Leave a ReplyCancel reply

Verify Mellanox CX4s are Detected

Verify Numa Topology Via nvidia-smi

Nvidia-smi output on host viper.lab

Nvidia SMI output on host columbia.lab

Nvidia SMI output on host prometheus.lab

Verify Drivers Loaded Properly

Show Devices and Port State

ibv_devices

ibstat

ibv_definfo

rdma link

Setup Subnet Manager on one Host

Confirm Infiniband Fabric Topology

Configuring IP over InfiniBand

IPoIB Addresses for our lab

Persistent RHEL 10 / NetworkManager setup

prometheus.lab

columbia.lab

viper.lab

Confirm Routing

IP, IPoIB, and RDMA Usage Matrix

Next Steps.

Like this:

Related

Leave a ReplyCancel reply

Discover more from Chris Paquin