Introduction

InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel.

More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU clusters, where its high bandwidth, ultra-low latency, and support for technologies such as RDMA, GPUDirect RDMA, and NCCL make it well suited for distributed training and other GPU-to-GPU communication workloads.

This project involves the architecture, deployment, and optimization of a high-speed InfiniBand (IB) fabric to facilitate low-latency, high-throughput communication between 3x dual-homed, GPU enabled, RHEL 9/10.1 servers.

By integrating Mellanox ConnectX-4 adapters with an InfiniScale IV switch, we will establish a dedicated Remote Direct Memory Access (RDMA) backend separate from the standard management LAN.

Additionally, we will become more familiar with the setup, configuration, and troubleshooting of InfiniBand networks and adapters, while also exploring the broad set of NVIDIA tools and technologies currently available to support multi-GPU clusters.

This project also encompasses the installation and configuration of NVIDIA drivers, CUDA, the NVIDIA Container Toolkit, and other supported elements of the NVIDIA software stack needed to support HCP/AI Clusters and environments.

Primary Objectives Summary

Fabric orchestration: Deploy and manage a QDR InfiniBand fabric using OpenSM.
RDMA enablement: Configure IPoIB in Connected Mode with a 65,520 MTU and validate RDMA functionality across the fabric.
GPU acceleration: Enable and test GPUDirect RDMA with nvidia-peermem for NVIDIA Tesla T4 and P4 GPUs.
Platform enablement: Install and configure NVIDIA drivers, CUDA, the NVIDIA Container Toolkit, and other supported NVIDIA software stack components required for GPU-enabled workloads.
Operations and telemetry: Develop hands-on familiarity with InfiniBand diagnostics, troubleshooting, and NVIDIA GPU monitoring tools.

Bill of Materials (BOM)

Below is the BOM for this project.

Component	Qty	PART	DESCRIPTION
Switch	1x	Mellanox InfiniScale IV Is5022 Switch	8-port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System
NICs	3x	MCX455A-ECAT MELLANOX CONNECTX-4	1 PORT EDR 100GB IB QSFP28 Infiniband/Ethernet Adapter
GPU (New)	2x	NVIDIA Tesla P4	8GB GDDR5 (with active cooling mods)
GPU (Existing)	3x	NVIDIA Tesla T4	16GB GDDR5 (with active cooling mods)
Cabling	3x	FS 40Gbps QSFP+ 2M Passive DAC (QSFP-PC02)	3x Host to 1x Switch
Server	1x	Dell R720 (RHEL 9)	2 x Intel Xeon E5-2697 (Ivy Bridge) v2 Twelve-Core Processor 2.7GHz 8.0GT/s 30MB LGA 2011, 128GB
Server	1x	Dell R730 (RHEL 10.1)	2x Intel(R) Xeon(R) CPU E5-2690 (Broadwell-EP) v4 @ 2.40GHz, 256GB
Server	1x	Dell R730 (RHEL 10.1)	2x Intel® Xeon® Processor E5-2699 v4 (Broadwell-EP) @2.20GHz, 768GB

Notes On the Bill of Materials (BOM)

There as been a bit of flux on the exact BOM for this project (and my lab). A Dell R720 was recently added to my lab, replacing a mammoth T620. And while I like the tower form factor due to its ample number of PCI slots and spare SATA power cables, it has to sit on a rack mount shelf, takes up 5RU and is really heavy amd hard to move.

So the CPUs/Memory from the T620 were migrated to the R720. However RHEL 10 deprecated support for Intel v2 processors, so I had to deploy RHEL 9 instead of RHEL10. RHEL 9 supported the CX-3 with in-band drivers, however support for the CX-3 was dropped from RHEL 10. So I had to switch to CX-4s, which were more costly. However further research found that RDMA was not supported on the CX-3, so I needed to move to the CX4 (or newer) anyways.

IB Switch – I was able to pick up the unmanaged Mellanox InfiniScale switch on eBay quite cheaply. Being unmanaged it does not run subnet manager, which I will need to run on my primary host. Limited to 40GBe per port.

IB Adapters – Initially, as stated above, I intended to use Mellanox CX-3 adapters as they were incredibly inexpensive (like $12 USD), however due to CX-3 driver being dropped in RHEL 10, I switched to CX-4s, which were supported out of the box on both RHEL 9 and RHEL 10. These adapters were not as cheap, but still affordable. Additionally, I needed low profile brackets in order to fit into the existing open PCIe slots in my servers. I did not want to use OFED drivers, as I was looking to have the adapters supported out of the box, and was not interested in fiddling with drivers. Additionally, support docs for RDMA mentioned CX-4 or newer were required.

GPUS – Workstation class GPUs (like my 1x 3070 and my 2x 3060s 12Gb are not supported. I needed Datacenter Class NVIDIA CPUs installed in all 3 Dell Servers. I already owned 3x NVIDIA Tesla T4 (installed in R730s). Picked up 2x additional GPUs (NVIDIA Tesla M4), which are for functional validation only – not purchased for their performance or VRAM. Additional requirement was to stick with low-power GPUs that ran on PCIe power alone and did not require additional power connections.

Nvidia T4/P4 Comparison and Feature Support

Our LUT (Lab Under Test) consists of 5 GPUs, deployed across 3 Servers. Details below

Feature	NVIDIA T4	NVIDIA Tesla P4
Architecture	Turing	Pascal
Release date	September 12, 2018	September 12, 2016
CUDA cores	2,560	2,560
Tensor cores	320	None
vRAM	16 GB GDDR6	8 GB GDDR5
Memory bandwidth	300–320+ GB/s	192 GB/s
PCIe interface	PCIe Gen3 x16	PCIe Gen3
Form factor	Low-profile, single-slot, passive	Low-profile, single-slot, passive
Max power	70 W	75 W
ECC memory support	Yes	Yes
NVENC / NVDEC	Yes	Yes
NVIDIA vGPU support	Yes	Yes
GPUDirect RDMA	Yes, conditionally supported	Yes, conditionally supported
RDMA NIC requirement	ConnectX-4 or later	ConnectX-4 or later
GPUDirect RDMA topology requirement	GPU and NIC should share the same upstream PCIe root complex for best support/performance	GPU and NIC should share the same upstream PCIe root complex for best support/performance
NVLink	No	No
MIG	No	No

Additional GPUs Available

In addition to the 5x GPUs currently installed in my existing lab servers, I have few additional GPUs that are not currently deployed. I will include them below for reference.

Generally speaking, these are either “older“, “hungrier” or “hotter” than the GPUs that I already have in service.

Feature	NVIDIA Tesla K20	NVIDIA Tesla P100
Architecture	Kepler	Pascal
Release date	November 2012	April 5, 2016
CUDA cores	2,496	3,584
Tensor cores	None	None
vRAM	5 GB GDDR5	16 GB HBM2
Memory bandwidth	208 GB/s	732 GB/s
PCIe interface	PCIe Gen2 x16	PCIe Gen3 x16
Form factor	Full-height, dual-slot	Full-height, dual-slot
Max power	225 W	250 W
ECC memory support	Yes	Yes
NVENC / NVDEC	No	No
NVIDIA vGPU support	No	Yes
GPUDirect RDMA	Yes, conditionally supported	Yes, conditionally supported
RDMA NIC requirement	ConnectX-4 or later	ConnectX-4 or later
GPUDirect RDMA topology requirement	GPU and NIC should share the same upstream PCIe root complex for best support/performance	GPU and NIC should share the same upstream PCIe root complex for best support/performance
NVLink	No	Depends on model; PCIe P100: No, SXM2 P100: Yes
MIG	No	No

NVIDIA Tool Test Matrix

Below is a Matrix of technologies, their supportability in my soon to be deployed stack, and a brief description of each technology.

Technology / Product	Supported	Description	Priority	NOTES
RDMA (InfiniBand / verbs)	Supported	Remote Direct Memory Access. It is a networking technology that lets one computer access or transfer data directly to the memory of another computer without involving the remote CPU or operating system in the data path.	High	Direct memory access over the network using your ConnectX-4 InfiniBand cards. This is the base networking capability you will use for low-latency, high-throughput node-to-node transfers. Native fit for CX-4 + IB switch
GPUDirect RDMA	Supported	GPUDirect RDMA is an NVIDIA technology that lets a third-party PCIe device directly read from or write to GPU memory without first copying the data through system RAM.	High	Lets a supported NIC perform RDMA directly to/from GPU memory, bypassing extra CPU copies. NVIDIA documents GPUDirect RDMA for Tesla/Quadro GPUs and requires ConnectX-4 or later NICs, with best results when GPU and NIC share the same upstream PCIe root complex. (NVIDIA Docs) One of the most relevant GPU+IB features for this setup
GPUDirect Storage	Unclear	Designed for direct data movement between storage and GPU memory. It is primarily positioned around storage stacks rather than IB switching alone, so whether your exact lab can validate it depends on OS, filesystem, NVMe/storage path, and supported software stack rather than just T4/P4 + CX-4. (NVIDIA Docs)	Low	NVIDIA technology that allows data to move directly between storage and GPU memory using DMA, instead of first bouncing through CPU memory. Best with NVME storage, have only SSDs and HDDs, no NVME support in Dell models under test.
MIG (Multi-Instance GPU)	Not supported	MIG starts with newer architectures and is documented in NVIDIA’s MIG guide as an Ampere-era feature. NVIDIA’s cloud-native docs explicitly note that Tesla T4 does not support MIG. P4 also predates MIG. (NVIDIA Docs)	None	Not supported. Multi-Instance GPU. It is an NVIDIA technology that lets a single supported GPU be partitioned into multiple smaller, isolated GPU instances
Time-slicing / shared GPU scheduling	Supported	Good for shared-lab/VM experiments Allows multiple workloads share one physical GPU by giving each workload a small turn on the GPU scheduler	Medium	Since T4 does not support MIG, NVIDIA documents time-slicing as a way to share T4 across multiple smaller jobs. This is useful for Kubernetes/OpenShift experiments or general shared-lab validation. P4 can also be shared through virtualization/software scheduling rather than MIG. (NVIDIA Docs)
NVIDIA vGPU	Supported	NVIDIA’s virtual GPU stack allows partitioning/sharing GPUs across VMs for compute, VDI, or graphics use cases. Both T4 and P4 are in NVIDIA’s supported vGPU product documentation. (NVIDIA Docs)	High	NVIDIA’s docs describe it as enabling multiple VMs to have simultaneous, direct access to a single physical GPU using NVIDIA drivers inside the guest OS. It lets multiple virtual machines share one physical NVIDIA GPU.
DCGM (Data Center GPU Manager)	Supported	NVIDIA’s primary datacenter GPU management and telemetry framework for health, diagnostics, topology, clocks, thermals, ECC, profiling, and integration with cluster tooling. It is explicitly built for Tesla/datacenter GPUs. (NVIDIA Docs)	High	DCGM is NVIDIA’s datacenter GPU management and monitoring framework. NVIDIA describes it as a lightweight user-space library/agent for administering NVIDIA datacenter GPUs in clusters and datacenters.
DCGM Exporter	Supported	Prometheus exporter built on top of DCGM that exposes GPU metrics over HTTP for scraping. Good fit for validating telemetry, dashboards, and alerting with your servers. (NVIDIA Docs)	High	NVIDIA’s Prometheus exporter for GPU metrics, will utilize existing Grafana instance Easy to validate and useful operationally
NVIDIA Merlin	Supported	Merlin is NVIDIA’s recommender-system framework stack for training and especially inference pipelines. T4 is a strong fit; P4 may work for smaller or older inference experiments, but T4 is the more relevant target. Support is practical rather than “card-specific” in docs, since Merlin rides on the CUDA/framework/container stack. (NVIDIA Developer)	Medium–Low	NVIDIA framework for building recommender systems. A recommender system is the kind of ML system used for things like: product recommendations “people also watched” next-best content ranking search or feed results ad / click-through prediction
TensorRT	Supported	TensorRT is NVIDIA’s SDK/runtime for optimizing trained neural-network models for inference on NVIDIA GPUs. It takes a model from frameworks such as TensorFlow, PyTorch, or ONNX and builds an optimized inference engine that can use precision modes such as FP32, FP16, and INT8 where supported.	High	NVIDIA’s inference optimizer/runtime. Very relevant for T4 and still usable on P4. T4 benefits significantly from Tensor Cores, so it is the better platform for validation. (NVIDIA Developer)
CUDA	Supported	Core GPU programming/runtime stack. Required for most of the technologies you listed and the base layer for custom validation, benchmarks, peer access tests, and GPU-aware applications. (NVIDIA Docs)	High	Foundation for most modern NVIDIA workflows
NVIDIA Container Toolkit	Supported	Enables Docker/Podman/Kubernetes containers to access NVIDIA GPUs cleanly. Useful for validating Merlin, TensorRT, PyTorch, RAPIDS, and exporter containers. (NVIDIA Docs)	High	Foundation for most modern NVIDIA workflows, integrate with Podman on RHEL
NVIDIA GPU Operator	Supported	Kubernetes/OpenShift operator that automates driver, toolkit, DCGM, exporter, and related GPU software deployment. Best fit if you want to turn a lab into a small cluster validation environment. (NVIDIA Docs)	Low	Good if you want Kubernetes/OpenShift validation, however no plans to run OCP in near future.
NVIDIA Fabric Manager	Not supported / not applicable	NVIDIA Fabric Manager is software for managing NVSwitch / NVLink GPU fabrics inside supported multi-GPU servers. NVIDIA says it configures the NVSwitch memory fabric to form a single memory fabric among participating GPUs and monitors the NVLinks that support that fabric.	None	Fabric Manager is for NVSwitch-based systems, not T4/P4 PCIe accelerator setups. Installed cards do not use NVSwitch. (NVIDIA Docs) Not supported here
NVLink	Not supported	NVLink is NVIDIA’s high-speed direct interconnect for GPUs. It provides a much faster path for GPU-to-GPU communication than ordinary PCIe alone, and in some platforms it is also used for CPU/GPU or switch-based interconnect designs. NVIDIA describes it as a direct GPU-to-GPU interconnect used to scale multi-GPU I/O within a server.	None	Not Supported. Neither T4 nor P4 provides NVLink. Multi-node connectivity would be via InfiniBand/RDMA, not GPU-to-GPU NVLink. (NVIDIA)
NVIDIA NIM / inference microservices	Unclear / limited	NVIDIA NIM is NVIDIA’s set of prebuilt, optimized, containerized inference microservices for running AI models on NVIDIA GPUs. NVIDIA describes NIM as portable microservices that simplify deployment of AI models across cloud, datacenter, workstation, and edge environments, typically exposing standard APIs for integration into applications	Low	Possible in some cases, but modern NIM profiles often assume newer GPUs and larger memory footprints than P4, and sometimes more than T4 depending on model size. It is worth testing selectively with small models, but I would not assume broad support on P4/T4 without checking the specific NIM/model requirements. (NVIDIA Docs)
Mixed precision inference / training	T4: Supported / P4: Limited	Mixed precision means using a mix of higher-precision and lower-precision numeric formats in AI workloads so you get better speed and lower memory use without giving up model quality where precision still matters.	Low	T4 supports Tensor Cores and is much better for FP16/INT8 inference acceleration. P4 lacks Tensor Cores, so mixed-precision benefits are more limited and workload-dependent. (NVIDIA)
NCCL (multi-GPU collectives)	Supported, but topology-dependent	NVIDIA Collective Communications Library. It is NVIDIA’s library for fast GPU-to-GPU communication, including both multi-GPU within a server and multi-node across servers. NVIDIA describes it as a topology-aware library of collective communication primitives optimized for NVIDIA GPUs and networking.	Medium	Useful for experimenting with multi-GPU and possibly multi-node communication patterns. It can work over PCIe and network paths, but the quality of results depends heavily on topology and software stack. This is a practical support judgment rather than a clean per-card matrix in the cited pages. (NVIDIA Docs) Is Supported on by both T4/P4 Single-host NCCL tests across multiple GPUs in one server Multi-node NCCL tests over InfiniBand GPUDirect RDMA-assisted NCCL when the stack and PCIe topology cooperate
GPUDirect P2P / peer-to-peer	Unclear / topology-dependent	GPU peer-to-peer memory access: one NVIDIA GPU can directly access or copy data to another NVIDIA GPU’s memory without staging the transfer through host RAM		Peer-to-peer GPU memory access can work in some PCIe topologies, but support and performance vary a lot by motherboard, root complex, ACS/IOMMU behavior, and driver stack. Possibly worth validating experimentally in homelab. (NVIDIA Docs)

Project Status

Currently I am in the “waiting for hardware” to arrive stage of the project (mainly due to the switch from CX-3s to CX-4s). So lets take stock of where we are in the project and outline our next steps.

Current State

IB Switch Racked
GPUs physically installed in all systems
Rough list of supported technologies to install and test
Basic installation steps and IB troubleshooting documented

Next Steps

Install all CX-4, ensure drivers are installed/loaded properly, and possibly update CX-4 firmware
Install NVIDIA drivers, CUDA, and Container Toolkit on all GPU enabled hardware
Install IB cables, and power up the Mellanox IB Switch (and hope its not too loud)
Install 1x instance of active Subnet Manager (OpenSM) on one of target hosts
Configure IB IPs.
Work through listed/supported technologies in the NVIDIA matrix above.

My goal is to document my progress in future posts.

Comments

4 responses to “Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview”

03/14/2026

Installing Mellanox CX-4 Adapters for InfiniBand Setup

[…] Part 1: Of this Project Log can be found here […]

Loading…

03/16/2026

Basem

Great work

Loading…

03/22/2026

Managing Noise and Heat: Tips for Your Dell Server Setup

[…] this morning, when I attempted to log into my main hypervisor to start on my 3rd installment of my HPC GPU Cluster journey, I found the host unresponsive and throwing the errors below on the console. Time to […]

Loading…

03/22/2026

NVIDIA HPC InfiniBand Cluster: Performance Testing Guide

[…] Part 1 and Part 2 we […]

Loading…

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

Introduction

Primary Objectives Summary

Bill of Materials (BOM)

Notes On the Bill of Materials (BOM)

Nvidia T4/P4 Comparison and Feature Support

Additional GPUs Available

NVIDIA Tool Test Matrix

Project Status

Current State

Next Steps

Like this:

Comments

4 responses to “Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview”

Leave a ReplyCancel reply

More posts

Fixing the OpenSSH Post-Quantum Warning on RHEL 9

The July 2026 Agentic Coding Cheat Sheet: Frontier Tools, Cost Models, and Quota Mechanics – V2

From Phrasing to Choreography: An Exceptionally Brief History of Agentic Coding

How to Reconfigure Claude Code CLI when changing subscriptions

Discover more from Chris Paquin