Introduction
InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel.
More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU clusters, where its high bandwidth, ultra-low latency, and support for technologies such as RDMA, GPUDirect RDMA, and NCCL make it well suited for distributed training and other GPU-to-GPU communication workloads.
This project involves the architecture, deployment, and optimization of a high-speed InfiniBand (IB) fabric to facilitate low-latency, high-throughput communication between 3x dual-homed, GPU enabled, RHEL 9/10.1 servers.
By integrating Mellanox ConnectX-4 adapters with an InfiniScale IV switch, we will establish a dedicated Remote Direct Memory Access (RDMA) backend separate from the standard management LAN.
Additionally, we will become more familiar with the setup, configuration, and troubleshooting of InfiniBand networks and adapters, while also exploring the broad set of NVIDIA tools and technologies currently available to support multi-GPU clusters.
This project also encompasses the installation and configuration of NVIDIA drivers, CUDA, the NVIDIA Container Toolkit, and other supported elements of the NVIDIA software stack needed to support HCP/AI Clusters and environments.
Primary Objectives Summary
- Fabric orchestration: Deploy and manage a QDR InfiniBand fabric using OpenSM.
- RDMA enablement: Configure IPoIB in Connected Mode with a 65,520 MTU and validate RDMA functionality across the fabric.
- GPU acceleration: Enable and test GPUDirect RDMA with
nvidia-peermemfor NVIDIA Tesla T4 and P4 GPUs. - Platform enablement: Install and configure NVIDIA drivers, CUDA, the NVIDIA Container Toolkit, and other supported NVIDIA software stack components required for GPU-enabled workloads.
- Operations and telemetry: Develop hands-on familiarity with InfiniBand diagnostics, troubleshooting, and NVIDIA GPU monitoring tools.
Bill of Materials (BOM)
Below is the BOM for this project.
| Component | Qty | PART | DESCRIPTION |
|---|---|---|---|
| Switch | 1x | Mellanox InfiniScale IV Is5022 Switch | 8-port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System |
| NICs | 3x | MCX455A-ECAT MELLANOX CONNECTX-4 | 1 PORT EDR 100GB IB QSFP28 Infiniband/Ethernet Adapter |
| GPU (New) | 2x | NVIDIA Tesla P4 | 8GB GDDR5 (with active cooling mods) |
| GPU (Existing) | 3x | NVIDIA Tesla T4 | 16GB GDDR5 (with active cooling mods) |
| Cabling | 3x | FS 40Gbps QSFP+ 2M Passive DAC (QSFP-PC02) | 3x Host to 1x Switch |
| Server | 1x | Dell R720 (RHEL 9) | 2 x Intel Xeon E5-2697 (Ivy Bridge) v2 Twelve-Core Processor 2.7GHz 8.0GT/s 30MB LGA 2011, 128GB |
| Server | 1x | Dell R730 (RHEL 10.1) | 2x Intel(R) Xeon(R) CPU E5-2690 (Broadwell-EP) v4 @ 2.40GHz, 256GB |
| Server | 1x | Dell R730 (RHEL 10.1) | 2x Intel® Xeon® Processor E5-2699 v4 (Broadwell-EP) @2.20GHz, 768GB |
Notes On the Bill of Materials (BOM)
There as been a bit of flux on the exact BOM for this project (and my lab). A Dell R720 was recently added to my lab, replacing a mammoth T620. And while I like the tower form factor due to its ample number of PCI slots and spare SATA power cables, it has to sit on a rack mount shelf, takes up 5RU and is really heavy amd hard to move.
So the CPUs/Memory from the T620 were migrated to the R720. However RHEL 10 deprecated support for Intel v2 processors, so I had to deploy RHEL 9 instead of RHEL10. RHEL 9 supported the CX-3 with in-band drivers, however support for the CX-3 was dropped from RHEL 10. So I had to switch to CX-4s, which were more costly. However further research found that RDMA was not supported on the CX-3, so I needed to move to the CX4 (or newer) anyways.
IB Switch – I was able to pick up the unmanaged Mellanox InfiniScale switch on eBay quite cheaply. Being unmanaged it does not run subnet manager, which I will need to run on my primary host. Limited to 40GBe per port.
IB Adapters – Initially, as stated above, I intended to use Mellanox CX-3 adapters as they were incredibly inexpensive (like $12 USD), however due to CX-3 driver being dropped in RHEL 10, I switched to CX-4s, which were supported out of the box on both RHEL 9 and RHEL 10. These adapters were not as cheap, but still affordable. Additionally, I needed low profile brackets in order to fit into the existing open PCIe slots in my servers. I did not want to use OFED drivers, as I was looking to have the adapters supported out of the box, and was not interested in fiddling with drivers. Additionally, support docs for RDMA mentioned CX-4 or newer were required.
GPUS – Workstation class GPUs (like my 1x 3070 and my 2x 3060s 12Gb are not supported. I needed Datacenter Class NVIDIA CPUs installed in all 3 Dell Servers. I already owned 3x NVIDIA Tesla T4 (installed in R730s). Picked up 2x additional GPUs (NVIDIA Tesla M4), which are for functional validation only – not purchased for their performance or VRAM. Additional requirement was to stick with low-power GPUs that ran on PCIe power alone and did not require additional power connections.
Nvidia T4/P4 Comparison and Feature Support
Our LUT (Lab Under Test) consists of 5 GPUs, deployed across 3 Servers. Details below
| Feature | NVIDIA T4 | NVIDIA Tesla P4 |
|---|---|---|
| Architecture | Turing | Pascal |
| Release date | September 12, 2018 | September 12, 2016 |
| CUDA cores | 2,560 | 2,560 |
| Tensor cores | 320 | None |
| vRAM | 16 GB GDDR6 | 8 GB GDDR5 |
| Memory bandwidth | 300–320+ GB/s | 192 GB/s |
| PCIe interface | PCIe Gen3 x16 | PCIe Gen3 |
| Form factor | Low-profile, single-slot, passive | Low-profile, single-slot, passive |
| Max power | 70 W | 75 W |
| ECC memory support | Yes | Yes |
| NVENC / NVDEC | Yes | Yes |
| NVIDIA vGPU support | Yes | Yes |
| GPUDirect RDMA | Yes, conditionally supported | Yes, conditionally supported |
| RDMA NIC requirement | ConnectX-4 or later | ConnectX-4 or later |
| GPUDirect RDMA topology requirement | GPU and NIC should share the same upstream PCIe root complex for best support/performance | GPU and NIC should share the same upstream PCIe root complex for best support/performance |
| NVLink | No | No |
| MIG | No | No |
Additional GPUs Available
In addition to the 5x GPUs currently installed in my existing lab servers, I have few additional GPUs that are not currently deployed. I will include them below for reference.
Generally speaking, these are either “older“, “hungrier” or “hotter” than the GPUs that I already have in service.
| Feature | NVIDIA Tesla K20 | NVIDIA Tesla P100 |
|---|---|---|
| Architecture | Kepler | Pascal |
| Release date | November 2012 | April 5, 2016 |
| CUDA cores | 2,496 | 3,584 |
| Tensor cores | None | None |
| vRAM | 5 GB GDDR5 | 16 GB HBM2 |
| Memory bandwidth | 208 GB/s | 732 GB/s |
| PCIe interface | PCIe Gen2 x16 | PCIe Gen3 x16 |
| Form factor | Full-height, dual-slot | Full-height, dual-slot |
| Max power | 225 W | 250 W |
| ECC memory support | Yes | Yes |
| NVENC / NVDEC | No | No |
| NVIDIA vGPU support | No | Yes |
| GPUDirect RDMA | Yes, conditionally supported | Yes, conditionally supported |
| RDMA NIC requirement | ConnectX-4 or later | ConnectX-4 or later |
| GPUDirect RDMA topology requirement | GPU and NIC should share the same upstream PCIe root complex for best support/performance | GPU and NIC should share the same upstream PCIe root complex for best support/performance |
| NVLink | No | Depends on model; PCIe P100: No, SXM2 P100: Yes |
| MIG | No | No |
NVIDIA Tool Test Matrix
Below is a Matrix of technologies, their supportability in my soon to be deployed stack, and a brief description of each technology.
| Technology / Product | Supported | Description | Priority | NOTES |
|---|---|---|---|---|
| RDMA (InfiniBand / verbs) | Supported | Remote Direct Memory Access. It is a networking technology that lets one computer access or transfer data directly to the memory of another computer without involving the remote CPU or operating system in the data path. | High | Direct memory access over the network using your ConnectX-4 InfiniBand cards. This is the base networking capability you will use for low-latency, high-throughput node-to-node transfers. Native fit for CX-4 + IB switch |
| GPUDirect RDMA | Supported | GPUDirect RDMA is an NVIDIA technology that lets a third-party PCIe device directly read from or write to GPU memory without first copying the data through system RAM. | High | Lets a supported NIC perform RDMA directly to/from GPU memory, bypassing extra CPU copies. NVIDIA documents GPUDirect RDMA for Tesla/Quadro GPUs and requires ConnectX-4 or later NICs, with best results when GPU and NIC share the same upstream PCIe root complex. (NVIDIA Docs) One of the most relevant GPU+IB features for this setup |
| GPUDirect Storage | Unclear | Designed for direct data movement between storage and GPU memory. It is primarily positioned around storage stacks rather than IB switching alone, so whether your exact lab can validate it depends on OS, filesystem, NVMe/storage path, and supported software stack rather than just T4/P4 + CX-4. (NVIDIA Docs) | Low | NVIDIA technology that allows data to move directly between storage and GPU memory using DMA, instead of first bouncing through CPU memory. Best with NVME storage, have only SSDs and HDDs, no NVME support in Dell models under test. |
| MIG (Multi-Instance GPU) | Not supported | MIG starts with newer architectures and is documented in NVIDIA’s MIG guide as an Ampere-era feature. NVIDIA’s cloud-native docs explicitly note that Tesla T4 does not support MIG. P4 also predates MIG. (NVIDIA Docs) | None | Not supported. Multi-Instance GPU. It is an NVIDIA technology that lets a single supported GPU be partitioned into multiple smaller, isolated GPU instances |
| Time-slicing / shared GPU scheduling | Supported | Good for shared-lab/VM experiments Allows multiple workloads share one physical GPU by giving each workload a small turn on the GPU scheduler | Medium | Since T4 does not support MIG, NVIDIA documents time-slicing as a way to share T4 across multiple smaller jobs. This is useful for Kubernetes/OpenShift experiments or general shared-lab validation. P4 can also be shared through virtualization/software scheduling rather than MIG. (NVIDIA Docs) |
| NVIDIA vGPU | Supported | NVIDIA’s virtual GPU stack allows partitioning/sharing GPUs across VMs for compute, VDI, or graphics use cases. Both T4 and P4 are in NVIDIA’s supported vGPU product documentation. (NVIDIA Docs) | High | NVIDIA’s docs describe it as enabling multiple VMs to have simultaneous, direct access to a single physical GPU using NVIDIA drivers inside the guest OS. It lets multiple virtual machines share one physical NVIDIA GPU. |
| DCGM (Data Center GPU Manager) | Supported | NVIDIA’s primary datacenter GPU management and telemetry framework for health, diagnostics, topology, clocks, thermals, ECC, profiling, and integration with cluster tooling. It is explicitly built for Tesla/datacenter GPUs. (NVIDIA Docs) | High | DCGM is NVIDIA’s datacenter GPU management and monitoring framework. NVIDIA describes it as a lightweight user-space library/agent for administering NVIDIA datacenter GPUs in clusters and datacenters. |
| DCGM Exporter | Supported | Prometheus exporter built on top of DCGM that exposes GPU metrics over HTTP for scraping. Good fit for validating telemetry, dashboards, and alerting with your servers. (NVIDIA Docs) | High | NVIDIA’s Prometheus exporter for GPU metrics, will utilize existing Grafana instance Easy to validate and useful operationally |
| NVIDIA Merlin | Supported | Merlin is NVIDIA’s recommender-system framework stack for training and especially inference pipelines. T4 is a strong fit; P4 may work for smaller or older inference experiments, but T4 is the more relevant target. Support is practical rather than “card-specific” in docs, since Merlin rides on the CUDA/framework/container stack. (NVIDIA Developer) | Medium–Low | NVIDIA framework for building recommender systems. A recommender system is the kind of ML system used for things like: product recommendations “people also watched” next-best content ranking search or feed results ad / click-through prediction |
| TensorRT | Supported | TensorRT is NVIDIA’s SDK/runtime for optimizing trained neural-network models for inference on NVIDIA GPUs. It takes a model from frameworks such as TensorFlow, PyTorch, or ONNX and builds an optimized inference engine that can use precision modes such as FP32, FP16, and INT8 where supported. | High | NVIDIA’s inference optimizer/runtime. Very relevant for T4 and still usable on P4. T4 benefits significantly from Tensor Cores, so it is the better platform for validation. (NVIDIA Developer) |
| CUDA | Supported | Core GPU programming/runtime stack. Required for most of the technologies you listed and the base layer for custom validation, benchmarks, peer access tests, and GPU-aware applications. (NVIDIA Docs) | High | Foundation for most modern NVIDIA workflows |
| NVIDIA Container Toolkit | Supported | Enables Docker/Podman/Kubernetes containers to access NVIDIA GPUs cleanly. Useful for validating Merlin, TensorRT, PyTorch, RAPIDS, and exporter containers. (NVIDIA Docs) | High | Foundation for most modern NVIDIA workflows, integrate with Podman on RHEL |
| NVIDIA GPU Operator | Supported | Kubernetes/OpenShift operator that automates driver, toolkit, DCGM, exporter, and related GPU software deployment. Best fit if you want to turn a lab into a small cluster validation environment. (NVIDIA Docs) | Low | Good if you want Kubernetes/OpenShift validation, however no plans to run OCP in near future. |
| NVIDIA Fabric Manager | Not supported / not applicable | NVIDIA Fabric Manager is software for managing NVSwitch / NVLink GPU fabrics inside supported multi-GPU servers. NVIDIA says it configures the NVSwitch memory fabric to form a single memory fabric among participating GPUs and monitors the NVLinks that support that fabric. | None | Fabric Manager is for NVSwitch-based systems, not T4/P4 PCIe accelerator setups. Installed cards do not use NVSwitch. (NVIDIA Docs) Not supported here |
| NVLink | Not supported | NVLink is NVIDIA’s high-speed direct interconnect for GPUs. It provides a much faster path for GPU-to-GPU communication than ordinary PCIe alone, and in some platforms it is also used for CPU/GPU or switch-based interconnect designs. NVIDIA describes it as a direct GPU-to-GPU interconnect used to scale multi-GPU I/O within a server. | None | Not Supported. Neither T4 nor P4 provides NVLink. Multi-node connectivity would be via InfiniBand/RDMA, not GPU-to-GPU NVLink. (NVIDIA) |
| NVIDIA NIM / inference microservices | Unclear / limited | NVIDIA NIM is NVIDIA’s set of prebuilt, optimized, containerized inference microservices for running AI models on NVIDIA GPUs. NVIDIA describes NIM as portable microservices that simplify deployment of AI models across cloud, datacenter, workstation, and edge environments, typically exposing standard APIs for integration into applications | Low | Possible in some cases, but modern NIM profiles often assume newer GPUs and larger memory footprints than P4, and sometimes more than T4 depending on model size. It is worth testing selectively with small models, but I would not assume broad support on P4/T4 without checking the specific NIM/model requirements. (NVIDIA Docs) |
| Mixed precision inference / training | T4: Supported / P4: Limited | Mixed precision means using a mix of higher-precision and lower-precision numeric formats in AI workloads so you get better speed and lower memory use without giving up model quality where precision still matters. | Low | T4 supports Tensor Cores and is much better for FP16/INT8 inference acceleration. P4 lacks Tensor Cores, so mixed-precision benefits are more limited and workload-dependent. (NVIDIA) |
| NCCL (multi-GPU collectives) | Supported, but topology-dependent | NVIDIA Collective Communications Library. It is NVIDIA’s library for fast GPU-to-GPU communication, including both multi-GPU within a server and multi-node across servers. NVIDIA describes it as a topology-aware library of collective communication primitives optimized for NVIDIA GPUs and networking. | Medium | Useful for experimenting with multi-GPU and possibly multi-node communication patterns. It can work over PCIe and network paths, but the quality of results depends heavily on topology and software stack. This is a practical support judgment rather than a clean per-card matrix in the cited pages. (NVIDIA Docs) Is Supported on by both T4/P4 Single-host NCCL tests across multiple GPUs in one server Multi-node NCCL tests over InfiniBand GPUDirect RDMA-assisted NCCL when the stack and PCIe topology cooperate |
| GPUDirect P2P / peer-to-peer | Unclear / topology-dependent | GPU peer-to-peer memory access: one NVIDIA GPU can directly access or copy data to another NVIDIA GPU’s memory without staging the transfer through host RAM | Peer-to-peer GPU memory access can work in some PCIe topologies, but support and performance vary a lot by motherboard, root complex, ACS/IOMMU behavior, and driver stack. Possibly worth validating experimentally in homelab. (NVIDIA Docs) |
Project Status
Currently I am in the “waiting for hardware” to arrive stage of the project (mainly due to the switch from CX-3s to CX-4s). So lets take stock of where we are in the project and outline our next steps.
Current State
- IB Switch Racked
- GPUs physically installed in all systems
- Rough list of supported technologies to install and test
- Basic installation steps and IB troubleshooting documented
Next Steps
- Install all CX-4, ensure drivers are installed/loaded properly, and possibly update CX-4 firmware
- Install NVIDIA drivers, CUDA, and Container Toolkit on all GPU enabled hardware
- Install IB cables, and power up the Mellanox IB Switch (and hope its not too loud)
- Install 1x instance of active Subnet Manager (OpenSM) on one of target hosts
- Configure IB IPs.
- Work through listed/supported technologies in the NVIDIA matrix above.
My goal is to document my progress in future posts.
Pingback: Installing Mellanox CX-4 Adapters for InfiniBand Setup
Great work
Pingback: Managing Noise and Heat: Tips for Your Dell Server Setup
Pingback: NVIDIA HPC InfiniBand Cluster: Performance Testing Guide