Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 3: RDMA Performance Testing

The project review outlines the successful construction and configuration of a 3-node GPU cluster with InfiniBand networking. It details performance testing procedures and confirms healthy RDMA performance with minimal tuning.

The Homelab Dilemma: Living With Enterprise Servers

Enterprise Servers are loud and hot, and while a couple of 13th generation Dell servers kept a room nice and toasty in the cold of winter, it a double whammy to your electric bill in the Spring/Summer. The noise can be problematic as well, especially since I no longer have a basement, where I relied … Continue reading The Homelab Dilemma: Living With Enterprise Servers

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 2: Infiniband Setup

The document details the installation and configuration of three Mellanox ConnectX-4 Adapters across multiple servers. It covers verifying detection, driver loading, InfiniBand setup, subnet management, and IP over InfiniBand configuration for effective connectivity and testing in a lab environment.

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

Introduction InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel. More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU … Continue reading Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

How to Set Up NVIDIA CUDA and Container Toolkits on RHEL 10

This guide details installing NVIDIA drivers, CUDA Toolkit, and Container Toolkit on RHEL 10.1, utilizing simplified methods and new repositories for streamlined setup and verification processes.

RHEL 10 – Enable Health Monitoring for NVIDIA GPUs Using DCGM Exporter

Nvidia Datacenter GPU Manager (DCGM) facilitates GPU health monitoring, performance telemetry, and diagnostics for NVIDIA GPUs on servers. This post outlines installation and setup for RHEL 10, including driver validation and metric visualization.