The project review outlines the successful construction and configuration of a 3-node GPU cluster with InfiniBand networking. It details performance testing procedures and confirms healthy RDMA performance with minimal tuning.
Author: Christopher Paquin
The Homelab Dilemma: Living With Enterprise Servers
Enterprise Servers are loud and hot, and while a couple of 13th generation Dell servers kept a room nice and toasty in the cold of winter, it a double whammy to your electric bill in the Spring/Summer. The noise can be problematic as well, especially since I no longer have a basement, where I relied … Continue reading The Homelab Dilemma: Living With Enterprise Servers
Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 2: Infiniband Setup
The document details the installation and configuration of three Mellanox ConnectX-4 Adapters across multiple servers. It covers verifying detection, driver loading, InfiniBand setup, subnet management, and IP over InfiniBand configuration for effective connectivity and testing in a lab environment.
Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview
Introduction InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel. More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU … Continue reading Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview
How to Set Up NVIDIA CUDA and Container Toolkits on RHEL 10
This guide details installing NVIDIA drivers, CUDA Toolkit, and Container Toolkit on RHEL 10.1, utilizing simplified methods and new repositories for streamlined setup and verification processes.
CPU Overclocking on Ubuntu 24.04
CPU Overclocking on Ubuntu 24.04. In this post we will overclock the Intel Core i7-8086K Special Edition CPU - Intel's first CPU to hit 5.0Ghz out of the box.
RHEL 10 – Enable Health Monitoring for NVIDIA GPUs Using DCGM Exporter
Nvidia Datacenter GPU Manager (DCGM) facilitates GPU health monitoring, performance telemetry, and diagnostics for NVIDIA GPUs on servers. This post outlines installation and setup for RHEL 10, including driver validation and metric visualization.