Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 2: Infiniband Setup

The document details the installation and configuration of three Mellanox ConnectX-4 Adapters across multiple servers. It covers verifying detection, driver loading, InfiniBand setup, subnet management, and IP over InfiniBand configuration for effective connectivity and testing in a lab environment.

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

Introduction InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel. More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU … Continue reading Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

Dell OpenManage Server Administrator: Comprehensive Guide for Hardware Monitoring (RHEL)(Dell 12 Gen)

Dell OpenManage Server Administrator (OMSA) is Dell’s on-host hardware management and monitoring framework for PowerEdge servers.  It runs inside the operating system and provides direct visibility into system hardware such as RAID controllers, physical and virtual disks, power supplies, fans, temperatures, memory, processors, and chassis health.  OMSA communicates with the server’s iDRAC and hardware controllers … Continue reading Dell OpenManage Server Administrator: Comprehensive Guide for Hardware Monitoring (RHEL)(Dell 12 Gen)