Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 2: Infiniband Setup

The document details the installation and configuration of three Mellanox ConnectX-4 Adapters across multiple servers. It covers verifying detection, driver loading, InfiniBand setup, subnet management, and IP over InfiniBand configuration for effective connectivity and testing in a lab environment.

Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

Introduction InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel. More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU … Continue reading Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview

How to Set Up NVIDIA CUDA and Container Toolkits on RHEL 10

This guide details installing NVIDIA drivers, CUDA Toolkit, and Container Toolkit on RHEL 10.1, utilizing simplified methods and new repositories for streamlined setup and verification processes.

RHEL 10 – Enable Health Monitoring for NVIDIA GPUs Using DCGM Exporter

Nvidia Datacenter GPU Manager (DCGM) facilitates GPU health monitoring, performance telemetry, and diagnostics for NVIDIA GPUs on servers. This post outlines installation and setup for RHEL 10, including driver validation and metric visualization.

Fix GPG Check Failed Error on RHEL 10.1

Overview On some RHEL 10.1 installs users are running into this error, post-install, when attempting to install packages via dnf. Unsure if the issue is isolated to users attempting to install RHEL via the full DVD ISO, from the minimal boot ISO, and users deploying RHEL 10.1 via kickstart. GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release … Continue reading Fix GPG Check Failed Error on RHEL 10.1

Moving Beyond OMSA: A Guide to Dell iSM Installation on RHEL 10 and PowerEdge R730

The shift from srvadmin (OMSA) to iSM (iDRAC Service Module) marks the end of bloated, "in-band" server management. This occurred between Dell 12th gen and Dell 13th gen server. If you have a 12th Gen Dell Server, you can still leverage Dell srvadmin (Idrac 7). I wrote a post on it here. While OMSA ran … Continue reading Moving Beyond OMSA: A Guide to Dell iSM Installation on RHEL 10 and PowerEdge R730