The document details the installation and configuration of three Mellanox ConnectX-4 Adapters across multiple servers. It covers verifying detection, driver loading, InfiniBand setup, subnet management, and IP over InfiniBand configuration for effective connectivity and testing in a lab environment.
RHEL10
Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview
Introduction InfiniBand is a mature interconnect technology known for high bandwidth and low latency. It has long been used in supercomputing and HPC environments, and has also been deployed in certain storage and clustered infrastructure designs as an alternative to Fibre Channel. More recently, InfiniBand has seen strong continued adoption in large-scale AI and GPU … Continue reading Project “NVIDIA HPC Infiniband Homelab GPU Cluster”: Part 1: Project Overview
How to Set Up NVIDIA CUDA and Container Toolkits on RHEL 10
This guide details installing NVIDIA drivers, CUDA Toolkit, and Container Toolkit on RHEL 10.1, utilizing simplified methods and new repositories for streamlined setup and verification processes.
RHEL 10 – Enable Health Monitoring for NVIDIA GPUs Using DCGM Exporter
Nvidia Datacenter GPU Manager (DCGM) facilitates GPU health monitoring, performance telemetry, and diagnostics for NVIDIA GPUs on servers. This post outlines installation and setup for RHEL 10, including driver validation and metric visualization.
Fix GPG Check Failed Error on RHEL 10.1
Overview On some RHEL 10.1 installs users are running into this error, post-install, when attempting to install packages via dnf. Unsure if the issue is isolated to users attempting to install RHEL via the full DVD ISO, from the minimal boot ISO, and users deploying RHEL 10.1 via kickstart. GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release … Continue reading Fix GPG Check Failed Error on RHEL 10.1
Configuring LACP on TP-Link SX3008F for RHEL 9/10
This guide details setting up three LACP port-channels on a TP-Link SX3008F switch for RHEL 9/10 hosts, enabling 20Gbe connectivity for efficient NFS backups.
Moving Beyond OMSA: A Guide to Dell iSM Installation on RHEL 10 and PowerEdge R730
The shift from srvadmin (OMSA) to iSM (iDRAC Service Module) marks the end of bloated, "in-band" server management. This occurred between Dell 12th gen and Dell 13th gen server. If you have a 12th Gen Dell Server, you can still leverage Dell srvadmin (Idrac 7). I wrote a post on it here. While OMSA ran … Continue reading Moving Beyond OMSA: A Guide to Dell iSM Installation on RHEL 10 and PowerEdge R730