Step-by-Step Nvidia Driver, CUDA Toolkit, & Container Toolkit Install for RHEL9

Introduction

In this step-by-steps guide we will replace the out of the box nouveau drivers on RHEL9 with Nvidia Drivers. We will also install the the Nvidia CUDA Toolkit and the Nvidia Container Toolkit.


GPU and Driver Inspection

First we need to make sure that our Nvdia GPU is recognized by Red Hat Enterprise Linux 9 (RHEL9).

lspci -nn | grep -i nvidia
b6:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A40] [10de:2235] (rev a1)

Using the command below we can see that we are currently using the non-propietary nouveau driver.

# lspci | grep ' NVIDIA ' | cut -d" " -f 1 | xargs -i lspci -v -s {}
b6:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
	Subsystem: NVIDIA Corporation Device 145a
	Flags: bus master, fast devsel, latency 0, IRQ 32, NUMA node 0
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 38d000000000 (64-bit, prefetchable) [size=64G]
	Memory at 38f040000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] Null
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: nouveau
	Kernel modules: nouveau

Configuring Repositories for the Nvidia Driver Install

First we need to enable the RHEL9 CodeReady Builder repo. Note we are running these commands as root.

# subscription-manager repos --enable codeready-builder-for-rhel-9-$(uname -i)-rpms

Next we will need to install and configure the EPEL repo.

# dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

Now we install the ELRepo project repo – this will provide nvidia-detect which we can utilize later

# dnf -y  install https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm

Prerequisites for Nvidia Driver Install

Now we need to install dependencies and build tools.

# dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc make dkms acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig

Install Nvidia Drivers

Install nvidia-detect from the ELRepo project repo.

# dnf -y install nvidia-detect

Now install the Nvidia Drivers.

# dnf -y install $(nvidia-detect)

Now reboot.


Confirming Nvidia Driver Installation

Now lets run the command below one more time.

[root@gpu ~]# lspci | grep ' NVIDIA ' | cut -d" " -f 1 | xargs -i lspci -v -s {}
b6:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
	Subsystem: NVIDIA Corporation Device 145a
	Flags: bus master, fast devsel, latency 0, IRQ 32, NUMA node 0
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 38d000000000 (64-bit, prefetchable) [size=64G]
	Memory at 38f040000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] Null
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

As you can see in the output below, the kernel is loading the Nvidia driver. We can still see nouveau kernel modules listed, but that is fine, as they are not loaded. We can confirm this with the command below.

# lsmod | grep nouveau

The above command should not output anything, while the opposite should be true for the command below.

# lsmod | grep nvidia

Configure Nvidia Persistenced

Start and enable nvidia-persistenced.service. This will enable persistence-mode which will keep the nvidia device state from going “stale”

# systemctl enable nvidia-persistenced.service
# systemctl start nvidia-persistenced.service

Installing the Nvidia CUDA Toolkit

We will now follow the official guide and install the Nvidia CUDA toolkit. Per that guide, we need to enable a few repos, however two of those repos should be enabled by default, and the other one we enabled above, however I will list them here for the sake of documentation.

# subscription-manager repos --enable=rhel-9-for-x86_64-appstream-rpms
# subscription-manager repos --enable=rhel-9-for-x86_64-baseos-rpms
# subscription-manager repos --enable=codeready-builder-for-rhel-9-x86_64-rpms

Now we install the Nvidia repo for the CUDA toolkit.

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

Now install the CUDA toolkit as shown below

# sudo dnf -y install cuda-toolkit

Confirm that the toolkit is installed and note the version.

# rpm -qa cuda-toolkit
cuda-toolkit-12.8.1-1.x86_64

Add the following to your .bashrc. And if you intend to run/install anything as root, you may want to add it to root’s .bashrc as well. Note that the cuda version should match the one that you installed above.

export PATH=/usr/local/cuda-12.8/bin:$PATH

Now test nvcc as shown below.

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Installing the Nvidia Container Toolkit

Next we will install the Nvidia Container Toolkit, which allows users to run GPU-accelerated containerized applications.

A bit about Container Management in RHEL 9

The default container packages in RHEL 9 are as follows.

  1. Podman – daemonless container image
  2. Buildah – tool for building OCI (Open Container Initiative) container images
  3. Skopeo – tool for managing container images and repos
  4. CRIU – tool to create and save running container checkpoints to disk
  5. Udica – tool for managing SELinux policies for containers

Installation of the toolkit

We will follow the instructions as documented here.

First we configure the repo

# curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

Then install via dnf

# dnf install -y nvidia-container-toolkit

Configuring the Container Toolkit for Podman

Generate the CDI specification file using the command below.

# nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Now lets check the names of the generated device(s).

# nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-7e880be2-891c-72e3-9515-0fd51240e7f4
nvidia.com/gpu=all

References

  1. https://medium.com/@blackhorseya/step-by-step-guide-to-installing-nvidia-drivers-on-rhel-9-1107e0cd641d
  2. https://access.redhat.com/discussions/227d2101-b4e3-490a-aa1c-601c407ec038
  3. https://darryldias.me/2022/install-nvidia-drivers-on-rhel-9/
  4. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  5. https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#network-repo-installation-for-rhel-rocky
  6. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html

Leave a Reply