Essential Commands to Monitor Nvidia GPUs in Linux

Identify Your GPU Via the Linux CLI

Identify that your card is recognized by the OS via the CLI command below, hwinfo

# hwinfo --gfxcard --short
graphics card:                                                  
                       nVidia TU104GL [Tesla T4]
                       nVidia TU104GL [Tesla T4]
                       Matrox G200eR2

Primary display adapter: #58

Or you can see similar output with lshw

# lshw -C display
  
  *-display
       description: 3D controller
       product: TU104GL [Tesla T4]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:43:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm bus_master cap_list fb
       configuration: depth=32 driver=nvidia latency=0 mode=1280x1024 visual=truecolor xres=1280 yres=1024
       resources: iomemory:3800-37ff iomemory:3810-380f irq:106 memory:d0000000-d0ffffff memory:38000000000-3800fffffff memory:38110000000-38111ffffff memory:d1000000-d13fffff memory:38010000000-3810fffffff memory:38112000000-38131ffffff

Nvidia-smi

Nvidia System Management interface (nvidia-smi) is a cli command which facilitates management and monitoring of Nvidia GPUs (mainly Tesla, GRID, Quatro, and Titan products). It ships with Nvidia GPU drivers on Linux. It is an extension buildt on the Nvidia Management Library. Official documentation for nvidia-smi can be found here.

The output comprises two tables. The first table provides comprehensive details about all detected GPUs (e.g., one GPU in the provided example), while the second table enumerates the processes actively utilizing the GPUs. Below are detailed explanations of each parameter:

  • Temp (Temperature): Indicates the GPU core temperature in Celsius. Typically, temperature regulation is managed by data center infrastructure or external cooling solutions. Values like “44°C” are normal operating conditions, but sustained temperatures exceeding 90°C should trigger immediate action to prevent hardware degradation.
  • Perf (Performance State): Represents the current performance state of the GPU, ranging from P0 (highest performance) to P12 (lowest performance).
  • Persistence-M (Persistence Mode): Specifies whether the NVIDIA driver remains loaded in memory even in the absence of active processes like nvidia-smi. When “On,” this mode reduces driver load latency for GPU-dependent applications such as CUDA workloads.
  • Pwr: Usage/Cap (Power Usage/Capacity): Displays the current power draw of the GPU relative to its total power capacity, measured in Watts.
  • Bus-Id: Represents the PCI bus address of the GPU in the format domain:bus:device.function (hexadecimal). This identifier is critical for targeting specific GPUs in systems with multiple devices.
  • Disp.A (Display Active): Denotes whether memory on the GPU is allocated for display purposes. An “Off” value signifies no display context is associated with the GPU, making it dedicated to compute tasks.
  • Memory-Usage: Indicates memory utilization on the GPU, expressed as the amount of memory in use versus total available memory. Machine learning frameworks like TensorFlow may preallocate the full GPU memory capacity upon initialization, irrespective of immediate requirements.
  • Volatile Uncorr. ECC (Volatile Uncorrected ECC): Tracks uncorrected memory errors since the last driver initialization. Error Correction Code (ECC) is designed to detect and correct memory errors, ensuring data integrity during GPU operations.
  • GPU-Util (GPU Utilization): Reports the percentage of time over the sample interval during which one or more kernels actively used the GPU.
  • Compute M. (Compute Mode): Specifies the GPU’s compute mode. In “Default” mode, multiple processes can access the GPU concurrently. Other modes may restrict access to a single process or prohibit access entirely.
  • GPU (Index): Enumerates the GPUs detected in the system. The index corresponds to the NVML (NVIDIA Management Library) device index, enabling precise identification in multi-GPU environments.
  • PID (Process ID): Lists the process identifier of applications utilizing GPU resources.
  • Type: Describes the context of GPU usage—“C” for Compute tasks, “G” for Graphics tasks, and “C+G” for combined Compute and Graphics contexts.
  • Process Name: Identifies the executable or application utilizing GPU resources.
  • GPU Memory Usage: Reports the GPU memory utilized by each individual process.

Use the -a switch for more detail

Sample Nvidia-SMI Commands

Query memory, free memory, and used memory

# nvidia-smi --query-gpu=index,name,uuid,memory.total,memory.free,memory.used --format=csv
index, name, uuid, memory.total [MiB], memory.free [MiB], memory.used [MiB]
0, Tesla T4, GPU-9491a3e6-ea29-ba4e-4403-083244d5575c, 15360 MiB, 14928 MiB, 2 MiB
1, Tesla T4, GPU-1d877ac8-5df1-34b0-4f86-59945e37d2ba, 15360 MiB, 14928 MiB, 2 MiB

Query Temperatures

# nvidia-smi --query-gpu=name,temperature.gpu --format=csv
name, temperature.gpu
Tesla T4, 34
Tesla T4, 30

Query PCI Slot

# nvidia-smi --query-gpu=index,name,pci.bus_id --format=csv
index, name, pci.bus_id
0, Tesla T4, 00000000:02:00.0
1, Tesla T4, 00000000:43:00.0

Show Numa Affinity

Non-Uniform Memory Access (numa) is a term used on systems with more than one bus/CPU. In the example below my GPU is installed in the numa node local to CPU0. use the flag “topo -m”


Show Running Stats with dmon

The dmon flag is unsed to show running statistics for one or more CPUs at 1s intervals. Dmon accepts a slew of options which are explained here.

Below are the available base metrics and associated metric letter.

SWITCHDESCRIPTION
pPower Usage and Temperature
uUtilization
cProc and Mem Clocks
vPower and Thermal Violations
mFB, Bar1 and CC Protected Memory
eECC Errors and PCIe Replay errors
tPCIe Rx and Tx Throughput

The nvidia-smi dmon command is also able to query available GPM (GPU Performance Monitor) metrics as shown in the example below

#nvidia-smi dmon --gpm-metrics <gpmMetric1, gpmMetric2, ... ,gpmMetricN>
<gpmMetricX> R

The table below shows some of the available metrics and associated metric number. A complete list of metrics can be found here.

METRICVARDESCRIPTION
Graphics Activity = 1
SM Activity  = 2
SM Occupancy = 3
Integer Activity=4
Tensor Activity = 5
DFMA Tensor Activity= 6
HMMA Tensor Activity = 7
IMMA Tensor Activity= 9
DRAM Activity = 10
FP64 Activity= 11
FP32 Activity= 12
FP16 Activity= 13
PCIe TX= 20
PCIe RX= 21
NVDEC 0-7 Activity= 30-37
NVOFA 0 Activity= 50
NVJPG 0-7 Activity   = 40-47
NVLink Total RX= 60
NVLink Total TX= 61
NVLink L0-17 RX= 62, 64, 66, …, 96
NVLink L0-17 TX = 63, 65, 67, …, 97

Enable persistence mode to reduce initialization overhead and keep the GPU active and running.

# sudo nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:43:00.0.
All done.

NVtop

More info on nvtop can be found here

$ sudo apt install nvtop

Output example below


GPUstat

A wrapper of sorts for nvidia-smi. More info here.

# apt install gpustat -y

Output below

galactica    Thu Feb  6 22:43:59 2025  535.183.01
[0] Tesla T4 | 36°C,   0 % |     2 / 15360 MB |
[1] Tesla T4 | 30°C,   0 % |     2 / 15360 MB |

Installing the Nvidia Container Toolkit on Ubuntu 22.04

The NVIDIA Container Toolkit is a set of tools that enables the use of NVIDIA GPUs within Docker and other container runtimes. It allows GPU-accelerated applications to run inside containers by providing the necessary drivers, libraries, and runtime components. Instructions for installing the Nvidia Container Toolkit are below. The official Nvidia doc can be found here, where you can also find guides for installing via DNF/Yum or Zypper

~# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /

Now refresh packages list

~# sudo apt-get update

Now install the Container Toolkit

# sudo apt-get install -y nvidia-container-toolkit

Configure the NVIDIA-Container Toolkit with Containerd

For Ubuntu, the default runtime is containerd. In the example below we configure integration with containerd. Which modifies /etc/containerd/config.toml

#sudo nvidia-ctk runtime configure --runtime=containerd
#sudo systemctl restart containerd

or use docker as shown below

Configure the NVIDIA-Container Toolkit Configuring with Docker

The nvidia-ctk command modifies the /etc/docker/daemon.json file on the host. The file is updated so that Docker can use the NVIDIA Container Runtime.

# sudo nvidia-ctk runtime configure --runtime=docker

I have also seen that it may be necessary to add default-runtime parameter directly to the file “/etc/docker/daemon.json“.

Run the command below to see what runtimes Docker is using

docker info | grep "Runtime"
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc

Make a backup copy of /etc/docker/daemon.json

# cp /etc/docker/daemon.json /etc/docker/daemon.json.ORIG

Modify the file as shown below.

# cat daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
         } 
    },
    "default-runtime": "nvidia" 
}

Restart Docker

# systemctl restart docker

Check the output of docker info and ensure that the nvidia is the default runtime

# docker info | grep "Runtime"
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia

Find your Nvidia Devices in /dev

# sudo ls -la /dev | grep nvidia
crw-rw-rw- 1 root root 195, 0 Feb 4 03:54 nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 4 03:54 nvidia1

NGC CLI

NVIDIA NGC (Nvidia GPU CLoud) CLI is a command-line interface tool for managing Docker containers in the NVIDIA NGC Registry. Download the CLI here.

Once downloaded, unzip the Zip file and make the binary executable

chmod u+x ngc-cli/ngc && chmod u+x ngc-cli/ngc

Add the binary path to your path

echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile

You will need an Nvidia Cloud account and and API key, follow the setup guide here to get started.

Then docker login as shown below using your API key as your password

# docker login nvcr.io
Username: $oauthtoken
Password: 
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

nvidia-ctk

A Container Device Interface (CDI) device is a standard way to manage container hardware access. More specifically it is used to assign GPU to containers through the Nvidia Container Toolkit

Run the command below to generate the CDI specification file

#  sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Then run the list command below to see what GPUs were detected.

# nvidia-ctk cdi list
INFO[0000] Found 5 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=GPU-1d877ac8-5df1-34b0-4f86-59945e37d2ba
nvidia.com/gpu=GPU-9491a3e6-ea29-ba4e-4403-083244d5575c
nvidia.com/gpu=all

Running a Sample Docker Workload

Run the command below to test to make sure docker is working properly

# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

The command above should run nvidia-smi once and then exit.

GPU Burn With Docker

Pop a second terminal window, run nvtop and run the command below. You should see load on your GPUs. In the example below, 60 represents the number of seconds to run the test.

# sudo docker run --gpus all --rm oguzpastirmaci/gpu-burn 60

See below. GPUs running at 100% load.


More Commands

Reference

  1. https://org.ngc.nvidia.com/setup/installers/cli
  2. https://docs.nvidia.com/deploy/nvidia-smi/index.html
  3. https://taozhi.medium.com/monitor-nvidia-gpu-by-nvidia-smi-cli-56198fbf8e62
  4. https://www.gpu-mart.com/blog/monitor-gpu-utilization-with-nvidia-smi
  5. https://programmersought.com/article/84455484104/
  6. https://docs.nvidia.com/deploy/driver-persistence/index.html
  7. https://www.incredibuild.com/integrations/cuda#:~:text=Compute%20Unified%20Architecture%20(CUDA)%20is,tasks%20on%20GPU%20using%20CUDA.
  8. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  9. https://docs.nvidia.com/deploy/nvml-api/group__nvmlGpmEnums.html
  10. https://docs.nvidia.com/deploy/pdf/NVML_API_Reference_Guide.pdf
  11. https://docs.nvidia.com/deploy/nvidia-smi/index.html
  12. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  13. https://hub.docker.com/r/oguzpastirmaci/gpu-burn

Leave a Reply