Identify Your GPU Via the Linux CLI
Identify that your card is recognized by the OS via the CLI command below, hwinfo
# hwinfo --gfxcard --short
graphics card:
nVidia TU104GL [Tesla T4]
nVidia TU104GL [Tesla T4]
Matrox G200eR2
Primary display adapter: #58
Or you can see similar output with lshw
# lshw -C display
*-display
description: 3D controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:43:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list fb
configuration: depth=32 driver=nvidia latency=0 mode=1280x1024 visual=truecolor xres=1280 yres=1024
resources: iomemory:3800-37ff iomemory:3810-380f irq:106 memory:d0000000-d0ffffff memory:38000000000-3800fffffff memory:38110000000-38111ffffff memory:d1000000-d13fffff memory:38010000000-3810fffffff memory:38112000000-38131ffffff
Nvidia-smi
Nvidia System Management interface (nvidia-smi) is a cli command which facilitates management and monitoring of Nvidia GPUs (mainly Tesla, GRID, Quatro, and Titan products). It ships with Nvidia GPU drivers on Linux. It is an extension buildt on the Nvidia Management Library. Official documentation for nvidia-smi can be found here.

The output comprises two tables. The first table provides comprehensive details about all detected GPUs (e.g., one GPU in the provided example), while the second table enumerates the processes actively utilizing the GPUs. Below are detailed explanations of each parameter:
- Temp (Temperature): Indicates the GPU core temperature in Celsius. Typically, temperature regulation is managed by data center infrastructure or external cooling solutions. Values like “44°C” are normal operating conditions, but sustained temperatures exceeding 90°C should trigger immediate action to prevent hardware degradation.
- Perf (Performance State): Represents the current performance state of the GPU, ranging from P0 (highest performance) to P12 (lowest performance).
- Persistence-M (Persistence Mode): Specifies whether the NVIDIA driver remains loaded in memory even in the absence of active processes like
nvidia-smi. When “On,” this mode reduces driver load latency for GPU-dependent applications such as CUDA workloads. - Pwr: Usage/Cap (Power Usage/Capacity): Displays the current power draw of the GPU relative to its total power capacity, measured in Watts.
- Bus-Id: Represents the PCI bus address of the GPU in the format
domain:bus:device.function(hexadecimal). This identifier is critical for targeting specific GPUs in systems with multiple devices. - Disp.A (Display Active): Denotes whether memory on the GPU is allocated for display purposes. An “Off” value signifies no display context is associated with the GPU, making it dedicated to compute tasks.
- Memory-Usage: Indicates memory utilization on the GPU, expressed as the amount of memory in use versus total available memory. Machine learning frameworks like TensorFlow may preallocate the full GPU memory capacity upon initialization, irrespective of immediate requirements.
- Volatile Uncorr. ECC (Volatile Uncorrected ECC): Tracks uncorrected memory errors since the last driver initialization. Error Correction Code (ECC) is designed to detect and correct memory errors, ensuring data integrity during GPU operations.
- GPU-Util (GPU Utilization): Reports the percentage of time over the sample interval during which one or more kernels actively used the GPU.
- Compute M. (Compute Mode): Specifies the GPU’s compute mode. In “Default” mode, multiple processes can access the GPU concurrently. Other modes may restrict access to a single process or prohibit access entirely.
- GPU (Index): Enumerates the GPUs detected in the system. The index corresponds to the NVML (NVIDIA Management Library) device index, enabling precise identification in multi-GPU environments.
- PID (Process ID): Lists the process identifier of applications utilizing GPU resources.
- Type: Describes the context of GPU usage—“C” for Compute tasks, “G” for Graphics tasks, and “C+G” for combined Compute and Graphics contexts.
- Process Name: Identifies the executable or application utilizing GPU resources.
- GPU Memory Usage: Reports the GPU memory utilized by each individual process.
Use the -a switch for more detail

Sample Nvidia-SMI Commands
Query memory, free memory, and used memory
# nvidia-smi --query-gpu=index,name,uuid,memory.total,memory.free,memory.used --format=csv
index, name, uuid, memory.total [MiB], memory.free [MiB], memory.used [MiB]
0, Tesla T4, GPU-9491a3e6-ea29-ba4e-4403-083244d5575c, 15360 MiB, 14928 MiB, 2 MiB
1, Tesla T4, GPU-1d877ac8-5df1-34b0-4f86-59945e37d2ba, 15360 MiB, 14928 MiB, 2 MiB
Query Temperatures
# nvidia-smi --query-gpu=name,temperature.gpu --format=csv
name, temperature.gpu
Tesla T4, 34
Tesla T4, 30
Query PCI Slot
# nvidia-smi --query-gpu=index,name,pci.bus_id --format=csv
index, name, pci.bus_id
0, Tesla T4, 00000000:02:00.0
1, Tesla T4, 00000000:43:00.0
Show Numa Affinity
Non-Uniform Memory Access (numa) is a term used on systems with more than one bus/CPU. In the example below my GPU is installed in the numa node local to CPU0. use the flag “topo -m”

Show Running Stats with dmon
The dmon flag is unsed to show running statistics for one or more CPUs at 1s intervals. Dmon accepts a slew of options which are explained here.

Below are the available base metrics and associated metric letter.
| SWITCH | DESCRIPTION |
| p | Power Usage and Temperature |
| u | Utilization |
| c | Proc and Mem Clocks |
| v | Power and Thermal Violations |
| m | FB, Bar1 and CC Protected Memory |
| e | ECC Errors and PCIe Replay errors |
| t | PCIe Rx and Tx Throughput |
The nvidia-smi dmon command is also able to query available GPM (GPU Performance Monitor) metrics as shown in the example below
#nvidia-smi dmon --gpm-metrics <gpmMetric1, gpmMetric2, ... ,gpmMetricN>
<gpmMetricX> R
The table below shows some of the available metrics and associated metric number. A complete list of metrics can be found here.
| METRIC | VAR | DESCRIPTION |
| Graphics Activity | = 1 | |
| SM Activity | = 2 | |
| SM Occupancy | = 3 | |
| Integer Activity | =4 | |
| Tensor Activity | = 5 | |
| DFMA Tensor Activity | = 6 | |
| HMMA Tensor Activity | = 7 | |
| IMMA Tensor Activity | = 9 | |
| DRAM Activity | = 10 | |
| FP64 Activity | = 11 | |
| FP32 Activity | = 12 | |
| FP16 Activity | = 13 | |
| PCIe TX | = 20 | |
| PCIe RX | = 21 | |
| NVDEC 0-7 Activity | = 30-37 | |
| NVOFA 0 Activity | = 50 | |
| NVJPG 0-7 Activity | = 40-47 | |
| NVLink Total RX | = 60 | |
| NVLink Total TX | = 61 | |
| NVLink L0-17 RX | = 62, 64, 66, …, 96 | |
| NVLink L0-17 TX | = 63, 65, 67, …, 97 |
Enable persistence mode to reduce initialization overhead and keep the GPU active and running.
# sudo nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:43:00.0.
All done.
NVtop
More info on nvtop can be found here
$ sudo apt install nvtop
Output example below

GPUstat
A wrapper of sorts for nvidia-smi. More info here.
# apt install gpustat -y
Output below
galactica Thu Feb 6 22:43:59 2025 535.183.01
[0] Tesla T4 | 36°C, 0 % | 2 / 15360 MB |
[1] Tesla T4 | 30°C, 0 % | 2 / 15360 MB |
Installing the Nvidia Container Toolkit on Ubuntu 22.04
The NVIDIA Container Toolkit is a set of tools that enables the use of NVIDIA GPUs within Docker and other container runtimes. It allows GPU-accelerated applications to run inside containers by providing the necessary drivers, libraries, and runtime components. Instructions for installing the Nvidia Container Toolkit are below. The official Nvidia doc can be found here, where you can also find guides for installing via DNF/Yum or Zypper
~# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /
Now refresh packages list
~# sudo apt-get update
Now install the Container Toolkit
# sudo apt-get install -y nvidia-container-toolkit
Configure the NVIDIA-Container Toolkit with Containerd
For Ubuntu, the default runtime is containerd. In the example below we configure integration with containerd. Which modifies /etc/containerd/config.toml
#sudo nvidia-ctk runtime configure --runtime=containerd
#sudo systemctl restart containerd
or use docker as shown below
Configure the NVIDIA-Container Toolkit Configuring with Docker
The nvidia-ctk command modifies the /etc/docker/daemon.json file on the host. The file is updated so that Docker can use the NVIDIA Container Runtime.
# sudo nvidia-ctk runtime configure --runtime=docker
I have also seen that it may be necessary to add default-runtime parameter directly to the file “/etc/docker/daemon.json“.
Run the command below to see what runtimes Docker is using
docker info | grep "Runtime"
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Make a backup copy of /etc/docker/daemon.json
# cp /etc/docker/daemon.json /etc/docker/daemon.json.ORIG
Modify the file as shown below.
# cat daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Restart Docker
# systemctl restart docker
Check the output of docker info and ensure that the nvidia is the default runtime
# docker info | grep "Runtime"
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Find your Nvidia Devices in /dev
# sudo ls -la /dev | grep nvidia
crw-rw-rw- 1 root root 195, 0 Feb 4 03:54 nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 4 03:54 nvidia1
NGC CLI
NVIDIA NGC (Nvidia GPU CLoud) CLI is a command-line interface tool for managing Docker containers in the NVIDIA NGC Registry. Download the CLI here.
Once downloaded, unzip the Zip file and make the binary executable
chmod u+x ngc-cli/ngc && chmod u+x ngc-cli/ngc
Add the binary path to your path
echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
You will need an Nvidia Cloud account and and API key, follow the setup guide here to get started.
Then docker login as shown below using your API key as your password
# docker login nvcr.io
Username: $oauthtoken
Password:
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
nvidia-ctk
A Container Device Interface (CDI) device is a standard way to manage container hardware access. More specifically it is used to assign GPU to containers through the Nvidia Container Toolkit
Run the command below to generate the CDI specification file
# sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
Then run the list command below to see what GPUs were detected.
# nvidia-ctk cdi list
INFO[0000] Found 5 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=GPU-1d877ac8-5df1-34b0-4f86-59945e37d2ba
nvidia.com/gpu=GPU-9491a3e6-ea29-ba4e-4403-083244d5575c
nvidia.com/gpu=all
Running a Sample Docker Workload
Run the command below to test to make sure docker is working properly
# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
The command above should run nvidia-smi once and then exit.
GPU Burn With Docker
Pop a second terminal window, run nvtop and run the command below. You should see load on your GPUs. In the example below, 60 represents the number of seconds to run the test.
# sudo docker run --gpus all --rm oguzpastirmaci/gpu-burn 60
See below. GPUs running at 100% load.

More Commands
Reference
- https://org.ngc.nvidia.com/setup/installers/cli
- https://docs.nvidia.com/deploy/nvidia-smi/index.html
- https://taozhi.medium.com/monitor-nvidia-gpu-by-nvidia-smi-cli-56198fbf8e62
- https://www.gpu-mart.com/blog/monitor-gpu-utilization-with-nvidia-smi
- https://programmersought.com/article/84455484104/
- https://docs.nvidia.com/deploy/driver-persistence/index.html
- https://www.incredibuild.com/integrations/cuda#:~:text=Compute%20Unified%20Architecture%20(CUDA)%20is,tasks%20on%20GPU%20using%20CUDA.
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- https://docs.nvidia.com/deploy/nvml-api/group__nvmlGpmEnums.html
- https://docs.nvidia.com/deploy/pdf/NVML_API_Reference_Guide.pdf
- https://docs.nvidia.com/deploy/nvidia-smi/index.html
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- https://hub.docker.com/r/oguzpastirmaci/gpu-burn
