Getting Started with RamaLama with Nvidia Cuda Support On Ubuntu 24.04

Introduction to RamaLama

Streamlining AI Deployment with OCI Containers

RamaLama is an open-source project developed to simplify AI model deployment and management using OCI (Open Container Initiative) containers. Ramalama enables seamless execution of AI workloads across different hardware configurations, supporting both GPU-accelerated and CPU-based environments.

By leveraging container engines like Podman and Docker, RamaLama includes all necessary dependencies, eliminating complex installation and dependency nightmares.

Ramalama integrates with AI model registries such as Hugging Face and Ollama, providing flexibility in model selection. Key features include automatic GPU detection, CPU fallback, and optional direct execution on the host system.

Prerequisites

Updating Ubuntu

First let’s confirm our Ubuntu version

$ sudo lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.5 LTS
Release:	22.04
Codename:	jammy

Run the two commands below to update your package cache and install any updates. Reboot if required.

$ sudo apt-get update
$ sudo apt-get upgrade -y

Installing podman

$ sudo apt -y install podman
# sudo podman --version
podman version 5.0.3

Installing Nvidia Drivers

Assuming you added third-party repos at build time, we should be able to check the suggested NVIDIA driver version. Do so with the command below

$ nvidia-detector
nvidia-driver-545

The command below confirms that we do not have an NVIDIA driver loaded.

$ cat /proc/driver/nvidia/version
cat: /proc/driver/nvidia/version: No such file or directory

Now let’s install the NVIDIA driver.

$ sudo ubuntu-drivers --gpgpu install

And we need to install the nvidia-utils package. Make sure the package version matches your installed driver.

$ sudo apt install nvidia-utils-535-server

Now reboot.

Once your system is back up. Run the command below to verify that the drivers installed correctly. At the top of the output you should see your driver version and CUDA API version

$ sudo nvidia-smi

You must configure the persistence daemon (nvidia-persistenced) to start at boot and run continuously. Otherwise, the driver may unload, causing the Tesla GPUs to deinitialize, requiring a full reinitialization when nvidia-smi is executed. Additionally, failing to keep nvidia-persistenced running could lead to more severe issues, such as GPU crashes, depending on the workload.

Enable and start the service below

$ sudo systemctl start nvidia-persistenced
$ sudo systemctl status nvidia-persistenced

Installing Nvidia Cuda Toolkit

Run the command below to install the nvidia-cuda-toolkit from the default Ubuntu repos.

$ sudo apt install nvidia-cuda-toolkit -y

Now test to ensure proper install and that new binary files are in your path.

~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Configuring Podman with Nvidia Cuda Support

Following along with this document from the RamaLama github page, we first need to install the nvidia-container-toolkit.

First configure the repo. Note that this is one command. See here for more info.

$ sudo curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Refresh the repos

# sudo apt-get update

Then install the toolkit

$ sudo apt install nvidia-container-toolkit -y

Then run the command below to create the CDI spec file

$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Now lets check the detected devices

$ nvidia-ctk cdi list

My two Telsa T4s have been detected

INFO[0000] Found 5 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=GPU-1d877ac8-5df1-34b0-4f86-59945e37d2ba
nvidia.com/gpu=GPU-9491a3e6-ea29-ba4e-4403-083244d5575c
nvidia.com/gpu=all

Test the install/config

$ sudo podman run --rm --device=nvidia.com/gpu=all fedora nvidia-smi

Installing Python pip

Lets first make sure that python 3 is installed

$ python3 --version
Python 3.12.7

Now we need to install the virtual-env module for python

$ sudo apt install python3.10-venv

Installing RamaLama via Pip in a Python Virt Env

Create a directory for ramalama, and cd to that directory

$ mkdir ramalama && cd ramalama

Create virtual env and source to activate

$ python3 -m venv --upgrade-deps venv
$ source venv/bin/activate

Now pip install

$ pip install ramalama

Now run a model as a test

$ ramalama run instructlab/merlinite-7b-lab

In another window, run the commands shown below to view the ramalama container running in your python virtual env. Note that the output of podman ps will be empty unless your “activate” your virtual env.

$ source venv/bin/activate
$ podman ps

Screenshot below


Example CLI Commands

Pull a model.

$ ramalama pull ollama://mistral

List downloaded models.

$ ramalama list
NAME                             MODIFIED       SIZE   
ollama://mistral:latest          42 seconds ago 3.83 GB
ollama://merlinite-7b-lab:latest 10 hours ago   4.07 GB

Run a model

$ ramalama run mistral:latest

Info on ramalama itself. Output will tell you detected GPUs and what driver is being used.

$ ramalama info

–dryrun flag provides the podman command used to serve/run a model

$  ramalama --dryrun run instructlab/merlinite-7b-lab 
podman run --rm -i --label ai.ramalama --name ramalama_6buqEjuCUm --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=instructlab/merlinite-7b-lab --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.command=run --pull=newer -t --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --network none --mount=type=bind,src=/home/cpaquin/.local/share/ramalama/models/ollama/merlinite-7b-lab:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/cuda:latest llama-run -c 2048 --temp 0.8 --ngl 999 /mnt/models/model.file 


There are a whole load of other topics that I will eventually get into with ramalama

  1. GPU Support/Enablement
  2. RAG
  3. Whisper

More to come at a later date. In the meantime, take a look at the “Resources” section below.

Additional Video

Resources

  1. https://github.com/containers/ramalama
  2. https://github.com/containers/ramalama/blob/main/docs/ramalama.1.md
  3. https://developers.redhat.com/articles/2024/11/22/how-ramalama-makes-working-ai-models-boring
  4. https://developers.redhat.com/blog/2024/12/17/simplifying-ai-ramalama-and-llama-run
  5. https://www.linkedin.com/pulse/ollama-much-try-ramalama-surya-rekha-tw5kf/

Leave a Reply