Introduction to RamaLama
Streamlining AI Deployment with OCI Containers
RamaLama is an open-source project developed to simplify AI model deployment and management using OCI (Open Container Initiative) containers. Ramalama enables seamless execution of AI workloads across different hardware configurations, supporting both GPU-accelerated and CPU-based environments.
By leveraging container engines like Podman and Docker, RamaLama includes all necessary dependencies, eliminating complex installation and dependency nightmares.
Ramalama integrates with AI model registries such as Hugging Face and Ollama, providing flexibility in model selection. Key features include automatic GPU detection, CPU fallback, and optional direct execution on the host system.
Prerequisites
Updating Ubuntu
First let’s confirm our Ubuntu version
$ sudo lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.5 LTS
Release: 22.04
Codename: jammy
Run the two commands below to update your package cache and install any updates. Reboot if required.
$ sudo apt-get update
$ sudo apt-get upgrade -y
Installing podman
$ sudo apt -y install podman
# sudo podman --version
podman version 5.0.3
Installing Nvidia Drivers
Assuming you added third-party repos at build time, we should be able to check the suggested NVIDIA driver version. Do so with the command below
$ nvidia-detector
nvidia-driver-545
The command below confirms that we do not have an NVIDIA driver loaded.
$ cat /proc/driver/nvidia/version
cat: /proc/driver/nvidia/version: No such file or directory
Now let’s install the NVIDIA driver.
$ sudo ubuntu-drivers --gpgpu install
And we need to install the nvidia-utils package. Make sure the package version matches your installed driver.
$ sudo apt install nvidia-utils-535-server
Now reboot.
Once your system is back up. Run the command below to verify that the drivers installed correctly. At the top of the output you should see your driver version and CUDA API version
$ sudo nvidia-smi
You must configure the persistence daemon (nvidia-persistenced) to start at boot and run continuously. Otherwise, the driver may unload, causing the Tesla GPUs to deinitialize, requiring a full reinitialization when nvidia-smi is executed. Additionally, failing to keep nvidia-persistenced running could lead to more severe issues, such as GPU crashes, depending on the workload.
Enable and start the service below
$ sudo systemctl start nvidia-persistenced
$ sudo systemctl status nvidia-persistenced
Installing Nvidia Cuda Toolkit
Run the command below to install the nvidia-cuda-toolkit from the default Ubuntu repos.
$ sudo apt install nvidia-cuda-toolkit -y
Now test to ensure proper install and that new binary files are in your path.
~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
Configuring Podman with Nvidia Cuda Support
Following along with this document from the RamaLama github page, we first need to install the nvidia-container-toolkit.
First configure the repo. Note that this is one command. See here for more info.
$ sudo curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Refresh the repos
# sudo apt-get update
Then install the toolkit
$ sudo apt install nvidia-container-toolkit -y
Then run the command below to create the CDI spec file
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
Now lets check the detected devices
$ nvidia-ctk cdi list
My two Telsa T4s have been detected
INFO[0000] Found 5 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=GPU-1d877ac8-5df1-34b0-4f86-59945e37d2ba
nvidia.com/gpu=GPU-9491a3e6-ea29-ba4e-4403-083244d5575c
nvidia.com/gpu=all
Test the install/config
$ sudo podman run --rm --device=nvidia.com/gpu=all fedora nvidia-smi
Installing Python pip
Lets first make sure that python 3 is installed
$ python3 --version
Python 3.12.7
Now we need to install the virtual-env module for python
$ sudo apt install python3.10-venv
Installing RamaLama via Pip in a Python Virt Env
Create a directory for ramalama, and cd to that directory
$ mkdir ramalama && cd ramalama
Create virtual env and source to activate
$ python3 -m venv --upgrade-deps venv
$ source venv/bin/activate
Now pip install
$ pip install ramalama
Now run a model as a test
$ ramalama run instructlab/merlinite-7b-lab
In another window, run the commands shown below to view the ramalama container running in your python virtual env. Note that the output of podman ps will be empty unless your “activate” your virtual env.
$ source venv/bin/activate
$ podman ps
Screenshot below

Example CLI Commands
Pull a model.
$ ramalama pull ollama://mistral
List downloaded models.
$ ramalama list
NAME MODIFIED SIZE
ollama://mistral:latest 42 seconds ago 3.83 GB
ollama://merlinite-7b-lab:latest 10 hours ago 4.07 GB
Run a model
$ ramalama run mistral:latest
Info on ramalama itself. Output will tell you detected GPUs and what driver is being used.
$ ramalama info
–dryrun flag provides the podman command used to serve/run a model
$ ramalama --dryrun run instructlab/merlinite-7b-lab
podman run --rm -i --label ai.ramalama --name ramalama_6buqEjuCUm --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=instructlab/merlinite-7b-lab --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.command=run --pull=newer -t --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --network none --mount=type=bind,src=/home/cpaquin/.local/share/ramalama/models/ollama/merlinite-7b-lab:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/cuda:latest llama-run -c 2048 --temp 0.8 --ngl 999 /mnt/models/model.file
There are a whole load of other topics that I will eventually get into with ramalama
- GPU Support/Enablement
- RAG
- Whisper
More to come at a later date. In the meantime, take a look at the “Resources” section below.
Additional Video
Resources
- https://github.com/containers/ramalama
- https://github.com/containers/ramalama/blob/main/docs/ramalama.1.md
- https://developers.redhat.com/articles/2024/11/22/how-ramalama-makes-working-ai-models-boring
- https://developers.redhat.com/blog/2024/12/17/simplifying-ai-ramalama-and-llama-run
- https://www.linkedin.com/pulse/ollama-much-try-ramalama-surya-rekha-tw5kf/