This 2 part guide is written specifically for those who are just getting started with Ollama. Note that I originally wrote this post with the Nvidia Jetson Orin Nano in mind, as that is where I was initially running Olama… in part 2 I switch to something more powerful.

That being said, the information below regarding installing Ollama and pulling a model is not specific to the Jetson and should work for anyone who wants to get started quickly with Ollama. Also worth noting that in this quick start guide we are installing ollama as a service, not as a container, as you would do if using jetson-containers [1]

Installing ollama

Use the command below to install ollama.

$ sudo curl -fsSL https://ollama.com/install.sh | sh

The install script downloads ollama, required Jetpack 6 components (on Jetson Devices), creates the ollama user, creates an api endpoint, and enables & starts the ollama service.

$ sudo curl -fsSL https://ollama.com/install.sh | sh 
>>> Installing ollama to /usr/local
>>> Downloading Linux arm64 bundle
######################################################################## 100.0%
>>> Downloading JetPack 6 components
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA JetPack ready.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

Run the command below to test functionality and verify installed version of ollama.

$ ollama --version
ollama version is 0.5.7

Configure ollama to Listen on all Interfaces

By default ollama listens on 127.0.0.1. If you want to configure it to listen on all interfaces so that you can interact with it remotely, you will need to modify the service configuration as shown below.

$ sudo vi /etc/systemd/system/ollama.service

Add the following line to the file.

Environment=”OLLAMA_HOST=0.0.0.0″

Below is the service file post edit.

$ cat ollama.service
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/cuda-12.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="OLLAMA_HOST=0.0.0.0"

Reload systemctl daemons so the change will take effect.

$ sudo systemctl daemon-reload

Restart the Service

$ sudo systemctl restart ollama.service

Check to ensure ollama is listening on all interfaces/IPs.

$ netstat -a | grep 11434
tcp6       0      0 [::]:11434              [::]:*                  LISTEN

Test connectivity from a remote host using telnet.

$ telnet 10.1.10.11 11434

Configure Alternative Download Directory For Ollama Models

Obviously this step is optional, but you can set the download directory if wanted with the parameter below. Apparently the default location is “~/.ollama/models”

Environment="OLLAMA_MODELS=/home/cpaquin/Download/ollama/models"

You will need to restart ollama service

Additional Packages (for non-jetson users)

For those not running a jetson and installing jetpack, you may need a few additional packages

Installing a Model

Pull a Llama Model (Optimized for Jetson)

For the Jetson Orin Nano (4GB), Iit is recommend using a small or quantized model such as:

TinyLlama (1B) → Good for Jetson Nano.
Llama 2 7B (Q4_0 or Q8) → Use GGUF quantization for lower RAM usage (Did not do this, so ran into issues) .
Mistral 7B (Q4_K) → More efficient than Llama 2.

Running the command below shows us that we have yet to install a model.

$ ollama list
NAME    ID    SIZE    MODIFIED

Run the command below to download a model. In the example below we are downloading/installing Llama 2 7B (3.8GB).

$ ollama pull llama2:7b

PRO TIP:
You can browse available models at https://ollama.com/

Let’s config successful download and install with “ollama list“

$ ollama list
NAME         ID              SIZE      MODIFIED      
llama2:7b    78e26419b446    3.8 GB    3 minutes ago

Interacting with Ollama via the CLI

You can interact with Ollama (and the loaded model) via the CLI in one of two ways.

Predefined Prompt
Interactive Mode

Via Predefined Prompt

In this mode, you call the model and pass the prompt in one step

~$ ollama run llama2:7b "Tell me about the Nvidia Jetson Orin Nano"

Interactive Mode

ollama run llama2:7b

A Wild Error Appears

While attempting to run an interactive session, we see that the ollma runner process was terminated

~$ ollama run llama2:7b 
Error: llama runner process has terminated: signal: killed

Let’s watch the output of journalctl and watch for errors as we try again

 sudo journalctl -f -u ollama.service

While watching journalctl we see the following

Feb 02 17:21:26 jetson.lab ollama[2306]: time=2025-02-02T17:21:25.650-05:00 level=WARN source=server.go:562 msg="client connection closed before server finished loading, aborting load"
Feb 02 17:21:26 jetson.lab ollama[2306]: time=2025-02-02T17:21:25.668-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
Feb 02 17:21:26 jetson.lab ollama[2306]: [GIN] 2025/02/02 - 17:21:25 | 499 |  4.679059726s |       127.0.0.1 | POST     "/api/generate"
Feb 02 17:21:26 jetson.lab systemd[1]: ollama.service: Failed with result 'oom-kill'.

Apparently we are getting “oom-killed“.

Lets watch free memory while we execute “ollama run” again. We will “watch” free -m

 watch -d -n 1 free -m

We can see free memory drop to about 500mb, which may or may not be enough to run the rest of the system. When ollama is not running we have about 2485MB of free memory (shown below)

$ free -m
               total        used        free      shared  buff/cache   available
Mem:            3601         940        2485           0         176        2483
Swap:           1800         338        1462

So we have a few options

Attempt to reduce the amount of free memory available to Ollama (lets try)
Tune ollama to attempt to use less memory (might be possible)
Modify oom-killer behavior (probably a bad idea)
Change to a lighter weight model (best idea)

Freeing up System Memory

Lets disable the Desktop GUI in ubuntu and see what that buys us in free memory. Note that you can probably just skip this section as I eventually move to a smaller model, but there is some good troubleshooting information here for those new to Linux.

# sudo systemctl set-default multi-user.target

And now immediately move to cli mode without reboot

$ sudo init 3

Output from ‘free -m” has not changed.

$ sudo free -m
               total        used        free      shared  buff/cache   available
Mem:            3601         859        2402           4         339        2554
Swap:           1800         164        1636

We will reboot just in case…. ok that is a tiny bit better

$ sudo free -m
               total        used        free      shared  buff/cache   available
Mem:            3601         458        2709          18         433        2938
Swap:           1800           0        1800

Now lets see if we can find a few services that we do not need, and stop and disable them.

$ sudo systemctl disable --now bluetooth
$ sudo systemctl disable --now avahi-daemon.service avahi-daemon.socket

Then reboot. Once back up and running a cursory check of “free -m” shows that our efforts were mostly in vain. Lets try another model

So this time we are going with a small model to test basic functionality and see if we are still running into oom errors.

ollama pull tinyllama

Ok much better…

~$ ollama run tinyllama
>>> Send a message (/? for help)

Lets check memory, and see how much headroom we have… Its not a lot

~$ free -m
               total        used        free      shared  buff/cache   available
Mem:            3601        1835         924           0         841        1575
Swap:           1800         147        1653

Inspecting System Utilization

For this step we are going to use a couple of tools, most of which are custom to the Jetson.

First we launch jtop [2]. And n another terminal windows, we load up the tinylama model and enter our prompt. While ollama is working, we watch observe jtop.

So our prompt goes…

can you tell me the history of the company Digital and their line of PDP computers
Certainly! The company Digital was founded in England in 1963 by two college students, John Cocking and Michael Kearns. They were inspired to start a computer company after 
witnessing the emergence of the personal digital assistant (PDA) market, which had been dominated by smaller, less advanced companies such as Acorn Computers and Marmalade.....trunc...

While this is running we are watching jtop. Below we can see that our GPU is at 80% load and at times approached 100%. Memory usage is high, and CPU utilization is low.

Measuring Tokens Per Second

This seems to be the measurement that many use to determine how fast their machine is, so lets give it a try. To output tokens per second we add the “–verbose” flag.

~$ ollama run tinyllama --verbose "Can you tell me as much as you know about the Dell T620 Server"

I get a 6 bullet list that is almost completely incorrect. What exactly is a DelT620? And no the T620 was intel xeon powered, not powered by AMD’s EPYC 7551 processor. But that does not matter, what comes next is what we are looking for.

total duration:       25.885975737s
load duration:        1.246583429s
prompt eval count:    52 token(s)
prompt eval duration: 283ms
prompt eval rate:     183.75 tokens/s
eval count:           464 token(s)
eval duration:        24.354s
eval rate:            19.05 tokens/s

Let’s break down the output above line by line.

TERM	VALUE	WHAT IT MEANS
total duration	25.885975737s	total duration: refers to the total time taken for the entire process of generating a response. This includes all stages such as: – Model Loading: If model isn’t already loaded into memory, this ime accounts for loading it. – Tokenization: Converting input text into tokens that the model can process. – Inference Time: The time spent by the model generating the response token by token. – Post-processing: Any steps taken after generation, such as formatting the output. – Communication Overhead: Time spent handling requests and responses, especially if running in a client-server setup.
load duration	1.246583429s	load duration: refers to the amount of time spent loading the model into memory before it can start processing input. This step includes: – Model Retrieval: If the model is not already cached in memory, Ollama retrieves it from disk or another source. – Model Initialization: Preparing the model, including loading weights into VRAM (if using a GPU) or RAM (if running on a CPU). – Graph Compilation (if applicable): Some backends may optimize or compile the model for execution. – Memory Allocation: Ensuring that enough memory is available for inference.
prompt eval count	52 token(s)	prompt eval count: refers to the number of tokens that were processed (evaluated) from the initial prompt before the model starts generating a response. Breakdown: – Prompt Tokens: Before the model generates any output, it first processes (evaluates) the input text (prompt). – Evaluation: The model tokenizes the input text and processes these tokens through its neural network. – Count Meaning: This number represents how many tokens were in the prompt that the model had to evaluate before responding.
prompt eval duration	283ms	“prompt eval duration” refers to the total time taken to process (evaluate) the input prompt before the model begins generating a response. Breakdown: – Tokenization: The input text is broken down into tokens that the model can process. – Model Inference on Prompt Tokens: The model evaluates these tokens, passing them through its neural network layers to set up the internal context. – Context Initialization: The model updates its internal state based on the prompt before starting to generate output.
prompt eval rate	183.75 tokens/s	prompt eval rate” refers to the speed at which the model processes the tokens from the input prompt. Measured in tokens per second (tokens/sec) where prompt-eval-rate = prompt eval count/prompt eval duration
eval count	464 tokens/s	“eval count” refers to the total number of tokens that were processed during the generation phase—i.e., the number of new tokens the model generated as output.
eval duration	24.354s	“eval duration” refers to the total time taken for the model to generate its response, measured from the moment it starts producing tokens until it finishes.
eval rate	19.05 tokens/s	“eval rate” refers to the speed at which the model generates tokens during the response phase. It is typically measured in tokens per second (tokens/sec). – eval_count / eval_duration = eval rate

Customizing Model Behavior

You can customize how the model responds with the set command. For example, lets say I want to tell the model to respond to me as if it were a child

$ ollama run tinyllama
>>> /set system For any question asked can you respond as if you were a child, using only small words, short sentences, and lacking any specific technical details or jargon
Set system message.
>>> /save kid2
Created new model 'kid2'
>>> /bye

Then we load the replicated model.

~$ ollama run kid2
>> what was the US civil war about 
The United States Civil War was fought between the Union and Confederate states over whether the union should remain intact as a single nation or become a separate nation known 
as the Confederate States of America. The war began in April 1861 after President Abraham Lincoln announced that he would veto any bill to abolish the constitutional right to 
secession unless Congress passed a similar law.

As the war progressed, each side unleashed their forces on the other and suffered significant losses. The north won major victories in the first year of the war, but southern 
states began withdrawing from the union due to economic hardship, making it necessary for Lincoln to appeal to them with a proposal to call off fighting and establish a 
government in the south. In response, the Confederate army launched its own attack on April 12, 1865, which led to Lincoln's assassination a few months later. The US civil war 
was one of the most devastating conflicts in world history, causing significant destruction to both sides and leaving millions dead or wounded.

Ok, so that was not exactly child-like, however this is probably the fault of the model that I am using. Lets see if I can coax it to answer the question differently.

>> /set system For any question asked can you respond as if you were a scholar, using a large vocabulary of diverse words. Be very specific in your responses and provide colorful deta
... ils
Set system message.
>>> what was the US civil war about
The United States Civil War was fought between the Union and Confederate states over whether the union should remain intact as a single nation or become a separate nation known 
as the Confederate States of America. The war began in April 1861 after President Abraham Lincoln announced that he would veto any bill to abolish the constitutional right to 
secession unless Congress passed a similar law.

The main cause of the war was the belief by some members of the southern states that they had been wronged and mistreated by the federal government, as well as a desire for 
self-determination and statehood. The north won major victories in the first year of the war, but southern states began withdrawing from the union due to economic hardship, 
making it necessary for Lincoln to appeal to them with a proposal to call off fighting and establish a government in the south. In response, the Confederate army launched its own 
attack on April 12, 1865, which led to Lincoln's assassination a few months later. The US civil war was one of the most devastating conflicts in world history, causing 
significant destruction to both sides and leaving millions dead or wounded.

Ok so not great, but again probably the model. Anyway you get the point. I’ll try this again with a different model in the near future of one of my more powerful servers/

Removing a Model

Remove a model with the “rm” switch. See below

~$ ollama list
NAME                ID              SIZE      MODIFIED    
kid2:latest         ca8452f00cd5    637 MB    2 hours ago    
kid:latest          ce8e59f0d306    637 MB    2 hours ago    
tinyllama:latest    2644915ede35    637 MB    3 hours ago    
llama2:7b           78e26419b446    3.8 GB    5 hours ago    
cpaquin@jetson:~$ ollama rm kid:latest kid2:latest llama2:7b 
deleted 'kid:latest'
deleted 'kid2:latest'
deleted 'llama2:7b'

Training a Model

Most available models operate pretty well when you are asking them about information that they have been trained on. Above you can see that the tinyllama model has not been trained properly to even remotely respond as if it was a child. In order to do that, the model would need to be trained.

So lets train tinyllama. Keep in mind that these models have short-term memory constraints, meaning they only retain training data during an active conversation. Once you close the session and start a new one, the model will not remember any information from the previous interaction. To overcome the short-term memory limitation, one would need a backend database or some form of persistent storage, which would you to save and retrieve relevant information across sessions. Depending on your needs, you could use:

Relational Databases (SQL) – MySQL, PostgreSQL, or SQLite for structured data.
NoSQL Databases – MongoDB, Redis, or Firebase for more flexible storage.
Vector Databases – Pinecone, FAISS, or ChromaDB for storing embeddings in AI applications.
File Storage – JSON, CSV, or other formats for lightweight persistence.

The backend system could then integrate with the AI model, fetching and updating information as needed, effectively giving it “memory” beyond a single session. More about this later. For now let’s “teach” tiny llama about Star Trek. Specifically, I prompted the model with this question, and its response was incomplete.

>>> do you know about any of the star trek tv shows
Certainly! Here are some popular Star Trek TV shows:

1. Star Trek: The Original Series (1966-1968)
2. Star Trek: The Next Generation (1987-1994)
3. Star Trek: Deep Space Nine (1993-1999)
4. Star Trek: Voyager (1995-2001)
5. Star Trek: Enterprise (2001-2005)
6. Star Trek: Discovery (2017-)

I then made sure that was all the data it had on Star Trek TV shows.

>>> is that all the star trek shows
Yes, that's all the Star Trek TV series I could find.

And we are off to the races! Here is what I “taught” the model

 Let me tell you about some other Start trek TV shows that you are unaware of. Star Trek: The Animated Series premiered in 1972 and ran for 2 seasons, it featured the same character
... s that appeared Star Trek: The Original Series. Star Trek: Lower Decks premiered in 2020 and ran for 5 seasons. Sadly it was recently cancelled. It was an animated show which was  
... comedic in nature, often making fun of the tropes features throughout the other Star Trek shows and movies. Star Trek: Prodigy premiered in 2021 and ran for 2 seasons, it was an an
... animated show that originally aired on the kids station Nickelodeon. It was geared towards children, but was still fun for adults to watch as well. Especially with their children. 
... Star Trek: Strange New Worlds premiered in 2022 and has been running for 2 seasons, a third season is on its way. This show takes place on the starship Enterprise, but before Capta
... in Kirk was in command. The Captain, is Captain Pike, who was a character originally introduced in the 2 part pilot episode of the Original Series. Would you like to know more about star trek?

Here we are witnessing what is called a hallucination which is when an artificial intelligence (AI) model generates an incorrect or misleading output.

We do see that the model does now know about Star Trek: Strange New Worlds, but it basically made up the plot. There is no Captain Alex Danforth anywhere on google, and I have never heard of the U.S.S Explorer. Ava Acres is a real actor, but has nothing to do with Star Trek. James Frain played Sarek in Star Trek Discovery, so at least he is associated with the series. The model did remember that I mentioned a Star Trek show that aired on Nickelodeon, and got the series dates correct, but associated it with a completely made up show.

Fascinating…

>>> Tell me about Star Trek: Strange New Worlds 
Yes, I do have more information on Star Trek: Stragrg New Worlds. This is a science fiction TV show that originally aired on Nickelodeon between 2021 and 2023. It's set in the 
same universe as Star Trek: Lower Decks, but takes place in the 25th century.

The show follows the crew of the U.S.S. Explorer, a new ship that was built specifically for this mission. The crew is made up of Captain Alex Danforth (played by James Frain), 
First Officer Renee "Ren" Dahla (played by Ava Acres), and Engineer Jesse Johnson (played by David Hewison).

The Explorer has been sent to explore a new planet that's rich in resources. The crew encounters various alien species, including a group of intelligent, hostile beings who have 
been observing their activities for some time.

The show is filled with action and adventure as the Explorer faces numerous challenges on its mission. It also explores themes such as colonialism, power dynamics, and 
responsibility to others. The show's premise may seem familiar to Star Trek fans, but it features new characters and a fresh take on the classic series universe.

If you enjoy Nickelodeon-style science fiction and have enjoyed Star Trek: Lower Decks, then I think you will enjoy Star Trek: Stragrg New Worlds.

Part 2…. coming soon.

Resources

Reference

[1] https://github.com/dusty-nv/jetson-containers

[2] https://github.com/rbonghi/jetson_stats

Chris Paquin

AI / Virt / Containers / Hardware / Linux

Ollama CLI Quick Start Guide and Tutorial for Beginners – Part 1

Installing ollama

Configure ollama to Listen on all Interfaces

Configure Alternative Download Directory For Ollama Models

Additional Packages (for non-jetson users)

Installing a Model

Pull a Llama Model (Optimized for Jetson)

Interacting with Ollama via the CLI

Via Predefined Prompt

Interactive Mode

A Wild Error Appears

Freeing up System Memory

Inspecting System Utilization

Measuring Tokens Per Second

Customizing Model Behavior

Removing a Model

Training a Model

Resources

Reference

Like this:

Related

One thought on “Ollama CLI Quick Start Guide and Tutorial for Beginners – Part 1”

Leave a ReplyCancel reply

Installing ollama

Configure ollama to Listen on all Interfaces

Configure Alternative Download Directory For Ollama Models

Additional Packages (for non-jetson users)

Installing a Model

Pull a Llama Model (Optimized for Jetson)

Interacting with Ollama via the CLI

Via Predefined Prompt

Interactive Mode

A Wild Error Appears

Freeing up System Memory

Inspecting System Utilization

Measuring Tokens Per Second

Customizing Model Behavior

Removing a Model

Training a Model

Resources

Reference

Like this:

Related

One thought on “Ollama CLI Quick Start Guide and Tutorial for Beginners – Part 1”

Leave a ReplyCancel reply

Discover more from Chris Paquin