Running LLMs locally on commodity hardware may be challenging. The setup is definitely not trivial, the requirements are very high, and the end result is typically disappointing due to bad performance.

Among the solutions addressing those issues, there is Ollama.

Ollama can run several preconfigured LLMs, including Llama 2, Mistral, and Orca, in different versions. The full list is available on the website.

Models, parameters, and quantisation

There are several “brands” to choose from, and it is not always clear which alternative is the most appropriate. The rule of thumb is to pick a specialised model when possible (e.g. Codellama for coding) and fall back to general models if none is available.

The several versions of the same model differ in the number of parameters (in billions) and type of quantisation.

The number of parameters determines the quality of the response produced by the model, but also its requirements:

size	RAM
7B	8GB
13B	16GB
33B	32GB

Quite obviously, the number of parameters translates in size on the disk: a model can span from a couple of GB to 70GB or even more.

The Ollama’s library page contains a list of the available models with all their versions. Each version has a different number of parameters and quantisation. The number of parameters is identified by a number followed by b: 7b indicates a model with 7 billion parameters. Ollama uses 4-bit quantisation by default, and the alternatives are tagged with the letter q followed by a number: q8 indicates a model with a quantisation of 8-bit.

Putting things together, orca-mini:3b-q8_0 identifies an orca model with 8-bit quantisation and 3 billion parameters.

The q parameters could seem confusing, especially for the _0 or _1 suffixes. The different variants provide information about the quantisation scheme. q4_0 and q4_1 are both 4-bit quantised models, but the weights are calculated in a different way.

It’s important to remember that q2 (2-bit quantisation) produces less accurate models with a smaller footprint when compared to q8 (8-bit quantisation). f16 is 16-bit full-precision and it is considered the baseline.

Interestingly, according to the developers of Llama.cpp, q6 is within 0.1% perplexity (accuracy) from f16, therefore, for practical purposes the trade-off between f16 and q6 tends to favour the latter with comparable perplexity but a much smaller footprint.

The main parameter to control the perplexity is the number of parameters. More is better, but more parameters also mean higher hardware requirements and slower responsiveness.

Setup

With Ollama, the location of the models cannot be customised, but it is possible to use symlinks to move around the physical location.

ln -s ~/.ollama/models /some/external/drive

Using Ollama is straightforward. Start the main service with:

ollama serve

Pull a model with:

ollama pull orca-mini

When the download is complete, it should appear in the list of models:

ollama list

Returning:

NAME              ID            SIZE    MODIFIED       
orca-mini:latest  2dbd9f439647  2.0 GB  33 seconds ago  

To use the model:

ollama run orca-mini

The command above opens a prompt, and it’s immediately possible to play with it:

»»» who are you?

I am an AI assistant designed to assist with various tasks and activities as required.

input /bye to quit.

REST APIs

The prompt is a nice way to test a model, but the REST APIs are a much more powerful way to interact with Ollama because they allow for extra control over the queries and can be the backend for a service.

To try, start Ollama with ollama serve, then call it with a POST:

curl --location --request POST 'http://localhost:11434/api/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
  "model": "orca-mini",
  "stream": false,
  "prompt": "who are you?"
}'

Ollama will respond with something like:

{
    "model": "orca-mini",
    "created_at": "2024-01-19T10:24:55.565197Z",
    "response": " I am an AI assistant designed to help and assist you with your tasks. My purpose is to provide helpful information, answer questions, and help you achieve your goals.",
    "done": true,
    "context": [],
    "total_duration": 831663375,
    "prompt_eval_count": 44,
    "prompt_eval_duration": 272847000,
    "eval_count": 33,
    "eval_duration": 552068000
}

There are two interesting options in the generate endpoint

"raw":true, to enable “raw” mode and be able to customise the prompt template.
"options":{}, to control the parameters.

Raw mode

The default prompt can be inspected with:

ollama show orca-mini --template

plus

ollama show orca-mini --system

For orca-mini the commands above return:

{{- if .System }}
### System:
{{ .System }}
{{- end }}

### User:
{{ .Prompt }}

### Response:

and

You are an AI assistant that follows instruction extremely well. Help as much as you can.

Let’s compare what happens in regular and raw mode for the same prompt:

Regular, with Tell me something about Napoleon:

Napoleon Bonaparte was a French military and political leader who rose to prominence during the Napoleonic Wars. He was born on May 14, 1769, in Corsica, which was then a French colony. Napoleon became a general in the French army in 1785 and later served as the First Consul of France from 1799 to 1804, which gave him dictatorial powers. He invaded Egypt in 1798 and conquered much of Europe in 1805-1812, making himself Emperor of France. However, his rule was marked by controversy and controversy, and he was eventually defeated in 1812 at the Battle of Waterloo. Napoleon's legacy is still debated today, with some arguing that he played a positive role in European history while others see him as a destructive force who led to years of war and chaos.

Raw, using this prompt: "### System:\nYou are an AI assistant. You respond with bullet points.\n\n### User:\nTell me something about Napoleon\n\n### Response:"

1. Napoleon Bonaparte was a French military and political leader who rose to prominence during the French Revolution.

2. He was born on the island of Corsica in 1769 and became a general in the French army.

3. In 1799, he rose to power in a coup d'état and became the first consul of France.

4. He is known for his military conquests and his reforms of the French government and economy.

5. Napoleon was emperor of France from 1804 until his abdication in 1814.

Raw mode is an essential tool for prompt engineering.

Options

The options can control various aspects of the generation: the full list is available in the documentation.

The following is a demonstration of temperature, top_k, and top_p: these three parameters control the level of randomness of the generation.

With the prompt Tell me which books did Socrates write and the default settings, the LLM explains that Socrates never wrote a book.

Socrates is known for his contributions to philosophy, but he did not write any books. Instead, he is known for his teachings and discussions with fellow citizens in ancient Athens, which are recorded in some of the world's oldest known works of Western philosophy such as the "Apology" and "The Republic."

With the options turned to maximise the “imagination”, the result changes:

curl --location --request POST 'http://localhost:11434/api/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
  "model": "orca-mini",
  "stream": false,
  "raw": false,
  "prompt": "Tell me which books did Socrates write",
  "options": {
    "temperature": 100,
    "top_k": 100,
    "top_p": 1
  }
}'

Some results correctly report that Socrates did not write any book, but more than half have hallucinations:

Socrates is known for writing several philosophical dialogues, which have been translated into many languages. Some of his most well-known works include the "The Republic" and "Crito". However, Socrates was not himself a prolific writer and instead used other writers to record his ideas in the form of dialogues and letters.

Socrates is known primarily for his written work, the Apology. He also wrote other works, but these two are the most notable ones.

Socrates is known for writing many famous works, but some of his most notable ones are "The Republic," "Crito," and "Apology."

More to explore

This is just the tip of the iceberg. There are many more things doable with Ollama, including:

Chat APIs
Special models for coding or Images
Model files to create configurations
Import of custom models
Generation of embeddings

I will explore them in future articles.

Tags: AI Ollama LLM