Step 0 - preparation

Make sure that you have enough space on disk. I bought an external drive (Crucial X9 2TB SSD) for this task. The initial download required 331Gb and a minimum of 1Tb is recommended to work with Llama.

Step 1 - initial download

Open the Llama repository and follow the instructions.

clone the repo
request a new link at meta.com
get the key via mail
launch download.sh and paste the key

Step 2 - download Llama C++

Download Llama c++ and compile it.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

convert the models

python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
python convert.py ../llama/llama-2-7b-chat

This will convert the model in ggml. The file will be pretty big

Note: the name may differ

$ du -ch ../llama/llama-2-7b-chat/ggml-model-f16.gguf 
 13G	../llama/llama-2-7b-chat/ggml-model-f16.gguf

To see what quantization is available run ./quantize

or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
or  Q3_K   : alias for Q3_K_M
or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
or  Q4_K   : alias for Q4_K_M
or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
or  Q5_K   : alias for Q5_K_M
or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
or  F16    : 13.00G              @ 7B
or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

Quantize

./quantize ../llama/llama-2-7b-chat/ggml-model-f16.gguf ../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin Q4_0

The new file is smaller

$ du -ch ../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin
3.6G	../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin

Step 3 - try it

Create a project with PyCharm. I used Python 3.12 (installed with brew in /opt/homebrew/Cellar/python@3.12/3.12.0/bin/python3.12) and a virtual environment in ~/venvs/llama

Note: it looks like PyCharm struggles with virtual envs in paths containing non alphanumerical characters. In case of error, it’s better to change location.

Install llama-cpp-python

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

Test it with

from llama_cpp import Llama

model_path = "../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin"
model = Llama(model_path = model_path, n_ctx = 2048, n_gpu_layers = 1, use_mlock = True)
prompt = "[INST]<<SYS>>Tell me a jokes[/INST]"
output = model(prompt = prompt, max_tokens = 120, temperature = 0.2)
print(output['choices'][0]['text'])

Tags: AI LLM Llama