Step 0 - preparation
Make sure that you have enough space on disk. I bought an external drive (Crucial X9 2TB SSD) for this task. The initial download required 331Gb and a minimum of 1Tb is recommended to work with Llama.
Step 1 - initial download
Open the Llama repository and follow the instructions.
- clone the repo
- request a new link at meta.com
- get the key via mail
- launch
download.sh
and paste the key
Step 2 - download Llama C++
Download Llama c++ and compile it.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
convert the models
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
python convert.py ../llama/llama-2-7b-chat
This will convert the model in ggml. The file will be pretty big
Note: the name may differ
$ du -ch ../llama/llama-2-7b-chat/ggml-model-f16.gguf
13G ../llama/llama-2-7b-chat/ggml-model-f16.gguf
To see what quantization is available run ./quantize
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 13.00G @ 7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
Quantize
./quantize ../llama/llama-2-7b-chat/ggml-model-f16.gguf ../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin Q4_0
The new file is smaller
$ du -ch ../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin
3.6G ../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin
Step 3 - try it
Create a project with PyCharm. I used Python 3.12 (installed with brew in /opt/homebrew/Cellar/python@3.12/3.12.0/bin/python3.12
) and a virtual environment in ~/venvs/llama
Note: it looks like PyCharm struggles with virtual envs in paths containing non alphanumerical characters. In case of error, it’s better to change location.
Install llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
Test it with
from llama_cpp import Llama
model_path = "../llama/llama-2-7b-chat/ggml-model-f16_q4_0.bin"
model = Llama(model_path = model_path, n_ctx = 2048, n_gpu_layers = 1, use_mlock = True)
prompt = "[INST]<<SYS>>Tell me a jokes[/INST]"
output = model(prompt = prompt, max_tokens = 120, temperature = 0.2)
print(output['choices'][0]['text'])