Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,32 @@
|
|
1 |
---
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
+
|
6 |
+
# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 first mixtral8x22b finetune 💫
|
7 |
+
|
8 |
+
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
|
9 |
+
|
10 |
+
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
|
11 |
+
|
12 |
+
To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)
|
13 |
+
|
14 |
+
```
|
15 |
+
cat orpo4ns-0000* > orpo4ns.gguf
|
16 |
+
|
17 |
+
```
|
18 |
+
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
|
19 |
+
|
20 |
+
# Run with llama.cpp
|
21 |
+
|
22 |
+
```
|
23 |
+
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
|
24 |
+
|
25 |
+
./main -m ~/orpo4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbits?"
|
26 |
+
|
27 |
+
```
|
28 |
+
# Perplexity benchmarks
|
29 |
+
|
30 |
+
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
|
31 |
+
|
32 |
+
```./perplexity -m ~/orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
|