nisten commited on
Commit
8fda89e
·
verified ·
1 Parent(s): b00b931

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -1,3 +1,32 @@
1
  ---
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
3
  license: apache-2.0
4
  ---
5
+
6
+ # Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 first mixtral8x22b finetune 💫
7
+
8
+ the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
9
+
10
+ Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
11
+
12
+ To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)
13
+
14
+ ```
15
+ cat orpo4ns-0000* > orpo4ns.gguf
16
+
17
+ ```
18
+ careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
19
+
20
+ # Run with llama.cpp
21
+
22
+ ```
23
+ git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
24
+
25
+ ./main -m ~/orpo4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbits?"
26
+
27
+ ```
28
+ # Perplexity benchmarks
29
+
30
+ Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
31
+
32
+ ```./perplexity -m ~/orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```