Update README.md
Browse files
README.md
CHANGED
@@ -2,51 +2,49 @@
|
|
2 |
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
-
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
```
|
8 |
cd ~/Downloads
|
9 |
|
10 |
-
cat
|
11 |
|
12 |
cd llamacppFolderLocaltion
|
13 |
|
14 |
-
./server -m ~/Downloads/
|
15 |
```
|
|
|
16 |
|
17 |
For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
|
18 |
|
19 |
-
## orpo4ns.gguf is
|
20 |
-
|
21 |
-
|
22 |
-
# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
|
23 |
-
# first mixtral8x22b finetune 💫
|
24 |
|
25 |
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
|
26 |
|
27 |
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
|
28 |
|
29 |
-
|
30 |
-
|
31 |
-
```
|
32 |
-
cat orpo4ns.gguf.part* > orpo4ns.gguf
|
33 |
-
|
34 |
-
```
|
35 |
-
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
|
36 |
|
37 |
# Run with llama.cpp
|
38 |
|
39 |
```
|
40 |
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
|
41 |
|
42 |
-
./main -m
|
43 |
|
44 |
```
|
45 |
# Perplexity benchmarks
|
46 |
|
47 |
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
|
48 |
|
49 |
-
```./perplexity -m
|
50 |
|
51 |
# Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
|
52 |
# orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
|
@@ -76,5 +74,8 @@ orpo2ns.gguf FILESIZE: 44026 MB
|
|
76 |
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
|
77 |
Final estimate: PPL = 4.4659 +/- 0.16582
|
78 |
```
|
|
|
|
|
|
|
79 |
|
80 |
-
# The 3bit version is surprisingly usable even though only 58GB
|
|
|
2 |
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
+
|
6 |
+
# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
|
7 |
+
# first mixtral8x22b finetune 💫
|
8 |
+
|
9 |
+
This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp
|
10 |
+
Hoping to push edgequants to main llama.cpp repo soon
|
11 |
+
|
12 |
+
## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING
|
13 |
+
## I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then:
|
14 |
```
|
15 |
cd ~/Downloads
|
16 |
|
17 |
+
cat orpo4ns.gguf.part* > orpo4ns.gguf
|
18 |
|
19 |
cd llamacppFolderLocaltion
|
20 |
|
21 |
+
./server -m ~/Downloads/orpor4ns.gguf -ngl 56
|
22 |
```
|
23 |
+
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
|
24 |
|
25 |
For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
|
26 |
|
27 |
+
## orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended.
|
|
|
|
|
|
|
|
|
28 |
|
29 |
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
|
30 |
|
31 |
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
|
32 |
|
33 |
+
I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot.
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
# Run with llama.cpp
|
36 |
|
37 |
```
|
38 |
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
|
39 |
|
40 |
+
./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?"
|
41 |
|
42 |
```
|
43 |
# Perplexity benchmarks
|
44 |
|
45 |
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
|
46 |
|
47 |
+
```./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
|
48 |
|
49 |
# Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
|
50 |
# orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
|
|
|
74 |
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
|
75 |
Final estimate: PPL = 4.4659 +/- 0.16582
|
76 |
```
|
77 |
+
People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps)
|
78 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/id32eagz3KNxiK3NC6cTv.png)
|
79 |
+
|
80 |
|
81 |
+
# The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac.
|