nisten commited on
Commit
bbad105
·
verified ·
1 Parent(s): 44b2a1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -19
README.md CHANGED
@@ -2,51 +2,49 @@
2
  base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
3
  license: apache-2.0
4
  ---
5
- ## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING OR IT WONT WORK
6
- ## I.e. Download the 3bit orpo3ns.gguf.part0 & orpo3ns.gguf.part1 files then:
 
 
 
 
 
 
 
7
  ```
8
  cd ~/Downloads
9
 
10
- cat orpo3ns.gguf.part* > orpo3ns.gguf
11
 
12
  cd llamacppFolderLocaltion
13
 
14
- ./server -m ~/Downloads/orpor3ns.gguf -ngl 56
15
  ```
 
16
 
17
  For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
18
 
19
- ## orpo4ns.gguf is good to go, 2bit also done but not recommended.
20
-
21
-
22
- # Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
23
- # first mixtral8x22b finetune 💫
24
 
25
  the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
26
 
27
  Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
28
 
29
- To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)
30
-
31
- ```
32
- cat orpo4ns.gguf.part* > orpo4ns.gguf
33
-
34
- ```
35
- careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
36
 
37
  # Run with llama.cpp
38
 
39
  ```
40
  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
41
 
42
- ./main -m ~/orpo4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbits?"
43
 
44
  ```
45
  # Perplexity benchmarks
46
 
47
  Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
48
 
49
- ```./perplexity -m ~/orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
50
 
51
  # Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
52
  # orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
@@ -76,5 +74,8 @@ orpo2ns.gguf FILESIZE: 44026 MB
76
  [1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
77
  Final estimate: PPL = 4.4659 +/- 0.16582
78
  ```
 
 
 
79
 
80
- # The 3bit version is surprisingly usable even though only 58GB
 
2
  base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
3
  license: apache-2.0
4
  ---
5
+
6
+ # Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
7
+ # first mixtral8x22b finetune 💫
8
+
9
+ This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp
10
+ Hoping to push edgequants to main llama.cpp repo soon
11
+
12
+ ## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING
13
+ ## I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then:
14
  ```
15
  cd ~/Downloads
16
 
17
+ cat orpo4ns.gguf.part* > orpo4ns.gguf
18
 
19
  cd llamacppFolderLocaltion
20
 
21
+ ./server -m ~/Downloads/orpor4ns.gguf -ngl 56
22
  ```
23
+ careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
24
 
25
  For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
26
 
27
+ ## orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended.
 
 
 
 
28
 
29
  the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
30
 
31
  Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
32
 
33
+ I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot.
 
 
 
 
 
 
34
 
35
  # Run with llama.cpp
36
 
37
  ```
38
  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
39
 
40
+ ./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?"
41
 
42
  ```
43
  # Perplexity benchmarks
44
 
45
  Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
46
 
47
+ ```./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
48
 
49
  # Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
50
  # orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
 
74
  [1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
75
  Final estimate: PPL = 4.4659 +/- 0.16582
76
  ```
77
+ People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps)
78
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/id32eagz3KNxiK3NC6cTv.png)
79
+
80
 
81
+ # The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac.