nisten
/

orpo-zephyr-8x22-EdgeQuants-gguf

GGUF

Inference Endpoints

conversational

Model card Files Files and versions Community

nisten commited on Apr 12, 2024

Commit

bbad105

verified ·

1 Parent(s): 44b2a1f

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -19

README.md CHANGED Viewed

@@ -2,51 +2,49 @@
 base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
 license: apache-2.0
 ---
-## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING OR IT WONT WORK
-## I.e. Download the 3bit orpo3ns.gguf.part0 & orpo3ns.gguf.part1 files then:
 ```
 cd ~/Downloads
-cat orpo3ns.gguf.part* > orpo3ns.gguf
 cd llamacppFolderLocaltion
-./server -m ~/Downloads/orpor3ns.gguf -ngl 56
 ```
 For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
-## orpo4ns.gguf is good to go, 2bit also done but not recommended.
-# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
-# first mixtral8x22b finetune 💫
 the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
 Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
-To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)
-```
-cat orpo4ns.gguf.part* > orpo4ns.gguf
-```
-careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
 # Run with llama.cpp
 ```
 git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
-./main -m ~/orpo4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbits?"
 ```
 # Perplexity benchmarks
 Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
-```./perplexity -m ~/orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
 # Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
 # orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
@@ -76,5 +74,8 @@ orpo2ns.gguf FILESIZE: 44026 MB
 [1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
 Final estimate: PPL = 4.4659 +/- 0.16582
 ```
-# The 3bit version is surprisingly usable even though only 58GB

 base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
 license: apache-2.0
 ---
+# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
+# first mixtral8x22b finetune 💫
+This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp
+Hoping to push edgequants to main llama.cpp repo soon
+## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING
+## I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then:
 ```
 cd ~/Downloads
+cat orpo4ns.gguf.part* > orpo4ns.gguf
 cd llamacppFolderLocaltion
+./server -m ~/Downloads/orpor4ns.gguf -ngl 56
 ```
+careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
 For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
+## orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended.
 the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
 Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
+I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot.
 # Run with llama.cpp
 ```
 git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
+./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?"
 ```
 # Perplexity benchmarks
 Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
+```./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
 # Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
 # orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
 [1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
 Final estimate: PPL = 4.4659 +/- 0.16582
 ```
+People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/id32eagz3KNxiK3NC6cTv.png)
+# The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac.