File size: 3,407 Bytes
b00b931 8fda89e b00b931 bbad105 6875282 bbad105 6875282 ea0a077 bbad105 6875282 bbad105 6875282 bbad105 8fda89e bbad105 8fda89e bbad105 8fda89e bbad105 434fbe3 e774605 701252c e774605 434fbe3 99ba75f 434fbe3 99ba75f 434fbe3 99ba75f 434fbe3 e774605 99ba75f 6875282 bbad105 6875282 bbad105 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
license: apache-2.0
---
# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
# first mixtral8x22b finetune 💫
This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp
Hoping to push edgequants to main llama.cpp repo soon
## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING
## I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then:
```
cd ~/Downloads
cat orpo4ns.gguf.part* > orpo4ns.gguf
cd llamacppFolderLocaltion
./server -m ~/Downloads/orpor4ns.gguf -ngl 56
```
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/
## orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended.
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot.
# Run with llama.cpp
```
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j
./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?"
```
# Perplexity benchmarks
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
```./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```
# Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
# orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.
```bash
orpor4ns.gguf FILESIZE: 71260 MB
[1]2.6970,[2]3.1781,[3]3.7390,[4]3.4159,[5]2.8977,[6]2.7126,[7]2.5597,[8]2.5013,[9]2.5279,[10]2.5175,[11]2.5315,[12]2.5455,
Final estimate: PPL = 2.5455 +/- 0.07697
orpo3ns.gguf FILESIZE: 58536 MB
[1]2.8042,[2]3.3418,[3]3.9400,[4]3.5859,[5]3.2042,[6]3.0524,[7]2.9738,[8]2.9695,[9]3.0232,[10]3.0099,[11]3.0510,[12]3.0589,
Final estimate: PPL = 3.0589 +/- 0.09882
orpo3nm.gguf FILESIZE: 60828 MB
[1]2.8435,[2]3.2998,[3]3.8984,[4]3.4821,[5]3.1084,[6]2.9597,[7]2.8[9]2.9155,[10]2.9218,[11]2.9613,[12]2.9709,
Final estimate: PPL = 2.9709 +/- 0.09419
orpo3nl.gguf FILESIZE: 65405 MB
[1]2.8175,[2]3.2506,[3]3.8241,[4]3.4152,[5]2.9970,[6]2.8455,[7]2.7358,[8]2.7120,[9]2.7955,[10]2.8003,[11]2.8254,[12]2.8371,
Final estimate: PPL = 2.8371 +/- 0.08781
orpo2n.gguf FILESIZE: 49420 MB
[1]3.0082,[2]3.5829,[3]4.1414,[4]4.1671,[5]3.8567,[6]3.7209,[7]3.7150,[8]3.7210,[9]3.8445,[10]3.9332,[11]4.0879,[12]4.0884,
Final estimate: PPL = 4.0884 +/- 0.1499
orpo2ns.gguf FILESIZE: 44026 MB
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
Final estimate: PPL = 4.4659 +/- 0.16582
```
People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/id32eagz3KNxiK3NC6cTv.png)
# The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac. |