|
--- |
|
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 |
|
license: apache-2.0 |
|
--- |
|
|
|
# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 |
|
# first mixtral8x22b finetune 💫 |
|
|
|
This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp |
|
Hoping to push edgequants to main llama.cpp repo soon |
|
|
|
## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING |
|
## I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then: |
|
``` |
|
cd ~/Downloads |
|
|
|
cat orpo4ns.gguf.part* > orpo4ns.gguf |
|
|
|
cd llamacppFolderLocaltion |
|
|
|
./server -m ~/Downloads/orpor4ns.gguf -ngl 56 |
|
``` |
|
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la |
|
|
|
For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/ |
|
|
|
## orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended. |
|
|
|
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included ) |
|
|
|
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024. |
|
|
|
I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot. |
|
|
|
# Run with llama.cpp |
|
|
|
``` |
|
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j |
|
|
|
./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?" |
|
|
|
``` |
|
# Perplexity benchmarks |
|
|
|
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own. |
|
|
|
```./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ``` |
|
|
|
# Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far |
|
# orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware. |
|
|
|
```bash |
|
orpor4ns.gguf FILESIZE: 71260 MB |
|
[1]2.6970,[2]3.1781,[3]3.7390,[4]3.4159,[5]2.8977,[6]2.7126,[7]2.5597,[8]2.5013,[9]2.5279,[10]2.5175,[11]2.5315,[12]2.5455, |
|
Final estimate: PPL = 2.5455 +/- 0.07697 |
|
|
|
orpo3ns.gguf FILESIZE: 58536 MB |
|
[1]2.8042,[2]3.3418,[3]3.9400,[4]3.5859,[5]3.2042,[6]3.0524,[7]2.9738,[8]2.9695,[9]3.0232,[10]3.0099,[11]3.0510,[12]3.0589, |
|
Final estimate: PPL = 3.0589 +/- 0.09882 |
|
|
|
orpo3nm.gguf FILESIZE: 60828 MB |
|
[1]2.8435,[2]3.2998,[3]3.8984,[4]3.4821,[5]3.1084,[6]2.9597,[7]2.8[9]2.9155,[10]2.9218,[11]2.9613,[12]2.9709, |
|
Final estimate: PPL = 2.9709 +/- 0.09419 |
|
|
|
orpo3nl.gguf FILESIZE: 65405 MB |
|
[1]2.8175,[2]3.2506,[3]3.8241,[4]3.4152,[5]2.9970,[6]2.8455,[7]2.7358,[8]2.7120,[9]2.7955,[10]2.8003,[11]2.8254,[12]2.8371, |
|
Final estimate: PPL = 2.8371 +/- 0.08781 |
|
|
|
orpo2n.gguf FILESIZE: 49420 MB |
|
[1]3.0082,[2]3.5829,[3]4.1414,[4]4.1671,[5]3.8567,[6]3.7209,[7]3.7150,[8]3.7210,[9]3.8445,[10]3.9332,[11]4.0879,[12]4.0884, |
|
Final estimate: PPL = 4.0884 +/- 0.1499 |
|
|
|
orpo2ns.gguf FILESIZE: 44026 MB |
|
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659, |
|
Final estimate: PPL = 4.4659 +/- 0.16582 |
|
``` |
|
People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps) |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/id32eagz3KNxiK3NC6cTv.png) |
|
|
|
|
|
# The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac. |