File size: 1,148 Bytes
b00b931
8fda89e
b00b931
 
8fda89e
 
 
 
 
 
 
 
 
 
6fa6593
8fda89e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
license: apache-2.0
---

# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 first mixtral8x22b finetune 💫

the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )

Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.

To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)

```
cat orpo4ns.gguf.part* > orpo4ns.gguf

```
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la

# Run with llama.cpp 

```
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j

./main -m ~/orpo4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbits?"

```
# Perplexity benchmarks

Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.

```./perplexity -m ~/orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ```