File size: 2,417 Bytes
b00b931
8fda89e
b00b931
 
6875282
 
 
 
 
 
 
 
 
 
 
8fda89e
434fbe3
 
 
8fda89e
 
 
 
 
 
 
 
6fa6593
8fda89e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
434fbe3
 
e774605
 
434fbe3
 
 
 
 
 
 
 
 
 
 
 
e774605
 
 
 
6875282
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
license: apache-2.0
---
## Alex, download the 3bit orpo3ns.gguf.part0 & orpo3ns.gguf.part1 files then:
``` 
cd ~/Downloads

cat orpo3ns.gguf.part* > orpo3ns.gguf && rm -rf orpo3ns.gguf.part*

```

For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/ 

## orpo4ns.gguf is good to go, 2bit also done but not recommended.


# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 
# first mixtral8x22b finetune 💫

the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )

Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.

To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)

```
cat orpo4ns.gguf.part* > orpo4ns.gguf

```
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la

# Run with llama.cpp 

```
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j

./main -m ~/orpo4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbits?"

```
# Perplexity benchmarks

Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.

```./perplexity -m ~/orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ``` 

# Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far

```bash
orpor4ns.gguf 71260MB 
[1]2.6970,[2]3.1781,[3]3.7390,[4]3.4159,[5]2.8977,[6]2.7126,[7]2.5597,[8]2.5013,[9]2.5279,[10]2.5175,[11]2.5315,[12]2.5455,
Final estimate: PPL = 2.5455 +/- 0.07697

orpo2ns.gguf 44026MB
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
Final estimate: PPL = 4.4659 +/- 0.16582

orpo2n.gguf 49420MB
[1]3.0082,[2]3.5829,[3]4.1414,[4]4.1671,[5]3.8567,[6]3.7209,[7]3.7150,[8]3.7210,[9]3.8445,[10]3.9332,[11]4.0879,[12]4.0884,
Final estimate: PPL = 4.0884 +/- 0.1499

orpo3ns.gguf 58536MB
[1]2.8042,[2]3.3418,[3]3.9400,[4]3.5859,[5]3.2042,[6]3.0524,[7]2.9738,[8]2.9695,[9]3.0232,[10]3.0099,[11]3.0510,[12]3.0589,
Final estimate: PPL = 3.0589 +/- 0.09882
```

# The 3bit version is surprisingly usable even though only 58GB