brucethemoose commited on
Commit
284bbd8
1 Parent(s): 08479db

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: yi-license
4
+ license_link: https://huggingface.co/01-ai/Yi-34B/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ library_name: transformers
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - text-generation-inference
11
+ ---
12
+
13
+ **Dolphin-2.2-yi-34b-200k**, **Nous-Capybara-34B**, **Tess-M-v1.4**, **Airoboros-3_1-yi-34b-200k**, **PlatYi-34B-200K-Q**, and **Una-xaberius-34b-v1beta** merged with a new, experimental implementation of "dare ties" via mergekit.
14
+
15
+ Quantized with the git version of exllamav2 with 200 rows (400K tokens) on a long Orca-Vicuna format chat, a selected sci fi story and a fantasy story. This should hopefully yield better chat/storytelling performance than the short, default wikitext quantization.
16
+
17
+ 4bpw is enough for **~47K context on a 24GB GPU.** I would highly recommend running in exui for speed at long context. I go into more detail in this [Reddit post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
18
+
19
+ Merged with the following config, and the tokenizer from chargoddard's Yi-Llama:
20
+ ```
21
+ models:
22
+ - model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
23
+ # no parameters necessary for base model
24
+ - model: /home/alpha/Storage/Models/Raw/migtissera_Tess-34B-v1.4
25
+ parameters:
26
+ weight: 0.19
27
+ density: 0.44
28
+ - model: /home/alpha//Storage/Models/Raw/bhenrym14_airoboros-3_1-yi-34b-200k
29
+ parameters:
30
+ weight: 0.14
31
+ density: 0.34
32
+ - model: /home/alpha/Storage/Models/Raw/Nous-Capybara-34B
33
+ parameters:
34
+ weight: 0.19
35
+ density: 0.44
36
+ - model: /home/alpha/Storage/Models/Raw/kyujinpy_PlatYi-34B-200K-Q
37
+ parameters:
38
+ weight: 0.14
39
+ density: 0.34
40
+ - model: /home/alpha/FastModels/ehartford_dolphin-2.2-yi-34b-200k
41
+ parameters:
42
+ weight: 0.19
43
+ density: 0.44
44
+ - model: /home/alpha/FastModels/fblgit_una-xaberius-34b-v1beta
45
+ parameters:
46
+ weight: 0.15
47
+ density: 0.08
48
+ merge_method: dare_ties
49
+ base_model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
50
+ parameters:
51
+
52
+ int8_mask: true
53
+ dtype: bfloat16
54
+
55
+ ```
56
+
57
+ First exllama quantization pass:
58
+ ```
59
+ python convert.py --in_dir //home/alpha/FastModels/CaPlatTessDolXaBoros-Yi-34B-200K-DARE-Ties -o /home/alpha/FastModels/scratch -om /home/alpha/FastModels/mes.json --cal_dataset /home/alpha/Documents/smol.parquet -l 2048 -r 80 -ml 2048 -mr 40 -gr 40 -ss 4096 -nr -b 4.0 -hb 6
60
+ ```
61
+
62
+ Second exllama quantization pass:
63
+ ```
64
+ python convert.py --in_dir /home/alpha/FastModels/CaPlatTessDolXaBoros-Yi-34B-200K-DARE-Ties -o /home/alpha/FastModels/scratch -m /home/alpha/FastModels/mes.json --cal_dataset /home/alpha/Documents/medium.parquet -l 2048 -r 200 -ml 2048 -mr 40 -gr 200 -ss 4096 -b 4.0 -hb 6 -cf /home/alpha/FastModels/CaPlatTessDolXaBoros-Yi-34B-200K-DARE-Ties-exl2-4bpw-fiction -nr
65
+ ```
66
+ ## Testing Notes
67
+
68
+ Various densities were tested with perplexity tests and high context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.
69
+
70
+ Weights that add up to 1 seems to be optimal.
71
+
72
+ Dare Ties is also resulting in seemingly better, lower perplexity merges than a regular ties merge, task arithmetic or a slerp merge.
73
+
74
+ Xaberuis is not a 200K model, hence it was merged at a very low density to try and preserve Yi 200K's long context performance while still inheriting some of Xaberius's performance.
75
+
76
+ I chose not to include other finetunes because they aren't trained on the 200K base. If any other 200K finetunes pop up, let me know.
77
+
78
+ ***
79
+ ## Prompt template: Orca-Vicuna?
80
+ ```
81
+ SYSTEM: {system_message}
82
+ USER: {prompt}
83
+ ASSISTANT:
84
+ ```
85
+ It might recognize ChatML from Dolphin+Xaberius, and Llama-chat from Airoboros.
86
+
87
+ Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
88
+
89
+ ***
90
+ ## Running
91
+ Being a Yi model, try disabling the BOS token and/or running a lower temperature with 0.05-0.13 MinP, a little repetition penalty, and no other samplers. Yi tends to run "hot" by default.
92
+
93
+ 24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
94
+
95
+ I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw!
96
+
97
+ To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
98
+
99
+ ***
100
+
101
+ ## Credits:
102
+ https://github.com/turboderp/exllamav2
103
+
104
+ https://github.com/cg123/mergekit/tree/dare
105
+
106
+ https://huggingface.co/ehartford/dolphin-2.2-yi-34b-200k
107
+
108
+ https://huggingface.co/kyujinpy/PlatYi-34B-200K-Q
109
+
110
+ https://huggingface.co/NousResearch/Nous-Capybara-34B/
111
+
112
+ https://huggingface.co/bhenrym14/airoboros-3_1-yi-34b-200k
113
+
114
+ https://huggingface.co/migtissera/Tess-M-v1.4
115
+
116
+ https://huggingface.co/fblgit/una-xaberius-34b-v1beta
117
+
118
+ https://huggingface.co/chargoddard/Yi-34B-200K-Llama
119
+
120
+ https://huggingface.co/01-ai/Yi-34B-200K