LDJnr commited on
Commit
c2734cc
·
1 Parent(s): f1c90f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -4
README.md CHANGED
@@ -33,7 +33,7 @@ Notable mentions for assisting in some of the training issues goes to: Caseus an
33
 
34
  Redmond-Puffin-13B-V1.3 is a new model trained for multiple epochs on a dataset of 3,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4.
35
 
36
- Additional data came from carefully curated subsections of datasets such as CamelAI's Physics, Chemistry, Biology and Math.
37
 
38
  ## Prompt Format
39
 
@@ -53,12 +53,24 @@ Optional reccomended pre-prompt / system prompt:
53
  ### response: Sure! sounds good.
54
  ```
55
 
56
-
57
  ## Improvements over previous version:
58
 
59
  The original Puffin model was loved by many, however it was quickly discovered to have dataset errors in a significant amount of the conversations.
60
  Puffin-V1.3 dataset solves this issue and the resulting fixed model has now fully finished training!
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## Notable Features:
64
 
@@ -90,6 +102,83 @@ In the near future we plan on leveraging the help of domain specific expert volu
90
 
91
  If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact ldj on discord!
92
 
93
- ## Benchmarks coming soon
 
 
 
94
 
95
- benchmarks coming soon!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  Redmond-Puffin-13B-V1.3 is a new model trained for multiple epochs on a dataset of 3,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4.
35
 
36
+ Additional data came from carefully curated sub sections of datasets such as CamelAI's Physics, Chemistry, Biology and Math.
37
 
38
  ## Prompt Format
39
 
 
53
  ### response: Sure! sounds good.
54
  ```
55
 
 
56
  ## Improvements over previous version:
57
 
58
  The original Puffin model was loved by many, however it was quickly discovered to have dataset errors in a significant amount of the conversations.
59
  Puffin-V1.3 dataset solves this issue and the resulting fixed model has now fully finished training!
60
 
61
+ ## When should I use Puffin or Hermes 2?
62
+
63
+ Puffin and Hermes-2 both beat previous SOTA for GPT4ALL benchmarks, with Hermes-2 winning by a 0.1% margin over Puffin.
64
+
65
+ - Hermes 2 is trained on purely single turn instruction examples.
66
+
67
+ - Puffin is trained mostly on long multi-turn, long context GPT-4 conversations, as well as curated single-turn examples relating to Physics, Bio, Math and Chem.
68
+
69
+ For these reasons, it's reccomended to give Puffin a try if you want to have multi-turn conversations and/or long context communication.
70
+
71
+ That being said, it's important to note that the commonly referenced benchmarks are all single-turn tests, and despite this, Puffin reaches within 0.1% of the Hermes-2 GPT4All average score.
72
+
73
+ Puffin also beats Hermes-2 for the #1 spot in Arc-E, Hella swag and Winogrande! as well as perfectly tying with Hermes-2 in PIQA for the exact score of 80.69 (PIQA is a single-turn benchmark for common-sense reasoning of the physical world)
74
 
75
  ## Notable Features:
76
 
 
102
 
103
  If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact ldj on discord!
104
 
105
+ ## Benchmarks!
106
+
107
+ As of Puffins release, it achieves a new SOTA for the GPT4All benchmarks! Supplanting Hermes for the #1 position!
108
+ (Rounded to nearest tenth)
109
 
110
+ Previous Sota: Hermes - 68.8
111
+ New Sota: Puffin - 69.9 (+1.1)
112
+
113
+ note: After release, Puffin has since had its average GPT4All score beaten by 0.1%, by Nous' very own Model Hermes-2!
114
+ Latest SOTA w/ Hermes 2- 70.0 (+0.1 over Puffins 69.9 score)
115
+
116
+ That being said, Puffin still ends up supplanting even Hermes-2 for the #1 spot in Arc-E, HellaSwag and Winogrande!
117
+
118
+ Puffin also perfectly ties with Hermes in PIQA.
119
+
120
+ GPT4all :
121
+
122
+ ```
123
+ | Task |Version| Metric |Value | |Stderr|
124
+ |-------------|------:|--------|-----:|---|-----:|
125
+ |arc_challenge| 0|acc |0.4983|± |0.0146|
126
+ | | |acc_norm|0.5068|± |0.0146|
127
+ |arc_easy | 0|acc |0.7980|± |0.0082|
128
+ | | |acc_norm|0.7757|± |0.0086|
129
+ |boolq | 1|acc |0.8150|± |0.0068|
130
+ |hellaswag | 0|acc |0.6132|± |0.0049|
131
+ | | |acc_norm|0.8043|± |0.0040|
132
+ |openbookqa | 0|acc |0.3560|± |0.0214|
133
+ | | |acc_norm|0.4560|± |0.0223|
134
+ |piqa | 0|acc |0.7954|± |0.0094|
135
+ | | |acc_norm|0.8069|± |0.0092|
136
+ |winogrande | 0|acc |0.7245|± |0.0126|```
137
+
138
+
139
+ BigBench :
140
+
141
+ ```
142
+ | Task |Version| Metric |Value | |Stderr|
143
+ |------------------------------------------------|------:|---------------------|-----:|---|-----:|
144
+ |bigbench_causal_judgement | 0|multiple_choice_grade|0.5368|± |0.0363|
145
+ |bigbench_date_understanding | 0|multiple_choice_grade|0.7127|± |0.0236|
146
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3023|± |0.0286|
147
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159|
148
+ | | |exact_str_match |0.0000|± |0.0000|
149
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2520|± |0.0194|
150
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1743|± |0.0143|
151
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4200|± |0.0285|
152
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2900|± |0.0203|
153
+ |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158|
154
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5430|± |0.0111|
155
+ |bigbench_ruin_names | 0|multiple_choice_grade|0.4442|± |0.0235|
156
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2074|± |0.0128|
157
+ |bigbench_snarks | 0|multiple_choice_grade|0.5083|± |0.0373|
158
+ |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159|
159
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3260|± |0.0148|
160
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2136|± |0.0116|
161
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1326|± |0.0081|
162
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4200|± |0.0285|```
163
+
164
+ AGI Eval:
165
+
166
+ ```
167
+ | Task |Version| Metric |Value | |Stderr|
168
+ |------------------------------|------:|--------|-----:|---|-----:|
169
+ |agieval_aqua_rat | 0|acc |0.2283|± |0.0264|
170
+ | | |acc_norm|0.2244|± |0.0262|
171
+ |agieval_logiqa_en | 0|acc |0.2780|± |0.0176|
172
+ | | |acc_norm|0.3164|± |0.0182|
173
+ |agieval_lsat_ar | 0|acc |0.2348|± |0.0280|
174
+ | | |acc_norm|0.2043|± |0.0266|
175
+ |agieval_lsat_lr | 0|acc |0.3392|± |0.0210|
176
+ | | |acc_norm|0.2961|± |0.0202|
177
+ |agieval_lsat_rc | 0|acc |0.4387|± |0.0303|
178
+ | | |acc_norm|0.3569|± |0.0293|
179
+ |agieval_sat_en | 0|acc |0.5874|± |0.0344|
180
+ | | |acc_norm|0.5194|± |0.0349|
181
+ |agieval_sat_en_without_passage| 0|acc |0.4223|± |0.0345|
182
+ | | |acc_norm|0.3447|± |0.0332|
183
+ |agieval_sat_math | 0|acc |0.3364|± |0.0319|
184
+ | | |acc_norm|0.2773|± |0.0302|```