Update README.md
Browse files
README.md
CHANGED
@@ -33,7 +33,7 @@ Notable mentions for assisting in some of the training issues goes to: Caseus an
|
|
33 |
|
34 |
Redmond-Puffin-13B-V1.3 is a new model trained for multiple epochs on a dataset of 3,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4.
|
35 |
|
36 |
-
Additional data came from carefully curated
|
37 |
|
38 |
## Prompt Format
|
39 |
|
@@ -53,12 +53,24 @@ Optional reccomended pre-prompt / system prompt:
|
|
53 |
### response: Sure! sounds good.
|
54 |
```
|
55 |
|
56 |
-
|
57 |
## Improvements over previous version:
|
58 |
|
59 |
The original Puffin model was loved by many, however it was quickly discovered to have dataset errors in a significant amount of the conversations.
|
60 |
Puffin-V1.3 dataset solves this issue and the resulting fixed model has now fully finished training!
|
61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
## Notable Features:
|
64 |
|
@@ -90,6 +102,83 @@ In the near future we plan on leveraging the help of domain specific expert volu
|
|
90 |
|
91 |
If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact ldj on discord!
|
92 |
|
93 |
-
## Benchmarks
|
|
|
|
|
|
|
94 |
|
95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
Redmond-Puffin-13B-V1.3 is a new model trained for multiple epochs on a dataset of 3,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4.
|
35 |
|
36 |
+
Additional data came from carefully curated sub sections of datasets such as CamelAI's Physics, Chemistry, Biology and Math.
|
37 |
|
38 |
## Prompt Format
|
39 |
|
|
|
53 |
### response: Sure! sounds good.
|
54 |
```
|
55 |
|
|
|
56 |
## Improvements over previous version:
|
57 |
|
58 |
The original Puffin model was loved by many, however it was quickly discovered to have dataset errors in a significant amount of the conversations.
|
59 |
Puffin-V1.3 dataset solves this issue and the resulting fixed model has now fully finished training!
|
60 |
|
61 |
+
## When should I use Puffin or Hermes 2?
|
62 |
+
|
63 |
+
Puffin and Hermes-2 both beat previous SOTA for GPT4ALL benchmarks, with Hermes-2 winning by a 0.1% margin over Puffin.
|
64 |
+
|
65 |
+
- Hermes 2 is trained on purely single turn instruction examples.
|
66 |
+
|
67 |
+
- Puffin is trained mostly on long multi-turn, long context GPT-4 conversations, as well as curated single-turn examples relating to Physics, Bio, Math and Chem.
|
68 |
+
|
69 |
+
For these reasons, it's reccomended to give Puffin a try if you want to have multi-turn conversations and/or long context communication.
|
70 |
+
|
71 |
+
That being said, it's important to note that the commonly referenced benchmarks are all single-turn tests, and despite this, Puffin reaches within 0.1% of the Hermes-2 GPT4All average score.
|
72 |
+
|
73 |
+
Puffin also beats Hermes-2 for the #1 spot in Arc-E, Hella swag and Winogrande! as well as perfectly tying with Hermes-2 in PIQA for the exact score of 80.69 (PIQA is a single-turn benchmark for common-sense reasoning of the physical world)
|
74 |
|
75 |
## Notable Features:
|
76 |
|
|
|
102 |
|
103 |
If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact ldj on discord!
|
104 |
|
105 |
+
## Benchmarks!
|
106 |
+
|
107 |
+
As of Puffins release, it achieves a new SOTA for the GPT4All benchmarks! Supplanting Hermes for the #1 position!
|
108 |
+
(Rounded to nearest tenth)
|
109 |
|
110 |
+
Previous Sota: Hermes - 68.8
|
111 |
+
New Sota: Puffin - 69.9 (+1.1)
|
112 |
+
|
113 |
+
note: After release, Puffin has since had its average GPT4All score beaten by 0.1%, by Nous' very own Model Hermes-2!
|
114 |
+
Latest SOTA w/ Hermes 2- 70.0 (+0.1 over Puffins 69.9 score)
|
115 |
+
|
116 |
+
That being said, Puffin still ends up supplanting even Hermes-2 for the #1 spot in Arc-E, HellaSwag and Winogrande!
|
117 |
+
|
118 |
+
Puffin also perfectly ties with Hermes in PIQA.
|
119 |
+
|
120 |
+
GPT4all :
|
121 |
+
|
122 |
+
```
|
123 |
+
| Task |Version| Metric |Value | |Stderr|
|
124 |
+
|-------------|------:|--------|-----:|---|-----:|
|
125 |
+
|arc_challenge| 0|acc |0.4983|± |0.0146|
|
126 |
+
| | |acc_norm|0.5068|± |0.0146|
|
127 |
+
|arc_easy | 0|acc |0.7980|± |0.0082|
|
128 |
+
| | |acc_norm|0.7757|± |0.0086|
|
129 |
+
|boolq | 1|acc |0.8150|± |0.0068|
|
130 |
+
|hellaswag | 0|acc |0.6132|± |0.0049|
|
131 |
+
| | |acc_norm|0.8043|± |0.0040|
|
132 |
+
|openbookqa | 0|acc |0.3560|± |0.0214|
|
133 |
+
| | |acc_norm|0.4560|± |0.0223|
|
134 |
+
|piqa | 0|acc |0.7954|± |0.0094|
|
135 |
+
| | |acc_norm|0.8069|± |0.0092|
|
136 |
+
|winogrande | 0|acc |0.7245|± |0.0126|```
|
137 |
+
|
138 |
+
|
139 |
+
BigBench :
|
140 |
+
|
141 |
+
```
|
142 |
+
| Task |Version| Metric |Value | |Stderr|
|
143 |
+
|------------------------------------------------|------:|---------------------|-----:|---|-----:|
|
144 |
+
|bigbench_causal_judgement | 0|multiple_choice_grade|0.5368|± |0.0363|
|
145 |
+
|bigbench_date_understanding | 0|multiple_choice_grade|0.7127|± |0.0236|
|
146 |
+
|bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3023|± |0.0286|
|
147 |
+
|bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159|
|
148 |
+
| | |exact_str_match |0.0000|± |0.0000|
|
149 |
+
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2520|± |0.0194|
|
150 |
+
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1743|± |0.0143|
|
151 |
+
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4200|± |0.0285|
|
152 |
+
|bigbench_movie_recommendation | 0|multiple_choice_grade|0.2900|± |0.0203|
|
153 |
+
|bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158|
|
154 |
+
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5430|± |0.0111|
|
155 |
+
|bigbench_ruin_names | 0|multiple_choice_grade|0.4442|± |0.0235|
|
156 |
+
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2074|± |0.0128|
|
157 |
+
|bigbench_snarks | 0|multiple_choice_grade|0.5083|± |0.0373|
|
158 |
+
|bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159|
|
159 |
+
|bigbench_temporal_sequences | 0|multiple_choice_grade|0.3260|± |0.0148|
|
160 |
+
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2136|± |0.0116|
|
161 |
+
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1326|± |0.0081|
|
162 |
+
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4200|± |0.0285|```
|
163 |
+
|
164 |
+
AGI Eval:
|
165 |
+
|
166 |
+
```
|
167 |
+
| Task |Version| Metric |Value | |Stderr|
|
168 |
+
|------------------------------|------:|--------|-----:|---|-----:|
|
169 |
+
|agieval_aqua_rat | 0|acc |0.2283|± |0.0264|
|
170 |
+
| | |acc_norm|0.2244|± |0.0262|
|
171 |
+
|agieval_logiqa_en | 0|acc |0.2780|± |0.0176|
|
172 |
+
| | |acc_norm|0.3164|± |0.0182|
|
173 |
+
|agieval_lsat_ar | 0|acc |0.2348|± |0.0280|
|
174 |
+
| | |acc_norm|0.2043|± |0.0266|
|
175 |
+
|agieval_lsat_lr | 0|acc |0.3392|± |0.0210|
|
176 |
+
| | |acc_norm|0.2961|± |0.0202|
|
177 |
+
|agieval_lsat_rc | 0|acc |0.4387|± |0.0303|
|
178 |
+
| | |acc_norm|0.3569|± |0.0293|
|
179 |
+
|agieval_sat_en | 0|acc |0.5874|± |0.0344|
|
180 |
+
| | |acc_norm|0.5194|± |0.0349|
|
181 |
+
|agieval_sat_en_without_passage| 0|acc |0.4223|± |0.0345|
|
182 |
+
| | |acc_norm|0.3447|± |0.0332|
|
183 |
+
|agieval_sat_math | 0|acc |0.3364|± |0.0319|
|
184 |
+
| | |acc_norm|0.2773|± |0.0302|```
|