NousResearch
/

Redmond-Puffin-13B-GGML

English

llama-2

sft

Model card Files Files and versions Community

LDJnr commited on Jul 22, 2023

Commit

c2734cc

1 Parent(s): f1c90f6

Update README.md

Browse files

Files changed (1) hide show

README.md +93 -4

README.md CHANGED Viewed

@@ -33,7 +33,7 @@ Notable mentions for assisting in some of the training issues goes to: Caseus an
 Redmond-Puffin-13B-V1.3 is a new model trained for multiple epochs on a dataset of 3,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4.
-Additional data came from carefully curated subsections of datasets such as CamelAI's Physics, Chemistry, Biology and Math.
 ## Prompt Format
@@ -53,12 +53,24 @@ Optional reccomended pre-prompt / system prompt:
 ### response: Sure! sounds good.
 ```
 ## Improvements over previous version:
 The original Puffin model was loved by many, however it was quickly discovered to have dataset errors in a significant amount of the conversations.
 Puffin-V1.3 dataset solves this issue and the resulting fixed model has now fully finished training!
 ## Notable Features:
@@ -90,6 +102,83 @@ In the near future we plan on leveraging the help of domain specific expert volu
 If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact ldj on discord!
-## Benchmarks coming soon
-benchmarks coming soon!

 Redmond-Puffin-13B-V1.3 is a new model trained for multiple epochs on a dataset of 3,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4.
+Additional data came from carefully curated sub sections of datasets such as CamelAI's Physics, Chemistry, Biology and Math.
 ## Prompt Format
 ### response: Sure! sounds good.
 ```
 ## Improvements over previous version:
 The original Puffin model was loved by many, however it was quickly discovered to have dataset errors in a significant amount of the conversations.
 Puffin-V1.3 dataset solves this issue and the resulting fixed model has now fully finished training!
+## When should I use Puffin or Hermes 2?
+Puffin and Hermes-2 both beat previous SOTA for GPT4ALL benchmarks, with Hermes-2 winning by a 0.1% margin over Puffin.
+- Hermes 2 is trained on purely single turn instruction examples.
+- Puffin is trained mostly on long multi-turn, long context GPT-4 conversations, as well as curated single-turn examples relating to Physics, Bio, Math and Chem.
+For these reasons, it's reccomended to give Puffin a try if you want to have multi-turn conversations and/or long context communication.
+That being said, it's important to note that the commonly referenced benchmarks are all single-turn tests, and despite this, Puffin reaches within 0.1% of the Hermes-2 GPT4All average score.
+Puffin also beats Hermes-2 for the #1 spot in Arc-E, Hella swag and Winogrande! as well as perfectly tying with Hermes-2 in PIQA for the exact score of 80.69 (PIQA is a single-turn benchmark for common-sense reasoning of the physical world)
 ## Notable Features:
 If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact ldj on discord!
+## Benchmarks!
+As of Puffins release, it achieves a new SOTA for the GPT4All benchmarks! Supplanting Hermes for the #1 position!
+(Rounded to nearest tenth)
+Previous Sota: Hermes - 68.8
+New Sota:      Puffin - 69.9 (+1.1)
+note: After release, Puffin has since had its average GPT4All score beaten by 0.1%, by Nous' very own Model Hermes-2!
+Latest SOTA w/ Hermes 2- 70.0 (+0.1 over Puffins 69.9 score)
+That being said, Puffin still ends up supplanting even Hermes-2 for the #1 spot in Arc-E, HellaSwag and Winogrande!
+Puffin also perfectly ties with Hermes in PIQA.
+GPT4all :
+```
+|    Task     |Version| Metric |Value |   |Stderr|
+|-------------|------:|--------|-----:|---|-----:|
+|arc_challenge|      0|acc     |0.4983|±  |0.0146|
+|             |       |acc_norm|0.5068|±  |0.0146|
+|arc_easy     |      0|acc     |0.7980|±  |0.0082|
+|             |       |acc_norm|0.7757|±  |0.0086|
+|boolq        |      1|acc     |0.8150|±  |0.0068|
+|hellaswag    |      0|acc     |0.6132|±  |0.0049|
+|             |       |acc_norm|0.8043|±  |0.0040|
+|openbookqa   |      0|acc     |0.3560|±  |0.0214|
+|             |       |acc_norm|0.4560|±  |0.0223|
+|piqa         |      0|acc     |0.7954|±  |0.0094|
+|             |       |acc_norm|0.8069|±  |0.0092|
+|winogrande   |      0|acc     |0.7245|±  |0.0126|```
+BigBench :
+```
+|                      Task                      |Version|       Metric        |Value |   |Stderr|
+|------------------------------------------------|------:|---------------------|-----:|---|-----:|
+|bigbench_causal_judgement                       |      0|multiple_choice_grade|0.5368|±  |0.0363|
+|bigbench_date_understanding                     |      0|multiple_choice_grade|0.7127|±  |0.0236|
+|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|0.3023|±  |0.0286|
+|bigbench_geometric_shapes                       |      0|multiple_choice_grade|0.1003|±  |0.0159|
+|                                                |       |exact_str_match      |0.0000|±  |0.0000|
+|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|0.2520|±  |0.0194|
+|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|0.1743|±  |0.0143|
+|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|0.4200|±  |0.0285|
+|bigbench_movie_recommendation                   |      0|multiple_choice_grade|0.2900|±  |0.0203|
+|bigbench_navigate                               |      0|multiple_choice_grade|0.5000|±  |0.0158|
+|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|0.5430|±  |0.0111|
+|bigbench_ruin_names                             |      0|multiple_choice_grade|0.4442|±  |0.0235|
+|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|0.2074|±  |0.0128|
+|bigbench_snarks                                 |      0|multiple_choice_grade|0.5083|±  |0.0373|
+|bigbench_sports_understanding                   |      0|multiple_choice_grade|0.4970|±  |0.0159|
+|bigbench_temporal_sequences                     |      0|multiple_choice_grade|0.3260|±  |0.0148|
+|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|0.2136|±  |0.0116|
+|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|0.1326|±  |0.0081|
+|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|0.4200|±  |0.0285|```
+AGI Eval:
+```
+|             Task             |Version| Metric |Value |   |Stderr|
+|------------------------------|------:|--------|-----:|---|-----:|
+|agieval_aqua_rat              |      0|acc     |0.2283|±  |0.0264|
+|                              |       |acc_norm|0.2244|±  |0.0262|
+|agieval_logiqa_en             |      0|acc     |0.2780|±  |0.0176|
+|                              |       |acc_norm|0.3164|±  |0.0182|
+|agieval_lsat_ar               |      0|acc     |0.2348|±  |0.0280|
+|                              |       |acc_norm|0.2043|±  |0.0266|
+|agieval_lsat_lr               |      0|acc     |0.3392|±  |0.0210|
+|                              |       |acc_norm|0.2961|±  |0.0202|
+|agieval_lsat_rc               |      0|acc     |0.4387|±  |0.0303|
+|                              |       |acc_norm|0.3569|±  |0.0293|
+|agieval_sat_en                |      0|acc     |0.5874|±  |0.0344|
+|                              |       |acc_norm|0.5194|±  |0.0349|
+|agieval_sat_en_without_passage|      0|acc     |0.4223|±  |0.0345|
+|                              |       |acc_norm|0.3447|±  |0.0332|
+|agieval_sat_math              |      0|acc     |0.3364|±  |0.0319|
+|                              |       |acc_norm|0.2773|±  |0.0302|```