VictorSanh
commited on
Commit
·
3449b08
1
Parent(s):
68ebcd0
update training datasets list
Browse files
README.md
CHANGED
@@ -61,15 +61,17 @@ We trained different variants T0 with different mixtures of datasets.
|
|
61 |
|
62 |
|Model|Training datasets|
|
63 |
|--|--|
|
64 |
-
|T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA
|
65 |
-
|T0p_11B|Same as T0_11B with
|
66 |
-
|T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE:<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
|
67 |
|T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
|
68 |
|T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
|
69 |
|T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
|
70 |
|
71 |
For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
|
72 |
|
|
|
|
|
73 |
# Evaluation data
|
74 |
|
75 |
We systematically evaluate our models on a suite of held-out tasks:
|
@@ -82,20 +84,20 @@ We systematically evaluate our models on a suite of held-out tasks:
|
|
82 |
|Sentence completion|COPA, HellaSwag, Story Cloze|
|
83 |
|
84 |
We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
|
85 |
-
-
|
86 |
-
-
|
87 |
-
-
|
88 |
-
-
|
89 |
-
- Language
|
90 |
-
-
|
91 |
-
-
|
92 |
-
-
|
93 |
-
-
|
94 |
-
-
|
95 |
-
-
|
96 |
-
-
|
97 |
- VitaminC
|
98 |
-
-
|
99 |
|
100 |
# Limitations
|
101 |
|
|
|
61 |
|
62 |
|Model|Training datasets|
|
63 |
|--|--|
|
64 |
+
|T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA*, Wiki QA<br>- Structure-To-Text: Common Gen, Wiki Bio<br>- Sentiment: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp<br>- Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum<br>- Topic Classification: AG News, DBPedia, TREC<br>- Paraphrase Identification: MRPC, PAWS, QQP|
|
65 |
+
|T0p_11B|Same as T0_11B with additional datasets from GPT-3's evaluation suite:<br>- Multiple-Choice QA: ARC, OpenBook QA, PiQA, RACE, HellaSwag<br>- Extractive QA: SQuAD v2<br>- Closed-Book QA: Trivia QA, Web Questions|
|
66 |
+
|T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE (excluding NLI sets):<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
|
67 |
|T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
|
68 |
|T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
|
69 |
|T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
|
70 |
|
71 |
For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
|
72 |
|
73 |
+
*: We recast Hotpot QA as closed-book QA due to long input sequence length.
|
74 |
+
|
75 |
# Evaluation data
|
76 |
|
77 |
We systematically evaluate our models on a suite of held-out tasks:
|
|
|
84 |
|Sentence completion|COPA, HellaSwag, Story Cloze|
|
85 |
|
86 |
We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
|
87 |
+
- Code description task
|
88 |
+
- Conceptual combinations
|
89 |
+
- Hindu knowledge json
|
90 |
+
- Known unknowns
|
91 |
+
- Language identification
|
92 |
+
- Logic grid puzzle task
|
93 |
+
- Logical deduction
|
94 |
+
- Common misconceptions
|
95 |
+
- Movie dialog same or different
|
96 |
+
- Novel concepts
|
97 |
+
- Strategyqa
|
98 |
+
- Formal fallacies syllogisms negation
|
99 |
- VitaminC
|
100 |
+
- Winowhy multiple choice
|
101 |
|
102 |
# Limitations
|
103 |
|