VictorSanh commited on
Commit
3449b08
·
1 Parent(s): 68ebcd0

update training datasets list

Browse files
Files changed (1) hide show
  1. README.md +18 -16
README.md CHANGED
@@ -61,15 +61,17 @@ We trained different variants T0 with different mixtures of datasets.
61
 
62
  |Model|Training datasets|
63
  |--|--|
64
- |T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA, Wiki QA<br>- Structure-To-Text: Common Gen, Wiki Bio<br>- Sentiment: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp<br>- Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum<br>- Topic Classification: AG News, DBPedia, TREC<br>- Paraphrase Identification: MRPC, PAWS, QQP|
65
- |T0p_11B|Same as T0_11B with a few additional datasets:<br>- Multiple-Choice QA: ARC, Circa, MC-TACO, Open Book QA, PiQA, RACE<br>- Extractive QA: CoQA, DROP, QA SRL,QuAC, ReCoRD, SQuAD v2<br>- Closed-Book QA: NQ Open, Trivia QA, Web Questions|
66
- |T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE:<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
67
  |T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
68
  |T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
69
  |T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
70
 
71
  For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
72
 
 
 
73
  # Evaluation data
74
 
75
  We systematically evaluate our models on a suite of held-out tasks:
@@ -82,20 +84,20 @@ We systematically evaluate our models on a suite of held-out tasks:
82
  |Sentence completion|COPA, HellaSwag, Story Cloze|
83
 
84
  We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
85
- - code description task
86
- - conceptual_combinations
87
- - hindu_knowledge_json
88
- - known_unknowns
89
- - Language Identification
90
- - logic_grid_puzzle_task
91
- - logical_deduction
92
- - common_misconceptions
93
- - movie_dialog_same_or_different
94
- - novel_concepts
95
- - strategyqa
96
- - formal_fallacies_syllogisms_negation
97
  - VitaminC
98
- - winowhy_multiple_choice
99
 
100
  # Limitations
101
 
 
61
 
62
  |Model|Training datasets|
63
  |--|--|
64
+ |T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA*, Wiki QA<br>- Structure-To-Text: Common Gen, Wiki Bio<br>- Sentiment: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp<br>- Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum<br>- Topic Classification: AG News, DBPedia, TREC<br>- Paraphrase Identification: MRPC, PAWS, QQP|
65
+ |T0p_11B|Same as T0_11B with additional datasets from GPT-3's evaluation suite:<br>- Multiple-Choice QA: ARC, OpenBook QA, PiQA, RACE, HellaSwag<br>- Extractive QA: SQuAD v2<br>- Closed-Book QA: Trivia QA, Web Questions|
66
+ |T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE (excluding NLI sets):<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
67
  |T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
68
  |T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
69
  |T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
70
 
71
  For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
72
 
73
+ *: We recast Hotpot QA as closed-book QA due to long input sequence length.
74
+
75
  # Evaluation data
76
 
77
  We systematically evaluate our models on a suite of held-out tasks:
 
84
  |Sentence completion|COPA, HellaSwag, Story Cloze|
85
 
86
  We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
87
+ - Code description task
88
+ - Conceptual combinations
89
+ - Hindu knowledge json
90
+ - Known unknowns
91
+ - Language identification
92
+ - Logic grid puzzle task
93
+ - Logical deduction
94
+ - Common misconceptions
95
+ - Movie dialog same or different
96
+ - Novel concepts
97
+ - Strategyqa
98
+ - Formal fallacies syllogisms negation
99
  - VitaminC
100
+ - Winowhy multiple choice
101
 
102
  # Limitations
103