wenhu commited on
Commit
58300f4
1 Parent(s): af5aae4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -24,9 +24,8 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
24
 
25
  TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTScore, and BLEURT, and emerging LLM-based metrics as reference-free metrics. Though our dataset was originally sourced from ChatGPT, our distilled model actually outperforms ChatGPT itself, which proves the effectiveness of our filtering strategy. On the unseen task of story generation, TIGERScore also demonstrates reasonable generalization capability.
26
 
27
- | Tasks→ | Summarization | Translation | Data2Text | Long-form QA | MathQA | Inst-Fol | Story-Gen | Average |
28
  |-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
29
- | Metrics↓ Datasets→ | SummaEval | WMT22-zh-en | WebNLG2020 | ASQA+ | gsm8k | LIMA+ | ROC | |
30
  | GPT-3.5-turbo (few-shot) | **38.50** | 40.53 | 40.20 | 29.33 | **66.46** | 23.20 | 4.77 | 34.71 |
31
  | GPT-4 (zero-shot) | 36.46 | **43.87** | **44.04** | **48.95** | 51.71 | **58.53** | **32.48** | **45.15** |
32
  | BLEU | 11.98 | 19.73 | 33.29 | 11.38 | 21.12 | **46.61** | -1.17 | 20.42 |
@@ -48,7 +47,6 @@ TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTSc
48
  | TIGERScore-13B (ours) | 36.81 | 44.99 | **45.88** | 46.22 | **23.32** | **47.03** | **46.36** | **41.52** |
49
  | Δ (ours - best reference-free) | -2 | -3 | +12 | +5 | +9 | +14 | +13 | +16 |
50
 
51
-
52
  ## Formatting
53
 
54
 
 
24
 
25
  TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTScore, and BLEURT, and emerging LLM-based metrics as reference-free metrics. Though our dataset was originally sourced from ChatGPT, our distilled model actually outperforms ChatGPT itself, which proves the effectiveness of our filtering strategy. On the unseen task of story generation, TIGERScore also demonstrates reasonable generalization capability.
26
 
27
+ | Tasks→ | Summarization | Translation | Data2Text | Long-form QA | MathQA | Instruction Following | Story-Gen | Average |
28
  |-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
 
29
  | GPT-3.5-turbo (few-shot) | **38.50** | 40.53 | 40.20 | 29.33 | **66.46** | 23.20 | 4.77 | 34.71 |
30
  | GPT-4 (zero-shot) | 36.46 | **43.87** | **44.04** | **48.95** | 51.71 | **58.53** | **32.48** | **45.15** |
31
  | BLEU | 11.98 | 19.73 | 33.29 | 11.38 | 21.12 | **46.61** | -1.17 | 20.42 |
 
47
  | TIGERScore-13B (ours) | 36.81 | 44.99 | **45.88** | 46.22 | **23.32** | **47.03** | **46.36** | **41.52** |
48
  | Δ (ours - best reference-free) | -2 | -3 | +12 | +5 | +9 | +14 | +13 | +16 |
49
 
 
50
  ## Formatting
51
 
52