leaderboard-pr-bot commited on
Commit
ffdf04e
·
verified ·
1 Parent(s): c6d1adf

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +113 -5
README.md CHANGED
@@ -1,15 +1,110 @@
1
  ---
2
- pipeline_tag: text-generation
 
 
3
  tags:
4
  - granite
5
  - ibm
6
  - lab
7
  - labrador
8
  - labradorite
9
- license: apache-2.0
10
- language:
11
- - en
12
  base_model: ibm/granite-7b-base
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
  # Model Card for Granite-7b-lab [Paper](https://arxiv.org/abs/2403.01081)
15
 
@@ -93,4 +188,17 @@ We advise utilizing the system prompt employed during the model's training for o
93
 
94
  **Bias, Risks, and Limitations**
95
 
96
- Granite-7b-lab is a base model and has not undergone any safety alignment, there it may produce problematic outputs. In the absence of adequate safeguards and RLHF, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
  tags:
6
  - granite
7
  - ibm
8
  - lab
9
  - labrador
10
  - labradorite
 
 
 
11
  base_model: ibm/granite-7b-base
12
+ pipeline_tag: text-generation
13
+ model-index:
14
+ - name: granite-7b-instruct
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ name: Text Generation
19
+ dataset:
20
+ name: IFEval (0-Shot)
21
+ type: HuggingFaceH4/ifeval
22
+ args:
23
+ num_few_shot: 0
24
+ metrics:
25
+ - type: inst_level_strict_acc and prompt_level_strict_acc
26
+ value: 29.72
27
+ name: strict accuracy
28
+ source:
29
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=ibm-granite/granite-7b-instruct
30
+ name: Open LLM Leaderboard
31
+ - task:
32
+ type: text-generation
33
+ name: Text Generation
34
+ dataset:
35
+ name: BBH (3-Shot)
36
+ type: BBH
37
+ args:
38
+ num_few_shot: 3
39
+ metrics:
40
+ - type: acc_norm
41
+ value: 12.64
42
+ name: normalized accuracy
43
+ source:
44
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=ibm-granite/granite-7b-instruct
45
+ name: Open LLM Leaderboard
46
+ - task:
47
+ type: text-generation
48
+ name: Text Generation
49
+ dataset:
50
+ name: MATH Lvl 5 (4-Shot)
51
+ type: hendrycks/competition_math
52
+ args:
53
+ num_few_shot: 4
54
+ metrics:
55
+ - type: exact_match
56
+ value: 0.6
57
+ name: exact match
58
+ source:
59
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=ibm-granite/granite-7b-instruct
60
+ name: Open LLM Leaderboard
61
+ - task:
62
+ type: text-generation
63
+ name: Text Generation
64
+ dataset:
65
+ name: GPQA (0-shot)
66
+ type: Idavidrein/gpqa
67
+ args:
68
+ num_few_shot: 0
69
+ metrics:
70
+ - type: acc_norm
71
+ value: 4.7
72
+ name: acc_norm
73
+ source:
74
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=ibm-granite/granite-7b-instruct
75
+ name: Open LLM Leaderboard
76
+ - task:
77
+ type: text-generation
78
+ name: Text Generation
79
+ dataset:
80
+ name: MuSR (0-shot)
81
+ type: TAUR-Lab/MuSR
82
+ args:
83
+ num_few_shot: 0
84
+ metrics:
85
+ - type: acc_norm
86
+ value: 8.82
87
+ name: acc_norm
88
+ source:
89
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=ibm-granite/granite-7b-instruct
90
+ name: Open LLM Leaderboard
91
+ - task:
92
+ type: text-generation
93
+ name: Text Generation
94
+ dataset:
95
+ name: MMLU-PRO (5-shot)
96
+ type: TIGER-Lab/MMLU-Pro
97
+ config: main
98
+ split: test
99
+ args:
100
+ num_few_shot: 5
101
+ metrics:
102
+ - type: acc
103
+ value: 14.29
104
+ name: accuracy
105
+ source:
106
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=ibm-granite/granite-7b-instruct
107
+ name: Open LLM Leaderboard
108
  ---
109
  # Model Card for Granite-7b-lab [Paper](https://arxiv.org/abs/2403.01081)
110
 
 
188
 
189
  **Bias, Risks, and Limitations**
190
 
191
+ Granite-7b-lab is a base model and has not undergone any safety alignment, there it may produce problematic outputs. In the absence of adequate safeguards and RLHF, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.
192
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
193
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_ibm-granite__granite-7b-instruct)
194
+
195
+ | Metric |Value|
196
+ |-------------------|----:|
197
+ |Avg. |11.80|
198
+ |IFEval (0-Shot) |29.72|
199
+ |BBH (3-Shot) |12.64|
200
+ |MATH Lvl 5 (4-Shot)| 0.60|
201
+ |GPQA (0-shot) | 4.70|
202
+ |MuSR (0-shot) | 8.82|
203
+ |MMLU-PRO (5-shot) |14.29|
204
+