nm-research commited on
Commit
0f12672
·
verified ·
1 Parent(s): 1a350b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -23
README.md CHANGED
@@ -42,7 +42,7 @@ from transformers import AutoTokenizer
42
  from vllm import LLM, SamplingParams
43
 
44
  max_model_len, tp_size = 4096, 1
45
- model_name = "neuralmagic-ent/granite-3.1-2b-base-FP8-dynamic"
46
  tokenizer = AutoTokenizer.from_pretrained(model_name)
47
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
48
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -65,6 +65,9 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
67
 
 
 
 
68
  ```bash
69
  python quantize.py --model_id ibm-granite/granite-3.1-2b-base --save_path "output_dir/"
70
  ```
@@ -109,16 +112,20 @@ def main():
109
  if __name__ == "__main__":
110
  main()
111
  ```
 
112
 
113
  ## Evaluation
114
 
115
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
116
 
 
 
 
117
  OpenLLM Leaderboard V1:
118
  ```
119
  lm_eval \
120
  --model vllm \
121
- --model_args pretrained="neuralmagic-ent/granite-3.1-2b-base-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
122
  --tasks openllm \
123
  --write_out \
124
  --batch_size auto \
@@ -130,7 +137,7 @@ lm_eval \
130
  ##### Generation
131
  ```
132
  python3 codegen/generate.py \
133
- --model neuralmagic-ent/granite-3.1-2b-base-FP8-dynamic \
134
  --bs 16 \
135
  --temperature 0.2 \
136
  --n_samples 50 \
@@ -140,35 +147,82 @@ python3 codegen/generate.py \
140
  ##### Sanitization
141
  ```
142
  python3 evalplus/sanitize.py \
143
- humaneval/neuralmagic-ent--granite-3.1-2b-base-FP8-dynamic_vllm_temp_0.2
144
  ```
145
  ##### Evaluation
146
  ```
147
  evalplus.evaluate \
148
  --dataset humaneval \
149
- --samples humaneval/neuralmagic-ent--granite-3.1-2b-base-FP8-dynamic_vllm_temp_0.2-sanitized
150
  ```
 
151
 
152
  ### Accuracy
153
 
154
- #### OpenLLM Leaderboard V1 evaluation scores
155
-
156
-
157
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-base-FP8-dynamic |
158
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
159
- | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 53.50 |
160
- | GSM8K (Strict-Match, 5-shot) | 60.96 | 46.10 |
161
- | HellaSwag (Acc-Norm, 10-shot) | 75.21 | 77.76 |
162
- | MMLU (Acc, 5-shot) | 54.38 | 52.61 |
163
- | TruthfulQA (MC2, 0-shot) | 55.93 | 39.84 |
164
- | Winogrande (Acc, 5-shot) | 69.67 | 74.43 |
165
- | **Average Score** | **61.98** | **57.37** |
166
- | **Recovery** | **100.00** | **99.52** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
168
- #### HumanEval pass@1 scores
169
- | Metric | ibm-granite/granite-3.1-2b-base | neuralmagic-ent/granite-3.1-2b-base-FP8-dynamic |
170
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
171
- | HumanEval Pass@1 | 30.00 | 30.40 |
172
 
173
 
174
 
 
42
  from vllm import LLM, SamplingParams
43
 
44
  max_model_len, tp_size = 4096, 1
45
+ model_name = "neuralmagic/granite-3.1-2b-base-FP8-dynamic"
46
  tokenizer = AutoTokenizer.from_pretrained(model_name)
47
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
48
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
67
 
68
+ <details>
69
+ <summary>Model Creation Code</summary>
70
+
71
  ```bash
72
  python quantize.py --model_id ibm-granite/granite-3.1-2b-base --save_path "output_dir/"
73
  ```
 
112
  if __name__ == "__main__":
113
  main()
114
  ```
115
+ </details>
116
 
117
  ## Evaluation
118
 
119
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
120
 
121
+ <details>
122
+ <summary>Evaluation Commands</summary>
123
+
124
  OpenLLM Leaderboard V1:
125
  ```
126
  lm_eval \
127
  --model vllm \
128
+ --model_args pretrained="neuralmagic/granite-3.1-2b-base-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
129
  --tasks openllm \
130
  --write_out \
131
  --batch_size auto \
 
137
  ##### Generation
138
  ```
139
  python3 codegen/generate.py \
140
+ --model neuralmagic/granite-3.1-2b-base-FP8-dynamic \
141
  --bs 16 \
142
  --temperature 0.2 \
143
  --n_samples 50 \
 
147
  ##### Sanitization
148
  ```
149
  python3 evalplus/sanitize.py \
150
+ humaneval/neuralmagic--granite-3.1-2b-base-FP8-dynamic_vllm_temp_0.2
151
  ```
152
  ##### Evaluation
153
  ```
154
  evalplus.evaluate \
155
  --dataset humaneval \
156
+ --samples humaneval/neuralmagic--granite-3.1-2b-base-FP8-dynamic_vllm_temp_0.2-sanitized
157
  ```
158
+ </details>
159
 
160
  ### Accuracy
161
 
162
+ <table>
163
+ <thead>
164
+ <tr>
165
+ <th>Category</th>
166
+ <th>Metric</th>
167
+ <th>ibm-granite/granite-3.1-2b-instruct</th>
168
+ <th>neuralmagic/granite-3.1-2b-base-FP8-dynamic</th>
169
+ <th>Recovery (%)</th>
170
+ </tr>
171
+ </thead>
172
+ <tbody>
173
+ <tr>
174
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
175
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
176
+ <td>55.63</td>
177
+ <td>53.50</td>
178
+ <td>96.17</td>
179
+ </tr>
180
+ <tr>
181
+ <td>GSM8K (Strict-Match, 5-shot)</td>
182
+ <td>60.96</td>
183
+ <td>46.10</td>
184
+ <td>75.63</td>
185
+ </tr>
186
+ <tr>
187
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
188
+ <td>75.21</td>
189
+ <td>77.76</td>
190
+ <td>103.39</td>
191
+ </tr>
192
+ <tr>
193
+ <td>MMLU (Acc, 5-shot)</td>
194
+ <td>54.38</td>
195
+ <td>52.61</td>
196
+ <td>96.75</td>
197
+ </tr>
198
+ <tr>
199
+ <td>TruthfulQA (MC2, 0-shot)</td>
200
+ <td>55.93</td>
201
+ <td>39.84</td>
202
+ <td>71.23</td>
203
+ </tr>
204
+ <tr>
205
+ <td>Winogrande (Acc, 5-shot)</td>
206
+ <td>69.67</td>
207
+ <td>74.43</td>
208
+ <td>106.84</td>
209
+ </tr>
210
+ <tr>
211
+ <td><b>Average Score</b></td>
212
+ <td><b>61.98</b></td>
213
+ <td><b>57.37</b></td>
214
+ <td><b>92.57</b></td>
215
+ </tr>
216
+ <tr>
217
+ <td rowspan="2"><b>HumanEval</b></td>
218
+ <td>HumanEval Pass@1</td>
219
+ <td>30.00</td>
220
+ <td>30.40</td>
221
+ <td><b>101.33</b></td>
222
+ </tr>
223
+ </tbody>
224
+ </table>
225
 
 
 
 
 
226
 
227
 
228