munish0838 commited on
Commit
db2d46f
·
verified ·
1 Parent(s): 283af16

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +296 -0
README.md ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ language:
5
+ - pt
6
+ model-index:
7
+ - name: sabia-7b
8
+ results:
9
+ - task:
10
+ type: text-generation
11
+ name: Text Generation
12
+ dataset:
13
+ name: ENEM Challenge (No Images)
14
+ type: eduagarcia/enem_challenge
15
+ split: train
16
+ args:
17
+ num_few_shot: 3
18
+ metrics:
19
+ - type: acc
20
+ value: 55.07
21
+ name: accuracy
22
+ source:
23
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
24
+ name: Open Portuguese LLM Leaderboard
25
+ - task:
26
+ type: text-generation
27
+ name: Text Generation
28
+ dataset:
29
+ name: BLUEX (No Images)
30
+ type: eduagarcia-temp/BLUEX_without_images
31
+ split: train
32
+ args:
33
+ num_few_shot: 3
34
+ metrics:
35
+ - type: acc
36
+ value: 47.71
37
+ name: accuracy
38
+ source:
39
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
40
+ name: Open Portuguese LLM Leaderboard
41
+ - task:
42
+ type: text-generation
43
+ name: Text Generation
44
+ dataset:
45
+ name: OAB Exams
46
+ type: eduagarcia/oab_exams
47
+ split: train
48
+ args:
49
+ num_few_shot: 3
50
+ metrics:
51
+ - type: acc
52
+ value: 41.41
53
+ name: accuracy
54
+ source:
55
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
56
+ name: Open Portuguese LLM Leaderboard
57
+ - task:
58
+ type: text-generation
59
+ name: Text Generation
60
+ dataset:
61
+ name: Assin2 RTE
62
+ type: assin2
63
+ split: test
64
+ args:
65
+ num_few_shot: 15
66
+ metrics:
67
+ - type: f1_macro
68
+ value: 46.68
69
+ name: f1-macro
70
+ source:
71
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
72
+ name: Open Portuguese LLM Leaderboard
73
+ - task:
74
+ type: text-generation
75
+ name: Text Generation
76
+ dataset:
77
+ name: Assin2 STS
78
+ type: eduagarcia/portuguese_benchmark
79
+ split: test
80
+ args:
81
+ num_few_shot: 15
82
+ metrics:
83
+ - type: pearson
84
+ value: 1.89
85
+ name: pearson
86
+ source:
87
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
88
+ name: Open Portuguese LLM Leaderboard
89
+ - task:
90
+ type: text-generation
91
+ name: Text Generation
92
+ dataset:
93
+ name: FaQuAD NLI
94
+ type: ruanchaves/faquad-nli
95
+ split: test
96
+ args:
97
+ num_few_shot: 15
98
+ metrics:
99
+ - type: f1_macro
100
+ value: 58.34
101
+ name: f1-macro
102
+ source:
103
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
104
+ name: Open Portuguese LLM Leaderboard
105
+ - task:
106
+ type: text-generation
107
+ name: Text Generation
108
+ dataset:
109
+ name: HateBR Binary
110
+ type: ruanchaves/hatebr
111
+ split: test
112
+ args:
113
+ num_few_shot: 25
114
+ metrics:
115
+ - type: f1_macro
116
+ value: 61.93
117
+ name: f1-macro
118
+ source:
119
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
120
+ name: Open Portuguese LLM Leaderboard
121
+ - task:
122
+ type: text-generation
123
+ name: Text Generation
124
+ dataset:
125
+ name: PT Hate Speech Binary
126
+ type: hate_speech_portuguese
127
+ split: test
128
+ args:
129
+ num_few_shot: 25
130
+ metrics:
131
+ - type: f1_macro
132
+ value: 64.13
133
+ name: f1-macro
134
+ source:
135
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
136
+ name: Open Portuguese LLM Leaderboard
137
+ - task:
138
+ type: text-generation
139
+ name: Text Generation
140
+ dataset:
141
+ name: tweetSentBR
142
+ type: eduagarcia-temp/tweetsentbr
143
+ split: test
144
+ args:
145
+ num_few_shot: 25
146
+ metrics:
147
+ - type: f1_macro
148
+ value: 46.64
149
+ name: f1-macro
150
+ source:
151
+ url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=maritaca-ai/sabia-7b
152
+ name: Open Portuguese LLM Leaderboard
153
+
154
+ ---
155
+
156
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
157
+
158
+
159
+ # QuantFactory/sabia-7b-GGUF
160
+ This is quantized version of [maritaca-ai/sabia-7b](https://huggingface.co/maritaca-ai/sabia-7b) created using llama.cpp
161
+
162
+ # Original Model Card
163
+
164
+
165
+ Sabiá-7B is Portuguese language model developed by [Maritaca AI](https://www.maritaca.ai/).
166
+
167
+ **Input:** The model accepts only text input.
168
+
169
+ **Output:** The Model generates text only.
170
+
171
+ **Model Architecture:** Sabiá-7B is an auto-regressive language model that uses the same architecture of LLaMA-1-7B.
172
+
173
+ **Tokenizer:** It uses the same tokenizer as LLaMA-1-7B.
174
+
175
+ **Maximum sequence length:** 2048 tokens.
176
+
177
+ **Pretraining data:** The model was pretrained on 7 billion tokens from the Portuguese subset of ClueWeb22, starting with the weights of LLaMA-1-7B and further trained for an additional 10 billion tokens, approximately 1.4 epochs of the training dataset.
178
+
179
+ **Data Freshness:** The pretraining data has a cutoff of mid-2022.
180
+
181
+ **License:** The licensing is the same as LLaMA-1's, restricting the model's use to research purposes only.
182
+
183
+ **Paper:** For more details, please refer to our paper: [Sabiá: Portuguese Large Language Models](https://arxiv.org/pdf/2304.07880.pdf)
184
+
185
+
186
+ ## Few-shot Example
187
+
188
+ Given that Sabiá-7B was trained solely on a language modeling objective without fine-tuning for instruction following, it is recommended for few-shot tasks rather than zero-shot tasks, like in the example below.
189
+
190
+ ```python
191
+ import torch
192
+ from transformers import LlamaTokenizer, LlamaForCausalLM
193
+
194
+ tokenizer = LlamaTokenizer.from_pretrained("maritaca-ai/sabia-7b")
195
+ model = LlamaForCausalLM.from_pretrained(
196
+ "maritaca-ai/sabia-7b",
197
+ device_map="auto", # Automatically loads the model in the GPU, if there is one. Requires pip install acelerate
198
+ low_cpu_mem_usage=True,
199
+ torch_dtype=torch.bfloat16 # If your GPU does not support bfloat16, change to torch.float16
200
+ )
201
+
202
+ prompt = """Classifique a resenha de filme como "positiva" ou "negativa".
203
+
204
+ Resenha: Gostei muito do filme, é o melhor do ano!
205
+ Classe: positiva
206
+
207
+ Resenha: O filme deixa muito a desejar.
208
+ Classe: negativa
209
+
210
+ Resenha: Apesar de longo, valeu o ingresso.
211
+ Classe:"""
212
+
213
+ input_ids = tokenizer(prompt, return_tensors="pt")
214
+
215
+ output = model.generate(
216
+ input_ids["input_ids"].to("cuda"),
217
+ max_length=1024,
218
+ eos_token_id=tokenizer.encode("\n")) # Stop generation when a "\n" token is dectected
219
+
220
+ # The output contains the input tokens, so we have to skip them.
221
+ output = output[0][len(input_ids["input_ids"][0]):]
222
+
223
+ print(tokenizer.decode(output, skip_special_tokens=True))
224
+ ```
225
+
226
+ If your GPU does not have enough RAM, try using int8 precision.
227
+ However, expect some degradation in the model output quality when compared to fp16 or bf16.
228
+ ```python
229
+ model = LlamaForCausalLM.from_pretrained(
230
+ "maritaca-ai/sabia-7b",
231
+ device_map="auto",
232
+ low_cpu_mem_usage=True,
233
+ load_in_8bit=True, # Requires pip install bitsandbytes
234
+ )
235
+ ```
236
+
237
+ ## Results in Portuguese
238
+
239
+ Below we show the results on the Poeta benchmark, which consists of 14 Portuguese datasets.
240
+
241
+ For more information on the Normalized Preferred Metric (NPM), please refer to our paper.
242
+
243
+ |Model | NPM |
244
+ |--|--|
245
+ |LLaMA-1-7B| 33.0|
246
+ |LLaMA-2-7B| 43.7|
247
+ |Sabiá-7B| 48.5|
248
+
249
+ ## Results in English
250
+
251
+ Below we show the average results on 6 English datasets: PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, and OpenBookQA.
252
+
253
+ |Model | NPM |
254
+ |--|--|
255
+ |LLaMA-1-7B| 50.1|
256
+ |Sabiá-7B| 49.0|
257
+
258
+
259
+ ## Citation
260
+
261
+ Please use the following bibtex to cite our paper:
262
+ ```
263
+ @InProceedings{10.1007/978-3-031-45392-2_15,
264
+ author="Pires, Ramon
265
+ and Abonizio, Hugo
266
+ and Almeida, Thales Sales
267
+ and Nogueira, Rodrigo",
268
+ editor="Naldi, Murilo C.
269
+ and Bianchi, Reinaldo A. C.",
270
+ title="Sabi{\'a}: Portuguese Large Language Models",
271
+ booktitle="Intelligent Systems",
272
+ year="2023",
273
+ publisher="Springer Nature Switzerland",
274
+ address="Cham",
275
+ pages="226--240",
276
+ isbn="978-3-031-45392-2"
277
+ }
278
+ ```
279
+
280
+ # [Open Portuguese LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard)
281
+ Detailed results can be found [here](https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/tree/main/maritaca-ai/sabia-7b)
282
+
283
+ | Metric | Value |
284
+ |--------------------------|---------|
285
+ |Average |**47.09**|
286
+ |ENEM Challenge (No Images)| 55.07|
287
+ |BLUEX (No Images) | 47.71|
288
+ |OAB Exams | 41.41|
289
+ |Assin2 RTE | 46.68|
290
+ |Assin2 STS | 1.89|
291
+ |FaQuAD NLI | 58.34|
292
+ |HateBR Binary | 61.93|
293
+ |PT Hate Speech Binary | 64.13|
294
+ |tweetSentBR | 46.64|
295
+
296
+