rodrigo-nogueira
commited on
Commit
•
2702b15
1
Parent(s):
769af1d
Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,57 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- pt
|
4 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
language:
|
3 |
- pt
|
4 |
+
---
|
5 |
+
|
6 |
+
Sabiá-7B is Portuguese language model developed by [Maritaca AI](https://www.maritaca.ai/).
|
7 |
+
|
8 |
+
**Input:** The model accepts only text input.
|
9 |
+
|
10 |
+
**Output:** The Model generates text only.
|
11 |
+
|
12 |
+
**Model Architecture:** Sabiá-7B 2 is an auto-regressive language model that uses the same architecture of LLaMA-1-7B.
|
13 |
+
|
14 |
+
**Tokenizer:** It uses the same tokenizer as LLaMA-1-7B.
|
15 |
+
|
16 |
+
**Maximum sequence length:** 2048 tokens.
|
17 |
+
|
18 |
+
**Pretraining data:** The model was pretrained on 7 billion tokens from the Portuguese subset of ClueWeb22, starting with the weights of LLaMA-1-7B and further trained for an additional 10 billion tokens, approximately 1.4 epochs of the training dataset.
|
19 |
+
|
20 |
+
**Data Freshness:** The pretraining data has a cutoff of mid-2022.
|
21 |
+
|
22 |
+
**License:** The licensing is the same as LLaMA-1's, restricting the model's use to research purposes only.
|
23 |
+
|
24 |
+
**Paper:** For more details, please refer to our paper: [Sabiá: Portuguese Large Language Models](https://arxiv.org/pdf/2304.07880.pdf)
|
25 |
+
|
26 |
+
Given that Sabiá-7B was trained solely on a language modeling objective without fine-tuning for instruction following, it is recommended for few-shot tasks rather than zero-shot tasks.
|
27 |
+
|
28 |
+
**Results**
|
29 |
+
|
30 |
+
Below we show the results on the Poeta benchmark, which consists of 14 Portuguese datasets:
|
31 |
+
|
32 |
+
|Model | NPM |
|
33 |
+
|--|--|
|
34 |
+
|LLaMA-1-7B| 33.0|
|
35 |
+
|LLaMA-2-7B| 43.7|
|
36 |
+
|Sabiá-7B| 48.5|
|
37 |
+
|
38 |
+
For more information on the Normalized Preferred Metric (NPM), please check out our paper.
|
39 |
+
|
40 |
+
Please use the following bibtex to cite our paper:
|
41 |
+
```
|
42 |
+
@InProceedings{10.1007/978-3-031-45392-2_15,
|
43 |
+
author="Pires, Ramon
|
44 |
+
and Abonizio, Hugo
|
45 |
+
and Almeida, Thales Sales
|
46 |
+
and Nogueira, Rodrigo",
|
47 |
+
editor="Naldi, Murilo C.
|
48 |
+
and Bianchi, Reinaldo A. C.",
|
49 |
+
title="Sabi{\'a}: Portuguese Large Language Models",
|
50 |
+
booktitle="Intelligent Systems",
|
51 |
+
year="2023",
|
52 |
+
publisher="Springer Nature Switzerland",
|
53 |
+
address="Cham",
|
54 |
+
pages="226--240",
|
55 |
+
isbn="978-3-031-45392-2"
|
56 |
+
}
|
57 |
+
```
|