Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@ language:
|
|
6 |
licence:
|
7 |
- apache-2.0
|
8 |
tags:
|
9 |
-
-
|
10 |
- bloom
|
11 |
- spanish
|
12 |
- catalan
|
@@ -52,7 +52,7 @@ widget:
|
|
52 |
example_title: Entidades-Nombradas
|
53 |
---
|
54 |
|
55 |
-
#
|
56 |
|
57 |
## Table of Contents
|
58 |
<details>
|
@@ -70,7 +70,7 @@ widget:
|
|
70 |
|
71 |
## Model description
|
72 |
|
73 |
-
**
|
74 |
It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
|
75 |
which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
|
76 |
|
@@ -78,7 +78,7 @@ This model has been developed as part of a scientific research submitted to [LRE
|
|
78 |
|
79 |
## Intended uses and limitations
|
80 |
|
81 |
-
The **
|
82 |
It can perform text-generation tasks and be fine-tuned for specific scenarios.
|
83 |
|
84 |
## How to use
|
@@ -88,7 +88,7 @@ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
|
|
88 |
|
89 |
input_text = "Sovint em trobo pensant en tot allò que"
|
90 |
|
91 |
-
model_id = "BSC-LT/
|
92 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
93 |
generator = pipeline(
|
94 |
"text-generation",
|
@@ -118,7 +118,7 @@ on multiple web sources. We intend to conduct research in these areas in the fut
|
|
118 |
|
119 |
### Language adaptation and training
|
120 |
|
121 |
-
The language adaptation technique used to create
|
122 |
to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
|
123 |
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
|
124 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
@@ -187,11 +187,11 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
|
|
187 |
|
188 |
|
189 |
## Evaluation
|
190 |
-
|
191 |
|
192 |
The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
|
193 |
|
194 |
-
Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/
|
195 |
|
196 |
The following is a list of evaluation areas and their respective datasets:
|
197 |
- Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
|
@@ -202,13 +202,11 @@ The following is a list of evaluation areas and their respective datasets:
|
|
202 |
- Translation: [FLoRes](https://huggingface.co/datasets/flores)
|
203 |
|
204 |
|
205 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/
|
206 |
|
|
|
207 |
|
208 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/
|
209 |
-
|
210 |
-
|
211 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/PxgzqXAelUoWY-23zXvPm.png){ width: 200px; }
|
212 |
|
213 |
|
214 |
## Additional information
|
|
|
6 |
licence:
|
7 |
- apache-2.0
|
8 |
tags:
|
9 |
+
- FLOR
|
10 |
- bloom
|
11 |
- spanish
|
12 |
- catalan
|
|
|
52 |
example_title: Entidades-Nombradas
|
53 |
---
|
54 |
|
55 |
+
# FLOR-760M
|
56 |
|
57 |
## Table of Contents
|
58 |
<details>
|
|
|
70 |
|
71 |
## Model description
|
72 |
|
73 |
+
**FLOR-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
|
74 |
It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
|
75 |
which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
|
76 |
|
|
|
78 |
|
79 |
## Intended uses and limitations
|
80 |
|
81 |
+
The **FLOR-760M** model is ready-to-use only for causal language modeling.
|
82 |
It can perform text-generation tasks and be fine-tuned for specific scenarios.
|
83 |
|
84 |
## How to use
|
|
|
88 |
|
89 |
input_text = "Sovint em trobo pensant en tot allò que"
|
90 |
|
91 |
+
model_id = "BSC-LT/FLOR-760M"
|
92 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
93 |
generator = pipeline(
|
94 |
"text-generation",
|
|
|
118 |
|
119 |
### Language adaptation and training
|
120 |
|
121 |
+
The language adaptation technique used to create FLOR-1.3B requires the vocabulary of the source model
|
122 |
to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
|
123 |
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
|
124 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
|
|
187 |
|
188 |
|
189 |
## Evaluation
|
190 |
+
FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
|
191 |
|
192 |
The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
|
193 |
|
194 |
+
Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/FLOR-eval).
|
195 |
|
196 |
The following is a list of evaluation areas and their respective datasets:
|
197 |
- Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
|
|
|
202 |
- Translation: [FLoRes](https://huggingface.co/datasets/flores)
|
203 |
|
204 |
|
205 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/nKvFF6Ap7ocdAtSBQyD6Q.png)
|
206 |
|
207 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/OcCNfkKyGB4zXi2pXjbB4.png)
|
208 |
|
209 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/d3iW68sAubt1uU0-le5hX.png)
|
|
|
|
|
|
|
210 |
|
211 |
|
212 |
## Additional information
|