gonzalez-agirre
commited on
Commit
·
779a575
1
Parent(s):
446952a
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,7 @@ tags:
|
|
10 |
- bloom
|
11 |
- spanish
|
12 |
- catalan
|
|
|
13 |
pipeline_tag: text-generation
|
14 |
widget:
|
15 |
- text: |-
|
@@ -74,8 +75,6 @@ widget:
|
|
74 |
It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
|
75 |
which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
|
76 |
|
77 |
-
This model has been developed as part of a scientific research submitted to [LREC-COLING 2024](https://lrec-coling-2024.org/), and is currently undergoing a peer review process.
|
78 |
-
|
79 |
## Intended uses and limitations
|
80 |
|
81 |
The **FLOR-760M** model is ready-to-use only for causal language modeling.
|
@@ -118,13 +117,13 @@ on multiple web sources. We intend to conduct research in these areas in the fut
|
|
118 |
|
119 |
### Language adaptation and training
|
120 |
|
121 |
-
The language adaptation technique used to create FLOR-
|
122 |
to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
|
123 |
-
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.
|
124 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
125 |
3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
|
126 |
-
4) The model was initialized with the weights from BOOM-1.
|
127 |
-
5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish and English data.
|
128 |
|
129 |
### Training data
|
130 |
|
@@ -187,7 +186,7 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
|
|
187 |
|
188 |
|
189 |
## Evaluation
|
190 |
-
FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
|
191 |
|
192 |
The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
|
193 |
|
|
|
10 |
- bloom
|
11 |
- spanish
|
12 |
- catalan
|
13 |
+
- english
|
14 |
pipeline_tag: text-generation
|
15 |
widget:
|
16 |
- text: |-
|
|
|
75 |
It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
|
76 |
which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
|
77 |
|
|
|
|
|
78 |
## Intended uses and limitations
|
79 |
|
80 |
The **FLOR-760M** model is ready-to-use only for causal language modeling.
|
|
|
117 |
|
118 |
### Language adaptation and training
|
119 |
|
120 |
+
The language adaptation technique used to create FLOR-760M requires the vocabulary of the source model
|
121 |
to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
|
122 |
+
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.1B parameters to 760M.
|
123 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
124 |
3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
|
125 |
+
4) The model was initialized with the weights from BOOM-1.1B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
|
126 |
+
5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
|
127 |
|
128 |
### Training data
|
129 |
|
|
|
186 |
|
187 |
|
188 |
## Evaluation
|
189 |
+
FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish, and English, with particular emphasis on Catalan datasets.
|
190 |
|
191 |
The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
|
192 |
|