gonzalez-agirre commited on
Commit
779a575
·
1 Parent(s): 446952a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - bloom
11
  - spanish
12
  - catalan
 
13
  pipeline_tag: text-generation
14
  widget:
15
  - text: |-
@@ -74,8 +75,6 @@ widget:
74
  It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
75
  which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
76
 
77
- This model has been developed as part of a scientific research submitted to [LREC-COLING 2024](https://lrec-coling-2024.org/), and is currently undergoing a peer review process.
78
-
79
  ## Intended uses and limitations
80
 
81
  The **FLOR-760M** model is ready-to-use only for causal language modeling.
@@ -118,13 +117,13 @@ on multiple web sources. We intend to conduct research in these areas in the fut
118
 
119
  ### Language adaptation and training
120
 
121
- The language adaptation technique used to create FLOR-1.3B requires the vocabulary of the source model
122
  to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
123
- 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
124
  2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
125
  3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
126
- 4) The model was initialized with the weights from BOOM-1.7B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
127
- 5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish and English data.
128
 
129
  ### Training data
130
 
@@ -187,7 +186,7 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
187
 
188
 
189
  ## Evaluation
190
- FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
191
 
192
  The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
193
 
 
10
  - bloom
11
  - spanish
12
  - catalan
13
+ - english
14
  pipeline_tag: text-generation
15
  widget:
16
  - text: |-
 
75
  It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
76
  which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
77
 
 
 
78
  ## Intended uses and limitations
79
 
80
  The **FLOR-760M** model is ready-to-use only for causal language modeling.
 
117
 
118
  ### Language adaptation and training
119
 
120
+ The language adaptation technique used to create FLOR-760M requires the vocabulary of the source model
121
  to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
122
+ 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.1B parameters to 760M.
123
  2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
124
  3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
125
+ 4) The model was initialized with the weights from BOOM-1.1B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
126
+ 5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
127
 
128
  ### Training data
129
 
 
186
 
187
 
188
  ## Evaluation
189
+ FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish, and English, with particular emphasis on Catalan datasets.
190
 
191
  The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
192