projecte-aina
/

FLOR-760M

@@ -10,6 +10,7 @@ tags:
 - bloom
 - spanish
 - catalan
 pipeline_tag: text-generation
 widget:
 - text: |-
@@ -74,8 +75,6 @@ widget:
 It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
 which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
-This model has been developed as part of a scientific research submitted to [LREC-COLING 2024](https://lrec-coling-2024.org/), and is currently undergoing a peer review process.
 ## Intended uses and limitations
 The **FLOR-760M** model is ready-to-use only for causal language modeling.
@@ -118,13 +117,13 @@ on multiple web sources. We intend to conduct research in these areas in the fut
 ### Language adaptation and training
-The language adaptation technique used to create FLOR-1.3B requires the vocabulary of the source model
 to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
-1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
 3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
-4) The model was initialized with the weights from BOOM-1.7B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
-5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish and English data.
 ### Training data
@@ -187,7 +186,7 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
 ## Evaluation
-FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
 The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.

 - bloom
 - spanish
 - catalan
+- english
 pipeline_tag: text-generation
 widget:
 - text: |-
 It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
 which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
 ## Intended uses and limitations
 The **FLOR-760M** model is ready-to-use only for causal language modeling.
 ### Language adaptation and training
+The language adaptation technique used to create FLOR-760M requires the vocabulary of the source model
 to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
+1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.1B parameters to 760M.
 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
 3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
+4) The model was initialized with the weights from BOOM-1.1B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
+5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
 ### Training data
 ## Evaluation
+FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish, and English, with particular emphasis on Catalan datasets.
 The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.