ibaucells commited on
Commit
446952a
·
1 Parent(s): 51ee47f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -13
README.md CHANGED
@@ -6,7 +6,7 @@ language:
6
  licence:
7
  - apache-2.0
8
  tags:
9
- - cabloom
10
  - bloom
11
  - spanish
12
  - catalan
@@ -52,7 +52,7 @@ widget:
52
  example_title: Entidades-Nombradas
53
  ---
54
 
55
- # CaBLOOM-760M
56
 
57
  ## Table of Contents
58
  <details>
@@ -70,7 +70,7 @@ widget:
70
 
71
  ## Model description
72
 
73
- **CaBLOOM-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
74
  It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
75
  which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
76
 
@@ -78,7 +78,7 @@ This model has been developed as part of a scientific research submitted to [LRE
78
 
79
  ## Intended uses and limitations
80
 
81
- The **CaBLOOM-760M** model is ready-to-use only for causal language modeling.
82
  It can perform text-generation tasks and be fine-tuned for specific scenarios.
83
 
84
  ## How to use
@@ -88,7 +88,7 @@ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
88
 
89
  input_text = "Sovint em trobo pensant en tot allò que"
90
 
91
- model_id = "BSC-LT/CaBLOOM-760M"
92
  tokenizer = AutoTokenizer.from_pretrained(model_id)
93
  generator = pipeline(
94
  "text-generation",
@@ -118,7 +118,7 @@ on multiple web sources. We intend to conduct research in these areas in the fut
118
 
119
  ### Language adaptation and training
120
 
121
- The language adaptation technique used to create CaBLOOM-1.3B requires the vocabulary of the source model
122
  to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
123
  1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
124
  2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
@@ -187,11 +187,11 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
187
 
188
 
189
  ## Evaluation
190
- CaBLOOM-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
191
 
192
  The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
193
 
194
- Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/cabloom-eval).
195
 
196
  The following is a list of evaluation areas and their respective datasets:
197
  - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
@@ -202,13 +202,11 @@ The following is a list of evaluation areas and their respective datasets:
202
  - Translation: [FLoRes](https://huggingface.co/datasets/flores)
203
 
204
 
205
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/o595pF7dw-iTuR1_x4MVy.png)
206
 
 
207
 
208
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/DhrkZG8Xqob7Ml4n6zQcY.png)
209
-
210
-
211
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/PxgzqXAelUoWY-23zXvPm.png){ width: 200px; }
212
 
213
 
214
  ## Additional information
 
6
  licence:
7
  - apache-2.0
8
  tags:
9
+ - FLOR
10
  - bloom
11
  - spanish
12
  - catalan
 
52
  example_title: Entidades-Nombradas
53
  ---
54
 
55
+ # FLOR-760M
56
 
57
  ## Table of Contents
58
  <details>
 
70
 
71
  ## Model description
72
 
73
+ **FLOR-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
74
  It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
75
  which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
76
 
 
78
 
79
  ## Intended uses and limitations
80
 
81
+ The **FLOR-760M** model is ready-to-use only for causal language modeling.
82
  It can perform text-generation tasks and be fine-tuned for specific scenarios.
83
 
84
  ## How to use
 
88
 
89
  input_text = "Sovint em trobo pensant en tot allò que"
90
 
91
+ model_id = "BSC-LT/FLOR-760M"
92
  tokenizer = AutoTokenizer.from_pretrained(model_id)
93
  generator = pipeline(
94
  "text-generation",
 
118
 
119
  ### Language adaptation and training
120
 
121
+ The language adaptation technique used to create FLOR-1.3B requires the vocabulary of the source model
122
  to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
123
  1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
124
  2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
 
187
 
188
 
189
  ## Evaluation
190
+ FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
191
 
192
  The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
193
 
194
+ Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/FLOR-eval).
195
 
196
  The following is a list of evaluation areas and their respective datasets:
197
  - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
 
202
  - Translation: [FLoRes](https://huggingface.co/datasets/flores)
203
 
204
 
205
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/nKvFF6Ap7ocdAtSBQyD6Q.png)
206
 
207
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/OcCNfkKyGB4zXi2pXjbB4.png)
208
 
209
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/d3iW68sAubt1uU0-le5hX.png)
 
 
 
210
 
211
 
212
  ## Additional information