ibaucells commited on
Commit
c803ee7
·
1 Parent(s): b9865de

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +245 -1
README.md CHANGED
@@ -1,3 +1,247 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - es
5
+ - ca
6
+ licence:
7
+ - apache-2.0
8
+ tags:
9
+ - cabloom
10
+ - bloom
11
+ - spanish
12
+ - catalan
13
+ pipeline_tag: text-generation
14
+ widget:
15
+ - text: |-
16
+ Respon a la pregunta següent.
17
+ Pregunta: "Quina és la capital de Suècia?"
18
+ Resposta: "La capital de Suècia és Estocolm."
19
+ ----
20
+ Respon a la pregunta següent.
21
+ Pregunta: "Quina beguda es consumeix als matins per despertar-se?"
22
+ Resposta: "La majoria de gent consumeix cafè per despertar-se."
23
+ ----
24
+ Respon a la pregunta següent.
25
+ Pregunta: "Explica com funciona un motor de combustió"
26
+ Resposta:
27
+ example_title: Pregunta-Resposta
28
+ - text: |-
29
+ Extrae las entidades nombradas del siguiente texto:
30
+ Texto: "Me llamo Wolfgang y vivo en Berlin"
31
+ Entidades: Wolfgang:PER, Berlin:LOC
32
+ ----
33
+ Extrae las entidades nombradas del siguiente texto:
34
+ Texto: "Hoy voy a visitar el parc güell tras salir del barcelona supercomputing center"
35
+ Entidades: parc güell:LOC, barcelona supercomputing center:LOC
36
+ ----
37
+ Extrae las entidades nombradas del siguiente texto:
38
+ Texto: "Maria y Miguel no tienen ningún problema contigo"
39
+ Entidades: Maria:PER, Miguel:PER
40
+ ----
41
+ Extrae las entidades nombradas del siguiente texto:
42
+ Texto: "Damián se cortó el pelo"
43
+ Entidades: Damián:PER
44
+ ----
45
+ Extrae las entidades nombradas del siguiente texto:
46
+ Texto: "Lo mejor de Barcelona és el bar de mi amigo Pablo"
47
+ Entidades: Pablo:PER, Barcelona:LOC
48
+ ----
49
+ Extrae las entidades nombradas del siguiente texto:
50
+ Texto: "Carlos comparte piso con Marc"
51
+ Entidades:
52
+ example_title: Entidades-Nombradas
53
  ---
54
+
55
+ # CaBLOOM-760M
56
+
57
+ ## Table of Contents
58
+ <details>
59
+ <summary>Click to expand</summary>
60
+
61
+ - [Model description](#model-description)
62
+ - [Intended uses and limitations](#intended-uses-and-limitations)
63
+ - [How to use](#how-to-use)
64
+ - [Limitations and bias](#limitations-and-bias)
65
+ - [Training](#training)
66
+ - [Evaluation](#evaluation)
67
+ - [Additional information](#additional-information)
68
+
69
+ </details>
70
+
71
+ ## Model description
72
+
73
+ **CaBLOOM-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
74
+ It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
75
+ which involves modifying the model's vocabulary and embedding layer, and continuously pre-training the model with 26B tokens in our target languages.
76
+
77
+ This model has been developed as part of a scientific research submitted to [LREC-COLING 2024](https://lrec-coling-2024.org/), and is currently undergoing a peer review process.
78
+
79
+ ## Intended uses and limitations
80
+
81
+ The **CaBLOOM-760M** model is ready-to-use only for causal language modeling.
82
+ It can perform text-generation tasks and be fine-tuned for specific scenarios.
83
+
84
+ ## How to use
85
+ ```python
86
+ import torch
87
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
88
+
89
+ input_text = "Sovint em trobo pensant en tot allò que"
90
+
91
+ model_id = "BSC-LT/CaBLOOM-760M"
92
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
93
+ generator = pipeline(
94
+ "text-generation",
95
+ model=model_id,
96
+ tokenizer=tokenizer,
97
+ torch_dtype=torch.bfloat16,
98
+ trust_remote_code=True,
99
+ device_map="auto",
100
+ )
101
+ generation = generator(
102
+ input_text,
103
+ do_sample=True,
104
+ top_k=10,
105
+ eos_token_id=tokenizer.eos_token_id,
106
+ )
107
+
108
+ print(f"Result: {generation[0]['generated_text']}")
109
+ ```
110
+
111
+ ## Limitations and bias
112
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
113
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
114
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
115
+
116
+
117
+ ## Training
118
+
119
+ ### Language adaptation and training
120
+
121
+ The language adaptation technique used to create CaBLOOM-1.3B requires the vocabulary of the source model
122
+ to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
123
+ 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
124
+ 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
125
+ 3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
126
+ 4) The model was initialized with the weights from BOOM-1.7B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
127
+ 5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish and English data.
128
+
129
+ ### Training data
130
+
131
+ The training corpus is the same that was used to train [Ǎguila-7B](https://huggingface.co/projecte-aina/aguila-7b).
132
+ It consists of 26B tokens of several corpora gathered from web crawlings and public domain data.
133
+
134
+ | Dataset | Language | Words (per-epoch) | Epochs |
135
+ |---------------------|----------|--------------------|--------------|
136
+ | Wikipedia | en | 2169.97M | 1.428144485 |
137
+ | C4_es | es | 53709.80M | 0.1049686196 |
138
+ | Biomedical | es | 455.03M | 0.7140722425 |
139
+ | Legal | es | 995.70M | 0.7140722425 |
140
+ | Wikipedia | es | 693.60M | 1.428144485 |
141
+ | Gutenberg | es | 53.18M | 0.7140722425 |
142
+ | C4_ca | ca | 2826.00M | 2.142216727 |
143
+ | Biomedical | ca | 11.80M | 1.428144485 |
144
+ | RacoCatalà Noticias | ca | 17.16M | 2.142216727 |
145
+ | RacoCatalà Forums | ca | 333.73M | 2.142216727 |
146
+ | CaWaC | ca | 57.79M | 2.142216727 |
147
+ | Wikipedia | ca | 228.01M | 3.570361212 |
148
+ | Vilaweb | ca | 50.34M | 2.142216727 |
149
+
150
+ ### Languages
151
+
152
+ The training data has the same amount of Catalan and Spanish texts, and a smaller amount of English data.
153
+ The table below shows the final language distribution:
154
+
155
+ |Language|Percentage|
156
+ |--------|----------|
157
+ | English (EN) | 16.84% |
158
+ | Spanish (ES) | 41.38% |
159
+ | Catalan (CA) | 41.79% |
160
+
161
+ ### Training hyperparameters
162
+ - seed: 1
163
+ - distributed_type: [WSE-2](https://www.cerebras.net/product-chip/)
164
+ - num_devices: 1
165
+ - train_batch_size: 60
166
+ - eval_batch_size: 60
167
+ - optimizer: AdamW
168
+ - betas: (0.9,0.95)
169
+ - epsilon: 1e-08
170
+ - weight_decay_rate: 0.1
171
+ - learning_rate:
172
+ - scheduler: "Linear"
173
+ initial_learning_rate: 0.0
174
+ end_learning_rate: 4.1e-5
175
+ steps: 3050
176
+ - scheduler: "CosineDecay"
177
+ initial_learning_rate: 4.1e-5
178
+ end_learning_rate: 3.4e-6
179
+ steps: 209133
180
+ - scheduler: "Constant"
181
+ learning_rate: 2.2e-6
182
+ - num_epochs: 1.0
183
+
184
+ ### Framework versions
185
+ The training was conducted in a Cerebras' [CS-2 system](https://www.cerebras.net/product-system/)
186
+ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1.9.1) release of their software.
187
+
188
+
189
+ ## Evaluation
190
+ CaBLOOM-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
191
+
192
+ The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
193
+
194
+ Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/cabloom-eval).
195
+
196
+ The following is a list of evaluation areas and their respective datasets:
197
+ - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
198
+ - Question Answering: [XQuAD](https://huggingface.co/datasets/xquad), [CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa), [CoQCat](https://huggingface.co/datasets/projecte-aina/CoQCat)
199
+ - Natural Language Inference: [XNLI](https://huggingface.co/datasets/xnli) and its translation to Catalan ([XNLI-ca](https://huggingface.co/datasets/projecte-aina/xnli-ca)), [TE-ca](https://huggingface.co/datasets/projecte-aina/teca)
200
+ - Paraphrase Identification: [PAWS-X](https://huggingface.co/datasets/paws-x) and its translation to Catalan ([PAWS-ca](https://huggingface.co/datasets/projecte-aina/PAWS-ca)), [Parafraseja](https://huggingface.co/datasets/projecte-aina/Parafraseja)
201
+ - Commonsense Reasoning: [COPA](https://people.ict.usc.edu/~gordon/copa.html) and its translation to Catalan ([COPA-ca](https://huggingface.co/datasets/projecte-aina/COPA-ca))
202
+ - Translation: [FLoRes](https://huggingface.co/datasets/flores)
203
+
204
+
205
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/o595pF7dw-iTuR1_x4MVy.png)
206
+
207
+
208
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/DhrkZG8Xqob7Ml4n6zQcY.png)
209
+
210
+
211
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/PxgzqXAelUoWY-23zXvPm.png){ width: 200px; }
212
+
213
+
214
+ ## Additional information
215
+
216
+ ### Author
217
+ The Language Technologies Unit from Barcelona Supercomputing Center.
218
+
219
+ ### Contact
220
+ For further information, please send an email to <[email protected]>.
221
+
222
+ ### Copyright
223
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
224
+
225
+ ### License
226
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
227
+
228
+ ### Funding
229
+ This work was funded by [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
230
+
231
+ ### Disclaimer
232
+
233
+ <details>
234
+ <summary>Click to expand</summary>
235
+
236
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
237
+
238
+ Be aware that the model may have biases and/or any other undesirable distortions.
239
+
240
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
241
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
242
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
243
+
244
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
245
+ be liable for any results arising from the use made by third parties.
246
+
247
+ </details>