main merged to 1.1
Browse files- README.md +11 -11
- images/corpus_languages_1.0.png +0 -0
README.md
CHANGED
@@ -87,7 +87,7 @@ The pre-training corpus contains text in 35 European languages and code.
|
|
87 |
|
88 |
### Hyperparameters
|
89 |
|
90 |
-
The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/
|
91 |
|
92 |
### Architecture
|
93 |
|
@@ -141,7 +141,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
|
|
141 |
operated by Barcelona Supercomputing Center.
|
142 |
|
143 |
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
144 |
-
- 4x Nvidia Hopper GPUs with
|
145 |
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
|
146 |
- 4x NDR200 (BW per node 800Gb/s)
|
147 |
- 512 GB of Main memory (DDR5)
|
@@ -309,8 +309,8 @@ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outl
|
|
309 |
|
310 |
![lang distrib](./images/corpus_languages_1.1.png)
|
311 |
|
312 |
-
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53
|
313 |
-
Following this, Starcoder provides 13
|
314 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
315 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
316 |
The remaining 10% comes from smaller sources in various languages.
|
@@ -590,8 +590,8 @@ especially if the content originates from less-regulated sources or user-generat
|
|
590 |
**How was the data collected?**
|
591 |
|
592 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
593 |
-
- Web-sourced datasets with some preprocessing available under permissive license.
|
594 |
-
- Domain-specific or language-specific raw crawls, always respecting robots.txt.
|
595 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
596 |
(p.e. CATalog).
|
597 |
|
@@ -644,7 +644,7 @@ The original raw data was not kept.
|
|
644 |
|
645 |
**Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
|
646 |
|
647 |
-
Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for
|
648 |
and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
|
649 |
|
650 |
#### Uses
|
@@ -724,7 +724,7 @@ We only use tasks that are either human generated, human translated, or with a s
|
|
724 |
|
725 |
During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
|
726 |
|
727 |
-
It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the
|
728 |
|
729 |
A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
|
730 |
|
@@ -1065,7 +1065,7 @@ We report that while performance is high (accuracies between 0.69 and 0.87 depen
|
|
1065 |
the model performs very poorly in ambiguous settings, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
|
1066 |
|
1067 |
We additionally analyse model generations using the Regard dataset and classifier in Catalan, Spanish, and English using backtranslation and manual revision of the
|
1068 |
-
translations. We find no statistically significant difference in regard between majority and minority groups for any regard
|
1069 |
with the exception of negative regard in Catalan where model generations are actually slightly worse for social majorities.
|
1070 |
Our analyses on societal biases show that while these biases are capable of interfering with model performance as expressed in the results on the BBQ dataset,
|
1071 |
their tendency for representational harm is limited given the results of the Regard dataset. We highlight that our analyses of these biases are by no means exhaustive
|
@@ -1075,7 +1075,7 @@ in future work.
|
|
1075 |
Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
|
1076 |
For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018).
|
1077 |
We observe moderate to strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
|
1078 |
-
We measure
|
1079 |
implying that outputs can be influenced by the prompts.
|
1080 |
|
1081 |
We highlight that these results can be expected from a pretrained model that has not yet been instruction-tuned or aligned.
|
@@ -1133,4 +1133,4 @@ Technical report coming soon.
|
|
1133 |
|:---:|:---:|:---:|
|
1134 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1135 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1136 |
-
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|
|
|
87 |
|
88 |
### Hyperparameters
|
89 |
|
90 |
+
The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
|
91 |
|
92 |
### Architecture
|
93 |
|
|
|
141 |
operated by Barcelona Supercomputing Center.
|
142 |
|
143 |
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
144 |
+
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
|
145 |
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
|
146 |
- 4x NDR200 (BW per node 800Gb/s)
|
147 |
- 512 GB of Main memory (DDR5)
|
|
|
309 |
|
310 |
![lang distrib](./images/corpus_languages_1.1.png)
|
311 |
|
312 |
+
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53.05% of the total tokens.
|
313 |
+
Following this, Starcoder provides 13.67%, and FineWeb-Edu (350BT subset) adds 10.24%. The next largest sources are HPLT at 4.21% and French-PD at 3.59%.
|
314 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
315 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
316 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
590 |
**How was the data collected?**
|
591 |
|
592 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
593 |
+
- Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
|
594 |
+
- Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
|
595 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
596 |
(p.e. CATalog).
|
597 |
|
|
|
644 |
|
645 |
**Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
|
646 |
|
647 |
+
Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for Spanish Crawling and CATalog,
|
648 |
and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
|
649 |
|
650 |
#### Uses
|
|
|
724 |
|
725 |
During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
|
726 |
|
727 |
+
It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
|
728 |
|
729 |
A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
|
730 |
|
|
|
1065 |
the model performs very poorly in ambiguous settings, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
|
1066 |
|
1067 |
We additionally analyse model generations using the Regard dataset and classifier in Catalan, Spanish, and English using backtranslation and manual revision of the
|
1068 |
+
translations. We find no statistically significant difference in regard between majority and minority groups for any regard types,
|
1069 |
with the exception of negative regard in Catalan where model generations are actually slightly worse for social majorities.
|
1070 |
Our analyses on societal biases show that while these biases are capable of interfering with model performance as expressed in the results on the BBQ dataset,
|
1071 |
their tendency for representational harm is limited given the results of the Regard dataset. We highlight that our analyses of these biases are by no means exhaustive
|
|
|
1075 |
Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
|
1076 |
For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018).
|
1077 |
We observe moderate to strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
|
1078 |
+
We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects,
|
1079 |
implying that outputs can be influenced by the prompts.
|
1080 |
|
1081 |
We highlight that these results can be expected from a pretrained model that has not yet been instruction-tuned or aligned.
|
|
|
1133 |
|:---:|:---:|:---:|
|
1134 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1135 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1136 |
+
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|
images/corpus_languages_1.0.png
DELETED
Binary file (352 kB)
|
|