joanllop commited on
Commit
b90f22c
·
2 Parent(s): dd93102 4fe94c2

main merged to 1.1

Browse files
Files changed (2) hide show
  1. README.md +11 -11
  2. images/corpus_languages_1.0.png +0 -0
README.md CHANGED
@@ -87,7 +87,7 @@ The pre-training corpus contains text in 35 European languages and code.
87
 
88
  ### Hyperparameters
89
 
90
- The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/blob/main/configs/bsc_7b.yaml).
91
 
92
  ### Architecture
93
 
@@ -141,7 +141,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
141
  operated by Barcelona Supercomputing Center.
142
 
143
  The accelerated partition is composed of 1,120 nodes with the following specifications:
144
- - 4x Nvidia Hopper GPUs with 64GB HBM2 memory
145
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
146
  - 4x NDR200 (BW per node 800Gb/s)
147
  - 512 GB of Main memory (DDR5)
@@ -309,8 +309,8 @@ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outl
309
 
310
  ![lang distrib](./images/corpus_languages_1.1.png)
311
 
312
- The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
313
- Following this, Starcoder provides 13,67%, and FineWeb-Edu (350BT subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
314
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
315
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
316
  The remaining 10% comes from smaller sources in various languages.
@@ -590,8 +590,8 @@ especially if the content originates from less-regulated sources or user-generat
590
  **How was the data collected?**
591
 
592
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
593
- - Web-sourced datasets with some preprocessing available under permissive license.
594
- - Domain-specific or language-specific raw crawls, always respecting robots.txt.
595
  - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
596
  (p.e. CATalog).
597
 
@@ -644,7 +644,7 @@ The original raw data was not kept.
644
 
645
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
646
 
647
- Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated sources,
648
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
649
 
650
  #### Uses
@@ -724,7 +724,7 @@ We only use tasks that are either human generated, human translated, or with a s
724
 
725
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
726
 
727
- It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the model's capabilities and potential. We thus advise caution when reading and interpreting the results.
728
 
729
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
730
 
@@ -1065,7 +1065,7 @@ We report that while performance is high (accuracies between 0.69 and 0.87 depen
1065
  the model performs very poorly in ambiguous settings, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
1066
 
1067
  We additionally analyse model generations using the Regard dataset and classifier in Catalan, Spanish, and English using backtranslation and manual revision of the
1068
- translations. We find no statistically significant difference in regard between majority and minority groups for any regard type,
1069
  with the exception of negative regard in Catalan where model generations are actually slightly worse for social majorities.
1070
  Our analyses on societal biases show that while these biases are capable of interfering with model performance as expressed in the results on the BBQ dataset,
1071
  their tendency for representational harm is limited given the results of the Regard dataset. We highlight that our analyses of these biases are by no means exhaustive
@@ -1075,7 +1075,7 @@ in future work.
1075
  Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1076
  For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018).
1077
  We observe moderate to strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1078
- We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects,
1079
  implying that outputs can be influenced by the prompts.
1080
 
1081
  We highlight that these results can be expected from a pretrained model that has not yet been instruction-tuned or aligned.
@@ -1133,4 +1133,4 @@ Technical report coming soon.
1133
  |:---:|:---:|:---:|
1134
  |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1135
  |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1136
- |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
 
87
 
88
  ### Hyperparameters
89
 
90
+ The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
91
 
92
  ### Architecture
93
 
 
141
  operated by Barcelona Supercomputing Center.
142
 
143
  The accelerated partition is composed of 1,120 nodes with the following specifications:
144
+ - 4x Nvidia Hopper GPUs with 64 HBM2 memory
145
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
146
  - 4x NDR200 (BW per node 800Gb/s)
147
  - 512 GB of Main memory (DDR5)
 
309
 
310
  ![lang distrib](./images/corpus_languages_1.1.png)
311
 
312
+ The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53.05% of the total tokens.
313
+ Following this, Starcoder provides 13.67%, and FineWeb-Edu (350BT subset) adds 10.24%. The next largest sources are HPLT at 4.21% and French-PD at 3.59%.
314
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
315
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
316
  The remaining 10% comes from smaller sources in various languages.
 
590
  **How was the data collected?**
591
 
592
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
593
+ - Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
594
+ - Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
595
  - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
596
  (p.e. CATalog).
597
 
 
644
 
645
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
646
 
647
+ Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for Spanish Crawling and CATalog,
648
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
649
 
650
  #### Uses
 
724
 
725
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
726
 
727
+ It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
728
 
729
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
730
 
 
1065
  the model performs very poorly in ambiguous settings, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
1066
 
1067
  We additionally analyse model generations using the Regard dataset and classifier in Catalan, Spanish, and English using backtranslation and manual revision of the
1068
+ translations. We find no statistically significant difference in regard between majority and minority groups for any regard types,
1069
  with the exception of negative regard in Catalan where model generations are actually slightly worse for social majorities.
1070
  Our analyses on societal biases show that while these biases are capable of interfering with model performance as expressed in the results on the BBQ dataset,
1071
  their tendency for representational harm is limited given the results of the Regard dataset. We highlight that our analyses of these biases are by no means exhaustive
 
1075
  Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1076
  For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018).
1077
  We observe moderate to strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1078
+ We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects,
1079
  implying that outputs can be influenced by the prompts.
1080
 
1081
  We highlight that these results can be expected from a pretrained model that has not yet been instruction-tuned or aligned.
 
1133
  |:---:|:---:|:---:|
1134
  |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1135
  |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1136
+ |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
images/corpus_languages_1.0.png DELETED
Binary file (352 kB)