jsaizant commited on
Commit
65a5630
·
verified ·
1 Parent(s): 30d8413

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -52
README.md CHANGED
@@ -464,28 +464,26 @@ We provide an extense Datasheet section following the best practices defined by
464
 
465
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
466
 
467
- The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
468
- European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
469
- languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
470
 
471
- We detected that there is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of
472
- our efforts in the creation of this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR
473
- (Brack et al., 2024), which includes 151 languages and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in
474
- Catalan in the world.
475
 
476
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
477
 
478
- The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
479
- Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
480
- and the use of HPC. In particular, it was created by the unit's data team, the main contributors being Javier Saiz, Ferran Espuña, and
481
- Jorge Palomar.
482
 
483
- However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
484
- and public institutions, which can be found in detail in the acknowledgements.
485
 
486
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
487
 
488
- This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
489
 
490
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
491
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
@@ -526,14 +524,14 @@ sources were sampled in proportion to their occurrence.
526
 
527
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
528
 
529
- Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some
530
- documents required optical character recognition (OCR) to extract text from non-text formats such as PDFs.
531
 
532
  **Is there a label or target associated with each instance? If so, please provide a description.**
533
 
534
- Each instance is labeled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional
535
- labels were automatically assigned to detect specific types of content harmful or toxic content and to assign preliminary indicators of
536
- undesired qualities —very short documents, high density of symbols, etc.— which were used for filtering instances.
537
 
538
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
539
 
@@ -545,12 +543,12 @@ Instances are related through shared metadata, such as source and language ident
545
 
546
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
547
 
548
- The dataset is split randomly into training, validation, and test sets.
549
 
550
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
551
 
552
- Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in
553
- web-sourced instances where SEO techniques and templates contribute to repeated textual patterns. Some instances may also be duplicated
554
  across sources due to format variations.
555
 
556
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
@@ -574,10 +572,10 @@ The dataset does not explicitly identify any subpopulations.
574
 
575
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
576
 
577
- Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as
578
- names, IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the
579
- combination of multiple data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are
580
- made to filter or anonymize sensitive data during pre-processing, but some identifiable information may remain in the dataset.
581
 
582
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
583
 
@@ -590,29 +588,28 @@ especially if the content originates from less-regulated sources or user-generat
590
  **How was the data collected?**
591
 
592
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
593
- - Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
594
- - Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
595
- - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
596
- (p.e. CATalog).
597
 
598
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
599
 
600
- According to the three groups previously defined, these are the mechanisms used in each of them:
601
- - Open direct download. Validation: data integrity tests.
602
- - Ad-hoc scrapers or crawlers. Validation: software unit and data integrity tests.
603
- - Direct download via FTP, SFTP, API or S3. Validation: data integrity tests.
604
 
605
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
606
 
607
- The sampling strategy was to use the whole dataset resulting from the filtering explained in the preprocessing/cleaning/labelling section,
608
- with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official
609
- languages of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a
610
- code document, evenly distributed among all programming languages).
611
 
612
  **Who was involved in the data collection process and how were they compensated?**
613
 
614
- This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed
615
- entirely by members of the LangTech data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
616
  consideration for acquiring data from suppliers.
617
 
618
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
@@ -631,12 +628,9 @@ ethical and legal point of view, respectively.
631
 
632
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
633
 
634
- Instances of text documents were not altered, but web-sourced documents were filtered based on specific criteria along two dimensions:
635
- - Quality: documents with a score lower than 0.8, based on undesired qualities, such as documents with low number of lines, very short
636
- sentences, presence of long footers and headers, and high percentage of punctuation, obtained through CURATE (Palomar-Giner et al., 2024)
637
- were filtered out.
638
- - Harmful or adult content: documents originating from Colossal OSCAR were filtered using LLM-Datasets (Ostendorff et al., 2024) based on
639
- the perplexity from a language model (‘harmful_pp’ field) provided by the Ungoliant pipeline (Abadji et al., 2021).
640
 
641
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
642
 
@@ -644,7 +638,7 @@ The original raw data was not kept.
644
 
645
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
646
 
647
- Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for Spanish Crawling and CATalog,
648
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
649
 
650
  #### Uses
@@ -695,11 +689,10 @@ The dataset will not be updated.
695
 
696
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
697
 
698
- The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly
699
- available in web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data
700
- retention on an individual basis. However, efforts are made to mitigate the risks associated with sensitive information through
701
- pre-processing and filtering to remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential
702
- privacy and ethical issues.
703
 
704
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
705
 
 
464
 
465
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
466
 
467
+ The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of European languages (35)
468
+ and programming languages (92). We also want to represent the co-official languages of Spain: Spanish, Catalan, Galician and Basque. For this reason, we oversample
469
+ these languages by a factor of 2.
470
 
471
+ There is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of our efforts in the creation of
472
+ this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR (Brack et al., 2024), which includes 151 languages
473
+ and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in Catalan in the world.
 
474
 
475
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
476
 
477
+ The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS),
478
+ which aims to advance the field of natural language processing through cutting-edge research and development and the use of HPC. In particular, it was created by
479
+ the unit's data team, the main contributors being José Javier Saiz, Ferran Espuña and Jorge Palomar.
 
480
 
481
+ However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners and public institutions,
482
+ which can be found in detail in the acknowledgements.
483
 
484
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
485
 
486
+ This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
487
 
488
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
489
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
 
524
 
525
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
526
 
527
+ Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some documents required
528
+ optical character recognition (OCR) to extract text from non-text formats such as PDFs.
529
 
530
  **Is there a label or target associated with each instance? If so, please provide a description.**
531
 
532
+ Each instance is labelled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional labels were
533
+ automatically assigned to detect specific types of content -harmful or toxic content- and to assign preliminary indicators of undesired qualities -very
534
+ short documents, high density of symbols, etc.- which were used for filtering instances.
535
 
536
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
537
 
 
543
 
544
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
545
 
546
+ The dataset is randomly divided into training, validation and test sets, where the validation and test sets are each 1% of the total corpus.
547
 
548
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
549
 
550
+ Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in web-sourced
551
+ instances where search engine optimization techniques and templates contribute to repeated textual patterns. Some instances may be also duplicated
552
  across sources due to format variations.
553
 
554
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
 
572
 
573
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
574
 
575
+ Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as names,
576
+ IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the combination of multiple
577
+ data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are made to filter or anonymize
578
+ sensitive data (Mina et al., 2024), but some identifiable information may remain in the dataset.
579
 
580
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
581
 
 
588
  **How was the data collected?**
589
 
590
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
591
+ - Web-sourced datasets with some preprocessing available under permissive license.
592
+ - Domain-specific or language-specific raw crawls.
593
+ - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects (e.g. CATalog).
 
594
 
595
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
596
 
597
+ The data collection process was carried out using three different mechanisms, each corresponding to one of the groups defined in the previous answer. The specific methods used and their respective validation procedures are outlined below:
598
+ - Open Direct Download: Data were obtained directly from publicly accessible sources, such as websites or repositories that provide open data downloads. We validate the data with a data integrity check, which ensures that the downloaded files are complete, uncorrupted and in the expected format and structure.
599
+ - Ad hoc scrapers or crawlers: Custom web scraping scripts or crawlers were used to extract data from various online sources where direct downloads were not available. These scripts navigate web pages, extract relevant data and store it in a structured format. We validate this method with software unit tests to evaluate the functionality of individual components of the scraping programs, checking for errors or unexpected behaviour. In addition, data integrity tests were performed to verify that the collected data remained complete throughout the extraction and storage process.
600
+ - Direct download via FTP, SFTP, API or S3: Some datasets were acquired using secure transfer protocols such as FTP (File Transfer Protocol), SFTP (Secure File Transfer Protocol), or API (Application Programming Interface) requests from cloud storage services such as Amazon S3. As with the open direct download method, data integrity tests were used to validate the completeness of the files to ensure that the files were not altered or corrupted during the transfer process.
601
 
602
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
603
 
604
+ The sampling strategy was to use the whole dataset resulting from the filtering explained in the 'preprocessing/cleaning/labelling' section,
605
+ with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official languages
606
+ of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a code document,
607
+ evenly distributed among all programming languages).
608
 
609
  **Who was involved in the data collection process and how were they compensated?**
610
 
611
+ This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed entirely
612
+ by members of the Language Technologies data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
613
  consideration for acquiring data from suppliers.
614
 
615
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
 
628
 
629
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
630
 
631
+ No changes were made to the content of individual text document instances. However, the web-sourced documents underwent a filtering process based on specific criteria along two key dimensions:
632
+ - Quality filtering: The text processing pipeline CURATE (Palomar et. al, 2024) calculates a quality score for each document based on a set of filtering criteria that identify undesirable textual characteristics. Any document with a score below the 0.8 threshold was excluded from the dataset.
633
+ - Harmful or adult content filtering: To reduce the amount of harmful or inappropriate material in the dataset, documents from Colossal OSCAR were filtered using the Ungoliant pipeline (Abadji et al., 2021), which uses the 'harmful\_pp' field, a perplexity-based score generated by a language model.
 
 
 
634
 
635
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
636
 
 
638
 
639
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
640
 
641
+ Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated datasets,
642
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
643
 
644
  #### Uses
 
689
 
690
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
691
 
692
+ The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly available in
693
+ web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data retention on an
694
+ individual basis. However, efforts are made to mitigate the risks associated with sensitive information through pre-processing and filtering to
695
+ remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential privacy and ethical issues.
 
696
 
697
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
698