jsaizant commited on
Commit
f2fc4f3
·
verified ·
1 Parent(s): 8562bba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -51
README.md CHANGED
@@ -380,27 +380,26 @@ We provide an extense Datasheet section following the best practices defined by
380
 
381
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
382
 
383
- The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
384
- European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
385
- languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
386
 
387
- We detected that there is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of
388
- our efforts in the creation of this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR
389
- (Brack et al., 2024), which includes 151 languages and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in
390
- Catalan in the world.
391
 
392
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
393
 
394
- The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
395
- Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
396
- and the use of HPC. In particular, it was created by the unit's data team, the main contributors being Javier Saiz, Ferran Espuña, and
397
- Jorge Palomar.
398
 
399
- However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
400
- and public institutions, which can be found in detail in the acknowledgements.
401
 
402
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
403
- This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
 
404
 
405
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
406
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
@@ -441,14 +440,14 @@ sources were sampled in proportion to their occurrence.
441
 
442
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
443
 
444
- Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some
445
- documents required optical character recognition (OCR) to extract text from non-text formats such as PDFs.
446
 
447
  **Is there a label or target associated with each instance? If so, please provide a description.**
448
 
449
- Each instance is labeled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional
450
- labels were automatically assigned to detect specific types of content harmful or toxic content and to assign preliminary indicators of
451
- undesired qualities —very short documents, high density of symbols, etc.— which were used for filtering instances.
452
 
453
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
454
 
@@ -460,12 +459,12 @@ Instances are related through shared metadata, such as source and language ident
460
 
461
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
462
 
463
- The dataset is split randomly into training, validation, and test sets.
464
 
465
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
466
 
467
- Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in
468
- web-sourced instances where SEO techniques and templates contribute to repeated textual patterns. Some instances may also be duplicated
469
  across sources due to format variations.
470
 
471
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
@@ -489,10 +488,10 @@ The dataset does not explicitly identify any subpopulations.
489
 
490
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
491
 
492
- Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as
493
- names, IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the
494
- combination of multiple data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are
495
- made to filter or anonymize sensitive data during pre-processing, but some identifiable information may remain in the dataset.
496
 
497
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
498
 
@@ -506,28 +505,27 @@ especially if the content originates from less-regulated sources or user-generat
506
 
507
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
508
  - Web-sourced datasets with some preprocessing available under permissive license.
509
- - Domain-specific or language-specific raw crawls, always respecting robots.txt.
510
- - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
511
- (p.e. CATalog).
512
 
513
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
514
 
515
- According to the three groups previously defined, these are the mechanisms used in each of them:
516
- - Open direct download. Validation: data integrity tests.
517
- - Ad-hoc scrapers or crawlers. Validation: software unit and data integrity tests.
518
- - Direct download via FTP, SFTP, API or S3. Validation: data integrity tests.
519
 
520
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
521
 
522
- The sampling strategy was to use the whole dataset resulting from the filtering explained in the preprocessing/cleaning/labelling section,
523
- with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official
524
- languages of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a
525
- code document, evenly distributed among all programming languages).
526
 
527
  **Who was involved in the data collection process and how were they compensated?**
528
 
529
- This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed
530
- entirely by members of the LangTech data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
531
  consideration for acquiring data from suppliers.
532
 
533
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
@@ -546,12 +544,9 @@ ethical and legal point of view, respectively.
546
 
547
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
548
 
549
- Instances of text documents were not altered, but web-sourced documents were filtered based on specific criteria along two dimensions:
550
- - Quality: documents with a score lower than 0.8, based on undesired qualities, such as documents with low number of lines, very short
551
- sentences, presence of long footers and headers, and high percentage of punctuation, obtained through CURATE (Palomar-Giner et al., 2024)
552
- were filtered out.
553
- - Harmful or adult content: documents originating from Colossal OSCAR were filtered using LLM-Datasets (Ostendorff et al., 2024) based on
554
- the perplexity from a language model (‘harmful_pp’ field) provided by the Ungoliant pipeline (Abadji et al., 2021).
555
 
556
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
557
 
@@ -559,7 +554,7 @@ The original raw data was not kept.
559
 
560
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
561
 
562
- Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated sources,
563
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
564
 
565
  #### Uses
@@ -610,11 +605,10 @@ The dataset will not be updated.
610
 
611
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
612
 
613
- The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly
614
- available in web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data
615
- retention on an individual basis. However, efforts are made to mitigate the risks associated with sensitive information through
616
- pre-processing and filtering to remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential
617
- privacy and ethical issues.
618
 
619
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
620
 
 
380
 
381
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
382
 
383
+ The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of European languages (35)
384
+ and programming languages (92). We also want to represent the co-official languages of Spain: Spanish, Catalan, Galician and Basque. For this reason, we oversample
385
+ these languages by a factor of 2.
386
 
387
+ There is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of our efforts in the creation of
388
+ this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR (Brack et al., 2024), which includes 151 languages
389
+ and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in Catalan in the world.
 
390
 
391
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
392
 
393
+ The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS),
394
+ which aims to advance the field of natural language processing through cutting-edge research and development and the use of HPC. In particular, it was created by
395
+ the unit's data team, the main contributors being José Javier Saiz, Ferran Espuña and Jorge Palomar.
 
396
 
397
+ However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners and public institutions,
398
+ which can be found in detail in the acknowledgements.
399
 
400
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
401
+
402
+ This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
403
 
404
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
405
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
 
440
 
441
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
442
 
443
+ Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some documents required
444
+ optical character recognition (OCR) to extract text from non-text formats such as PDFs.
445
 
446
  **Is there a label or target associated with each instance? If so, please provide a description.**
447
 
448
+ Each instance is labelled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional labels were
449
+ automatically assigned to detect specific types of content -harmful or toxic content- and to assign preliminary indicators of undesired qualities -very
450
+ short documents, high density of symbols, etc.- which were used for filtering instances.
451
 
452
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
453
 
 
459
 
460
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
461
 
462
+ The dataset is randomly divided into training, validation and test sets, where the validation and test sets are each 1% of the total corpus.
463
 
464
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
465
 
466
+ Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in web-sourced
467
+ instances where search engine optimization techniques and templates contribute to repeated textual patterns. Some instances may be also duplicated
468
  across sources due to format variations.
469
 
470
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
 
488
 
489
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
490
 
491
+ Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as names,
492
+ IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the combination of multiple
493
+ data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are made to filter or anonymize
494
+ sensitive data (Mina et al., 2024), but some identifiable information may remain in the dataset.
495
 
496
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
497
 
 
505
 
506
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
507
  - Web-sourced datasets with some preprocessing available under permissive license.
508
+ - Domain-specific or language-specific raw crawls.
509
+ - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects (e.g. CATalog).
 
510
 
511
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
512
 
513
+ The data collection process was carried out using three different mechanisms, each corresponding to one of the groups defined in the previous answer. The specific methods used and their respective validation procedures are outlined below:
514
+ - Open Direct Download: Data were obtained directly from publicly accessible sources, such as websites or repositories that provide open data downloads. We validate the data with a data integrity check, which ensures that the downloaded files are complete, uncorrupted and in the expected format and structure.
515
+ - Ad hoc scrapers or crawlers: Custom web scraping scripts or crawlers were used to extract data from various online sources where direct downloads were not available. These scripts navigate web pages, extract relevant data and store it in a structured format. We validate this method with software unit tests to evaluate the functionality of individual components of the scraping programs, checking for errors or unexpected behaviour. In addition, data integrity tests were performed to verify that the collected data remained complete throughout the extraction and storage process.
516
+ - Direct download via FTP, SFTP, API or S3: Some datasets were acquired using secure transfer protocols such as FTP (File Transfer Protocol), SFTP (Secure File Transfer Protocol), or API (Application Programming Interface) requests from cloud storage services such as Amazon S3. As with the open direct download method, data integrity tests were used to validate the completeness of the files to ensure that the files were not altered or corrupted during the transfer process.
517
 
518
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
519
 
520
+ The sampling strategy was to use the whole dataset resulting from the filtering explained in the 'preprocessing/cleaning/labelling' section,
521
+ with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official languages
522
+ of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a code document,
523
+ evenly distributed among all programming languages).
524
 
525
  **Who was involved in the data collection process and how were they compensated?**
526
 
527
+ This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed entirely
528
+ by members of the Language Technologies data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
529
  consideration for acquiring data from suppliers.
530
 
531
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
 
544
 
545
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
546
 
547
+ No changes were made to the content of individual text document instances. However, the web-sourced documents underwent a filtering process based on specific criteria along two key dimensions:
548
+ - Quality filtering: The text processing pipeline CURATE (Palomar et. al, 2024) calculates a quality score for each document based on a set of filtering criteria that identify undesirable textual characteristics. Any document with a score below the 0.8 threshold was excluded from the dataset.
549
+ - Harmful or adult content filtering: To reduce the amount of harmful or inappropriate material in the dataset, documents from Colossal OSCAR were filtered using the Ungoliant pipeline (Abadji et al., 2021), which uses the 'harmful\_pp' field, a perplexity-based score generated by a language model.
 
 
 
550
 
551
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
552
 
 
554
 
555
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
556
 
557
+ Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated datasets,
558
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
559
 
560
  #### Uses
 
605
 
606
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
607
 
608
+ The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly available in
609
+ web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data retention on an
610
+ individual basis. However, efforts are made to mitigate the risks associated with sensitive information through pre-processing and filtering to
611
+ remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential privacy and ethical issues.
 
612
 
613
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
614