Update README.md
Browse files
README.md
CHANGED
@@ -590,8 +590,8 @@ especially if the content originates from less-regulated sources or user-generat
|
|
590 |
**How was the data collected?**
|
591 |
|
592 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
593 |
-
- Web-sourced datasets with some preprocessing available under permissive license
|
594 |
-
- Domain-specific or language-specific raw crawls, always respecting robots.txt
|
595 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
596 |
(p.e. CATalog).
|
597 |
|
@@ -644,7 +644,7 @@ The original raw data was not kept.
|
|
644 |
|
645 |
**Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
|
646 |
|
647 |
-
Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for
|
648 |
and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
|
649 |
|
650 |
#### Uses
|
|
|
590 |
**How was the data collected?**
|
591 |
|
592 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
593 |
+
- Web-sourced datasets with some preprocessing available under permissive license.
|
594 |
+
- Domain-specific or language-specific raw crawls, always respecting robots.txt.
|
595 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
596 |
(p.e. CATalog).
|
597 |
|
|
|
644 |
|
645 |
**Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
|
646 |
|
647 |
+
Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated sources,
|
648 |
and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
|
649 |
|
650 |
#### Uses
|