Papers
arxiv:2109.07765
Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models
Published on Sep 16, 2021
Authors:
Abstract
We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license, both in Zenodo (https://zenodo.org/record/4561971\#.YTI5SnVKiEA).
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2109.07765 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2109.07765 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a
collection
to link it from this page.