|
--- |
|
language: es |
|
tags: |
|
- QA |
|
- Q&A |
|
datasets: |
|
- BSC-TeMU/SQAC |
|
|
|
--- |
|
|
|
# Spanish Longformer fine-tuned on **SQAC** for Spanish **QA** 📖❓ |
|
[longformer-base-4096-spanish](https://huggingface.co/mrm8488/longformer-base-4096-spanish) fine-tuned on [SQAC](https://huggingface.co/datasets/BSC-TeMU/SQAC) for **Q&A** downstream task. |
|
|
|
## Details of the model 🧠 |
|
[longformer-base-4096-spanish](https://huggingface.co/mrm8488/longformer-base-4096-spanish) is a BERT-like model started from the RoBERTa checkpoint (**BERTIN** in this case) and pre-trained for *MLM* on long documents (from BETO's `all_wikis`). It supports sequences of length up to **4,096**! |
|
|
|
## Details of the dataset 📚 |
|
|
|
This dataset contains 6,247 contexts and 18,817 questions with their answers, 1 to 5 for each fragment. |
|
The sources of the contexts are: |
|
* Encyclopedic articles from [Wikipedia in Spanish](https://es.wikipedia.org/), used under [CC-by-sa licence](https://creativecommons.org/licenses/by-sa/3.0/legalcode). |
|
* News from [Wikinews in Spanish](https://es.wikinews.org/), used under [CC-by licence](https://creativecommons.org/licenses/by/2.5/). |
|
* Text from the Spanish corpus [AnCora](http://clic.ub.edu/corpus/en), which is a mix from diferent newswire and literature sources, used under [CC-by licence](https://creativecommons.org/licenses/by/4.0/legalcode). |
|
This dataset can be used to build extractive-QA. |