|
--- |
|
library_name: transformers |
|
language: |
|
- da |
|
license: openrail |
|
base_model: chcaa/xls-r-300m-danish |
|
datasets: |
|
- generator |
|
metrics: |
|
- wer |
|
- cer |
|
model-index: |
|
- name: roest-315m |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: CoRal read-aloud |
|
type: alexandrainst/coral |
|
split: test |
|
args: read_aloud |
|
metrics: |
|
- name: CER |
|
type: cer |
|
value: 6.9% ± 0.2% |
|
- name: WER |
|
type: wer |
|
value: 14.9% ± 0.4% |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Danish Common Voice 17 |
|
type: mozilla-foundation/common_voice_17_0 |
|
split: test |
|
args: da |
|
metrics: |
|
- name: CER |
|
type: cer |
|
value: 5.1% ± 0.6% |
|
- name: WER |
|
type: wer |
|
value: 13.2% ± 0.8% |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Røst-315m |
|
|
|
This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra |
|
Institute](https://alexandra.dk/). |
|
|
|
|
|
## Quick Start |
|
Start by installing the required libraries: |
|
|
|
```shell |
|
$ pip install transformers kenlm pyctcdecode |
|
``` |
|
|
|
Next you can use the model using the `transformers` Python package as follows: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> audio = get_audio() # 16kHz raw audio array |
|
>>> transcriber = pipeline(model="alexandrainst/roest-315m") |
|
>>> transcriber(audio) |
|
{'text': 'your transcription'} |
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
We have evaluated both our and existing models on the CoRal test set as well as the |
|
Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have |
|
bootstrapped the results 1000 times and report here the mean scores along with a 95% |
|
confidence interval (lower is better; best scores in **bold**, second-best in |
|
*italics*): |
|
|
|
| Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER | |
|
|:---|---:|---:|---:|---:|---:| |
|
| Røst-315m (this model) | 315M | **6.9% ± 0.2%** | **14.9% ± 0.4%** | *5.1% ± 0.6%* | *13.2% ± 0.8%* | |
|
| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | *14.4% ± 0.3%* | *36.5% ± 0.6%* | **4.1% ± 0.5%** | **12.0% ± 0.8%** | |
|
| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 15.8% ± 0.7% | *36.5% ± 1.0%* | 5.3% ± 0.4% | 14.5% ± 0.8% | |
|
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | 16.5% ± 1.3% | 36.8% ± 1.9% | 7.6% ± 0.6% | 18.3% ± 1.1% | |
|
| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 19.7% ± 1.8% | 42.2% ± 2.6% | 10.6% ± 1.6% | 23.3% ± 2.0% | |
|
| [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 19.5% ± 1.3% | 42.4% ± 1.7% | 12.8% ± 0.8% | 28.3% ± 1.3% | |
|
| [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 21.5% ± 1.7% | 47.4% ± 2.6% | 13.3% ± 0.8% | 30.0% ± 1.3% | |
|
| [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 26.1% ± 1.2% | 57.9% ± 1.5% | 22.9% ± 4.3% | 49.3% ± 6.3% | |
|
| [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 50.8% ± 3.6% | 100.1% ± 5.6% | 43.1% ± 5.0% | 85.1% ± 7.9% | |
|
| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 63.7% ± 3.9% | 120.3% ± 5.7% | 58.5% ± 5.8% | 106.4% ± 8.7% | |
|
|
|
|
|
### Detailed Evaluation Across Demographics on the CoRal Test Set |
|
|
|
| Dialect | CER | WER | |
|
|:---|---:|---:| |
|
| Københavnsk | 2.8% | 6.6% | |
|
| Sjællandsk | 3.9% | 8.6% | |
|
| Fynsk | 7.3% | 15.7% | |
|
| Sønderjysk | 13.3% | 27.1% | |
|
| Vestjysk | 11.1% | 24.9% | |
|
| Østjysk | 3.2% | 7.6% | |
|
| Nordjysk | 2.7% | 5.7% | |
|
| Sydømål | 6.0% | 12.3% | |
|
| Bornholmsk | 9.1% | 19.9% | |
|
| Non-native | 7.3% | 16.7% | |
|
|
|
| Gender | CER | WER | |
|
|:---|---:|---:| |
|
| Female | 8.0% | 17.1% | |
|
| Male | 5.8% | 12.8% | |
|
|
|
| Age group | CER | WER | |
|
|:---|---:|---:| |
|
| 0-25 | 5.5% | 12.1% | |
|
| 25-50 | 5.9% | 13.3% | |
|
| 50+ | 8.2% | 17.5% | |
|
|
|
|
|
## Training Data |
|
|
|
This model is the result of four different stages of training: |
|
|
|
1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data, |
|
13,628 hours of which is Danish. Pretraining here means that the model learnt to |
|
"fill in" gaps of raw audio - no transcriptions were used (or available) during |
|
this process. The pretraining data is distributed as follows: |
|
- 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being |
|
speeches from the European Parliament in 23 European languages. |
|
This includes 13,600 hours of Danish speech. |
|
- 51,000 hours from [Multilingual |
|
LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in |
|
8 European languages. This does not include any Danish speech. |
|
- 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670), |
|
being read-aloud speech in 60 diverse languages. This does not include any Danish |
|
speech. |
|
- 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459), |
|
being audio from YouTube videos in 107 languages. This includes 28 hours of |
|
Danish speech. |
|
- 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being |
|
conversational telephone speech in 17 African and Asian languages. This does not |
|
include any Danish speech. |
|
2. Continued pretraining on 141,000 hours of Danish radio (more specifically, DR P1 |
|
and Radio24Syv from 2005 to 2021). |
|
3. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning" |
|
indicates that this stage of training was supervised, i.e. the model was trained on |
|
both audio and transcriptions to perform the speech-to-text task (also known as |
|
automatic speech recognition). The finetuning data is as follows: |
|
- The read-aloud training split of the [CoRal |
|
dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision |
|
fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish |
|
read-aloud speech, diverse across dialects, accents, ages and genders. |
|
- The Danish training split of the [Common Voice 17 |
|
dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), |
|
consisting of 12 hours of Danish read-aloud speech. |
|
4. An n-gram language model has been trained separately, and is used to guide the |
|
transcription generation of the finetuned speech recognition model. This n-gram |
|
language model has been trained on all of the [Danish |
|
Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da) |
|
(approximately 287,000 articles). |
|
|
|
The first step was trained by [Babu et al. |
|
(2021)](https://doi.org/10.48550/arXiv.2111.09296), second step by [Hansen |
|
(2022)](https://huggingface.co/chcaa/xls-r-300m-danish) and the third and fourth step by |
|
[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m). |
|
|
|
The final product is then the combination of the finetuned model along with the n-gram |
|
model, and this is what is used when you use the model as mentioned in the Quick Start |
|
section above. |
|
|
|
|
|
## Intended use cases |
|
|
|
This model is intended to be used for Danish automatic speech recognition. |
|
|
|
Note that Biometric Identification is not allowed using the CoRal dataset and/or derived |
|
models. For more information, see addition 4 in our |
|
[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE). |
|
|
|
|
|
## Why the name Røst? |
|
|
|
Røst is both the [Danish word for the human |
|
voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one |
|
of the cold-water coral reefs in |
|
Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev). |
|
|
|
|
|
## License |
|
The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows |
|
commercial use with a few restrictions (speech synthesis and biometric identification). |
|
See |
|
[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE). |
|
|
|
|
|
## Creators and Funders |
|
The CoRal project is funded by the [Danish Innovation |
|
Fund](https://innovationsfonden.dk/) and consists of the following partners: |
|
|
|
- [Alexandra Institute](https://alexandra.dk/) |
|
- [University of Copenhagen](https://www.ku.dk/) |
|
- [Agency for Digital Government](https://digst.dk/) |
|
- [Alvenir](https://www.alvenir.ai/) |
|
- [Corti](https://www.corti.ai/) |
|
|
|
|
|
## Citation |
|
|
|
We will submit a research paper soon, but until then, if you use this model in your |
|
research or development, please cite it as follows: |
|
|
|
```bibtex |
|
@dataset{coral2024, |
|
author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach}, |
|
title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups}, |
|
year = {2024}, |
|
url = {https://hf.co/datasets/alexandrainst/coral}, |
|
} |
|
``` |