roest-315m / README.md

saattrupdan

Update reference

424a13c verified 2 days ago

preview code

raw

history blame

No virus

9.26 kB

	---
	library_name: transformers
	language:
	- da
	license: openrail
	base_model: chcaa/xls-r-300m-danish
	datasets:
	- generator
	metrics:
	- wer
	- cer
	model-index:
	- name: roest-315m
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: CoRal read-aloud
	type: alexandrainst/coral
	split: test
	args: read_aloud
	metrics:
	- name: CER
	type: cer
	value: 6.9% ± 0.2%
	- name: WER
	type: wer
	value: 14.9% ± 0.4%
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Danish Common Voice 17
	type: mozilla-foundation/common_voice_17_0
	split: test
	args: da
	metrics:
	- name: CER
	type: cer
	value: 5.1% ± 0.6%
	- name: WER
	type: wer
	value: 13.2% ± 0.8%
	pipeline_tag: automatic-speech-recognition
	---

	# Røst-315m

	This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
	Institute](https://alexandra.dk/).


	## Quick Start
	Start by installing the required libraries:

	```shell
	$ pip install transformers kenlm pyctcdecode
	```

	Next you can use the model using the `transformers` Python package as follows:

	```python
	>>> from transformers import pipeline
	>>> audio = get_audio() # 16kHz raw audio array
	>>> transcriber = pipeline(model="alexandrainst/roest-315m")
	>>> transcriber(audio)
	{'text': 'your transcription'}
	```


	## Evaluation Results

	We have evaluated both our and existing models on the CoRal test set as well as the
	Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
	bootstrapped the results 1000 times and report here the mean scores along with a 95%
	confidence interval (lower is better; best scores in bold, second-best in
	italics):

	\| Model \| Number of parameters \| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER \| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER \| [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER \| [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER \|
	\|:---\|---:\|---:\|---:\|---:\|---:\|
	\| Røst-315m (this model) \| 315M \| 6.9% ± 0.2% \| 14.9% ± 0.4% \| 5.1% ± 0.6% \| 13.2% ± 0.8% \|
	\| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) \| 315M \| 14.4% ± 0.3% \| 36.5% ± 0.6% \| 4.1% ± 0.5% \| 12.0% ± 0.8% \|
	\| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) \| 1540M \| 15.8% ± 0.7% \| 36.5% ± 1.0% \| 5.3% ± 0.4% \| 14.5% ± 0.8% \|
	\| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) \| 1540M \| 16.5% ± 1.3% \| 36.8% ± 1.9% \| 7.6% ± 0.6% \| 18.3% ± 1.1% \|
	\| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) \| 1540M \| 19.7% ± 1.8% \| 42.2% ± 2.6% \| 10.6% ± 1.6% \| 23.3% ± 2.0% \|
	\| [openai/whisper-large](https://hf.co/openai/whisper-large) \| 1540M \| 19.5% ± 1.3% \| 42.4% ± 1.7% \| 12.8% ± 0.8% \| 28.3% ± 1.3% \|
	\| [openai/whisper-medium](https://hf.co/openai/whisper-medium) \| 764M \| 21.5% ± 1.7% \| 47.4% ± 2.6% \| 13.3% ± 0.8% \| 30.0% ± 1.3% \|
	\| [openai/whisper-small](https://hf.co/openai/whisper-small) \| 242M \| 26.1% ± 1.2% \| 57.9% ± 1.5% \| 22.9% ± 4.3% \| 49.3% ± 6.3% \|
	\| [openai/whisper-base](https://hf.co/openai/whisper-base) \| 73M \| 50.8% ± 3.6% \| 100.1% ± 5.6% \| 43.1% ± 5.0% \| 85.1% ± 7.9% \|
	\| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) \| 38M \| 63.7% ± 3.9% \| 120.3% ± 5.7% \| 58.5% ± 5.8% \| 106.4% ± 8.7% \|


	### Detailed Evaluation Across Demographics on the CoRal Test Set

	\| Dialect \| CER \| WER \|
	\|:---\|---:\|---:\|
	\| Københavnsk \| 2.8% \| 6.6% \|
	\| Sjællandsk \| 3.9% \| 8.6% \|
	\| Fynsk \| 7.3% \| 15.7% \|
	\| Sønderjysk \| 13.3% \| 27.1% \|
	\| Vestjysk \| 11.1% \| 24.9% \|
	\| Østjysk \| 3.2% \| 7.6% \|
	\| Nordjysk \| 2.7% \| 5.7% \|
	\| Sydømål \| 6.0% \| 12.3% \|
	\| Bornholmsk \| 9.1% \| 19.9% \|
	\| Non-native \| 7.3% \| 16.7% \|

	\| Gender \| CER \| WER \|
	\|:---\|---:\|---:\|
	\| Female \| 8.0% \| 17.1% \|
	\| Male \| 5.8% \| 12.8% \|

	\| Age group \| CER \| WER \|
	\|:---\|---:\|---:\|
	\| 0-25 \| 5.5% \| 12.1% \|
	\| 25-50 \| 5.9% \| 13.3% \|
	\| 50+ \| 8.2% \| 17.5% \|


	## Training Data

	This model is the result of four different stages of training:

	1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
	13,628 hours of which is Danish. Pretraining here means that the model learnt to
	"fill in" gaps of raw audio - no transcriptions were used (or available) during
	this process. The pretraining data is distributed as follows:
	- 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
	speeches from the European Parliament in 23 European languages.
	This includes 13,600 hours of Danish speech.
	- 51,000 hours from [Multilingual
	LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
	8 European languages. This does not include any Danish speech.
	- 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
	being read-aloud speech in 60 diverse languages. This does not include any Danish
	speech.
	- 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
	being audio from YouTube videos in 107 languages. This includes 28 hours of
	Danish speech.
	- 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
	conversational telephone speech in 17 African and Asian languages. This does not
	include any Danish speech.
	2. Continued pretraining on 141,000 hours of Danish radio (more specifically, DR P1
	and Radio24Syv from 2005 to 2021).
	3. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
	indicates that this stage of training was supervised, i.e. the model was trained on
	both audio and transcriptions to perform the speech-to-text task (also known as
	automatic speech recognition). The finetuning data is as follows:
	- The read-aloud training split of the [CoRal
	dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
	fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
	read-aloud speech, diverse across dialects, accents, ages and genders.
	- The Danish training split of the [Common Voice 17
	dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
	consisting of 12 hours of Danish read-aloud speech.
	4. An n-gram language model has been trained separately, and is used to guide the
	transcription generation of the finetuned speech recognition model. This n-gram
	language model has been trained on all of the [Danish
	Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
	(approximately 287,000 articles).

	The first step was trained by [Babu et al.
	(2021)](https://doi.org/10.48550/arXiv.2111.09296), second step by [Hansen
	(2022)](https://huggingface.co/chcaa/xls-r-300m-danish) and the third and fourth step by
	[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).

	The final product is then the combination of the finetuned model along with the n-gram
	model, and this is what is used when you use the model as mentioned in the Quick Start
	section above.


	## Intended use cases

	This model is intended to be used for Danish automatic speech recognition.

	Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
	models. For more information, see addition 4 in our
	[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).


	## Why the name Røst?

	Røst is both the [Danish word for the human
	voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
	of the cold-water coral reefs in
	Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).


	## License
	The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
	commercial use with a few restrictions (speech synthesis and biometric identification).
	See
	[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).


	## Creators and Funders
	The CoRal project is funded by the [Danish Innovation
	Fund](https://innovationsfonden.dk/) and consists of the following partners:

	- [Alexandra Institute](https://alexandra.dk/)
	- [University of Copenhagen](https://www.ku.dk/)
	- [Agency for Digital Government](https://digst.dk/)
	- [Alvenir](https://www.alvenir.ai/)
	- [Corti](https://www.corti.ai/)


	## Citation

	We will submit a research paper soon, but until then, if you use this model in your
	research or development, please cite it as follows:

	```bibtex
	@dataset{coral2024,
	author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
	title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
	year = {2024},
	url = {https://hf.co/datasets/alexandrainst/coral},
	}
	```