|
--- |
|
tags: |
|
- espnet |
|
- audio |
|
- automatic-speech-recognition |
|
- speech-translation |
|
language: multilingual |
|
datasets: |
|
- owsm_v3.1 |
|
license: cc-by-4.0 |
|
--- |
|
|
|
## OWLS: Open Whisper-style Large-scale neural model Suite |
|
|
|
[Paper](https://arxiv.org/abs/2502.10373) |
|
|
|
OWLS is a suite of Whisper-style models, designed to help researchers understand the scaling properties of speech models. |
|
OWLS models range from 0.25B to 18B parameters, and are trained on up to 360K hours of data. |
|
|
|
OWLS models are developed using [ESPnet](https://github.com/espnet/espnet), and support multilingual Speech Recognition and Translation. |
|
|
|
It is part of the [OWSM](https://www.wavlab.org/activities/2024/owsm/) project, which aims to develop fully open speech foundation models using publicly available data and open-source toolkits. |
|
|
|
The model in this repo has 1.12B parameters in total and is trained on 180k hours of public speech data. |
|
Specifically, it supports the following speech-to-text tasks: |
|
- Speech recognition |
|
- Any-to-any-language speech translation |
|
- Utterance-level alignment |
|
- Long-form transcription |
|
- Language identification |
|
|
|
## Use this model |
|
|
|
You can use this model in your projects with the following code: |
|
|
|
```python |
|
# make sure espnet is installed: pip install espnet |
|
from espnet2.bin.s2t_inference import Speech2Text |
|
|
|
model = Speech2Text.from_pretrained( |
|
"espnet/owls_1B_180K" |
|
) |
|
|
|
speech, rate = soundfile.read("speech.wav") |
|
text, *_ = model(speech)[0] |
|
``` |
|
|
|
|
|
## Citations |
|
|
|
``` |
|
@article{chen2025owls, |
|
title={OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models}, |
|
author={Chen, William and Tian, Jinchuan and Peng, Yifan and Yan, Brian and Yang, Chao-Han Huck and Watanabe, Shinji}, |
|
journal={arXiv preprint arXiv:2502.10373}, |
|
year={2025} |
|
} |
|
|
|
``` |
|
|
|
|
|
|