|
--- |
|
library_name: transformers |
|
license: llama3 |
|
language: |
|
- th |
|
- en |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Typhoon-Audio Preview |
|
|
|
**llama-3-typhoon-v1.5-8b-audio-preview** is a 🇹🇭 Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). |
|
|
|
More details can be found in our [release blog](https://blog.opentyphoon.ai/typhoon-audio-preview-release-6fbb3f938287) and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name. |
|
|
|
## Model Description |
|
|
|
- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs. |
|
- **Requirement**: transformers 4.38.0 or newer. |
|
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧 |
|
- **Demo**: https://audio.opentyphoon.ai/ |
|
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/) |
|
|
|
## Usage Example |
|
|
|
```python |
|
from transformers import AutoModel |
|
|
|
# Initialize from the trained model |
|
model = AutoModel.from_pretrained( |
|
"scb10x/llama-3-typhoon-v1.5-8b-audio-preview", |
|
torch_dtype=torch.float16, |
|
trust_remote_code=True |
|
) |
|
model.to("cuda") |
|
model.eval() |
|
|
|
# Run generation |
|
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
|
response = model.generate( |
|
wav_path="path_to_your_audio.wav", |
|
prompt="transcribe this audio", |
|
prompt_pattern=prompt_pattern, |
|
do_sample=False, |
|
max_length=1200, |
|
repetition_penalty=1.1, |
|
num_beams=1, |
|
# temperature=0.4, |
|
# top_p=0.9, |
|
# streamer=streamer # supports TextIteratorStreamer |
|
) |
|
print(response) |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
| Model | ASR-en (WER↓) | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) | |
|
|:----------------------------|:-------------------|:--------------|:--------------|:-------------|:--------------| |
|
| SALMONN-13B | 5.79 | 98.07 | 0.07 | 0.10 | 14.97 | |
|
| DiVA-8B | 30.28 | 65.21 | 9.82 | 5.31 | 7.97 | |
|
| Gemini-1.5-pro-001 | 5.98 | 13.56 | 20.69 | 13.52 | 22.54 | |
|
| Typhoon-Audio-Preview | 8.72 | 14.17 | 17.52 | 10.67 | 24.14 | |
|
|
|
|
|
| Model | Gender-th (Acc) | SpokenQA-th (F1) | SpeechInstruct-th | |
|
|:-------------------------------|:---------------|:-------------------|:-------------------| |
|
| SALMONN-13B | 93.26 | 2.95 | 1.18 | |
|
| DiVA-8B | 50.12 | 15.13 | 2.68 | |
|
| Gemini-1.5-pro-001 | 81.32 | 62.10 | 3.93 | |
|
| Typhoon-Audio-Preview | 93.74 | 64.60 | 6.11 | |
|
|
|
|
|
## Intended Uses & Limitations |
|
This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses. |
|
|
|
## Follow us & Support |
|
- https://twitter.com/opentyphoon |
|
- https://discord.gg/CqyBscMFpg |
|
|
|
## Acknowledgements |
|
We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights. |
|
|
|
## Typhoon Team |
|
*Potsawee Manakul*, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, |
|
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, *Kunat Pipatanakul* |