kunato's picture
Update README.md
391fbeb verified
|
raw
history blame
4.27 kB
---
library_name: transformers
license: llama3
language:
- th
- en
pipeline_tag: text-generation
---
# Typhoon-Audio Preview
**llama-3-typhoon-v1.5-8b-audio-preview** is a 🇹🇭 Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct).
More details can be found in our [release blog](https://blog.opentyphoon.ai/typhoon-audio-preview-release-6fbb3f938287) and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.
## Model Description
- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **Demo**: https://audio.opentyphoon.ai/
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)
## Usage Example
```python
from transformers import AutoModel
# Initialize from the trained model
model = AutoModel.from_pretrained(
"scb10x/llama-3-typhoon-v1.5-8b-audio-preview",
torch_dtype=torch.float16,
trust_remote_code=True
)
model.to("cuda")
model.eval()
# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
wav_path="path_to_your_audio.wav",
prompt="transcribe this audio",
prompt_pattern=prompt_pattern,
do_sample=False,
max_length=1200,
repetition_penalty=1.1,
num_beams=1,
# temperature=0.4,
# top_p=0.9,
# streamer=streamer # supports TextIteratorStreamer
)
print(response)
```
## Evaluation Results
| Model | ASR-en (WER↓) | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) |
|:----------------------------|:-------------------|:--------------|:--------------|:-------------|:--------------|
| SALMONN-13B | 5.79 | 98.07 | 0.07 | 0.10 | 14.97 |
| DiVA-8B | 30.28 | 65.21 | 9.82 | 5.31 | 7.97 |
| Gemini-1.5-pro-001 | 5.98 | 13.56 | 20.69 | 13.52 | 22.54 |
| Typhoon-Audio-Preview | 8.72 | 14.17 | 17.52 | 10.67 | 24.14 |
| Model | Gender-th (Acc) | SpokenQA-th (F1) | SpeechInstruct-th |
|:-------------------------------|:---------------|:-------------------|:-------------------|
| SALMONN-13B | 93.26 | 2.95 | 1.18 |
| DiVA-8B | 50.12 | 15.13 | 2.68 |
| Gemini-1.5-pro-001 | 81.32 | 62.10 | 3.93 |
| Typhoon-Audio-Preview | 93.74 | 64.60 | 6.11 |
## Intended Uses & Limitations
This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses.
## Follow us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/CqyBscMFpg
## Acknowledgements
We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.
## Typhoon Team
*Potsawee Manakul*, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun,
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, *Kunat Pipatanakul*