library_name: transformers
license: llama3
language:
- th
- en
pipeline_tag: text-generation
Typhoon-Audio Preview
llama-3-typhoon-v1.5-8b-audio-preview is a 🇹🇭 Thai audio-language model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.
More details can be found in our release blog and technical report. *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.
Model Description
- Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- Requirement: transformers 4.38.0 or newer.
- Primary Language(s): Thai 🇹🇭 and English 🇬🇧
- Demo: https://audio.opentyphoon.ai/
- License: Llama 3 Community License
Usage Example
from transformers import AutoModel
# Initialize from the trained model
model = AutoModel.from_pretrained(
"scb10x/llama-3-typhoon-v1.5-8b-audio-preview",
torch_dtype=torch.float16,
trust_remote_code=True
)
model.to("cuda")
model.eval()
# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
wav_path="path_to_your_audio.wav",
prompt="transcribe this audio",
prompt_pattern=prompt_pattern,
do_sample=False,
max_length=1200,
repetition_penalty=1.1,
num_beams=1,
# temperature=0.4,
# top_p=0.9,
# streamer=streamer # supports TextIteratorStreamer
)
print(response)
Evaluation Results
Model | ASR-en (WER↓) | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) |
---|---|---|---|---|---|
SALMONN-13B | 5.79 | 98.07 | 0.07 | 0.10 | 14.97 |
DiVA-8B | 30.28 | 65.21 | 9.82 | 5.31 | 7.97 |
Gemini-1.5-pro-001 | 5.98 | 13.56 | 20.69 | 13.52 | 22.54 |
Typhoon-Audio-Preview | 8.72 | 14.17 | 17.52 | 10.67 | 24.14 |
Model | Gender-th (Acc) | SpokenQA-th (F1) | SpeechInstruct-th |
---|---|---|---|
SALMONN-13B | 93.26 | 2.95 | 1.18 |
DiVA-8B | 50.12 | 15.13 | 2.68 |
Gemini-1.5-pro-001 | 81.32 | 62.10 | 3.93 |
Typhoon-Audio-Preview | 93.74 | 64.60 | 6.11 |
Intended Uses & Limitations
This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses.
Follow us & Support
Acknowledgements
We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.
Typhoon Team
Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul