metadata

library_name: transformers
license: llama3
language:
  - th
  - en
pipeline_tag: text-generation

Typhoon-Audio Preview

llama-3-typhoon-v1.5-8b-audio-preview is a 🇹🇭 Thai audio-language model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.

More details can be found in our release blog and technical report. *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

Model Description

Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
Requirement: transformers 4.38.0 or newer.
Primary Language(s): Thai 🇹🇭 and English 🇬🇧
Demo: https://audio.opentyphoon.ai/
License: Llama 3 Community License

Usage Example

from transformers import AutoModel

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    wav_path="path_to_your_audio.wav",
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_length=1200,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
    # streamer=streamer # supports TextIteratorStreamer
)
print(response)

Evaluation Results

Model	ASR-en (WER↓)	ASR-th (WER↓)	En2Th (BLEU↑)	X2Th (BLEU↑)	Th2En (BLEU↑)
SALMONN-13B	5.79	98.07	0.07	0.10	14.97
DiVA-8B	30.28	65.21	9.82	5.31	7.97
Gemini-1.5-pro-001	5.98	13.56	20.69	13.52	22.54
Typhoon-Audio-Preview	8.72	14.17	17.52	10.67	24.14

Model	Gender-th (Acc)	SpokenQA-th (F1)	SpeechInstruct-th
SALMONN-13B	93.26	2.95	1.18
DiVA-8B	50.12	15.13	2.68
Gemini-1.5-pro-001	81.32	62.10	3.93
Typhoon-Audio-Preview	93.74	64.60	6.11

Intended Uses & Limitations

This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses.

Follow us & Support

Acknowledgements

We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

Typhoon Team

Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul