kunato's picture
Update README.md
391fbeb verified
|
raw
history blame
4.27 kB
metadata
library_name: transformers
license: llama3
language:
  - th
  - en
pipeline_tag: text-generation

Typhoon-Audio Preview

llama-3-typhoon-v1.5-8b-audio-preview is a 🇹🇭 Thai audio-language model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.

More details can be found in our release blog and technical report. *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

Model Description

  • Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
  • Requirement: transformers 4.38.0 or newer.
  • Primary Language(s): Thai 🇹🇭 and English 🇬🇧
  • Demo: https://audio.opentyphoon.ai/
  • License: Llama 3 Community License

Usage Example

from transformers import AutoModel

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    wav_path="path_to_your_audio.wav",
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_length=1200,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
    # streamer=streamer # supports TextIteratorStreamer
)
print(response)

Evaluation Results

Model ASR-en (WER↓) ASR-th (WER↓) En2Th (BLEU↑) X2Th (BLEU↑) Th2En (BLEU↑)
SALMONN-13B 5.79 98.07 0.07 0.10 14.97
DiVA-8B 30.28 65.21 9.82 5.31 7.97
Gemini-1.5-pro-001 5.98 13.56 20.69 13.52 22.54
Typhoon-Audio-Preview 8.72 14.17 17.52 10.67 24.14
Model Gender-th (Acc) SpokenQA-th (F1) SpeechInstruct-th
SALMONN-13B 93.26 2.95 1.18
DiVA-8B 50.12 15.13 2.68
Gemini-1.5-pro-001 81.32 62.10 3.93
Typhoon-Audio-Preview 93.74 64.60 6.11

Intended Uses & Limitations

This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses.

Follow us & Support

Acknowledgements

We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

Typhoon Team

Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul