Update README.md

391fbeb verified 3 months ago

4.27 kB

	---
	library_name: transformers
	license: llama3
	language:
	- th
	- en
	pipeline_tag: text-generation
	---

	# Typhoon-Audio Preview

	llama-3-typhoon-v1.5-8b-audio-preview is a 🇹🇭 Thai audio-language model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct).

	More details can be found in our [release blog](https://blog.opentyphoon.ai/typhoon-audio-preview-release-6fbb3f938287) and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

	## Model Description

	- Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
	- Requirement: transformers 4.38.0 or newer.
	- Primary Language(s): Thai 🇹🇭 and English 🇬🇧
	- Demo: https://audio.opentyphoon.ai/
	- License: [Llama 3 Community License](https://llama.meta.com/llama3/license/)

	## Usage Example

	```python
	from transformers import AutoModel

	# Initialize from the trained model
	model = AutoModel.from_pretrained(
	"scb10x/llama-3-typhoon-v1.5-8b-audio-preview",
	torch_dtype=torch.float16,
	trust_remote_code=True
	)
	model.to("cuda")
	model.eval()

	# Run generation
	prompt_pattern="<\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>\n\n<Speech><SpeechHere></Speech> {}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n"
	response = model.generate(
	wav_path="path_to_your_audio.wav",
	prompt="transcribe this audio",
	prompt_pattern=prompt_pattern,
	do_sample=False,
	max_length=1200,
	repetition_penalty=1.1,
	num_beams=1,
	# temperature=0.4,
	# top_p=0.9,
	# streamer=streamer # supports TextIteratorStreamer
	)
	print(response)
	```

	## Evaluation Results

	\| Model \| ASR-en (WER↓) \| ASR-th (WER↓) \| En2Th (BLEU↑) \| X2Th (BLEU↑) \| Th2En (BLEU↑) \|
	\|:----------------------------\|:-------------------\|:--------------\|:--------------\|:-------------\|:--------------\|
	\| SALMONN-13B \| 5.79 \| 98.07 \| 0.07 \| 0.10 \| 14.97 \|
	\| DiVA-8B \| 30.28 \| 65.21 \| 9.82 \| 5.31 \| 7.97 \|
	\| Gemini-1.5-pro-001 \| 5.98 \| 13.56 \| 20.69 \| 13.52 \| 22.54 \|
	\| Typhoon-Audio-Preview \| 8.72 \| 14.17 \| 17.52 \| 10.67 \| 24.14 \|


	\| Model \| Gender-th (Acc) \| SpokenQA-th (F1) \| SpeechInstruct-th \|
	\|:-------------------------------\|:---------------\|:-------------------\|:-------------------\|
	\| SALMONN-13B \| 93.26 \| 2.95 \| 1.18 \|
	\| DiVA-8B \| 50.12 \| 15.13 \| 2.68 \|
	\| Gemini-1.5-pro-001 \| 81.32 \| 62.10 \| 3.93 \|
	\| Typhoon-Audio-Preview \| 93.74 \| 64.60 \| 6.11 \|


	## Intended Uses & Limitations
	This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses.

	## Follow us & Support
	- https://twitter.com/opentyphoon
	- https://discord.gg/CqyBscMFpg

	## Acknowledgements
	We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

	## Typhoon Team
	Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun,
	Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul