File size: 4,273 Bytes
253101d
 
e6654a9
 
 
 
fbd9086
253101d
 
e1fd6c2
253101d
e1fd6c2
253101d
add295e
253101d
e1fd6c2
253101d
e1fd6c2
 
 
 
 
253101d
e6654a9
253101d
e6654a9
 
253101d
e1fd6c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253101d
e1fd6c2
253101d
add295e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e1fd6c2
add295e
 
 
391fbeb
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
library_name: transformers
license: llama3
language:
- th
- en
pipeline_tag: text-generation
---

# Typhoon-Audio Preview

**llama-3-typhoon-v1.5-8b-audio-preview** is a 🇹🇭 Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). 

More details can be found in our [release blog](https://blog.opentyphoon.ai/typhoon-audio-preview-release-6fbb3f938287) and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

## Model Description

- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **Demo**: https://audio.opentyphoon.ai/
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)

## Usage Example

```python
from transformers import AutoModel

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    wav_path="path_to_your_audio.wav",
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_length=1200,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
    # streamer=streamer # supports TextIteratorStreamer
)
print(response)
```

## Evaluation Results

| Model                       | ASR-en (WER↓)      | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) |
|:----------------------------|:-------------------|:--------------|:--------------|:-------------|:--------------|
| SALMONN-13B                 | 5.79      | 98.07         | 0.07         | 0.10        | 14.97        |
| DiVA-8B                     | 30.28     | 65.21         | 9.82         | 5.31        | 7.97         |
| Gemini-1.5-pro-001          | 5.98      | 13.56         | 20.69        | 13.52       | 22.54        |
| Typhoon-Audio-Preview       | 8.72      | 14.17         | 17.52        | 10.67       | 24.14        |


| Model                          | Gender-th (Acc) | SpokenQA-th (F1)   | SpeechInstruct-th |
|:-------------------------------|:---------------|:-------------------|:-------------------|
| SALMONN-13B                   |     93.26       |    2.95     |        1.18         |
| DiVA-8B                       |     50.12       |    15.13    |        2.68         |
| Gemini-1.5-pro-001            |     81.32       |    62.10    |        3.93         |
| Typhoon-Audio-Preview         |     93.74       |    64.60    |        6.11         |


## Intended Uses & Limitations
This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses.

## Follow us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/CqyBscMFpg

## Acknowledgements
We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

## Typhoon Team
*Potsawee Manakul*, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, 
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, *Kunat Pipatanakul*