File size: 8,366 Bytes
c6e0b6b
 
 
 
b3a3daf
 
 
 
 
 
 
 
 
 
 
 
 
 
c6e0b6b
b3a3daf
 
 
 
 
 
 
c95877a
b3a3daf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
license: other
license_name: seallms
license_link: https://huggingface.co/SeaLLMs/SeaLLM-13B-Chat/blob/main/LICENSE
language:
- en
- zh
- vi
- id
- th
- ms
- km
- lo
- my
- tl
tags:
- multilingual
- sea
---

<p align="center">
  <img src="sealmmm.png" width="200" />
</p>

> SeaLLM will be able to "see"!

# *SeaLMMM-7B* - Large Multilingual Multimodal Models for Southeast Asia


<p align="center">
<a href="https://damo-nlp-sg.github.io/SeaLLMs/" target="_blank" rel="noopener">Website</a>
&nbsp;&nbsp;
<a href="https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1" target="_blank" rel="noopener"> 🤗 Tech Memo</a>
&nbsp;&nbsp;
<a href="https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B" target="_blank" rel="noopener"> 🤗 DEMO</a>
&nbsp;&nbsp;
<a href="https://github.com/DAMO-NLP-SG/SeaLLMs" target="_blank" rel="noopener">Github</a>
&nbsp;&nbsp;
<a href="https://arxiv.org/pdf/2312.00738.pdf" target="_blank" rel="noopener">Technical Report</a>
</p>

<!-- 🔥<span style="color: #ff3860">[HOT]</span> SeaLLMs project now has a dedicated website - [damo-nlp-sg.github.io/SeaLLMs](https://damo-nlp-sg.github.io/SeaLLMs/) -->


We introduce and [showcase](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B) the first iteration of [SeaLMMM](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) -- A unified multilingual and multimodal that excel in both text-only and vision tasks in multiple languages spoken in Southeast Asia.

### SeaLMMM-7B abilities
* SeaLMMM-7B is one of the strongest 7B vision-language models at **text-only tasks**, with performance similar to [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2). It is a text-first-vision-second model.
* SeaLMMM-7B **is** able to handle most SEA languages, making it more multilingual than En-only LLava, Bilingual (En+Zh) Qwen-VL or Yi-VL. 
* Unlike LLava or specialized VLMs, which demand only one image at the begining, SeaLMMM-7B can seamlessly handle text-only conversations at the begining and visual instructions in the middle of the conversations and support topic and language switching.


### Release and DEMO

- DEMO: [SeaLLMs/SeaLLM-7b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B).
- Model weights:
  - [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1).
- Explore SeaLLMs:
  - [SeaLLMs/SeaLLM-7B-v2.5](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v2.5).
  - [SeaLLMs/SeaLLM-7B-v2](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v2).
  - [SeaLLMs/SeaLLM-7B-v1](https://huggingface.co/spaces/SeaLLMs/SeaLLM-7B-v1).


<blockquote style="color:red">
<p><strong style="color: red">Terms of Use and License</strong>: 
By using our released weights, codes, and demos, you agree to and comply with the terms and conditions specified in our <a href="https://huggingface.co/SeaLLMs/SeaLLM-Chat-13b/edit/main/LICENSE" target="_blank" rel="noopener">SeaLLMs Terms Of Use</a>.
</blockquote>

> **Disclaimer**:
> We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation.
> Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
> In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.

> The logo was generated by DALL-E 3.


## Overview

SeaLMMM-7B-v0.1 is a multimodal extension of SeaLLM-7B-v2. It adopts the Llava-1.6 (Llava-NEXT) architecture. It is trained by jointly train SeaLLM's multilingual text-only datasets along with Llava-1.5 English-only vision data, as well as in-house synthetically generated multilingual multimodal vision data and open-source data, such as [ThaiIDCardSynt](https://huggingface.co/datasets/matichon/ThaiIDCardSynt).


### English Vision QA Tasks

| Multimodal Models | VQA2 | GQA | Vizwiz | SQA-IMG | TextQA
| --- | --- | --- | --- | --- | --- | 
| Qwen-VL-Chat | 78.20 | 57.50 | 38.90 | 68.20 | 61.50
| Llava-1.5-7b | 78.50 | 62.00 | 50.00 | 66.80 | 58.20
| Llava-1.5-13b | 80.00 | 63.30 | 53.60 | 71.60 | 61.30
| [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) | ? | 61.58 | 58.00 | 71.79 | 63.47

### Multilingual Text-only World Knowledge

We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for En, 3-shot [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) (M3e) for En, Zh, Vi, Id, Th, and zero-shot [VMLU](https://vmlu.ai/) for Vi.

On text-only benchmarks, [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) is generally on-par with [SeaLLM-7B-v2](https://huggingface.co/SeaLLMs/SeaLLM-7B-v2) - its base LLM model. This demonstrates that our multimodal training regime does not vastly degrade text-only performance.

| Model | Langs | En<br>MMLU | En<br>M3e | Zh<br>M3e | Vi<br>M3e | Id<br>M3e | Th<br>M3e
|-----| -----  | --- |  -- | ----- | ---- | --- | --- |
| GPT-3.5         | Multi | 68.90 | 75.46 | 60.20 | 58.64 | 49.27 | 37.41
| Vistral-7B-chat | Mono  | 56.86 | 67.00 | 44.56 | 54.33 | 36.49 | 25.27
| Qwen1.5-7B-chat | Multi | 61.00 | 52.07 | 81.96 | 43.38 | 24.29 | 20.25
| SailorLM        | Multi | 52.72 | 59.76 | 67.74 | 50.14 | 39.53 | 37.73
| SeaLLM-7B-v2    | Multi | 61.89 | 70.91 | 55.43 | 51.15 | 42.25 | 35.52
| SeaLLM-7B-v2.5  | Multi | 64.05 | 76.87 | 62.54 | 63.11 | 48.64 | 46.86
| ---
| [SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) | Multi | 60.31 | 70.43 | 52.78 | 50.47 | 42.37 | 33.53



## Multilingual Multimodal Showcases

[SeaLMMM-7B-v0.1](https://huggingface.co/SeaLLMs/SeaLMMM-7B-v0.1) has better vision understanding and solving abilities in languages beyond English and Chinese, especially SEA languages, such as Vietnamese and Indonesian.

![two_cat.png](two_cat.png)

Image: find "x" in Vietnamese. Left: Llava-1.6-34B. Right: SeaLMMM-7B-v0.1. 
<div class="row" style="display: flex; clear: both;">
    <img src="llava_1.6_34b_find_x_vi.png" alt="Forest" style="float: left; width: 39%">
    <img src="find_x_vi.png" alt="Snow" style="float: left; width: 59%">
</div>


### Limitations
* Despite being multilingual, SeaLMMM-7B-v0.1 multi-modal capabilities still work best in English, while we're working to improve it in other languages.
* For OCR, it can only read English.
* SeaLMMM-7B-v0.1 sometimes still think it cannot process image in multi-turn setting, due to existing text-only SFT, future versions fill fix this.
* Multi-modal multi-turn capabilities are still limited.


### Usage

#### Instruction format

**Unlike others, image token is `<|image|>`**

```python
prompt = """<|im_start|>system
You are a helpful assistant.</s>
<|im_start|>user
<|image|>
What is in the image?</s>
<|im_start|>assistant
There is 2 cats in the image.</s>"""

# <|im_start|> is not a special token.
# Transformers chat_template should be consistent with vLLM format below.

# ! ENSURE 1 and only 1 bos `<s>` at the beginning of sequence
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))

"""
```

## Acknowledgement to Our Linguists

We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.

## Citation

If you find our project useful, we hope you would kindly star our repo and cite our work as follows: Corresponding Author: [[email protected]](mailto:[email protected])

**Author list and order will change!**

* `*` and `^` are equal contributions.

```
@article{damonlpsg2023seallm,
  author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*, Weiwen Xu, Hou Pong Chan,
            Zhiqiang Hu, Chenhui Shen^, Yew Ken Chia^, Xingxuan Li, Jianyu Wang,
            Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
            Chaoqun Liu, Hang Zhang, Lidong Bing},
  title = {SeaLLMs - Large Language Models for Southeast Asia},
  year = 2023,
  Eprint = {arXiv:2312.00738},
}
```