File size: 6,993 Bytes
84a64ea
 
 
 
 
 
5c1526f
 
84a64ea
1e02e14
 
 
5c1526f
7858a36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c6bd12
 
7858a36
 
 
 
 
 
 
 
7c6bd12
 
 
7858a36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c1526f
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: mit
language:
- ko
pipeline_tag: text-generation
tags:
- Language
- Dialect
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/66ac53d3f202443fb03d5c70/MiejxjzWeKjKJgw13yor5.png" width="50%">


# JEJUMA-001 [WIP; Model is not fully trained]
LLM์œผ๋กœ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์šฐ๋ฆฌ ๋ฐฉ์–ธ ์ง€ํ‚ค๊ธฐ ํ”„๋กœ์ ํŠธ1: ์ œ์ฃผ๋„ ๋ฐฉ์–ธ

## ์™œ ์‹œ์ž‘ํ•˜๊ฒŒ ๋˜์—ˆ๋‚˜์š”?
### ๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์ง€์—ญ๋ฐฉ์–ธ: ์ œ์ฃผ๋„
* ์—ฌ๋Ÿฌ ์ง€์—ญ ๋ฐฉ์–ธ, ํŠนํžˆ ์ œ์ฃผ๋„์˜ ๋ฐฉ์–ธ์ด ๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
* ์œ ๋„ค์Šค์ฝ”๋Š” ์ œ์ฃผ์–ด(์ œ์ฃผ๋ฐฉ์–ธ)์„ **์•„์ฃผ ์‹ฌ๊ฐํ•˜๊ฒŒ ์œ„๊ธฐ์— ์ฒ˜ํ•œ ์–ธ์–ด** ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
* ์ œ์ฃผ๋„๋ฏผ ์ค‘ **์ œ์ฃผ์–ด๋ฅผ ์•„๋Š” ์‚ฌ๋žŒ์˜ ๋น„์œจ์€ 36.1%** ์— ๊ทธ์น˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
* ํŠนํžˆ, ํƒ€์ง€์—ญ๊ณผ์˜ ๊ต๋ฅ˜๊ฐ€ ํ™œ๋ฐœํ•ด์ง€๋ฉด์„œ ์ Š์€ ์ธต์—์„  ์ œ์ฃผ์–ด๋ณด๋‹จ ํ‘œ์ค€์–ด๋ฅผ ์„ ํ˜ธํ•˜๋Š” ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

### ์ง€์—ญ๋ฐฉ์–ธ์— ์•ฝํ•œ ์–ธ์–ด๋ชจ๋ธ
* ์˜จ๋ผ์ธ ์†Œ์Šค๋Š” ํ‘œ์ค€์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ์—, ์ž๋ฃŒ๊ฐ€ ์ ์€ ์ง€์—ญ๋ฐฉ์–ธ์„ ์ž˜ ๋ชจ๋ฆ…๋‹ˆ๋‹ค.
* ํŠนํžˆ ์ œ์ฃผ์–ด๋Š” ํ‘œ์ค€์–ด์™€ ์ฐจ์ด๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์—, ์œ ๋ช…ํ•œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ ์™ธ์—๋Š” ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

## ์–ด๋–ป๊ฒŒ ์ด๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‚˜์š”?
* ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ์–ด๋ ค์šด ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ์ œ์ฃผ์–ด๊ฐ€ ์žŠํ˜€์ง€์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
* ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ํ‘œ์ค€์–ด์˜ ์ œ์ฃผ์–ด ๋ฒ„์ „์„ ์ƒ์„ฑํ•˜์—ฌ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
* ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” ๊ธฐ์กด์— ํ•™์Šต๋œ ๋‹ค์–‘ํ•œ ๋‚ด์šฉ์„ ๊ทธ๋Œ€๋กœ ์ด์–ด๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

## ๊ฐœ๋ฐœํ•œ ์–ธ์–ด๋ชจ๋ธ์— ๋Œ€ํ•œ ์„ค๋ช…
* ์ œ์ฃผ๋„ ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Llama3.1์„ ๋‹ค์–‘ํ•œ ํ…Œ์Šคํฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ, ์ œ์ฃผ๋„ ๋ฐฉ์–ธ๊ณผ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
* `JEJUMA-001`์€ ํ˜„์žฌ ๋ฐฉ์–ธ๊ณผ ํ‘œ์ค€์–ด๊ฐ„ ๋ณ€๊ฒฝ, ๋ฐฉ์–ธ ํƒ์ง€ ๋“ฑ์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
* `JEJUMA-001`์„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์•ฝ 105๋งŒ๊ฐœ์˜ ์ œ์ฃผ๋ฐฉ์–ธ-์„œ์šธ๋ง ํŽ˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , ๊ทธ ์ค‘ ์ œ์ฃผ์–ด๊ฐ€ ์ž˜ ๋“ค์–ด๋‚œ ๋ฐ์ดํ„ฐ 17๋งŒ๊ฐœ๋ฅผ ์„ ๋ณ„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
* ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด 4๊ฐ€์ง€์˜ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ์ด ์•ฝ 34๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
* LlamaFactory๋ฅผ ํ†ตํ•ด LoRA ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 1์—ํญ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
* ์–ด๋ ค์šด ์ œ์ฃผ๋„ ๋ง์— ๋Œ€ํ•ด์„œ, gpt4o์™€ ๊ตญ์‚ฐ ๋ชจ๋ธ์ธ ์—…์Šคํ…Œ์ด์ง€ Solar, ๋„ค์ด๋ฒ„ HCX ๋†’์€ ๋ฒˆ์—ญ ์ •ํ™•๋„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

### ์ œ์ฃผ์–ด -> ํ‘œ์ค€์–ด

| **์ž…๋ ฅ ๋ฌธ์žฅ**                   | **์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค**                                  |
| ----------------------------- | ----------------------------------------------------|
| **์ •๋‹ต**                       | **์žฌ ํŒ”์— ๋‹ญ์‚ด์ด ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ, ์ถ”์šด ๋ชจ์–‘์ด๋‹ค.**                                            |
| Upstage Solar ์ถœ๋ ฅ             | ๊ทธ ๋ฐ”์œ„์— ๋ฑ€์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค.                                               |
| Naver HCX ์ถœ๋ ฅ                 | ์žฌ์˜ ํ’€์— ๋…์ดˆ๊ฐ€ ๋งˆ๊ตฌ ๋‚œ ๊ฒƒ์„ ๋ณด๋‹ˆ ์–ด๋ฆฐ ์†Œ๋‚˜๋ฌด์ž…๋‹ˆ๋‹ค.                                          |
| GPT-4o ์ถœ๋ ฅ                    | ์ €๊ธฐ ๋ฐ”์œ„์— ๋…์‚ฌ๊ฐ€ ๋ง‰ ๋‚˜ํƒ€๋‚œ ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค.                                           |
| **JEJUMA-001 ์ถœ๋ ฅ**                |                                                 |

### ํ‘œ์ค€์–ด -> ์ œ์ฃผ์–ด

| **์ž…๋ ฅ ๋ฌธ์žฅ**                   |    **๊ทค๋‚˜๋ฌด์— ๊ทธ๋ƒฅ ๊ฐ€์„œ ๋„ˆ๋„ค ์•„๋ฒ„์ง€์ข€ ์ฐพ์•„์™€๋ผ.**                                           |
| ----------------------------- | ----------------------------------------------------|
| **์ •๋‹ต**                       |    **๋ฏธ๊นก๋‚ญ ๊ฒฝ ๊ฐ€์‹ฌ ๋„ˆ๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น**                                 |
| Upstage Solar ์ถœ๋ ฅ             |    ๊ทค ๋‚˜๋ฌด์— ๊ฐ€์„œ ๋„ค ์•„๋ฒ„์ง€๋ฅผ ์ข€ ์ฐพ์•„์™€.                                            |
| Naver HCX ์ถœ๋ ฅ                 |    ๊ทค๋‚ญ์— ๊ฐ• ๋Š๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น์˜ค๋ผ.                                       |
| GPT-4o ์ถœ๋ ฅ                    |    ๊ทค๋‚˜๋ฌด์— ๊ฑ ๊ฐ€์„œ ํ–„์‹  ์•„๋ฐฉ ์ข€ ์ฐพ์•„์™€๋ผ.                                        |
| **JEJUMA-001 ์ถœ๋ ฅ**                |                                                  |

## ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋‚˜์š”?
* ์ •์˜๋œ ํƒฌํ”Œ๋ฆฟ์—์„œ `dialect_to_standard`, `standard_to_dialect`, `detect_dialect`, `detect_dialect_and_convert` ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
* `dialect_to_standard`: ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝ
* `standard_to_dialect`: ํ‘œ์ค€์–ด๋ฅผ ์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
* `detect_dialect`: ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด ๊ฐ์ง€
* `detect_dialect_and_convert`: ์ž๋™์œผ๋กœ ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด๋ฅผ ๊ฐ์ง€ํ•˜์—ฌ ํ‘œ์ค€์–ด/์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
  
```python
import transformers
import torch

model_id = "JEJUMA/JEJUMA-001"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

class JejuPromptTemplate:
    @staticmethod
    def dialect_to_standard(text):
        return [{"role":"user", "content":"Convert the following sentence or word which is Jeju island dialect to standard Korean: " + text},]

    @staticmethod
    def standard_to_dialect(text):
        return [{"role":"user", "content":"Convert the following sentence or word which is standard Korean to Jeju island dialect: " + text},]

    @staticmethod
    def detect_dialect(text):
        return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean: " + text},]

    @staticmethod
    def detect_dialect_and_convert(text):
        return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean and convert the following sentence or word to Jeju island dialect or standard Korean: " + text},]


outputs = pipeline(
    JejuPromptTemplate.standard_to_dialect("์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค"),
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
)

print(outputs[0]["generated_text"][-1])
```

## ์ถ”ํ›„ ๊ณ„ํš
* JEJUMA-001๋Š” ํ˜„์žฌ ํ•™์Šต/ํ‰๊ฐ€ ์ค‘์ž…๋‹ˆ๋‹ค.
* JEJUMA-002๋Š” ๊ตญ๋‚ด์˜ ๋ชจ๋“  ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•˜์—ฌ ๋™์ผํ•œ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.
* JEJUMA-003๋Š” ๊ตญ๋‚ด์˜ ๋ชจ๋“  ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ์™€ ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ํ…Œ์Šคํฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์— ์—†๋Š” ๋ฐฉ์–ธ(์—ฐ๋ณ€๋ฐฉ์–ธ, ๋ถํ•œ์–ด, ์ œ3์˜ ์–ธ์–ด)๋ฅผ ์ผ๋ถ€ ๋ฒˆ์—ญํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.
* JEJUMA-003์ด ๋ณธ ์—ฐ๊ตฌ์— ์ตœ์ข… ๋‹จ๊ณ„์ด๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ๋ฒˆ์—ญ๋ชจ๋ธ์ด๋‚˜ ๋” ์ž‘์€ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ 8B ํฌ๊ธฐ์˜ ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.