Edit model card

ke_t5_base_bongsoo_en_ko

This model is a fine-tuned version of KETI-AIR/ke-t5-base on the bongsoo/news_news_talk_en_ko dataset. See translation_ke_t5_base_bongsoo_en_ko.ipynb

Model description

KE-T5 is a pretrained-model of t5 text-to-text transfer transformers using the Korean and English corpus developed by KETI (ํ•œ๊ตญ์ „์ž์—ฐ๊ตฌ์›). The vocabulary used by KE-T5 consists of 64,000 sub-word tokens and was created using Google's sentencepiece. The Sentencepiece model was trained to cover 99.95% of a 30GB corpus with an approximate 7:3 mix of Korean and English.

Intended uses & limitations

Translation from English to Korean

Usage

You can use this model directly with a pipeline for translation language modeling:

>>> from transformers import pipeline
>>> translator = pipeline('translation', model='chunwoolee0/ke_t5_base_bongsoo_en_ko')

>>> translator("Let us go for a walk after lunch.")
[{'translation_text': '์ ์‹ฌ์„ ๋งˆ์น˜๊ณ  ์‚ฐ์ฑ…์„ ํ•˜๋Ÿฌ ๊ฐ€์ž.'}]

>>> translator("The BRICS countries welcomed six new members from three different continents on Thursday.")
[{'translation_text': '๋ธŒ๋ฆญ์Šค ๊ตญ๊ฐ€๋“ค์€ ์ง€๋‚œ 24์ผ 3๊ฐœ ๋Œ€๋ฅ™ 6๋ช…์˜ ์‹ ๊ทœ ํšŒ์›์„ ํ™˜์˜ํ–ˆ๋‹ค.'}]

>>> translator("The BRICS countries welcomed six new members from three different continents on Thursday, marking a historic milestone that underscored the solidarity of BRICS and developing countries and determination to work together for a better future, officials and experts said.",max_length=400)
[{'translation_text': '๋ธŒ๋ ™์Šค ๊ตญ๊ฐ€๋Š” ์ง€๋‚œ 7์ผ 3๊ฐœ ๋Œ€๋ฅ™ 6๋ช…์˜ ์‹ ๊ทœ ํšŒ์›์„ ํ™˜์˜ํ•˜๋ฉฐ BRICS์™€ ๊ฐœ๋ฐœ๋„์ƒ๊ตญ์˜ ์—ฐ๋Œ€์™€ ๋” ๋‚˜์€ ๋ฏธ๋ž˜๋ฅผ ์œ„ํ•ด ํ•จ๊ป˜ ๋…ธ๋ ฅํ•˜๊ฒ ๋‹ค๋Š” ์˜์ง€๋ฅผ ์žฌํ™•์ธํ•œ ์—ญ์‚ฌ์ ์ธ ์ด์ •ํ‘œ๋ฅผ ์žฅ์‹ํ–ˆ๋‹ค๊ณ  ๊ด€๊ณ„์ž๋“ค๊ณผ ์ „๋ฌธ๊ฐ€๋“ค์€ ์ „ํ–ˆ๋‹ค.'}]

>>> translator("Bidenโ€™s decree zaps lucrative investments in Chinaโ€™s chip and AI sectors")
[{'translation_text': '๋ฐ”์ด๋“  ์žฅ๊ด€์˜ ํ–‰์ •๋ช…๋ น์€ ์ค‘๊ตญ ์นฉ๊ณผ AI ๋ถ„์•ผ์˜ ๊ณ ์ˆ˜์ต ํˆฌ์ž๋ฅผ ์˜ฅ์ฃ„๋Š” ๊ฒƒ์ด๋‹ค.'}]

>>> translator("It is most likely that Chinaโ€™s largest chip foundry, a key piece of the puzzle in Beijingโ€™s efforts to achieve greater self-sufficiency in semiconductors, would not have been able to set up its first plant in Shanghaiโ€™s suburbs in the early 2000s without funding from American investors such as Walden International and Goldman Sachs.", max_length=400)
[{'translation_text': '๋ฐ˜๋„์ฒด์˜ ๋” ํฐ ์ž๋ฆฝ์„ฑ์„ ์ด๋ฃจ๊ธฐ ์œ„ํ•ด ๋ฒ ์ด์ง•์ด ์• ์“ฐ๋Š” ํผ์ฆ์˜ ํ•ต์‹ฌ ์กฐ๊ฐ์ธ ์ค‘๊ตญ ์ตœ๋Œ€ ์นฉ ํŒŒ์šด๋“œ๋ฆฌ๊ฐ€ ์›”๋ด์ธํ„ฐ๋‚ด์…”๋„, ๊ณจ๋“œ๋งŒ์‚ญ์Šค ๋“ฑ ๋ฏธ๊ตญ ํˆฌ์ž์ž๋กœ๋ถ€ํ„ฐ ์ž๊ธˆ ์ง€์›์„ ๋ฐ›์ง€ ๋ชปํ•œ ์ฑ„ 2000๋…„๋Œ€ ์ดˆ ์ƒํ•˜์ด ์‹œ๋‚ด์— ์ฒซ ๊ณต์žฅ์„ ์ง€์„ ์ˆ˜ ์—†์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค.'}]

## Training and evaluation data

One third of the original training data size of 1200000 is selected because of the resource limit of the colab of google.

## Training procedure

Because of the limitation of google's colab, the model is trained only by one epoch. The result is still quite satisfactory. The quality of translation is not so bad.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1

### Training results

| Training Loss | Epoch | Step | Validation Loss | Bleu   |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| No log        | 1.0   | 5625 | 2.4075          | 8.2272 |

- cpu usage: 4.8/12.7GB
- gpu usage: 13.0/15.0GB
- running time: 3h

### Framework versions

- Transformers 4.32.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3
Downloads last month
19
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for chunwoolee0/ke_t5_base_bongsoo_en_ko

Finetuned
(8)
this model