devngho's picture
Update README.md
84cede2 verified
metadata
language:
  - ko
library_name: transformers
license: mit
tags:
  - maxtext

devngho/llama-ablation-large-korean-corpus_edu

Llama μ•„ν‚€ν…μ³λ‘œ pretrain된 λͺ¨λΈμž…λ‹ˆλ‹€. μ•½ 20.7B ν† ν°μœΌλ‘œ μ•½ 34.5에포크 ν•™μŠ΅ν–ˆμŠ΅λ‹ˆλ‹€. MaxTextλ₯Ό 톡해 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

500stepλ§ˆλ‹€ μ²΄ν¬ν¬μΈνŠΈκ°€ μ œκ³΅λ©λ‹ˆλ‹€.

이 μ—°κ΅¬λŠ” Google의 TPU Research Cloud (TRC)의 Cloud TPU 제곡으둜 μˆ˜ν–‰λ˜μ—ˆμŠ΅λ‹ˆλ‹€. ⚑

μ˜ˆμ‹œ

ꡡ은 뢀뢄이 μž…λ ₯μž…λ‹ˆλ‹€.

  • max_new_tokens: 500

μ˜ˆμ‹œ 1 <s> 인곡지λŠ₯은 2015년에 μ „ μ„Έκ³„μ—μ„œ κ°€μž₯ λΉ λ₯Έ μ†λ„λ‘œ λ°œμ „ν•˜κ³  μžˆλ‹€. 2015년에 100μ–΅ κ°œκ°€ λ„˜λŠ” 인곡지λŠ₯ λ‘œλ΄‡μ΄ 개발되고 2020λ…„μ—λŠ” 100μ–΅ κ°œκ°€ λ„˜λŠ” 인곡지λŠ₯ λ‘œλ΄‡μ΄ 개발될 κ²ƒμœΌλ‘œ μ˜ˆμƒλœλ‹€. 2020λ…„μ—λŠ” 100μ–΅ κ°œκ°€ λ„˜λŠ” 인곡지λŠ₯ λ‘œλ΄‡μ΄ 개발될 κ²ƒμœΌλ‘œ μ „λ§λœλ‹€. 2020λ…„μ—λŠ” 100μ–΅ κ°œκ°€ λ„˜λŠ” 인곡지λŠ₯ λ‘œλ΄‡μ΄ 개발될 κ²ƒμœΌλ‘œ μ˜ˆμƒλœλ‹€.</s>

μ˜ˆμ‹œ 2 <s> ν•œκΈ€μ˜ νŠΉμ§•μ€ ν•œκΈ€μ΄ λ§Œλ“€μ–΄μ§€κΈ° μ΄μ „μ˜ λ¬Έμžμ˜€λ‹€λŠ” 것이닀. ν•œκΈ€μ€ 1443년에 창제된 ν›ˆλ―Όμ •μŒμ˜ 창제 이후뢀터 1907년에 이λ₯΄κΈ°κΉŒμ§€ μ•½ 250λ…„ λ™μ•ˆμ— κ±Έμ³μ„œ λ§Œλ“€μ–΄μ‘Œλ‹€. ν•œκΈ€μ€ 1443년에 창제된 ν›ˆλ―Όμ •μŒμ˜ λ°˜ν¬μ™€ ν•¨κ»˜ 1446년에 반포된 ν›ˆλ―Όμ •μŒ, 곧 ν•œκΈ€μ˜ κΈ°μ›μ„μ˜ μ†Œλ¦Ώκ°’μœΌλ‘œ 보고 μžˆλ‹€. ν•œκΈ€μ€ 1443년에 창제된 ν›ˆλ―Όμ •μŒμ˜ 창제 이후뢀터 1517년에 반포된 ν›ˆλ―Όμ •μŒ, 곧 ν•œκΈ€μ˜ κΈ°μ›μ„μ˜ μ†Œλ¦Ώκ°’μœΌλ‘œ 보고 μžˆλ‹€.</s>

μ˜ˆμ‹œ 3 <s> μ»€ν”ΌλŠ” 17μ„ΈκΈ°κ²½λΆ€ν„° 유럽 κ°κ΅­μ—μ„œ 컀피λ₯Ό λ§ˆμ…¨κ³ , 18μ„ΈκΈ° λ§μ—λŠ” 영ꡭ과 ν”„λž‘μŠ€μ—μ„œ 컀피λ₯Ό λ§ˆμ‹œκ²Œ λ˜μ—ˆλ‹€. 19μ„ΈκΈ° μ΄ˆμ—λŠ” μ˜κ΅­μ—μ„œ 컀피가 λŒ€λŸ‰μœΌλ‘œ μˆ˜μž…λ˜μ—ˆλ‹€. 19μ„ΈκΈ° μ΄ˆμ—λŠ” μ˜κ΅­μ—μ„œ 컀피가 λŒ€λŸ‰μœΌλ‘œ μˆ˜μž…λ˜μ—ˆλ‹€. 19μ„ΈκΈ° μ΄ˆμ—λŠ” μ˜κ΅­μ—μ„œ 컀피가 λŒ€λŸ‰μœΌλ‘œ μˆ˜μž…λ˜μ—ˆλ‹€.</s>

μƒλ‹Ήν•œ ν™˜κ°κ³Ό 어색함, 반볡이 μžˆμŠ΅λ‹ˆλ‹€.

상세

  • μ œμž‘: devngho
  • μ–Έμ–΄: ko
  • λΌμ΄μ„ μŠ€: mit

ν•™μŠ΅ 상세

  • learning_rate: 6e-4 (cosine, initial/end 6e-5)
  • warmup_ratio: 0.05
  • batch_size: 1024(fsdp 16 * per device 8 * ga 8)
  • optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
  • duration: about 27h 50m
  • steps: 10000
  • wandbμ—μ„œ 전체 μ„€μ •κ³Ό κ²°κ³Όλ₯Ό λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

ν•™μŠ΅ μž₯λΉ„

TPU v4-32

ν•™μŠ΅ 데이터셋

AI Hub, λͺ¨λ‘μ˜λ§λ­‰μΉ˜λ₯Ό dedup, length filtering ν›„ devngho/ko_edu_classifier_v2_nlpai-lab_KoE5둜 ν‰κ°€ν–ˆμ„ λ•Œ 3점 이상인 데이터(μ•½ 8%, 1,354,234ν–‰) μ‚¬μš©

AI Hub, λͺ¨λ‘μ˜λ§λ­‰μΉ˜ κ·œμ •μœΌλ‘œ 인해 데이터셋을 κ³΅κ°œν•  수 μ—†μ§€λ§Œ, 원본 데이터λ₯Ό μ€€λΉ„ν•œλ‹€λ©΄ devngho/dataset-preprocess의 κ³Όμ •μœΌλ‘œ λ™μΌν•˜κ²Œ μ „μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λΆ„λ₯˜κΈ° 필터링은 λ”°λ‘œ μˆ˜ν–‰ν•΄μ•Ό ν•©λ‹ˆλ‹€.

μ†Œν”„νŠΈμ›¨μ–΄

jax==0.4.35

MaxTextλ₯Ό ν¬ν¬ν•œ devngho/MaxText

ν•™μŠ΅ κ²°κ³Ό

  • learning/loss: 1.6112642288208008
  • eval/avg_loss: 2.0766192864296023

μ•„λž˜μ— 벀치마크 κ²°κ³Όκ°€ μ œκ³΅λ©λ‹ˆλ‹€.

devngho/llama-ablation-large-korean-corpus_edu

Pretrained using Llama architecture. Trained with about 20.7B tokens(approximately 34.5 epoch), using MaxText.

Checkpoints for every 500 steps are available.

This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC). ⚑

Details

  • Made by: devngho
  • Language: ko
  • License: mit

Training details

  • learning_rate: 6e-4 (cosine, initial/end 6e-5)
  • warmup_ratio: 0.05
  • batch_size: 1024(fsdp 16 * per device 8 * ga 8)
  • optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
  • duration: about 27h 50m
  • steps: 10000
  • You can check all the configs and training results on wandb

Training devices

TPU v4-32

Training datasets

I applied deduplication and length filtering to a corpus from AI Hub and Modu Corpus, then filtered data (about 8%, 1,354,234 rows) that is that is >=3 points when evaluated using devngho/ko_edu_classifier_v2_nlpai-lab_KoE5.

I couldn't make the training dataset public because of the terms of AI Hub and Modu Corpus. You can still preprocess the dataset in the same way as the dataset used during training this model using devngho/dataset-preprocess with the raw datas. You still have to apply filtering using the edu classifier apart from the preprocessing.

Software

jax==0.4.35

devngho/MaxText, a fork of MaxText

Training results

  • learning/loss: 1.6112642288208008
  • eval/avg_loss: 2.0766192864296023

Benchmark graph Benchmark graph Benchmark graph