language:
- ko
library_name: transformers
license: mit
tags:
- maxtext
devngho/llama-ablation-large-korean-corpus_edu
Llama μν€ν μ³λ‘ pretrainλ λͺ¨λΈμ λλ€. μ½ 20.7B ν ν°μΌλ‘ μ½ 34.5μν¬ν¬ νμ΅νμ΅λλ€. MaxTextλ₯Ό ν΅ν΄ νμ΅λμμ΅λλ€.
500stepλ§λ€ 체ν¬ν¬μΈνΈκ° μ 곡λ©λλ€.
μ΄ μ°κ΅¬λ Googleμ TPU Research Cloud (TRC)μ Cloud TPU μ 곡μΌλ‘ μνλμμ΅λλ€. β‘
μμ
κ΅΅μ λΆλΆμ΄ μ λ ₯μ λλ€.
- max_new_tokens: 500
μμ 1 <s> μΈκ³΅μ§λ₯μ 2015λ μ μ μΈκ³μμ κ°μ₯ λΉ λ₯Έ μλλ‘ λ°μ νκ³ μλ€. 2015λ μ 100μ΅ κ°κ° λλ μΈκ³΅μ§λ₯ λ‘λ΄μ΄ κ°λ°λκ³ 2020λ μλ 100μ΅ κ°κ° λλ μΈκ³΅μ§λ₯ λ‘λ΄μ΄ κ°λ°λ κ²μΌλ‘ μμλλ€. 2020λ μλ 100μ΅ κ°κ° λλ μΈκ³΅μ§λ₯ λ‘λ΄μ΄ κ°λ°λ κ²μΌλ‘ μ λ§λλ€. 2020λ μλ 100μ΅ κ°κ° λλ μΈκ³΅μ§λ₯ λ‘λ΄μ΄ κ°λ°λ κ²μΌλ‘ μμλλ€.</s>
μμ 2 <s> νκΈμ νΉμ§μ νκΈμ΄ λ§λ€μ΄μ§κΈ° μ΄μ μ λ¬Έμμλ€λ κ²μ΄λ€. νκΈμ 1443λ μ μ°½μ λ νλ―Όμ μμ μ°½μ μ΄νλΆν° 1907λ μ μ΄λ₯΄κΈ°κΉμ§ μ½ 250λ λμμ κ±Έμ³μ λ§λ€μ΄μ‘λ€. νκΈμ 1443λ μ μ°½μ λ νλ―Όμ μμ λ°ν¬μ ν¨κ» 1446λ μ λ°ν¬λ νλ―Όμ μ, 곧 νκΈμ κΈ°μμμ μλ¦Ώκ°μΌλ‘ λ³΄κ³ μλ€. νκΈμ 1443λ μ μ°½μ λ νλ―Όμ μμ μ°½μ μ΄νλΆν° 1517λ μ λ°ν¬λ νλ―Όμ μ, 곧 νκΈμ κΈ°μμμ μλ¦Ώκ°μΌλ‘ λ³΄κ³ μλ€.</s>
μμ 3 <s> 컀νΌλ 17μΈκΈ°κ²½λΆν° μ λ½ κ°κ΅μμ 컀νΌλ₯Ό λ§μ ¨κ³ , 18μΈκΈ° λ§μλ μκ΅κ³Ό νλμ€μμ 컀νΌλ₯Ό λ§μκ² λμλ€. 19μΈκΈ° μ΄μλ μκ΅μμ 컀νΌκ° λλμΌλ‘ μμ λμλ€. 19μΈκΈ° μ΄μλ μκ΅μμ 컀νΌκ° λλμΌλ‘ μμ λμλ€. 19μΈκΈ° μ΄μλ μκ΅μμ 컀νΌκ° λλμΌλ‘ μμ λμλ€.</s>
μλΉν νκ°κ³Ό μ΄μν¨, λ°λ³΅μ΄ μμ΅λλ€.
μμΈ
- μ μ: devngho
- μΈμ΄: ko
- λΌμ΄μ μ€: mit
νμ΅ μμΈ
- learning_rate: 6e-4 (cosine, initial/end 6e-5)
- warmup_ratio: 0.05
- batch_size: 1024(fsdp 16 * per device 8 * ga 8)
- optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
- duration: about 27h 50m
- steps: 10000
- wandbμμ μ 체 μ€μ κ³Ό κ²°κ³Όλ₯Ό λ³Ό μ μμ΅λλ€.
νμ΅ μ₯λΉ
TPU v4-32
νμ΅ λ°μ΄ν°μ
AI Hub, λͺ¨λμλ§λμΉλ₯Ό dedup, length filtering ν devngho/ko_edu_classifier_v2_nlpai-lab_KoE5λ‘ νκ°νμ λ 3μ μ΄μμΈ λ°μ΄ν°(μ½ 8%, 1,354,234ν) μ¬μ©
AI Hub, λͺ¨λμλ§λμΉ κ·μ μΌλ‘ μΈν΄ λ°μ΄ν°μ μ 곡κ°ν μ μμ§λ§, μλ³Έ λ°μ΄ν°λ₯Ό μ€λΉνλ€λ©΄ devngho/dataset-preprocessμ κ³Όμ μΌλ‘ λμΌνκ² μ μ²λ¦¬ν μ μμ΅λλ€. λΆλ₯κΈ° νν°λ§μ λ°λ‘ μνν΄μΌ ν©λλ€.
μννΈμ¨μ΄
jax==0.4.35
MaxTextλ₯Ό ν¬ν¬ν devngho/MaxText
νμ΅ κ²°κ³Ό
- learning/loss: 1.6112642288208008
- eval/avg_loss: 2.0766192864296023
μλμ λ²€μΉλ§ν¬ κ²°κ³Όκ° μ 곡λ©λλ€.
devngho/llama-ablation-large-korean-corpus_edu
Pretrained using Llama architecture. Trained with about 20.7B tokens(approximately 34.5 epoch), using MaxText.
Checkpoints for every 500 steps are available.
This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC). β‘
Details
- Made by: devngho
- Language: ko
- License: mit
Training details
- learning_rate: 6e-4 (cosine, initial/end 6e-5)
- warmup_ratio: 0.05
- batch_size: 1024(fsdp 16 * per device 8 * ga 8)
- optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
- duration: about 27h 50m
- steps: 10000
- You can check all the configs and training results on wandb
Training devices
TPU v4-32
Training datasets
I applied deduplication and length filtering to a corpus from AI Hub and Modu Corpus, then filtered data (about 8%, 1,354,234 rows) that is that is >=3 points when evaluated using devngho/ko_edu_classifier_v2_nlpai-lab_KoE5.
I couldn't make the training dataset public because of the terms of AI Hub and Modu Corpus. You can still preprocess the dataset in the same way as the dataset used during training this model using devngho/dataset-preprocess with the raw datas. You still have to apply filtering using the edu classifier apart from the preprocessing.
Software
jax==0.4.35
devngho/MaxText, a fork of MaxText
Training results
- learning/loss: 1.6112642288208008
- eval/avg_loss: 2.0766192864296023