Update README.md
Browse filesCorrect the training process explanation (reversed)
README.md
CHANGED
@@ -46,14 +46,14 @@ for name, param in model.named_parameters():
|
|
46 |
|
47 |
Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
|
48 |
|
49 |
-
1. Freezing the `
|
50 |
-
2. Unfreezing the `
|
51 |
|
52 |
As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
|
53 |
|
54 |
### Usage and Limitations
|
55 |
|
56 |
-
Keep in mind
|
57 |
|
58 |
### Training Details
|
59 |
|
@@ -73,13 +73,13 @@ Our model’s training was comprehensive and diverse:
|
|
73 |
|
74 |
3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
|
75 |
|
76 |
-
4. **Frequency Analysis:** Using target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
|
77 |
|
78 |
5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
|
79 |
|
80 |
-
6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that
|
81 |
|
82 |
-
7. **Iterative Refinement:** We repeated steps 2 to 6 until there
|
83 |
|
84 |
8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
|
85 |
|
|
|
46 |
|
47 |
Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
|
48 |
|
49 |
+
1. Freezing the `embed_tokens` layer for existing tokens is crucial to maintain overall performance.
|
50 |
+
2. Unfreezing the `lm_head` layer for existing tokens actually boosts performance.
|
51 |
|
52 |
As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
|
53 |
|
54 |
### Usage and Limitations
|
55 |
|
56 |
+
Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
|
57 |
|
58 |
### Training Details
|
59 |
|
|
|
73 |
|
74 |
3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
|
75 |
|
76 |
+
4. **Frequency Analysis:** Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
|
77 |
|
78 |
5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
|
79 |
|
80 |
+
6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times.
|
81 |
|
82 |
+
7. **Iterative Refinement:** We repeated steps 2 to 6 until there were no tokens to drop or add.
|
83 |
|
84 |
8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
|
85 |
|