naxalpha commited on
Commit
7a42e6d
·
1 Parent(s): 9c22e75

add training proceedure

Browse files
.ipynb_checkpoints/README-checkpoint.md CHANGED
@@ -36,3 +36,8 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
36
  model.net.to_logits[1].weight.requires_grad_(False)
37
  model.net.to_logits[1].weight.copy_(emb)
38
  ```
 
 
 
 
 
 
36
  model.net.to_logits[1].weight.requires_grad_(False)
37
  model.net.to_logits[1].weight.copy_(emb)
38
  ```
39
+
40
+
41
+ ## Training proceedure
42
+
43
+ Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.
README.md CHANGED
@@ -36,3 +36,8 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
36
  model.net.to_logits[1].weight.requires_grad_(False)
37
  model.net.to_logits[1].weight.copy_(emb)
38
  ```
 
 
 
 
 
 
36
  model.net.to_logits[1].weight.requires_grad_(False)
37
  model.net.to_logits[1].weight.copy_(emb)
38
  ```
39
+
40
+
41
+ ## Training proceedure
42
+
43
+ Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.