lapp0 commited on
Commit
275c26d
1 Parent(s): 4e4f88f

End of training

Browse files
README.md CHANGED
@@ -44,7 +44,7 @@ More information needed
44
 
45
  # Resource Usage Comparison
46
 
47
- - VRAM Use: 15.6974 GB
48
 
49
  # Distillation (Teacher -> Student) Architecture Difference:
50
 
@@ -75,7 +75,7 @@ More information needed
75
  <br/>
76
 
77
  # Train Dataset
78
- Trained on 521,374,680 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
79
 
80
  - Num Samples: `990,000`
81
  - Subset: `20231101.en`
@@ -94,7 +94,7 @@ The following hyperparameters were used during training:
94
  <details>
95
  <summary>Expand</summary>
96
 
97
- - learning_rate: `0.0001`
98
  - train_batch_size: `16`
99
  - eval_batch_size: `8`
100
  - seed: `42`
@@ -103,7 +103,7 @@ The following hyperparameters were used during training:
103
  - num_epochs: `1.0`
104
  - distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, norm=layernorm_teacher_only, projector=mlp))`
105
  - train_embeddings: `True`
106
- - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f460cccc760>`
107
  - student_model_name_or_path: `None`
108
  - student_config_name_or_path: `distilbert/distilgpt2`
109
  - student_model_config: `None`
 
44
 
45
  # Resource Usage Comparison
46
 
47
+ - VRAM Use: 15.6991 GB
48
 
49
  # Distillation (Teacher -> Student) Architecture Difference:
50
 
 
75
  <br/>
76
 
77
  # Train Dataset
78
+ Trained on 521,408,413 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
79
 
80
  - Num Samples: `990,000`
81
  - Subset: `20231101.en`
 
94
  <details>
95
  <summary>Expand</summary>
96
 
97
+ - learning_rate: `0.0002`
98
  - train_batch_size: `16`
99
  - eval_batch_size: `8`
100
  - seed: `42`
 
103
  - num_epochs: `1.0`
104
  - distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, norm=layernorm_teacher_only, projector=mlp))`
105
  - train_embeddings: `True`
106
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f9be8f9ba30>`
107
  - student_model_name_or_path: `None`
108
  - student_config_name_or_path: `distilbert/distilgpt2`
109
  - student_model_config: `None`
logs/attn_norm=layernorm_teacher_only, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=16, warmup_ratio=0/events.out.tfevents.1725179970.849724f928d2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a6b8dfced35e5bbf5b9442a50fb20fb67f8826fa8af42dd25ff177d3693ca91
3
+ size 29625548
logs/attn_norm=layernorm_teacher_only, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=16, warmup_ratio=0/events.out.tfevents.1725198120.849724f928d2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d778d2efa3ee0cb6eb70b83389af8dc796dd3bcbe3ac74e8fd6c5ef6654c28c5
3
+ size 529
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:510bf40fbbc84e43e0abf99badf7f5b3a5ca81bc31697dec5efab16b1d634a7a
3
  size 163832792
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93a3b84ead81078ee12e4a9bd9762e9b68224e572b7738e0c5e882bd73dfddbd
3
  size 163832792
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3dee14f3db7cafcc7f99167311abb29cc48054bc06f56275bf53dd809c46f09d
3
  size 5624
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5fbf0affc717120a5cfc2e1143ea4893cc06465ea0457eae81e475adc6c5550d
3
  size 5624