progen2_cross_attention_only_h

This model is a fine-tuned version of on the None dataset. It achieves the following results on the evaluation set:

Loss: 2.4917
Perplexity: 12.0823

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_ratio: 0.1
training_steps: 5000

Training results

Training Loss	Epoch	Step	Validation Loss	Perplexity
32.5486	0.2909	100	7.6093	2016.7766
22.5738	0.5818	200	2.8788	17.7926
11.4858	0.8727	300	2.8572	17.4123
11.391	1.1658	400	2.8481	17.2545
11.6307	1.4567	500	2.6227	13.7734
10.2311	1.7476	600	2.4862	12.0155
9.9477	2.0407	700	2.4658	11.7733
9.8694	2.3316	800	2.6730	14.4827
9.8291	2.6225	900	2.4811	11.9541
31.1466	2.9135	1000	8.7851	6536.0332
34.9023	3.2065	1100	7.7230	2259.8149
30.5868	3.4975	1200	7.5959	1990.0344
30.4004	3.7884	1300	7.5865	1971.3219
31.7038	4.0815	1400	8.0208	3043.6248
31.3893	4.3724	1500	7.2647	1428.9806
25.8028	4.6633	1600	5.7546	315.6425
22.4188	4.9542	1700	5.3616	213.0554
21.249	5.2473	1800	5.3029	200.9226
20.9864	5.5382	1900	5.3000	200.3277
20.9816	5.8291	2000	5.1496	172.3635
20.6328	6.1222	2100	4.6971	109.6314
18.4146	6.4131	2200	4.5423	93.9023
17.0501	6.704	2300	3.8270	45.9244
15.666	6.9949	2400	3.4366	31.0810
15.927	7.288	2500	3.9706	53.0142
13.5433	7.5789	2600	2.9892	19.8694
12.3278	7.8698	2700	3.1080	22.3761
12.0588	8.1629	2800	2.7287	15.3123
11.1222	8.4538	2900	2.6745	14.5055
10.9132	8.7447	3000	2.6467	14.1074
10.9437	9.0378	3100	2.6341	13.9301
10.8436	9.3287	3200	3.8787	48.3626
10.6462	9.6196	3300	2.6104	13.6050
10.5014	9.9105	3400	2.6434	14.0614
10.4753	10.2036	3500	2.6008	13.4750
10.4235	10.4945	3600	2.5825	13.2301
10.2556	10.7855	3700	2.5495	12.8001
10.2415	11.0785	3800	2.5396	12.6741
10.1531	11.3695	3900	2.5290	12.5413
10.1279	11.6604	4000	2.5270	12.5158
10.0816	11.9513	4100	2.5152	12.3687
10.0384	12.2444	4200	2.5198	12.4260
10.0156	12.5353	4300	2.5003	12.1862
9.9928	12.8262	4400	2.4984	12.1632
10.0172	13.1193	4500	2.4940	12.1100
9.9678	13.4102	4600	2.4955	12.1281
9.9605	13.7011	4700	2.4927	12.0943
9.9324	13.992	4800	2.4920	12.0851
9.9536	14.2851	4900	2.4916	12.0804
9.9154	14.576	5000	2.4917	12.0823

Framework versions

Transformers 4.47.1
Pytorch 2.1.0.post301
Datasets 3.0.2
Tokenizers 0.21.0

mpekey
/

progen2_cross_attention_only_h

progen2_cross_attention_only_h

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Evaluation results